Re: Workflow on GCP with BigQuery Output

Fabian Peters Wed, 31 Aug 2022 09:11:57 -0700

Hi Hans,

Yes, changing the "Temp location" in the Beam-Direct Pipeline Run Configuration 
to the GCS url was sufficient. Now it works, both wenn running via Beam-Direct 
locally and when running a pipeline with a "Workflow executor" transform on 
Dataflow.


cheers

Fabian

> Am 31.08.2022 um 17:39 schrieb Hans Van Akelyen <[email protected]>:
> 
> Thank you Israel for the support!
> 
> @Fabian, so the solution in your case was to change the Temp location in the 
> Direct runner to a GCS path?
> 
> In our documentation about the runner config we state that it should be a GCS 
> location for the Dataflow runner, in the Direct runner we do not state this 
> explicitly. I'll add a note to the docs that when using non-local IO's such 
> as BigQuery a GCS path is required here.
> 
> I have also created a ticket to expose the gcpTempLocation, currently only 
> the tempLocation is configurable via the UI.
> 
> Cheers,
> Hans
> 
> 
> 
> 
> On Wed, 31 Aug 2022 at 17:10, Israel Herraiz via users <[email protected] 
> <mailto:[email protected]>> wrote:
> I am searching in Google, and I cannot find any reference. 
> 
> I seem to remember that the stacktrace will tell you something like "temp 
> location is not in GCS" or something like that.
> 
> In any case, that temp location depends on the method used to write to 
> BigQuery. The default method in Beam is using FILE_LOADS, which will create a 
> BigQuery job (https://cloud.google.com/bigquery/docs/batch-loading-data 
> <https://cloud.google.com/bigquery/docs/batch-loading-data>). Those jobs will 
> read data from GCS.
> 
> For FILE_LOADS, Beam creates Avro files in the tempLocation of the pipeline, 
> and uses the location of those files as an input parameter for the BQ job. So 
> it has to be a location in GCS.
> 
> Now, tempLocation is used for more things. If you want to use a different 
> tempLocation for the rest of the pipeline, you can use the option 
> --gcpTempLocation in combination with --tempLocation. BigQueryIO will use 
> gcpTempLocation if it is set, and it will fall back to tempLocation if 
> gcpTempLocation is not set.
> 
> Bear also in mind that if you are using a different write method (e.g. 
> STORAGE_WRITE_API), Beam will not generate files, so whether tempLocation is 
> in GCS or not does not matter, and the data will be directly written to 
> BigQuery (https://cloud.google.com/bigquery/docs/write-api-batch 
> <https://cloud.google.com/bigquery/docs/write-api-batch>).
> 
> These are the write methods that can be used with Beam and BigQuery: 
> https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html
>  
> <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html>
> 
> Kind regards,
> Israel
> 
> 
> On Wed, 31 Aug 2022 at 16:50, Fabian Peters <[email protected] 
> <mailto:[email protected]>> wrote:
> Hi Israel,
> 
> That was it, many thanks! I had it set to "${java.io.tmpdir}". Is the 
> requirement to use a GCS location documented somewhere?
> 
> cheers
> 
> Fabian
> 
>> Am 31.08.2022 um 11:25 schrieb Israel Herraiz via users 
>> <[email protected] <mailto:[email protected]>>:
>> 
>> What are the command line arguments that you are using for those direct 
>> runner pipelines? For instance, for BigQuery you will need to set 
>> --tempLocation to a GCS location for the BQ jobs to work.
>> 
>> 
>> On Wed, 31 Aug 2022 at 09:50, Fabian Peters <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Good morning!
>> 
>> I'm putting together my Dataflow deployment and am running into another 
>> problem I don't know how to deal with: I'm running a pipeline via Dataflow, 
>> which contains a "Workflow executor" transform. The workflow contains a 
>> number of pipelines that have their run configuration set to Beam-Direct. In 
>> principle, this works fine. (Yeah!)
>> 
>> However, in this setup a BigQuery Output fails with a 
>> "java.lang.RuntimeException: Failed to create job with prefix 
>> beam_bq_job_LOAD_sites_FOO_ID, reached max retries: 3, last failed job: 
>> null." I see the the same when running just the pipeline (or any other with 
>> BigQuery Output) via Beam-Direct locally, which makes me think that the GCP 
>> credentials are not being picked up? Is there something I need to configure?
>> 
>> cheers
>> 
>> Fabian
>> 
>> P.S.: Logs from running locally with Beam-Direct:
>> 
>> 2022/08/31 09:30:07 - sites - ERROR: Error starting the Beam pipeline
>> 2022/08/31 09:30:07 - sites - ERROR: 
>> org.apache.hop.core.exception.HopException: 
>> 2022/08/31 09:30:07 - sites - Error executing pipeline with runner Direct
>> 2022/08/31 09:30:07 - sites - java.lang.RuntimeException: Failed to create 
>> job with prefix 
>> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000,
>>  reached max retries: 3, last failed job: null.
>> 2022/08/31 09:30:07 - sites - 
>> 2022/08/31 09:30:07 - sites -   at 
>> org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:258)
>> 2022/08/31 09:30:07 - sites -   at 
>> org.apache.hop.beam.engines.BeamPipelineEngine.lambda$startThreads$0(BeamPipelineEngine.java:305)
>> 2022/08/31 09:30:07 - sites -   at 
>> java.base/java.lang.Thread.run(Thread.java:829)
>> 2022/08/31 09:30:07 - sites - Caused by: 
>> org.apache.beam.sdk.Pipeline$PipelineExecutionException: 
>> java.lang.RuntimeException: Failed to create job with prefix 
>> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000,
>>  reached max retries: 3, last failed job: null.
>> 2022/08/31 09:30:07 - sites -   at 
>> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373)
>> 2022/08/31 09:30:07 - sites -   at 
>> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341)
>> 2022/08/31 09:30:07 - sites -   at 
>> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218)
>> 2022/08/31 09:30:07 - sites -   at 
>> org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:246)
>> 2022/08/31 09:30:07 - sites -   ... 2 more
>> 2022/08/31 09:30:07 - sites - Caused by: java.lang.RuntimeException: Failed 
>> to create job with prefix 
>> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000,
>>  reached max retries: 3, last failed job: null.
>> 2022/08/31 09:30:07 - sites -   at 
>> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:199)
>> 2022/08/31 09:30:07 - sites -   at 
>> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:152)
>> 2022/08/31 09:30:07 - sites -   at 
>> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:380)
>> 
>

Re: Workflow on GCP with BigQuery Output

Reply via email to