Hi Hans, Yes, changing the "Temp location" in the Beam-Direct Pipeline Run Configuration to the GCS url was sufficient. Now it works, both wenn running via Beam-Direct locally and when running a pipeline with a "Workflow executor" transform on Dataflow.
cheers Fabian > Am 31.08.2022 um 17:39 schrieb Hans Van Akelyen <[email protected]>: > > Thank you Israel for the support! > > @Fabian, so the solution in your case was to change the Temp location in the > Direct runner to a GCS path? > > In our documentation about the runner config we state that it should be a GCS > location for the Dataflow runner, in the Direct runner we do not state this > explicitly. I'll add a note to the docs that when using non-local IO's such > as BigQuery a GCS path is required here. > > I have also created a ticket to expose the gcpTempLocation, currently only > the tempLocation is configurable via the UI. > > Cheers, > Hans > > > > > On Wed, 31 Aug 2022 at 17:10, Israel Herraiz via users <[email protected] > <mailto:[email protected]>> wrote: > I am searching in Google, and I cannot find any reference. > > I seem to remember that the stacktrace will tell you something like "temp > location is not in GCS" or something like that. > > In any case, that temp location depends on the method used to write to > BigQuery. The default method in Beam is using FILE_LOADS, which will create a > BigQuery job (https://cloud.google.com/bigquery/docs/batch-loading-data > <https://cloud.google.com/bigquery/docs/batch-loading-data>). Those jobs will > read data from GCS. > > For FILE_LOADS, Beam creates Avro files in the tempLocation of the pipeline, > and uses the location of those files as an input parameter for the BQ job. So > it has to be a location in GCS. > > Now, tempLocation is used for more things. If you want to use a different > tempLocation for the rest of the pipeline, you can use the option > --gcpTempLocation in combination with --tempLocation. BigQueryIO will use > gcpTempLocation if it is set, and it will fall back to tempLocation if > gcpTempLocation is not set. > > Bear also in mind that if you are using a different write method (e.g. > STORAGE_WRITE_API), Beam will not generate files, so whether tempLocation is > in GCS or not does not matter, and the data will be directly written to > BigQuery (https://cloud.google.com/bigquery/docs/write-api-batch > <https://cloud.google.com/bigquery/docs/write-api-batch>). > > These are the write methods that can be used with Beam and BigQuery: > https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html > > <https://beam.apache.org/releases/javadoc/current/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html> > > Kind regards, > Israel > > > On Wed, 31 Aug 2022 at 16:50, Fabian Peters <[email protected] > <mailto:[email protected]>> wrote: > Hi Israel, > > That was it, many thanks! I had it set to "${java.io.tmpdir}". Is the > requirement to use a GCS location documented somewhere? > > cheers > > Fabian > >> Am 31.08.2022 um 11:25 schrieb Israel Herraiz via users >> <[email protected] <mailto:[email protected]>>: >> >> What are the command line arguments that you are using for those direct >> runner pipelines? For instance, for BigQuery you will need to set >> --tempLocation to a GCS location for the BQ jobs to work. >> >> >> On Wed, 31 Aug 2022 at 09:50, Fabian Peters <[email protected] >> <mailto:[email protected]>> wrote: >> Good morning! >> >> I'm putting together my Dataflow deployment and am running into another >> problem I don't know how to deal with: I'm running a pipeline via Dataflow, >> which contains a "Workflow executor" transform. The workflow contains a >> number of pipelines that have their run configuration set to Beam-Direct. In >> principle, this works fine. (Yeah!) >> >> However, in this setup a BigQuery Output fails with a >> "java.lang.RuntimeException: Failed to create job with prefix >> beam_bq_job_LOAD_sites_FOO_ID, reached max retries: 3, last failed job: >> null." I see the the same when running just the pipeline (or any other with >> BigQuery Output) via Beam-Direct locally, which makes me think that the GCP >> credentials are not being picked up? Is there something I need to configure? >> >> cheers >> >> Fabian >> >> P.S.: Logs from running locally with Beam-Direct: >> >> 2022/08/31 09:30:07 - sites - ERROR: Error starting the Beam pipeline >> 2022/08/31 09:30:07 - sites - ERROR: >> org.apache.hop.core.exception.HopException: >> 2022/08/31 09:30:07 - sites - Error executing pipeline with runner Direct >> 2022/08/31 09:30:07 - sites - java.lang.RuntimeException: Failed to create >> job with prefix >> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000, >> reached max retries: 3, last failed job: null. >> 2022/08/31 09:30:07 - sites - >> 2022/08/31 09:30:07 - sites - at >> org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:258) >> 2022/08/31 09:30:07 - sites - at >> org.apache.hop.beam.engines.BeamPipelineEngine.lambda$startThreads$0(BeamPipelineEngine.java:305) >> 2022/08/31 09:30:07 - sites - at >> java.base/java.lang.Thread.run(Thread.java:829) >> 2022/08/31 09:30:07 - sites - Caused by: >> org.apache.beam.sdk.Pipeline$PipelineExecutionException: >> java.lang.RuntimeException: Failed to create job with prefix >> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000, >> reached max retries: 3, last failed job: null. >> 2022/08/31 09:30:07 - sites - at >> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:373) >> 2022/08/31 09:30:07 - sites - at >> org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:341) >> 2022/08/31 09:30:07 - sites - at >> org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:218) >> 2022/08/31 09:30:07 - sites - at >> org.apache.hop.beam.engines.BeamPipelineEngine.executePipeline(BeamPipelineEngine.java:246) >> 2022/08/31 09:30:07 - sites - ... 2 more >> 2022/08/31 09:30:07 - sites - Caused by: java.lang.RuntimeException: Failed >> to create job with prefix >> beam_bq_job_LOAD_sites_65dba39290c04240933e3a982c0c5699_b77cb1586fc969929097729a4a6cdf2a_00001_00000, >> reached max retries: 3, last failed job: null. >> 2022/08/31 09:30:07 - sites - at >> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJob.runJob(BigQueryHelpers.java:199) >> 2022/08/31 09:30:07 - sites - at >> org.apache.beam.sdk.io.gcp.bigquery.BigQueryHelpers$PendingJobManager.waitForDone(BigQueryHelpers.java:152) >> 2022/08/31 09:30:07 - sites - at >> org.apache.beam.sdk.io.gcp.bigquery.WriteTables$WriteTablesDoFn.finishBundle(WriteTables.java:380) >> >
