[jira] [Updated] (BEAM-11742) ParquetSink fails for nullable fields
[ https://issues.apache.org/jira/browse/BEAM-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbing Bai updated BEAM-11742: --- Fix Version/s: 2.30.0 Assignee: Wenbing Bai Resolution: Fixed Status: Resolved (was: Open) https://github.com/apache/beam/commit/c97e17eba04e56f9f7bdcec991e179c003ae128b#diff-33b0b6b112036df96f341aa83b88efba9215ec14dfabc9db9e9ffe66a23154a2 > ParquetSink fails for nullable fields > - > > Key: BEAM-11742 > URL: https://issues.apache.org/jira/browse/BEAM-11742 > Project: Beam > Issue Type: Improvement > Components: io-py-parquet >Reporter: Wenbing Bai >Assignee: Wenbing Bai >Priority: P3 > Fix For: 2.30.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Before pyarrow 0.15, it is not possible to create pyarrow record batch with > schema. > So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record > batch we use > > {code:java} > rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code} > Error is raised that the parquet table to be created (record batch schema) > has a different schema with the schema specify (self._schema). > For example, when schema specified with "is not null", the record batch > schema doesn't indicate that, the error will be raised. > > The fix is to use schema instead of names in pa.RecordBatch.from_arrays > {code:java} > rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work started] (BEAM-11742) ParquetSink fails for nullable fields
[ https://issues.apache.org/jira/browse/BEAM-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on BEAM-11742 started by Wenbing Bai. -- > ParquetSink fails for nullable fields > - > > Key: BEAM-11742 > URL: https://issues.apache.org/jira/browse/BEAM-11742 > Project: Beam > Issue Type: Improvement > Components: io-py-parquet >Reporter: Wenbing Bai >Assignee: Wenbing Bai >Priority: P2 > Time Spent: 2h 50m > Remaining Estimate: 0h > > Before pyarrow 0.15, it is not possible to create pyarrow record batch with > schema. > So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record > batch we use > > {code:java} > rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code} > Error is raised that the parquet table to be created (record batch schema) > has a different schema with the schema specify (self._schema). > For example, when schema specified with "is not null", the record batch > schema doesn't indicate that, the error will be raised. > > The fix is to use schema instead of names in pa.RecordBatch.from_arrays > {code:java} > rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-11742) ParquetSink fails for nullable fields
[ https://issues.apache.org/jira/browse/BEAM-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbing Bai reassigned BEAM-11742: -- Assignee: Wenbing Bai > ParquetSink fails for nullable fields > - > > Key: BEAM-11742 > URL: https://issues.apache.org/jira/browse/BEAM-11742 > Project: Beam > Issue Type: Improvement > Components: io-py-parquet >Reporter: Wenbing Bai >Assignee: Wenbing Bai >Priority: P2 > Time Spent: 2h 50m > Remaining Estimate: 0h > > Before pyarrow 0.15, it is not possible to create pyarrow record batch with > schema. > So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record > batch we use > > {code:java} > rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code} > Error is raised that the parquet table to be created (record batch schema) > has a different schema with the schema specify (self._schema). > For example, when schema specified with "is not null", the record batch > schema doesn't indicate that, the error will be raised. > > The fix is to use schema instead of names in pa.RecordBatch.from_arrays > {code:java} > rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work started] (BEAM-11742) ParquetSink fails for nullable fields
[ https://issues.apache.org/jira/browse/BEAM-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on BEAM-11742 started by Wenbing Bai. -- > ParquetSink fails for nullable fields > - > > Key: BEAM-11742 > URL: https://issues.apache.org/jira/browse/BEAM-11742 > Project: Beam > Issue Type: Improvement > Components: io-py-parquet >Reporter: Wenbing Bai >Assignee: Wenbing Bai >Priority: P2 > Time Spent: 2h 50m > Remaining Estimate: 0h > > Before pyarrow 0.15, it is not possible to create pyarrow record batch with > schema. > So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record > batch we use > > {code:java} > rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code} > Error is raised that the parquet table to be created (record batch schema) > has a different schema with the schema specify (self._schema). > For example, when schema specified with "is not null", the record batch > schema doesn't indicate that, the error will be raised. > > The fix is to use schema instead of names in pa.RecordBatch.from_arrays > {code:java} > rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-11742) Use schema when creating record batch in ParquetSink
[ https://issues.apache.org/jira/browse/BEAM-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbing Bai reassigned BEAM-11742: -- Assignee: Wenbing Bai > Use schema when creating record batch in ParquetSink > > > Key: BEAM-11742 > URL: https://issues.apache.org/jira/browse/BEAM-11742 > Project: Beam > Issue Type: Improvement > Components: io-py-parquet >Reporter: Wenbing Bai >Assignee: Wenbing Bai >Priority: P2 > > Before pyarrow 0.15, it is not possible to create pyarrow record batch with > schema. > So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record > batch we use > > {code:java} > rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code} > Error is raised that the parquet table to be created (record batch schema) > has a different schema with the schema specify (self._schema). > For example, when schema specified with "is not null", the record batch > schema doesn't indicate that, the error will be raised. > > The fix is to use schema instead of names in pa.RecordBatch.from_arrays > {code:java} > rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-11742) Use schema when creating record batch in ParquetSink
Wenbing Bai created BEAM-11742: -- Summary: Use schema when creating record batch in ParquetSink Key: BEAM-11742 URL: https://issues.apache.org/jira/browse/BEAM-11742 Project: Beam Issue Type: Improvement Components: io-py-parquet Reporter: Wenbing Bai Assignee: Wenbing Bai Before pyarrow 0.15, it is not possible to create pyarrow record batch with schema. So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record batch we use {code:java} rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code} Error is raised that the parquet table to be created (record batch schema) has a different schema with the schema specify (self._schema). For example, when schema specified with "is not null", the record batch schema doesn't indicate that, the error will be raised. The fix is to use schema instead of names in pa.RecordBatch.from_arrays {code:java} rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs
[ https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249372#comment-17249372 ] Wenbing Bai commented on BEAM-11277: Any progress on this ticket? > WriteToBigQuery with batch file loads does not respect schema update options > when there are multiple load jobs > -- > > Key: BEAM-11277 > URL: https://issues.apache.org/jira/browse/BEAM-11277 > Project: Beam > Issue Type: Bug > Components: io-py-gcp, runner-dataflow >Affects Versions: 2.21.0, 2.25.0 >Reporter: Chun Yang >Priority: P2 > Attachments: repro.py > > > When multiple load jobs are needed to write data to a destination table, > e.g., when the data is spread over more than > [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, > WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and > then copy the temporary tables into the destination table. > When WriteToBigQuery is used with > {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and > {{additional_bq_parameters=\{"schemaUpdateOptions": > ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by > the jobs that copy data from temporary tables into the destination table. The > effect is that for small jobs (<10K source URIs), schema field addition is > allowed, however, if the job is scaled to >10K source URIs, then schema field > addition will fail with an error such as: > {code:none}Provided Schema does not match Table project:dataset.table. Cannot > add fields (field: field_name){code} > I've been able to reproduce this issue with Python 3.7 and DataflowRunner on > Beam 2.21.0 and Beam 2.25.0. I could not reproduce the issue with > DirectRunner. A minimal reproducible example is attached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner
[ https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbing Bai resolved BEAM-8965. --- Fix Version/s: 2.20.0 Assignee: Wenbing Bai Resolution: Fixed > WriteToBigQuery failed in BundleBasedDirectRunner > - > > Key: BEAM-8965 > URL: https://issues.apache.org/jira/browse/BEAM-8965 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0 >Reporter: Wenbing Bai >Assignee: Wenbing Bai >Priority: P2 > Fix For: 2.20.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > *{{WriteToBigQuery}}* fails in *{{BundleBasedDirectRunner}}* with error > {{PCollection of size 2 with more than one element accessed as a singleton > view.}} > Here is the code > > {code:python} > with Pipeline() as p: > query_results = ( > p > | beam.io.Read(beam.io.BigQuerySource( > query='SELECT ... FROM ...') > ) > query_results | beam.io.gcp.WriteToBigQuery( > table=, > method=WriteToBigQuery.Method.FILE_LOADS, > schema={"fields": []} > ) > {code} > > Here is the error > > {code:none} > File "apache_beam/runners/common.py", line 778, in > apache_beam.runners.common.DoFnRunner.process > def process(self, windowed_value): > File "apache_beam/runners/common.py", line 782, in > apache_beam.runners.common.DoFnRunner.process > self._reraise_augmented(exn) > File "apache_beam/runners/common.py", line 849, in > apache_beam.runners.common.DoFnRunner._reraise_augmented > raise_with_traceback(new_exn) > File "apache_beam/runners/common.py", line 780, in > apache_beam.runners.common.DoFnRunner.process > return self.do_fn_invoker.invoke_process(windowed_value) > File "apache_beam/runners/common.py", line 587, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > self._invoke_process_per_window( > File "apache_beam/runners/common.py", line 610, in > apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window > [si[global_window] for si in self.side_inputs])) > File > "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py", > line 65, in __getitem__ > _FilteringIterable(self._iterable, target_window), self._view_options) > File > "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py", > line 443, in _from_runtime_iterable > len(head), str(head[0]), str(head[1]))) > ValueError: PCollection of size 2 with more than one element accessed as a > singleton view. First two elements encountered are > "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", > "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running > 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)'] > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9457) Allow WriteToBigQuery with external data resource
[ https://issues.apache.org/jira/browse/BEAM-9457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbing Bai updated BEAM-9457: -- Status: Open (was: Triage Needed) > Allow WriteToBigQuery with external data resource > - > > Key: BEAM-9457 > URL: https://issues.apache.org/jira/browse/BEAM-9457 > Project: Beam > Issue Type: New Feature > Components: io-py-gcp >Reporter: Wenbing Bai >Priority: Major > > Create another WriteToBigQuery.Method to allow user writeToBigQuery with > external data source like GCS, instead of loading the data to BigQuery. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9457) Allow WriteToBigQuery with external data resource
Wenbing Bai created BEAM-9457: - Summary: Allow WriteToBigQuery with external data resource Key: BEAM-9457 URL: https://issues.apache.org/jira/browse/BEAM-9457 Project: Beam Issue Type: New Feature Components: io-py-gcp Reporter: Wenbing Bai Create another WriteToBigQuery.Method to allow user writeToBigQuery with external data source like GCS, instead of loading the data to BigQuery. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work started] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner
[ https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on BEAM-8965 started by Wenbing Bai. - > WriteToBigQuery failed in BundleBasedDirectRunner > - > > Key: BEAM-8965 > URL: https://issues.apache.org/jira/browse/BEAM-8965 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0 >Reporter: Wenbing Bai >Assignee: Wenbing Bai >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > *{{WriteToBigQuery}}* fails in *{{BundleBasedDirectRunner}}* with error > {{PCollection of size 2 with more than one element accessed as a singleton > view.}} > Here is the code > > {code:python} > with Pipeline() as p: > query_results = ( > p > | beam.io.Read(beam.io.BigQuerySource( > query='SELECT ... FROM ...') > ) > query_results | beam.io.gcp.WriteToBigQuery( > table=, > method=WriteToBigQuery.Method.FILE_LOADS, > schema={"fields": []} > ) > {code} > > Here is the error > > {code:none} > File "apache_beam/runners/common.py", line 778, in > apache_beam.runners.common.DoFnRunner.process > def process(self, windowed_value): > File "apache_beam/runners/common.py", line 782, in > apache_beam.runners.common.DoFnRunner.process > self._reraise_augmented(exn) > File "apache_beam/runners/common.py", line 849, in > apache_beam.runners.common.DoFnRunner._reraise_augmented > raise_with_traceback(new_exn) > File "apache_beam/runners/common.py", line 780, in > apache_beam.runners.common.DoFnRunner.process > return self.do_fn_invoker.invoke_process(windowed_value) > File "apache_beam/runners/common.py", line 587, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > self._invoke_process_per_window( > File "apache_beam/runners/common.py", line 610, in > apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window > [si[global_window] for si in self.side_inputs])) > File > "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py", > line 65, in __getitem__ > _FilteringIterable(self._iterable, target_window), self._view_options) > File > "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py", > line 443, in _from_runtime_iterable > len(head), str(head[0]), str(head[1]))) > ValueError: PCollection of size 2 with more than one element accessed as a > singleton view. First two elements encountered are > "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", > "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running > 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)'] > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner
[ https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbing Bai updated BEAM-8965: -- Affects Version/s: 2.17.0 2.18.0 2.19.0 > WriteToBigQuery failed in BundleBasedDirectRunner > - > > Key: BEAM-8965 > URL: https://issues.apache.org/jira/browse/BEAM-8965 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0 >Reporter: Wenbing Bai >Assignee: Wenbing Bai >Priority: Major > > *{{WriteToBigQuery}}* fails in *{{BundleBasedDirectRunner}}* with error > {{PCollection of size 2 with more than one element accessed as a singleton > view.}} > Here is the code > > {code:python} > with Pipeline() as p: > query_results = ( > p > | beam.io.Read(beam.io.BigQuerySource( > query='SELECT ... FROM ...') > ) > query_results | beam.io.gcp.WriteToBigQuery( > table=, > method=WriteToBigQuery.Method.FILE_LOADS, > schema={"fields": []} > ) > {code} > > Here is the error > > {code:none} > File "apache_beam/runners/common.py", line 778, in > apache_beam.runners.common.DoFnRunner.process > def process(self, windowed_value): > File "apache_beam/runners/common.py", line 782, in > apache_beam.runners.common.DoFnRunner.process > self._reraise_augmented(exn) > File "apache_beam/runners/common.py", line 849, in > apache_beam.runners.common.DoFnRunner._reraise_augmented > raise_with_traceback(new_exn) > File "apache_beam/runners/common.py", line 780, in > apache_beam.runners.common.DoFnRunner.process > return self.do_fn_invoker.invoke_process(windowed_value) > File "apache_beam/runners/common.py", line 587, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > self._invoke_process_per_window( > File "apache_beam/runners/common.py", line 610, in > apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window > [si[global_window] for si in self.side_inputs])) > File > "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py", > line 65, in __getitem__ > _FilteringIterable(self._iterable, target_window), self._view_options) > File > "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py", > line 443, in _from_runtime_iterable > len(head), str(head[0]), str(head[1]))) > ValueError: PCollection of size 2 with more than one element accessed as a > singleton view. First two elements encountered are > "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", > "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running > 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)'] > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner
[ https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbing Bai reassigned BEAM-8965: - Assignee: Wenbing Bai > WriteToBigQuery failed in BundleBasedDirectRunner > - > > Key: BEAM-8965 > URL: https://issues.apache.org/jira/browse/BEAM-8965 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Affects Versions: 2.16.0 >Reporter: Wenbing Bai >Assignee: Wenbing Bai >Priority: Major > > *{{WriteToBigQuery}}* fails in *{{BundleBasedDirectRunner}}* with error > {{PCollection of size 2 with more than one element accessed as a singleton > view.}} > Here is the code > > {code:python} > with Pipeline() as p: > query_results = ( > p > | beam.io.Read(beam.io.BigQuerySource( > query='SELECT ... FROM ...') > ) > query_results | beam.io.gcp.WriteToBigQuery( > table=, > method=WriteToBigQuery.Method.FILE_LOADS, > schema={"fields": []} > ) > {code} > > Here is the error > > {code:none} > File "apache_beam/runners/common.py", line 778, in > apache_beam.runners.common.DoFnRunner.process > def process(self, windowed_value): > File "apache_beam/runners/common.py", line 782, in > apache_beam.runners.common.DoFnRunner.process > self._reraise_augmented(exn) > File "apache_beam/runners/common.py", line 849, in > apache_beam.runners.common.DoFnRunner._reraise_augmented > raise_with_traceback(new_exn) > File "apache_beam/runners/common.py", line 780, in > apache_beam.runners.common.DoFnRunner.process > return self.do_fn_invoker.invoke_process(windowed_value) > File "apache_beam/runners/common.py", line 587, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > self._invoke_process_per_window( > File "apache_beam/runners/common.py", line 610, in > apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window > [si[global_window] for si in self.side_inputs])) > File > "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py", > line 65, in __getitem__ > _FilteringIterable(self._iterable, target_window), self._view_options) > File > "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py", > line 443, in _from_runtime_iterable > len(head), str(head[0]), str(head[1]))) > ValueError: PCollection of size 2 with more than one element accessed as a > singleton view. First two elements encountered are > "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", > "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running > 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)'] > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner
[ https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbing Bai updated BEAM-8965: -- Status: Open (was: Triage Needed) > WriteToBigQuery failed in BundleBasedDirectRunner > - > > Key: BEAM-8965 > URL: https://issues.apache.org/jira/browse/BEAM-8965 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Affects Versions: 2.16.0 >Reporter: Wenbing Bai >Priority: Major > > {{*{{WriteToBigQuery}}* failed in }}{{*BundleBasedDirectRunner*}}{{ with > error PCollection of size 2 with more than one element accessed as a > singleton view.}} > Here is the code > > {code:java} > with Pipeline() as p: > query_results = ( > p > | beam.io.Read(beam.io.BigQuerySource( > query='SELECT ... FROM ...') > ) > query_results | beam.io.gcp.WriteToBigQuery( > table=, > method=WriteToBigQuery.Method.FILE_LOADS, > schema={"fields": []} > ) > {code} > > Here is the error > > {code:java} > File "apache_beam/runners/common.py", line 778, in > apache_beam.runners.common.DoFnRunner.process > def process(self, windowed_value): > File "apache_beam/runners/common.py", line 782, in > apache_beam.runners.common.DoFnRunner.process > self._reraise_augmented(exn) > File "apache_beam/runners/common.py", line 849, in > apache_beam.runners.common.DoFnRunner._reraise_augmented > raise_with_traceback(new_exn) > File "apache_beam/runners/common.py", line 780, in > apache_beam.runners.common.DoFnRunner.process > return self.do_fn_invoker.invoke_process(windowed_value) > File "apache_beam/runners/common.py", line 587, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > self._invoke_process_per_window( > File "apache_beam/runners/common.py", line 610, in > apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window > [si[global_window] for si in self.side_inputs])) > File > "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py", > line 65, in __getitem__ > _FilteringIterable(self._iterable, target_window), self._view_options) > File > "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py", > line 443, in _from_runtime_iterable > len(head), str(head[0]), str(head[1]))) > ValueError: PCollection of size 2 with more than one element accessed as a > singleton view. First two elements encountered are > "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", > "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running > 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)'] > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner
[ https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995849#comment-16995849 ] Wenbing Bai commented on BEAM-8965: --- I did a small investigate, I think Singleton object is somehow evaluated twice BundleBasedDirectRunner. Here is the Singleton object [https://github.com/apache/beam/blob/de30361359b70e9fe9729f0f3d52f6c6e8462cfb/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L802] And the Singleton object is used twice in [https://github.com/apache/beam/blob/de30361359b70e9fe9729f0f3d52f6c6e8462cfb/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L656] *Also note*, this is only for BundleBasedDirectRunner, good at DataFlowRunner and FnApiRunner. > WriteToBigQuery failed in BundleBasedDirectRunner > - > > Key: BEAM-8965 > URL: https://issues.apache.org/jira/browse/BEAM-8965 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Affects Versions: 2.16.0 >Reporter: Wenbing Bai >Priority: Major > > {{*{{WriteToBigQuery}}* failed in }}{{*BundleBasedDirectRunner*}}{{ with > error PCollection of size 2 with more than one element accessed as a > singleton view.}} > Here is the code > > {code:java} > with Pipeline() as p: > query_results = ( > p > | beam.io.Read(beam.io.BigQuerySource( > query='SELECT ... FROM ...') > ) > query_results | beam.io.gcp.WriteToBigQuery( > table=, > method=WriteToBigQuery.Method.FILE_LOADS, > schema={"fields": []} > ) > {code} > > Here is the error > > {code:java} > File "apache_beam/runners/common.py", line 778, in > apache_beam.runners.common.DoFnRunner.process > def process(self, windowed_value): > File "apache_beam/runners/common.py", line 782, in > apache_beam.runners.common.DoFnRunner.process > self._reraise_augmented(exn) > File "apache_beam/runners/common.py", line 849, in > apache_beam.runners.common.DoFnRunner._reraise_augmented > raise_with_traceback(new_exn) > File "apache_beam/runners/common.py", line 780, in > apache_beam.runners.common.DoFnRunner.process > return self.do_fn_invoker.invoke_process(windowed_value) > File "apache_beam/runners/common.py", line 587, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > self._invoke_process_per_window( > File "apache_beam/runners/common.py", line 610, in > apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window > [si[global_window] for si in self.side_inputs])) > File > "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py", > line 65, in __getitem__ > _FilteringIterable(self._iterable, target_window), self._view_options) > File > "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py", > line 443, in _from_runtime_iterable > len(head), str(head[0]), str(head[1]))) > ValueError: PCollection of size 2 with more than one element accessed as a > singleton view. First two elements encountered are > "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", > "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running > 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)'] > {code} > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner
Wenbing Bai created BEAM-8965: - Summary: WriteToBigQuery failed in BundleBasedDirectRunner Key: BEAM-8965 URL: https://issues.apache.org/jira/browse/BEAM-8965 Project: Beam Issue Type: Bug Components: io-py-gcp Affects Versions: 2.16.0 Reporter: Wenbing Bai {{*{{WriteToBigQuery}}* failed in }}{{*BundleBasedDirectRunner*}}{{ with error PCollection of size 2 with more than one element accessed as a singleton view.}} Here is the code {code:java} with Pipeline() as p: query_results = ( p | beam.io.Read(beam.io.BigQuerySource( query='SELECT ... FROM ...') ) query_results | beam.io.gcp.WriteToBigQuery( table=, method=WriteToBigQuery.Method.FILE_LOADS, schema={"fields": []} ) {code} Here is the error {code:java} File "apache_beam/runners/common.py", line 778, in apache_beam.runners.common.DoFnRunner.process def process(self, windowed_value): File "apache_beam/runners/common.py", line 782, in apache_beam.runners.common.DoFnRunner.process self._reraise_augmented(exn) File "apache_beam/runners/common.py", line 849, in apache_beam.runners.common.DoFnRunner._reraise_augmented raise_with_traceback(new_exn) File "apache_beam/runners/common.py", line 780, in apache_beam.runners.common.DoFnRunner.process return self.do_fn_invoker.invoke_process(windowed_value) File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.PerWindowInvoker.invoke_process self._invoke_process_per_window( File "apache_beam/runners/common.py", line 610, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window [si[global_window] for si in self.side_inputs])) File "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py", line 65, in __getitem__ _FilteringIterable(self._iterable, target_window), self._view_options) File "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py", line 443, in _from_runtime_iterable len(head), str(head[0]), str(head[1]))) ValueError: PCollection of size 2 with more than one element accessed as a singleton view. First two elements encountered are "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)'] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8186) WaitForBQJobs doesn't work for RUNNING jobs
[ https://issues.apache.org/jira/browse/BEAM-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbing Bai updated BEAM-8186: -- Status: Open (was: Triage Needed) > WaitForBQJobs doesn't work for RUNNING jobs > --- > > Key: BEAM-8186 > URL: https://issues.apache.org/jira/browse/BEAM-8186 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Affects Versions: 2.13.0 >Reporter: Wenbing Bai >Priority: Major > > Python SDK > WaitForBQJobs doesn't work for Job which has RUNNING state -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8186) WaitForBQJobs doesn't work for RUNNING jobs
[ https://issues.apache.org/jira/browse/BEAM-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbing Bai updated BEAM-8186: -- Status: Triage Needed (was: Open) > WaitForBQJobs doesn't work for RUNNING jobs > --- > > Key: BEAM-8186 > URL: https://issues.apache.org/jira/browse/BEAM-8186 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Affects Versions: 2.13.0 >Reporter: Wenbing Bai >Priority: Major > > Python SDK > WaitForBQJobs doesn't work for Job which has RUNNING state -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8186) WaitForBQJobs doesn't work for RUNNING jobs
[ https://issues.apache.org/jira/browse/BEAM-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbing Bai updated BEAM-8186: -- Status: Open (was: Triage Needed) > WaitForBQJobs doesn't work for RUNNING jobs > --- > > Key: BEAM-8186 > URL: https://issues.apache.org/jira/browse/BEAM-8186 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Affects Versions: 2.13.0 >Reporter: Wenbing Bai >Priority: Major > > Python SDK > WaitForBQJobs doesn't work for Job which has RUNNING state -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8186) WaitForBQJobs doesn't work for RUNNING jobs
[ https://issues.apache.org/jira/browse/BEAM-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenbing Bai updated BEAM-8186: -- Description: Python SDK WaitForBQJobs doesn't work for Job which has RUNNING state was: Python SDK WaitForBQJobs doesn't work for Job which has RUNNING status > WaitForBQJobs doesn't work for RUNNING jobs > --- > > Key: BEAM-8186 > URL: https://issues.apache.org/jira/browse/BEAM-8186 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Affects Versions: 2.13.0 >Reporter: Wenbing Bai >Priority: Major > > Python SDK > WaitForBQJobs doesn't work for Job which has RUNNING state -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Created] (BEAM-8186) WaitForBQJobs doesn't work for RUNNING jobs
Wenbing Bai created BEAM-8186: - Summary: WaitForBQJobs doesn't work for RUNNING jobs Key: BEAM-8186 URL: https://issues.apache.org/jira/browse/BEAM-8186 Project: Beam Issue Type: Bug Components: io-py-gcp Affects Versions: 2.13.0 Reporter: Wenbing Bai Python SDK WaitForBQJobs doesn't work for Job which has RUNNING status -- This message was sent by Atlassian Jira (v8.3.2#803003)