[jira] [Updated] (BEAM-11742) ParquetSink fails for nullable fields

2021-09-07 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbing Bai updated BEAM-11742:
---
Fix Version/s: 2.30.0
 Assignee: Wenbing Bai
   Resolution: Fixed
   Status: Resolved  (was: Open)

https://github.com/apache/beam/commit/c97e17eba04e56f9f7bdcec991e179c003ae128b#diff-33b0b6b112036df96f341aa83b88efba9215ec14dfabc9db9e9ffe66a23154a2

> ParquetSink fails for nullable fields
> -
>
> Key: BEAM-11742
> URL: https://issues.apache.org/jira/browse/BEAM-11742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-parquet
>Reporter: Wenbing Bai
>Assignee: Wenbing Bai
>Priority: P3
> Fix For: 2.30.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Before pyarrow 0.15, it is not possible to create pyarrow record batch with 
> schema.
> So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record 
> batch we use 
>  
> {code:java}
> rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code}
> Error is raised that the parquet table to be created (record batch schema) 
> has a different schema with the schema specify (self._schema).
> For example, when schema specified with "is not null", the record batch 
> schema doesn't indicate that, the error will be raised.
>  
> The fix is to use schema instead of names in pa.RecordBatch.from_arrays
> {code:java}
> rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-11742) ParquetSink fails for nullable fields

2021-05-17 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-11742 started by Wenbing Bai.
--
> ParquetSink fails for nullable fields
> -
>
> Key: BEAM-11742
> URL: https://issues.apache.org/jira/browse/BEAM-11742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-parquet
>Reporter: Wenbing Bai
>Assignee: Wenbing Bai
>Priority: P2
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Before pyarrow 0.15, it is not possible to create pyarrow record batch with 
> schema.
> So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record 
> batch we use 
>  
> {code:java}
> rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code}
> Error is raised that the parquet table to be created (record batch schema) 
> has a different schema with the schema specify (self._schema).
> For example, when schema specified with "is not null", the record batch 
> schema doesn't indicate that, the error will be raised.
>  
> The fix is to use schema instead of names in pa.RecordBatch.from_arrays
> {code:java}
> rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-11742) ParquetSink fails for nullable fields

2021-05-17 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbing Bai reassigned BEAM-11742:
--

Assignee: Wenbing Bai

> ParquetSink fails for nullable fields
> -
>
> Key: BEAM-11742
> URL: https://issues.apache.org/jira/browse/BEAM-11742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-parquet
>Reporter: Wenbing Bai
>Assignee: Wenbing Bai
>Priority: P2
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Before pyarrow 0.15, it is not possible to create pyarrow record batch with 
> schema.
> So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record 
> batch we use 
>  
> {code:java}
> rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code}
> Error is raised that the parquet table to be created (record batch schema) 
> has a different schema with the schema specify (self._schema).
> For example, when schema specified with "is not null", the record batch 
> schema doesn't indicate that, the error will be raised.
>  
> The fix is to use schema instead of names in pa.RecordBatch.from_arrays
> {code:java}
> rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-11742) ParquetSink fails for nullable fields

2021-04-07 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-11742 started by Wenbing Bai.
--
> ParquetSink fails for nullable fields
> -
>
> Key: BEAM-11742
> URL: https://issues.apache.org/jira/browse/BEAM-11742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-parquet
>Reporter: Wenbing Bai
>Assignee: Wenbing Bai
>Priority: P2
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Before pyarrow 0.15, it is not possible to create pyarrow record batch with 
> schema.
> So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record 
> batch we use 
>  
> {code:java}
> rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code}
> Error is raised that the parquet table to be created (record batch schema) 
> has a different schema with the schema specify (self._schema).
> For example, when schema specified with "is not null", the record batch 
> schema doesn't indicate that, the error will be raised.
>  
> The fix is to use schema instead of names in pa.RecordBatch.from_arrays
> {code:java}
> rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-11742) Use schema when creating record batch in ParquetSink

2021-03-12 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbing Bai reassigned BEAM-11742:
--

Assignee: Wenbing Bai

> Use schema when creating record batch in ParquetSink
> 
>
> Key: BEAM-11742
> URL: https://issues.apache.org/jira/browse/BEAM-11742
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-parquet
>Reporter: Wenbing Bai
>Assignee: Wenbing Bai
>Priority: P2
>
> Before pyarrow 0.15, it is not possible to create pyarrow record batch with 
> schema.
> So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record 
> batch we use 
>  
> {code:java}
> rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code}
> Error is raised that the parquet table to be created (record batch schema) 
> has a different schema with the schema specify (self._schema).
> For example, when schema specified with "is not null", the record batch 
> schema doesn't indicate that, the error will be raised.
>  
> The fix is to use schema instead of names in pa.RecordBatch.from_arrays
> {code:java}
> rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-11742) Use schema when creating record batch in ParquetSink

2021-02-02 Thread Wenbing Bai (Jira)
Wenbing Bai created BEAM-11742:
--

 Summary: Use schema when creating record batch in ParquetSink
 Key: BEAM-11742
 URL: https://issues.apache.org/jira/browse/BEAM-11742
 Project: Beam
  Issue Type: Improvement
  Components: io-py-parquet
Reporter: Wenbing Bai
Assignee: Wenbing Bai


Before pyarrow 0.15, it is not possible to create pyarrow record batch with 
schema.

So in apache_beam.io.parquetio._ParquetSink, when creating pyarrow record batch 
we use 

 
{code:java}
rb = pa.RecordBatch.from_arrays(arrays, self._schema.names){code}
Error is raised that the parquet table to be created (record batch schema) has 
a different schema with the schema specify (self._schema).

For example, when schema specified with "is not null", the record batch schema 
doesn't indicate that, the error will be raised.

 

The fix is to use schema instead of names in pa.RecordBatch.from_arrays
{code:java}
rb = pa.RecordBatch.from_arrays(arrays, schema=self._schema){code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2020-12-14 Thread Wenbing Bai (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249372#comment-17249372
 ] 

Wenbing Bai commented on BEAM-11277:


Any progress on this ticket? 

> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --
>
> Key: BEAM-11277
> URL: https://issues.apache.org/jira/browse/BEAM-11277
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.21.0, 2.25.0
>Reporter: Chun Yang
>Priority: P2
> Attachments: repro.py
>
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. I could not reproduce the issue with 
> DirectRunner. A minimal reproducible example is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner

2020-06-10 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbing Bai resolved BEAM-8965.
---
Fix Version/s: 2.20.0
 Assignee: Wenbing Bai
   Resolution: Fixed

> WriteToBigQuery failed in BundleBasedDirectRunner
> -
>
> Key: BEAM-8965
> URL: https://issues.apache.org/jira/browse/BEAM-8965
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0
>Reporter: Wenbing Bai
>Assignee: Wenbing Bai
>Priority: P2
> Fix For: 2.20.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> *{{WriteToBigQuery}}* fails in *{{BundleBasedDirectRunner}}* with error 
> {{PCollection of size 2 with more than one element accessed as a singleton 
> view.}}
> Here is the code
>  
> {code:python}
> with Pipeline() as p:
> query_results = (
> p 
> | beam.io.Read(beam.io.BigQuerySource(
> query='SELECT ... FROM ...')
> )
> query_results | beam.io.gcp.WriteToBigQuery(
> table=,
> method=WriteToBigQuery.Method.FILE_LOADS,
> schema={"fields": []}
> )
> {code}
>  
> Here is the error
>  
> {code:none}
>   File "apache_beam/runners/common.py", line 778, in 
> apache_beam.runners.common.DoFnRunner.process
>     def process(self, windowed_value):
>   File "apache_beam/runners/common.py", line 782, in 
> apache_beam.runners.common.DoFnRunner.process
>     self._reraise_augmented(exn)
>   File "apache_beam/runners/common.py", line 849, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented
>     raise_with_traceback(new_exn)
>   File "apache_beam/runners/common.py", line 780, in 
> apache_beam.runners.common.DoFnRunner.process
>     return self.do_fn_invoker.invoke_process(windowed_value)
>   File "apache_beam/runners/common.py", line 587, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process
>     self._invoke_process_per_window(
>   File "apache_beam/runners/common.py", line 610, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
>     [si[global_window] for si in self.side_inputs]))
>   File 
> "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py",
>  line 65, in __getitem__
>     _FilteringIterable(self._iterable, target_window), self._view_options)
>   File 
> "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py",
>  line 443, in _from_runtime_iterable
>     len(head), str(head[0]), str(head[1])))
> ValueError: PCollection of size 2 with more than one element accessed as a 
> singleton view. First two elements encountered are 
> "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", 
> "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running 
> 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)']
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9457) Allow WriteToBigQuery with external data resource

2020-03-05 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbing Bai updated BEAM-9457:
--
Status: Open  (was: Triage Needed)

> Allow WriteToBigQuery with external data resource
> -
>
> Key: BEAM-9457
> URL: https://issues.apache.org/jira/browse/BEAM-9457
> Project: Beam
>  Issue Type: New Feature
>  Components: io-py-gcp
>Reporter: Wenbing Bai
>Priority: Major
>
> Create another WriteToBigQuery.Method to allow user writeToBigQuery with 
> external data source like GCS, instead of loading the data to BigQuery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9457) Allow WriteToBigQuery with external data resource

2020-03-05 Thread Wenbing Bai (Jira)
Wenbing Bai created BEAM-9457:
-

 Summary: Allow WriteToBigQuery with external data resource
 Key: BEAM-9457
 URL: https://issues.apache.org/jira/browse/BEAM-9457
 Project: Beam
  Issue Type: New Feature
  Components: io-py-gcp
Reporter: Wenbing Bai


Create another WriteToBigQuery.Method to allow user writeToBigQuery with 
external data source like GCS, instead of loading the data to BigQuery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner

2020-02-19 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-8965 started by Wenbing Bai.
-
> WriteToBigQuery failed in BundleBasedDirectRunner
> -
>
> Key: BEAM-8965
> URL: https://issues.apache.org/jira/browse/BEAM-8965
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0
>Reporter: Wenbing Bai
>Assignee: Wenbing Bai
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> *{{WriteToBigQuery}}* fails in *{{BundleBasedDirectRunner}}* with error 
> {{PCollection of size 2 with more than one element accessed as a singleton 
> view.}}
> Here is the code
>  
> {code:python}
> with Pipeline() as p:
> query_results = (
> p 
> | beam.io.Read(beam.io.BigQuerySource(
> query='SELECT ... FROM ...')
> )
> query_results | beam.io.gcp.WriteToBigQuery(
> table=,
> method=WriteToBigQuery.Method.FILE_LOADS,
> schema={"fields": []}
> )
> {code}
>  
> Here is the error
>  
> {code:none}
>   File "apache_beam/runners/common.py", line 778, in 
> apache_beam.runners.common.DoFnRunner.process
>     def process(self, windowed_value):
>   File "apache_beam/runners/common.py", line 782, in 
> apache_beam.runners.common.DoFnRunner.process
>     self._reraise_augmented(exn)
>   File "apache_beam/runners/common.py", line 849, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented
>     raise_with_traceback(new_exn)
>   File "apache_beam/runners/common.py", line 780, in 
> apache_beam.runners.common.DoFnRunner.process
>     return self.do_fn_invoker.invoke_process(windowed_value)
>   File "apache_beam/runners/common.py", line 587, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process
>     self._invoke_process_per_window(
>   File "apache_beam/runners/common.py", line 610, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
>     [si[global_window] for si in self.side_inputs]))
>   File 
> "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py",
>  line 65, in __getitem__
>     _FilteringIterable(self._iterable, target_window), self._view_options)
>   File 
> "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py",
>  line 443, in _from_runtime_iterable
>     len(head), str(head[0]), str(head[1])))
> ValueError: PCollection of size 2 with more than one element accessed as a 
> singleton view. First two elements encountered are 
> "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", 
> "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running 
> 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)']
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner

2020-02-18 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbing Bai updated BEAM-8965:
--
Affects Version/s: 2.17.0
   2.18.0
   2.19.0

> WriteToBigQuery failed in BundleBasedDirectRunner
> -
>
> Key: BEAM-8965
> URL: https://issues.apache.org/jira/browse/BEAM-8965
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0
>Reporter: Wenbing Bai
>Assignee: Wenbing Bai
>Priority: Major
>
> *{{WriteToBigQuery}}* fails in *{{BundleBasedDirectRunner}}* with error 
> {{PCollection of size 2 with more than one element accessed as a singleton 
> view.}}
> Here is the code
>  
> {code:python}
> with Pipeline() as p:
> query_results = (
> p 
> | beam.io.Read(beam.io.BigQuerySource(
> query='SELECT ... FROM ...')
> )
> query_results | beam.io.gcp.WriteToBigQuery(
> table=,
> method=WriteToBigQuery.Method.FILE_LOADS,
> schema={"fields": []}
> )
> {code}
>  
> Here is the error
>  
> {code:none}
>   File "apache_beam/runners/common.py", line 778, in 
> apache_beam.runners.common.DoFnRunner.process
>     def process(self, windowed_value):
>   File "apache_beam/runners/common.py", line 782, in 
> apache_beam.runners.common.DoFnRunner.process
>     self._reraise_augmented(exn)
>   File "apache_beam/runners/common.py", line 849, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented
>     raise_with_traceback(new_exn)
>   File "apache_beam/runners/common.py", line 780, in 
> apache_beam.runners.common.DoFnRunner.process
>     return self.do_fn_invoker.invoke_process(windowed_value)
>   File "apache_beam/runners/common.py", line 587, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process
>     self._invoke_process_per_window(
>   File "apache_beam/runners/common.py", line 610, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
>     [si[global_window] for si in self.side_inputs]))
>   File 
> "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py",
>  line 65, in __getitem__
>     _FilteringIterable(self._iterable, target_window), self._view_options)
>   File 
> "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py",
>  line 443, in _from_runtime_iterable
>     len(head), str(head[0]), str(head[1])))
> ValueError: PCollection of size 2 with more than one element accessed as a 
> singleton view. First two elements encountered are 
> "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", 
> "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running 
> 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)']
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner

2020-02-14 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbing Bai reassigned BEAM-8965:
-

Assignee: Wenbing Bai

> WriteToBigQuery failed in BundleBasedDirectRunner
> -
>
> Key: BEAM-8965
> URL: https://issues.apache.org/jira/browse/BEAM-8965
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.16.0
>Reporter: Wenbing Bai
>Assignee: Wenbing Bai
>Priority: Major
>
> *{{WriteToBigQuery}}* fails in *{{BundleBasedDirectRunner}}* with error 
> {{PCollection of size 2 with more than one element accessed as a singleton 
> view.}}
> Here is the code
>  
> {code:python}
> with Pipeline() as p:
> query_results = (
> p 
> | beam.io.Read(beam.io.BigQuerySource(
> query='SELECT ... FROM ...')
> )
> query_results | beam.io.gcp.WriteToBigQuery(
> table=,
> method=WriteToBigQuery.Method.FILE_LOADS,
> schema={"fields": []}
> )
> {code}
>  
> Here is the error
>  
> {code:none}
>   File "apache_beam/runners/common.py", line 778, in 
> apache_beam.runners.common.DoFnRunner.process
>     def process(self, windowed_value):
>   File "apache_beam/runners/common.py", line 782, in 
> apache_beam.runners.common.DoFnRunner.process
>     self._reraise_augmented(exn)
>   File "apache_beam/runners/common.py", line 849, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented
>     raise_with_traceback(new_exn)
>   File "apache_beam/runners/common.py", line 780, in 
> apache_beam.runners.common.DoFnRunner.process
>     return self.do_fn_invoker.invoke_process(windowed_value)
>   File "apache_beam/runners/common.py", line 587, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process
>     self._invoke_process_per_window(
>   File "apache_beam/runners/common.py", line 610, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
>     [si[global_window] for si in self.side_inputs]))
>   File 
> "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py",
>  line 65, in __getitem__
>     _FilteringIterable(self._iterable, target_window), self._view_options)
>   File 
> "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py",
>  line 443, in _from_runtime_iterable
>     len(head), str(head[0]), str(head[1])))
> ValueError: PCollection of size 2 with more than one element accessed as a 
> singleton view. First two elements encountered are 
> "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", 
> "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running 
> 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)']
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner

2019-12-13 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbing Bai updated BEAM-8965:
--
Status: Open  (was: Triage Needed)

> WriteToBigQuery failed in BundleBasedDirectRunner
> -
>
> Key: BEAM-8965
> URL: https://issues.apache.org/jira/browse/BEAM-8965
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.16.0
>Reporter: Wenbing Bai
>Priority: Major
>
> {{*{{WriteToBigQuery}}* failed in }}{{*BundleBasedDirectRunner*}}{{ with 
> error PCollection of size 2 with more than one element accessed as a 
> singleton view.}}
> Here is the code
>  
> {code:java}
> with Pipeline() as p:
> query_results = (
> p 
> | beam.io.Read(beam.io.BigQuerySource(
> query='SELECT ... FROM ...')
> )
> query_results | beam.io.gcp.WriteToBigQuery(
> table=,
> method=WriteToBigQuery.Method.FILE_LOADS,
> schema={"fields": []}
> )
> {code}
>  
> Here is the error
>  
> {code:java}
>   File "apache_beam/runners/common.py", line 778, in 
> apache_beam.runners.common.DoFnRunner.process
>     def process(self, windowed_value):
>   File "apache_beam/runners/common.py", line 782, in 
> apache_beam.runners.common.DoFnRunner.process
>     self._reraise_augmented(exn)
>   File "apache_beam/runners/common.py", line 849, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented
>     raise_with_traceback(new_exn)
>   File "apache_beam/runners/common.py", line 780, in 
> apache_beam.runners.common.DoFnRunner.process
>     return self.do_fn_invoker.invoke_process(windowed_value)
>   File "apache_beam/runners/common.py", line 587, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process
>     self._invoke_process_per_window(
>   File "apache_beam/runners/common.py", line 610, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
>     [si[global_window] for si in self.side_inputs]))
>   File 
> "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py",
>  line 65, in __getitem__
>     _FilteringIterable(self._iterable, target_window), self._view_options)
>   File 
> "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py",
>  line 443, in _from_runtime_iterable
>     len(head), str(head[0]), str(head[1])))
> ValueError: PCollection of size 2 with more than one element accessed as a 
> singleton view. First two elements encountered are 
> "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", 
> "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running 
> 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)']
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner

2019-12-13 Thread Wenbing Bai (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995849#comment-16995849
 ] 

Wenbing Bai commented on BEAM-8965:
---

I did a small investigate, I think Singleton object  is somehow evaluated twice 
BundleBasedDirectRunner.

Here is the Singleton object 
[https://github.com/apache/beam/blob/de30361359b70e9fe9729f0f3d52f6c6e8462cfb/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L802]

And the Singleton object is used twice in 
[https://github.com/apache/beam/blob/de30361359b70e9fe9729f0f3d52f6c6e8462cfb/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L656]

*Also note*, this is only for BundleBasedDirectRunner, good at DataFlowRunner 
and FnApiRunner.

 

> WriteToBigQuery failed in BundleBasedDirectRunner
> -
>
> Key: BEAM-8965
> URL: https://issues.apache.org/jira/browse/BEAM-8965
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.16.0
>Reporter: Wenbing Bai
>Priority: Major
>
> {{*{{WriteToBigQuery}}* failed in }}{{*BundleBasedDirectRunner*}}{{ with 
> error PCollection of size 2 with more than one element accessed as a 
> singleton view.}}
> Here is the code
>  
> {code:java}
> with Pipeline() as p:
> query_results = (
> p 
> | beam.io.Read(beam.io.BigQuerySource(
> query='SELECT ... FROM ...')
> )
> query_results | beam.io.gcp.WriteToBigQuery(
> table=,
> method=WriteToBigQuery.Method.FILE_LOADS,
> schema={"fields": []}
> )
> {code}
>  
> Here is the error
>  
> {code:java}
>   File "apache_beam/runners/common.py", line 778, in 
> apache_beam.runners.common.DoFnRunner.process
>     def process(self, windowed_value):
>   File "apache_beam/runners/common.py", line 782, in 
> apache_beam.runners.common.DoFnRunner.process
>     self._reraise_augmented(exn)
>   File "apache_beam/runners/common.py", line 849, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented
>     raise_with_traceback(new_exn)
>   File "apache_beam/runners/common.py", line 780, in 
> apache_beam.runners.common.DoFnRunner.process
>     return self.do_fn_invoker.invoke_process(windowed_value)
>   File "apache_beam/runners/common.py", line 587, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process
>     self._invoke_process_per_window(
>   File "apache_beam/runners/common.py", line 610, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
>     [si[global_window] for si in self.side_inputs]))
>   File 
> "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py",
>  line 65, in __getitem__
>     _FilteringIterable(self._iterable, target_window), self._view_options)
>   File 
> "/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py",
>  line 443, in _from_runtime_iterable
>     len(head), str(head[0]), str(head[1])))
> ValueError: PCollection of size 2 with more than one element accessed as a 
> singleton view. First two elements encountered are 
> "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", 
> "gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running 
> 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)']
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner

2019-12-13 Thread Wenbing Bai (Jira)
Wenbing Bai created BEAM-8965:
-

 Summary: WriteToBigQuery failed in BundleBasedDirectRunner
 Key: BEAM-8965
 URL: https://issues.apache.org/jira/browse/BEAM-8965
 Project: Beam
  Issue Type: Bug
  Components: io-py-gcp
Affects Versions: 2.16.0
Reporter: Wenbing Bai


{{*{{WriteToBigQuery}}* failed in }}{{*BundleBasedDirectRunner*}}{{ with error 
PCollection of size 2 with more than one element accessed as a singleton view.}}

Here is the code

 
{code:java}
with Pipeline() as p:
query_results = (
p 
| beam.io.Read(beam.io.BigQuerySource(
query='SELECT ... FROM ...')
)
query_results | beam.io.gcp.WriteToBigQuery(
table=,
method=WriteToBigQuery.Method.FILE_LOADS,
schema={"fields": []}
)
{code}
 

Here is the error

 
{code:java}
  File "apache_beam/runners/common.py", line 778, in 
apache_beam.runners.common.DoFnRunner.process
    def process(self, windowed_value):
  File "apache_beam/runners/common.py", line 782, in 
apache_beam.runners.common.DoFnRunner.process
    self._reraise_augmented(exn)
  File "apache_beam/runners/common.py", line 849, in 
apache_beam.runners.common.DoFnRunner._reraise_augmented
    raise_with_traceback(new_exn)
  File "apache_beam/runners/common.py", line 780, in 
apache_beam.runners.common.DoFnRunner.process
    return self.do_fn_invoker.invoke_process(windowed_value)
  File "apache_beam/runners/common.py", line 587, in 
apache_beam.runners.common.PerWindowInvoker.invoke_process
    self._invoke_process_per_window(
  File "apache_beam/runners/common.py", line 610, in 
apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
    [si[global_window] for si in self.side_inputs]))
  File 
"/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py",
 line 65, in __getitem__
    _FilteringIterable(self._iterable, target_window), self._view_options)
  File 
"/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py",
 line 443, in _from_runtime_iterable
    len(head), str(head[0]), str(head[1])))
ValueError: PCollection of size 2 with more than one element accessed as a 
singleton view. First two elements encountered are 
"gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", 
"gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running 
'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)']
{code}
 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8186) WaitForBQJobs doesn't work for RUNNING jobs

2019-11-05 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbing Bai updated BEAM-8186:
--
Status: Open  (was: Triage Needed)

> WaitForBQJobs doesn't work for RUNNING jobs
> ---
>
> Key: BEAM-8186
> URL: https://issues.apache.org/jira/browse/BEAM-8186
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.13.0
>Reporter: Wenbing Bai
>Priority: Major
>
> Python SDK
> WaitForBQJobs doesn't work for Job which has RUNNING state



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8186) WaitForBQJobs doesn't work for RUNNING jobs

2019-11-05 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbing Bai updated BEAM-8186:
--
Status: Triage Needed  (was: Open)

> WaitForBQJobs doesn't work for RUNNING jobs
> ---
>
> Key: BEAM-8186
> URL: https://issues.apache.org/jira/browse/BEAM-8186
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.13.0
>Reporter: Wenbing Bai
>Priority: Major
>
> Python SDK
> WaitForBQJobs doesn't work for Job which has RUNNING state



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8186) WaitForBQJobs doesn't work for RUNNING jobs

2019-11-05 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbing Bai updated BEAM-8186:
--
Status: Open  (was: Triage Needed)

> WaitForBQJobs doesn't work for RUNNING jobs
> ---
>
> Key: BEAM-8186
> URL: https://issues.apache.org/jira/browse/BEAM-8186
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.13.0
>Reporter: Wenbing Bai
>Priority: Major
>
> Python SDK
> WaitForBQJobs doesn't work for Job which has RUNNING state



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8186) WaitForBQJobs doesn't work for RUNNING jobs

2019-09-09 Thread Wenbing Bai (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenbing Bai updated BEAM-8186:
--
Description: 
Python SDK

WaitForBQJobs doesn't work for Job which has RUNNING state

  was:
Python SDK

WaitForBQJobs doesn't work for Job which has RUNNING status


> WaitForBQJobs doesn't work for RUNNING jobs
> ---
>
> Key: BEAM-8186
> URL: https://issues.apache.org/jira/browse/BEAM-8186
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.13.0
>Reporter: Wenbing Bai
>Priority: Major
>
> Python SDK
> WaitForBQJobs doesn't work for Job which has RUNNING state



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (BEAM-8186) WaitForBQJobs doesn't work for RUNNING jobs

2019-09-09 Thread Wenbing Bai (Jira)
Wenbing Bai created BEAM-8186:
-

 Summary: WaitForBQJobs doesn't work for RUNNING jobs
 Key: BEAM-8186
 URL: https://issues.apache.org/jira/browse/BEAM-8186
 Project: Beam
  Issue Type: Bug
  Components: io-py-gcp
Affects Versions: 2.13.0
Reporter: Wenbing Bai


Python SDK

WaitForBQJobs doesn't work for Job which has RUNNING status



--
This message was sent by Atlassian Jira
(v8.3.2#803003)