[jira] [Updated] (BEAM-14112) ReadFromBigQuery cannot be used with the interactive runner

2022-03-23 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-14112:
-
Fix Version/s: (was: 2.38.0)

> ReadFromBigQuery cannot be used with the interactive runner
> ---
>
> Key: BEAM-14112
> URL: https://issues.apache.org/jira/browse/BEAM-14112
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp, runner-py-interactive
>Affects Versions: 2.35.0, 2.36.0, 2.37.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P2
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> A change in Apache Beam 2.35.0 caused ReadFromBigQuery to no longer work with 
> the Python interactive runner.
> The error can be reproduced with the following code:
> {code:python}#!/usr/bin/env python
> """Reproduce pickle issue when using RFBQ in interactive runner."""
> import apache_beam as beam
> 
> from apache_beam.runners.interactive.interactive_runner import 
> InteractiveRunner  
> import apache_beam.runners.interactive.interactive_beam as ib 
> 
>   
> 
>   
> 
> options = beam.options.pipeline_options.PipelineOptions(  
> 
> project="...",
> temp_location="...",  
> ) 
> 
>   
> 
> pipeline = beam.Pipeline(InteractiveRunner(), options=options)
> 
> pcoll = pipeline | beam.io.ReadFromBigQuery(query="SELECT 1") 
> 
> print(ib.collect(pcoll)){code}
> {code:none}Traceback (most recent call last):
>   File "apache_beam/runners/common.py", line 1198, in 
> apache_beam.runners.common.DoFnRunner.process
>   File "apache_beam/runners/common.py", line 536, in 
> apache_beam.runners.common.SimpleInvoker.invoke_process
>   File "apache_beam/runners/common.py", line 1361, in 
> apache_beam.runners.common._OutputProcessor.process_outputs
>   File "apache_beam/runners/worker/operations.py", line 214, in 
> apache_beam.runners.worker.operations.SingletonConsumerSet.receive
>   File "apache_beam/runners/worker/operations.py", line 178, in 
> apache_beam.runners.worker.operations.ConsumerSet.update_counters_start
>   File "apache_beam/runners/worker/opcounters.py", line 211, in 
> apache_beam.runners.worker.opcounters.OperationCounters.update_from
>   File "apache_beam/runners/worker/opcounters.py", line 250, in 
> apache_beam.runners.worker.opcounters.OperationCounters.do_sample
>   File "apache_beam/coders/coder_impl.py", line 1425, in 
> apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 1436, in 
> apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 987, in 
> apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 987, in 
> apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 207, in 
> apache_beam.coders.coder_impl.CoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 1514, in 
> apache_beam.coders.coder_impl.LengthPrefixCoderImpl.estimate_size
>   File "apache_beam/coders/coder_impl.py", line 246, in 
> apache_beam.coders.coder_impl.StreamCoderImpl.estimate_size
>   File "apache_beam/coders/coder_impl.py", line 441, in 
> apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
>   File "apache_beam/coders/coder_impl.py", line 268, in 
> apache_beam.coders.coder_impl.CallbackCoderImpl.encode_to_stream
>   File 
> "/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/coders/coders.py",
>  line 802, in 
> lambda x: dumps(x, protocol), pickle.loads)
> TypeError: can't pickle generator objects
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "repro.py", line 16, in 
> print(ib.collect(pcoll))
>   File 
> "/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/runners/interactive/utils.py",
>  line 270, in run_within_progress_indicator
> return func(*args, **kwargs)
>   File 
> "/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/runners/interactive/interactive_beam.py

[jira] [Reopened] (BEAM-14112) ReadFromBigQuery cannot be used with the interactive runner

2022-03-23 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang reopened BEAM-14112:
--

Fix was reverted in https://github.com/apache/beam/pull/17141

> ReadFromBigQuery cannot be used with the interactive runner
> ---
>
> Key: BEAM-14112
> URL: https://issues.apache.org/jira/browse/BEAM-14112
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp, runner-py-interactive
>Affects Versions: 2.35.0, 2.36.0, 2.37.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P2
> Fix For: 2.38.0
>
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> A change in Apache Beam 2.35.0 caused ReadFromBigQuery to no longer work with 
> the Python interactive runner.
> The error can be reproduced with the following code:
> {code:python}#!/usr/bin/env python
> """Reproduce pickle issue when using RFBQ in interactive runner."""
> import apache_beam as beam
> 
> from apache_beam.runners.interactive.interactive_runner import 
> InteractiveRunner  
> import apache_beam.runners.interactive.interactive_beam as ib 
> 
>   
> 
>   
> 
> options = beam.options.pipeline_options.PipelineOptions(  
> 
> project="...",
> temp_location="...",  
> ) 
> 
>   
> 
> pipeline = beam.Pipeline(InteractiveRunner(), options=options)
> 
> pcoll = pipeline | beam.io.ReadFromBigQuery(query="SELECT 1") 
> 
> print(ib.collect(pcoll)){code}
> {code:none}Traceback (most recent call last):
>   File "apache_beam/runners/common.py", line 1198, in 
> apache_beam.runners.common.DoFnRunner.process
>   File "apache_beam/runners/common.py", line 536, in 
> apache_beam.runners.common.SimpleInvoker.invoke_process
>   File "apache_beam/runners/common.py", line 1361, in 
> apache_beam.runners.common._OutputProcessor.process_outputs
>   File "apache_beam/runners/worker/operations.py", line 214, in 
> apache_beam.runners.worker.operations.SingletonConsumerSet.receive
>   File "apache_beam/runners/worker/operations.py", line 178, in 
> apache_beam.runners.worker.operations.ConsumerSet.update_counters_start
>   File "apache_beam/runners/worker/opcounters.py", line 211, in 
> apache_beam.runners.worker.opcounters.OperationCounters.update_from
>   File "apache_beam/runners/worker/opcounters.py", line 250, in 
> apache_beam.runners.worker.opcounters.OperationCounters.do_sample
>   File "apache_beam/coders/coder_impl.py", line 1425, in 
> apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 1436, in 
> apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 987, in 
> apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 987, in 
> apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 207, in 
> apache_beam.coders.coder_impl.CoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 1514, in 
> apache_beam.coders.coder_impl.LengthPrefixCoderImpl.estimate_size
>   File "apache_beam/coders/coder_impl.py", line 246, in 
> apache_beam.coders.coder_impl.StreamCoderImpl.estimate_size
>   File "apache_beam/coders/coder_impl.py", line 441, in 
> apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
>   File "apache_beam/coders/coder_impl.py", line 268, in 
> apache_beam.coders.coder_impl.CallbackCoderImpl.encode_to_stream
>   File 
> "/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/coders/coders.py",
>  line 802, in 
> lambda x: dumps(x, protocol), pickle.loads)
> TypeError: can't pickle generator objects
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "repro.py", line 16, in 
> print(ib.collect(pcoll))
>   File 
> "/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/runners/interactive/utils.py",
>  line 270, in run_within_progress_indicator
> return func(*args, **kwargs)
>   File 
> "/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-

[jira] [Assigned] (BEAM-14112) ReadFromBigQuery cannot be used with the interactive runner

2022-03-15 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang reassigned BEAM-14112:


Assignee: Chun Yang

> ReadFromBigQuery cannot be used with the interactive runner
> ---
>
> Key: BEAM-14112
> URL: https://issues.apache.org/jira/browse/BEAM-14112
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp, runner-py-interactive
>Affects Versions: 2.35.0, 2.36.0, 2.37.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P2
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A change in Apache Beam 2.35.0 caused ReadFromBigQuery to no longer work with 
> the Python interactive runner.
> The error can be reproduced with the following code:
> {code:python}#!/usr/bin/env python
> """Reproduce pickle issue when using RFBQ in interactive runner."""
> import apache_beam as beam
> 
> from apache_beam.runners.interactive.interactive_runner import 
> InteractiveRunner  
> import apache_beam.runners.interactive.interactive_beam as ib 
> 
>   
> 
>   
> 
> options = beam.options.pipeline_options.PipelineOptions(  
> 
> project="...",
> temp_location="...",  
> ) 
> 
>   
> 
> pipeline = beam.Pipeline(InteractiveRunner(), options=options)
> 
> pcoll = pipeline | beam.io.ReadFromBigQuery(query="SELECT 1") 
> 
> print(ib.collect(pcoll)){code}
> {code:none}Traceback (most recent call last):
>   File "apache_beam/runners/common.py", line 1198, in 
> apache_beam.runners.common.DoFnRunner.process
>   File "apache_beam/runners/common.py", line 536, in 
> apache_beam.runners.common.SimpleInvoker.invoke_process
>   File "apache_beam/runners/common.py", line 1361, in 
> apache_beam.runners.common._OutputProcessor.process_outputs
>   File "apache_beam/runners/worker/operations.py", line 214, in 
> apache_beam.runners.worker.operations.SingletonConsumerSet.receive
>   File "apache_beam/runners/worker/operations.py", line 178, in 
> apache_beam.runners.worker.operations.ConsumerSet.update_counters_start
>   File "apache_beam/runners/worker/opcounters.py", line 211, in 
> apache_beam.runners.worker.opcounters.OperationCounters.update_from
>   File "apache_beam/runners/worker/opcounters.py", line 250, in 
> apache_beam.runners.worker.opcounters.OperationCounters.do_sample
>   File "apache_beam/coders/coder_impl.py", line 1425, in 
> apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 1436, in 
> apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 987, in 
> apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 987, in 
> apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 207, in 
> apache_beam.coders.coder_impl.CoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 1514, in 
> apache_beam.coders.coder_impl.LengthPrefixCoderImpl.estimate_size
>   File "apache_beam/coders/coder_impl.py", line 246, in 
> apache_beam.coders.coder_impl.StreamCoderImpl.estimate_size
>   File "apache_beam/coders/coder_impl.py", line 441, in 
> apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
>   File "apache_beam/coders/coder_impl.py", line 268, in 
> apache_beam.coders.coder_impl.CallbackCoderImpl.encode_to_stream
>   File 
> "/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/coders/coders.py",
>  line 802, in 
> lambda x: dumps(x, protocol), pickle.loads)
> TypeError: can't pickle generator objects
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "repro.py", line 16, in 
> print(ib.collect(pcoll))
>   File 
> "/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/runners/interactive/utils.py",
>  line 270, in run_within_progress_indicator
> return func(*args, **kwargs)
>   File 
> "/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/runners/interactive/interactive_beam.py",
>  lin

[jira] [Work logged] (BEAM-14112) ReadFromBigQuery cannot be used with the interactive runner

2022-03-15 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-14112?focusedWorklogId=741963&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-741963
 ]

Chun Yang logged work on BEAM-14112:


Author: Chun Yang
Created on: 16/Mar/22 00:32
Start Date: 16/Mar/22 00:32
Worklog Time Spent: 2h 

Issue Time Tracking
---

Worklog Id: (was: 741963)
Time Spent: 2h 10m  (was: 10m)

> ReadFromBigQuery cannot be used with the interactive runner
> ---
>
> Key: BEAM-14112
> URL: https://issues.apache.org/jira/browse/BEAM-14112
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp, runner-py-interactive
>Affects Versions: 2.35.0, 2.36.0, 2.37.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P2
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> A change in Apache Beam 2.35.0 caused ReadFromBigQuery to no longer work with 
> the Python interactive runner.
> The error can be reproduced with the following code:
> {code:python}#!/usr/bin/env python
> """Reproduce pickle issue when using RFBQ in interactive runner."""
> import apache_beam as beam
> 
> from apache_beam.runners.interactive.interactive_runner import 
> InteractiveRunner  
> import apache_beam.runners.interactive.interactive_beam as ib 
> 
>   
> 
>   
> 
> options = beam.options.pipeline_options.PipelineOptions(  
> 
> project="...",
> temp_location="...",  
> ) 
> 
>   
> 
> pipeline = beam.Pipeline(InteractiveRunner(), options=options)
> 
> pcoll = pipeline | beam.io.ReadFromBigQuery(query="SELECT 1") 
> 
> print(ib.collect(pcoll)){code}
> {code:none}Traceback (most recent call last):
>   File "apache_beam/runners/common.py", line 1198, in 
> apache_beam.runners.common.DoFnRunner.process
>   File "apache_beam/runners/common.py", line 536, in 
> apache_beam.runners.common.SimpleInvoker.invoke_process
>   File "apache_beam/runners/common.py", line 1361, in 
> apache_beam.runners.common._OutputProcessor.process_outputs
>   File "apache_beam/runners/worker/operations.py", line 214, in 
> apache_beam.runners.worker.operations.SingletonConsumerSet.receive
>   File "apache_beam/runners/worker/operations.py", line 178, in 
> apache_beam.runners.worker.operations.ConsumerSet.update_counters_start
>   File "apache_beam/runners/worker/opcounters.py", line 211, in 
> apache_beam.runners.worker.opcounters.OperationCounters.update_from
>   File "apache_beam/runners/worker/opcounters.py", line 250, in 
> apache_beam.runners.worker.opcounters.OperationCounters.do_sample
>   File "apache_beam/coders/coder_impl.py", line 1425, in 
> apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 1436, in 
> apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 987, in 
> apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 987, in 
> apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 207, in 
> apache_beam.coders.coder_impl.CoderImpl.get_estimated_size_and_observables
>   File "apache_beam/coders/coder_impl.py", line 1514, in 
> apache_beam.coders.coder_impl.LengthPrefixCoderImpl.estimate_size
>   File "apache_beam/coders/coder_impl.py", line 246, in 
> apache_beam.coders.coder_impl.StreamCoderImpl.estimate_size
>   File "apache_beam/coders/coder_impl.py", line 441, in 
> apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
>   File "apache_beam/coders/coder_impl.py", line 268, in 
> apache_beam.coders.coder_impl.CallbackCoderImpl.encode_to_stream
>   File 
> "/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/coders/coders.py",
>  line 802, in 
> lambda x: dumps(x, protocol), pickle.loads)
> TypeError: can't pickle generator objects
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "repro.py", line 16, in 
> print(ib.collect(pcoll))
>   File 
> "/home/chuck.yang/src/venv/py3-general/lib/pyth

[jira] [Created] (BEAM-14112) ReadFromBigQuery cannot be used with the interactive runner

2022-03-15 Thread Chun Yang (Jira)
Chun Yang created BEAM-14112:


 Summary: ReadFromBigQuery cannot be used with the interactive 
runner
 Key: BEAM-14112
 URL: https://issues.apache.org/jira/browse/BEAM-14112
 Project: Beam
  Issue Type: Improvement
  Components: io-py-gcp, runner-py-interactive
Affects Versions: 2.37.0, 2.36.0, 2.35.0
Reporter: Chun Yang


A change in Apache Beam 2.35.0 caused ReadFromBigQuery to no longer work with 
the Python interactive runner.

The error can be reproduced with the following code:
{code:python}#!/usr/bin/env python
"""Reproduce pickle issue when using RFBQ in interactive runner."""

import apache_beam as beam  
  
from apache_beam.runners.interactive.interactive_runner import 
InteractiveRunner  
import apache_beam.runners.interactive.interactive_beam as ib   
  

  

  
options = beam.options.pipeline_options.PipelineOptions(
  
project="...",
temp_location="...",  
)   
  

  
pipeline = beam.Pipeline(InteractiveRunner(), options=options)  
  
pcoll = pipeline | beam.io.ReadFromBigQuery(query="SELECT 1")   
  
print(ib.collect(pcoll)){code}

{code:none}Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1198, in 
apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 536, in 
apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1361, in 
apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 214, in 
apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 178, in 
apache_beam.runners.worker.operations.ConsumerSet.update_counters_start
  File "apache_beam/runners/worker/opcounters.py", line 211, in 
apache_beam.runners.worker.opcounters.OperationCounters.update_from
  File "apache_beam/runners/worker/opcounters.py", line 250, in 
apache_beam.runners.worker.opcounters.OperationCounters.do_sample
  File "apache_beam/coders/coder_impl.py", line 1425, in 
apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 1436, in 
apache_beam.coders.coder_impl.WindowedValueCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 987, in 
apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 987, in 
apache_beam.coders.coder_impl.AbstractComponentCoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 207, in 
apache_beam.coders.coder_impl.CoderImpl.get_estimated_size_and_observables
  File "apache_beam/coders/coder_impl.py", line 1514, in 
apache_beam.coders.coder_impl.LengthPrefixCoderImpl.estimate_size
  File "apache_beam/coders/coder_impl.py", line 246, in 
apache_beam.coders.coder_impl.StreamCoderImpl.estimate_size
  File "apache_beam/coders/coder_impl.py", line 441, in 
apache_beam.coders.coder_impl.FastPrimitivesCoderImpl.encode_to_stream
  File "apache_beam/coders/coder_impl.py", line 268, in 
apache_beam.coders.coder_impl.CallbackCoderImpl.encode_to_stream
  File 
"/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/coders/coders.py",
 line 802, in 
lambda x: dumps(x, protocol), pickle.loads)
TypeError: can't pickle generator objects

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "repro.py", line 16, in 
print(ib.collect(pcoll))
  File 
"/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/runners/interactive/utils.py",
 line 270, in run_within_progress_indicator
return func(*args, **kwargs)
  File 
"/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/runners/interactive/interactive_beam.py",
 line 664, in collect
recording = recording_manager.record([pcoll], max_n=n, 
max_duration=duration)
  File 
"/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/runners/interactive/recording_manager.py",
 line 458, in record
self.user_pipeline.options).run()
  File 
"/home/chuck.yang/src/venv/py3-general/lib/python3.7/site-packages/apache_beam/runners/interactive/pipeline_fragment.py",
 line 113, in run
return self.deduce_fragmen

[jira] [Updated] (BEAM-11359) Clean up temporary dataset after ReadAllFromBQ executes

2021-10-20 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-11359:
-
Fix Version/s: 2.32.0
   Resolution: Fixed
   Status: Resolved  (was: Open)

Resolved via this commit AFAICT: 
https://github.com/apache/beam/commit/a92965969fda3d7be5b65996d1a9a5b7dd373119

> Clean up temporary dataset after ReadAllFromBQ executes
> ---
>
> Key: BEAM-11359
> URL: https://issues.apache.org/jira/browse/BEAM-11359
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Pablo Estrada
>Priority: P3
> Fix For: 2.32.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Currently, the transform creates (or receives) a temp dataset and it does not 
> clean it up. Only one is created per pipeline, so it's not too bad, but it's 
> not ideal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-12669) UpdateDestinationSchema PTransform does not respect source format

2021-10-13 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-12669:
-
Fix Version/s: 2.33.0
   Resolution: Fixed
   Status: Resolved  (was: Triage Needed)

> UpdateDestinationSchema PTransform does not respect source format
> -
>
> Key: BEAM-12669
> URL: https://issues.apache.org/jira/browse/BEAM-12669
> Project: Beam
>  Issue Type: Bug
>  Components: io-go-gcp, runner-dataflow
>Affects Versions: 2.30.0
>Reporter: Sayat Satybaldiyev
>Priority: P3
> Fix For: 2.33.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 10,000 URIs, WriteToBigQuery in 
> FILE_LOADS mode will write data into temporary tables and then update the 
> temporary tables if schema additions is allowed.
> However, update of temporary table scheme does not respect a specified source 
> format of the loading files(i.e. JSON, AVRO). 
> The UpdateDestinationSchema issue schema modification command with a default 
> CSV setting which causing AVRO or JSON nested schema loads to fail with the 
> error:
> {code:java}
> apache_beam.io.gcp.bigquery_file_loads: INFO: Triggering schema modification 
> job 
> beam_bq_job_LOAD_satybald7_SCHEMA_MOD_STEP_994_3869e4dc1dd08c68d20fd047e242161a_7c553f684cce4963a75d669f38a4ec46
>  on   datasetId: 'python_write_to_table_162743435'
>  projectId: 'DELETED'
>  tableId: 'python_append_schema_update'>
> apache_beam.io.gcp.bigquery_tools: INFO: Failed to insert job   jobId: 
> 'beam_bq_job_LOAD7_SCHEMA_MOD_STEP_994_3869e4dc1dd08c68d20fd047e242161a_7c553f684cce4963a75d669f38a4ec46'
>  projectId: 'DELETED'>: HttpError accessing 
>  'content-type': 'application/json; charset=UTF-8', 'content-length': '332', 
> 'date': 'Wed, 28 Jul 2021 00:12:03 GMT', 'server': 'UploadServer', 'status': 
> '400'}>, content <{
>   "error": {
> "code": 400,
> "message": "Cannot load CSV data with a nested schema. Field: 
> nested_field",
> "errors": [
>   {
> "message": "Cannot load CSV data with a nested schema. Field: 
> nested_field",
> "domain": "global",
> "reason": "invalid"
>   }
> ],
> "status": "INVALID_ARGUMENT"
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-12913) Enable configuring query priority in Python ReadFromBigQuery

2021-09-26 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-12913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-12913:
-
Fix Version/s: 2.34.0
   Resolution: Fixed
   Status: Resolved  (was: Open)

> Enable configuring query priority in Python ReadFromBigQuery
> 
>
> Key: BEAM-12913
> URL: https://issues.apache.org/jira/browse/BEAM-12913
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Affects Versions: 2.32.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P3
> Fix For: 2.34.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> Submitting BigQuery queries with BATCH priority allows queries to be started 
> when idle resources are available and allows queries submitted from Beam to 
> not count toward a project's concurrent rate limit.
> The Java BigQueryIO enables configuring the query priority. We should enable 
> the same configuration in the Python ReadFromBigQuery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work stopped] (BEAM-12913) Enable configuring query priority in Python ReadFromBigQuery

2021-09-17 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-12913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-12913 stopped by Chun Yang.

> Enable configuring query priority in Python ReadFromBigQuery
> 
>
> Key: BEAM-12913
> URL: https://issues.apache.org/jira/browse/BEAM-12913
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Affects Versions: 2.32.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P3
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Submitting BigQuery queries with BATCH priority allows queries to be started 
> when idle resources are available and allows queries submitted from Beam to 
> not count toward a project's concurrent rate limit.
> The Java BigQueryIO enables configuring the query priority. We should enable 
> the same configuration in the Python ReadFromBigQuery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-12913) Enable configuring query priority in Python ReadFromBigQuery

2021-09-17 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-12913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-12913 started by Chun Yang.

> Enable configuring query priority in Python ReadFromBigQuery
> 
>
> Key: BEAM-12913
> URL: https://issues.apache.org/jira/browse/BEAM-12913
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Affects Versions: 2.32.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P3
>
> Submitting BigQuery queries with BATCH priority allows queries to be started 
> when idle resources are available and allows queries submitted from Beam to 
> not count toward a project's concurrent rate limit.
> The Java BigQueryIO enables configuring the query priority. We should enable 
> the same configuration in the Python ReadFromBigQuery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-12913) Enable configuring query priority in Python ReadFromBigQuery

2021-09-17 Thread Chun Yang (Jira)
Chun Yang created BEAM-12913:


 Summary: Enable configuring query priority in Python 
ReadFromBigQuery
 Key: BEAM-12913
 URL: https://issues.apache.org/jira/browse/BEAM-12913
 Project: Beam
  Issue Type: Improvement
  Components: io-py-gcp
Affects Versions: 2.32.0
Reporter: Chun Yang
Assignee: Chun Yang


Submitting BigQuery queries with BATCH priority allows queries to be started 
when idle resources are available and allows queries submitted from Beam to not 
count toward a project's concurrent rate limit.

The Java BigQueryIO enables configuring the query priority. We should enable 
the same configuration in the Python ReadFromBigQuery.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-12079) Deterministic coding enforcement causes _StreamToBigQuery/CommitInsertIds/GroupByKey to fail

2021-08-17 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-12079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-12079:
-
Fix Version/s: 2.30.0

> Deterministic coding enforcement causes 
> _StreamToBigQuery/CommitInsertIds/GroupByKey to fail
> 
>
> Key: BEAM-12079
> URL: https://issues.apache.org/jira/browse/BEAM-12079
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Reporter: Ludovic Post
>Priority: P2
>  Labels: stale-P2
> Fix For: 2.30.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> {code:none}
> TypeError: Unable to deterministically encode '  datasetId: 'my-dataset-id'
>  projectId: 'my-project-id'
>  tableId: 'my-table-id'>' of type ' 'apache_beam.io.gcp.internal.clients.bigquery.bigquery_v2_messages.TableReference'>',
>  please provide a type hint for the input of 
> 'WriteToBigQuery/_StreamToBigQuery/CommitInsertIds/GroupByKey' [while running 
> 'WriteToBigQuery/_StreamToBigQuery/CommitInsertIds/Map(reify_timestamps)-ptransform-171']
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-12641) Support GCP auth using custom token URIs

2021-07-20 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-12641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-12641 started by Chun Yang.

> Support GCP auth using custom token URIs
> 
>
> Key: BEAM-12641
> URL: https://issues.apache.org/jira/browse/BEAM-12641
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P3
>  Labels: auth, gcp, python
>
> Feature request: Allow authenticating to GCP with service account credentials 
> that use a custom token URI.
> {quote}We use the {{service_account}} credential type. Older versions 
> (oauth2client) supported this type, BUT they only supported using google 
> endpoints for issuing credentials, where we use a custom {{token_uri}} to 
> issue credentials. New versions (google-auth) will reference our custom 
> {{token_uri}}.{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work stopped] (BEAM-12641) Support GCP auth using custom token URIs

2021-07-20 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-12641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-12641 stopped by Chun Yang.

> Support GCP auth using custom token URIs
> 
>
> Key: BEAM-12641
> URL: https://issues.apache.org/jira/browse/BEAM-12641
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P3
>  Labels: auth, gcp, python
>
> Feature request: Allow authenticating to GCP with service account credentials 
> that use a custom token URI.
> {quote}We use the {{service_account}} credential type. Older versions 
> (oauth2client) supported this type, BUT they only supported using google 
> endpoints for issuing credentials, where we use a custom {{token_uri}} to 
> issue credentials. New versions (google-auth) will reference our custom 
> {{token_uri}}.{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-12641) Support GCP auth using custom token URIs

2021-07-20 Thread Chun Yang (Jira)
Chun Yang created BEAM-12641:


 Summary: Support GCP auth using custom token URIs
 Key: BEAM-12641
 URL: https://issues.apache.org/jira/browse/BEAM-12641
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Chun Yang
Assignee: Chun Yang


Feature request: Allow authenticating to GCP with service account credentials 
that use a custom token URI.

{quote}We use the {{service_account}} credential type. Older versions 
(oauth2client) supported this type, BUT they only supported using google 
endpoints for issuing credentials, where we use a custom {{token_uri}} to issue 
credentials. New versions (google-auth) will reference our custom 
{{token_uri}}.{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2021-04-28 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-11277:
-
Fix Version/s: 2.30.0
   Resolution: Fixed
   Status: Resolved  (was: Open)

> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --
>
> Key: BEAM-11277
> URL: https://issues.apache.org/jira/browse/BEAM-11277
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.21.0, 2.24.0, 2.25.0, 2.28.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P2
> Fix For: 2.30.0
>
> Attachments: repro.py
>
>  Time Spent: 12h 40m
>  Remaining Estimate: 0h
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. I could not reproduce the issue with 
> DirectRunner. A minimal reproducible example is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2021-03-05 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-11277:
-
Status: Open  (was: Triage Needed)

> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --
>
> Key: BEAM-11277
> URL: https://issues.apache.org/jira/browse/BEAM-11277
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.21.0, 2.24.0, 2.25.0, 2.28.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P2
> Attachments: repro.py
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. I could not reproduce the issue with 
> DirectRunner. A minimal reproducible example is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2021-03-05 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-11277 started by Chun Yang.

> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --
>
> Key: BEAM-11277
> URL: https://issues.apache.org/jira/browse/BEAM-11277
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.21.0, 2.24.0, 2.25.0, 2.28.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P2
> Attachments: repro.py
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. I could not reproduce the issue with 
> DirectRunner. A minimal reproducible example is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work stopped] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2021-03-05 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-11277 stopped by Chun Yang.

> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --
>
> Key: BEAM-11277
> URL: https://issues.apache.org/jira/browse/BEAM-11277
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.21.0, 2.24.0, 2.25.0, 2.28.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P2
> Attachments: repro.py
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. I could not reproduce the issue with 
> DirectRunner. A minimal reproducible example is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2021-03-01 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-11277:
-
Affects Version/s: 2.28.0

> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --
>
> Key: BEAM-11277
> URL: https://issues.apache.org/jira/browse/BEAM-11277
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.21.0, 2.24.0, 2.25.0, 2.28.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P2
> Attachments: repro.py
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. I could not reproduce the issue with 
> DirectRunner. A minimal reproducible example is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-11884) Deterministic coding enforcement causes BigQueryBatchFieldLoads/GroupFilesByTableDestinations to fail

2021-02-27 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang reassigned BEAM-11884:


Assignee: Chun Yang

> Deterministic coding enforcement causes 
> BigQueryBatchFieldLoads/GroupFilesByTableDestinations to fail
> -
>
> Key: BEAM-11884
> URL: https://issues.apache.org/jira/browse/BEAM-11884
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P1
>  Labels: pull-request-available
>
> {code:none}TypeError: Unable to deterministically encode '  datasetId: 'my-dataset-id'
>  projectId: 'my-project-id'
>  tableId: 'my-table-id'>' of type ' 'apache_beam.io.gcp.internal.clients.bigquery.bigquery_v2_messages.TableReference'>',
>  please provide a type hint for the input of 
> 'WriteToBigQuery/BigQueryBatchFileLoads/GroupFilesByTableDestinations' [while 
> running 'WriteToBigQuery/BigQueryBatchFileLoads/IdentityWorkaround']{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-11884) Deterministic coding enforcement causes BigQueryBatchFieldLoads/GroupFilesByTableDestinations to fail

2021-02-27 Thread Chun Yang (Jira)
Chun Yang created BEAM-11884:


 Summary: Deterministic coding enforcement causes 
BigQueryBatchFieldLoads/GroupFilesByTableDestinations to fail
 Key: BEAM-11884
 URL: https://issues.apache.org/jira/browse/BEAM-11884
 Project: Beam
  Issue Type: Bug
  Components: io-py-gcp
Reporter: Chun Yang


{code:none}TypeError: Unable to deterministically encode '' of type '',
 please provide a type hint for the input of 
'WriteToBigQuery/BigQueryBatchFileLoads/GroupFilesByTableDestinations' [while 
running 'WriteToBigQuery/BigQueryBatchFileLoads/IdentityWorkaround']{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2021-02-26 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-11277:
-
Labels:   (was: stale-P2)

I will take a stab at this issue this week.

> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --
>
> Key: BEAM-11277
> URL: https://issues.apache.org/jira/browse/BEAM-11277
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.21.0, 2.25.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P2
> Attachments: repro.py
>
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. I could not reproduce the issue with 
> DirectRunner. A minimal reproducible example is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2021-02-26 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-11277:
-
Affects Version/s: 2.24.0

> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --
>
> Key: BEAM-11277
> URL: https://issues.apache.org/jira/browse/BEAM-11277
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.21.0, 2.24.0, 2.25.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P2
> Attachments: repro.py
>
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. I could not reproduce the issue with 
> DirectRunner. A minimal reproducible example is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2021-02-26 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang reassigned BEAM-11277:


Assignee: Chun Yang

> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --
>
> Key: BEAM-11277
> URL: https://issues.apache.org/jira/browse/BEAM-11277
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.21.0, 2.25.0
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P2
>  Labels: stale-P2
> Attachments: repro.py
>
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. I could not reproduce the issue with 
> DirectRunner. A minimal reproducible example is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2020-11-17 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-11277:
-
Description: 
When multiple load jobs are needed to write data to a destination table, e.g., 
when the data is spread over more than 
[10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
then copy the temporary tables into the destination table.

When WriteToBigQuery is used with 
{{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
{{additional_bq_parameters=\{"schemaUpdateOptions": 
["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
the jobs that copy data from temporary tables into the destination table. The 
effect is that for small jobs (<10K source URIs), schema field addition is 
allowed, however, if the job is scaled to >10K source URIs, then schema field 
addition will fail with an error such as:

{code:none}Provided Schema does not match Table project:dataset.table. Cannot 
add fields (field: field_name){code}

I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
Beam 2.21.0 and Beam 2.25.0. I could not reproduce the issue with DirectRunner. 
A minimal reproducible example is attached.

  was:
When multiple load jobs are needed to write data to a destination table, e.g., 
when the data is spread over more than 
[10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
then copy the temporary tables into the destination table.

When WriteToBigQuery is used with 
{{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
{{additional_bq_parameters=\{"schemaUpdateOptions": 
["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
the jobs that copy data from temporary tables into the destination table. The 
effect is that for small jobs (<10K source URIs), schema field addition is 
allowed, however, if the job is scaled to >10K source URIs, then schema field 
addition will fail with an error such as:

{code:none}Provided Schema does not match Table project:dataset.table. Cannot 
add fields (field: field_name){code}

I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
Beam 2.21.0 and Beam 2.25.0. The issue does not manifest when using 
DirectRunner.


> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --
>
> Key: BEAM-11277
> URL: https://issues.apache.org/jira/browse/BEAM-11277
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.21.0, 2.25.0
>Reporter: Chun Yang
>Priority: P2
> Attachments: repro.py
>
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. I could not reproduce the issue with 
> DirectRunner. A minimal reproducible example is attached.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2020-11-17 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-11277:
-
Attachment: repro.py

> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --
>
> Key: BEAM-11277
> URL: https://issues.apache.org/jira/browse/BEAM-11277
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.21.0, 2.25.0
>Reporter: Chun Yang
>Priority: P2
> Attachments: repro.py
>
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. The issue does not manifest when using 
> DirectRunner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2020-11-17 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-11277:
-
Description: 
When multiple load jobs are needed to write data to a destination table, e.g., 
when the data is spread over more than 
[10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
then copy the temporary tables into the destination table.

When WriteToBigQuery is used with 
{{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
{{additional_bq_parameters=\{"schemaUpdateOptions": 
["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
the jobs that copy data from temporary tables into the destination table. The 
effect is that for small jobs (<10K source URIs), schema field addition is 
allowed, however, if the job is scaled to >10K source URIs, then schema field 
addition will fail with an error such as:

{code:none}Provided Schema does not match Table project:dataset.table. Cannot 
add fields (field: field_name){code}

I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
Beam 2.21.0 and Beam 2.25.0. The issue does not manifest when using 
DirectRunner.

  was:
When multiple load jobs are needed to write data to a destination table, e.g., 
when the data is spread over more than 
[10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
then copy the temporary tables into the destination table.

When WriteToBigQuery is used with 
{{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
{{additional_bq_parameters=\{"schemaUpdateOptions": 
["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
the jobs that copy data from temporary tables into the destination table. The 
effect is that for small jobs (<10K source URIs), schema field addition is 
allowed, however, if the job is scaled to >10K source URIs, then schema field 
addition will fail with an error such as:

{code:none}Provided Schema does not match Table project:dataset.table. Cannot 
add fields (field: field_name){code}

I've been able to reproduce this issue with Python 2.7 and DataflowRunner on 
Beam 2.21.0 and Beam 2.25.0. The issue does not manifest when using 
DirectRunner.


> WriteToBigQuery with batch file loads does not respect schema update options 
> when there are multiple load jobs
> --
>
> Key: BEAM-11277
> URL: https://issues.apache.org/jira/browse/BEAM-11277
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp, runner-dataflow
>Affects Versions: 2.21.0, 2.25.0
>Reporter: Chun Yang
>Priority: P2
>
> When multiple load jobs are needed to write data to a destination table, 
> e.g., when the data is spread over more than 
> [10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
> WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
> then copy the temporary tables into the destination table.
> When WriteToBigQuery is used with 
> {{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
> {{additional_bq_parameters=\{"schemaUpdateOptions": 
> ["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
> the jobs that copy data from temporary tables into the destination table. The 
> effect is that for small jobs (<10K source URIs), schema field addition is 
> allowed, however, if the job is scaled to >10K source URIs, then schema field 
> addition will fail with an error such as:
> {code:none}Provided Schema does not match Table project:dataset.table. Cannot 
> add fields (field: field_name){code}
> I've been able to reproduce this issue with Python 3.7 and DataflowRunner on 
> Beam 2.21.0 and Beam 2.25.0. The issue does not manifest when using 
> DirectRunner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-11277) WriteToBigQuery with batch file loads does not respect schema update options when there are multiple load jobs

2020-11-17 Thread Chun Yang (Jira)
Chun Yang created BEAM-11277:


 Summary: WriteToBigQuery with batch file loads does not respect 
schema update options when there are multiple load jobs
 Key: BEAM-11277
 URL: https://issues.apache.org/jira/browse/BEAM-11277
 Project: Beam
  Issue Type: Bug
  Components: io-py-gcp, runner-dataflow
Affects Versions: 2.25.0, 2.21.0
Reporter: Chun Yang


When multiple load jobs are needed to write data to a destination table, e.g., 
when the data is spread over more than 
[10,000|https://cloud.google.com/bigquery/quotas#load_jobs] URIs, 
WriteToBigQuery in FILE_LOADS mode will write data into temporary tables and 
then copy the temporary tables into the destination table.

When WriteToBigQuery is used with 
{{write_disposition=BigQueryDisposition.WRITE_APPEND}} and 
{{additional_bq_parameters=\{"schemaUpdateOptions": 
["ALLOW_FIELD_ADDITION"]\}}}, the schema update options are not respected by 
the jobs that copy data from temporary tables into the destination table. The 
effect is that for small jobs (<10K source URIs), schema field addition is 
allowed, however, if the job is scaled to >10K source URIs, then schema field 
addition will fail with an error such as:

{code:none}Provided Schema does not match Table project:dataset.table. Cannot 
add fields (field: field_name){code}

I've been able to reproduce this issue with Python 2.7 and DataflowRunner on 
Beam 2.21.0 and Beam 2.25.0. The issue does not manifest when using 
DirectRunner.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8186) WaitForBQJobs doesn't work for RUNNING jobs

2020-07-15 Thread Chun Yang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17158819#comment-17158819
 ] 

Chun Yang commented on BEAM-8186:
-

Fixed by https://github.com/apache/beam/pull/9856 (I think)

> WaitForBQJobs doesn't work for RUNNING jobs
> ---
>
> Key: BEAM-8186
> URL: https://issues.apache.org/jira/browse/BEAM-8186
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.13.0
>Reporter: Wenbing Bai
>Priority: P3
> Fix For: 2.17.0
>
>
> Python SDK
> WaitForBQJobs doesn't work for Job which has RUNNING state



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-8186) WaitForBQJobs doesn't work for RUNNING jobs

2020-07-15 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang closed BEAM-8186.
---
Fix Version/s: 2.17.0
   Resolution: Fixed

> WaitForBQJobs doesn't work for RUNNING jobs
> ---
>
> Key: BEAM-8186
> URL: https://issues.apache.org/jira/browse/BEAM-8186
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.13.0
>Reporter: Wenbing Bai
>Priority: P3
> Fix For: 2.17.0
>
>
> Python SDK
> WaitForBQJobs doesn't work for Job which has RUNNING state



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10176) Support BigQuery type aliases in WriteToBigQuery with Avro format

2020-06-23 Thread Chun Yang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143500#comment-17143500
 ] 

Chun Yang commented on BEAM-10176:
--

Nope, this was merged, closing.

> Support BigQuery type aliases in WriteToBigQuery with Avro format
> -
>
> Key: BEAM-10176
> URL: https://issues.apache.org/jira/browse/BEAM-10176
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P3
> Fix For: 2.23.0
>
>   Original Estimate: 3h
>  Time Spent: 1h 40m
>  Remaining Estimate: 1h 20m
>
> Support BigQuery type aliases in WriteToBigQuery with avro temp file format. 
> The following aliases are missing:
> * {{STRUCT}} ({{==RECORD}})
> * {{FLOAT64}} ({{==FLOAT}})
> * {{INT64}} ({{==INTEGER}})
> Reference:
> https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
> https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/TableFieldSchema.html#setType-java.lang.String-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-10176) Support BigQuery type aliases in WriteToBigQuery with Avro format

2020-06-23 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang resolved BEAM-10176.
--
Fix Version/s: 2.23.0
   Resolution: Fixed

> Support BigQuery type aliases in WriteToBigQuery with Avro format
> -
>
> Key: BEAM-10176
> URL: https://issues.apache.org/jira/browse/BEAM-10176
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P3
> Fix For: 2.23.0
>
>   Original Estimate: 3h
>  Time Spent: 1h 40m
>  Remaining Estimate: 1h 20m
>
> Support BigQuery type aliases in WriteToBigQuery with avro temp file format. 
> The following aliases are missing:
> * {{STRUCT}} ({{==RECORD}})
> * {{FLOAT64}} ({{==FLOAT}})
> * {{INT64}} ({{==INTEGER}})
> Reference:
> https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
> https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/TableFieldSchema.html#setType-java.lang.String-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10176) Support BigQuery type aliases in WriteToBigQuery with Avro format

2020-06-02 Thread Chun Yang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124264#comment-17124264
 ] 

Chun Yang commented on BEAM-10176:
--

Patch available: 
https://github.com/apache/beam/compare/master...chunyang:cyang/more-type-names?expand=1

But will wait for https://github.com/apache/beam/pull/11896 to land first, 
since there are conflicts.

> Support BigQuery type aliases in WriteToBigQuery with Avro format
> -
>
> Key: BEAM-10176
> URL: https://issues.apache.org/jira/browse/BEAM-10176
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P3
> Fix For: 2.23.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Support BigQuery type aliases in WriteToBigQuery with avro temp file format. 
> The following aliases are missing:
> * {{STRUCT}} ({{==RECORD}})
> * {{FLOAT64}} ({{==FLOAT}})
> * {{INT64}} ({{==INTEGER}})
> Reference:
> https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
> https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/TableFieldSchema.html#setType-java.lang.String-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-10176) Support BigQuery type aliases in WriteToBigQuery with Avro format

2020-06-02 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-10176 started by Chun Yang.

> Support BigQuery type aliases in WriteToBigQuery with Avro format
> -
>
> Key: BEAM-10176
> URL: https://issues.apache.org/jira/browse/BEAM-10176
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: P3
> Fix For: 2.23.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Support BigQuery type aliases in WriteToBigQuery with avro temp file format. 
> The following aliases are missing:
> * {{STRUCT}} ({{==RECORD}})
> * {{FLOAT64}} ({{==FLOAT}})
> * {{INT64}} ({{==INTEGER}})
> Reference:
> https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
> https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/TableFieldSchema.html#setType-java.lang.String-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10176) Support BigQuery type aliases in WriteToBigQuery with Avro format

2020-06-02 Thread Chun Yang (Jira)
Chun Yang created BEAM-10176:


 Summary: Support BigQuery type aliases in WriteToBigQuery with 
Avro format
 Key: BEAM-10176
 URL: https://issues.apache.org/jira/browse/BEAM-10176
 Project: Beam
  Issue Type: Improvement
  Components: io-py-gcp
Reporter: Chun Yang
Assignee: Chun Yang
 Fix For: 2.23.0


Support BigQuery type aliases in WriteToBigQuery with avro temp file format. 
The following aliases are missing:
* {{STRUCT}} ({{==RECORD}})
* {{FLOAT64}} ({{==FLOAT}})
* {{INT64}} ({{==INTEGER}})

Reference:
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/TableFieldSchema.html#setType-java.lang.String-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8841) Add ability to perform BigQuery file loads using avro

2020-03-05 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang resolved BEAM-8841.
-
Resolution: Fixed

> Add ability to perform BigQuery file loads using avro
> -
>
> Key: BEAM-8841
> URL: https://issues.apache.org/jira/browse/BEAM-8841
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Currently, JSON format is used for file loads into BigQuery in the Python 
> SDK. JSON has some disadvantages including size of serialized data and 
> inability to represent NaN and infinity float values.
> BigQuery supports loading files in avro format, which can overcome these 
> disadvantages. The Java SDK already supports loading files using avro format 
> (BEAM-2879) so it makes sense to support it in the Python SDK as well.
> The change will be somewhere around 
> [{{BigQueryBatchFileLoads}}|https://github.com/apache/beam/blob/3e7865ee6c6a56e51199515ec5b4b16de1ddd166/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L554].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-8841) Add ability to perform BigQuery file loads using avro

2020-03-05 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-8841 started by Chun Yang.
---
> Add ability to perform BigQuery file loads using avro
> -
>
> Key: BEAM-8841
> URL: https://issues.apache.org/jira/browse/BEAM-8841
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Currently, JSON format is used for file loads into BigQuery in the Python 
> SDK. JSON has some disadvantages including size of serialized data and 
> inability to represent NaN and infinity float values.
> BigQuery supports loading files in avro format, which can overcome these 
> disadvantages. The Java SDK already supports loading files using avro format 
> (BEAM-2879) so it makes sense to support it in the Python SDK as well.
> The change will be somewhere around 
> [{{BigQueryBatchFileLoads}}|https://github.com/apache/beam/blob/3e7865ee6c6a56e51199515ec5b4b16de1ddd166/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L554].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8841) Add ability to perform BigQuery file loads using avro

2020-03-05 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-8841:

Fix Version/s: 2.21.0

> Add ability to perform BigQuery file loads using avro
> -
>
> Key: BEAM-8841
> URL: https://issues.apache.org/jira/browse/BEAM-8841
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Chun Yang
>Assignee: Chun Yang
>Priority: Minor
> Fix For: 2.21.0
>
>  Time Spent: 8h
>  Remaining Estimate: 0h
>
> Currently, JSON format is used for file loads into BigQuery in the Python 
> SDK. JSON has some disadvantages including size of serialized data and 
> inability to represent NaN and infinity float values.
> BigQuery supports loading files in avro format, which can overcome these 
> disadvantages. The Java SDK already supports loading files using avro format 
> (BEAM-2879) so it makes sense to support it in the Python SDK as well.
> The change will be somewhere around 
> [{{BigQueryBatchFileLoads}}|https://github.com/apache/beam/blob/3e7865ee6c6a56e51199515ec5b4b16de1ddd166/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L554].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9192) BigQuery IO on Dataflow runner fails (java.lang.ClassCastException) with --experiment=beam_fn_api

2020-02-14 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-9192:

Affects Version/s: 2.16.0
   2.19.0

> BigQuery IO on Dataflow runner fails (java.lang.ClassCastException) with 
> --experiment=beam_fn_api
> -
>
> Key: BEAM-9192
> URL: https://issues.apache.org/jira/browse/BEAM-9192
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0
>Reporter: Bo Shi
>Priority: Major
>
> {noformat}
> python repro.py \
>   --project=CHANGEME \
>   --runner=DataflowRunner \
>   --temp_location=gs://change-me/bshi/tmp \
>   --staging_location=gs://change-me/bshi/stg \
>   --experiment=beam_fn_api
>   --save_main_function
> {noformat}
> The same repro code works with --runner=Direct.  On Dataflow, the error is
> {noformat}
> java.util.concurrent.ExecutionException: java.lang.ClassCastException: [B 
> cannot be cast to org.apache.beam.sdk.values.KV
>   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>   at 
> org.apache.beam.sdk.fn.data.CompletableFutureInboundDataClient.awaitCompletion(CompletableFutureInboundDataClient.java:48)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.awaitCompletion(BeamFnDataInboundObserver.java:87)
>   at 
> org.apache.beam.runners.dataflow.worker.fn.data.BeamFnDataGrpcService$DeferredInboundDataClient.awaitCompletion(BeamFnDataGrpcService.java:134)
>   at 
> org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortReadOperation.finish(RemoteGrpcPortReadOperation.java:83)
>   at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:85)
>   at 
> org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:125)
>   at 
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:411)
>   at 
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:380)
>   at 
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:305)
>   at 
> org.apache.beam.runners.dataflow.worker.DataflowRunnerHarness.start(DataflowRunnerHarness.java:195)
>   at 
> org.apache.beam.runners.dataflow.worker.DataflowRunnerHarness.main(DataflowRunnerHarness.java:123)
>   Suppressed: java.lang.IllegalStateException: Already closed.
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataBufferingOutboundObserver.close(BeamFnDataBufferingOutboundObserver.java:93)
>   at 
> org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.abort(RemoteGrpcPortWriteOperation.java:220)
>   at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:91)
>   ... 6 more
> Caused by: java.lang.ClassCastException: [B cannot be cast to 
> org.apache.beam.sdk.values.KV
>   at 
> org.apache.beam.runners.dataflow.worker.ReifyTimestampAndWindowsParDoFnFactory$ReifyTimestampAndWindowsParDoFn.processElement(ReifyTimestampAndWindowsParDoFnFactory.java:72)
>   at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
>   at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
>   at 
> org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortReadOperation.consumeOutput(RemoteGrpcPortReadOperation.java:103)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:78)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:31)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:138)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:125)
>   at 
> org.apache.beam.vendor.grpc.v1p21p0.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:249)
>   at 
> org.apache.beam.vendor.grpc.v1p21p0.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
>   at 
> org.apache.beam.vendor.grpc.v1p21p0.io.grpc.Contexts$ContextualizedServerCallListener.onMessage(Contexts.java:76)
>   at 
> org.apache.beam.vendor.grpc.v1p21p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:297

[jira] [Commented] (BEAM-9192) BigQuery IO on Dataflow runner fails (java.lang.ClassCastException) with --experiment=beam_fn_api

2020-02-14 Thread Chun Yang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037377#comment-17037377
 ] 

Chun Yang commented on BEAM-9192:
-

I see this in 2.16.0 as well.

> BigQuery IO on Dataflow runner fails (java.lang.ClassCastException) with 
> --experiment=beam_fn_api
> -
>
> Key: BEAM-9192
> URL: https://issues.apache.org/jira/browse/BEAM-9192
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Affects Versions: 2.16.0, 2.17.0, 2.18.0, 2.19.0
>Reporter: Bo Shi
>Priority: Major
>
> {noformat}
> python repro.py \
>   --project=CHANGEME \
>   --runner=DataflowRunner \
>   --temp_location=gs://change-me/bshi/tmp \
>   --staging_location=gs://change-me/bshi/stg \
>   --experiment=beam_fn_api
>   --save_main_function
> {noformat}
> The same repro code works with --runner=Direct.  On Dataflow, the error is
> {noformat}
> java.util.concurrent.ExecutionException: java.lang.ClassCastException: [B 
> cannot be cast to org.apache.beam.sdk.values.KV
>   at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
>   at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
>   at 
> org.apache.beam.sdk.fn.data.CompletableFutureInboundDataClient.awaitCompletion(CompletableFutureInboundDataClient.java:48)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.awaitCompletion(BeamFnDataInboundObserver.java:87)
>   at 
> org.apache.beam.runners.dataflow.worker.fn.data.BeamFnDataGrpcService$DeferredInboundDataClient.awaitCompletion(BeamFnDataGrpcService.java:134)
>   at 
> org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortReadOperation.finish(RemoteGrpcPortReadOperation.java:83)
>   at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:85)
>   at 
> org.apache.beam.runners.dataflow.worker.fn.control.BeamFnMapTaskExecutor.execute(BeamFnMapTaskExecutor.java:125)
>   at 
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.executeWork(BatchDataflowWorker.java:411)
>   at 
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.doWork(BatchDataflowWorker.java:380)
>   at 
> org.apache.beam.runners.dataflow.worker.BatchDataflowWorker.getAndPerformWork(BatchDataflowWorker.java:305)
>   at 
> org.apache.beam.runners.dataflow.worker.DataflowRunnerHarness.start(DataflowRunnerHarness.java:195)
>   at 
> org.apache.beam.runners.dataflow.worker.DataflowRunnerHarness.main(DataflowRunnerHarness.java:123)
>   Suppressed: java.lang.IllegalStateException: Already closed.
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataBufferingOutboundObserver.close(BeamFnDataBufferingOutboundObserver.java:93)
>   at 
> org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortWriteOperation.abort(RemoteGrpcPortWriteOperation.java:220)
>   at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:91)
>   ... 6 more
> Caused by: java.lang.ClassCastException: [B cannot be cast to 
> org.apache.beam.sdk.values.KV
>   at 
> org.apache.beam.runners.dataflow.worker.ReifyTimestampAndWindowsParDoFnFactory$ReifyTimestampAndWindowsParDoFn.processElement(ReifyTimestampAndWindowsParDoFnFactory.java:72)
>   at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
>   at 
> org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
>   at 
> org.apache.beam.runners.dataflow.worker.fn.data.RemoteGrpcPortReadOperation.consumeOutput(RemoteGrpcPortReadOperation.java:103)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:78)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataInboundObserver.accept(BeamFnDataInboundObserver.java:31)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:138)
>   at 
> org.apache.beam.sdk.fn.data.BeamFnDataGrpcMultiplexer$InboundObserver.onNext(BeamFnDataGrpcMultiplexer.java:125)
>   at 
> org.apache.beam.vendor.grpc.v1p21p0.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:249)
>   at 
> org.apache.beam.vendor.grpc.v1p21p0.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
>   at 
> org.apache.beam.vendor.grpc.v1p21p0.io.grpc.Contexts$ContextualizedServerCallListener.onMessage(Contexts.java:76)
>   at 
> org.apache.beam.vendor.grpc.v1p21p0.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvail

[jira] [Updated] (BEAM-8965) WriteToBigQuery failed in BundleBasedDirectRunner

2020-02-04 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-8965:

Description: 
*{{WriteToBigQuery}}* fails in *{{BundleBasedDirectRunner}}* with error 
{{PCollection of size 2 with more than one element accessed as a singleton 
view.}}

Here is the code

 
{code:python}
with Pipeline() as p:
query_results = (
p 
| beam.io.Read(beam.io.BigQuerySource(
query='SELECT ... FROM ...')
)
query_results | beam.io.gcp.WriteToBigQuery(
table=,
method=WriteToBigQuery.Method.FILE_LOADS,
schema={"fields": []}
)
{code}
 

Here is the error

 
{code:none}
  File "apache_beam/runners/common.py", line 778, in 
apache_beam.runners.common.DoFnRunner.process
    def process(self, windowed_value):
  File "apache_beam/runners/common.py", line 782, in 
apache_beam.runners.common.DoFnRunner.process
    self._reraise_augmented(exn)
  File "apache_beam/runners/common.py", line 849, in 
apache_beam.runners.common.DoFnRunner._reraise_augmented
    raise_with_traceback(new_exn)
  File "apache_beam/runners/common.py", line 780, in 
apache_beam.runners.common.DoFnRunner.process
    return self.do_fn_invoker.invoke_process(windowed_value)
  File "apache_beam/runners/common.py", line 587, in 
apache_beam.runners.common.PerWindowInvoker.invoke_process
    self._invoke_process_per_window(
  File "apache_beam/runners/common.py", line 610, in 
apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
    [si[global_window] for si in self.side_inputs]))
  File 
"/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py",
 line 65, in __getitem__
    _FilteringIterable(self._iterable, target_window), self._view_options)
  File 
"/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py",
 line 443, in _from_runtime_iterable
    len(head), str(head[0]), str(head[1])))
ValueError: PCollection of size 2 with more than one element accessed as a 
singleton view. First two elements encountered are 
"gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", 
"gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running 
'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)']
{code}
 

 

 

 

  was:
{{*{{WriteToBigQuery}}* failed in }}{{*BundleBasedDirectRunner*}}{{ with error 
PCollection of size 2 with more than one element accessed as a singleton view.}}

Here is the code

 
{code:java}
with Pipeline() as p:
query_results = (
p 
| beam.io.Read(beam.io.BigQuerySource(
query='SELECT ... FROM ...')
)
query_results | beam.io.gcp.WriteToBigQuery(
table=,
method=WriteToBigQuery.Method.FILE_LOADS,
schema={"fields": []}
)
{code}
 

Here is the error

 
{code:java}
  File "apache_beam/runners/common.py", line 778, in 
apache_beam.runners.common.DoFnRunner.process
    def process(self, windowed_value):
  File "apache_beam/runners/common.py", line 782, in 
apache_beam.runners.common.DoFnRunner.process
    self._reraise_augmented(exn)
  File "apache_beam/runners/common.py", line 849, in 
apache_beam.runners.common.DoFnRunner._reraise_augmented
    raise_with_traceback(new_exn)
  File "apache_beam/runners/common.py", line 780, in 
apache_beam.runners.common.DoFnRunner.process
    return self.do_fn_invoker.invoke_process(windowed_value)
  File "apache_beam/runners/common.py", line 587, in 
apache_beam.runners.common.PerWindowInvoker.invoke_process
    self._invoke_process_per_window(
  File "apache_beam/runners/common.py", line 610, in 
apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window
    [si[global_window] for si in self.side_inputs]))
  File 
"/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/transforms/sideinputs.py",
 line 65, in __getitem__
    _FilteringIterable(self._iterable, target_window), self._view_options)
  File 
"/home/wbai/terra/terra_py2/local/lib/python2.7/site-packages/apache_beam/pvalue.py",
 line 443, in _from_runtime_iterable
    len(head), str(head[0]), str(head[1])))
ValueError: PCollection of size 2 with more than one element accessed as a 
singleton view. First two elements encountered are 
"gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f", 
"gs://temp-dev/temp/bq_load/3edbf2172dd540edb5c8e9597206b10f". [while running 
'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)']
{code}
 

 

 

 


> WriteToBigQuery failed in BundleBasedDirectRunner
> -
>
> Key: BEAM-8965
> URL: https://issues.apache.org/jira/browse/BEAM-8965
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.16.0

[jira] [Updated] (BEAM-8841) Add ability to perform BigQuery file loads using avro

2019-11-27 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-8841:

Description: 
Currently, JSON format is used for file loads into BigQuery in the Python SDK. 
JSON has some disadvantages including size of serialized data and inability to 
represent NaN and infinity float values.

BigQuery supports loading files in avro format, which can overcome these 
disadvantages. The Java SDK already supports loading files using avro format 
(BEAM-2879) so it makes sense to support it in the Python SDK as well.

The change will be somewhere around 
[{{BigQueryBatchFileLoads}}|https://github.com/apache/beam/blob/3e7865ee6c6a56e51199515ec5b4b16de1ddd166/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L554].

  was:
Currently, JSON format is used for file loads into BigQuery in the Python SDK. 
JSON has some disadvantages including size of serialized data and inability to 
represent NaN and infinity float values.

BigQuery supports loading files in avro format, which can overcome these 
disadvantages. The Java SDK already supports loading files using avro format 
(BEAM-2879) so it makes sense to support it in the Python SDK as well.


> Add ability to perform BigQuery file loads using avro
> -
>
> Key: BEAM-8841
> URL: https://issues.apache.org/jira/browse/BEAM-8841
> Project: Beam
>  Issue Type: Improvement
>  Components: io-py-gcp
>Reporter: Chun Yang
>Priority: Minor
>
> Currently, JSON format is used for file loads into BigQuery in the Python 
> SDK. JSON has some disadvantages including size of serialized data and 
> inability to represent NaN and infinity float values.
> BigQuery supports loading files in avro format, which can overcome these 
> disadvantages. The Java SDK already supports loading files using avro format 
> (BEAM-2879) so it makes sense to support it in the Python SDK as well.
> The change will be somewhere around 
> [{{BigQueryBatchFileLoads}}|https://github.com/apache/beam/blob/3e7865ee6c6a56e51199515ec5b4b16de1ddd166/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L554].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8841) Add ability to perform BigQuery file loads using avro

2019-11-27 Thread Chun Yang (Jira)
Chun Yang created BEAM-8841:
---

 Summary: Add ability to perform BigQuery file loads using avro
 Key: BEAM-8841
 URL: https://issues.apache.org/jira/browse/BEAM-8841
 Project: Beam
  Issue Type: Improvement
  Components: io-py-gcp
Reporter: Chun Yang


Currently, JSON format is used for file loads into BigQuery in the Python SDK. 
JSON has some disadvantages including size of serialized data and inability to 
represent NaN and infinity float values.

BigQuery supports loading files in avro format, which can overcome these 
disadvantages. The Java SDK already supports loading files using avro format 
(BEAM-2879) so it makes sense to support it in the Python SDK as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8451) Interactive Beam example failing from stack overflow

2019-11-22 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang resolved BEAM-8451.
-
  Assignee: Chun Yang  (was: Igor Durovic)
Resolution: Fixed

This was fixed in https://github.com/apache/beam/pull/9865

> Interactive Beam example failing from stack overflow
> 
>
> Key: BEAM-8451
> URL: https://issues.apache.org/jira/browse/BEAM-8451
> Project: Beam
>  Issue Type: Bug
>  Components: examples-python, runner-py-interactive
>Reporter: Igor Durovic
>Assignee: Chun Yang
>Priority: Major
> Fix For: 2.18.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
>  
> RecursionError: maximum recursion depth exceeded in __instancecheck__
> at 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py#L405]
>  
> This occurred after the execution of the last cell in 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Interactive%20Beam%20Example.ipynb]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8451) Interactive Beam example failing from stack overflow

2019-11-22 Thread Chun Yang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun Yang updated BEAM-8451:

Fix Version/s: 2.18.0

> Interactive Beam example failing from stack overflow
> 
>
> Key: BEAM-8451
> URL: https://issues.apache.org/jira/browse/BEAM-8451
> Project: Beam
>  Issue Type: Bug
>  Components: examples-python, runner-py-interactive
>Reporter: Igor Durovic
>Assignee: Igor Durovic
>Priority: Major
> Fix For: 2.18.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
>  
> RecursionError: maximum recursion depth exceeded in __instancecheck__
> at 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/pipeline_analyzer.py#L405]
>  
> This occurred after the execution of the last cell in 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/interactive/examples/Interactive%20Beam%20Example.ipynb]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)