[jira] [Commented] (BEAM-48) BigQueryIO.Read reimplemented as BoundedSource

2017-05-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993646#comment-15993646
 ] 

ASF GitHub Bot commented on BEAM-48:


Github user asfgit closed the pull request at:

https://github.com/apache/beam/pull/2832


> BigQueryIO.Read reimplemented as BoundedSource
> --
>
> Key: BEAM-48
> URL: https://issues.apache.org/jira/browse/BEAM-48
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-gcp
>Reporter: Daniel Halperin
>Assignee: Pei He
> Fix For: 0.1.0-incubating
>
>
> BigQueryIO.Read is currently implemented in a hacky way: the 
> DirectPipelineRunner streams all rows in the table or query result directly 
> using the JSON API, in a single-threaded manner.
> In contrast, the DataflowPipelineRunner uses an entirely different code path 
> implemented in the Google Cloud Dataflow service. (A BigQuery export job to 
> GCS, followed by a parallel read from GCS).
> We need to reimplement BigQueryIO as a BoundedSource in order to support 
> other runners in a scalable way.
> I additionally suggest that we revisit the design of the BigQueryIO source in 
> the process. A short list:
> * Do not use TableRow as the default value for rows. It could be Map Object> with well-defined types, for example, or an Avro GenericRecord. 
> Dropping TableRow will get around a variety of issues with types, fields 
> named 'f', etc., and it will also reduce confusion as we use TableRow objects 
> differently than usual (for good reason).
> * We could also directly add support for a RowParser to a user's POJO.
> * We should expose TableSchema as a side output from the BigQueryIO.Read.
> * Our builders for BigQueryIO.Read are useful and we should keep them. Where 
> possible we should also allow users to provide the JSON objects that 
> configure the underlying intermediate tables, query export, etc. This would 
> let users directly control result flattening, location of intermediate 
> tables, table decorators, etc., and also optimistically let users take 
> advantage of some new BigQuery features without code changes.
> * We could use switch between whether we use a BigQuery export + parallel 
> scan vs API read based on factors such as the size of the table at pipeline 
> construction time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-48) BigQueryIO.Read reimplemented as BoundedSource

2017-05-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993368#comment-15993368
 ] 

ASF GitHub Bot commented on BEAM-48:


GitHub user dhalperi opened a pull request:

https://github.com/apache/beam/pull/2832

[BEAM-48] BigQuery: swap from asSingleton to asIterable for Cleanup

asIterable can be simpler for runners to implement as it does not require 
semantically
that the PCollection being viewed contains exactly one element.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dhalperi/beam bq-singleton

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/beam/pull/2832.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2832






> BigQueryIO.Read reimplemented as BoundedSource
> --
>
> Key: BEAM-48
> URL: https://issues.apache.org/jira/browse/BEAM-48
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-gcp
>Reporter: Daniel Halperin
>Assignee: Pei He
> Fix For: 0.1.0-incubating
>
>
> BigQueryIO.Read is currently implemented in a hacky way: the 
> DirectPipelineRunner streams all rows in the table or query result directly 
> using the JSON API, in a single-threaded manner.
> In contrast, the DataflowPipelineRunner uses an entirely different code path 
> implemented in the Google Cloud Dataflow service. (A BigQuery export job to 
> GCS, followed by a parallel read from GCS).
> We need to reimplement BigQueryIO as a BoundedSource in order to support 
> other runners in a scalable way.
> I additionally suggest that we revisit the design of the BigQueryIO source in 
> the process. A short list:
> * Do not use TableRow as the default value for rows. It could be Map Object> with well-defined types, for example, or an Avro GenericRecord. 
> Dropping TableRow will get around a variety of issues with types, fields 
> named 'f', etc., and it will also reduce confusion as we use TableRow objects 
> differently than usual (for good reason).
> * We could also directly add support for a RowParser to a user's POJO.
> * We should expose TableSchema as a side output from the BigQueryIO.Read.
> * Our builders for BigQueryIO.Read are useful and we should keep them. Where 
> possible we should also allow users to provide the JSON objects that 
> configure the underlying intermediate tables, query export, etc. This would 
> let users directly control result flattening, location of intermediate 
> tables, table decorators, etc., and also optimistically let users take 
> advantage of some new BigQuery features without code changes.
> * We could use switch between whether we use a BigQuery export + parallel 
> scan vs API read based on factors such as the size of the table at pipeline 
> construction time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)