On Sat, Sep 9, 2017 at 10:53 AM, Eugene Kirpichov < [email protected]> wrote:
> This is a bit confusing - BigQueryQuerySource and BigQueryTableSource > indeed use the REST API to read rows if you read them unsplit - however, in > split() they run extract jobs and produce a bunch of Avro sources that are > read in parallel. I'm not sure we have any use cases for reading them > unsplit (except unit tests) - perhaps that code path can be removed? > I believe split() will always be called in production. Maybe not in unit tests? > > About outputting non-TableRow: per > https://beam.apache.org/contribute/ptransform-style- > guide/#choosing-types-of-input-and-output-pcollections, > it is recommended to output the native type of the connector, unless it's > impossible to provide a coder for it. This is the case for > AvroIO.parseGenericRecords, but it's not the case for TableRow, so I would > recommend against it: you can always map a TableRow to something else using > MapElements. > > On Sat, Sep 9, 2017 at 10:37 AM Reuven Lax <[email protected]> > wrote: > > > Hi Steve, > > > > The BigQuery source should always uses extract jobs, regardless of > > withTemplateCompatibility. What makes you think otherwise? > > > > Reuven > > > > > > On Sat, Sep 9, 2017 at 9:35 AM, Steve Niemitz <[email protected]> > wrote: > > > > > Hello! > > > > > > Until now I've been using a custom-built alternative to BigQueryIO.Read > > > that manually runs a BigQuery extract job (to avro), then uses > > > AvroIO.parseGenericRecords() to read the output. > > > > > > I'm investigating instead enhancing the actual BigQueryIO.Read to allow > > > something similar, since it appears a good amount of the plumbing is > > > already in place to do this. However I'm confused at some of the > > > implementation details. > > > > > > To start, it seems like there's two different read paths: > > > - If "withTemplateCompatibility" is set, a similar method to what I > > > described above is used; an extract job is started to export to avro, > and > > > AvroSource is used to read files and transform them into TableRows. > > > > > > - However, if not set, the BigQueryReader class simply uses the REST > API > > to > > > read rows from the tables. This method, I've seen in practice, has > some > > > significant performance limitations. > > > > > > It seems to me that for large tables, I'd always want to use the first > > > method, however I'm not sure why the implementation is tied to the > oddly > > > named "withTemplateCompatibility" option. Does anyone have insight as > to > > > the implementation details here? > > > > > > Additionally, would the community in general be accepting to > enhancements > > > to BigQueryIO to allow the final output to be something other than > > > "TableRow" instances, similar to how AvroIO.parseGenericRecords takes a > > > parseFn? > > > > > > Thanks! > > > > > >
