On Thu, Sep 14, 2017 at 1:10 PM Steve Niemitz <[email protected]> wrote:
> I spent a little time trying to refactor it yesterday, but ran into a > couple tricky points: > > - The BigQuerySource implementation's non-split version (mentioned above) > doesn't read from avro files and this would be pretty difficult to "fake" > into GenericRecords. It sounds like this could just be dropped completely > from what was mentioned previously though. > Agreed. This would be a very welcome contribution. > > - A user-provided SerializableFunction<GenericRecord, T> seems like the a > good route to providing an extension point (which could be passed directly > to the AvroSource used to read the files), but a) goes against the > PTransform style guide, and b) is tricky to shoehorn into the current > implementation, due to the existing tight coupling of the GenericRecord -> > TableRow transform and the rest of BigQuerySource. > If we create a version of BigQueryIO that goes through GenericRecord, then providing a SerializableFunction<GenericRecord, T> will be reasonable, and not against the PT Style Guide, because GenericRecord is not encodable unless you know the schema, which in case of reading from BigQuery you don't. > > - I feel like the ideal solution would really be to provide a coder for > GenericRecord, and allow anyone to hook up a MapElements to the output of a > BigQueryIO that produces them. However, the fact that GenericRecord > bundles the avro schema along with the object means every record serialized > would need to also include the full schema. I was playing around with ways > to possibly memoize the schema so it doesn't need to be serialized for each > record, but I'm not familiar enough with the guarantees a Coder is provided > to know if its safe to do. > I suggest to not spend time looking in this direction - I think it's impossible to provide a Coder for GenericRecord with the current Coder API [changing the Coder API may be possible, but it would be a huge undertaking]. A SerializableFunction is a reasonable way to go. > > I hope to have something concrete soon as an example, but in the mean time > I'm interested to hear other people's thoughts on the above. > > > On Thu, Sep 14, 2017 at 3:29 PM, Eugene Kirpichov < > [email protected]> wrote: > > > Oh, I see what you mean. Yeah, I agree that having BigQueryIO use > TableRow > > as the native format was a suboptimal decision in retrospect, and I agree > > that it would be reasonable to provide ability to go through Avro > > GenericRecord instead. I'm just not sure how to provide it in an > > API-compatible way - that would be particularly challenging since > > BigQueryIO is a beast in terms of amount of code and intermediate > > transforms involved. But if you have ideas, they would be welcome. > > > > On Sat, Sep 9, 2017 at 11:18 AM Steve Niemitz <[email protected]> > wrote: > > > > > Ah that makes sense wrt splitting, but is indeed confusing! Thanks for > > the > > > explanation. :) > > > > > > wrt native types and TableRow, I understand your point, but could also > > > argue that the raw avro records are just as "native" to the BigQuery > > > connector as the TableRow JSON objects, since both are directly exposed > > by > > > BigQuery. > > > > > > Maybe my use-case is more specialized, but I already have a good amount > > of > > > code that I used pre-Beam to process BigQuery avro extract files, and > > avro > > > is significantly smaller and more performant than JSON, which is why > I'm > > > using it rather than just using TableRows. > > > > > > In any case, if there's no desire for such a feature I can always > > replicate > > > the functionality of BigQueryIO in my own codebase, so it's not a big > > deal, > > > it just seems like a feature that would be useful for other people as > > well. > > > > > > On Sat, Sep 9, 2017 at 1:55 PM, Reuven Lax <[email protected]> > > > wrote: > > > > > > > On Sat, Sep 9, 2017 at 10:53 AM, Eugene Kirpichov < > > > > [email protected]> wrote: > > > > > > > > > This is a bit confusing - BigQueryQuerySource and > BigQueryTableSource > > > > > indeed use the REST API to read rows if you read them unsplit - > > > however, > > > > in > > > > > split() they run extract jobs and produce a bunch of Avro sources > > that > > > > are > > > > > read in parallel. I'm not sure we have any use cases for reading > them > > > > > unsplit (except unit tests) - perhaps that code path can be > removed? > > > > > > > > > > > > > I believe split() will always be called in production. Maybe not in > > unit > > > > tests? > > > > > > > > > > > > > > > > > > About outputting non-TableRow: per > > > > > https://beam.apache.org/contribute/ptransform-style- > > > > > guide/#choosing-types-of-input-and-output-pcollections, > > > > > it is recommended to output the native type of the connector, > unless > > > it's > > > > > impossible to provide a coder for it. This is the case for > > > > > AvroIO.parseGenericRecords, but it's not the case for TableRow, so > I > > > > would > > > > > recommend against it: you can always map a TableRow to something > else > > > > using > > > > > MapElements. > > > > > > > > > > On Sat, Sep 9, 2017 at 10:37 AM Reuven Lax > <[email protected] > > > > > > > > wrote: > > > > > > > > > > > Hi Steve, > > > > > > > > > > > > The BigQuery source should always uses extract jobs, regardless > of > > > > > > withTemplateCompatibility. What makes you think otherwise? > > > > > > > > > > > > Reuven > > > > > > > > > > > > > > > > > > On Sat, Sep 9, 2017 at 9:35 AM, Steve Niemitz < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > Hello! > > > > > > > > > > > > > > Until now I've been using a custom-built alternative to > > > > BigQueryIO.Read > > > > > > > that manually runs a BigQuery extract job (to avro), then uses > > > > > > > AvroIO.parseGenericRecords() to read the output. > > > > > > > > > > > > > > I'm investigating instead enhancing the actual BigQueryIO.Read > to > > > > allow > > > > > > > something similar, since it appears a good amount of the > plumbing > > > is > > > > > > > already in place to do this. However I'm confused at some of > the > > > > > > > implementation details. > > > > > > > > > > > > > > To start, it seems like there's two different read paths: > > > > > > > - If "withTemplateCompatibility" is set, a similar method to > > what I > > > > > > > described above is used; an extract job is started to export to > > > avro, > > > > > and > > > > > > > AvroSource is used to read files and transform them into > > TableRows. > > > > > > > > > > > > > > - However, if not set, the BigQueryReader class simply uses the > > > REST > > > > > API > > > > > > to > > > > > > > read rows from the tables. This method, I've seen in practice, > > has > > > > > some > > > > > > > significant performance limitations. > > > > > > > > > > > > > > It seems to me that for large tables, I'd always want to use > the > > > > first > > > > > > > method, however I'm not sure why the implementation is tied to > > the > > > > > oddly > > > > > > > named "withTemplateCompatibility" option. Does anyone have > > insight > > > > as > > > > > to > > > > > > > the implementation details here? > > > > > > > > > > > > > > Additionally, would the community in general be accepting to > > > > > enhancements > > > > > > > to BigQueryIO to allow the final output to be something other > > than > > > > > > > "TableRow" instances, similar to how AvroIO.parseGenericRecords > > > > takes a > > > > > > > parseFn? > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > > >
