Heh maybe we can merge our merge requests :P I just posted mine here as well: https://github.com/apache/beam/pull/3894
On Mon, Sep 25, 2017 at 11:30 AM, Eugene Kirpichov < [email protected]> wrote: > Hi, > I actually also made a PR for this myself the other day, but forgot to > update this thread, sorry :-| > https://github.com/apache/beam/pull/3891 > I also hit this conversion issue, and it is intended behavior because > TableRow is the JSON representation of the data, and it matches what > BigQuery does when exporting to JSON format: it converts numbers to STRING. > Yes, some tests were incorrect. > > On Mon, Sep 25, 2017 at 7:13 AM Steve Niemitz <[email protected]> wrote: > > > I've just about got something to send for a PR, I'll push it shortly. > > > > One thing I noticed though while running the BigQueryIO unit tests that I > > found strange. > > It looks like currently the TableRow / Avro / TableRowJsonCoder > interaction > > changes BigQuery INTEGERs to strings in the public interface. eg > > TableRow.get("someInteger") returns a string, not a long. The strange > > thing is some unit tests do not mimic this behavior. For > > example testBigQueryTableSourceInitSplit asserts that the long values > exist > > in the result set, not the string versions. On the other > > hand, testValidateReadSetsDefaultProject gets the value as a String and > > converts it to a Long. > > > > It seems like the behavior to convert to strings is intended, and the > unit > > tests are incorrect? Can anyone confirm? > > > > On Thu, Sep 14, 2017 at 5:31 PM, Steve Niemitz <[email protected]> > > wrote: > > > > > Cool, thanks for the shove in the right direction! > > > > > > I'll probably have some more time to spend on this early next week, > > > hopefully I'll have a PR to submit after that. :) > > > > > > On Thu, Sep 14, 2017 at 4:37 PM, Eugene Kirpichov < > > > [email protected]> wrote: > > > > > >> On Thu, Sep 14, 2017 at 1:10 PM Steve Niemitz <[email protected]> > > >> wrote: > > >> > > >> > I spent a little time trying to refactor it yesterday, but ran into > a > > >> > couple tricky points: > > >> > > > >> > - The BigQuerySource implementation's non-split version (mentioned > > >> above) > > >> > doesn't read from avro files and this would be pretty difficult to > > >> "fake" > > >> > into GenericRecords. It sounds like this could just be dropped > > >> completely > > >> > from what was mentioned previously though. > > >> > > > >> Agreed. This would be a very welcome contribution. > > >> > > >> > > >> > > > >> > - A user-provided SerializableFunction<GenericRecord, T> seems like > > >> the a > > >> > good route to providing an extension point (which could be passed > > >> directly > > >> > to the AvroSource used to read the files), but a) goes against the > > >> > PTransform style guide, and b) is tricky to shoehorn into the > current > > >> > implementation, due to the existing tight coupling of the > > GenericRecord > > >> -> > > >> > TableRow transform and the rest of BigQuerySource. > > >> > > > >> If we create a version of BigQueryIO that goes through GenericRecord, > > then > > >> providing a SerializableFunction<GenericRecord, T> will be > reasonable, > > >> and > > >> not against the PT Style Guide, because GenericRecord is not encodable > > >> unless you know the schema, which in case of reading from BigQuery you > > >> don't. > > >> > > >> > > >> > > > >> > - I feel like the ideal solution would really be to provide a coder > > for > > >> > GenericRecord, and allow anyone to hook up a MapElements to the > output > > >> of a > > >> > BigQueryIO that produces them. However, the fact that GenericRecord > > >> > bundles the avro schema along with the object means every record > > >> serialized > > >> > would need to also include the full schema. I was playing around > with > > >> ways > > >> > to possibly memoize the schema so it doesn't need to be serialized > for > > >> each > > >> > record, but I'm not familiar enough with the guarantees a Coder is > > >> provided > > >> > to know if its safe to do. > > >> > > > >> I suggest to not spend time looking in this direction - I think it's > > >> impossible to provide a Coder for GenericRecord with the current Coder > > API > > >> [changing the Coder API may be possible, but it would be a huge > > >> undertaking]. A SerializableFunction is a reasonable way to go. > > >> > > >> > > >> > > > >> > I hope to have something concrete soon as an example, but in the > mean > > >> time > > >> > I'm interested to hear other people's thoughts on the above. > > >> > > > >> > > > >> > On Thu, Sep 14, 2017 at 3:29 PM, Eugene Kirpichov < > > >> > [email protected]> wrote: > > >> > > > >> > > Oh, I see what you mean. Yeah, I agree that having BigQueryIO use > > >> > TableRow > > >> > > as the native format was a suboptimal decision in retrospect, and > I > > >> agree > > >> > > that it would be reasonable to provide ability to go through Avro > > >> > > GenericRecord instead. I'm just not sure how to provide it in an > > >> > > API-compatible way - that would be particularly challenging since > > >> > > BigQueryIO is a beast in terms of amount of code and intermediate > > >> > > transforms involved. But if you have ideas, they would be welcome. > > >> > > > > >> > > On Sat, Sep 9, 2017 at 11:18 AM Steve Niemitz < > [email protected]> > > >> > wrote: > > >> > > > > >> > > > Ah that makes sense wrt splitting, but is indeed confusing! > > Thanks > > >> for > > >> > > the > > >> > > > explanation. :) > > >> > > > > > >> > > > wrt native types and TableRow, I understand your point, but > could > > >> also > > >> > > > argue that the raw avro records are just as "native" to the > > BigQuery > > >> > > > connector as the TableRow JSON objects, since both are directly > > >> exposed > > >> > > by > > >> > > > BigQuery. > > >> > > > > > >> > > > Maybe my use-case is more specialized, but I already have a good > > >> amount > > >> > > of > > >> > > > code that I used pre-Beam to process BigQuery avro extract > files, > > >> and > > >> > > avro > > >> > > > is significantly smaller and more performant than JSON, which is > > why > > >> > I'm > > >> > > > using it rather than just using TableRows. > > >> > > > > > >> > > > In any case, if there's no desire for such a feature I can > always > > >> > > replicate > > >> > > > the functionality of BigQueryIO in my own codebase, so it's not > a > > >> big > > >> > > deal, > > >> > > > it just seems like a feature that would be useful for other > people > > >> as > > >> > > well. > > >> > > > > > >> > > > On Sat, Sep 9, 2017 at 1:55 PM, Reuven Lax > > <[email protected] > > >> > > > >> > > > wrote: > > >> > > > > > >> > > > > On Sat, Sep 9, 2017 at 10:53 AM, Eugene Kirpichov < > > >> > > > > [email protected]> wrote: > > >> > > > > > > >> > > > > > This is a bit confusing - BigQueryQuerySource and > > >> > BigQueryTableSource > > >> > > > > > indeed use the REST API to read rows if you read them > unsplit > > - > > >> > > > however, > > >> > > > > in > > >> > > > > > split() they run extract jobs and produce a bunch of Avro > > >> sources > > >> > > that > > >> > > > > are > > >> > > > > > read in parallel. I'm not sure we have any use cases for > > reading > > >> > them > > >> > > > > > unsplit (except unit tests) - perhaps that code path can be > > >> > removed? > > >> > > > > > > > >> > > > > > > >> > > > > I believe split() will always be called in production. Maybe > not > > >> in > > >> > > unit > > >> > > > > tests? > > >> > > > > > > >> > > > > > > >> > > > > > > > >> > > > > > About outputting non-TableRow: per > > >> > > > > > https://beam.apache.org/contribute/ptransform-style- > > >> > > > > > guide/#choosing-types-of-input-and-output-pcollections, > > >> > > > > > it is recommended to output the native type of the > connector, > > >> > unless > > >> > > > it's > > >> > > > > > impossible to provide a coder for it. This is the case for > > >> > > > > > AvroIO.parseGenericRecords, but it's not the case for > > TableRow, > > >> so > > >> > I > > >> > > > > would > > >> > > > > > recommend against it: you can always map a TableRow to > > something > > >> > else > > >> > > > > using > > >> > > > > > MapElements. > > >> > > > > > > > >> > > > > > On Sat, Sep 9, 2017 at 10:37 AM Reuven Lax > > >> > <[email protected] > > >> > > > > > >> > > > > > wrote: > > >> > > > > > > > >> > > > > > > Hi Steve, > > >> > > > > > > > > >> > > > > > > The BigQuery source should always uses extract jobs, > > >> regardless > > >> > of > > >> > > > > > > withTemplateCompatibility. What makes you think otherwise? > > >> > > > > > > > > >> > > > > > > Reuven > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > On Sat, Sep 9, 2017 at 9:35 AM, Steve Niemitz < > > >> > [email protected] > > >> > > > > > >> > > > > > wrote: > > >> > > > > > > > > >> > > > > > > > Hello! > > >> > > > > > > > > > >> > > > > > > > Until now I've been using a custom-built alternative to > > >> > > > > BigQueryIO.Read > > >> > > > > > > > that manually runs a BigQuery extract job (to avro), > then > > >> uses > > >> > > > > > > > AvroIO.parseGenericRecords() to read the output. > > >> > > > > > > > > > >> > > > > > > > I'm investigating instead enhancing the actual > > >> BigQueryIO.Read > > >> > to > > >> > > > > allow > > >> > > > > > > > something similar, since it appears a good amount of the > > >> > plumbing > > >> > > > is > > >> > > > > > > > already in place to do this. However I'm confused at > some > > >> of > > >> > the > > >> > > > > > > > implementation details. > > >> > > > > > > > > > >> > > > > > > > To start, it seems like there's two different read > paths: > > >> > > > > > > > - If "withTemplateCompatibility" is set, a similar > method > > to > > >> > > what I > > >> > > > > > > > described above is used; an extract job is started to > > >> export to > > >> > > > avro, > > >> > > > > > and > > >> > > > > > > > AvroSource is used to read files and transform them into > > >> > > TableRows. > > >> > > > > > > > > > >> > > > > > > > - However, if not set, the BigQueryReader class simply > > uses > > >> the > > >> > > > REST > > >> > > > > > API > > >> > > > > > > to > > >> > > > > > > > read rows from the tables. This method, I've seen in > > >> practice, > > >> > > has > > >> > > > > > some > > >> > > > > > > > significant performance limitations. > > >> > > > > > > > > > >> > > > > > > > It seems to me that for large tables, I'd always want to > > use > > >> > the > > >> > > > > first > > >> > > > > > > > method, however I'm not sure why the implementation is > > tied > > >> to > > >> > > the > > >> > > > > > oddly > > >> > > > > > > > named "withTemplateCompatibility" option. Does anyone > > have > > >> > > insight > > >> > > > > as > > >> > > > > > to > > >> > > > > > > > the implementation details here? > > >> > > > > > > > > > >> > > > > > > > Additionally, would the community in general be > accepting > > to > > >> > > > > > enhancements > > >> > > > > > > > to BigQueryIO to allow the final output to be something > > >> other > > >> > > than > > >> > > > > > > > "TableRow" instances, similar to how > > >> AvroIO.parseGenericRecords > > >> > > > > takes a > > >> > > > > > > > parseFn? > > >> > > > > > > > > > >> > > > > > > > Thanks! > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >
