[ 
https://issues.apache.org/jira/browse/BEAM-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191085#comment-16191085
 ] 

Ryan Skraba commented on BEAM-2993:
-----------------------------------

Hello -- I'm chiming in to help clarify our use case, which is a bit 
specialized.  However, if it's useful for us, it's potentially useful to others!

As part of our work using Beam, we help users assemble pipelines to run using 
configured "components".  These are eventually translated to PTransforms, of 
course, acting on PCollections -- nothing surprising!   We've picked Avro 
IndexedRecords (not GenericRecords, but that's a detail for the moment) as the 
common currency between the PTransforms.  This works well, especially if you 
know every schema on every collection at design-time, when you're building your 
pipeline.   

[~jkff] is correct that *if* the schema is known  *and* you are using 
{{AvroCoder}}, you already have the schema in your hands when you build the 
{{AvroIO.write}} and all is well.

We have some advanced functionality, however, where we deduce the schema at 
runtime -- either at the start of the pipeline (such as reading from a JDBC 
table and converting the row + table metadata into a consistent IndexedRecord) 
but also in the middle of a pipeline (we can infer a schema after some 
user-defined processing).  In brief, we can't directly use AvroCoder in this 
case, but we can write our own {{Coder<IndexedRecord>}} that takes care of 
sharing the schema between interested nodes when necessary (not with every 
record).

In this case, we've managed to create a {{PCollection<IndexedRecord>}} while 
designing the Pipeline, using our Coder that doesn't require the schema, but we 
still can't attach it to the {{AvroIO.write}}...

Note that "sharing the schema between interested nodes" in our custom coder 
introduces a distributed state between nodes, which is not the ideal for 
parallelization -- in this case, we've measured the cost to be acceptable since 
it only occurs the first time a node tries to write or read an avro-encoded 
record from the collection.

That's a long detour to explain why {{AvroIO.write}} without schema would be 
interesting to us, but I hope you find it useful.   Our technique for sharing 
the schema as distributed state in the Coder is a much larger view but I'm very 
sure we'd be interested in contributing!

> AvroIO.write without specifying a schema
> ----------------------------------------
>
>                 Key: BEAM-2993
>                 URL: https://issues.apache.org/jira/browse/BEAM-2993
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>
> Similarly to https://issues.apache.org/jira/browse/BEAM-2677, we should be 
> able to write to avro files using {{AvroIO}} without specifying a schema at 
> build time. Consider the following use case: a user has a 
> {{PCollection<GenericRecord>}}  but the schema is only known while running 
> the pipeline.  {{AvroIO.writeGenericRecords}} needs the schema, but the 
> schema is already available in {{GenericRecord}}. We should be able to call 
> {{AvroIO.writeGenericRecords()}} with no schema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to