[ 
https://issues.apache.org/jira/browse/BEAM-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192777#comment-16192777
 ] 

Ryan Skraba commented on BEAM-2993:
-----------------------------------

Very good points and foreseeing some of our plans!  Fortunately, I'm pretty 
sure that we can consider that *if* someone chooses to use {{AvroIO.write()}} 
without specifying a schema, they *must* provide a homogeneous collection (all 
with the same schema)! 

But looking ahead, we *are* moving towards heterogeneous collections (or at 
least heterogeneous-ish with a limited number of possible schemas) and there 
are intelligent things we can do in intermediate transforms, such as 
reconciling them into a good, "known" schema.  I don't think it would be 
reasonable or desirable to ask AvroIO.write to implement any of this logic.

That being said, the SchemaRefAndRecord is probably what we would need to solve 
the heterogeneous collection problem, but I don't consider it related.

For info, before Beam 2.0, we used the hadoop input format Sink, with a lazy 
configuration when the first record is received which actually worked very well 
-- but we're pretty motivated to move entirely to the BFS as soon as possible!

> AvroIO.write without specifying a schema
> ----------------------------------------
>
>                 Key: BEAM-2993
>                 URL: https://issues.apache.org/jira/browse/BEAM-2993
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Etienne Chauchot
>            Assignee: Etienne Chauchot
>
> Similarly to https://issues.apache.org/jira/browse/BEAM-2677, we should be 
> able to write to avro files using {{AvroIO}} without specifying a schema at 
> build time. Consider the following use case: a user has a 
> {{PCollection<GenericRecord>}}  but the schema is only known while running 
> the pipeline.  {{AvroIO.writeGenericRecords}} needs the schema, but the 
> schema is already available in {{GenericRecord}}. We should be able to call 
> {{AvroIO.writeGenericRecords()}} with no schema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to