[ https://issues.apache.org/jira/browse/BEAM-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191085#comment-16191085 ]
Ryan Skraba commented on BEAM-2993: ----------------------------------- Hello -- I'm chiming in to help clarify our use case, which is a bit specialized. However, if it's useful for us, it's potentially useful to others! As part of our work using Beam, we help users assemble pipelines to run using configured "components". These are eventually translated to PTransforms, of course, acting on PCollections -- nothing surprising! We've picked Avro IndexedRecords (not GenericRecords, but that's a detail for the moment) as the common currency between the PTransforms. This works well, especially if you know every schema on every collection at design-time, when you're building your pipeline. [~jkff] is correct that *if* the schema is known *and* you are using {{AvroCoder}}, you already have the schema in your hands when you build the {{AvroIO.write}} and all is well. We have some advanced functionality, however, where we deduce the schema at runtime -- either at the start of the pipeline (such as reading from a JDBC table and converting the row + table metadata into a consistent IndexedRecord) but also in the middle of a pipeline (we can infer a schema after some user-defined processing). In brief, we can't directly use AvroCoder in this case, but we can write our own {{Coder<IndexedRecord>}} that takes care of sharing the schema between interested nodes when necessary (not with every record). In this case, we've managed to create a {{PCollection<IndexedRecord>}} while designing the Pipeline, using our Coder that doesn't require the schema, but we still can't attach it to the {{AvroIO.write}}... Note that "sharing the schema between interested nodes" in our custom coder introduces a distributed state between nodes, which is not the ideal for parallelization -- in this case, we've measured the cost to be acceptable since it only occurs the first time a node tries to write or read an avro-encoded record from the collection. That's a long detour to explain why {{AvroIO.write}} without schema would be interesting to us, but I hope you find it useful. Our technique for sharing the schema as distributed state in the Coder is a much larger view but I'm very sure we'd be interested in contributing! > AvroIO.write without specifying a schema > ---------------------------------------- > > Key: BEAM-2993 > URL: https://issues.apache.org/jira/browse/BEAM-2993 > Project: Beam > Issue Type: Improvement > Components: sdk-java-extensions > Reporter: Etienne Chauchot > Assignee: Etienne Chauchot > > Similarly to https://issues.apache.org/jira/browse/BEAM-2677, we should be > able to write to avro files using {{AvroIO}} without specifying a schema at > build time. Consider the following use case: a user has a > {{PCollection<GenericRecord>}} but the schema is only known while running > the pipeline. {{AvroIO.writeGenericRecords}} needs the schema, but the > schema is already available in {{GenericRecord}}. We should be able to call > {{AvroIO.writeGenericRecords()}} with no schema. -- This message was sent by Atlassian JIRA (v6.4.14#64029)