Hi all,

we are currently using beam over spark, reading and writing avro files to
hdfs.

Until now we use HDFSFileSource for reading and HadoopIO for writing,
essentially reading and writing PCollection<AvroKey<GenericRecord>>

With the changes introduced by
https://issues.apache.org/jira/browse/BEAM-1497 this seems to be not
directly supported anymore by beam, as the required AvroWrapperCoder is
deleted.

So as we have to change our code anyway, we are wondering, what would be
the recommended approach to read/write avro files from/to hdfs with beam on
spark.

- use the new implementation of HDFSFileSource/HDFSFileSink
- use spark provided HadoopIO (and probably reimplement AvroWrapperCoder by
ourself?)

What ware the trade offs here, possibly also considering already planned
changes on IO? Do we have advantages using the spark HadoopIO as our
underlying engine is currently spark, or will this eventually be deprecated
and exists only for ‘historical’ reasons?

Any thoughts and advice here?

Regards,

michel

Reply via email to