Thanks JB for detailed notes. On Fri, Mar 23, 2018 at 2:43 PM Eugene Kirpichov <[email protected]> wrote:
> Hi! Thanks for the notes. > > On Fri, Mar 23, 2018 at 3:07 AM Jean-Baptiste Onofré <[email protected]> > wrote: > >> Hi all, >> >> Sorry for the delay, but I got issues with my e-mail provider (I was not >> able to >> send e-mails :( ). >> >> Last week during Beam Summit, I had the change to participate to the IO >> brainstorming session. >> >> Here's the minute notes: >> >> 1. IOs set >> We now have a decent number of IOs in Beam, and new are coming (ParquetIO, >> RabbitMQIO). Users mentioned a new file format you could support: HDF5. >> It would >> be an Python IO. >> I will create the Jira about HDF5. >> Other IOs will also be in preparation, coming along with SDF support. >> > As Eila mentioned, we are talking to HDF5 group to determine if there's somebody whose willing to write a HDF5 IO for Python SDK. I'll be happy to review it. Looks like Eila created https://issues.apache.org/jira/browse/BEAM-3850 for this. > >> 2. IOs and SDKs >> This point was related to the portability layer: how can I use a Java IO >> in >> Python or the opposite ? Today, most of the IOs are related to Java SDK, >> and >> it's a bit frustrating for Python SDK users. Users are looking forward >> portability layer, however they also expressed some questions about Docker >> requirements. I think we should prepare a clean answer to this point. >> > > I'm pretty sure this is on the radar this quarter, but I don't remember > whose radar. > I hope to look into some aspects of this in next few months. Created https://issues.apache.org/jira/browse/BEAM-3923 with more info. > > >> >> 3. PCollection Headers >> Users want more "dynamic" IOs, maybe that a IO behavior could change >> depending >> of the element they are considering in the PCollection. I introduced what >> we are >> using in Apache Camel: Message Headers. The Camel components endpoints >> (equivalent of Beam IOs) can use the headers: for instance the camel-http >> component can use a Camel.HTTP_URL header. We already discussed about >> PCollection headers/hints/annotation/metadata (whatever the name we give) >> and I >> still think it would be a great feature for both IOs and even the runners. >> I'm proposing to create a Jira about that, I will be more than happy to >> work on >> this one. >> > > Do you have a use case in mind that cannot be solved within the current > approach to IOs? I think we have a pretty reasonable approach to "dynamic" > IOs too, exemplified by FileIO.writeDynamic(). > > >> >> 4. Schema >> As you might know, we are working on adding schema support in >> PCollection. This >> feature can be leveraged by IOs. Especially, I think it would reduce the >> "wrapping" made by IOs (like KafkaRecord, JmsRecord, ...) and easier data >> convert. >> >> 5. Error Handling >> Users would need a generic error handling in the IOs. Today the error >> handling >> is managed by each IOs. I introduced the error handler we are using in >> Apache >> Camel (sorry again ;)) and especially the default error handler features >> like: >> redelivery policy, recoverable/irrecoverable error handling, onWhen, >> onException, whileTrue, ... >> The error handler is not at component level but at routing engine level. >> We >> could imagine something similar at pipeline level. >> Thoughts ? >> > > Can you give some example use cases here too? > I'm sure we can add some useful abstractions related to error handling, > but picking the right level of abstraction for such an API will require > very careful design. E.g. something like "a pipeline-global deadletter > collection of records that failed processing" sounds useful in theory, but > I think is impossible to define in a useful way compatible with the Beam > model, and I think it has to be left to individual transforms. > > >> I hope I didn't forget something ;) >> >> To summarize: >> - I will create new Jiras for HDF5 and other new IOs >> - We have to work on documentation/explanation about portability layer & >> IOs >> - I will start a separate thread for error handling discussion >> - Nothing to do about schema: it has already started. >> >> Regards >> JB >> -- >> Jean-Baptiste Onofré >> [email protected] >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> >
