[DISCUSS] HadoopInputFormat based IOs

Ismaël Mejía Tue, 23 May 2017 09:53:28 -0700

Hello, I bring this subject to the mailing list to see everybody’s
opinion on the subject.


The recent inclusion of HadoopInputFormatIO (HiFiIO) gave Beam users
the option to ‘easily’ include data stores that support the
Hadoop-based partitioning scheme. There are currently examples of how
to use it for example to read from Elasticsearch and Cassandra. In
both cases we already have specific IOs on master or as WIP so using
HiFiIO based IO is not needed.

During the review of the recent IO for Hive (HCatalog) that uses
HiFiIO instead of a native API, there was a discussion about the fact
that this shouldn’t be included as a specific IO but better to add the
tests/documentation of how to read Hive records using the existing
HiFiIO. This makes sense from an abstraction point of view, however
there are visibility issues since end users would need to repackage
and discover the supported (and tested) HiFi-based IOs that won’t be
explicit in the code base.

I would like to know what other members of the community think about
this, is it worth to have individual IOs based on HiFiIO for things
that we currently don’t support (e.g. Hive or Amazon Redshift) (option
1) or maybe it is just better to add just the tests/docs of how to use
them as proposed in the PR (option 2).

Feel free to comment/vote or maybe add an eventual third option if you
think there is one better option.

Regards,
Ismaël Mejía

[1] https://issues.apache.org/jira/browse/BEAM-1158

[DISCUSS] HadoopInputFormat based IOs

Reply via email to