Hello, I bring this subject to the mailing list to see everybody’s opinion on the subject.
The recent inclusion of HadoopInputFormatIO (HiFiIO) gave Beam users the option to ‘easily’ include data stores that support the Hadoop-based partitioning scheme. There are currently examples of how to use it for example to read from Elasticsearch and Cassandra. In both cases we already have specific IOs on master or as WIP so using HiFiIO based IO is not needed. During the review of the recent IO for Hive (HCatalog) that uses HiFiIO instead of a native API, there was a discussion about the fact that this shouldn’t be included as a specific IO but better to add the tests/documentation of how to read Hive records using the existing HiFiIO. This makes sense from an abstraction point of view, however there are visibility issues since end users would need to repackage and discover the supported (and tested) HiFi-based IOs that won’t be explicit in the code base. I would like to know what other members of the community think about this, is it worth to have individual IOs based on HiFiIO for things that we currently don’t support (e.g. Hive or Amazon Redshift) (option 1) or maybe it is just better to add just the tests/docs of how to use them as proposed in the PR (option 2). Feel free to comment/vote or maybe add an eventual third option if you think there is one better option. Regards, Ismaël Mejía [1] https://issues.apache.org/jira/browse/BEAM-1158