Re: [DISCUSS] HadoopInputFormat based IOs

Stephen Sisk Tue, 23 May 2017 10:25:46 -0700

hey,

Thanks for bringing this up! It's definitely an interesting question and I
can see both sides of the argument.

I can see the appeal of HIFIO wrapper IOs as stop-gaps and if they have
good test coverage, it does ensure that the HIFIO route is working. If we
have good IT coverage, it also means there's fewer steps involved in
building a native IO as well, since the ITs will already be written.

However, I think I'm still assuming that the community will implement
native IOs for most data stores that users want to interact with, and thus
I'd still discourage building IOs that are just HIFIO/jdbc wrappers. I'd
personally rather devote time and resources to native IOs. If we don't see
traction on building more IOs then I'd be more open to it.

If we do choose to go down this "Don't build HIFIO wrappers, just improve
discoverability" route, one idea I had floating around in my head was that
we might add a section to the Built-in IO Transforms page that covers
"non-native but readable" IOs (better name suggestions appreciated :) -
that could include a list of data stores that jdbc/jms/hifio support and
link to HIFIO's info on how to use them. (That might also be a good place
to document the performance tradeoffs of using HIFIO)

S

On Tue, May 23, 2017 at 9:53 AM Ismaël Mejía <ieme...@gmail.com> wrote:

> Hello, I bring this subject to the mailing list to see everybody’s
> opinion on the subject.
>
> The recent inclusion of HadoopInputFormatIO (HiFiIO) gave Beam users
> the option to ‘easily’ include data stores that support the
> Hadoop-based partitioning scheme. There are currently examples of how
> to use it for example to read from Elasticsearch and Cassandra. In
> both cases we already have specific IOs on master or as WIP so using
> HiFiIO based IO is not needed.
>
> During the review of the recent IO for Hive (HCatalog) that uses
> HiFiIO instead of a native API, there was a discussion about the fact
> that this shouldn’t be included as a specific IO but better to add the
> tests/documentation of how to read Hive records using the existing
> HiFiIO. This makes sense from an abstraction point of view, however
> there are visibility issues since end users would need to repackage
> and discover the supported (and tested) HiFi-based IOs that won’t be
> explicit in the code base.
>
> I would like to know what other members of the community think about
> this, is it worth to have individual IOs based on HiFiIO for things
> that we currently don’t support (e.g. Hive or Amazon Redshift) (option
> 1) or maybe it is just better to add just the tests/docs of how to use
> them as proposed in the PR (option 2).
>
> Feel free to comment/vote or maybe add an eventual third option if you
> think there is one better option.
>
> Regards,
> Ismaël Mejía
>
> [1] https://issues.apache.org/jira/browse/BEAM-1158
>

Re: [DISCUSS] HadoopInputFormat based IOs

Reply via email to