Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Stephen Sisk Wed, 01 Mar 2017 11:00:49 -0800

I wanted to follow up on this thread since I see some potential blocking
questions arising, and I'm trying to help dipti along with her PR.


Dipti's PR[1] is currently written to put files into:
io/hadoop/inputformat

The recent changes to create hadoop-common created:
io/hadoop-common

This means that the overall structure if we take the HIFIO PR as-is would
be:
io/hadoop/inputformat - the HIFIO (copies of some code in hadoop-common and
hdfs, but no dependency on hadoop-common)
io/hadoop-common - module with some shared code
io/hbase - hbase IO transforms
io/hdfs - FileInputFormat IO transforms - much shared code with
hadoop/inputformat.

Which I don't think is great b/c there's a common dir, but only some
directories use it, and there's lots of similar-but-slightly different code
in hadoop/inputformat and hdfsio. I don't believe anyone intends this to be
the final result.

After looking at the comments in this thread, I'd like to recommend the
following end-result:  (#1)
io/hadoop -  the HIFIO (dependency on  hadoop-common) - contains both
HadoopInputFormatIO.java and HDFSFileSink/HDFSFileSource (so contents of
hdfs and hadoop/inputformat)
io/hadoop-common - module with some shared code
io/hbase - hbase IO transforms

To get there I propose the following steps:
1. finish current PR [1] with only renaming the containing module from
hadoop/inputformat -> hadoop, and taking dependency on hadoop-common
2. someone does cleanup to reconcile hdfs and hadoop directories, including
renaming the files so they make sense

I would also be fine with: (#2)
io/hadoop - container dir only
io/hadoop/common
io/hadoop/hbase
io/hadoop/inputformat

I think the downside of #2 is that it hides hbase, which I think deserves
to be top level.

Other comments:
It should be noted that when we have all modules use hadoop-common, we'll
be forcing all hadoop modules to have the same dependencies on hadoop - I
think this makes sense, but worth noting that as the one advantage of the
"every hadoop IO transform has its own hadoop dependency"

On the naming discussion: I personally prefer "inputformat" as the name of
the directory, but I defer to the folks who know the hadoop community more.

S

[1] HadoopInputFormatIO PR - https://github.com/apache/beam/pull/1994
[2] HdfsIO dependency change PR - https://github.com/apache/beam/pull/2087


On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni <
dipti_dkulka...@persistent.com> wrote:

> Thank you  all for your inputs!
>
>
> -----Original Message-----
> From: Dan Halperin [mailto:dhalp...@google.com.INVALID]
> Sent: Friday, February 17, 2017 12:17 PM
> To: dev@beam.apache.org
> Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module
>
> Raghu, Amit -- +1 to your expertise :)
>
> On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com> wrote:
>
> > I agree with Dan on everything regarding HdfsFileSystem - it's super
> > convenient for users to use TextIO with HdfsFileSystem rather then
> > replacing the IO and also specifying the InputFormat type.
> >
> > I disagree on "HadoopIO" - I think that people who work with Hadoop
> > would find this name intuitive, and that's whats important.
> > Even more, and joining Raghu's comment, it is also recognized as
> > "compatible with Hadoop", so for example someone running a Beam
> > pipeline using the Spark runner on Amazon's S3 and wants to read/write
> > Hadoop sequence files would simply use HadoopIO and provide the
> > appropriate runtime dependencies (actually true for GS as well).
> >
> > On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi
> > <rang...@google.com.invalid>
> > wrote:
> >
> > > FileInputFormat is extremely widely used, pretty much all the file
> > > based input formats extend it. All of them call into to list the
> > > input files, split (with some tweaks on top of that). The special
> > > API ( *FileInputFormat.setMinInputSplitSize(job,
> > > desiredBundleSizeBytes)* ) is how the split size is normally
> > communicated.
> > > New IO can use the api directly.
> > >
> > > HdfsIO as implemented in Beam is not HDFS specific at all. There are
> > > no hdfs imports and HDFS name does not appear anywhere other than in
> > HdfsIO's
> > > own class and method names. AvroHdfsFileSource etc would work just
> > > as
> > well
> > > with new IO.
> > >
> > > On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin
> > <dhalp...@google.com.invalid
> > > >
> > > wrote:
> > >
> > > > (And I think renaming to HadoopIO doesn't make sense.
> > > > "InputFormat" is
> > > the
> > > > key component of the name -- it reads things that implement the
> > > InputFormat
> > > > interface. "Hadoop" means a lot more than that.)
> > > >
> > >
> > > Often 'IO' in Beam implies both sources and sinks. It might not be
> > > long before we might be supporting Hadoop OutputFormat as well. In
> > > addition HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can
> > > mean a lot of things depending on the context. In 'IO' context it
> > > might not be too
> > broad.
> > > Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'.
> > >
> > > Either way, I am quite confident once HadoopInputFormatIO is
> > > written, it can easily replace HdfsIO. That decision could be made
> later.
> > >
> > > Raghu.
> > >
> >
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>
>

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Reply via email to