I wanted to follow up on this thread since I see some potential blocking questions arising, and I'm trying to help dipti along with her PR.
Dipti's PR[1] is currently written to put files into: io/hadoop/inputformat The recent changes to create hadoop-common created: io/hadoop-common This means that the overall structure if we take the HIFIO PR as-is would be: io/hadoop/inputformat - the HIFIO (copies of some code in hadoop-common and hdfs, but no dependency on hadoop-common) io/hadoop-common - module with some shared code io/hbase - hbase IO transforms io/hdfs - FileInputFormat IO transforms - much shared code with hadoop/inputformat. Which I don't think is great b/c there's a common dir, but only some directories use it, and there's lots of similar-but-slightly different code in hadoop/inputformat and hdfsio. I don't believe anyone intends this to be the final result. After looking at the comments in this thread, I'd like to recommend the following end-result: (#1) io/hadoop - the HIFIO (dependency on hadoop-common) - contains both HadoopInputFormatIO.java and HDFSFileSink/HDFSFileSource (so contents of hdfs and hadoop/inputformat) io/hadoop-common - module with some shared code io/hbase - hbase IO transforms To get there I propose the following steps: 1. finish current PR [1] with only renaming the containing module from hadoop/inputformat -> hadoop, and taking dependency on hadoop-common 2. someone does cleanup to reconcile hdfs and hadoop directories, including renaming the files so they make sense I would also be fine with: (#2) io/hadoop - container dir only io/hadoop/common io/hadoop/hbase io/hadoop/inputformat I think the downside of #2 is that it hides hbase, which I think deserves to be top level. Other comments: It should be noted that when we have all modules use hadoop-common, we'll be forcing all hadoop modules to have the same dependencies on hadoop - I think this makes sense, but worth noting that as the one advantage of the "every hadoop IO transform has its own hadoop dependency" On the naming discussion: I personally prefer "inputformat" as the name of the directory, but I defer to the folks who know the hadoop community more. S [1] HadoopInputFormatIO PR - https://github.com/apache/beam/pull/1994 [2] HdfsIO dependency change PR - https://github.com/apache/beam/pull/2087 On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni < dipti_dkulka...@persistent.com> wrote: > Thank you all for your inputs! > > > -----Original Message----- > From: Dan Halperin [mailto:dhalp...@google.com.INVALID] > Sent: Friday, February 17, 2017 12:17 PM > To: dev@beam.apache.org > Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module > > Raghu, Amit -- +1 to your expertise :) > > On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela <amitsel...@gmail.com> wrote: > > > I agree with Dan on everything regarding HdfsFileSystem - it's super > > convenient for users to use TextIO with HdfsFileSystem rather then > > replacing the IO and also specifying the InputFormat type. > > > > I disagree on "HadoopIO" - I think that people who work with Hadoop > > would find this name intuitive, and that's whats important. > > Even more, and joining Raghu's comment, it is also recognized as > > "compatible with Hadoop", so for example someone running a Beam > > pipeline using the Spark runner on Amazon's S3 and wants to read/write > > Hadoop sequence files would simply use HadoopIO and provide the > > appropriate runtime dependencies (actually true for GS as well). > > > > On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi > > <rang...@google.com.invalid> > > wrote: > > > > > FileInputFormat is extremely widely used, pretty much all the file > > > based input formats extend it. All of them call into to list the > > > input files, split (with some tweaks on top of that). The special > > > API ( *FileInputFormat.setMinInputSplitSize(job, > > > desiredBundleSizeBytes)* ) is how the split size is normally > > communicated. > > > New IO can use the api directly. > > > > > > HdfsIO as implemented in Beam is not HDFS specific at all. There are > > > no hdfs imports and HDFS name does not appear anywhere other than in > > HdfsIO's > > > own class and method names. AvroHdfsFileSource etc would work just > > > as > > well > > > with new IO. > > > > > > On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin > > <dhalp...@google.com.invalid > > > > > > > wrote: > > > > > > > (And I think renaming to HadoopIO doesn't make sense. > > > > "InputFormat" is > > > the > > > > key component of the name -- it reads things that implement the > > > InputFormat > > > > interface. "Hadoop" means a lot more than that.) > > > > > > > > > > Often 'IO' in Beam implies both sources and sinks. It might not be > > > long before we might be supporting Hadoop OutputFormat as well. In > > > addition HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can > > > mean a lot of things depending on the context. In 'IO' context it > > > might not be too > > broad. > > > Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'. > > > > > > Either way, I am quite confident once HadoopInputFormatIO is > > > written, it can easily replace HdfsIO. That decision could be made > later. > > > > > > Raghu. > > > > > > > DISCLAIMER > ========== > This e-mail may contain privileged and confidential information which is > the property of Persistent Systems Ltd. It is intended only for the use of > the individual or entity to which it is addressed. If you are not the > intended recipient, you are not authorized to read, retain, copy, print, > distribute or use this message. If you have received this communication in > error, please notify the sender and delete all copies of this message. > Persistent Systems Ltd. does not accept any liability for virus infected > mails. > >