Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-03-02 Thread Stephen Sisk
o/hadoop/inputformat > >>> > >>> I think the downside of #2 is that it hides hbase, which I think > deserves > >>> to be top level. > >>> > >>> Other comments: > >>> It should be noted that when we have all modules use hadoop-common, > we&

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-03-02 Thread Ismaël Mejía
IO transform has its own hadoop dependency" >>> >>> On the naming discussion: I personally prefer "inputformat" as the >>> name of >>> the directory, but I defer to the folks who know the hadoop community >>> more. >>> >>> S >&

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-03-02 Thread Jean-Baptiste Onofré
17, 2017 at 9:38 AM, Dipti Kulkarni < dipti_dkulka...@persistent.com> wrote: Thank you all for your inputs! -Original Message- From: Dan Halperin [mailto:dhalp...@google.com.INVALID] Sent: Friday, February 17, 2017 12:17 PM To: dev@beam.apache.org Subject: Re: Merge HadoopInputFormat

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-03-02 Thread Jean-Baptiste Onofré
ub.com/apache/beam/pull/2087 On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni < dipti_dkulka...@persistent.com> wrote: Thank you all for your inputs! -Original Message- From: Dan Halperin [mailto:dhalp...@google.com.INVALID] Sent: Friday, February 17, 2017 12:17 PM To: dev@beam.apache.o

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-03-01 Thread Stephen Sisk
eam/pull/2087 On Fri, Feb 17, 2017 at 9:38 AM, Dipti Kulkarni < dipti_dkulka...@persistent.com> wrote: > Thank you all for your inputs! > > > -Original Message- > From: Dan Halperin [mailto:dhalp...@google.com.INVALID] > Sent: Friday, February 17, 2017 12:17 PM &g

RE: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-17 Thread Dipti Kulkarni
Thank you all for your inputs! -Original Message- From: Dan Halperin [mailto:dhalp...@google.com.INVALID] Sent: Friday, February 17, 2017 12:17 PM To: dev@beam.apache.org Subject: Re: Merge HadoopInputFormatIO and HDFSIO in a single module Raghu, Amit -- +1 to your expertise :) On

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-16 Thread Dan Halperin
Raghu, Amit -- +1 to your expertise :) On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela wrote: > I agree with Dan on everything regarding HdfsFileSystem - it's super > convenient for users to use TextIO with HdfsFileSystem rather then > replacing the IO and also specifying the InputFormat type. > > I

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-16 Thread Amit Sela
I agree with Dan on everything regarding HdfsFileSystem - it's super convenient for users to use TextIO with HdfsFileSystem rather then replacing the IO and also specifying the InputFormat type. I disagree on "HadoopIO" - I think that people who work with Hadoop would find this name intuitive, and

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-16 Thread Raghu Angadi
FileInputFormat is extremely widely used, pretty much all the file based input formats extend it. All of them call into to list the input files, split (with some tweaks on top of that). The special API ( *FileInputFormat.setMinInputSplitSize(job, desiredBundleSizeBytes)* ) is how the split size is

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-16 Thread Dan Halperin
Chiming in a bit late, but here's my 2 cents. HdfsFileSystem vs Hadoop*InputFormatIO is a red herring: * HdfsFileSystem is for file-format-specific, Beam-native, parsers of files. It will make TextIO, AvroIO, etc., work for files that happen to be located at hdfs:// URIs. * This is complementa

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Raghu Angadi
Dipti, Also how about calling it just HadoopIO? On Wed, Feb 15, 2017 at 11:13 AM, Raghu Angadi wrote: > I skimmed through HdfsIO and I think it is essentially HahdoopInpuFormatIO > with FileInputFormat. I would pretty much move most of the code to > HadoopInputFormatIO (just make HdfsIO a speci

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Stephen Sisk
Hi dipti! It sounds like there are two possible implementation options: 1. HdfsIO that is implemented using HadoopInputFormatIO 2. HdfsIO that is implemented using IOChannelFactory (I think BeamFileSystem is the new name?) Either way, I agree that it makes sense to have one module that contains t

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Jean-Baptiste Onofré
Hi I guess your saw my comment in the PR. Basically I was waiting the refactoring of IOChannelFactory to refactore hdfs IO as hadoop file format on top of IOChannelFactory. I would have wait a bit and I would be more than happy to help you on the PR. Regards JB On Feb 15, 2017, 14:55, at 14:5

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Jean-Baptiste Onofré
Hi It's what I said in the hadoop file format PR. When I discussed with Davor and Pei about the refactoring of the IOChannelFactory, I proposed to refactore hdfs IO to deal with hadoop file format on top of the file IO. Regards JB On Feb 15, 2017, 15:13, at 15:13, Raghu Angadi wrote: >I skim

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Raghu Angadi
I skimmed through HdfsIO and I think it is essentially HahdoopInpuFormatIO with FileInputFormat. I would pretty much move most of the code to HadoopInputFormatIO (just make HdfsIO a specific instance of HIF_IO). On Wed, Feb 15, 2017 at 9:15 AM, Dipti Kulkarni < dipti_dkulka...@persistent.com> wrot

Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Dipti Kulkarni
Hello there! I am working on writing a Read IO for Hadoop InputFormat. This will enable reading from any datasource which supports Hadoop InputFormat, i.e. provides source to read from InputFormat for integration with Hadoop. It makes sense for the HadoopInputFormatIO to share some code with the