Raghu, Amit -- +1 to your expertise :) On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela <[email protected]> wrote:
> I agree with Dan on everything regarding HdfsFileSystem - it's super > convenient for users to use TextIO with HdfsFileSystem rather then > replacing the IO and also specifying the InputFormat type. > > I disagree on "HadoopIO" - I think that people who work with Hadoop would > find this name intuitive, and that's whats important. > Even more, and joining Raghu's comment, it is also recognized as > "compatible with Hadoop", so for example someone running a Beam pipeline > using the Spark runner on Amazon's S3 and wants to read/write Hadoop > sequence files would simply use HadoopIO and provide the appropriate > runtime dependencies (actually true for GS as well). > > On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi <[email protected]> > wrote: > > > FileInputFormat is extremely widely used, pretty much all the file based > > input formats extend it. All of them call into to list the input files, > > split (with some tweaks on top of that). The special API ( > > *FileInputFormat.setMinInputSplitSize(job, > > desiredBundleSizeBytes)* ) is how the split size is normally > communicated. > > New IO can use the api directly. > > > > HdfsIO as implemented in Beam is not HDFS specific at all. There are no > > hdfs imports and HDFS name does not appear anywhere other than in > HdfsIO's > > own class and method names. AvroHdfsFileSource etc would work just as > well > > with new IO. > > > > On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin > <[email protected] > > > > > wrote: > > > > > (And I think renaming to HadoopIO doesn't make sense. "InputFormat" is > > the > > > key component of the name -- it reads things that implement the > > InputFormat > > > interface. "Hadoop" means a lot more than that.) > > > > > > > Often 'IO' in Beam implies both sources and sinks. It might not be long > > before we might be supporting Hadoop OutputFormat as well. In addition > > HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can mean a lot of > > things depending on the context. In 'IO' context it might not be too > broad. > > Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'. > > > > Either way, I am quite confident once HadoopInputFormatIO is written, it > > can easily replace HdfsIO. That decision could be made later. > > > > Raghu. > > >
