I agree with Dan on everything regarding HdfsFileSystem - it's super convenient for users to use TextIO with HdfsFileSystem rather then replacing the IO and also specifying the InputFormat type.
I disagree on "HadoopIO" - I think that people who work with Hadoop would find this name intuitive, and that's whats important. Even more, and joining Raghu's comment, it is also recognized as "compatible with Hadoop", so for example someone running a Beam pipeline using the Spark runner on Amazon's S3 and wants to read/write Hadoop sequence files would simply use HadoopIO and provide the appropriate runtime dependencies (actually true for GS as well). On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi <[email protected]> wrote: > FileInputFormat is extremely widely used, pretty much all the file based > input formats extend it. All of them call into to list the input files, > split (with some tweaks on top of that). The special API ( > *FileInputFormat.setMinInputSplitSize(job, > desiredBundleSizeBytes)* ) is how the split size is normally communicated. > New IO can use the api directly. > > HdfsIO as implemented in Beam is not HDFS specific at all. There are no > hdfs imports and HDFS name does not appear anywhere other than in HdfsIO's > own class and method names. AvroHdfsFileSource etc would work just as well > with new IO. > > On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin <[email protected] > > > wrote: > > > (And I think renaming to HadoopIO doesn't make sense. "InputFormat" is > the > > key component of the name -- it reads things that implement the > InputFormat > > interface. "Hadoop" means a lot more than that.) > > > > Often 'IO' in Beam implies both sources and sinks. It might not be long > before we might be supporting Hadoop OutputFormat as well. In addition > HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can mean a lot of > things depending on the context. In 'IO' context it might not be too broad. > Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'. > > Either way, I am quite confident once HadoopInputFormatIO is written, it > can easily replace HdfsIO. That decision could be made later. > > Raghu. >
