FileInputFormat is extremely widely used, pretty much all the file based input formats extend it. All of them call into to list the input files, split (with some tweaks on top of that). The special API ( *FileInputFormat.setMinInputSplitSize(job, desiredBundleSizeBytes)* ) is how the split size is normally communicated. New IO can use the api directly.
HdfsIO as implemented in Beam is not HDFS specific at all. There are no hdfs imports and HDFS name does not appear anywhere other than in HdfsIO's own class and method names. AvroHdfsFileSource etc would work just as well with new IO. On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin <[email protected]> wrote: > (And I think renaming to HadoopIO doesn't make sense. "InputFormat" is the > key component of the name -- it reads things that implement the InputFormat > interface. "Hadoop" means a lot more than that.) > Often 'IO' in Beam implies both sources and sinks. It might not be long before we might be supporting Hadoop OutputFormat as well. In addition HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can mean a lot of things depending on the context. In 'IO' context it might not be too broad. Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'. Either way, I am quite confident once HadoopInputFormatIO is written, it can easily replace HdfsIO. That decision could be made later. Raghu.
