Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Raghu Angadi Thu, 16 Feb 2017 11:09:12 -0800

FileInputFormat is extremely widely used, pretty much all the file based
input formats extend it. All of them call into to list the input files,
split (with some tweaks on top of that). The special API (
*FileInputFormat.setMinInputSplitSize(job,
desiredBundleSizeBytes)* ) is how the split size is normally communicated.
New IO can use the api directly.

HdfsIO as implemented in Beam is not HDFS specific at all. There are no
hdfs imports and HDFS name does not appear anywhere other than in HdfsIO's
own class and method names. AvroHdfsFileSource etc would work just as well
with new IO.

On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin <[email protected]>
wrote:

> (And I think renaming to HadoopIO doesn't make sense. "InputFormat" is the
> key component of the name -- it reads things that implement the InputFormat
> interface. "Hadoop" means a lot more than that.)
>

Often 'IO' in Beam implies both sources and sinks. It might not be long
before we might be supporting Hadoop OutputFormat as well. In addition
HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can mean a lot of
things depending on the context. In 'IO' context it might not be too broad.
Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'.

Either way, I am quite confident once HadoopInputFormatIO is written, it
can easily replace HdfsIO. That decision could be made later.

Raghu.

Re: Merge HadoopInputFormatIO and HDFSIO in a single module

Reply via email to