I agree with Dan on everything regarding HdfsFileSystem - it's super
convenient for users to use TextIO with HdfsFileSystem rather then
replacing the IO and also specifying the InputFormat type.

I disagree on "HadoopIO" - I think that people who work with Hadoop would
find this name intuitive, and that's whats important.
Even more, and joining Raghu's comment, it is also recognized as
"compatible with Hadoop", so for example someone running a Beam pipeline
using the Spark runner on Amazon's S3 and wants to read/write Hadoop
sequence files would simply use HadoopIO and provide the appropriate
runtime dependencies (actually true for GS as well).

On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi <rang...@google.com.invalid>
wrote:

> FileInputFormat is extremely widely used, pretty much all the file based
> input formats extend it. All of them call into to list the input files,
> split (with some tweaks on top of that). The special API (
> *FileInputFormat.setMinInputSplitSize(job,
> desiredBundleSizeBytes)* ) is how the split size is normally communicated.
> New IO can use the api directly.
>
> HdfsIO as implemented in Beam is not HDFS specific at all. There are no
> hdfs imports and HDFS name does not appear anywhere other than in HdfsIO's
> own class and method names. AvroHdfsFileSource etc would work just as well
> with new IO.
>
> On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin <dhalp...@google.com.invalid
> >
> wrote:
>
> > (And I think renaming to HadoopIO doesn't make sense. "InputFormat" is
> the
> > key component of the name -- it reads things that implement the
> InputFormat
> > interface. "Hadoop" means a lot more than that.)
> >
>
> Often 'IO' in Beam implies both sources and sinks. It might not be long
> before we might be supporting Hadoop OutputFormat as well. In addition
> HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can mean a lot of
> things depending on the context. In 'IO' context it might not be too broad.
> Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'.
>
> Either way, I am quite confident once HadoopInputFormatIO is written, it
> can easily replace HdfsIO. That decision could be made later.
>
> Raghu.
>

Reply via email to