Raghu, Amit -- +1 to your expertise :)

On Thu, Feb 16, 2017 at 3:39 PM, Amit Sela <[email protected]> wrote:

> I agree with Dan on everything regarding HdfsFileSystem - it's super
> convenient for users to use TextIO with HdfsFileSystem rather then
> replacing the IO and also specifying the InputFormat type.
>
> I disagree on "HadoopIO" - I think that people who work with Hadoop would
> find this name intuitive, and that's whats important.
> Even more, and joining Raghu's comment, it is also recognized as
> "compatible with Hadoop", so for example someone running a Beam pipeline
> using the Spark runner on Amazon's S3 and wants to read/write Hadoop
> sequence files would simply use HadoopIO and provide the appropriate
> runtime dependencies (actually true for GS as well).
>
> On Thu, Feb 16, 2017 at 9:08 PM Raghu Angadi <[email protected]>
> wrote:
>
> > FileInputFormat is extremely widely used, pretty much all the file based
> > input formats extend it. All of them call into to list the input files,
> > split (with some tweaks on top of that). The special API (
> > *FileInputFormat.setMinInputSplitSize(job,
> > desiredBundleSizeBytes)* ) is how the split size is normally
> communicated.
> > New IO can use the api directly.
> >
> > HdfsIO as implemented in Beam is not HDFS specific at all. There are no
> > hdfs imports and HDFS name does not appear anywhere other than in
> HdfsIO's
> > own class and method names. AvroHdfsFileSource etc would work just as
> well
> > with new IO.
> >
> > On Thu, Feb 16, 2017 at 8:17 AM, Dan Halperin
> <[email protected]
> > >
> > wrote:
> >
> > > (And I think renaming to HadoopIO doesn't make sense. "InputFormat" is
> > the
> > > key component of the name -- it reads things that implement the
> > InputFormat
> > > interface. "Hadoop" means a lot more than that.)
> > >
> >
> > Often 'IO' in Beam implies both sources and sinks. It might not be long
> > before we might be supporting Hadoop OutputFormat as well. In addition
> > HadoopInputFormatIO is quite a mouthful. Agreed, Hadoop can mean a lot of
> > things depending on the context. In 'IO' context it might not be too
> broad.
> > Normally it implies 'any FileSystem supported in Hadoop, e.g. S3'.
> >
> > Either way, I am quite confident once HadoopInputFormatIO is written, it
> > can easily replace HdfsIO. That decision could be made later.
> >
> > Raghu.
> >
>

Reply via email to