Re: TextIO binary file

2017-02-05 Thread Aviem Zur
AvroIO would is great for POJOs. But for use cases with more complex,
serializable objects, or objects which are compatible with some coder it
falls short.

Also, for less savvy users to know they need to use AvroIO might be a
stretch.
Some simpler API along the the lines of ObjectFile might be more user
friendly (even if for optimization it uses avro under the hood for POJOs).

On Sun, Feb 5, 2017, 22:00 Eugene Kirpichov 
wrote:

> OK, I see what you mean; however I still think this can be solved without
> introducing a new "Beam object file" (or whatever) file format, and without
> thereby introducing additional use cases and compatibility constraints on
> coders.
>
> I asked before in the thread why not just use AvroIO (it can serialize
> arbitrary POJOs using reflection); I skimmed the thread it doesn't seem
> like that got answered properly. I also like Dan's suggestion to use AvroIO
> to serialize byte[] arrays and you can do whatever you want with them (e.g.
> use another serialization library, say, Kryo, or Java serialization, etc.)
>
> On Sun, Feb 5, 2017 at 11:37 AM Aviem Zur  wrote:
>
> > I agree that these files will serve no use outside of Beam pipelines.
> >
> > The rationale was that you might want to have one pipeline write output
> to
> > files and then have a different pipeline that uses those files as inputs.
> >
> > Say one team in your organization creates a pipeline and a different team
> > utilizes those files as input for a different pipeline. The contract
> > between them is the file, in a Beam-readable format.
> > This is similar to Spark's `saveAsObjectFile` https://github.com/apache/
> >
> spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > <
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> >
> >
> > The merit for something like this in my eyes is to not burden the user
> with
> > writing a custom IO
> >
> > On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
> >  wrote:
> >
> > +1 to Robert. Either this will be a Beam-specific file format (and then
> > nothing except Beam will be able to read it - which I doubt is what you
> > want), or it is an existing well-known file format and then we should
> just
> > develop an IO for it.
> > Note that any file format that involves encoding elements with a Coder is
> > Beam-specific, because wire format of coders is Beam-specific.
> >
> > On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
> >  wrote:
> >
> > > On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur 
> wrote:
> > > > +1 on what Stas said.
> > > > I think there is value in not having the user write a custom IO for a
> > > > protocol they use which is not covered by Beam IOs. Plus having them
> > deal
> > > > with not only the encoding but also the IO part is not ideal.
> > > > I think having a basic FileIO that can write to the Filesystems
> > supported
> > > > by Beam (GS/HDFS/Local/...) which you can use any coder with,
> including
> > > > your own custom coder, can be beneficial.
> > >
> > > What would the format of the file be? Just the concatenation of the
> > > elements encoded according to the coder? Or is there a delimiter
> > > needed to separate records. In which case how does one ensure the
> > > delimiter does not also appear in the middle of an encoded element? At
> > > this point you're developing a file format, and might as well stick
> > > with one of the standard ones. https://xkcd.com/927
> > >
> > > > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin 
> > wrote:
> > > >
> > > > I believe the motivation is to have an abstraction that allows one to
> > > write
> > > > stuff to a file in a way that is agnostic to the coder.
> > > > If one needs to write a non-Avro protocol to a file, and this
> > particular
> > > > protocol does not meet the assumption made by TextIO, one might need
> to
> > > > duplicate the file IO related code from AvroIO.
> > > >
> > > > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > > >  wrote:
> > > >
> > > >> Could you clarify why it would be useful to write objects to files
> > using
> > > >> Beam coders, as opposed to just using e.g. AvroIO?
> > > >>
> > > >> Coders (should) make no promise as to what their wire format is, so
> > such
> > > >> files could be read back only by other Beam pipelines using the same
> > IO.
> > > >>
> > > >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur 
> wrote:
> > > >>
> > > >> > So If I understand the general agreement is that TextIO should not
> > > >> support
> > > >> > anything but lines from files as strings.
> > > >> > I'll go ahead and file a ticket that says the Javadoc should be
> > > changed
> > > >> to
> > > >> > reflect this and `withCoder` method should be removed.
> > > >> >
> > > >> > Is there merit for Beam to supply an IO 

Re: TextIO binary file

2017-02-05 Thread Aviem Zur
I agree that these files will serve no use outside of Beam pipelines.

The rationale was that you might want to have one pipeline write output to
files and then have a different pipeline that uses those files as inputs.

Say one team in your organization creates a pipeline and a different team
utilizes those files as input for a different pipeline. The contract
between them is the file, in a Beam-readable format.
This is similar to Spark's `saveAsObjectFile` https://github.com/apache/
spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512

The merit for something like this in my eyes is to not burden the user with
writing a custom IO

On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
 wrote:

+1 to Robert. Either this will be a Beam-specific file format (and then
nothing except Beam will be able to read it - which I doubt is what you
want), or it is an existing well-known file format and then we should just
develop an IO for it.
Note that any file format that involves encoding elements with a Coder is
Beam-specific, because wire format of coders is Beam-specific.

On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
 wrote:

> On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur  wrote:
> > +1 on what Stas said.
> > I think there is value in not having the user write a custom IO for a
> > protocol they use which is not covered by Beam IOs. Plus having them
deal
> > with not only the encoding but also the IO part is not ideal.
> > I think having a basic FileIO that can write to the Filesystems
supported
> > by Beam (GS/HDFS/Local/...) which you can use any coder with, including
> > your own custom coder, can be beneficial.
>
> What would the format of the file be? Just the concatenation of the
> elements encoded according to the coder? Or is there a delimiter
> needed to separate records. In which case how does one ensure the
> delimiter does not also appear in the middle of an encoded element? At
> this point you're developing a file format, and might as well stick
> with one of the standard ones. https://xkcd.com/927
>
> > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin  wrote:
> >
> > I believe the motivation is to have an abstraction that allows one to
> write
> > stuff to a file in a way that is agnostic to the coder.
> > If one needs to write a non-Avro protocol to a file, and this particular
> > protocol does not meet the assumption made by TextIO, one might need to
> > duplicate the file IO related code from AvroIO.
> >
> > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> >  wrote:
> >
> >> Could you clarify why it would be useful to write objects to files
using
> >> Beam coders, as opposed to just using e.g. AvroIO?
> >>
> >> Coders (should) make no promise as to what their wire format is, so
such
> >> files could be read back only by other Beam pipelines using the same
IO.
> >>
> >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur  wrote:
> >>
> >> > So If I understand the general agreement is that TextIO should not
> >> support
> >> > anything but lines from files as strings.
> >> > I'll go ahead and file a ticket that says the Javadoc should be
> changed
> >> to
> >> > reflect this and `withCoder` method should be removed.
> >> >
> >> > Is there merit for Beam to supply an IO which does allow writing
> objects
> >> to
> >> > a file using Beam coders and Beam FS (To write these files to
> >> > GS/Hadoop/Local)?
> >> >
> >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> >> >  wrote:
> >> >
> >> > P.S. Note that this point (about coders) is also mentioned in the
> >> > now-being-reviewed PTransform Style Guide
> >> > https://github.com/apache/beam-site/pull/134
> >> > currently staged at
> >> >
> >> >
> >>
> >
>
http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> >> >
> >> >
> >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> chamik...@apache.org
> >> >
> >> > wrote:
> >> >
> >> > > +1 to what Eugene said.
> >> > >
> >> > > I've seen a number of Python SDK users incorrectly assuming that
> >> > > coder.decode() is needed when developing their own file-based
> sources
> >> > > (since many users usually refer to text source first). Probably
> coder
> >> > > parameter should not be configurable for text source/sink and they
> >> should
> >> > > be updated to only read/write UTF-8 encoded strings.
> >> > >
> >> > > - Cham
> >> > >
> >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> >> > >  wrote:
> >> > >
> >> > > > The use of Coder in TextIO is a long standing design issue
because
> >> > coders
> >> > > > are not intended to be used for general purpose converting things
> >> from
> >> > > and
> >> > > > to bytes, their only proper use is letting the runner materialize
> > and
> >> > > > restore objects if