AvroIO would is great for POJOs. But for use cases with more complex,
serializable objects, or objects which are compatible with some coder it
falls short.

Also, for less savvy users to know they need to use AvroIO might be a
stretch.
Some simpler API along the the lines of ObjectFile might be more user
friendly (even if for optimization it uses avro under the hood for POJOs).

On Sun, Feb 5, 2017, 22:00 Eugene Kirpichov <kirpic...@google.com.invalid>
wrote:

> OK, I see what you mean; however I still think this can be solved without
> introducing a new "Beam object file" (or whatever) file format, and without
> thereby introducing additional use cases and compatibility constraints on
> coders.
>
> I asked before in the thread why not just use AvroIO (it can serialize
> arbitrary POJOs using reflection); I skimmed the thread it doesn't seem
> like that got answered properly. I also like Dan's suggestion to use AvroIO
> to serialize byte[] arrays and you can do whatever you want with them (e.g.
> use another serialization library, say, Kryo, or Java serialization, etc.)
>
> On Sun, Feb 5, 2017 at 11:37 AM Aviem Zur <aviem...@gmail.com> wrote:
>
> > I agree that these files will serve no use outside of Beam pipelines.
> >
> > The rationale was that you might want to have one pipeline write output
> to
> > files and then have a different pipeline that uses those files as inputs.
> >
> > Say one team in your organization creates a pipeline and a different team
> > utilizes those files as input for a different pipeline. The contract
> > between them is the file, in a Beam-readable format.
> > This is similar to Spark's `saveAsObjectFile` https://github.com/apache/
> >
> spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > <
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> >
> >
> > The merit for something like this in my eyes is to not burden the user
> with
> > writing a custom IO
> >
> > On Tue, Jan 31, 2017 at 10:23 PM Eugene Kirpichov
> > <kirpic...@google.com.invalid> wrote:
> >
> > +1 to Robert. Either this will be a Beam-specific file format (and then
> > nothing except Beam will be able to read it - which I doubt is what you
> > want), or it is an existing well-known file format and then we should
> just
> > develop an IO for it.
> > Note that any file format that involves encoding elements with a Coder is
> > Beam-specific, because wire format of coders is Beam-specific.
> >
> > On Tue, Jan 31, 2017 at 12:20 PM Robert Bradshaw
> > <rober...@google.com.invalid> wrote:
> >
> > > On Tue, Jan 31, 2017 at 12:04 PM, Aviem Zur <aviem...@gmail.com>
> wrote:
> > > > +1 on what Stas said.
> > > > I think there is value in not having the user write a custom IO for a
> > > > protocol they use which is not covered by Beam IOs. Plus having them
> > deal
> > > > with not only the encoding but also the IO part is not ideal.
> > > > I think having a basic FileIO that can write to the Filesystems
> > supported
> > > > by Beam (GS/HDFS/Local/...) which you can use any coder with,
> including
> > > > your own custom coder, can be beneficial.
> > >
> > > What would the format of the file be? Just the concatenation of the
> > > elements encoded according to the coder? Or is there a delimiter
> > > needed to separate records. In which case how does one ensure the
> > > delimiter does not also appear in the middle of an encoded element? At
> > > this point you're developing a file format, and might as well stick
> > > with one of the standard ones. https://xkcd.com/927
> > >
> > > > On Tue, Jan 31, 2017 at 7:56 PM Stas Levin <stasle...@gmail.com>
> > wrote:
> > > >
> > > > I believe the motivation is to have an abstraction that allows one to
> > > write
> > > > stuff to a file in a way that is agnostic to the coder.
> > > > If one needs to write a non-Avro protocol to a file, and this
> > particular
> > > > protocol does not meet the assumption made by TextIO, one might need
> to
> > > > duplicate the file IO related code from AvroIO.
> > > >
> > > > On Tue, Jan 31, 2017 at 6:50 PM Eugene Kirpichov
> > > > <kirpic...@google.com.invalid> wrote:
> > > >
> > > >> Could you clarify why it would be useful to write objects to files
> > using
> > > >> Beam coders, as opposed to just using e.g. AvroIO?
> > > >>
> > > >> Coders (should) make no promise as to what their wire format is, so
> > such
> > > >> files could be read back only by other Beam pipelines using the same
> > IO.
> > > >>
> > > >> On Tue, Jan 31, 2017 at 2:48 AM Aviem Zur <aviem...@gmail.com>
> wrote:
> > > >>
> > > >> > So If I understand the general agreement is that TextIO should not
> > > >> support
> > > >> > anything but lines from files as strings.
> > > >> > I'll go ahead and file a ticket that says the Javadoc should be
> > > changed
> > > >> to
> > > >> > reflect this and `withCoder` method should be removed.
> > > >> >
> > > >> > Is there merit for Beam to supply an IO which does allow writing
> > > objects
> > > >> to
> > > >> > a file using Beam coders and Beam FS (To write these files to
> > > >> > GS/Hadoop/Local)?
> > > >> >
> > > >> > On Tue, Jan 31, 2017 at 2:28 AM Eugene Kirpichov
> > > >> > <kirpic...@google.com.invalid> wrote:
> > > >> >
> > > >> > P.S. Note that this point (about coders) is also mentioned in the
> > > >> > now-being-reviewed PTransform Style Guide
> > > >> > https://github.com/apache/beam-site/pull/134
> > > >> > currently staged at
> > > >> >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
> http://apache-beam-website-pull-requests.storage.googleapis.com/134/contribute/ptransform-style-guide/index.html#coders
> > > >> >
> > > >> >
> > > >> > On Mon, Jan 30, 2017 at 4:25 PM Chamikara Jayalath <
> > > chamik...@apache.org
> > > >> >
> > > >> > wrote:
> > > >> >
> > > >> > > +1 to what Eugene said.
> > > >> > >
> > > >> > > I've seen a number of Python SDK users incorrectly assuming that
> > > >> > > coder.decode() is needed when developing their own file-based
> > > sources
> > > >> > > (since many users usually refer to text source first). Probably
> > > coder
> > > >> > > parameter should not be configurable for text source/sink and
> they
> > > >> should
> > > >> > > be updated to only read/write UTF-8 encoded strings.
> > > >> > >
> > > >> > > - Cham
> > > >> > >
> > > >> > > On Mon, Jan 30, 2017 at 3:38 PM Eugene Kirpichov
> > > >> > > <kirpic...@google.com.invalid> wrote:
> > > >> > >
> > > >> > > > The use of Coder in TextIO is a long standing design issue
> > because
> > > >> > coders
> > > >> > > > are not intended to be used for general purpose converting
> > things
> > > >> from
> > > >> > > and
> > > >> > > > to bytes, their only proper use is letting the runner
> > materialize
> > > > and
> > > >> > > > restore objects if the runner thinks it's necessary. IMO it
> > should
> > > >> have
> > > >> > > > been called LineIO, document that it reads lines of text as
> > > String,
> > > >> and
> > > >> > > not
> > > >> > > > have a withCoder parameter at all.
> > > >> > > >
> > > >> > > > The proper way to address your use case is to write a custom
> > > >> > > > FileBasedSource.
> > > >> > > > On Mon, Jan 30, 2017 at 2:52 AM Aviem Zur <aviem...@gmail.com
> >
> > > >> wrote:
> > > >> > > >
> > > >> > > > > The Javadoc of TextIO states:
> > > >> > > > >
> > > >> > > > > * <p>By default, {@link TextIO.Read} returns a {@link
> > > PCollection}
> > > >> of
> > > >> > > > > {@link String Strings},
> > > >> > > > >  * each corresponding to one line of an input UTF-8 text
> file.
> > > To
> > > >> > > convert
> > > >> > > > > directly from the raw
> > > >> > > > >  * bytes (split into lines delimited by '\n', '\r', or
> '\r\n')
> > > to
> > > >> > > another
> > > >> > > > > object of type {@code T},
> > > >> > > > >  * supply a {@code Coder<T>} using {@link
> > > >> > > TextIO.Read#withCoder(Coder)}.
> > > >> > > > >
> > > >> > > > > However, as I stated, `withCoder` doesn't seem to have
> tests,
> > > and
> > > >> > > > probably
> > > >> > > > > won't work given the hard-coded '\n' delimiter.
> > > >> > > > >
> > > >> > > > > On Mon, Jan 30, 2017 at 12:18 PM Jean-Baptiste Onofré <
> > > >> > j...@nanthrax.net
> > > >> > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > Hi Aviem,
> > > >> > > > > >
> > > >> > > > > > TextIO is not designed to write/read binary file: it's
> pure
> > > > Text,
> > > >> > so
> > > >> > > > > > String.
> > > >> > > > > >
> > > >> > > > > > Regards
> > > >> > > > > > JB
> > > >> > > > > >
> > > >> > > > > > On 01/30/2017 09:24 AM, Aviem Zur wrote:
> > > >> > > > > > > Hi,
> > > >> > > > > > >
> > > >> > > > > > > While trying to use TextIO to write/read a binary file
> > > rather
> > > >> > than
> > > >> > > > > String
> > > >> > > > > > > lines from a textual file I ran into an issue - the
> > > delimiter
> > > >> > > TextIO
> > > >> > > > > uses
> > > >> > > > > > > seems to be hardcoded '\n'.
> > > >> > > > > > > See `findSeparatorBounds` -
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L1024
> > > >> > > > > > >
> > > >> > > > > > > The use case is to have a file of objects, encoded into
> > > bytes
> > > >> > > using a
> > > >> > > > > > > coder. However, '\n' is not a good delimiter here, as
> you
> > > can
> > > >> > > > imagine.
> > > >> > > > > > > A similar pattern is found in Spark's `saveAsObjectFile`
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1512
> > > >> > > > > > > where
> > > >> > > > > > > they use a more appropriate delimiter, to avoid such
> > issues.
> > > >> > > > > > >
> > > >> > > > > > > I did not find any unit tests which use TextIO to read
> > > > anything
> > > >> > > other
> > > >> > > > > > than
> > > >> > > > > > > Strings.
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Jean-Baptiste Onofré
> > > >> > > > > > jbono...@apache.org
> > > >> > > > > > http://blog.nanthrax.net
> > > >> > > > > > Talend - http://www.talend.com
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>

Reply via email to