Re: PCollection to PCollection Conversion

Vikas Kedigehalli Wed, 28 Dec 2016 11:38:58 -0800

Hi All,

  Not being aware of the discussion here, I sent out a PR
<https://github.com/apache/beam/pull/1704> but JB and others directed me to
this thread. Having converted PCollection<T> to PCollection<String> several
times, I feel something like 'ToString' transform is common enough to be
part of the core. What do you all think?


Also, if someone else is already working on or interested in tackling this,
then I am happy to discard the PR.

Regards,
Vikas

On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <[email protected]> wrote:

> It seems that there were a lot of good points raised here, and I tend to
> agree that something as trivial and lean as "ToString" should be a part of
> core.
> I'm particularly fond of makeString(prefix, toString, suffix) in various
> combinations (Scala-like).
> For "fromString", I think JB has a good point leveraging JAXB and Jackson -
> though I think this should be in extensions as it is not as lean as
> toString.
>
> Thanks,
> Amit
>
> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
> > Hi Jesse,
> >
> > yes, I started something there (using JAXB and Jackson). Let me polish
> > and push.
> >
> > Regards
> > JB
> >
> > On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > > I went through the string conversions. Do you have an example of
> writing
> > > out XML/JSON/etc too?
> > >
> > > On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <[email protected]>
> > > wrote:
> > >
> > >> Hi Jesse,
> > >>
> > >>
> > >>
> > https://github.com/jbonofre/incubator-beam/tree/DATAFORMAT/sdks/java/
> extensions/dataformat
> > >>
> > >> it's very simple and stupid and of course not complete at all (I have
> > >> other commits but not merged as they need some polishing), but as I
> > >> said, it's a base of discussion.
> > >>
> > >> Regards
> > >> JB
> > >>
> > >> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> > >>> @jb Sounds good. Just let us know once you've pushed.
> > >>>
> > >>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> [email protected]>
> > >>> wrote:
> > >>>
> > >>>> Good point Eugene.
> > >>>>
> > >>>> Right now, it's a DoFn collection to experiment a bit (a pure
> > >>>> extension). It's pretty stupid ;)
> > >>>>
> > >>>> But, you are right, depending the direction of such extension, it
> > could
> > >>>> cover more use cases (even if it's not my first intention ;)).
> > >>>>
> > >>>> Let me push the branch (pretty small) as an illustration, and in the
> > >>>> mean time, I'm preparing a document (more focused on the use cases).
> > >>>>
> > >>>> WDYT ?
> > >>>>
> > >>>> Regards
> > >>>> JB
> > >>>>
> > >>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> > >>>>> Hi JB,
> > >>>>> Depending on the scope of what you want to ultimately accomplish
> with
> > >>>> this
> > >>>>> extension, I think it may make sense to write a proposal document
> and
> > >>>>> discuss it.
> > >>>>> If it's just a collection of utility DoFn's for various
> well-defined
> > >>>>> source/target format pairs, then that's probably not needed, but if
> > >> it's
> > >>>>> anything more, then I think it is.
> > >>>>> That will help avoid a lot of churn if people propose reasonable
> > >>>>> significant changes.
> > >>>>>
> > >>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> > [email protected]
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my github
> > and I
> > >>>>>> will post on the dev mailing list when done.
> > >>>>>>
> > >>>>>> Regards
> > >>>>>> JB
> > >>>>>>
> > >>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > >>>>>>> I want to bring this thread back up since we've had time to think
> > >> about
> > >>>>>> it
> > >>>>>>> more and make a plan.
> > >>>>>>>
> > >>>>>>> I think a format-specific converter will be more time consuming
> > task
> > >>>> than
> > >>>>>>> we originally thought. It'd have to be a writer that takes
> another
> > >>>> writer
> > >>>>>>> as a parameter.
> > >>>>>>>
> > >>>>>>> I think a string converter can be done as a simple transform.
> > >>>>>>>
> > >>>>>>> I think we should start with a simple string converter and plan
> > for a
> > >>>>>>> format-specific writer.
> > >>>>>>>
> > >>>>>>> What are your thoughts?
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>>
> > >>>>>>> Jesse
> > >>>>>>>
> > >>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> > >> [email protected]
> > >>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>> I was thinking about what the outputs would look like last
> night. I
> > >>>>>>> realized that more complex formats like JSON and XML may or may
> not
> > >>>>>> output
> > >>>>>>> the data in a valid format.
> > >>>>>>>
> > >>>>>>> Doing a direct conversion on unbounded collections would work
> just
> > >>>> fine.
> > >>>>>>> They're self-contained. For writing out bounded collections,
> that's
> > >>>> where
> > >>>>>>> we'll hit the issues. This changes the uber conversion transform
> > >> into a
> > >>>>>>> transform that needs to be a writer.
> > >>>>>>>
> > >>>>>>> If a transform executes a JSON conversion on a per element basis,
> > >> we'd
> > >>>>>> get
> > >>>>>>> this:
> > >>>>>>> {
> > >>>>>>> "key": "value"
> > >>>>>>> }, {
> > >>>>>>> "key": "value"
> > >>>>>>> },
> > >>>>>>>
> > >>>>>>> That isn't valid JSON.
> > >>>>>>>
> > >>>>>>> The conversion transform would need to know do several things
> when
> > >>>>>> writing
> > >>>>>>> out a file. It would need to add brackets for an array. Now we
> > have:
> > >>>>>>> [
> > >>>>>>> {
> > >>>>>>> "key": "value"
> > >>>>>>> }, {
> > >>>>>>> "key": "value"
> > >>>>>>> },
> > >>>>>>> ]
> > >>>>>>>
> > >>>>>>> We still don't have valid JSON. We have to remove the last comma
> or
> > >>>> have
> > >>>>>>> the uber transform start putting in the commas, except for the
> last
> > >>>>>> element.
> > >>>>>>>
> > >>>>>>> [
> > >>>>>>> {
> > >>>>>>> "key": "value"
> > >>>>>>> }, {
> > >>>>>>> "key": "value"
> > >>>>>>> }
> > >>>>>>> ]
> > >>>>>>>
> > >>>>>>> Only by doing this do we have valid JSON.
> > >>>>>>>
> > >>>>>>> I'd argue we'd have a similar issue with XML. Some parsers
> require
> > a
> > >>>> root
> > >>>>>>> element for everything. The uber transform would have to put the
> > root
> > >>>>>>> element tags at the beginning and end of the file.
> > >>>>>>>
> > >>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> > [email protected]>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>> I would love to see a lean core and abundant Transforms at the
> same
> > >>>> time.
> > >>>>>>>
> > >>>>>>> Maybe we can look at what Confluent <
> > https://github.com/confluentinc
> > >>>
> > >>>>>> does
> > >>>>>>> for kafka-connect. They have official extensions support for
> JDBC,
> > >> HDFS
> > >>>>>> and
> > >>>>>>> ElasticSearch under https://github.com/confluentinc. They put
> them
> > >>>> along
> > >>>>>>> with other community extensions on
> > >>>>>>> https://www.confluent.io/product/connectors/ for visibility.
> > >>>>>>>
> > >>>>>>> Although not a commercial company, can we have a GitHub user like
> > >>>>>>> beam-community to host projects we build around beam but not
> > suitable
> > >>>> for
> > >>>>>>> https://github.com/apache/incubator-beam. In the future, we may
> > have
> > >>>>>>> beam-algebra like http://github.com/twitter/algebird for algebra
> > >>>>>> operations
> > >>>>>>> and beam-ml / beam-dl for machine learning / deep learning. Also,
> > >> there
> > >>>>>>> will will be beam related projects elsewhere maintained by other
> > >>>>>>> communities. We can put all of them on the beam-website or like
> > spark
> > >>>>>>> packages as mentioned by Amit.
> > >>>>>>>
> > >>>>>>> My $0.02
> > >>>>>>> Manu
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> > >> <[email protected]
> > >>>>>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> On this point from Amit and Ismaël, I agree: we could benefit
> > from a
> > >>>>>> place
> > >>>>>>>> for miscellaneous non-core helper transformations.
> > >>>>>>>>
> > >>>>>>>> We have sdks/java/extensions but it is organized as separate
> > >>>> artifacts.
> > >>>>>> I
> > >>>>>>>> think that is fine, considering the nature of Join and
> SortValues.
> > >> But
> > >>>>>> for
> > >>>>>>>> simpler transforms, Importing one artifact per tiny transform is
> > too
> > >>>>>> much
> > >>>>>>>> overhead. It also seems unlikely that we will have enough
> > >> commonality
> > >>>>>>> among
> > >>>>>>>> the transforms to call the artifact anything other than [some
> > >> synonym
> > >>>>>> for]
> > >>>>>>>> "miscellaneous".
> > >>>>>>>>
> > >>>>>>>> I wouldn't want to take this too far - even though the SDK many
> > >>>>>>> transforms*
> > >>>>>>>> that are not required for the model [1], I like that the SDK
> > >> artifact
> > >>>>>> has
> > >>>>>>>> everything a user might need in their "getting started" phase of
> > >> use.
> > >>>>>> This
> > >>>>>>>> user-friendliness (the user doesn't care that ParDo is core and
> > Sum
> > >> is
> > >>>>>>> not)
> > >>>>>>>> plus the difficulty of judging which transforms go where, are
> > >> probably
> > >>>>>> why
> > >>>>>>>> we have them mostly all in one place.
> > >>>>>>>>
> > >>>>>>>> Models to look at, off the top of my head, include Pig's
> PiggyBank
> > >> and
> > >>>>>>>> Apex's Malhar. These have different levels of support implied.
> > >> Others?
> > >>>>>>>>
> > >>>>>>>> Kenn
> > >>>>>>>>
> > >>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct,
> > >> Filter,
> > >>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min,
> > Values,
> > >>>>>>> KvSwap,
> > >>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> > WithTimestamps
> > >>>>>>>>
> > >>>>>>>> * at least they are separate classes and not methods on
> > PCollection
> > >>>> :-)
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <[email protected]
> >
> > >>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Nice discussion, and thanks Jesse for bringing this subject
> > back.
> > >>>>>>>>>
> > >>>>>>>>> I agree 100% with Amit and the idea of having a home for those
> > >>>>>>> transforms
> > >>>>>>>>> that are not core enough to be part of the sdk, but that we all
> > end
> > >>>> up
> > >>>>>>>>> re-writing somehow.
> > >>>>>>>>>
> > >>>>>>>>> This is a needed improvement to be more developer friendly, but
> > >> also
> > >>>> as
> > >>>>>>> a
> > >>>>>>>>> reference of good practices of Beam development, and for this
> > >> reason
> > >>>> I
> > >>>>>>>>> agree with JB that at this moment it would be better for these
> > >>>>>>> transforms
> > >>>>>>>>> to reside in the Beam repository at least for visibility
> reasons.
> > >>>>>>>>>
> > >>>>>>>>> One additional question is if these transforms represent a
> > >> different
> > >>>>>> DSL
> > >>>>>>>> or
> > >>>>>>>>> if those could be grouped with the current extensions (e.g.
> Join
> > >> and
> > >>>>>>>>> SortValues) into something more general that we as a community
> > >> could
> > >>>>>>>>> maintain, but well even if it is not the case, it would be
> really
> > >>>> nice
> > >>>>>>> to
> > >>>>>>>>> start working on something like this.
> > >>>>>>>>>
> > >>>>>>>>> Ismaël Mejía
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <
> > >>>> [email protected]
> > >>>>>>>
> > >>>>>>>>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> Related to spark-package, we also have Apache Bahir to host
> > >>>>>>>>>> connectors/transforms for Spark and Flink.
> > >>>>>>>>>>
> > >>>>>>>>>> IMHO, right now, Beam should host this, not sure if it makes
> > sense
> > >>>>>>>>>> directly in the core.
> > >>>>>>>>>>
> > >>>>>>>>>> It reminds me the "Integration" DSL we discussed in the
> > technical
> > >>>>>>>> vision
> > >>>>>>>>>> document.
> > >>>>>>>>>>
> > >>>>>>>>>> Regards
> > >>>>>>>>>> JB
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> I think Jesse has a very good point on one hand, while Luke's
> > and
> > >>>>>>>>>>> Kenneth's
> > >>>>>>>>>>> worries about committing users to specific implementations is
> > in
> > >>>>>>>> place.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The Spark community has a 3rd party repository for useful
> > >> libraries
> > >>>>>>>> that
> > >>>>>>>>>>> for various reasons are not a part of the Apache Spark
> project:
> > >>>>>>>>>>> https://spark-packages.org/.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Maybe a "common-transformations" package would serve both
> users
> > >>>> quick
> > >>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more "enabling" ?
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > >>>>>>> <[email protected]
> > >>>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> It seems useful for small scale debugging / demoing to have
> > >>>>>>>>>>>> Dump.toString(). I think it should be named to clearly
> > indicate
> > >>>> its
> > >>>>>>>>>>>> limited
> > >>>>>>>>>>>> scope. Maybe other stuff could go in the Dump namespace, but
> > >>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it should
> be
> > >>>> pretty
> > >>>>>>>>>>>> printed, not treated as a machine-to-machine wire format.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The broader question of representing data in JSON or XML,
> etc,
> > >> is
> > >>>>>>>>> already
> > >>>>>>>>>>>> the subject of many mature libraries which are already easy
> to
> > >> use
> > >>>>>>>> with
> > >>>>>>>>>>>> Beam.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The more esoteric practice of implicit or semi-implicit
> > >> coercions
> > >>>>>>>> seems
> > >>>>>>>>>>>> like it is also already addressed in many ways elsewhere.
> > >>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as
> > >>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with Beam.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In both of the last cases, there are many reasonable
> > approaches,
> > >>>> and
> > >>>>>>>> we
> > >>>>>>>>>>>> shouldn't commit our users to one of them.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > >>>>>>>> <[email protected]
> > >>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The suggestions you give seem good except for the the XML
> > cases.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Might want to have the XML be a document per line similar
> to
> > >> the
> > >>>>>>>> JSON
> > >>>>>>>>>>>>> examples you have been giving.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> > >>>>>>>>> [email protected]>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I was
> more
> > >>>> think
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> that
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV. It
> should
> > >>>>>>> handle
> > >>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give someone
> > >>>>>>>> something
> > >>>>>>>>>>>>>> general purpose enough that you would just end up writing
> > your
> > >>>> own
> > >>>>>>>>> code
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> handle it anyway.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Here are some ideas on what it could look like with a
> method
> > >> and
> > >>>>>>>> the
> > >>>>>>>>>>>>>> resulting string output:
> > >>>>>>>>>>>>>> *Stringify.toJSON()*
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>> {"key": "value"}
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>> ["one", "two", "three"]
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>> <rootelement key=value />
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>> <rootelement>
> > >>>>>>>>>>>>>>   <item>one</item>
> > >>>>>>>>>>>>>>   <item>two</item>
> > >>>>>>>>>>>>>>   <item>three</item>
> > >>>>>>>>>>>>>> </rootelement>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>> key,value
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>> one,two,three
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Do you think that would strike a good balance between
> > reusable
> > >>>>>>> code
> > >>>>>>>>> and
> > >>>>>>>>>>>>>> writing your own for more difficult formatting?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > >>>>>>>> <[email protected]
> > >>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Jesse, I believe if one format gets special treatment in
> > >> TextIO,
> > >>>>>>>>> people
> > >>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not
> supported.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Also, the example that you provide is using the fact that
> > the
> > >>>>>>> input
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> format
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question about
> using
> > KV
> > >>>>>>> with
> > >>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed input
> > >> format
> > >>>>>>>> and
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> still
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> would require to write a type conversion function, this
> time
> > >>>> from
> > >>>>>>>> KV
> > >>>>>>>>> to
> > >>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> > >>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Lukasz,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> > TextIO.Write.
> > >>>> For
> > >>>>>>>> CSV
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> call would look like:
> > >>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix,
> > delimiter,
> > >>>>>>>>> suffix).
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The code would be something like:
> > >>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> for (Item item : list) {
> > >>>>>>>>>>>>>>>   buffer.append(item.toString());
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>   if(notLast) {
> > >>>>>>>>>>>>>>>     buffer.append(delimiter);
> > >>>>>>>>>>>>>>>   }
> > >>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> buffer.append(suffix);
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> c.output(buffer.toString());
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and other
> > >>>> formats
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> without
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> complicated logic. The same sort of thing could be done
> for
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> TextIO.Write.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> > >>>>>>>>> <[email protected]
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The conversion from object to string will have uses
> outside
> > >> of
> > >>>>>>>> just
> > >>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would want to
> > have
> > >> a
> > >>>>>>>> ParDo
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> do
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> conversion.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if you
> > >> consider
> > >>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> subset
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> of CSV like formats where it could have fixed width
> fields,
> > >> or
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> escaping
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> quoting around other fields, or headers that should be
> > >> placed
> > >>>> at
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> top.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Having all these format conversions within TextIO.Write
> > >> seems
> > >>>>>>>> like
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> lot
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> logic to contain in that transform which should just
> focus
> > >> on
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> writing
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> files.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> This is a thread moved over from the user mailing list.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> > >> PCollection<KV>
> > >>>> to
> > >>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually convert
> > the
> > >>>> KV
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> to a
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> String:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>         p
> > >>>>>>>>>>>>>>>>>
> > >>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > >>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > >>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > >>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<String,
> Long>
> > >>>>>>> count)
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> ->*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> *                            count.getKey() + ":" +
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> count.getValue()*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> *                        ).withOutputType(
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > >>>> ("output/stringcounts"));
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> This code really should be something like:
> > >>>>>>>>>>>>>>>>>         p
> > >>>>>>>>>>>>>>>>>
> > >>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > >>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > >>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > >>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
> > >>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > >>>>>> ("output/stringcounts"));
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> To summarize the discussion:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to output
> > any
> > >> KV
> > >>>>>>> or
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> list
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes an type
> and
> > >> runs
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> toString()
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>    on it:
> > >>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> > SimpleFunction<InputT,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> String>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> {
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>        public static String apply(InputT input) {
> > >>>>>>>>>>>>>>>>>            return input.toString();
> > >>>>>>>>>>>>>>>>>        }
> > >>>>>>>>>>>>>>>>>    }
> > >>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter like in
> > >> Apache
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Camel.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that would
> write
> > >> out
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>    toString of any Object.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> My thoughts:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly needed
> when
> > >>>>>>> you're
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> using
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform only work
> in
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> certain
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> cases
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> and you'll normally have to write custom code format the
> > >>>> strings
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> way
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> you want them?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object
> support
> > to
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> TextIO.Write
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an argument.
> > >>>> Making
> > >>>>>>> a
> > >>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a delimiter (and
> > >>>> perhaps
> > >>>>>>> a
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> prefix
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and cases.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>> --
> > >>>>>>>>>> Jean-Baptiste Onofré
> > >>>>>>>>>> [email protected]
> > >>>>>>>>>> http://blog.nanthrax.net
> > >>>>>>>>>> Talend - http://www.talend.com
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>> --
> > >>>>>> Jean-Baptiste Onofré
> > >>>>>> [email protected]
> > >>>>>> http://blog.nanthrax.net
> > >>>>>> Talend - http://www.talend.com
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>> --
> > >>>> Jean-Baptiste Onofré
> > >>>> [email protected]
> > >>>> http://blog.nanthrax.net
> > >>>> Talend - http://www.talend.com
> > >>>>
> > >>>
> > >>
> > >> --
> > >> Jean-Baptiste Onofré
> > >> [email protected]
> > >> http://blog.nanthrax.net
> > >> Talend - http://www.talend.com
> > >>
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > [email protected]
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: PCollection to PCollection Conversion

Reply via email to