Re: PCollection to PCollection Conversion

Amit Sela Tue, 13 Dec 2016 01:58:08 -0800

It seems that there were a lot of good points raised here, and I tend to
agree that something as trivial and lean as "ToString" should be a part of
core.
I'm particularly fond of makeString(prefix, toString, suffix) in various
combinations (Scala-like).
For "fromString", I think JB has a good point leveraging JAXB and Jackson -
though I think this should be in extensions as it is not as lean as
toString.


Thanks,
Amit

On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Jesse,
>
> yes, I started something there (using JAXB and Jackson). Let me polish
> and push.
>
> Regards
> JB
>
> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > I went through the string conversions. Do you have an example of writing
> > out XML/JSON/etc too?
> >
> > On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> > wrote:
> >
> >> Hi Jesse,
> >>
> >>
> >>
> https://github.com/jbonofre/incubator-beam/tree/DATAFORMAT/sdks/java/extensions/dataformat
> >>
> >> it's very simple and stupid and of course not complete at all (I have
> >> other commits but not merged as they need some polishing), but as I
> >> said, it's a base of discussion.
> >>
> >> Regards
> >> JB
> >>
> >> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> >>> @jb Sounds good. Just let us know once you've pushed.
> >>>
> >>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> >>> wrote:
> >>>
> >>>> Good point Eugene.
> >>>>
> >>>> Right now, it's a DoFn collection to experiment a bit (a pure
> >>>> extension). It's pretty stupid ;)
> >>>>
> >>>> But, you are right, depending the direction of such extension, it
> could
> >>>> cover more use cases (even if it's not my first intention ;)).
> >>>>
> >>>> Let me push the branch (pretty small) as an illustration, and in the
> >>>> mean time, I'm preparing a document (more focused on the use cases).
> >>>>
> >>>> WDYT ?
> >>>>
> >>>> Regards
> >>>> JB
> >>>>
> >>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> >>>>> Hi JB,
> >>>>> Depending on the scope of what you want to ultimately accomplish with
> >>>> this
> >>>>> extension, I think it may make sense to write a proposal document and
> >>>>> discuss it.
> >>>>> If it's just a collection of utility DoFn's for various well-defined
> >>>>> source/target format pairs, then that's probably not needed, but if
> >> it's
> >>>>> anything more, then I think it is.
> >>>>> That will help avoid a lot of churn if people propose reasonable
> >>>>> significant changes.
> >>>>>
> >>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> j...@nanthrax.net
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my github
> and I
> >>>>>> will post on the dev mailing list when done.
> >>>>>>
> >>>>>> Regards
> >>>>>> JB
> >>>>>>
> >>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> >>>>>>> I want to bring this thread back up since we've had time to think
> >> about
> >>>>>> it
> >>>>>>> more and make a plan.
> >>>>>>>
> >>>>>>> I think a format-specific converter will be more time consuming
> task
> >>>> than
> >>>>>>> we originally thought. It'd have to be a writer that takes another
> >>>> writer
> >>>>>>> as a parameter.
> >>>>>>>
> >>>>>>> I think a string converter can be done as a simple transform.
> >>>>>>>
> >>>>>>> I think we should start with a simple string converter and plan
> for a
> >>>>>>> format-specific writer.
> >>>>>>>
> >>>>>>> What are your thoughts?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Jesse
> >>>>>>>
> >>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> >> je...@smokinghand.com
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> I was thinking about what the outputs would look like last night. I
> >>>>>>> realized that more complex formats like JSON and XML may or may not
> >>>>>> output
> >>>>>>> the data in a valid format.
> >>>>>>>
> >>>>>>> Doing a direct conversion on unbounded collections would work just
> >>>> fine.
> >>>>>>> They're self-contained. For writing out bounded collections, that's
> >>>> where
> >>>>>>> we'll hit the issues. This changes the uber conversion transform
> >> into a
> >>>>>>> transform that needs to be a writer.
> >>>>>>>
> >>>>>>> If a transform executes a JSON conversion on a per element basis,
> >> we'd
> >>>>>> get
> >>>>>>> this:
> >>>>>>> {
> >>>>>>> "key": "value"
> >>>>>>> }, {
> >>>>>>> "key": "value"
> >>>>>>> },
> >>>>>>>
> >>>>>>> That isn't valid JSON.
> >>>>>>>
> >>>>>>> The conversion transform would need to know do several things when
> >>>>>> writing
> >>>>>>> out a file. It would need to add brackets for an array. Now we
> have:
> >>>>>>> [
> >>>>>>> {
> >>>>>>> "key": "value"
> >>>>>>> }, {
> >>>>>>> "key": "value"
> >>>>>>> },
> >>>>>>> ]
> >>>>>>>
> >>>>>>> We still don't have valid JSON. We have to remove the last comma or
> >>>> have
> >>>>>>> the uber transform start putting in the commas, except for the last
> >>>>>> element.
> >>>>>>>
> >>>>>>> [
> >>>>>>> {
> >>>>>>> "key": "value"
> >>>>>>> }, {
> >>>>>>> "key": "value"
> >>>>>>> }
> >>>>>>> ]
> >>>>>>>
> >>>>>>> Only by doing this do we have valid JSON.
> >>>>>>>
> >>>>>>> I'd argue we'd have a similar issue with XML. Some parsers require
> a
> >>>> root
> >>>>>>> element for everything. The uber transform would have to put the
> root
> >>>>>>> element tags at the beginning and end of the file.
> >>>>>>>
> >>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> owenzhang1...@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> I would love to see a lean core and abundant Transforms at the same
> >>>> time.
> >>>>>>>
> >>>>>>> Maybe we can look at what Confluent <
> https://github.com/confluentinc
> >>>
> >>>>>> does
> >>>>>>> for kafka-connect. They have official extensions support for JDBC,
> >> HDFS
> >>>>>> and
> >>>>>>> ElasticSearch under https://github.com/confluentinc. They put them
> >>>> along
> >>>>>>> with other community extensions on
> >>>>>>> https://www.confluent.io/product/connectors/ for visibility.
> >>>>>>>
> >>>>>>> Although not a commercial company, can we have a GitHub user like
> >>>>>>> beam-community to host projects we build around beam but not
> suitable
> >>>> for
> >>>>>>> https://github.com/apache/incubator-beam. In the future, we may
> have
> >>>>>>> beam-algebra like http://github.com/twitter/algebird for algebra
> >>>>>> operations
> >>>>>>> and beam-ml / beam-dl for machine learning / deep learning. Also,
> >> there
> >>>>>>> will will be beam related projects elsewhere maintained by other
> >>>>>>> communities. We can put all of them on the beam-website or like
> spark
> >>>>>>> packages as mentioned by Amit.
> >>>>>>>
> >>>>>>> My $0.02
> >>>>>>> Manu
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> >> <k...@google.com.invalid
> >>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> On this point from Amit and Ismaël, I agree: we could benefit
> from a
> >>>>>> place
> >>>>>>>> for miscellaneous non-core helper transformations.
> >>>>>>>>
> >>>>>>>> We have sdks/java/extensions but it is organized as separate
> >>>> artifacts.
> >>>>>> I
> >>>>>>>> think that is fine, considering the nature of Join and SortValues.
> >> But
> >>>>>> for
> >>>>>>>> simpler transforms, Importing one artifact per tiny transform is
> too
> >>>>>> much
> >>>>>>>> overhead. It also seems unlikely that we will have enough
> >> commonality
> >>>>>>> among
> >>>>>>>> the transforms to call the artifact anything other than [some
> >> synonym
> >>>>>> for]
> >>>>>>>> "miscellaneous".
> >>>>>>>>
> >>>>>>>> I wouldn't want to take this too far - even though the SDK many
> >>>>>>> transforms*
> >>>>>>>> that are not required for the model [1], I like that the SDK
> >> artifact
> >>>>>> has
> >>>>>>>> everything a user might need in their "getting started" phase of
> >> use.
> >>>>>> This
> >>>>>>>> user-friendliness (the user doesn't care that ParDo is core and
> Sum
> >> is
> >>>>>>> not)
> >>>>>>>> plus the difficulty of judging which transforms go where, are
> >> probably
> >>>>>> why
> >>>>>>>> we have them mostly all in one place.
> >>>>>>>>
> >>>>>>>> Models to look at, off the top of my head, include Pig's PiggyBank
> >> and
> >>>>>>>> Apex's Malhar. These have different levels of support implied.
> >> Others?
> >>>>>>>>
> >>>>>>>> Kenn
> >>>>>>>>
> >>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct,
> >> Filter,
> >>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min,
> Values,
> >>>>>>> KvSwap,
> >>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> WithTimestamps
> >>>>>>>>
> >>>>>>>> * at least they are separate classes and not methods on
> PCollection
> >>>> :-)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <ieme...@gmail.com>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>>> Nice discussion, and thanks Jesse for bringing this subject
> back.
> >>>>>>>>>
> >>>>>>>>> I agree 100% with Amit and the idea of having a home for those
> >>>>>>> transforms
> >>>>>>>>> that are not core enough to be part of the sdk, but that we all
> end
> >>>> up
> >>>>>>>>> re-writing somehow.
> >>>>>>>>>
> >>>>>>>>> This is a needed improvement to be more developer friendly, but
> >> also
> >>>> as
> >>>>>>> a
> >>>>>>>>> reference of good practices of Beam development, and for this
> >> reason
> >>>> I
> >>>>>>>>> agree with JB that at this moment it would be better for these
> >>>>>>> transforms
> >>>>>>>>> to reside in the Beam repository at least for visibility reasons.
> >>>>>>>>>
> >>>>>>>>> One additional question is if these transforms represent a
> >> different
> >>>>>> DSL
> >>>>>>>> or
> >>>>>>>>> if those could be grouped with the current extensions (e.g. Join
> >> and
> >>>>>>>>> SortValues) into something more general that we as a community
> >> could
> >>>>>>>>> maintain, but well even if it is not the case, it would be really
> >>>> nice
> >>>>>>> to
> >>>>>>>>> start working on something like this.
> >>>>>>>>>
> >>>>>>>>> Ismaël Mejía
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <
> >>>> j...@nanthrax.net
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Related to spark-package, we also have Apache Bahir to host
> >>>>>>>>>> connectors/transforms for Spark and Flink.
> >>>>>>>>>>
> >>>>>>>>>> IMHO, right now, Beam should host this, not sure if it makes
> sense
> >>>>>>>>>> directly in the core.
> >>>>>>>>>>
> >>>>>>>>>> It reminds me the "Integration" DSL we discussed in the
> technical
> >>>>>>>> vision
> >>>>>>>>>> document.
> >>>>>>>>>>
> >>>>>>>>>> Regards
> >>>>>>>>>> JB
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I think Jesse has a very good point on one hand, while Luke's
> and
> >>>>>>>>>>> Kenneth's
> >>>>>>>>>>> worries about committing users to specific implementations is
> in
> >>>>>>>> place.
> >>>>>>>>>>>
> >>>>>>>>>>> The Spark community has a 3rd party repository for useful
> >> libraries
> >>>>>>>> that
> >>>>>>>>>>> for various reasons are not a part of the Apache Spark project:
> >>>>>>>>>>> https://spark-packages.org/.
> >>>>>>>>>>>
> >>>>>>>>>>> Maybe a "common-transformations" package would serve both users
> >>>> quick
> >>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more "enabling" ?
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> >>>>>>> <k...@google.com.invalid
> >>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> It seems useful for small scale debugging / demoing to have
> >>>>>>>>>>>> Dump.toString(). I think it should be named to clearly
> indicate
> >>>> its
> >>>>>>>>>>>> limited
> >>>>>>>>>>>> scope. Maybe other stuff could go in the Dump namespace, but
> >>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it should be
> >>>> pretty
> >>>>>>>>>>>> printed, not treated as a machine-to-machine wire format.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The broader question of representing data in JSON or XML, etc,
> >> is
> >>>>>>>>> already
> >>>>>>>>>>>> the subject of many mature libraries which are already easy to
> >> use
> >>>>>>>> with
> >>>>>>>>>>>> Beam.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The more esoteric practice of implicit or semi-implicit
> >> coercions
> >>>>>>>> seems
> >>>>>>>>>>>> like it is also already addressed in many ways elsewhere.
> >>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as
> >>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with Beam.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In both of the last cases, there are many reasonable
> approaches,
> >>>> and
> >>>>>>>> we
> >>>>>>>>>>>> shouldn't commit our users to one of them.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> >>>>>>>> <lc...@google.com.invalid
> >>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> The suggestions you give seem good except for the the XML
> cases.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Might want to have the XML be a document per line similar to
> >> the
> >>>>>>>> JSON
> >>>>>>>>>>>>> examples you have been giving.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> >>>>>>>>> je...@smokinghand.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I was more
> >>>> think
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> that
> >>>>>>>>>>>>
> >>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV. It should
> >>>>>>> handle
> >>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give someone
> >>>>>>>> something
> >>>>>>>>>>>>>> general purpose enough that you would just end up writing
> your
> >>>> own
> >>>>>>>>> code
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> handle it anyway.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Here are some ideas on what it could look like with a method
> >> and
> >>>>>>>> the
> >>>>>>>>>>>>>> resulting string output:
> >>>>>>>>>>>>>> *Stringify.toJSON()*
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With KV:
> >>>>>>>>>>>>>> {"key": "value"}
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With Iterables:
> >>>>>>>>>>>>>> ["one", "two", "three"]
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With KV:
> >>>>>>>>>>>>>> <rootelement key=value />
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With Iterables:
> >>>>>>>>>>>>>> <rootelement>
> >>>>>>>>>>>>>>   <item>one</item>
> >>>>>>>>>>>>>>   <item>two</item>
> >>>>>>>>>>>>>>   <item>three</item>
> >>>>>>>>>>>>>> </rootelement>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With KV:
> >>>>>>>>>>>>>> key,value
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> With Iterables:
> >>>>>>>>>>>>>> one,two,three
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Do you think that would strike a good balance between
> reusable
> >>>>>>> code
> >>>>>>>>> and
> >>>>>>>>>>>>>> writing your own for more difficult formatting?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jesse
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> >>>>>>>> <lc...@google.com.invalid
> >>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Jesse, I believe if one format gets special treatment in
> >> TextIO,
> >>>>>>>>> people
> >>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not supported.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Also, the example that you provide is using the fact that
> the
> >>>>>>> input
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> format
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question about using
> KV
> >>>>>>> with
> >>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed input
> >> format
> >>>>>>>> and
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> still
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> would require to write a type conversion function, this time
> >>>> from
> >>>>>>>> KV
> >>>>>>>>> to
> >>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> >>>>>>>>> je...@smokinghand.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Lukasz,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> TextIO.Write.
> >>>> For
> >>>>>>>> CSV
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> call would look like:
> >>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix,
> delimiter,
> >>>>>>>>> suffix).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The code would be something like:
> >>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> for (Item item : list) {
> >>>>>>>>>>>>>>>   buffer.append(item.toString());
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>   if(notLast) {
> >>>>>>>>>>>>>>>     buffer.append(delimiter);
> >>>>>>>>>>>>>>>   }
> >>>>>>>>>>>>>>> }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> buffer.append(suffix);
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> c.output(buffer.toString());
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and other
> >>>> formats
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> without
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> complicated logic. The same sort of thing could be done for
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>> TextIO.Write.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Jesse
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> >>>>>>>>> <lc...@google.com.invalid
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> The conversion from object to string will have uses outside
> >> of
> >>>>>>>> just
> >>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would want to
> have
> >> a
> >>>>>>>> ParDo
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> do
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> conversion.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if you
> >> consider
> >>>>>>>> the
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> subset
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> of CSV like formats where it could have fixed width fields,
> >> or
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> escaping
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> quoting around other fields, or headers that should be
> >> placed
> >>>> at
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>>>> top.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Having all these format conversions within TextIO.Write
> >> seems
> >>>>>>>> like
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>
> >>>>>>>>>>>>> lot
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> logic to contain in that transform which should just focus
> >> on
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> writing
> >>>>>>>>>>>>
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> files.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> je...@smokinghand.com>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This is a thread moved over from the user mailing list.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> >> PCollection<KV>
> >>>> to
> >>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually convert
> the
> >>>> KV
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> to a
> >>>>>>>>>>>>
> >>>>>>>>>>>>> String:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>         p
> >>>>>>>>>>>>>>>>>
> >>  .apply(TextIO.Read.from("playing_cards.tsv"))
> >>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> >>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> >>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<String, Long>
> >>>>>>> count)
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> ->*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> *                            count.getKey() + ":" +
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> count.getValue()*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> *                        ).withOutputType(
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> >>>> ("output/stringcounts"));
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> This code really should be something like:
> >>>>>>>>>>>>>>>>>         p
> >>>>>>>>>>>>>>>>>
> >>  .apply(TextIO.Read.from("playing_cards.tsv"))
> >>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> >>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> >>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
> >>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> >>>>>> ("output/stringcounts"));
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> To summarize the discussion:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to output
> any
> >> KV
> >>>>>>> or
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> list
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes an type and
> >> runs
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> toString()
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>    on it:
> >>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> SimpleFunction<InputT,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> String>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> {
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>        public static String apply(InputT input) {
> >>>>>>>>>>>>>>>>>            return input.toString();
> >>>>>>>>>>>>>>>>>        }
> >>>>>>>>>>>>>>>>>    }
> >>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter like in
> >> Apache
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Camel.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that would write
> >> out
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>    toString of any Object.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> My thoughts:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly needed when
> >>>>>>> you're
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> using
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform only work in
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> certain
> >>>>>>>>>>>>
> >>>>>>>>>>>>> cases
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> and you'll normally have to write custom code format the
> >>>> strings
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> way
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> you want them?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object support
> to
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> TextIO.Write
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an argument.
> >>>> Making
> >>>>>>> a
> >>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a delimiter (and
> >>>> perhaps
> >>>>>>> a
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> prefix
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and cases.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Jesse
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Jean-Baptiste Onofré
> >>>>>>>>>> jbono...@apache.org
> >>>>>>>>>> http://blog.nanthrax.net
> >>>>>>>>>> Talend - http://www.talend.com
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Jean-Baptiste Onofré
> >>>>>> jbono...@apache.org
> >>>>>> http://blog.nanthrax.net
> >>>>>> Talend - http://www.talend.com
> >>>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Jean-Baptiste Onofré
> >>>> jbono...@apache.org
> >>>> http://blog.nanthrax.net
> >>>> Talend - http://www.talend.com
> >>>>
> >>>
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
> >>
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: PCollection to PCollection Conversion

Reply via email to