Re: PCollection to PCollection Conversion

Eugene Kirpichov Tue, 29 Nov 2016 11:48:13 -0800

Hi JB,
Depending on the scope of what you want to ultimately accomplish with this
extension, I think it may make sense to write a proposal document and
discuss it.
If it's just a collection of utility DoFn's for various well-defined
source/target format pairs, then that's probably not needed, but if it's
anything more, then I think it is.
That will help avoid a lot of churn if people propose reasonable
significant changes.


On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <[email protected]>
wrote:

> By the way Jesse, I gonna push my DATAFORMAT branch on my github and I
> will post on the dev mailing list when done.
>
> Regards
> JB
>
> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > I want to bring this thread back up since we've had time to think about
> it
> > more and make a plan.
> >
> > I think a format-specific converter will be more time consuming task than
> > we originally thought. It'd have to be a writer that takes another writer
> > as a parameter.
> >
> > I think a string converter can be done as a simple transform.
> >
> > I think we should start with a simple string converter and plan for a
> > format-specific writer.
> >
> > What are your thoughts?
> >
> > Thanks,
> >
> > Jesse
> >
> > On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <[email protected]>
> > wrote:
> >
> > I was thinking about what the outputs would look like last night. I
> > realized that more complex formats like JSON and XML may or may not
> output
> > the data in a valid format.
> >
> > Doing a direct conversion on unbounded collections would work just fine.
> > They're self-contained. For writing out bounded collections, that's where
> > we'll hit the issues. This changes the uber conversion transform into a
> > transform that needs to be a writer.
> >
> > If a transform executes a JSON conversion on a per element basis, we'd
> get
> > this:
> > {
> > "key": "value"
> > }, {
> > "key": "value"
> > },
> >
> > That isn't valid JSON.
> >
> > The conversion transform would need to know do several things when
> writing
> > out a file. It would need to add brackets for an array. Now we have:
> > [
> > {
> > "key": "value"
> > }, {
> > "key": "value"
> > },
> > ]
> >
> > We still don't have valid JSON. We have to remove the last comma or have
> > the uber transform start putting in the commas, except for the last
> element.
> >
> > [
> > {
> > "key": "value"
> > }, {
> > "key": "value"
> > }
> > ]
> >
> > Only by doing this do we have valid JSON.
> >
> > I'd argue we'd have a similar issue with XML. Some parsers require a root
> > element for everything. The uber transform would have to put the root
> > element tags at the beginning and end of the file.
> >
> > On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <[email protected]>
> wrote:
> >
> > I would love to see a lean core and abundant Transforms at the same time.
> >
> > Maybe we can look at what Confluent <https://github.com/confluentinc>
> does
> > for kafka-connect. They have official extensions support for JDBC, HDFS
> and
> > ElasticSearch under https://github.com/confluentinc. They put them along
> > with other community extensions on
> > https://www.confluent.io/product/connectors/ for visibility.
> >
> > Although not a commercial company, can we have a GitHub user like
> > beam-community to host projects we build around beam but not suitable for
> > https://github.com/apache/incubator-beam. In the future, we may have
> > beam-algebra like http://github.com/twitter/algebird for algebra
> operations
> > and beam-ml / beam-dl for machine learning / deep learning. Also, there
> > will will be beam related projects elsewhere maintained by other
> > communities. We can put all of them on the beam-website or like spark
> > packages as mentioned by Amit.
> >
> > My $0.02
> > Manu
> >
> >
> >
> > On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles <[email protected]>
> > wrote:
> >
> >> On this point from Amit and Ismaël, I agree: we could benefit from a
> place
> >> for miscellaneous non-core helper transformations.
> >>
> >> We have sdks/java/extensions but it is organized as separate artifacts.
> I
> >> think that is fine, considering the nature of Join and SortValues. But
> for
> >> simpler transforms, Importing one artifact per tiny transform is too
> much
> >> overhead. It also seems unlikely that we will have enough commonality
> > among
> >> the transforms to call the artifact anything other than [some synonym
> for]
> >> "miscellaneous".
> >>
> >> I wouldn't want to take this too far - even though the SDK many
> > transforms*
> >> that are not required for the model [1], I like that the SDK artifact
> has
> >> everything a user might need in their "getting started" phase of use.
> This
> >> user-friendliness (the user doesn't care that ParDo is core and Sum is
> > not)
> >> plus the difficulty of judging which transforms go where, are probably
> why
> >> we have them mostly all in one place.
> >>
> >> Models to look at, off the top of my head, include Pig's PiggyBank and
> >> Apex's Malhar. These have different levels of support implied. Others?
> >>
> >> Kenn
> >>
> >> [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct, Filter,
> >> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min, Values,
> > KvSwap,
> >> Partition, Regex, Sample, Sum, Top, Values, WithKeys, WithTimestamps
> >>
> >> * at least they are separate classes and not methods on PCollection :-)
> >>
> >>
> >> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <[email protected]> wrote:
> >>
> >>> Nice discussion, and thanks Jesse for bringing this subject back.
> >>>
> >>> I agree 100% with Amit and the idea of having a home for those
> > transforms
> >>> that are not core enough to be part of the sdk, but that we all end up
> >>> re-writing somehow.
> >>>
> >>> This is a needed improvement to be more developer friendly, but also as
> > a
> >>> reference of good practices of Beam development, and for this reason I
> >>> agree with JB that at this moment it would be better for these
> > transforms
> >>> to reside in the Beam repository at least for visibility reasons.
> >>>
> >>> One additional question is if these transforms represent a different
> DSL
> >> or
> >>> if those could be grouped with the current extensions (e.g. Join and
> >>> SortValues) into something more general that we as a community could
> >>> maintain, but well even if it is not the case, it would be really nice
> > to
> >>> start working on something like this.
> >>>
> >>> Ismaël Mejía
> >>>
> >>>
> >>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <[email protected]
> >
> >>> wrote:
> >>>
> >>>> Related to spark-package, we also have Apache Bahir to host
> >>>> connectors/transforms for Spark and Flink.
> >>>>
> >>>> IMHO, right now, Beam should host this, not sure if it makes sense
> >>>> directly in the core.
> >>>>
> >>>> It reminds me the "Integration" DSL we discussed in the technical
> >> vision
> >>>> document.
> >>>>
> >>>> Regards
> >>>> JB
> >>>>
> >>>>
> >>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> >>>>
> >>>>> I think Jesse has a very good point on one hand, while Luke's and
> >>>>> Kenneth's
> >>>>> worries about committing users to specific implementations is in
> >> place.
> >>>>>
> >>>>> The Spark community has a 3rd party repository for useful libraries
> >> that
> >>>>> for various reasons are not a part of the Apache Spark project:
> >>>>> https://spark-packages.org/.
> >>>>>
> >>>>> Maybe a "common-transformations" package would serve both users quick
> >>>>> ramp-up and ease-of-use while keeping Beam more "enabling" ?
> >>>>>
> >>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > <[email protected]
> >>>
> >>>>> wrote:
> >>>>>
> >>>>> It seems useful for small scale debugging / demoing to have
> >>>>>> Dump.toString(). I think it should be named to clearly indicate its
> >>>>>> limited
> >>>>>> scope. Maybe other stuff could go in the Dump namespace, but
> >>>>>> "Dump.toJson()" would be for humans to read - so it should be pretty
> >>>>>> printed, not treated as a machine-to-machine wire format.
> >>>>>>
> >>>>>> The broader question of representing data in JSON or XML, etc, is
> >>> already
> >>>>>> the subject of many mature libraries which are already easy to use
> >> with
> >>>>>> Beam.
> >>>>>>
> >>>>>> The more esoteric practice of implicit or semi-implicit coercions
> >> seems
> >>>>>> like it is also already addressed in many ways elsewhere.
> >>>>>> Transform.via(TypeConverter) is basically the same as
> >>>>>> MapElements.via(<lambda>) and also easy to use with Beam.
> >>>>>>
> >>>>>> In both of the last cases, there are many reasonable approaches, and
> >> we
> >>>>>> shouldn't commit our users to one of them.
> >>>>>>
> >>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> >> <[email protected]
> >>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> The suggestions you give seem good except for the the XML cases.
> >>>>>>>
> >>>>>>> Might want to have the XML be a document per line similar to the
> >> JSON
> >>>>>>> examples you have been giving.
> >>>>>>>
> >>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> >>> [email protected]>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>> @lukasz Agreed there would have to be KV handling. I was more think
> >>>>>>>>
> >>>>>>> that
> >>>>>>
> >>>>>>> whatever the addition, it shouldn't just handle KV. It should
> > handle
> >>>>>>>> Iterables, Lists, Sets, and KVs.
> >>>>>>>>
> >>>>>>>> For JSON and XML, I wonder if we'd be able to give someone
> >> something
> >>>>>>>> general purpose enough that you would just end up writing your own
> >>> code
> >>>>>>>>
> >>>>>>> to
> >>>>>>>
> >>>>>>>> handle it anyway.
> >>>>>>>>
> >>>>>>>> Here are some ideas on what it could look like with a method and
> >> the
> >>>>>>>> resulting string output:
> >>>>>>>> *Stringify.toJSON()*
> >>>>>>>>
> >>>>>>>> With KV:
> >>>>>>>> {"key": "value"}
> >>>>>>>>
> >>>>>>>> With Iterables:
> >>>>>>>> ["one", "two", "three"]
> >>>>>>>>
> >>>>>>>> *Stringify.toXML("rootelement")*
> >>>>>>>>
> >>>>>>>> With KV:
> >>>>>>>> <rootelement key=value />
> >>>>>>>>
> >>>>>>>> With Iterables:
> >>>>>>>> <rootelement>
> >>>>>>>>   <item>one</item>
> >>>>>>>>   <item>two</item>
> >>>>>>>>   <item>three</item>
> >>>>>>>> </rootelement>
> >>>>>>>>
> >>>>>>>> *Stringify.toDelimited(",")*
> >>>>>>>>
> >>>>>>>> With KV:
> >>>>>>>> key,value
> >>>>>>>>
> >>>>>>>> With Iterables:
> >>>>>>>> one,two,three
> >>>>>>>>
> >>>>>>>> Do you think that would strike a good balance between reusable
> > code
> >>> and
> >>>>>>>> writing your own for more difficult formatting?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Jesse
> >>>>>>>>
> >>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> >> <[email protected]
> >>>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Jesse, I believe if one format gets special treatment in TextIO,
> >>> people
> >>>>>>>> will then ask why doesn't JSON, XML, ... also not supported.
> >>>>>>>>
> >>>>>>>> Also, the example that you provide is using the fact that the
> > input
> >>>>>>>>
> >>>>>>> format
> >>>>>>>
> >>>>>>>> is an Iterable<Item>. You had posted a question about using KV
> > with
> >>>>>>>> TextIO.Write which wouldn't align with the proposed input format
> >> and
> >>>>>>>>
> >>>>>>> still
> >>>>>>>
> >>>>>>>> would require to write a type conversion function, this time from
> >> KV
> >>> to
> >>>>>>>> Iterable<Item> instead of KV to string.
> >>>>>>>>
> >>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> >>> [email protected]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Lukasz,
> >>>>>>>>>
> >>>>>>>>> I don't think you'd need complicated logic for TextIO.Write. For
> >> CSV
> >>>>>>>>>
> >>>>>>>> the
> >>>>>>>
> >>>>>>>> call would look like:
> >>>>>>>>> Stringify.to("", ",", "\n");
> >>>>>>>>>
> >>>>>>>>> Where the arguments would be Stringify.to(prefix, delimiter,
> >>> suffix).
> >>>>>>>>>
> >>>>>>>>> The code would be something like:
> >>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> >>>>>>>>>
> >>>>>>>>> for (Item item : list) {
> >>>>>>>>>   buffer.append(item.toString());
> >>>>>>>>>
> >>>>>>>>>   if(notLast) {
> >>>>>>>>>     buffer.append(delimiter);
> >>>>>>>>>   }
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> buffer.append(suffix);
> >>>>>>>>>
> >>>>>>>>> c.output(buffer.toString());
> >>>>>>>>>
> >>>>>>>>> That would allow you to do the basic CSV, TSV, and other formats
> >>>>>>>>>
> >>>>>>>> without
> >>>>>>>
> >>>>>>>> complicated logic. The same sort of thing could be done for
> >>>>>>>>>
> >>>>>>>> TextIO.Write.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>>
> >>>>>>>>> Jesse
> >>>>>>>>>
> >>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> >>> <[email protected]
> >>>>>>>>>
> >>>>>>>>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> The conversion from object to string will have uses outside of
> >> just
> >>>>>>>>>> TextIO.Write so it seems logical that we would want to have a
> >> ParDo
> >>>>>>>>>>
> >>>>>>>>> do
> >>>>>>>
> >>>>>>>> the
> >>>>>>>>>
> >>>>>>>>>> conversion.
> >>>>>>>>>>
> >>>>>>>>>> Text file formats have a lot of variance, even if you consider
> >> the
> >>>>>>>>>>
> >>>>>>>>> subset
> >>>>>>>>
> >>>>>>>>> of CSV like formats where it could have fixed width fields, or
> >>>>>>>>>>
> >>>>>>>>> escaping
> >>>>>>>
> >>>>>>>> and
> >>>>>>>>>
> >>>>>>>>>> quoting around other fields, or headers that should be placed at
> >>>>>>>>>>
> >>>>>>>>> the
> >>>>>>
> >>>>>>> top.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Having all these format conversions within TextIO.Write seems
> >> like
> >>>>>>>>>>
> >>>>>>>>> a
> >>>>>>
> >>>>>>> lot
> >>>>>>>>
> >>>>>>>>> of
> >>>>>>>>>
> >>>>>>>>>> logic to contain in that transform which should just focus on
> >>>>>>>>>>
> >>>>>>>>> writing
> >>>>>>
> >>>>>>> to
> >>>>>>>>
> >>>>>>>>> files.
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> >>>>>>>>>>
> >>>>>>>>> [email protected]>
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> This is a thread moved over from the user mailing list.
> >>>>>>>>>>>
> >>>>>>>>>>> I think there needs to be a way to convert a PCollection<KV> to
> >>>>>>>>>>> PCollection<String> Conversion.
> >>>>>>>>>>>
> >>>>>>>>>>> To do a minimal WordCount, you have to manually convert the KV
> >>>>>>>>>>>
> >>>>>>>>>> to a
> >>>>>>
> >>>>>>> String:
> >>>>>>>>>>
> >>>>>>>>>>>         p
> >>>>>>>>>>>                 .apply(TextIO.Read.from("playing_cards.tsv"))
> >>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> >>>>>>>>>>>                 .apply(Count.perElement())
> >>>>>>>>>>> *                .apply(MapElements.via((KV<String, Long>
> > count)
> >>>>>>>>>>>
> >>>>>>>>>> ->*
> >>>>>>>
> >>>>>>>> *                            count.getKey() + ":" +
> >>>>>>>>>>>
> >>>>>>>>>> count.getValue()*
> >>>>>>>
> >>>>>>>> *                        ).withOutputType(
> >>>>>>>>>>>
> >>>>>>>>>> TypeDescriptors.strings()))*
> >>>>>>>>
> >>>>>>>>>                 .apply(TextIO.Write.to("output/stringcounts"));
> >>>>>>>>>>>
> >>>>>>>>>>> This code really should be something like:
> >>>>>>>>>>>         p
> >>>>>>>>>>>                 .apply(TextIO.Read.from("playing_cards.tsv"))
> >>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> >>>>>>>>>>>                 .apply(Count.perElement())
> >>>>>>>>>>> *                .apply(ToString.stringify())*
> >>>>>>>>>>>                 .apply(TextIO.Write.to
> ("output/stringcounts"));
> >>>>>>>>>>>
> >>>>>>>>>>> To summarize the discussion:
> >>>>>>>>>>>
> >>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to output any KV
> > or
> >>>>>>>>>>>
> >>>>>>>>>> list
> >>>>>>>
> >>>>>>>>    - JA and DH: Add a SimpleFunction that takes an type and runs
> >>>>>>>>>>>
> >>>>>>>>>> toString()
> >>>>>>>>>>
> >>>>>>>>>>>    on it:
> >>>>>>>>>>>    class ToStringFn<InputT> extends SimpleFunction<InputT,
> >>>>>>>>>>>
> >>>>>>>>>> String>
> >>>>>>
> >>>>>>> {
> >>>>>>>
> >>>>>>>>        public static String apply(InputT input) {
> >>>>>>>>>>>            return input.toString();
> >>>>>>>>>>>        }
> >>>>>>>>>>>    }
> >>>>>>>>>>>    - JB: Add a general purpose type converter like in Apache
> >>>>>>>>>>>
> >>>>>>>>>> Camel.
> >>>>>>
> >>>>>>>    - JA: Add Object support to TextIO.Write that would write out
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>
> >>>>>>>>    toString of any Object.
> >>>>>>>>>>>
> >>>>>>>>>>> My thoughts:
> >>>>>>>>>>>
> >>>>>>>>>>> Is converting to a PCollection<String> mostly needed when
> > you're
> >>>>>>>>>>>
> >>>>>>>>>> using
> >>>>>>>>
> >>>>>>>>> TextIO.Write? Will a general purpose transform only work in
> >>>>>>>>>>>
> >>>>>>>>>> certain
> >>>>>>
> >>>>>>> cases
> >>>>>>>>>
> >>>>>>>>>> and you'll normally have to write custom code format the strings
> >>>>>>>>>>>
> >>>>>>>>>> the
> >>>>>>>
> >>>>>>>> way
> >>>>>>>>>
> >>>>>>>>>> you want them?
> >>>>>>>>>>>
> >>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object support to
> >>>>>>>>>>>
> >>>>>>>>>> TextIO.Write
> >>>>>>>>>
> >>>>>>>>>> or
> >>>>>>>>>>
> >>>>>>>>>>> a SimpleFunction that takes a delimiter as an argument. Making
> > a
> >>>>>>>>>>> SimpleFunction that's able to specify a delimiter (and perhaps
> > a
> >>>>>>>>>>>
> >>>>>>>>>> prefix
> >>>>>>>>
> >>>>>>>>> and
> >>>>>>>>>>
> >>>>>>>>>>> suffix) should cover the majority of formats and cases.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Jesse
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>> --
> >>>> Jean-Baptiste Onofré
> >>>> [email protected]
> >>>> http://blog.nanthrax.net
> >>>> Talend - http://www.talend.com
> >>>>
> >>>
> >>
> >
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: PCollection to PCollection Conversion

Reply via email to