Re: PCollection to PCollection Conversion

Jesse Anderson Thu, 10 Nov 2016 07:34:04 -0800

I was thinking about what the outputs would look like last night. I
realized that more complex formats like JSON and XML may or may not output
the data in a valid format.


Doing a direct conversion on unbounded collections would work just fine.
They're self-contained. For writing out bounded collections, that's where
we'll hit the issues. This changes the uber conversion transform into a
transform that needs to be a writer.

If a transform executes a JSON conversion on a per element basis, we'd get
this:
{
"key": "value"
}, {
"key": "value"
},

That isn't valid JSON.

The conversion transform would need to know do several things when writing
out a file. It would need to add brackets for an array. Now we have:
[
{
"key": "value"
}, {
"key": "value"
},
]

We still don't have valid JSON. We have to remove the last comma or have
the uber transform start putting in the commas, except for the last element.

[
{
"key": "value"
}, {
"key": "value"
}
]

Only by doing this do we have valid JSON.

I'd argue we'd have a similar issue with XML. Some parsers require a root
element for everything. The uber transform would have to put the root
element tags at the beginning and end of the file.

On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <owenzhang1...@gmail.com> wrote:

> I would love to see a lean core and abundant Transforms at the same time.
>
> Maybe we can look at what Confluent <https://github.com/confluentinc> does
> for kafka-connect. They have official extensions support for JDBC, HDFS and
> ElasticSearch under https://github.com/confluentinc. They put them along
> with other community extensions on
> https://www.confluent.io/product/connectors/ for visibility.
>
> Although not a commercial company, can we have a GitHub user like
> beam-community to host projects we build around beam but not suitable for
> https://github.com/apache/incubator-beam. In the future, we may have
> beam-algebra like http://github.com/twitter/algebird for algebra
> operations
> and beam-ml / beam-dl for machine learning / deep learning. Also, there
> will will be beam related projects elsewhere maintained by other
> communities. We can put all of them on the beam-website or like spark
> packages as mentioned by Amit.
>
> My $0.02
> Manu
>
>
>
> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > On this point from Amit and Ismaël, I agree: we could benefit from a
> place
> > for miscellaneous non-core helper transformations.
> >
> > We have sdks/java/extensions but it is organized as separate artifacts. I
> > think that is fine, considering the nature of Join and SortValues. But
> for
> > simpler transforms, Importing one artifact per tiny transform is too much
> > overhead. It also seems unlikely that we will have enough commonality
> among
> > the transforms to call the artifact anything other than [some synonym
> for]
> > "miscellaneous".
> >
> > I wouldn't want to take this too far - even though the SDK many
> transforms*
> > that are not required for the model [1], I like that the SDK artifact has
> > everything a user might need in their "getting started" phase of use.
> This
> > user-friendliness (the user doesn't care that ParDo is core and Sum is
> not)
> > plus the difficulty of judging which transforms go where, are probably
> why
> > we have them mostly all in one place.
> >
> > Models to look at, off the top of my head, include Pig's PiggyBank and
> > Apex's Malhar. These have different levels of support implied. Others?
> >
> > Kenn
> >
> > [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct, Filter,
> > FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min, Values,
> KvSwap,
> > Partition, Regex, Sample, Sum, Top, Values, WithKeys, WithTimestamps
> >
> > * at least they are separate classes and not methods on PCollection :-)
> >
> >
> > On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <ieme...@gmail.com> wrote:
> >
> > > Nice discussion, and thanks Jesse for bringing this subject back.
> > >
> > > I agree 100% with Amit and the idea of having a home for those
> transforms
> > > that are not core enough to be part of the sdk, but that we all end up
> > > re-writing somehow.
> > >
> > > This is a needed improvement to be more developer friendly, but also
> as a
> > > reference of good practices of Beam development, and for this reason I
> > > agree with JB that at this moment it would be better for these
> transforms
> > > to reside in the Beam repository at least for visibility reasons.
> > >
> > > One additional question is if these transforms represent a different
> DSL
> > or
> > > if those could be grouped with the current extensions (e.g. Join and
> > > SortValues) into something more general that we as a community could
> > > maintain, but well even if it is not the case, it would be really nice
> to
> > > start working on something like this.
> > >
> > > Ismaël Mejía
> > >
> > >
> > > On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <j...@nanthrax.net
> >
> > > wrote:
> > >
> > > > Related to spark-package, we also have Apache Bahir to host
> > > > connectors/transforms for Spark and Flink.
> > > >
> > > > IMHO, right now, Beam should host this, not sure if it makes sense
> > > > directly in the core.
> > > >
> > > > It reminds me the "Integration" DSL we discussed in the technical
> > vision
> > > > document.
> > > >
> > > > Regards
> > > > JB
> > > >
> > > >
> > > > On 11/09/2016 11:17 AM, Amit Sela wrote:
> > > >
> > > >> I think Jesse has a very good point on one hand, while Luke's and
> > > >> Kenneth's
> > > >> worries about committing users to specific implementations is in
> > place.
> > > >>
> > > >> The Spark community has a 3rd party repository for useful libraries
> > that
> > > >> for various reasons are not a part of the Apache Spark project:
> > > >> https://spark-packages.org/.
> > > >>
> > > >> Maybe a "common-transformations" package would serve both users
> quick
> > > >> ramp-up and ease-of-use while keeping Beam more "enabling" ?
> > > >>
> > > >> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> <k...@google.com.invalid
> > >
> > > >> wrote:
> > > >>
> > > >> It seems useful for small scale debugging / demoing to have
> > > >>> Dump.toString(). I think it should be named to clearly indicate its
> > > >>> limited
> > > >>> scope. Maybe other stuff could go in the Dump namespace, but
> > > >>> "Dump.toJson()" would be for humans to read - so it should be
> pretty
> > > >>> printed, not treated as a machine-to-machine wire format.
> > > >>>
> > > >>> The broader question of representing data in JSON or XML, etc, is
> > > already
> > > >>> the subject of many mature libraries which are already easy to use
> > with
> > > >>> Beam.
> > > >>>
> > > >>> The more esoteric practice of implicit or semi-implicit coercions
> > seems
> > > >>> like it is also already addressed in many ways elsewhere.
> > > >>> Transform.via(TypeConverter) is basically the same as
> > > >>> MapElements.via(<lambda>) and also easy to use with Beam.
> > > >>>
> > > >>> In both of the last cases, there are many reasonable approaches,
> and
> > we
> > > >>> shouldn't commit our users to one of them.
> > > >>>
> > > >>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > <lc...@google.com.invalid
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> The suggestions you give seem good except for the the XML cases.
> > > >>>>
> > > >>>> Might want to have the XML be a document per line similar to the
> > JSON
> > > >>>> examples you have been giving.
> > > >>>>
> > > >>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> > > je...@smokinghand.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>> @lukasz Agreed there would have to be KV handling. I was more
> think
> > > >>>>>
> > > >>>> that
> > > >>>
> > > >>>> whatever the addition, it shouldn't just handle KV. It should
> handle
> > > >>>>> Iterables, Lists, Sets, and KVs.
> > > >>>>>
> > > >>>>> For JSON and XML, I wonder if we'd be able to give someone
> > something
> > > >>>>> general purpose enough that you would just end up writing your
> own
> > > code
> > > >>>>>
> > > >>>> to
> > > >>>>
> > > >>>>> handle it anyway.
> > > >>>>>
> > > >>>>> Here are some ideas on what it could look like with a method and
> > the
> > > >>>>> resulting string output:
> > > >>>>> *Stringify.toJSON()*
> > > >>>>>
> > > >>>>> With KV:
> > > >>>>> {"key": "value"}
> > > >>>>>
> > > >>>>> With Iterables:
> > > >>>>> ["one", "two", "three"]
> > > >>>>>
> > > >>>>> *Stringify.toXML("rootelement")*
> > > >>>>>
> > > >>>>> With KV:
> > > >>>>> <rootelement key=value />
> > > >>>>>
> > > >>>>> With Iterables:
> > > >>>>> <rootelement>
> > > >>>>>   <item>one</item>
> > > >>>>>   <item>two</item>
> > > >>>>>   <item>three</item>
> > > >>>>> </rootelement>
> > > >>>>>
> > > >>>>> *Stringify.toDelimited(",")*
> > > >>>>>
> > > >>>>> With KV:
> > > >>>>> key,value
> > > >>>>>
> > > >>>>> With Iterables:
> > > >>>>> one,two,three
> > > >>>>>
> > > >>>>> Do you think that would strike a good balance between reusable
> code
> > > and
> > > >>>>> writing your own for more difficult formatting?
> > > >>>>>
> > > >>>>> Thanks,
> > > >>>>>
> > > >>>>> Jesse
> > > >>>>>
> > > >>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > <lc...@google.com.invalid
> > > >
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>> Jesse, I believe if one format gets special treatment in TextIO,
> > > people
> > > >>>>> will then ask why doesn't JSON, XML, ... also not supported.
> > > >>>>>
> > > >>>>> Also, the example that you provide is using the fact that the
> input
> > > >>>>>
> > > >>>> format
> > > >>>>
> > > >>>>> is an Iterable<Item>. You had posted a question about using KV
> with
> > > >>>>> TextIO.Write which wouldn't align with the proposed input format
> > and
> > > >>>>>
> > > >>>> still
> > > >>>>
> > > >>>>> would require to write a type conversion function, this time from
> > KV
> > > to
> > > >>>>> Iterable<Item> instead of KV to string.
> > > >>>>>
> > > >>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> > > je...@smokinghand.com>
> > > >>>>> wrote:
> > > >>>>>
> > > >>>>> Lukasz,
> > > >>>>>>
> > > >>>>>> I don't think you'd need complicated logic for TextIO.Write. For
> > CSV
> > > >>>>>>
> > > >>>>> the
> > > >>>>
> > > >>>>> call would look like:
> > > >>>>>> Stringify.to("", ",", "\n");
> > > >>>>>>
> > > >>>>>> Where the arguments would be Stringify.to(prefix, delimiter,
> > > suffix).
> > > >>>>>>
> > > >>>>>> The code would be something like:
> > > >>>>>> StringBuffer buffer = new StringBuffer(prefix);
> > > >>>>>>
> > > >>>>>> for (Item item : list) {
> > > >>>>>>   buffer.append(item.toString());
> > > >>>>>>
> > > >>>>>>   if(notLast) {
> > > >>>>>>     buffer.append(delimiter);
> > > >>>>>>   }
> > > >>>>>> }
> > > >>>>>>
> > > >>>>>> buffer.append(suffix);
> > > >>>>>>
> > > >>>>>> c.output(buffer.toString());
> > > >>>>>>
> > > >>>>>> That would allow you to do the basic CSV, TSV, and other formats
> > > >>>>>>
> > > >>>>> without
> > > >>>>
> > > >>>>> complicated logic. The same sort of thing could be done for
> > > >>>>>>
> > > >>>>> TextIO.Write.
> > > >>>>
> > > >>>>>
> > > >>>>>> Thanks,
> > > >>>>>>
> > > >>>>>> Jesse
> > > >>>>>>
> > > >>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> > > <lc...@google.com.invalid
> > > >>>>>>
> > > >>>>>
> > > >>>> wrote:
> > > >>>>>>
> > > >>>>>> The conversion from object to string will have uses outside of
> > just
> > > >>>>>>> TextIO.Write so it seems logical that we would want to have a
> > ParDo
> > > >>>>>>>
> > > >>>>>> do
> > > >>>>
> > > >>>>> the
> > > >>>>>>
> > > >>>>>>> conversion.
> > > >>>>>>>
> > > >>>>>>> Text file formats have a lot of variance, even if you consider
> > the
> > > >>>>>>>
> > > >>>>>> subset
> > > >>>>>
> > > >>>>>> of CSV like formats where it could have fixed width fields, or
> > > >>>>>>>
> > > >>>>>> escaping
> > > >>>>
> > > >>>>> and
> > > >>>>>>
> > > >>>>>>> quoting around other fields, or headers that should be placed
> at
> > > >>>>>>>
> > > >>>>>> the
> > > >>>
> > > >>>> top.
> > > >>>>>
> > > >>>>>>
> > > >>>>>>> Having all these format conversions within TextIO.Write seems
> > like
> > > >>>>>>>
> > > >>>>>> a
> > > >>>
> > > >>>> lot
> > > >>>>>
> > > >>>>>> of
> > > >>>>>>
> > > >>>>>>> logic to contain in that transform which should just focus on
> > > >>>>>>>
> > > >>>>>> writing
> > > >>>
> > > >>>> to
> > > >>>>>
> > > >>>>>> files.
> > > >>>>>>>
> > > >>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> > > >>>>>>>
> > > >>>>>> je...@smokinghand.com>
> > > >>>>
> > > >>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> This is a thread moved over from the user mailing list.
> > > >>>>>>>>
> > > >>>>>>>> I think there needs to be a way to convert a PCollection<KV>
> to
> > > >>>>>>>> PCollection<String> Conversion.
> > > >>>>>>>>
> > > >>>>>>>> To do a minimal WordCount, you have to manually convert the KV
> > > >>>>>>>>
> > > >>>>>>> to a
> > > >>>
> > > >>>> String:
> > > >>>>>>>
> > > >>>>>>>>         p
> > > >>>>>>>>                 .apply(TextIO.Read.from("playing_cards.tsv"))
> > > >>>>>>>>                 .apply(Regex.split("\\W+"))
> > > >>>>>>>>                 .apply(Count.perElement())
> > > >>>>>>>> *                .apply(MapElements.via((KV<String, Long>
> count)
> > > >>>>>>>>
> > > >>>>>>> ->*
> > > >>>>
> > > >>>>> *                            count.getKey() + ":" +
> > > >>>>>>>>
> > > >>>>>>> count.getValue()*
> > > >>>>
> > > >>>>> *                        ).withOutputType(
> > > >>>>>>>>
> > > >>>>>>> TypeDescriptors.strings()))*
> > > >>>>>
> > > >>>>>>                 .apply(TextIO.Write.to("output/stringcounts"));
> > > >>>>>>>>
> > > >>>>>>>> This code really should be something like:
> > > >>>>>>>>         p
> > > >>>>>>>>                 .apply(TextIO.Read.from("playing_cards.tsv"))
> > > >>>>>>>>                 .apply(Regex.split("\\W+"))
> > > >>>>>>>>                 .apply(Count.perElement())
> > > >>>>>>>> *                .apply(ToString.stringify())*
> > > >>>>>>>>                 .apply(TextIO.Write.to
> ("output/stringcounts"));
> > > >>>>>>>>
> > > >>>>>>>> To summarize the discussion:
> > > >>>>>>>>
> > > >>>>>>>>    - JA: Add a method to StringDelegateCoder to output any KV
> or
> > > >>>>>>>>
> > > >>>>>>> list
> > > >>>>
> > > >>>>>    - JA and DH: Add a SimpleFunction that takes an type and runs
> > > >>>>>>>>
> > > >>>>>>> toString()
> > > >>>>>>>
> > > >>>>>>>>    on it:
> > > >>>>>>>>    class ToStringFn<InputT> extends SimpleFunction<InputT,
> > > >>>>>>>>
> > > >>>>>>> String>
> > > >>>
> > > >>>> {
> > > >>>>
> > > >>>>>        public static String apply(InputT input) {
> > > >>>>>>>>            return input.toString();
> > > >>>>>>>>        }
> > > >>>>>>>>    }
> > > >>>>>>>>    - JB: Add a general purpose type converter like in Apache
> > > >>>>>>>>
> > > >>>>>>> Camel.
> > > >>>
> > > >>>>    - JA: Add Object support to TextIO.Write that would write out
> > > >>>>>>>>
> > > >>>>>>> the
> > > >>>>
> > > >>>>>    toString of any Object.
> > > >>>>>>>>
> > > >>>>>>>> My thoughts:
> > > >>>>>>>>
> > > >>>>>>>> Is converting to a PCollection<String> mostly needed when
> you're
> > > >>>>>>>>
> > > >>>>>>> using
> > > >>>>>
> > > >>>>>> TextIO.Write? Will a general purpose transform only work in
> > > >>>>>>>>
> > > >>>>>>> certain
> > > >>>
> > > >>>> cases
> > > >>>>>>
> > > >>>>>>> and you'll normally have to write custom code format the
> strings
> > > >>>>>>>>
> > > >>>>>>> the
> > > >>>>
> > > >>>>> way
> > > >>>>>>
> > > >>>>>>> you want them?
> > > >>>>>>>>
> > > >>>>>>>> IMHO, it's yes to both. I'd prefer to add Object support to
> > > >>>>>>>>
> > > >>>>>>> TextIO.Write
> > > >>>>>>
> > > >>>>>>> or
> > > >>>>>>>
> > > >>>>>>>> a SimpleFunction that takes a delimiter as an argument.
> Making a
> > > >>>>>>>> SimpleFunction that's able to specify a delimiter (and
> perhaps a
> > > >>>>>>>>
> > > >>>>>>> prefix
> > > >>>>>
> > > >>>>>> and
> > > >>>>>>>
> > > >>>>>>>> suffix) should cover the majority of formats and cases.
> > > >>>>>>>>
> > > >>>>>>>> Thanks,
> > > >>>>>>>>
> > > >>>>>>>> Jesse
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>

Re: PCollection to PCollection Conversion

Reply via email to