I was thinking about what the outputs would look like last night. I realized that more complex formats like JSON and XML may or may not output the data in a valid format.
Doing a direct conversion on unbounded collections would work just fine. They're self-contained. For writing out bounded collections, that's where we'll hit the issues. This changes the uber conversion transform into a transform that needs to be a writer. If a transform executes a JSON conversion on a per element basis, we'd get this: { "key": "value" }, { "key": "value" }, That isn't valid JSON. The conversion transform would need to know do several things when writing out a file. It would need to add brackets for an array. Now we have: [ { "key": "value" }, { "key": "value" }, ] We still don't have valid JSON. We have to remove the last comma or have the uber transform start putting in the commas, except for the last element. [ { "key": "value" }, { "key": "value" } ] Only by doing this do we have valid JSON. I'd argue we'd have a similar issue with XML. Some parsers require a root element for everything. The uber transform would have to put the root element tags at the beginning and end of the file. On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <owenzhang1...@gmail.com> wrote: > I would love to see a lean core and abundant Transforms at the same time. > > Maybe we can look at what Confluent <https://github.com/confluentinc> does > for kafka-connect. They have official extensions support for JDBC, HDFS and > ElasticSearch under https://github.com/confluentinc. They put them along > with other community extensions on > https://www.confluent.io/product/connectors/ for visibility. > > Although not a commercial company, can we have a GitHub user like > beam-community to host projects we build around beam but not suitable for > https://github.com/apache/incubator-beam. In the future, we may have > beam-algebra like http://github.com/twitter/algebird for algebra > operations > and beam-ml / beam-dl for machine learning / deep learning. Also, there > will will be beam related projects elsewhere maintained by other > communities. We can put all of them on the beam-website or like spark > packages as mentioned by Amit. > > My $0.02 > Manu > > > > On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles <k...@google.com.invalid> > wrote: > > > On this point from Amit and Ismaël, I agree: we could benefit from a > place > > for miscellaneous non-core helper transformations. > > > > We have sdks/java/extensions but it is organized as separate artifacts. I > > think that is fine, considering the nature of Join and SortValues. But > for > > simpler transforms, Importing one artifact per tiny transform is too much > > overhead. It also seems unlikely that we will have enough commonality > among > > the transforms to call the artifact anything other than [some synonym > for] > > "miscellaneous". > > > > I wouldn't want to take this too far - even though the SDK many > transforms* > > that are not required for the model [1], I like that the SDK artifact has > > everything a user might need in their "getting started" phase of use. > This > > user-friendliness (the user doesn't care that ParDo is core and Sum is > not) > > plus the difficulty of judging which transforms go where, are probably > why > > we have them mostly all in one place. > > > > Models to look at, off the top of my head, include Pig's PiggyBank and > > Apex's Malhar. These have different levels of support implied. Others? > > > > Kenn > > > > [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct, Filter, > > FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min, Values, > KvSwap, > > Partition, Regex, Sample, Sum, Top, Values, WithKeys, WithTimestamps > > > > * at least they are separate classes and not methods on PCollection :-) > > > > > > On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <ieme...@gmail.com> wrote: > > > > > Nice discussion, and thanks Jesse for bringing this subject back. > > > > > > I agree 100% with Amit and the idea of having a home for those > transforms > > > that are not core enough to be part of the sdk, but that we all end up > > > re-writing somehow. > > > > > > This is a needed improvement to be more developer friendly, but also > as a > > > reference of good practices of Beam development, and for this reason I > > > agree with JB that at this moment it would be better for these > transforms > > > to reside in the Beam repository at least for visibility reasons. > > > > > > One additional question is if these transforms represent a different > DSL > > or > > > if those could be grouped with the current extensions (e.g. Join and > > > SortValues) into something more general that we as a community could > > > maintain, but well even if it is not the case, it would be really nice > to > > > start working on something like this. > > > > > > Ismaël Mejía > > > > > > > > > On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <j...@nanthrax.net > > > > > wrote: > > > > > > > Related to spark-package, we also have Apache Bahir to host > > > > connectors/transforms for Spark and Flink. > > > > > > > > IMHO, right now, Beam should host this, not sure if it makes sense > > > > directly in the core. > > > > > > > > It reminds me the "Integration" DSL we discussed in the technical > > vision > > > > document. > > > > > > > > Regards > > > > JB > > > > > > > > > > > > On 11/09/2016 11:17 AM, Amit Sela wrote: > > > > > > > >> I think Jesse has a very good point on one hand, while Luke's and > > > >> Kenneth's > > > >> worries about committing users to specific implementations is in > > place. > > > >> > > > >> The Spark community has a 3rd party repository for useful libraries > > that > > > >> for various reasons are not a part of the Apache Spark project: > > > >> https://spark-packages.org/. > > > >> > > > >> Maybe a "common-transformations" package would serve both users > quick > > > >> ramp-up and ease-of-use while keeping Beam more "enabling" ? > > > >> > > > >> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles > <k...@google.com.invalid > > > > > > >> wrote: > > > >> > > > >> It seems useful for small scale debugging / demoing to have > > > >>> Dump.toString(). I think it should be named to clearly indicate its > > > >>> limited > > > >>> scope. Maybe other stuff could go in the Dump namespace, but > > > >>> "Dump.toJson()" would be for humans to read - so it should be > pretty > > > >>> printed, not treated as a machine-to-machine wire format. > > > >>> > > > >>> The broader question of representing data in JSON or XML, etc, is > > > already > > > >>> the subject of many mature libraries which are already easy to use > > with > > > >>> Beam. > > > >>> > > > >>> The more esoteric practice of implicit or semi-implicit coercions > > seems > > > >>> like it is also already addressed in many ways elsewhere. > > > >>> Transform.via(TypeConverter) is basically the same as > > > >>> MapElements.via(<lambda>) and also easy to use with Beam. > > > >>> > > > >>> In both of the last cases, there are many reasonable approaches, > and > > we > > > >>> shouldn't commit our users to one of them. > > > >>> > > > >>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik > > <lc...@google.com.invalid > > > > > > > >>> wrote: > > > >>> > > > >>> The suggestions you give seem good except for the the XML cases. > > > >>>> > > > >>>> Might want to have the XML be a document per line similar to the > > JSON > > > >>>> examples you have been giving. > > > >>>> > > > >>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson < > > > je...@smokinghand.com> > > > >>>> wrote: > > > >>>> > > > >>>> @lukasz Agreed there would have to be KV handling. I was more > think > > > >>>>> > > > >>>> that > > > >>> > > > >>>> whatever the addition, it shouldn't just handle KV. It should > handle > > > >>>>> Iterables, Lists, Sets, and KVs. > > > >>>>> > > > >>>>> For JSON and XML, I wonder if we'd be able to give someone > > something > > > >>>>> general purpose enough that you would just end up writing your > own > > > code > > > >>>>> > > > >>>> to > > > >>>> > > > >>>>> handle it anyway. > > > >>>>> > > > >>>>> Here are some ideas on what it could look like with a method and > > the > > > >>>>> resulting string output: > > > >>>>> *Stringify.toJSON()* > > > >>>>> > > > >>>>> With KV: > > > >>>>> {"key": "value"} > > > >>>>> > > > >>>>> With Iterables: > > > >>>>> ["one", "two", "three"] > > > >>>>> > > > >>>>> *Stringify.toXML("rootelement")* > > > >>>>> > > > >>>>> With KV: > > > >>>>> <rootelement key=value /> > > > >>>>> > > > >>>>> With Iterables: > > > >>>>> <rootelement> > > > >>>>> <item>one</item> > > > >>>>> <item>two</item> > > > >>>>> <item>three</item> > > > >>>>> </rootelement> > > > >>>>> > > > >>>>> *Stringify.toDelimited(",")* > > > >>>>> > > > >>>>> With KV: > > > >>>>> key,value > > > >>>>> > > > >>>>> With Iterables: > > > >>>>> one,two,three > > > >>>>> > > > >>>>> Do you think that would strike a good balance between reusable > code > > > and > > > >>>>> writing your own for more difficult formatting? > > > >>>>> > > > >>>>> Thanks, > > > >>>>> > > > >>>>> Jesse > > > >>>>> > > > >>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik > > <lc...@google.com.invalid > > > > > > > >>>>> wrote: > > > >>>>> > > > >>>>> Jesse, I believe if one format gets special treatment in TextIO, > > > people > > > >>>>> will then ask why doesn't JSON, XML, ... also not supported. > > > >>>>> > > > >>>>> Also, the example that you provide is using the fact that the > input > > > >>>>> > > > >>>> format > > > >>>> > > > >>>>> is an Iterable<Item>. You had posted a question about using KV > with > > > >>>>> TextIO.Write which wouldn't align with the proposed input format > > and > > > >>>>> > > > >>>> still > > > >>>> > > > >>>>> would require to write a type conversion function, this time from > > KV > > > to > > > >>>>> Iterable<Item> instead of KV to string. > > > >>>>> > > > >>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson < > > > je...@smokinghand.com> > > > >>>>> wrote: > > > >>>>> > > > >>>>> Lukasz, > > > >>>>>> > > > >>>>>> I don't think you'd need complicated logic for TextIO.Write. For > > CSV > > > >>>>>> > > > >>>>> the > > > >>>> > > > >>>>> call would look like: > > > >>>>>> Stringify.to("", ",", "\n"); > > > >>>>>> > > > >>>>>> Where the arguments would be Stringify.to(prefix, delimiter, > > > suffix). > > > >>>>>> > > > >>>>>> The code would be something like: > > > >>>>>> StringBuffer buffer = new StringBuffer(prefix); > > > >>>>>> > > > >>>>>> for (Item item : list) { > > > >>>>>> buffer.append(item.toString()); > > > >>>>>> > > > >>>>>> if(notLast) { > > > >>>>>> buffer.append(delimiter); > > > >>>>>> } > > > >>>>>> } > > > >>>>>> > > > >>>>>> buffer.append(suffix); > > > >>>>>> > > > >>>>>> c.output(buffer.toString()); > > > >>>>>> > > > >>>>>> That would allow you to do the basic CSV, TSV, and other formats > > > >>>>>> > > > >>>>> without > > > >>>> > > > >>>>> complicated logic. The same sort of thing could be done for > > > >>>>>> > > > >>>>> TextIO.Write. > > > >>>> > > > >>>>> > > > >>>>>> Thanks, > > > >>>>>> > > > >>>>>> Jesse > > > >>>>>> > > > >>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik > > > <lc...@google.com.invalid > > > >>>>>> > > > >>>>> > > > >>>> wrote: > > > >>>>>> > > > >>>>>> The conversion from object to string will have uses outside of > > just > > > >>>>>>> TextIO.Write so it seems logical that we would want to have a > > ParDo > > > >>>>>>> > > > >>>>>> do > > > >>>> > > > >>>>> the > > > >>>>>> > > > >>>>>>> conversion. > > > >>>>>>> > > > >>>>>>> Text file formats have a lot of variance, even if you consider > > the > > > >>>>>>> > > > >>>>>> subset > > > >>>>> > > > >>>>>> of CSV like formats where it could have fixed width fields, or > > > >>>>>>> > > > >>>>>> escaping > > > >>>> > > > >>>>> and > > > >>>>>> > > > >>>>>>> quoting around other fields, or headers that should be placed > at > > > >>>>>>> > > > >>>>>> the > > > >>> > > > >>>> top. > > > >>>>> > > > >>>>>> > > > >>>>>>> Having all these format conversions within TextIO.Write seems > > like > > > >>>>>>> > > > >>>>>> a > > > >>> > > > >>>> lot > > > >>>>> > > > >>>>>> of > > > >>>>>> > > > >>>>>>> logic to contain in that transform which should just focus on > > > >>>>>>> > > > >>>>>> writing > > > >>> > > > >>>> to > > > >>>>> > > > >>>>>> files. > > > >>>>>>> > > > >>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson < > > > >>>>>>> > > > >>>>>> je...@smokinghand.com> > > > >>>> > > > >>>>> wrote: > > > >>>>>>> > > > >>>>>>> This is a thread moved over from the user mailing list. > > > >>>>>>>> > > > >>>>>>>> I think there needs to be a way to convert a PCollection<KV> > to > > > >>>>>>>> PCollection<String> Conversion. > > > >>>>>>>> > > > >>>>>>>> To do a minimal WordCount, you have to manually convert the KV > > > >>>>>>>> > > > >>>>>>> to a > > > >>> > > > >>>> String: > > > >>>>>>> > > > >>>>>>>> p > > > >>>>>>>> .apply(TextIO.Read.from("playing_cards.tsv")) > > > >>>>>>>> .apply(Regex.split("\\W+")) > > > >>>>>>>> .apply(Count.perElement()) > > > >>>>>>>> * .apply(MapElements.via((KV<String, Long> > count) > > > >>>>>>>> > > > >>>>>>> ->* > > > >>>> > > > >>>>> * count.getKey() + ":" + > > > >>>>>>>> > > > >>>>>>> count.getValue()* > > > >>>> > > > >>>>> * ).withOutputType( > > > >>>>>>>> > > > >>>>>>> TypeDescriptors.strings()))* > > > >>>>> > > > >>>>>> .apply(TextIO.Write.to("output/stringcounts")); > > > >>>>>>>> > > > >>>>>>>> This code really should be something like: > > > >>>>>>>> p > > > >>>>>>>> .apply(TextIO.Read.from("playing_cards.tsv")) > > > >>>>>>>> .apply(Regex.split("\\W+")) > > > >>>>>>>> .apply(Count.perElement()) > > > >>>>>>>> * .apply(ToString.stringify())* > > > >>>>>>>> .apply(TextIO.Write.to > ("output/stringcounts")); > > > >>>>>>>> > > > >>>>>>>> To summarize the discussion: > > > >>>>>>>> > > > >>>>>>>> - JA: Add a method to StringDelegateCoder to output any KV > or > > > >>>>>>>> > > > >>>>>>> list > > > >>>> > > > >>>>> - JA and DH: Add a SimpleFunction that takes an type and runs > > > >>>>>>>> > > > >>>>>>> toString() > > > >>>>>>> > > > >>>>>>>> on it: > > > >>>>>>>> class ToStringFn<InputT> extends SimpleFunction<InputT, > > > >>>>>>>> > > > >>>>>>> String> > > > >>> > > > >>>> { > > > >>>> > > > >>>>> public static String apply(InputT input) { > > > >>>>>>>> return input.toString(); > > > >>>>>>>> } > > > >>>>>>>> } > > > >>>>>>>> - JB: Add a general purpose type converter like in Apache > > > >>>>>>>> > > > >>>>>>> Camel. > > > >>> > > > >>>> - JA: Add Object support to TextIO.Write that would write out > > > >>>>>>>> > > > >>>>>>> the > > > >>>> > > > >>>>> toString of any Object. > > > >>>>>>>> > > > >>>>>>>> My thoughts: > > > >>>>>>>> > > > >>>>>>>> Is converting to a PCollection<String> mostly needed when > you're > > > >>>>>>>> > > > >>>>>>> using > > > >>>>> > > > >>>>>> TextIO.Write? Will a general purpose transform only work in > > > >>>>>>>> > > > >>>>>>> certain > > > >>> > > > >>>> cases > > > >>>>>> > > > >>>>>>> and you'll normally have to write custom code format the > strings > > > >>>>>>>> > > > >>>>>>> the > > > >>>> > > > >>>>> way > > > >>>>>> > > > >>>>>>> you want them? > > > >>>>>>>> > > > >>>>>>>> IMHO, it's yes to both. I'd prefer to add Object support to > > > >>>>>>>> > > > >>>>>>> TextIO.Write > > > >>>>>> > > > >>>>>>> or > > > >>>>>>> > > > >>>>>>>> a SimpleFunction that takes a delimiter as an argument. > Making a > > > >>>>>>>> SimpleFunction that's able to specify a delimiter (and > perhaps a > > > >>>>>>>> > > > >>>>>>> prefix > > > >>>>> > > > >>>>>> and > > > >>>>>>> > > > >>>>>>>> suffix) should cover the majority of formats and cases. > > > >>>>>>>> > > > >>>>>>>> Thanks, > > > >>>>>>>> > > > >>>>>>>> Jesse > > > >>>>>>>> > > > >>>>>>>> > > > >>>>>>> > > > >>>>>> > > > >>>>> > > > >>>> > > > >>> > > > >> > > > > -- > > > > Jean-Baptiste Onofré > > > > jbono...@apache.org > > > > http://blog.nanthrax.net > > > > Talend - http://www.talend.com > > > > > > > > > >