Hi JB, Depending on the scope of what you want to ultimately accomplish with this extension, I think it may make sense to write a proposal document and discuss it. If it's just a collection of utility DoFn's for various well-defined source/target format pairs, then that's probably not needed, but if it's anything more, then I think it is. That will help avoid a lot of churn if people propose reasonable significant changes.
On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <[email protected]> wrote: > By the way Jesse, I gonna push my DATAFORMAT branch on my github and I > will post on the dev mailing list when done. > > Regards > JB > > On 11/29/2016 07:01 PM, Jesse Anderson wrote: > > I want to bring this thread back up since we've had time to think about > it > > more and make a plan. > > > > I think a format-specific converter will be more time consuming task than > > we originally thought. It'd have to be a writer that takes another writer > > as a parameter. > > > > I think a string converter can be done as a simple transform. > > > > I think we should start with a simple string converter and plan for a > > format-specific writer. > > > > What are your thoughts? > > > > Thanks, > > > > Jesse > > > > On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <[email protected]> > > wrote: > > > > I was thinking about what the outputs would look like last night. I > > realized that more complex formats like JSON and XML may or may not > output > > the data in a valid format. > > > > Doing a direct conversion on unbounded collections would work just fine. > > They're self-contained. For writing out bounded collections, that's where > > we'll hit the issues. This changes the uber conversion transform into a > > transform that needs to be a writer. > > > > If a transform executes a JSON conversion on a per element basis, we'd > get > > this: > > { > > "key": "value" > > }, { > > "key": "value" > > }, > > > > That isn't valid JSON. > > > > The conversion transform would need to know do several things when > writing > > out a file. It would need to add brackets for an array. Now we have: > > [ > > { > > "key": "value" > > }, { > > "key": "value" > > }, > > ] > > > > We still don't have valid JSON. We have to remove the last comma or have > > the uber transform start putting in the commas, except for the last > element. > > > > [ > > { > > "key": "value" > > }, { > > "key": "value" > > } > > ] > > > > Only by doing this do we have valid JSON. > > > > I'd argue we'd have a similar issue with XML. Some parsers require a root > > element for everything. The uber transform would have to put the root > > element tags at the beginning and end of the file. > > > > On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <[email protected]> > wrote: > > > > I would love to see a lean core and abundant Transforms at the same time. > > > > Maybe we can look at what Confluent <https://github.com/confluentinc> > does > > for kafka-connect. They have official extensions support for JDBC, HDFS > and > > ElasticSearch under https://github.com/confluentinc. They put them along > > with other community extensions on > > https://www.confluent.io/product/connectors/ for visibility. > > > > Although not a commercial company, can we have a GitHub user like > > beam-community to host projects we build around beam but not suitable for > > https://github.com/apache/incubator-beam. In the future, we may have > > beam-algebra like http://github.com/twitter/algebird for algebra > operations > > and beam-ml / beam-dl for machine learning / deep learning. Also, there > > will will be beam related projects elsewhere maintained by other > > communities. We can put all of them on the beam-website or like spark > > packages as mentioned by Amit. > > > > My $0.02 > > Manu > > > > > > > > On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles <[email protected]> > > wrote: > > > >> On this point from Amit and Ismaël, I agree: we could benefit from a > place > >> for miscellaneous non-core helper transformations. > >> > >> We have sdks/java/extensions but it is organized as separate artifacts. > I > >> think that is fine, considering the nature of Join and SortValues. But > for > >> simpler transforms, Importing one artifact per tiny transform is too > much > >> overhead. It also seems unlikely that we will have enough commonality > > among > >> the transforms to call the artifact anything other than [some synonym > for] > >> "miscellaneous". > >> > >> I wouldn't want to take this too far - even though the SDK many > > transforms* > >> that are not required for the model [1], I like that the SDK artifact > has > >> everything a user might need in their "getting started" phase of use. > This > >> user-friendliness (the user doesn't care that ParDo is core and Sum is > > not) > >> plus the difficulty of judging which transforms go where, are probably > why > >> we have them mostly all in one place. > >> > >> Models to look at, off the top of my head, include Pig's PiggyBank and > >> Apex's Malhar. These have different levels of support implied. Others? > >> > >> Kenn > >> > >> [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct, Filter, > >> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min, Values, > > KvSwap, > >> Partition, Regex, Sample, Sum, Top, Values, WithKeys, WithTimestamps > >> > >> * at least they are separate classes and not methods on PCollection :-) > >> > >> > >> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <[email protected]> wrote: > >> > >>> Nice discussion, and thanks Jesse for bringing this subject back. > >>> > >>> I agree 100% with Amit and the idea of having a home for those > > transforms > >>> that are not core enough to be part of the sdk, but that we all end up > >>> re-writing somehow. > >>> > >>> This is a needed improvement to be more developer friendly, but also as > > a > >>> reference of good practices of Beam development, and for this reason I > >>> agree with JB that at this moment it would be better for these > > transforms > >>> to reside in the Beam repository at least for visibility reasons. > >>> > >>> One additional question is if these transforms represent a different > DSL > >> or > >>> if those could be grouped with the current extensions (e.g. Join and > >>> SortValues) into something more general that we as a community could > >>> maintain, but well even if it is not the case, it would be really nice > > to > >>> start working on something like this. > >>> > >>> Ismaël Mejía > >>> > >>> > >>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <[email protected] > > > >>> wrote: > >>> > >>>> Related to spark-package, we also have Apache Bahir to host > >>>> connectors/transforms for Spark and Flink. > >>>> > >>>> IMHO, right now, Beam should host this, not sure if it makes sense > >>>> directly in the core. > >>>> > >>>> It reminds me the "Integration" DSL we discussed in the technical > >> vision > >>>> document. > >>>> > >>>> Regards > >>>> JB > >>>> > >>>> > >>>> On 11/09/2016 11:17 AM, Amit Sela wrote: > >>>> > >>>>> I think Jesse has a very good point on one hand, while Luke's and > >>>>> Kenneth's > >>>>> worries about committing users to specific implementations is in > >> place. > >>>>> > >>>>> The Spark community has a 3rd party repository for useful libraries > >> that > >>>>> for various reasons are not a part of the Apache Spark project: > >>>>> https://spark-packages.org/. > >>>>> > >>>>> Maybe a "common-transformations" package would serve both users quick > >>>>> ramp-up and ease-of-use while keeping Beam more "enabling" ? > >>>>> > >>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles > > <[email protected] > >>> > >>>>> wrote: > >>>>> > >>>>> It seems useful for small scale debugging / demoing to have > >>>>>> Dump.toString(). I think it should be named to clearly indicate its > >>>>>> limited > >>>>>> scope. Maybe other stuff could go in the Dump namespace, but > >>>>>> "Dump.toJson()" would be for humans to read - so it should be pretty > >>>>>> printed, not treated as a machine-to-machine wire format. > >>>>>> > >>>>>> The broader question of representing data in JSON or XML, etc, is > >>> already > >>>>>> the subject of many mature libraries which are already easy to use > >> with > >>>>>> Beam. > >>>>>> > >>>>>> The more esoteric practice of implicit or semi-implicit coercions > >> seems > >>>>>> like it is also already addressed in many ways elsewhere. > >>>>>> Transform.via(TypeConverter) is basically the same as > >>>>>> MapElements.via(<lambda>) and also easy to use with Beam. > >>>>>> > >>>>>> In both of the last cases, there are many reasonable approaches, and > >> we > >>>>>> shouldn't commit our users to one of them. > >>>>>> > >>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik > >> <[email protected] > >>>> > >>>>>> wrote: > >>>>>> > >>>>>> The suggestions you give seem good except for the the XML cases. > >>>>>>> > >>>>>>> Might want to have the XML be a document per line similar to the > >> JSON > >>>>>>> examples you have been giving. > >>>>>>> > >>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson < > >>> [email protected]> > >>>>>>> wrote: > >>>>>>> > >>>>>>> @lukasz Agreed there would have to be KV handling. I was more think > >>>>>>>> > >>>>>>> that > >>>>>> > >>>>>>> whatever the addition, it shouldn't just handle KV. It should > > handle > >>>>>>>> Iterables, Lists, Sets, and KVs. > >>>>>>>> > >>>>>>>> For JSON and XML, I wonder if we'd be able to give someone > >> something > >>>>>>>> general purpose enough that you would just end up writing your own > >>> code > >>>>>>>> > >>>>>>> to > >>>>>>> > >>>>>>>> handle it anyway. > >>>>>>>> > >>>>>>>> Here are some ideas on what it could look like with a method and > >> the > >>>>>>>> resulting string output: > >>>>>>>> *Stringify.toJSON()* > >>>>>>>> > >>>>>>>> With KV: > >>>>>>>> {"key": "value"} > >>>>>>>> > >>>>>>>> With Iterables: > >>>>>>>> ["one", "two", "three"] > >>>>>>>> > >>>>>>>> *Stringify.toXML("rootelement")* > >>>>>>>> > >>>>>>>> With KV: > >>>>>>>> <rootelement key=value /> > >>>>>>>> > >>>>>>>> With Iterables: > >>>>>>>> <rootelement> > >>>>>>>> <item>one</item> > >>>>>>>> <item>two</item> > >>>>>>>> <item>three</item> > >>>>>>>> </rootelement> > >>>>>>>> > >>>>>>>> *Stringify.toDelimited(",")* > >>>>>>>> > >>>>>>>> With KV: > >>>>>>>> key,value > >>>>>>>> > >>>>>>>> With Iterables: > >>>>>>>> one,two,three > >>>>>>>> > >>>>>>>> Do you think that would strike a good balance between reusable > > code > >>> and > >>>>>>>> writing your own for more difficult formatting? > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Jesse > >>>>>>>> > >>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik > >> <[email protected] > >>>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>> Jesse, I believe if one format gets special treatment in TextIO, > >>> people > >>>>>>>> will then ask why doesn't JSON, XML, ... also not supported. > >>>>>>>> > >>>>>>>> Also, the example that you provide is using the fact that the > > input > >>>>>>>> > >>>>>>> format > >>>>>>> > >>>>>>>> is an Iterable<Item>. You had posted a question about using KV > > with > >>>>>>>> TextIO.Write which wouldn't align with the proposed input format > >> and > >>>>>>>> > >>>>>>> still > >>>>>>> > >>>>>>>> would require to write a type conversion function, this time from > >> KV > >>> to > >>>>>>>> Iterable<Item> instead of KV to string. > >>>>>>>> > >>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson < > >>> [email protected]> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>> Lukasz, > >>>>>>>>> > >>>>>>>>> I don't think you'd need complicated logic for TextIO.Write. For > >> CSV > >>>>>>>>> > >>>>>>>> the > >>>>>>> > >>>>>>>> call would look like: > >>>>>>>>> Stringify.to("", ",", "\n"); > >>>>>>>>> > >>>>>>>>> Where the arguments would be Stringify.to(prefix, delimiter, > >>> suffix). > >>>>>>>>> > >>>>>>>>> The code would be something like: > >>>>>>>>> StringBuffer buffer = new StringBuffer(prefix); > >>>>>>>>> > >>>>>>>>> for (Item item : list) { > >>>>>>>>> buffer.append(item.toString()); > >>>>>>>>> > >>>>>>>>> if(notLast) { > >>>>>>>>> buffer.append(delimiter); > >>>>>>>>> } > >>>>>>>>> } > >>>>>>>>> > >>>>>>>>> buffer.append(suffix); > >>>>>>>>> > >>>>>>>>> c.output(buffer.toString()); > >>>>>>>>> > >>>>>>>>> That would allow you to do the basic CSV, TSV, and other formats > >>>>>>>>> > >>>>>>>> without > >>>>>>> > >>>>>>>> complicated logic. The same sort of thing could be done for > >>>>>>>>> > >>>>>>>> TextIO.Write. > >>>>>>> > >>>>>>>> > >>>>>>>>> Thanks, > >>>>>>>>> > >>>>>>>>> Jesse > >>>>>>>>> > >>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik > >>> <[email protected] > >>>>>>>>> > >>>>>>>> > >>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> The conversion from object to string will have uses outside of > >> just > >>>>>>>>>> TextIO.Write so it seems logical that we would want to have a > >> ParDo > >>>>>>>>>> > >>>>>>>>> do > >>>>>>> > >>>>>>>> the > >>>>>>>>> > >>>>>>>>>> conversion. > >>>>>>>>>> > >>>>>>>>>> Text file formats have a lot of variance, even if you consider > >> the > >>>>>>>>>> > >>>>>>>>> subset > >>>>>>>> > >>>>>>>>> of CSV like formats where it could have fixed width fields, or > >>>>>>>>>> > >>>>>>>>> escaping > >>>>>>> > >>>>>>>> and > >>>>>>>>> > >>>>>>>>>> quoting around other fields, or headers that should be placed at > >>>>>>>>>> > >>>>>>>>> the > >>>>>> > >>>>>>> top. > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> Having all these format conversions within TextIO.Write seems > >> like > >>>>>>>>>> > >>>>>>>>> a > >>>>>> > >>>>>>> lot > >>>>>>>> > >>>>>>>>> of > >>>>>>>>> > >>>>>>>>>> logic to contain in that transform which should just focus on > >>>>>>>>>> > >>>>>>>>> writing > >>>>>> > >>>>>>> to > >>>>>>>> > >>>>>>>>> files. > >>>>>>>>>> > >>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson < > >>>>>>>>>> > >>>>>>>>> [email protected]> > >>>>>>> > >>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> This is a thread moved over from the user mailing list. > >>>>>>>>>>> > >>>>>>>>>>> I think there needs to be a way to convert a PCollection<KV> to > >>>>>>>>>>> PCollection<String> Conversion. > >>>>>>>>>>> > >>>>>>>>>>> To do a minimal WordCount, you have to manually convert the KV > >>>>>>>>>>> > >>>>>>>>>> to a > >>>>>> > >>>>>>> String: > >>>>>>>>>> > >>>>>>>>>>> p > >>>>>>>>>>> .apply(TextIO.Read.from("playing_cards.tsv")) > >>>>>>>>>>> .apply(Regex.split("\\W+")) > >>>>>>>>>>> .apply(Count.perElement()) > >>>>>>>>>>> * .apply(MapElements.via((KV<String, Long> > > count) > >>>>>>>>>>> > >>>>>>>>>> ->* > >>>>>>> > >>>>>>>> * count.getKey() + ":" + > >>>>>>>>>>> > >>>>>>>>>> count.getValue()* > >>>>>>> > >>>>>>>> * ).withOutputType( > >>>>>>>>>>> > >>>>>>>>>> TypeDescriptors.strings()))* > >>>>>>>> > >>>>>>>>> .apply(TextIO.Write.to("output/stringcounts")); > >>>>>>>>>>> > >>>>>>>>>>> This code really should be something like: > >>>>>>>>>>> p > >>>>>>>>>>> .apply(TextIO.Read.from("playing_cards.tsv")) > >>>>>>>>>>> .apply(Regex.split("\\W+")) > >>>>>>>>>>> .apply(Count.perElement()) > >>>>>>>>>>> * .apply(ToString.stringify())* > >>>>>>>>>>> .apply(TextIO.Write.to > ("output/stringcounts")); > >>>>>>>>>>> > >>>>>>>>>>> To summarize the discussion: > >>>>>>>>>>> > >>>>>>>>>>> - JA: Add a method to StringDelegateCoder to output any KV > > or > >>>>>>>>>>> > >>>>>>>>>> list > >>>>>>> > >>>>>>>> - JA and DH: Add a SimpleFunction that takes an type and runs > >>>>>>>>>>> > >>>>>>>>>> toString() > >>>>>>>>>> > >>>>>>>>>>> on it: > >>>>>>>>>>> class ToStringFn<InputT> extends SimpleFunction<InputT, > >>>>>>>>>>> > >>>>>>>>>> String> > >>>>>> > >>>>>>> { > >>>>>>> > >>>>>>>> public static String apply(InputT input) { > >>>>>>>>>>> return input.toString(); > >>>>>>>>>>> } > >>>>>>>>>>> } > >>>>>>>>>>> - JB: Add a general purpose type converter like in Apache > >>>>>>>>>>> > >>>>>>>>>> Camel. > >>>>>> > >>>>>>> - JA: Add Object support to TextIO.Write that would write out > >>>>>>>>>>> > >>>>>>>>>> the > >>>>>>> > >>>>>>>> toString of any Object. > >>>>>>>>>>> > >>>>>>>>>>> My thoughts: > >>>>>>>>>>> > >>>>>>>>>>> Is converting to a PCollection<String> mostly needed when > > you're > >>>>>>>>>>> > >>>>>>>>>> using > >>>>>>>> > >>>>>>>>> TextIO.Write? Will a general purpose transform only work in > >>>>>>>>>>> > >>>>>>>>>> certain > >>>>>> > >>>>>>> cases > >>>>>>>>> > >>>>>>>>>> and you'll normally have to write custom code format the strings > >>>>>>>>>>> > >>>>>>>>>> the > >>>>>>> > >>>>>>>> way > >>>>>>>>> > >>>>>>>>>> you want them? > >>>>>>>>>>> > >>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object support to > >>>>>>>>>>> > >>>>>>>>>> TextIO.Write > >>>>>>>>> > >>>>>>>>>> or > >>>>>>>>>> > >>>>>>>>>>> a SimpleFunction that takes a delimiter as an argument. Making > > a > >>>>>>>>>>> SimpleFunction that's able to specify a delimiter (and perhaps > > a > >>>>>>>>>>> > >>>>>>>>>> prefix > >>>>>>>> > >>>>>>>>> and > >>>>>>>>>> > >>>>>>>>>>> suffix) should cover the majority of formats and cases. > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> > >>>>>>>>>>> Jesse > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> -- > >>>> Jean-Baptiste Onofré > >>>> [email protected] > >>>> http://blog.nanthrax.net > >>>> Talend - http://www.talend.com > >>>> > >>> > >> > > > > -- > Jean-Baptiste Onofré > [email protected] > http://blog.nanthrax.net > Talend - http://www.talend.com >
