Re: PCollection to PCollection Conversion

Jean-Baptiste Onofré Tue, 29 Nov 2016 12:47:16 -0800

Hi Jesse,

https://github.com/jbonofre/incubator-beam/tree/DATAFORMAT/sdks/java/extensions/dataformat

it's very simple and stupid and of course not complete at all (I haveother commits but not merged as they need some polishing), but as Isaid, it's a base of discussion.


Regards
JB

On 11/29/2016 09:23 PM, Jesse Anderson wrote:

@jb Sounds good. Just let us know once you've pushed.

On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

Good point Eugene.

Right now, it's a DoFn collection to experiment a bit (a pure
extension). It's pretty stupid ;)

But, you are right, depending the direction of such extension, it could
cover more use cases (even if it's not my first intention ;)).

Let me push the branch (pretty small) as an illustration, and in the
mean time, I'm preparing a document (more focused on the use cases).

WDYT ?

Regards
JB

On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:

Hi JB,
Depending on the scope of what you want to ultimately accomplish with

this

extension, I think it may make sense to write a proposal document and
discuss it.
If it's just a collection of utility DoFn's for various well-defined
source/target format pairs, then that's probably not needed, but if it's
anything more, then I think it is.
That will help avoid a lot of churn if people propose reasonable
significant changes.

On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

By the way Jesse, I gonna push my DATAFORMAT branch on my github and I
will post on the dev mailing list when done.

Regards
JB

On 11/29/2016 07:01 PM, Jesse Anderson wrote:

I want to bring this thread back up since we've had time to think about

it

more and make a plan.

I think a format-specific converter will be more time consuming task

than

we originally thought. It'd have to be a writer that takes another

writer

as a parameter.

I think a string converter can be done as a simple transform.

I think we should start with a simple string converter and plan for a
format-specific writer.

What are your thoughts?

Thanks,

Jesse

On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <je...@smokinghand.com

wrote:

I was thinking about what the outputs would look like last night. I
realized that more complex formats like JSON and XML may or may not

output

the data in a valid format.

Doing a direct conversion on unbounded collections would work just

fine.

They're self-contained. For writing out bounded collections, that's

where

we'll hit the issues. This changes the uber conversion transform into a
transform that needs to be a writer.

If a transform executes a JSON conversion on a per element basis, we'd

get

this:
{
"key": "value"
}, {
"key": "value"
},

That isn't valid JSON.

The conversion transform would need to know do several things when

writing

out a file. It would need to add brackets for an array. Now we have:
[
{
"key": "value"
}, {
"key": "value"
},
]

We still don't have valid JSON. We have to remove the last comma or

have

the uber transform start putting in the commas, except for the last

element.


[
{
"key": "value"
}, {
"key": "value"
}
]

Only by doing this do we have valid JSON.

I'd argue we'd have a similar issue with XML. Some parsers require a

root

element for everything. The uber transform would have to put the root
element tags at the beginning and end of the file.

On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <owenzhang1...@gmail.com>

wrote:


I would love to see a lean core and abundant Transforms at the same

time.


Maybe we can look at what Confluent <https://github.com/confluentinc>

does

for kafka-connect. They have official extensions support for JDBC, HDFS

and

ElasticSearch under https://github.com/confluentinc. They put them

along

with other community extensions on
https://www.confluent.io/product/connectors/ for visibility.

Although not a commercial company, can we have a GitHub user like
beam-community to host projects we build around beam but not suitable

for

https://github.com/apache/incubator-beam. In the future, we may have
beam-algebra like http://github.com/twitter/algebird for algebra

operations

and beam-ml / beam-dl for machine learning / deep learning. Also, there
will will be beam related projects elsewhere maintained by other
communities. We can put all of them on the beam-website or like spark
packages as mentioned by Amit.

My $0.02
Manu



On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles <k...@google.com.invalid

wrote:

On this point from Amit and Ismaël, I agree: we could benefit from a

place

for miscellaneous non-core helper transformations.

We have sdks/java/extensions but it is organized as separate

artifacts.

think that is fine, considering the nature of Join and SortValues. But

for

simpler transforms, Importing one artifact per tiny transform is too

much

overhead. It also seems unlikely that we will have enough commonality

among

the transforms to call the artifact anything other than [some synonym

for]

"miscellaneous".

I wouldn't want to take this too far - even though the SDK many

transforms*

that are not required for the model [1], I like that the SDK artifact

has

everything a user might need in their "getting started" phase of use.

This

user-friendliness (the user doesn't care that ParDo is core and Sum is

not)

plus the difficulty of judging which transforms go where, are probably

why

we have them mostly all in one place.

Models to look at, off the top of my head, include Pig's PiggyBank and
Apex's Malhar. These have different levels of support implied. Others?

Kenn

[1] ApproximateQuantiles, ApproximateUnique, Count, Distinct, Filter,
FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min, Values,

KvSwap,

Partition, Regex, Sample, Sum, Top, Values, WithKeys, WithTimestamps

* at least they are separate classes and not methods on PCollection

:-)



On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <ieme...@gmail.com>

wrote:

Nice discussion, and thanks Jesse for bringing this subject back.

I agree 100% with Amit and the idea of having a home for those

transforms

that are not core enough to be part of the sdk, but that we all end

up

re-writing somehow.

This is a needed improvement to be more developer friendly, but also

as

reference of good practices of Beam development, and for this reason

agree with JB that at this moment it would be better for these

transforms

to reside in the Beam repository at least for visibility reasons.

One additional question is if these transforms represent a different

DSL

or

if those could be grouped with the current extensions (e.g. Join and
SortValues) into something more general that we as a community could
maintain, but well even if it is not the case, it would be really

nice

to

start working on something like this.

Ismaël Mejía


On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <

j...@nanthrax.net

wrote:

Related to spark-package, we also have Apache Bahir to host
connectors/transforms for Spark and Flink.

IMHO, right now, Beam should host this, not sure if it makes sense
directly in the core.

It reminds me the "Integration" DSL we discussed in the technical

vision

document.

Regards
JB


On 11/09/2016 11:17 AM, Amit Sela wrote:

I think Jesse has a very good point on one hand, while Luke's and
Kenneth's
worries about committing users to specific implementations is in

place.


The Spark community has a 3rd party repository for useful libraries

that

for various reasons are not a part of the Apache Spark project:
https://spark-packages.org/.

Maybe a "common-transformations" package would serve both users

quick

ramp-up and ease-of-use while keeping Beam more "enabling" ?

On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles

<k...@google.com.invalid

wrote:

It seems useful for small scale debugging / demoing to have

Dump.toString(). I think it should be named to clearly indicate

its

limited
scope. Maybe other stuff could go in the Dump namespace, but
"Dump.toJson()" would be for humans to read - so it should be

pretty

printed, not treated as a machine-to-machine wire format.

The broader question of representing data in JSON or XML, etc, is

already

the subject of many mature libraries which are already easy to use

with

Beam.

The more esoteric practice of implicit or semi-implicit coercions

seems

like it is also already addressed in many ways elsewhere.
Transform.via(TypeConverter) is basically the same as
MapElements.via(<lambda>) and also easy to use with Beam.

In both of the last cases, there are many reasonable approaches,

and

we

shouldn't commit our users to one of them.

On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik

<lc...@google.com.invalid

wrote:

The suggestions you give seem good except for the the XML cases.


Might want to have the XML be a document per line similar to the

JSON

examples you have been giving.

On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <

je...@smokinghand.com>

wrote:

@lukasz Agreed there would have to be KV handling. I was more

think

that

whatever the addition, it shouldn't just handle KV. It should

handle

Iterables, Lists, Sets, and KVs.

For JSON and XML, I wonder if we'd be able to give someone

something

general purpose enough that you would just end up writing your

own

code

to

handle it anyway.

Here are some ideas on what it could look like with a method and

the

resulting string output:
*Stringify.toJSON()*

With KV:
{"key": "value"}

With Iterables:
["one", "two", "three"]

*Stringify.toXML("rootelement")*

With KV:
<rootelement key=value />

With Iterables:
<rootelement>
  <item>one</item>
  <item>two</item>
  <item>three</item>
</rootelement>

*Stringify.toDelimited(",")*

With KV:
key,value

With Iterables:
one,two,three

Do you think that would strike a good balance between reusable

code

and

writing your own for more difficult formatting?

Thanks,

Jesse

On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik

<lc...@google.com.invalid

wrote:

Jesse, I believe if one format gets special treatment in TextIO,

people

will then ask why doesn't JSON, XML, ... also not supported.

Also, the example that you provide is using the fact that the

input

format

is an Iterable<Item>. You had posted a question about using KV

with

TextIO.Write which wouldn't align with the proposed input format

and

still

would require to write a type conversion function, this time

from

KV

to

Iterable<Item> instead of KV to string.

On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <

je...@smokinghand.com>

wrote:

Lukasz,


I don't think you'd need complicated logic for TextIO.Write.

For

CSV

the

call would look like:

Stringify.to("", ",", "\n");

Where the arguments would be Stringify.to(prefix, delimiter,

suffix).


The code would be something like:
StringBuffer buffer = new StringBuffer(prefix);

for (Item item : list) {
  buffer.append(item.toString());

  if(notLast) {
    buffer.append(delimiter);
  }
}

buffer.append(suffix);

c.output(buffer.toString());

That would allow you to do the basic CSV, TSV, and other

formats

without

complicated logic. The same sort of thing could be done for

TextIO.Write.

Thanks,

Jesse

On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik

<lc...@google.com.invalid

wrote:


The conversion from object to string will have uses outside of

just

TextIO.Write so it seems logical that we would want to have a

ParDo

do

the

conversion.

Text file formats have a lot of variance, even if you consider

the

subset

of CSV like formats where it could have fixed width fields, or

escaping

and

quoting around other fields, or headers that should be placed

at

the

top.

Having all these format conversions within TextIO.Write seems

like

lot

of

logic to contain in that transform which should just focus on

writing

to

files.


On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <

je...@smokinghand.com>

wrote:


This is a thread moved over from the user mailing list.


I think there needs to be a way to convert a PCollection<KV>

to

PCollection<String> Conversion.

To do a minimal WordCount, you have to manually convert the

KV

to a

String:

        p
                .apply(TextIO.Read.from("playing_cards.tsv"))
                .apply(Regex.split("\\W+"))
                .apply(Count.perElement())
*                .apply(MapElements.via((KV<String, Long>

count)

->*

*                            count.getKey() + ":" +

count.getValue()*

*                        ).withOutputType(

TypeDescriptors.strings()))*

                .apply(TextIO.Write.to

("output/stringcounts"));


This code really should be something like:
        p
                .apply(TextIO.Read.from("playing_cards.tsv"))
                .apply(Regex.split("\\W+"))
                .apply(Count.perElement())
*                .apply(ToString.stringify())*
                .apply(TextIO.Write.to

("output/stringcounts"));


To summarize the discussion:

   - JA: Add a method to StringDelegateCoder to output any KV

or

list

   - JA and DH: Add a SimpleFunction that takes an type and runs

toString()

   on it:
   class ToStringFn<InputT> extends SimpleFunction<InputT,

String>

       public static String apply(InputT input) {

           return input.toString();
       }
   }
   - JB: Add a general purpose type converter like in Apache

Camel.

   - JA: Add Object support to TextIO.Write that would write out

the

   toString of any Object.


My thoughts:

Is converting to a PCollection<String> mostly needed when

you're

using

TextIO.Write? Will a general purpose transform only work in

certain

cases

and you'll normally have to write custom code format the

strings

the

way

you want them?


IMHO, it's yes to both. I'd prefer to add Object support to

TextIO.Write

or

a SimpleFunction that takes a delimiter as an argument.

Making

SimpleFunction that's able to specify a delimiter (and

perhaps

prefix

and

suffix) should cover the majority of formats and cases.

Thanks,

Jesse

--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: PCollection to PCollection Conversion

Reply via email to