Related to spark-package, we also have Apache Bahir to host connectors/transforms for Spark and Flink.

IMHO, right now, Beam should host this, not sure if it makes sense directly in the core.

It reminds me the "Integration" DSL we discussed in the technical vision document.

Regards
JB

On 11/09/2016 11:17 AM, Amit Sela wrote:
I think Jesse has a very good point on one hand, while Luke's and Kenneth's
worries about committing users to specific implementations is in place.

The Spark community has a 3rd party repository for useful libraries that
for various reasons are not a part of the Apache Spark project:
https://spark-packages.org/.

Maybe a "common-transformations" package would serve both users quick
ramp-up and ease-of-use while keeping Beam more "enabling" ?

On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles <k...@google.com.invalid>
wrote:

It seems useful for small scale debugging / demoing to have
Dump.toString(). I think it should be named to clearly indicate its limited
scope. Maybe other stuff could go in the Dump namespace, but
"Dump.toJson()" would be for humans to read - so it should be pretty
printed, not treated as a machine-to-machine wire format.

The broader question of representing data in JSON or XML, etc, is already
the subject of many mature libraries which are already easy to use with
Beam.

The more esoteric practice of implicit or semi-implicit coercions seems
like it is also already addressed in many ways elsewhere.
Transform.via(TypeConverter) is basically the same as
MapElements.via(<lambda>) and also easy to use with Beam.

In both of the last cases, there are many reasonable approaches, and we
shouldn't commit our users to one of them.

On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik <lc...@google.com.invalid>
wrote:

The suggestions you give seem good except for the the XML cases.

Might want to have the XML be a document per line similar to the JSON
examples you have been giving.

On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <je...@smokinghand.com>
wrote:

@lukasz Agreed there would have to be KV handling. I was more think
that
whatever the addition, it shouldn't just handle KV. It should handle
Iterables, Lists, Sets, and KVs.

For JSON and XML, I wonder if we'd be able to give someone something
general purpose enough that you would just end up writing your own code
to
handle it anyway.

Here are some ideas on what it could look like with a method and the
resulting string output:
*Stringify.toJSON()*

With KV:
{"key": "value"}

With Iterables:
["one", "two", "three"]

*Stringify.toXML("rootelement")*

With KV:
<rootelement key=value />

With Iterables:
<rootelement>
  <item>one</item>
  <item>two</item>
  <item>three</item>
</rootelement>

*Stringify.toDelimited(",")*

With KV:
key,value

With Iterables:
one,two,three

Do you think that would strike a good balance between reusable code and
writing your own for more difficult formatting?

Thanks,

Jesse

On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik <lc...@google.com.invalid>
wrote:

Jesse, I believe if one format gets special treatment in TextIO, people
will then ask why doesn't JSON, XML, ... also not supported.

Also, the example that you provide is using the fact that the input
format
is an Iterable<Item>. You had posted a question about using KV with
TextIO.Write which wouldn't align with the proposed input format and
still
would require to write a type conversion function, this time from KV to
Iterable<Item> instead of KV to string.

On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <je...@smokinghand.com>
wrote:

Lukasz,

I don't think you'd need complicated logic for TextIO.Write. For CSV
the
call would look like:
Stringify.to("", ",", "\n");

Where the arguments would be Stringify.to(prefix, delimiter, suffix).

The code would be something like:
StringBuffer buffer = new StringBuffer(prefix);

for (Item item : list) {
  buffer.append(item.toString());

  if(notLast) {
    buffer.append(delimiter);
  }
}

buffer.append(suffix);

c.output(buffer.toString());

That would allow you to do the basic CSV, TSV, and other formats
without
complicated logic. The same sort of thing could be done for
TextIO.Write.

Thanks,

Jesse

On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik <lc...@google.com.invalid

wrote:

The conversion from object to string will have uses outside of just
TextIO.Write so it seems logical that we would want to have a ParDo
do
the
conversion.

Text file formats have a lot of variance, even if you consider the
subset
of CSV like formats where it could have fixed width fields, or
escaping
and
quoting around other fields, or headers that should be placed at
the
top.

Having all these format conversions within TextIO.Write seems like
a
lot
of
logic to contain in that transform which should just focus on
writing
to
files.

On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
je...@smokinghand.com>
wrote:

This is a thread moved over from the user mailing list.

I think there needs to be a way to convert a PCollection<KV> to
PCollection<String> Conversion.

To do a minimal WordCount, you have to manually convert the KV
to a
String:
        p
                .apply(TextIO.Read.from("playing_cards.tsv"))
                .apply(Regex.split("\\W+"))
                .apply(Count.perElement())
*                .apply(MapElements.via((KV<String, Long> count)
->*
*                            count.getKey() + ":" +
count.getValue()*
*                        ).withOutputType(
TypeDescriptors.strings()))*
                .apply(TextIO.Write.to("output/stringcounts"));

This code really should be something like:
        p
                .apply(TextIO.Read.from("playing_cards.tsv"))
                .apply(Regex.split("\\W+"))
                .apply(Count.perElement())
*                .apply(ToString.stringify())*
                .apply(TextIO.Write.to("output/stringcounts"));

To summarize the discussion:

   - JA: Add a method to StringDelegateCoder to output any KV or
list
   - JA and DH: Add a SimpleFunction that takes an type and runs
toString()
   on it:
   class ToStringFn<InputT> extends SimpleFunction<InputT,
String>
{
       public static String apply(InputT input) {
           return input.toString();
       }
   }
   - JB: Add a general purpose type converter like in Apache
Camel.
   - JA: Add Object support to TextIO.Write that would write out
the
   toString of any Object.

My thoughts:

Is converting to a PCollection<String> mostly needed when you're
using
TextIO.Write? Will a general purpose transform only work in
certain
cases
and you'll normally have to write custom code format the strings
the
way
you want them?

IMHO, it's yes to both. I'd prefer to add Object support to
TextIO.Write
or
a SimpleFunction that takes a delimiter as an argument. Making a
SimpleFunction that's able to specify a delimiter (and perhaps a
prefix
and
suffix) should cover the majority of formats and cases.

Thanks,

Jesse








--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to