Re: [DISCUSS] Change "RunnableOnService" To A More Intuitive Name

2016-11-09 Thread Aljoscha Krettek
+1

What I would really like to see is automatic derivation of the capability
matrix from an extended Runner Test Suite. (As outlined in Thomas' doc).

On Wed, 9 Nov 2016 at 21:42 Kenneth Knowles  wrote:

> Huge +1 to this.
>
> The two categories I care most about are:
>
> 1. Tests that need a runner, but are testing the other "thing under test";
> today this is NeedsRunner.
> 2. Tests that are intended to test a runner; today this is
> RunnableOnService.
>
> Actually the lines are not necessary clear between them, but I think we can
> make good choices, like we already do.
>
> The idea of two categories with a common superclass actually has a pitfall:
> what if a test is put in the superclass category, when it does not have a
> clear meaning? And also, I don't have any good ideas for names.
>
> So I think just replacing RunnableOnService with RunnerTest to make clear
> that it is there just to test the runner is good. We might also want
> RunnerIntegrationTest extends NeedsRunner to use in the IO modules.
>
> See also Thomas's doc on capability matrix testing* which is aimed at case
> 2. Those tests should all have a category from the doc, or a new one added.
>
> *
>
> https://docs.google.com/document/d/1fICxq32t9yWn9qXhmT07xpclHeHX2VlUyVtpi2WzzGM/edit
>
> Kenn
>
> On Wed, Nov 9, 2016 at 12:20 PM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi Mark,
> >
> > Generally speaking, I agree.
> >
> > As RunnableOnService extends NeedsRunner, @TestsWithRunner or
> @RunOnRunner
> > sound clearer.
> >
> > Regards
> > JB
> >
> >
> > On 11/09/2016 09:00 PM, Mark Liu wrote:
> >
> >> Hi all,
> >>
> >> I'm working on building RunnableOnService in Python SDK. After having
> >> discussions with folks, "RunnableOnService" looks like not a very
> >> intuitive
> >> name for those unit tests that require runners and build lightweight
> >> pipelines to test specific components. Especially, they don't have to
> run
> >> on a service.
> >>
> >> So I want to raise this idea to the community and see if anyone have
> >> similar thoughts. Maybe we can come up with a name this is tight to
> >> runner.
> >> Currently, I have two names in my head:
> >>
> >> - TestsWithRunners
> >> - RunnerExecutable
> >>
> >> Any thoughts?
> >>
> >> Thanks,
> >> Mark
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: PCollection to PCollection Conversion

2016-11-09 Thread Manu Zhang
I would love to see a lean core and abundant Transforms at the same time.

Maybe we can look at what Confluent  does
for kafka-connect. They have official extensions support for JDBC, HDFS and
ElasticSearch under https://github.com/confluentinc. They put them along
with other community extensions on
https://www.confluent.io/product/connectors/ for visibility.

Although not a commercial company, can we have a GitHub user like
beam-community to host projects we build around beam but not suitable for
https://github.com/apache/incubator-beam. In the future, we may have
beam-algebra like http://github.com/twitter/algebird for algebra operations
and beam-ml / beam-dl for machine learning / deep learning. Also, there
will will be beam related projects elsewhere maintained by other
communities. We can put all of them on the beam-website or like spark
packages as mentioned by Amit.

My $0.02
Manu



On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles 
wrote:

> On this point from Amit and Ismaël, I agree: we could benefit from a place
> for miscellaneous non-core helper transformations.
>
> We have sdks/java/extensions but it is organized as separate artifacts. I
> think that is fine, considering the nature of Join and SortValues. But for
> simpler transforms, Importing one artifact per tiny transform is too much
> overhead. It also seems unlikely that we will have enough commonality among
> the transforms to call the artifact anything other than [some synonym for]
> "miscellaneous".
>
> I wouldn't want to take this too far - even though the SDK many transforms*
> that are not required for the model [1], I like that the SDK artifact has
> everything a user might need in their "getting started" phase of use. This
> user-friendliness (the user doesn't care that ParDo is core and Sum is not)
> plus the difficulty of judging which transforms go where, are probably why
> we have them mostly all in one place.
>
> Models to look at, off the top of my head, include Pig's PiggyBank and
> Apex's Malhar. These have different levels of support implied. Others?
>
> Kenn
>
> [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct, Filter,
> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min, Values, KvSwap,
> Partition, Regex, Sample, Sum, Top, Values, WithKeys, WithTimestamps
>
> * at least they are separate classes and not methods on PCollection :-)
>
>
> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía  wrote:
>
> > ​Nice discussion, and thanks Jesse for bringing this subject back.
> >
> > I agree 100% with Amit and the idea of having a home for those transforms
> > that are not core enough to be part of the sdk, but that we all end up
> > re-writing somehow.
> >
> > This is a needed improvement to be more developer friendly, but also as a
> > reference of good practices of Beam development, and for this reason I
> > agree with JB that at this moment it would be better for these transforms
> > to reside in the Beam repository at least for visibility reasons.
> >
> > One additional question is if these transforms represent a different DSL
> or
> > if those could be grouped with the current extensions (e.g. Join and
> > SortValues) into something more general that we as a community could
> > maintain, but well even if it is not the case, it would be really nice to
> > start working on something like this.
> >
> > Ismaël Mejía​
> >
> >
> > On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Related to spark-package, we also have Apache Bahir to host
> > > connectors/transforms for Spark and Flink.
> > >
> > > IMHO, right now, Beam should host this, not sure if it makes sense
> > > directly in the core.
> > >
> > > It reminds me the "Integration" DSL we discussed in the technical
> vision
> > > document.
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 11/09/2016 11:17 AM, Amit Sela wrote:
> > >
> > >> I think Jesse has a very good point on one hand, while Luke's and
> > >> Kenneth's
> > >> worries about committing users to specific implementations is in
> place.
> > >>
> > >> The Spark community has a 3rd party repository for useful libraries
> that
> > >> for various reasons are not a part of the Apache Spark project:
> > >> https://spark-packages.org/.
> > >>
> > >> Maybe a "common-transformations" package would serve both users quick
> > >> ramp-up and ease-of-use while keeping Beam more "enabling" ?
> > >>
> > >> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles  >
> > >> wrote:
> > >>
> > >> It seems useful for small scale debugging / demoing to have
> > >>> Dump.toString(). I think it should be named to clearly indicate its
> > >>> limited
> > >>> scope. Maybe other stuff could go in the Dump namespace, but
> > >>> "Dump.toJson()" would be for humans to read - so it should be pretty
> > >>> printed, not treated as a machine-to-machine wire format.
> > >>>
> > >>> The broader question of representing data in JSON or XML, etc, is
> > already
> > >>> the subject of m

Re: SBT/ivy dependency issues

2016-11-09 Thread Manu Zhang
Hi all,

I tried and reproduced the issue. "sbt-dependency-graph" doesn't show
beam-sdks-java-core
 and beam-runners-core-java either.
It's likely an ivy/sbt issue. I'll dig further.

Manu

On Thu, Nov 10, 2016 at 3:07 AM Kenneth Knowles 
wrote:

> Hi Abbass,
>
> Seeing the output from `sbt dependency-tree` from the sbt-dependency-graph
> plugin [1] might help. (caveat: I did not try this out; I don't know the
> state of maintenance)
>
> Kenn
>
> [1] https://github.com/jrudolph/sbt-dependency-graph
>
> On Wed, Nov 9, 2016 at 6:33 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi Abbass,
> >
> > As discussed together, it could be related to some changes we did in the
> > Maven profiles and build.
> >
> > Let me investigate.
> >
> > I keep you posted.
> >
> > Thanks !
> > Regards
> > JB
> >
> >
> > On 11/09/2016 03:03 PM, amarouni wrote:
> >
> >> Hi guys,
> >>
> >> I'm facing a weird issue with a Scala project (using SBT/ivy) that uses
> >> *beam-runners-spark:0.3.0-incubating *which depends on
> >> *beam-sdks-java-core *& *beam-runners-core-java*.
> >>
> >> Until recently everything worked as expected i.e I had to declare a
> >> single dependency on *beam-runners-spark:0.3.0-incubating *which brought
> >> with it *beam-sdks-java-core *& *beam-runners-core-java*, but a couple
> >> of weeks ago I started having issues where the only workaround was to
> >> explicitly declare dependencies on *beam-runners-spark:0.3.0-incubating
> >> *in addition to its direct beam dependencies : *beam-sdks-java-core *&
> >> *beam-runners-core-java*.
> >>
> >> I verified that *beam-runners-spark's *pom contains both of the
> >> *beam-sdks-java-core *& *beam-runners-core-java *dependencies but still
> >> had to declare them explicitly, I'm not sure if this is an issue with
> >> SBT/ivy because Maven can correctly fetch the required beam dependencies
> >> but this issue appears only with beam dependencies.
> >>
> >> Did anyone with SBT/ivy encounter this issue.
> >>
> >> Thanks,
> >>
> >> Abbass,
> >>
> >>
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Re: [DISCUSS] Change "RunnableOnService" To A More Intuitive Name

2016-11-09 Thread Kenneth Knowles
Huge +1 to this.

The two categories I care most about are:

1. Tests that need a runner, but are testing the other "thing under test";
today this is NeedsRunner.
2. Tests that are intended to test a runner; today this is
RunnableOnService.

Actually the lines are not necessary clear between them, but I think we can
make good choices, like we already do.

The idea of two categories with a common superclass actually has a pitfall:
what if a test is put in the superclass category, when it does not have a
clear meaning? And also, I don't have any good ideas for names.

So I think just replacing RunnableOnService with RunnerTest to make clear
that it is there just to test the runner is good. We might also want
RunnerIntegrationTest extends NeedsRunner to use in the IO modules.

See also Thomas's doc on capability matrix testing* which is aimed at case
2. Those tests should all have a category from the doc, or a new one added.

*
https://docs.google.com/document/d/1fICxq32t9yWn9qXhmT07xpclHeHX2VlUyVtpi2WzzGM/edit

Kenn

On Wed, Nov 9, 2016 at 12:20 PM, Jean-Baptiste Onofré 
wrote:

> Hi Mark,
>
> Generally speaking, I agree.
>
> As RunnableOnService extends NeedsRunner, @TestsWithRunner or @RunOnRunner
> sound clearer.
>
> Regards
> JB
>
>
> On 11/09/2016 09:00 PM, Mark Liu wrote:
>
>> Hi all,
>>
>> I'm working on building RunnableOnService in Python SDK. After having
>> discussions with folks, "RunnableOnService" looks like not a very
>> intuitive
>> name for those unit tests that require runners and build lightweight
>> pipelines to test specific components. Especially, they don't have to run
>> on a service.
>>
>> So I want to raise this idea to the community and see if anyone have
>> similar thoughts. Maybe we can come up with a name this is tight to
>> runner.
>> Currently, I have two names in my head:
>>
>> - TestsWithRunners
>> - RunnerExecutable
>>
>> Any thoughts?
>>
>> Thanks,
>> Mark
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [DISCUSS] Change "RunnableOnService" To A More Intuitive Name

2016-11-09 Thread Robert Bradshaw
I think it's important to tease apart what why we're trying to mark
tests. Generally, nearly all tests should run on all runners. However,
there are some exceptions, namely.

1) Some runners don't support all features (especially at the start).
2) Some tests are incompatible with distributed runners (e.g. rely on
in-process IO fakes)

@RunnableOnService has also been used to mark tests that *should* be
run on the service, as it is prohibitively expensive to run all tests
on all runners. We should also have the notion of a comprehensive
suite of tests a runner should pass to support the full model. This
would exclude many tests that are of unmodified composite transforms
(that hopefully could run on any runner, but the incremental benefit
would be small.)


On Wed, Nov 9, 2016 at 12:20 PM, Jean-Baptiste Onofré  wrote:
> Hi Mark,
>
> Generally speaking, I agree.
>
> As RunnableOnService extends NeedsRunner, @TestsWithRunner or @RunOnRunner
> sound clearer.
>
> Regards
> JB
>
>
> On 11/09/2016 09:00 PM, Mark Liu wrote:
>>
>> Hi all,
>>
>> I'm working on building RunnableOnService in Python SDK. After having
>> discussions with folks, "RunnableOnService" looks like not a very
>> intuitive
>> name for those unit tests that require runners and build lightweight
>> pipelines to test specific components. Especially, they don't have to run
>> on a service.
>>
>> So I want to raise this idea to the community and see if anyone have
>> similar thoughts. Maybe we can come up with a name this is tight to
>> runner.
>> Currently, I have two names in my head:
>>
>> - TestsWithRunners
>> - RunnerExecutable
>>
>> Any thoughts?
>>
>> Thanks,
>> Mark
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com


Re: [DISCUSS] Change "RunnableOnService" To A More Intuitive Name

2016-11-09 Thread Jean-Baptiste Onofré

Hi Mark,

Generally speaking, I agree.

As RunnableOnService extends NeedsRunner, @TestsWithRunner or 
@RunOnRunner sound clearer.


Regards
JB

On 11/09/2016 09:00 PM, Mark Liu wrote:

Hi all,

I'm working on building RunnableOnService in Python SDK. After having
discussions with folks, "RunnableOnService" looks like not a very intuitive
name for those unit tests that require runners and build lightweight
pipelines to test specific components. Especially, they don't have to run
on a service.

So I want to raise this idea to the community and see if anyone have
similar thoughts. Maybe we can come up with a name this is tight to runner.
Currently, I have two names in my head:

- TestsWithRunners
- RunnerExecutable

Any thoughts?

Thanks,
Mark



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


[DISCUSS] Change "RunnableOnService" To A More Intuitive Name

2016-11-09 Thread Mark Liu
Hi all,

I'm working on building RunnableOnService in Python SDK. After having
discussions with folks, "RunnableOnService" looks like not a very intuitive
name for those unit tests that require runners and build lightweight
pipelines to test specific components. Especially, they don't have to run
on a service.

So I want to raise this idea to the community and see if anyone have
similar thoughts. Maybe we can come up with a name this is tight to runner.
Currently, I have two names in my head:

- TestsWithRunners
- RunnerExecutable

Any thoughts?

Thanks,
Mark


Re: SBT/ivy dependency issues

2016-11-09 Thread Kenneth Knowles
Hi Abbass,

Seeing the output from `sbt dependency-tree` from the sbt-dependency-graph
plugin [1] might help. (caveat: I did not try this out; I don't know the
state of maintenance)

Kenn

[1] https://github.com/jrudolph/sbt-dependency-graph

On Wed, Nov 9, 2016 at 6:33 AM, Jean-Baptiste Onofré 
wrote:

> Hi Abbass,
>
> As discussed together, it could be related to some changes we did in the
> Maven profiles and build.
>
> Let me investigate.
>
> I keep you posted.
>
> Thanks !
> Regards
> JB
>
>
> On 11/09/2016 03:03 PM, amarouni wrote:
>
>> Hi guys,
>>
>> I'm facing a weird issue with a Scala project (using SBT/ivy) that uses
>> *beam-runners-spark:0.3.0-incubating *which depends on
>> *beam-sdks-java-core *& *beam-runners-core-java*.
>>
>> Until recently everything worked as expected i.e I had to declare a
>> single dependency on *beam-runners-spark:0.3.0-incubating *which brought
>> with it *beam-sdks-java-core *& *beam-runners-core-java*, but a couple
>> of weeks ago I started having issues where the only workaround was to
>> explicitly declare dependencies on *beam-runners-spark:0.3.0-incubating
>> *in addition to its direct beam dependencies : *beam-sdks-java-core *&
>> *beam-runners-core-java*.
>>
>> I verified that *beam-runners-spark's *pom contains both of the
>> *beam-sdks-java-core *& *beam-runners-core-java *dependencies but still
>> had to declare them explicitly, I'm not sure if this is an issue with
>> SBT/ivy because Maven can correctly fetch the required beam dependencies
>> but this issue appears only with beam dependencies.
>>
>> Did anyone with SBT/ivy encounter this issue.
>>
>> Thanks,
>>
>> Abbass,
>>
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: PCollection to PCollection Conversion

2016-11-09 Thread Kenneth Knowles
On this point from Amit and Ismaël, I agree: we could benefit from a place
for miscellaneous non-core helper transformations.

We have sdks/java/extensions but it is organized as separate artifacts. I
think that is fine, considering the nature of Join and SortValues. But for
simpler transforms, Importing one artifact per tiny transform is too much
overhead. It also seems unlikely that we will have enough commonality among
the transforms to call the artifact anything other than [some synonym for]
"miscellaneous".

I wouldn't want to take this too far - even though the SDK many transforms*
that are not required for the model [1], I like that the SDK artifact has
everything a user might need in their "getting started" phase of use. This
user-friendliness (the user doesn't care that ParDo is core and Sum is not)
plus the difficulty of judging which transforms go where, are probably why
we have them mostly all in one place.

Models to look at, off the top of my head, include Pig's PiggyBank and
Apex's Malhar. These have different levels of support implied. Others?

Kenn

[1] ApproximateQuantiles, ApproximateUnique, Count, Distinct, Filter,
FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min, Values, KvSwap,
Partition, Regex, Sample, Sum, Top, Values, WithKeys, WithTimestamps

* at least they are separate classes and not methods on PCollection :-)


On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía  wrote:

> ​Nice discussion, and thanks Jesse for bringing this subject back.
>
> I agree 100% with Amit and the idea of having a home for those transforms
> that are not core enough to be part of the sdk, but that we all end up
> re-writing somehow.
>
> This is a needed improvement to be more developer friendly, but also as a
> reference of good practices of Beam development, and for this reason I
> agree with JB that at this moment it would be better for these transforms
> to reside in the Beam repository at least for visibility reasons.
>
> One additional question is if these transforms represent a different DSL or
> if those could be grouped with the current extensions (e.g. Join and
> SortValues) into something more general that we as a community could
> maintain, but well even if it is not the case, it would be really nice to
> start working on something like this.
>
> Ismaël Mejía​
>
>
> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré 
> wrote:
>
> > Related to spark-package, we also have Apache Bahir to host
> > connectors/transforms for Spark and Flink.
> >
> > IMHO, right now, Beam should host this, not sure if it makes sense
> > directly in the core.
> >
> > It reminds me the "Integration" DSL we discussed in the technical vision
> > document.
> >
> > Regards
> > JB
> >
> >
> > On 11/09/2016 11:17 AM, Amit Sela wrote:
> >
> >> I think Jesse has a very good point on one hand, while Luke's and
> >> Kenneth's
> >> worries about committing users to specific implementations is in place.
> >>
> >> The Spark community has a 3rd party repository for useful libraries that
> >> for various reasons are not a part of the Apache Spark project:
> >> https://spark-packages.org/.
> >>
> >> Maybe a "common-transformations" package would serve both users quick
> >> ramp-up and ease-of-use while keeping Beam more "enabling" ?
> >>
> >> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles 
> >> wrote:
> >>
> >> It seems useful for small scale debugging / demoing to have
> >>> Dump.toString(). I think it should be named to clearly indicate its
> >>> limited
> >>> scope. Maybe other stuff could go in the Dump namespace, but
> >>> "Dump.toJson()" would be for humans to read - so it should be pretty
> >>> printed, not treated as a machine-to-machine wire format.
> >>>
> >>> The broader question of representing data in JSON or XML, etc, is
> already
> >>> the subject of many mature libraries which are already easy to use with
> >>> Beam.
> >>>
> >>> The more esoteric practice of implicit or semi-implicit coercions seems
> >>> like it is also already addressed in many ways elsewhere.
> >>> Transform.via(TypeConverter) is basically the same as
> >>> MapElements.via() and also easy to use with Beam.
> >>>
> >>> In both of the last cases, there are many reasonable approaches, and we
> >>> shouldn't commit our users to one of them.
> >>>
> >>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik  >
> >>> wrote:
> >>>
> >>> The suggestions you give seem good except for the the XML cases.
> 
>  Might want to have the XML be a document per line similar to the JSON
>  examples you have been giving.
> 
>  On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> je...@smokinghand.com>
>  wrote:
> 
>  @lukasz Agreed there would have to be KV handling. I was more think
> >
>  that
> >>>
>  whatever the addition, it shouldn't just handle KV. It should handle
> > Iterables, Lists, Sets, and KVs.
> >
> > For JSON and XML, I wonder if we'd be able to give someone something
> > general purpose enough 

Re: [PROPOSAL] Merge apex-runner to master branch

2016-11-09 Thread Kenneth Knowles
Hi Thomas,

Very good point about establishing more clear definitions of the roles
mentioned in the guidelines. Let's discuss in a separate thread.

Kenn

On Tue, Nov 8, 2016 at 1:03 PM, Thomas Weise  wrote:

> Thanks for the support. It may be helpful to describe the roles of
> "maintainer" and "supporter" in this context, perhaps even capture it on:
>
> http://beam.apache.org/contribute/contribution-guide/
>
> Thanks,
> Thomas
>
>
> On Tue, Nov 8, 2016 at 7:51 PM, Robert Bradshaw
>  > wrote:
>
> > Nice. I'm +1 modulo one caveat below (hopefully easily addressed).
> >
> > On Tue, Nov 8, 2016 at 5:54 AM, Thomas Weise  wrote:
> > > Hi,
> > >
> > > As per previous discussion [1], I would like to propose to merge the
> > > apex-runner branch into master. The runner satisfies the criteria
> > outlined
> > > in [2] and merging it to master will give more visibility to other
> > > contributors and users.
> > >
> > > Specifically the Apex runner addresses:
> > >
> > >- Have at least 2 contributors interested in maintaining it, and 1
> > >committer interested in supporting it:  *I'm going to sign up for
> the
> > >support and there are more folks interested. Some have already
> > contributed
> > >and helped with PR reviews, others from the Apex community have
> > expressed
> > >interest [3].*
> >
> > As anyone in the open source ecosystem knows, maintaining is a much
> > higher bar than contributing, but very important. I'd like to see
> > specific names here.
> >
> > >- Provide both end-user and developer-facing documentation:  *Runner
> > has
> > >README, capability matrix, Javadoc. Planning to add it to the
> tutorial
> > >later.*
> > >- Have at least a basic level of unit test coverage:  *Has 30 runner
> > >specific tests and passes all Beam RunnableOnService tests.*
> > >- Run all existing applicable integration tests with other Beam
> > >components and create additional tests as appropriate: * Enabled
> > runner
> > >for examples integration tests in the same way as other runners.*
> > >- Be able to handle a subset of the model that address a significant
> > set of
> > >use cases (aka. ‘traditional batch’ or ‘processing time
> > > streaming’):  *Passes
> > >RunnableOnService without exclusions and example IT.*
> > >- Update the capability matrix with the current status:  *Done.*
> > >- Add a webpage under learn/runners: *Same "TODO" page as other
> > runners
> > >added to site.*
> > >
> > > The PR for the merge: https://github.com/apache/
> incubator-beam/pull/1305
> > >
> > > (There are intermittent test failures in individual Travis runs that
> are
> > > unrelated to the runner.)
> > >
> > > Thanks,
> > > Thomas
> > >
> > > [1]
> > > https://lists.apache.org/thread.html/2b420a35f05e47561f27c19e8ec648
> > 4f595553f32da88fe593ad931d@%3Cdev.beam.apache.org%3E
> > >
> > > [2] http://beam.apache.org/contribute/contribution-guide/
> > #feature-branches
> > >
> > > [3]
> > > https://lists.apache.org/thread.html/6e7618768cdcde81c28aa9883a1fcf
> > 4d3d4e41de4249547
> > >  > 4d3d4e41de4249547130691d52@%3Cdev.apex.apache.org%3E>
> > > 130691d52@%3Cdev.apex.apache.org%3E
> > >  > 4d3d4e41de4249547130691d52@%3Cdev.apex.apache.org%3E>
> >
>


Re: SBT/ivy dependency issues

2016-11-09 Thread Jean-Baptiste Onofré

Hi Abbass,

As discussed together, it could be related to some changes we did in the 
Maven profiles and build.


Let me investigate.

I keep you posted.

Thanks !
Regards
JB

On 11/09/2016 03:03 PM, amarouni wrote:

Hi guys,

I'm facing a weird issue with a Scala project (using SBT/ivy) that uses
*beam-runners-spark:0.3.0-incubating *which depends on
*beam-sdks-java-core *& *beam-runners-core-java*.

Until recently everything worked as expected i.e I had to declare a
single dependency on *beam-runners-spark:0.3.0-incubating *which brought
with it *beam-sdks-java-core *& *beam-runners-core-java*, but a couple
of weeks ago I started having issues where the only workaround was to
explicitly declare dependencies on *beam-runners-spark:0.3.0-incubating
*in addition to its direct beam dependencies : *beam-sdks-java-core *&
*beam-runners-core-java*.

I verified that *beam-runners-spark's *pom contains both of the
*beam-sdks-java-core *& *beam-runners-core-java *dependencies but still
had to declare them explicitly, I'm not sure if this is an issue with
SBT/ivy because Maven can correctly fetch the required beam dependencies
but this issue appears only with beam dependencies.

Did anyone with SBT/ivy encounter this issue.

Thanks,

Abbass,





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: PCollection to PCollection Conversion

2016-11-09 Thread Jean-Baptiste Onofré

Hi Ismaël,

you are right: it's not necessary a DSL by its own (even if I think it 
could make sense as we could provide convenient notation like .marshal() 
or .unmarshal() for instance), it could be an "extension" jar providing 
those transforms.


I think the SDKs should be low level, and new "extensions" (for now in 
Beam) can provide convenient transforms or DSLs (I'm thinking about 
machine learning extension too for instance).


Clearly, it extends the scope of the project by itself, and I think it's 
a great thing ;) It will allow new contributors to work on different 
part of the project.


Just my $0.01 ;)

Regards
JB

On 11/09/2016 03:03 PM, Ismaël Mejía wrote:

​Nice discussion, and thanks Jesse for bringing this subject back.

I agree 100% with Amit and the idea of having a home for those transforms
that are not core enough to be part of the sdk, but that we all end up
re-writing somehow.

This is a needed improvement to be more developer friendly, but also as a
reference of good practices of Beam development, and for this reason I
agree with JB that at this moment it would be better for these transforms
to reside in the Beam repository at least for visibility reasons.

One additional question is if these transforms represent a different DSL or
if those could be grouped with the current extensions (e.g. Join and
SortValues) into something more general that we as a community could
maintain, but well even if it is not the case, it would be really nice to
start working on something like this.

Ismaël Mejía​


On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré 
wrote:


Related to spark-package, we also have Apache Bahir to host
connectors/transforms for Spark and Flink.

IMHO, right now, Beam should host this, not sure if it makes sense
directly in the core.

It reminds me the "Integration" DSL we discussed in the technical vision
document.

Regards
JB


On 11/09/2016 11:17 AM, Amit Sela wrote:


I think Jesse has a very good point on one hand, while Luke's and
Kenneth's
worries about committing users to specific implementations is in place.

The Spark community has a 3rd party repository for useful libraries that
for various reasons are not a part of the Apache Spark project:
https://spark-packages.org/.

Maybe a "common-transformations" package would serve both users quick
ramp-up and ease-of-use while keeping Beam more "enabling" ?

On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles 
wrote:

It seems useful for small scale debugging / demoing to have

Dump.toString(). I think it should be named to clearly indicate its
limited
scope. Maybe other stuff could go in the Dump namespace, but
"Dump.toJson()" would be for humans to read - so it should be pretty
printed, not treated as a machine-to-machine wire format.

The broader question of representing data in JSON or XML, etc, is already
the subject of many mature libraries which are already easy to use with
Beam.

The more esoteric practice of implicit or semi-implicit coercions seems
like it is also already addressed in many ways elsewhere.
Transform.via(TypeConverter) is basically the same as
MapElements.via() and also easy to use with Beam.

In both of the last cases, there are many reasonable approaches, and we
shouldn't commit our users to one of them.

On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik 
wrote:

The suggestions you give seem good except for the the XML cases.


Might want to have the XML be a document per line similar to the JSON
examples you have been giving.

On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson 
wrote:

@lukasz Agreed there would have to be KV handling. I was more think



that



whatever the addition, it shouldn't just handle KV. It should handle

Iterables, Lists, Sets, and KVs.

For JSON and XML, I wonder if we'd be able to give someone something
general purpose enough that you would just end up writing your own code


to


handle it anyway.

Here are some ideas on what it could look like with a method and the
resulting string output:
*Stringify.toJSON()*

With KV:
{"key": "value"}

With Iterables:
["one", "two", "three"]

*Stringify.toXML("rootelement")*

With KV:


With Iterables:

  one
  two
  three


*Stringify.toDelimited(",")*

With KV:
key,value

With Iterables:
one,two,three

Do you think that would strike a good balance between reusable code and
writing your own for more difficult formatting?

Thanks,

Jesse

On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik 
wrote:

Jesse, I believe if one format gets special treatment in TextIO, people
will then ask why doesn't JSON, XML, ... also not supported.

Also, the example that you provide is using the fact that the input


format


is an Iterable. You had posted a question about using KV with
TextIO.Write which wouldn't align with the proposed input format and


still


would require to write a type conversion function, this time from KV to
Iterable instead of KV to string.

On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson 
wrote:

Lukasz,


I don't think you'd need 

Re: PCollection to PCollection Conversion

2016-11-09 Thread Ismaël Mejía
​Nice discussion, and thanks Jesse for bringing this subject back.

I agree 100% with Amit and the idea of having a home for those transforms
that are not core enough to be part of the sdk, but that we all end up
re-writing somehow.

This is a needed improvement to be more developer friendly, but also as a
reference of good practices of Beam development, and for this reason I
agree with JB that at this moment it would be better for these transforms
to reside in the Beam repository at least for visibility reasons.

One additional question is if these transforms represent a different DSL or
if those could be grouped with the current extensions (e.g. Join and
SortValues) into something more general that we as a community could
maintain, but well even if it is not the case, it would be really nice to
start working on something like this.

Ismaël Mejía​


On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré 
wrote:

> Related to spark-package, we also have Apache Bahir to host
> connectors/transforms for Spark and Flink.
>
> IMHO, right now, Beam should host this, not sure if it makes sense
> directly in the core.
>
> It reminds me the "Integration" DSL we discussed in the technical vision
> document.
>
> Regards
> JB
>
>
> On 11/09/2016 11:17 AM, Amit Sela wrote:
>
>> I think Jesse has a very good point on one hand, while Luke's and
>> Kenneth's
>> worries about committing users to specific implementations is in place.
>>
>> The Spark community has a 3rd party repository for useful libraries that
>> for various reasons are not a part of the Apache Spark project:
>> https://spark-packages.org/.
>>
>> Maybe a "common-transformations" package would serve both users quick
>> ramp-up and ease-of-use while keeping Beam more "enabling" ?
>>
>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles 
>> wrote:
>>
>> It seems useful for small scale debugging / demoing to have
>>> Dump.toString(). I think it should be named to clearly indicate its
>>> limited
>>> scope. Maybe other stuff could go in the Dump namespace, but
>>> "Dump.toJson()" would be for humans to read - so it should be pretty
>>> printed, not treated as a machine-to-machine wire format.
>>>
>>> The broader question of representing data in JSON or XML, etc, is already
>>> the subject of many mature libraries which are already easy to use with
>>> Beam.
>>>
>>> The more esoteric practice of implicit or semi-implicit coercions seems
>>> like it is also already addressed in many ways elsewhere.
>>> Transform.via(TypeConverter) is basically the same as
>>> MapElements.via() and also easy to use with Beam.
>>>
>>> In both of the last cases, there are many reasonable approaches, and we
>>> shouldn't commit our users to one of them.
>>>
>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik 
>>> wrote:
>>>
>>> The suggestions you give seem good except for the the XML cases.

 Might want to have the XML be a document per line similar to the JSON
 examples you have been giving.

 On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson 
 wrote:

 @lukasz Agreed there would have to be KV handling. I was more think
>
 that
>>>
 whatever the addition, it shouldn't just handle KV. It should handle
> Iterables, Lists, Sets, and KVs.
>
> For JSON and XML, I wonder if we'd be able to give someone something
> general purpose enough that you would just end up writing your own code
>
 to

> handle it anyway.
>
> Here are some ideas on what it could look like with a method and the
> resulting string output:
> *Stringify.toJSON()*
>
> With KV:
> {"key": "value"}
>
> With Iterables:
> ["one", "two", "three"]
>
> *Stringify.toXML("rootelement")*
>
> With KV:
> 
>
> With Iterables:
> 
>   one
>   two
>   three
> 
>
> *Stringify.toDelimited(",")*
>
> With KV:
> key,value
>
> With Iterables:
> one,two,three
>
> Do you think that would strike a good balance between reusable code and
> writing your own for more difficult formatting?
>
> Thanks,
>
> Jesse
>
> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik 
> wrote:
>
> Jesse, I believe if one format gets special treatment in TextIO, people
> will then ask why doesn't JSON, XML, ... also not supported.
>
> Also, the example that you provide is using the fact that the input
>
 format

> is an Iterable. You had posted a question about using KV with
> TextIO.Write which wouldn't align with the proposed input format and
>
 still

> would require to write a type conversion function, this time from KV to
> Iterable instead of KV to string.
>
> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson 
> wrote:
>
> Lukasz,
>>
>> I don't think you'd need complicated logic for TextIO.Write. For CSV
>>
> the

> call would look like:
>> Stri

SBT/ivy dependency issues

2016-11-09 Thread amarouni
Hi guys,

I'm facing a weird issue with a Scala project (using SBT/ivy) that uses
*beam-runners-spark:0.3.0-incubating *which depends on
*beam-sdks-java-core *& *beam-runners-core-java*.

Until recently everything worked as expected i.e I had to declare a
single dependency on *beam-runners-spark:0.3.0-incubating *which brought
with it *beam-sdks-java-core *& *beam-runners-core-java*, but a couple
of weeks ago I started having issues where the only workaround was to
explicitly declare dependencies on *beam-runners-spark:0.3.0-incubating
*in addition to its direct beam dependencies : *beam-sdks-java-core *&
*beam-runners-core-java*.

I verified that *beam-runners-spark's *pom contains both of the
*beam-sdks-java-core *& *beam-runners-core-java *dependencies but still
had to declare them explicitly, I'm not sure if this is an issue with
SBT/ivy because Maven can correctly fetch the required beam dependencies
but this issue appears only with beam dependencies.

Did anyone with SBT/ivy encounter this issue.

Thanks,

Abbass,




Re: PCollection to PCollection Conversion

2016-11-09 Thread Jean-Baptiste Onofré
Related to spark-package, we also have Apache Bahir to host 
connectors/transforms for Spark and Flink.


IMHO, right now, Beam should host this, not sure if it makes sense 
directly in the core.


It reminds me the "Integration" DSL we discussed in the technical vision 
document.


Regards
JB

On 11/09/2016 11:17 AM, Amit Sela wrote:

I think Jesse has a very good point on one hand, while Luke's and Kenneth's
worries about committing users to specific implementations is in place.

The Spark community has a 3rd party repository for useful libraries that
for various reasons are not a part of the Apache Spark project:
https://spark-packages.org/.

Maybe a "common-transformations" package would serve both users quick
ramp-up and ease-of-use while keeping Beam more "enabling" ?

On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles 
wrote:


It seems useful for small scale debugging / demoing to have
Dump.toString(). I think it should be named to clearly indicate its limited
scope. Maybe other stuff could go in the Dump namespace, but
"Dump.toJson()" would be for humans to read - so it should be pretty
printed, not treated as a machine-to-machine wire format.

The broader question of representing data in JSON or XML, etc, is already
the subject of many mature libraries which are already easy to use with
Beam.

The more esoteric practice of implicit or semi-implicit coercions seems
like it is also already addressed in many ways elsewhere.
Transform.via(TypeConverter) is basically the same as
MapElements.via() and also easy to use with Beam.

In both of the last cases, there are many reasonable approaches, and we
shouldn't commit our users to one of them.

On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik 
wrote:


The suggestions you give seem good except for the the XML cases.

Might want to have the XML be a document per line similar to the JSON
examples you have been giving.

On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson 
wrote:


@lukasz Agreed there would have to be KV handling. I was more think

that

whatever the addition, it shouldn't just handle KV. It should handle
Iterables, Lists, Sets, and KVs.

For JSON and XML, I wonder if we'd be able to give someone something
general purpose enough that you would just end up writing your own code

to

handle it anyway.

Here are some ideas on what it could look like with a method and the
resulting string output:
*Stringify.toJSON()*

With KV:
{"key": "value"}

With Iterables:
["one", "two", "three"]

*Stringify.toXML("rootelement")*

With KV:


With Iterables:

  one
  two
  three


*Stringify.toDelimited(",")*

With KV:
key,value

With Iterables:
one,two,three

Do you think that would strike a good balance between reusable code and
writing your own for more difficult formatting?

Thanks,

Jesse

On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik 
wrote:

Jesse, I believe if one format gets special treatment in TextIO, people
will then ask why doesn't JSON, XML, ... also not supported.

Also, the example that you provide is using the fact that the input

format

is an Iterable. You had posted a question about using KV with
TextIO.Write which wouldn't align with the proposed input format and

still

would require to write a type conversion function, this time from KV to
Iterable instead of KV to string.

On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson 
wrote:


Lukasz,

I don't think you'd need complicated logic for TextIO.Write. For CSV

the

call would look like:
Stringify.to("", ",", "\n");

Where the arguments would be Stringify.to(prefix, delimiter, suffix).

The code would be something like:
StringBuffer buffer = new StringBuffer(prefix);

for (Item item : list) {
  buffer.append(item.toString());

  if(notLast) {
buffer.append(delimiter);
  }
}

buffer.append(suffix);

c.output(buffer.toString());

That would allow you to do the basic CSV, TSV, and other formats

without

complicated logic. The same sort of thing could be done for

TextIO.Write.


Thanks,

Jesse

On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik 


wrote:


The conversion from object to string will have uses outside of just
TextIO.Write so it seems logical that we would want to have a ParDo

do

the

conversion.

Text file formats have a lot of variance, even if you consider the

subset

of CSV like formats where it could have fixed width fields, or

escaping

and

quoting around other fields, or headers that should be placed at

the

top.


Having all these format conversions within TextIO.Write seems like

a

lot

of

logic to contain in that transform which should just focus on

writing

to

files.

On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <

je...@smokinghand.com>

wrote:


This is a thread moved over from the user mailing list.

I think there needs to be a way to convert a PCollection to
PCollection Conversion.

To do a minimal WordCount, you have to manually convert the KV

to a

String:

p
.apply(TextIO.Read.from("playing_cards.tsv"))
.apply(Regex.split("\

Re: PCollection to PCollection Conversion

2016-11-09 Thread Amit Sela
I think Jesse has a very good point on one hand, while Luke's and Kenneth's
worries about committing users to specific implementations is in place.

The Spark community has a 3rd party repository for useful libraries that
for various reasons are not a part of the Apache Spark project:
https://spark-packages.org/.

Maybe a "common-transformations" package would serve both users quick
ramp-up and ease-of-use while keeping Beam more "enabling" ?

On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles 
wrote:

> It seems useful for small scale debugging / demoing to have
> Dump.toString(). I think it should be named to clearly indicate its limited
> scope. Maybe other stuff could go in the Dump namespace, but
> "Dump.toJson()" would be for humans to read - so it should be pretty
> printed, not treated as a machine-to-machine wire format.
>
> The broader question of representing data in JSON or XML, etc, is already
> the subject of many mature libraries which are already easy to use with
> Beam.
>
> The more esoteric practice of implicit or semi-implicit coercions seems
> like it is also already addressed in many ways elsewhere.
> Transform.via(TypeConverter) is basically the same as
> MapElements.via() and also easy to use with Beam.
>
> In both of the last cases, there are many reasonable approaches, and we
> shouldn't commit our users to one of them.
>
> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik 
> wrote:
>
> > The suggestions you give seem good except for the the XML cases.
> >
> > Might want to have the XML be a document per line similar to the JSON
> > examples you have been giving.
> >
> > On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson 
> > wrote:
> >
> > > @lukasz Agreed there would have to be KV handling. I was more think
> that
> > > whatever the addition, it shouldn't just handle KV. It should handle
> > > Iterables, Lists, Sets, and KVs.
> > >
> > > For JSON and XML, I wonder if we'd be able to give someone something
> > > general purpose enough that you would just end up writing your own code
> > to
> > > handle it anyway.
> > >
> > > Here are some ideas on what it could look like with a method and the
> > > resulting string output:
> > > *Stringify.toJSON()*
> > >
> > > With KV:
> > > {"key": "value"}
> > >
> > > With Iterables:
> > > ["one", "two", "three"]
> > >
> > > *Stringify.toXML("rootelement")*
> > >
> > > With KV:
> > > 
> > >
> > > With Iterables:
> > > 
> > >   one
> > >   two
> > >   three
> > > 
> > >
> > > *Stringify.toDelimited(",")*
> > >
> > > With KV:
> > > key,value
> > >
> > > With Iterables:
> > > one,two,three
> > >
> > > Do you think that would strike a good balance between reusable code and
> > > writing your own for more difficult formatting?
> > >
> > > Thanks,
> > >
> > > Jesse
> > >
> > > On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik 
> > > wrote:
> > >
> > > Jesse, I believe if one format gets special treatment in TextIO, people
> > > will then ask why doesn't JSON, XML, ... also not supported.
> > >
> > > Also, the example that you provide is using the fact that the input
> > format
> > > is an Iterable. You had posted a question about using KV with
> > > TextIO.Write which wouldn't align with the proposed input format and
> > still
> > > would require to write a type conversion function, this time from KV to
> > > Iterable instead of KV to string.
> > >
> > > On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson 
> > > wrote:
> > >
> > > > Lukasz,
> > > >
> > > > I don't think you'd need complicated logic for TextIO.Write. For CSV
> > the
> > > > call would look like:
> > > > Stringify.to("", ",", "\n");
> > > >
> > > > Where the arguments would be Stringify.to(prefix, delimiter, suffix).
> > > >
> > > > The code would be something like:
> > > > StringBuffer buffer = new StringBuffer(prefix);
> > > >
> > > > for (Item item : list) {
> > > >   buffer.append(item.toString());
> > > >
> > > >   if(notLast) {
> > > > buffer.append(delimiter);
> > > >   }
> > > > }
> > > >
> > > > buffer.append(suffix);
> > > >
> > > > c.output(buffer.toString());
> > > >
> > > > That would allow you to do the basic CSV, TSV, and other formats
> > without
> > > > complicated logic. The same sort of thing could be done for
> > TextIO.Write.
> > > >
> > > > Thanks,
> > > >
> > > > Jesse
> > > >
> > > > On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik  >
> > > > wrote:
> > > >
> > > > > The conversion from object to string will have uses outside of just
> > > > > TextIO.Write so it seems logical that we would want to have a ParDo
> > do
> > > > the
> > > > > conversion.
> > > > >
> > > > > Text file formats have a lot of variance, even if you consider the
> > > subset
> > > > > of CSV like formats where it could have fixed width fields, or
> > escaping
> > > > and
> > > > > quoting around other fields, or headers that should be placed at
> the
> > > top.
> > > > >
> > > > > Having all these format conversions within TextIO.Write seems like
> a
> > > lot
> > > > of
> > > > > logic to contain in that transf