Re: Scala DSL

2016-06-24 Thread Dan Halperin
On Fri, Jun 24, 2016 at 7:05 PM, Dan Halperin  wrote:

> On Fri, Jun 24, 2016 at 2:03 PM, Raghu Angadi 
> wrote:
>
>> DSL is a pretty generic term..
>>
>
> I agree and am not married to it. Neville?
>
>
>> The fact that scio uses Java SDK is an implementation detail.
>
>
> Reasonable, which is why I am also not pushing hard for '/java/scio' to be
> in the path.
>
>
>> I love the
>> name scio. But I think sdks/scala might be most appropriate and would make
>> it a first class citizen for Beam.
>>
>
> I am strongly against it being in the 'sdks/' top-level module -- it's not
> a Beam SDK. Unlike DSL, SDK is a very specific term in Beam.
>
>
>> Where would a future python sdk reside?
>>
>
> The Python SDK is in the python-sdk branch on Apache already, and it lives
> in `sdks/python`. (And it is aiming to become a proper Beam SDK. ;)
>

Now with a link:
https://github.com/apache/incubator-beam/tree/python-sdk/sdks

>
> Thanks,
> Dan
>
> On Fri, Jun 24, 2016 at 1:50 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>> > Agree for dsls/scio
>> >
>> > Regards
>> > JB
>> >
>> >
>> > On 06/24/2016 10:22 PM, Lukasz Cwik wrote:
>> >
>> >> +1 for dsls/scio for the already listed reasons
>> >>
>> >> On Fri, Jun 24, 2016 at 11:21 AM, Rafal Wojdyla
>> 
>> >> wrote:
>> >>
>> >> Hello. When it comes to SDK vs DSL - I fully agree with Frances. About
>> >>> dsls/java/scio or dsls/scio - dsls/java/scio may cause confusion, scio
>> >>> is a
>> >>> scala DSL but lives under java directory (?) - that makes sense only
>> once
>> >>> you get that scio is using java SDK under the hood. Thus, +1 to
>> >>> dsls/scio.
>> >>> - Rafal
>> >>>
>> >>> On Fri, Jun 24, 2016 at 2:01 PM, Kenneth Knowles
>> > >>> >
>> >>> wrote:
>> >>>
>> >>> My +1 goes to dsls/scio. It already has a cool name, so let's use it.
>> And
>>  there might be other Scala-based DSLs.
>> 
>>  On Fri, Jun 24, 2016 at 8:39 AM, Ismaël Mejía 
>>  wrote:
>> 
>>  ​Hello everyone,
>> >
>> > Neville, thanks a lot for your contribution. Your work is amazing
>> and I
>> >
>>  am
>> 
>> > really happy that this scala integration is finally happening.
>> > Congratulations to you and your team.
>> >
>> > I *strongly* disagree about the DSL classification for scio for one
>> >
>>  reason,
>> 
>> > if you go to the root of the term, Domain Specific Languages are
>> about
>> >
>>  a
>> >>>
>>  domain, and the domain in this case is writing Beam pipelines, which
>> >
>>  is a
>> >>>
>>  really broad domain.
>> >
>> > I agree with Frances’ argument that scio is not an SDK e.g. it
>> reuses
>> >
>>  the
>> >>>
>>  existing Beam java SDK. My proposition is that scio will be called
>> the
>> > Scala API because in the end this is what it is. I think the
>> confusion
>> > comes from the common definition of SDK which is normally an API + a
>> > Runtime. In this case scio will share the runtime with what we call
>> the
>> > Beam Java SDK.
>> >
>> > One additional point of using the term API is that it sends the
>> clear
>> > message that Beam has a Scala API too (which is good for visibility
>> as
>> >
>>  JB
>> >>>
>>  mentioned).
>> >
>> > Regards,
>> > Ismaël​
>> >
>> >
>> > On Fri, Jun 24, 2016 at 5:08 PM, Jean-Baptiste Onofré <
>> j...@nanthrax.net
>> >
>> 
>>  wrote:
>> >
>> > Hi Dan,
>> >>
>> >> fair enough.
>> >>
>> >> As I'm also working on new DSLs (XML, JSON), I already created the
>> >>
>> > dsls
>> >>>
>>  module.
>> >>
>> >> So, I would say dsls/scala.
>> >>
>> >> WDYT ?
>> >>
>> >> Regards
>> >> JB
>> >>
>> >>
>> >> On 06/24/2016 05:07 PM, Dan Halperin wrote:
>> >>
>> >> I don't think that sdks/scala is the right place -- scio is not a
>> >>>
>> >> Beam
>> >>>
>>  Scala SDK; it wraps the existing Java SDK.
>> >>>
>> >>> Some options:
>> >>> * sdks/java/extensions  (Scio builds on the Java SDK) -- mentally
>> >>>
>> >> vetoed
>> 
>> > since Scio isn't an extension for the Java SDK, but rather a wrapper
>> >>>
>> >>> * dsls/java/scio (Scio is a Beam DSL that uses the Java SDK)
>> >>> * dsls/scio  (Scio is a Beam DSL that could eventually use
>> multiple
>> >>>
>> >> SDKs)
>> >
>> >> * extensions/java/scio  (Scio is an extension of Beam that uses the
>> >>>
>> >> Java
>> 
>> > SDK)
>> >>> * extensions/scio  (Scio is an extension of Beam that is not
>> limited
>> >>>
>> >> to
>> 
>> > one
>> >>> SDK)
>> >>>
>> >>> I lean towards either dsls/java/scio or extensions/java/scio,
>> since
>> >>>
>> >> I
>> >>>
>>  don't
>> >>> think there are plans for Scio to handle multiple different SDKs
>> (in
>> >>> different languages). The question between these two is whether we
>> >>>
>> 

Re: Scala DSL

2016-06-24 Thread Dan Halperin
On Fri, Jun 24, 2016 at 2:03 PM, Raghu Angadi 
wrote:

> DSL is a pretty generic term..
>

I agree and am not married to it. Neville?


> The fact that scio uses Java SDK is an implementation detail.


Reasonable, which is why I am also not pushing hard for '/java/scio' to be
in the path.


> I love the
> name scio. But I think sdks/scala might be most appropriate and would make
> it a first class citizen for Beam.
>

I am strongly against it being in the 'sdks/' top-level module -- it's not
a Beam SDK. Unlike DSL, SDK is a very specific term in Beam.


> Where would a future python sdk reside?
>

The Python SDK is in the python-sdk branch on Apache already, and it lives
in `sdks/python`. (And it is aiming to become a proper Beam SDK. ;)

Thanks,
Dan

On Fri, Jun 24, 2016 at 1:50 PM, Jean-Baptiste Onofré 
> wrote:
>
> > Agree for dsls/scio
> >
> > Regards
> > JB
> >
> >
> > On 06/24/2016 10:22 PM, Lukasz Cwik wrote:
> >
> >> +1 for dsls/scio for the already listed reasons
> >>
> >> On Fri, Jun 24, 2016 at 11:21 AM, Rafal Wojdyla  >
> >> wrote:
> >>
> >> Hello. When it comes to SDK vs DSL - I fully agree with Frances. About
> >>> dsls/java/scio or dsls/scio - dsls/java/scio may cause confusion, scio
> >>> is a
> >>> scala DSL but lives under java directory (?) - that makes sense only
> once
> >>> you get that scio is using java SDK under the hood. Thus, +1 to
> >>> dsls/scio.
> >>> - Rafal
> >>>
> >>> On Fri, Jun 24, 2016 at 2:01 PM, Kenneth Knowles
>  >>> >
> >>> wrote:
> >>>
> >>> My +1 goes to dsls/scio. It already has a cool name, so let's use it.
> And
>  there might be other Scala-based DSLs.
> 
>  On Fri, Jun 24, 2016 at 8:39 AM, Ismaël Mejía 
>  wrote:
> 
>  ​Hello everyone,
> >
> > Neville, thanks a lot for your contribution. Your work is amazing
> and I
> >
>  am
> 
> > really happy that this scala integration is finally happening.
> > Congratulations to you and your team.
> >
> > I *strongly* disagree about the DSL classification for scio for one
> >
>  reason,
> 
> > if you go to the root of the term, Domain Specific Languages are
> about
> >
>  a
> >>>
>  domain, and the domain in this case is writing Beam pipelines, which
> >
>  is a
> >>>
>  really broad domain.
> >
> > I agree with Frances’ argument that scio is not an SDK e.g. it reuses
> >
>  the
> >>>
>  existing Beam java SDK. My proposition is that scio will be called the
> > Scala API because in the end this is what it is. I think the
> confusion
> > comes from the common definition of SDK which is normally an API + a
> > Runtime. In this case scio will share the runtime with what we call
> the
> > Beam Java SDK.
> >
> > One additional point of using the term API is that it sends the clear
> > message that Beam has a Scala API too (which is good for visibility
> as
> >
>  JB
> >>>
>  mentioned).
> >
> > Regards,
> > Ismaël​
> >
> >
> > On Fri, Jun 24, 2016 at 5:08 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> >
> 
>  wrote:
> >
> > Hi Dan,
> >>
> >> fair enough.
> >>
> >> As I'm also working on new DSLs (XML, JSON), I already created the
> >>
> > dsls
> >>>
>  module.
> >>
> >> So, I would say dsls/scala.
> >>
> >> WDYT ?
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 06/24/2016 05:07 PM, Dan Halperin wrote:
> >>
> >> I don't think that sdks/scala is the right place -- scio is not a
> >>>
> >> Beam
> >>>
>  Scala SDK; it wraps the existing Java SDK.
> >>>
> >>> Some options:
> >>> * sdks/java/extensions  (Scio builds on the Java SDK) -- mentally
> >>>
> >> vetoed
> 
> > since Scio isn't an extension for the Java SDK, but rather a wrapper
> >>>
> >>> * dsls/java/scio (Scio is a Beam DSL that uses the Java SDK)
> >>> * dsls/scio  (Scio is a Beam DSL that could eventually use multiple
> >>>
> >> SDKs)
> >
> >> * extensions/java/scio  (Scio is an extension of Beam that uses the
> >>>
> >> Java
> 
> > SDK)
> >>> * extensions/scio  (Scio is an extension of Beam that is not
> limited
> >>>
> >> to
> 
> > one
> >>> SDK)
> >>>
> >>> I lean towards either dsls/java/scio or extensions/java/scio, since
> >>>
> >> I
> >>>
>  don't
> >>> think there are plans for Scio to handle multiple different SDKs
> (in
> >>> different languages). The question between these two is whether we
> >>>
> >> think
> 
> > DSLs are "big enough" to be a top level concept.
> >>>
> >>> On Thu, Jun 23, 2016 at 11:05 PM, Jean-Baptiste Onofré <
> >>>
> >> j...@nanthrax.net
> 
> >
> >> wrote:
> >>>
> >>> Good point about new Fn and the fact it's based on the Java SDK.
> >>>
> 
>  It's just that in 

Re: How to control the parallelism when run ParDo on PCollection?

2016-06-24 Thread Shen Li
Hi Thomas,

Thanks for the follow-up.

Shen

On Fri, Jun 24, 2016 at 4:49 PM, Thomas Groh 
wrote:

> We do also have an active JIRA issue to support limiting parallelism on a
> per-step basis, BEAM-68
>
> https://issues.apache.org/jira/browse/BEAM-68
>
> As Kenn noted, this is not equivalent to controls over bundling, which is
> entirely determined by the runner.
>
> On Fri, Jun 24, 2016 at 1:25 PM, Shen Li  wrote:
>
> > Hi Kenn,
> >
> > Thanks for the explanation.
> >
> > Regards,
> >
> > Shen
> >
> > On Fri, Jun 24, 2016 at 4:09 PM, Kenneth Knowles  >
> > wrote:
> >
> > > Hi Shen,
> > >
> > > It is completely up to the runner how to divide things into bundles: it
> > is
> > > one item of work that should fail or succeed atomically. Bundling
> limits
> > > parallelism, but does not determine it. For example, a streaming
> > execution
> > > may have many bundles over time as elements arrive, regardless of
> > > parallelism.
> > >
> > > Kenn
> > >
> > > On Fri, Jun 24, 2016 at 12:13 PM, Shen Li  wrote:
> > >
> > > > Hi,
> > > >
> > > > The document says "when a ParDo transform is executed, the elements
> of
> > > the
> > > > input PCollection are first divided up into some number of bundles".
> > > >
> > > > How do users control the number of bundles/parallelism? Or is it
> > > completely
> > > > up to the runner?
> > > >
> > > > Thanks,
> > > >
> > > > Shen
> > > >
> > >
> >
>


Re: Running examples with different runners

2016-06-24 Thread Thomas Weise
Thanks! I will try that out.

Regarding the View translation, it still fails with HEAD of master (just
did a pull):

---
Test set: org.apache.beam.sdk.testing.PAssertTest
---
Tests run: 10, Failures: 1, Errors: 3, Skipped: 0, Time elapsed: 29.566 sec
<<< FAILURE! - in org.apache.beam.sdk.testing.PAssertTest
testIsEqualTo(org.apache.beam.sdk.testing.PAssertTest)  Time elapsed: 1.518
sec  <<< ERROR!
java.lang.IllegalStateException: no translator registered for
View.CreatePCollectionView
at
org.apache.beam.runners.apex.ApexPipelineTranslator.visitPrimitiveTransform(ApexPipelineTranslator.java:98)


On Fri, Jun 24, 2016 at 5:28 PM, Lukasz Cwik 
wrote:

> Below I outline a different approach than the DirectRunner which didn't
> require an override for Create since it knows that there was no data
> remaining and can correctly shut the pipeline down by pushing the watermark
> all the way through the pipeline. This is a superior approach but I believe
> is more difficult to get right.
>
> PAssert emits an aggregator with a specific name which states that the
> PAssert succeeded or failed:
>
> https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java#L110
>
> The test Dataflow runner counts how many PAsserts were applied and then
> polls itself every 10 seconds checking to see if the aggregator has any
> failures or all the successes for streaming pipelines.
> Polling logic:
>
> https://github.com/apache/incubator-beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/testing/TestDataflowRunner.java#L114
> Check logic:
>
> https://github.com/apache/incubator-beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/testing/TestDataflowRunner.java#L177
>
> As for overriding a transform, the runner is currently invoked during
> application of a transform and is able to inject/replace/modify the
> transform that was being applied. The test Dataflow runner uses this a
> little bit to do the PAssert counting while the normal Dataflow runner does
> this a lot for its own specific needs.
>
> Finally, I believe Ken just made some changes which removed the requirement
> to support View.YYY and replaced it with GroupByKey so the no translator
> registered for View... may go away.
>
>
> On Fri, Jun 24, 2016 at 4:52 PM, Thomas Weise 
> wrote:
>
> > Kenneth and Lukasz, thanks for the direction.
> >
> > Is there any information about other requirements to run the cross runner
> > tests and hints to troubleshoot. On first attempt they mosty fail due to
> > missing translator:
> >
> > PAssertTest.testIsEqualTo:219 ▒ IllegalState no translator registered for
> > View...
> >
> > Also, for run() to be synchronous or wait, there needs to be an exit
> > condition. I know how to solve this for the Apex runner specific tests.
> But
> > for the cross runner tests, what is the recommended way to do this?
> Kenneth
> > mentioned that Create could signal end of stream. Should I look to
> override
> > the Create transformation to configure the behavior ((just for this test
> > suite) and if so, is there an example how to do this cleanly?
> >
> > Thanks,
> > Thomas
> >
> >
> >
> >
> > On Tue, Jun 21, 2016 at 7:32 PM, Kenneth Knowles  >
> > wrote:
> >
> > > To expand on the RunnableOnService test suggestion, here [1] is the
> > commit
> > > from the Spark runner. You will get a lot more information if you can
> > port
> > > this for your runner than you would from an example end-to-end test.
> > >
> > > Note that this just pulls in the tests from the core SDK. For testing
> > with
> > > other I/O connectors, you'll add them to the dependenciesToScan.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/incubator-beam/commit/4254749bf103c4bb6f68e316768c0aa46d9f7df0
> > >
> > > On Tue, Jun 21, 2016 at 4:06 PM, Lukasz Cwik  >
> > > wrote:
> > >
> > > > There is a start to getting more e2e like integration tests going
> with
> > > the
> > > > first being WordCount.
> > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/examples/java/src/test/java/org/apache/beam/examples/WordCountIT.java
> > > > You could add WindowedWordCountIT.java which will be launched with
> the
> > > > proper configuration of the Apex runner pom.xml
> > > >
> > > > I would also suggest that you take a look at the @RunnableOnService
> > tests
> > > > which are a comprehensive validation suite of ~200ish tests that test
> > > > everything from triggers to side inputs. It requires some pom changes
> > and
> > > > creating a test runner which is able to setup an apex environment.
> > > >
> > > > Furthermore, we could really use an addition to the Beam wiki about
> > > testing
> > > > and how runners write tests/execute tests/...
> > > >
> > > > Some r

Re: Running examples with different runners

2016-06-24 Thread Lukasz Cwik
Below I outline a different approach than the DirectRunner which didn't
require an override for Create since it knows that there was no data
remaining and can correctly shut the pipeline down by pushing the watermark
all the way through the pipeline. This is a superior approach but I believe
is more difficult to get right.

PAssert emits an aggregator with a specific name which states that the
PAssert succeeded or failed:
https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java#L110

The test Dataflow runner counts how many PAsserts were applied and then
polls itself every 10 seconds checking to see if the aggregator has any
failures or all the successes for streaming pipelines.
Polling logic:
https://github.com/apache/incubator-beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/testing/TestDataflowRunner.java#L114
Check logic:
https://github.com/apache/incubator-beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/testing/TestDataflowRunner.java#L177

As for overriding a transform, the runner is currently invoked during
application of a transform and is able to inject/replace/modify the
transform that was being applied. The test Dataflow runner uses this a
little bit to do the PAssert counting while the normal Dataflow runner does
this a lot for its own specific needs.

Finally, I believe Ken just made some changes which removed the requirement
to support View.YYY and replaced it with GroupByKey so the no translator
registered for View... may go away.


On Fri, Jun 24, 2016 at 4:52 PM, Thomas Weise 
wrote:

> Kenneth and Lukasz, thanks for the direction.
>
> Is there any information about other requirements to run the cross runner
> tests and hints to troubleshoot. On first attempt they mosty fail due to
> missing translator:
>
> PAssertTest.testIsEqualTo:219 ▒ IllegalState no translator registered for
> View...
>
> Also, for run() to be synchronous or wait, there needs to be an exit
> condition. I know how to solve this for the Apex runner specific tests. But
> for the cross runner tests, what is the recommended way to do this? Kenneth
> mentioned that Create could signal end of stream. Should I look to override
> the Create transformation to configure the behavior ((just for this test
> suite) and if so, is there an example how to do this cleanly?
>
> Thanks,
> Thomas
>
>
>
>
> On Tue, Jun 21, 2016 at 7:32 PM, Kenneth Knowles 
> wrote:
>
> > To expand on the RunnableOnService test suggestion, here [1] is the
> commit
> > from the Spark runner. You will get a lot more information if you can
> port
> > this for your runner than you would from an example end-to-end test.
> >
> > Note that this just pulls in the tests from the core SDK. For testing
> with
> > other I/O connectors, you'll add them to the dependenciesToScan.
> >
> > [1]
> >
> >
> https://github.com/apache/incubator-beam/commit/4254749bf103c4bb6f68e316768c0aa46d9f7df0
> >
> > On Tue, Jun 21, 2016 at 4:06 PM, Lukasz Cwik 
> > wrote:
> >
> > > There is a start to getting more e2e like integration tests going with
> > the
> > > first being WordCount.
> > >
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/examples/java/src/test/java/org/apache/beam/examples/WordCountIT.java
> > > You could add WindowedWordCountIT.java which will be launched with the
> > > proper configuration of the Apex runner pom.xml
> > >
> > > I would also suggest that you take a look at the @RunnableOnService
> tests
> > > which are a comprehensive validation suite of ~200ish tests that test
> > > everything from triggers to side inputs. It requires some pom changes
> and
> > > creating a test runner which is able to setup an apex environment.
> > >
> > > Furthermore, we could really use an addition to the Beam wiki about
> > testing
> > > and how runners write tests/execute tests/...
> > >
> > > Some relevant links:
> > > Older presentation about getting cross runner tests going:
> > >
> > >
> >
> https://docs.google.com/presentation/d/1uTb7dx4-Y2OM_B0_3XF_whwAL2FlDTTuq2QzP9sJ4Mg/edit#slide=id.g127d614316_19_39
> > >
> > > Examples of test runners:
> > >
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/testing/TestDataflowRunner.java
> > >
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/TestFlinkRunner.java
> > >
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/TestSparkRunner.java
> > >
> > > Section of pom dedicated to enabling runnable on service tests:
> > >
> > >
> >
> https://github.com/apache/incubator-beam/blob/master/runners/spark/pom.xml#L54
> > >
> > > On Tue, Jun 21, 2016 at 2:21 PM, Thomas Weise 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > >

Re: Running examples with different runners

2016-06-24 Thread Thomas Weise
Kenneth and Lukasz, thanks for the direction.

Is there any information about other requirements to run the cross runner
tests and hints to troubleshoot. On first attempt they mosty fail due to
missing translator:

PAssertTest.testIsEqualTo:219 ▒ IllegalState no translator registered for
View...

Also, for run() to be synchronous or wait, there needs to be an exit
condition. I know how to solve this for the Apex runner specific tests. But
for the cross runner tests, what is the recommended way to do this? Kenneth
mentioned that Create could signal end of stream. Should I look to override
the Create transformation to configure the behavior ((just for this test
suite) and if so, is there an example how to do this cleanly?

Thanks,
Thomas




On Tue, Jun 21, 2016 at 7:32 PM, Kenneth Knowles 
wrote:

> To expand on the RunnableOnService test suggestion, here [1] is the commit
> from the Spark runner. You will get a lot more information if you can port
> this for your runner than you would from an example end-to-end test.
>
> Note that this just pulls in the tests from the core SDK. For testing with
> other I/O connectors, you'll add them to the dependenciesToScan.
>
> [1]
>
> https://github.com/apache/incubator-beam/commit/4254749bf103c4bb6f68e316768c0aa46d9f7df0
>
> On Tue, Jun 21, 2016 at 4:06 PM, Lukasz Cwik 
> wrote:
>
> > There is a start to getting more e2e like integration tests going with
> the
> > first being WordCount.
> >
> >
> https://github.com/apache/incubator-beam/blob/master/examples/java/src/test/java/org/apache/beam/examples/WordCountIT.java
> > You could add WindowedWordCountIT.java which will be launched with the
> > proper configuration of the Apex runner pom.xml
> >
> > I would also suggest that you take a look at the @RunnableOnService tests
> > which are a comprehensive validation suite of ~200ish tests that test
> > everything from triggers to side inputs. It requires some pom changes and
> > creating a test runner which is able to setup an apex environment.
> >
> > Furthermore, we could really use an addition to the Beam wiki about
> testing
> > and how runners write tests/execute tests/...
> >
> > Some relevant links:
> > Older presentation about getting cross runner tests going:
> >
> >
> https://docs.google.com/presentation/d/1uTb7dx4-Y2OM_B0_3XF_whwAL2FlDTTuq2QzP9sJ4Mg/edit#slide=id.g127d614316_19_39
> >
> > Examples of test runners:
> >
> >
> https://github.com/apache/incubator-beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/testing/TestDataflowRunner.java
> >
> >
> https://github.com/apache/incubator-beam/blob/master/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/TestFlinkRunner.java
> >
> >
> https://github.com/apache/incubator-beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/TestSparkRunner.java
> >
> > Section of pom dedicated to enabling runnable on service tests:
> >
> >
> https://github.com/apache/incubator-beam/blob/master/runners/spark/pom.xml#L54
> >
> > On Tue, Jun 21, 2016 at 2:21 PM, Thomas Weise 
> > wrote:
> >
> > > Hi,
> > >
> > > As part of the Apex runner, we have a few unit tests for the supported
> > > transformations. Next, I would like to test the WindowedWordCount
> > example.
> > >
> > > Is there an example of configuring this pipeline for another runner? Is
> > it
> > > recommended to supply such configuration as a JUnit test? What is the
> > > general (repeatable?) approach to exercise different runners with the
> set
> > > of example pipelines?
> > >
> > > Thanks,
> > > Thomas
> > >
> >
>


Re: Scala DSL

2016-06-24 Thread Raghu Angadi
DSL is a pretty generic term..

The fact that scio uses Java SDK is an implementation detail. I love the
name scio. But I think sdks/scala might be most appropriate and would make
it a first class citizen for Beam.

Where would a future python sdk reside?

On Fri, Jun 24, 2016 at 1:50 PM, Jean-Baptiste Onofré 
wrote:

> Agree for dsls/scio
>
> Regards
> JB
>
>
> On 06/24/2016 10:22 PM, Lukasz Cwik wrote:
>
>> +1 for dsls/scio for the already listed reasons
>>
>> On Fri, Jun 24, 2016 at 11:21 AM, Rafal Wojdyla 
>> wrote:
>>
>> Hello. When it comes to SDK vs DSL - I fully agree with Frances. About
>>> dsls/java/scio or dsls/scio - dsls/java/scio may cause confusion, scio
>>> is a
>>> scala DSL but lives under java directory (?) - that makes sense only once
>>> you get that scio is using java SDK under the hood. Thus, +1 to
>>> dsls/scio.
>>> - Rafal
>>>
>>> On Fri, Jun 24, 2016 at 2:01 PM, Kenneth Knowles >> >
>>> wrote:
>>>
>>> My +1 goes to dsls/scio. It already has a cool name, so let's use it. And
 there might be other Scala-based DSLs.

 On Fri, Jun 24, 2016 at 8:39 AM, Ismaël Mejía 
 wrote:

 ​Hello everyone,
>
> Neville, thanks a lot for your contribution. Your work is amazing and I
>
 am

> really happy that this scala integration is finally happening.
> Congratulations to you and your team.
>
> I *strongly* disagree about the DSL classification for scio for one
>
 reason,

> if you go to the root of the term, Domain Specific Languages are about
>
 a
>>>
 domain, and the domain in this case is writing Beam pipelines, which
>
 is a
>>>
 really broad domain.
>
> I agree with Frances’ argument that scio is not an SDK e.g. it reuses
>
 the
>>>
 existing Beam java SDK. My proposition is that scio will be called the
> Scala API because in the end this is what it is. I think the confusion
> comes from the common definition of SDK which is normally an API + a
> Runtime. In this case scio will share the runtime with what we call the
> Beam Java SDK.
>
> One additional point of using the term API is that it sends the clear
> message that Beam has a Scala API too (which is good for visibility as
>
 JB
>>>
 mentioned).
>
> Regards,
> Ismaël​
>
>
> On Fri, Jun 24, 2016 at 5:08 PM, Jean-Baptiste Onofré 

 wrote:
>
> Hi Dan,
>>
>> fair enough.
>>
>> As I'm also working on new DSLs (XML, JSON), I already created the
>>
> dsls
>>>
 module.
>>
>> So, I would say dsls/scala.
>>
>> WDYT ?
>>
>> Regards
>> JB
>>
>>
>> On 06/24/2016 05:07 PM, Dan Halperin wrote:
>>
>> I don't think that sdks/scala is the right place -- scio is not a
>>>
>> Beam
>>>
 Scala SDK; it wraps the existing Java SDK.
>>>
>>> Some options:
>>> * sdks/java/extensions  (Scio builds on the Java SDK) -- mentally
>>>
>> vetoed

> since Scio isn't an extension for the Java SDK, but rather a wrapper
>>>
>>> * dsls/java/scio (Scio is a Beam DSL that uses the Java SDK)
>>> * dsls/scio  (Scio is a Beam DSL that could eventually use multiple
>>>
>> SDKs)
>
>> * extensions/java/scio  (Scio is an extension of Beam that uses the
>>>
>> Java

> SDK)
>>> * extensions/scio  (Scio is an extension of Beam that is not limited
>>>
>> to

> one
>>> SDK)
>>>
>>> I lean towards either dsls/java/scio or extensions/java/scio, since
>>>
>> I
>>>
 don't
>>> think there are plans for Scio to handle multiple different SDKs (in
>>> different languages). The question between these two is whether we
>>>
>> think

> DSLs are "big enough" to be a top level concept.
>>>
>>> On Thu, Jun 23, 2016 at 11:05 PM, Jean-Baptiste Onofré <
>>>
>> j...@nanthrax.net

>
>> wrote:
>>>
>>> Good point about new Fn and the fact it's based on the Java SDK.
>>>

 It's just that in term of "marketing", it's a good message to

>>> provide a

> Scala SDK even if technically it's more a DSL.

 For instance, a valid "marketing" DSL would be a Java fluent DSL on

>>> top

> of
 the Java SDK, or a declarative XML DSL.

 However, from a technical perspective, it can go into dsl module.

 My $0.02 ;)

 Regards
 JB


 On 06/24/2016 06:51 AM, Frances Perry wrote:

 +Rafal & Andrew again

>
> I am leaning DSL for two reasons: (1) scio uses the existing java
> execution
> environment (and won't have a language-specific fn harness of its
>
 own),
>
>> and
> (2) it changes the abstractions that users interact w

Re: Scala DSL

2016-06-24 Thread Jean-Baptiste Onofré

Agree for dsls/scio

Regards
JB

On 06/24/2016 10:22 PM, Lukasz Cwik wrote:

+1 for dsls/scio for the already listed reasons

On Fri, Jun 24, 2016 at 11:21 AM, Rafal Wojdyla 
wrote:


Hello. When it comes to SDK vs DSL - I fully agree with Frances. About
dsls/java/scio or dsls/scio - dsls/java/scio may cause confusion, scio is a
scala DSL but lives under java directory (?) - that makes sense only once
you get that scio is using java SDK under the hood. Thus, +1 to dsls/scio.
- Rafal

On Fri, Jun 24, 2016 at 2:01 PM, Kenneth Knowles 
wrote:


My +1 goes to dsls/scio. It already has a cool name, so let's use it. And
there might be other Scala-based DSLs.

On Fri, Jun 24, 2016 at 8:39 AM, Ismaël Mejía  wrote:


​Hello everyone,

Neville, thanks a lot for your contribution. Your work is amazing and I

am

really happy that this scala integration is finally happening.
Congratulations to you and your team.

I *strongly* disagree about the DSL classification for scio for one

reason,

if you go to the root of the term, Domain Specific Languages are about

a

domain, and the domain in this case is writing Beam pipelines, which

is a

really broad domain.

I agree with Frances’ argument that scio is not an SDK e.g. it reuses

the

existing Beam java SDK. My proposition is that scio will be called the
Scala API because in the end this is what it is. I think the confusion
comes from the common definition of SDK which is normally an API + a
Runtime. In this case scio will share the runtime with what we call the
Beam Java SDK.

One additional point of using the term API is that it sends the clear
message that Beam has a Scala API too (which is good for visibility as

JB

mentioned).

Regards,
Ismaël​


On Fri, Jun 24, 2016 at 5:08 PM, Jean-Baptiste Onofré 


wrote:


Hi Dan,

fair enough.

As I'm also working on new DSLs (XML, JSON), I already created the

dsls

module.

So, I would say dsls/scala.

WDYT ?

Regards
JB


On 06/24/2016 05:07 PM, Dan Halperin wrote:


I don't think that sdks/scala is the right place -- scio is not a

Beam

Scala SDK; it wraps the existing Java SDK.

Some options:
* sdks/java/extensions  (Scio builds on the Java SDK) -- mentally

vetoed

since Scio isn't an extension for the Java SDK, but rather a wrapper

* dsls/java/scio (Scio is a Beam DSL that uses the Java SDK)
* dsls/scio  (Scio is a Beam DSL that could eventually use multiple

SDKs)

* extensions/java/scio  (Scio is an extension of Beam that uses the

Java

SDK)
* extensions/scio  (Scio is an extension of Beam that is not limited

to

one
SDK)

I lean towards either dsls/java/scio or extensions/java/scio, since

I

don't
think there are plans for Scio to handle multiple different SDKs (in
different languages). The question between these two is whether we

think

DSLs are "big enough" to be a top level concept.

On Thu, Jun 23, 2016 at 11:05 PM, Jean-Baptiste Onofré <

j...@nanthrax.net



wrote:

Good point about new Fn and the fact it's based on the Java SDK.


It's just that in term of "marketing", it's a good message to

provide a

Scala SDK even if technically it's more a DSL.

For instance, a valid "marketing" DSL would be a Java fluent DSL on

top

of
the Java SDK, or a declarative XML DSL.

However, from a technical perspective, it can go into dsl module.

My $0.02 ;)

Regards
JB


On 06/24/2016 06:51 AM, Frances Perry wrote:

+Rafal & Andrew again


I am leaning DSL for two reasons: (1) scio uses the existing java
execution
environment (and won't have a language-specific fn harness of its

own),

and
(2) it changes the abstractions that users interact with.

I recently saw a scio repl demo from Reuven -- there's some really

cool

stuff in there. I'd love to dive into it a bit more and see what

can

be

generalized beyond scio. The repl-like interactive graph

construction

is

very similar to what we've seen with ipython, in that it doesn't

always

play nicely with the graph construction / graph execution

distinction. I

wonder what changes to Beam might more generally support this. The
materialize stuff looks similar to some functionality in FlumeJava

we

used
to support multi-segment pipelines with some shared intermediate
PCollections.

On Thu, Jun 23, 2016 at 9:22 PM, Jean-Baptiste Onofré <

j...@nanthrax.net>

wrote:

Hi Neville,



thanks for the update !

As it's another language support, and to clearly identify the

purpose,

I
would say sdks/scala.

Regards
JB


On 06/23/2016 11:56 PM, Neville Li wrote:

+folks in my team



On Thu, Jun 23, 2016 at 5:57 PM Neville Li <

neville@gmail.com



wrote:

Hi all,



I'm the co-author of Scio 

and

am

in
the
progress of moving code to Beam (BEAM-302
). Just

wondering

if

sdks/scala is the right place for this code or if something

like

dsls/scio
is a better choice? What do you think?

A little background: Scio was built as a high-level Scala API

for

Google
Cloud Dataflow (now a

Re: How to control the parallelism when run ParDo on PCollection?

2016-06-24 Thread Thomas Groh
We do also have an active JIRA issue to support limiting parallelism on a
per-step basis, BEAM-68

https://issues.apache.org/jira/browse/BEAM-68

As Kenn noted, this is not equivalent to controls over bundling, which is
entirely determined by the runner.

On Fri, Jun 24, 2016 at 1:25 PM, Shen Li  wrote:

> Hi Kenn,
>
> Thanks for the explanation.
>
> Regards,
>
> Shen
>
> On Fri, Jun 24, 2016 at 4:09 PM, Kenneth Knowles 
> wrote:
>
> > Hi Shen,
> >
> > It is completely up to the runner how to divide things into bundles: it
> is
> > one item of work that should fail or succeed atomically. Bundling limits
> > parallelism, but does not determine it. For example, a streaming
> execution
> > may have many bundles over time as elements arrive, regardless of
> > parallelism.
> >
> > Kenn
> >
> > On Fri, Jun 24, 2016 at 12:13 PM, Shen Li  wrote:
> >
> > > Hi,
> > >
> > > The document says "when a ParDo transform is executed, the elements of
> > the
> > > input PCollection are first divided up into some number of bundles".
> > >
> > > How do users control the number of bundles/parallelism? Or is it
> > completely
> > > up to the runner?
> > >
> > > Thanks,
> > >
> > > Shen
> > >
> >
>


Re: How to control the parallelism when run ParDo on PCollection?

2016-06-24 Thread Shen Li
Hi Kenn,

Thanks for the explanation.

Regards,

Shen

On Fri, Jun 24, 2016 at 4:09 PM, Kenneth Knowles 
wrote:

> Hi Shen,
>
> It is completely up to the runner how to divide things into bundles: it is
> one item of work that should fail or succeed atomically. Bundling limits
> parallelism, but does not determine it. For example, a streaming execution
> may have many bundles over time as elements arrive, regardless of
> parallelism.
>
> Kenn
>
> On Fri, Jun 24, 2016 at 12:13 PM, Shen Li  wrote:
>
> > Hi,
> >
> > The document says "when a ParDo transform is executed, the elements of
> the
> > input PCollection are first divided up into some number of bundles".
> >
> > How do users control the number of bundles/parallelism? Or is it
> completely
> > up to the runner?
> >
> > Thanks,
> >
> > Shen
> >
>


Re: Scala DSL

2016-06-24 Thread Lukasz Cwik
+1 for dsls/scio for the already listed reasons

On Fri, Jun 24, 2016 at 11:21 AM, Rafal Wojdyla 
wrote:

> Hello. When it comes to SDK vs DSL - I fully agree with Frances. About
> dsls/java/scio or dsls/scio - dsls/java/scio may cause confusion, scio is a
> scala DSL but lives under java directory (?) - that makes sense only once
> you get that scio is using java SDK under the hood. Thus, +1 to dsls/scio.
> - Rafal
>
> On Fri, Jun 24, 2016 at 2:01 PM, Kenneth Knowles 
> wrote:
>
> > My +1 goes to dsls/scio. It already has a cool name, so let's use it. And
> > there might be other Scala-based DSLs.
> >
> > On Fri, Jun 24, 2016 at 8:39 AM, Ismaël Mejía  wrote:
> >
> > > ​Hello everyone,
> > >
> > > Neville, thanks a lot for your contribution. Your work is amazing and I
> > am
> > > really happy that this scala integration is finally happening.
> > > Congratulations to you and your team.
> > >
> > > I *strongly* disagree about the DSL classification for scio for one
> > reason,
> > > if you go to the root of the term, Domain Specific Languages are about
> a
> > > domain, and the domain in this case is writing Beam pipelines, which
> is a
> > > really broad domain.
> > >
> > > I agree with Frances’ argument that scio is not an SDK e.g. it reuses
> the
> > > existing Beam java SDK. My proposition is that scio will be called the
> > > Scala API because in the end this is what it is. I think the confusion
> > > comes from the common definition of SDK which is normally an API + a
> > > Runtime. In this case scio will share the runtime with what we call the
> > > Beam Java SDK.
> > >
> > > One additional point of using the term API is that it sends the clear
> > > message that Beam has a Scala API too (which is good for visibility as
> JB
> > > mentioned).
> > >
> > > Regards,
> > > Ismaël​
> > >
> > >
> > > On Fri, Jun 24, 2016 at 5:08 PM, Jean-Baptiste Onofré  >
> > > wrote:
> > >
> > > > Hi Dan,
> > > >
> > > > fair enough.
> > > >
> > > > As I'm also working on new DSLs (XML, JSON), I already created the
> dsls
> > > > module.
> > > >
> > > > So, I would say dsls/scala.
> > > >
> > > > WDYT ?
> > > >
> > > > Regards
> > > > JB
> > > >
> > > >
> > > > On 06/24/2016 05:07 PM, Dan Halperin wrote:
> > > >
> > > >> I don't think that sdks/scala is the right place -- scio is not a
> Beam
> > > >> Scala SDK; it wraps the existing Java SDK.
> > > >>
> > > >> Some options:
> > > >> * sdks/java/extensions  (Scio builds on the Java SDK) -- mentally
> > vetoed
> > > >> since Scio isn't an extension for the Java SDK, but rather a wrapper
> > > >>
> > > >> * dsls/java/scio (Scio is a Beam DSL that uses the Java SDK)
> > > >> * dsls/scio  (Scio is a Beam DSL that could eventually use multiple
> > > SDKs)
> > > >> * extensions/java/scio  (Scio is an extension of Beam that uses the
> > Java
> > > >> SDK)
> > > >> * extensions/scio  (Scio is an extension of Beam that is not limited
> > to
> > > >> one
> > > >> SDK)
> > > >>
> > > >> I lean towards either dsls/java/scio or extensions/java/scio, since
> I
> > > >> don't
> > > >> think there are plans for Scio to handle multiple different SDKs (in
> > > >> different languages). The question between these two is whether we
> > think
> > > >> DSLs are "big enough" to be a top level concept.
> > > >>
> > > >> On Thu, Jun 23, 2016 at 11:05 PM, Jean-Baptiste Onofré <
> > j...@nanthrax.net
> > > >
> > > >> wrote:
> > > >>
> > > >> Good point about new Fn and the fact it's based on the Java SDK.
> > > >>>
> > > >>> It's just that in term of "marketing", it's a good message to
> > provide a
> > > >>> Scala SDK even if technically it's more a DSL.
> > > >>>
> > > >>> For instance, a valid "marketing" DSL would be a Java fluent DSL on
> > top
> > > >>> of
> > > >>> the Java SDK, or a declarative XML DSL.
> > > >>>
> > > >>> However, from a technical perspective, it can go into dsl module.
> > > >>>
> > > >>> My $0.02 ;)
> > > >>>
> > > >>> Regards
> > > >>> JB
> > > >>>
> > > >>>
> > > >>> On 06/24/2016 06:51 AM, Frances Perry wrote:
> > > >>>
> > > >>> +Rafal & Andrew again
> > > 
> > >  I am leaning DSL for two reasons: (1) scio uses the existing java
> > >  execution
> > >  environment (and won't have a language-specific fn harness of its
> > > own),
> > >  and
> > >  (2) it changes the abstractions that users interact with.
> > > 
> > >  I recently saw a scio repl demo from Reuven -- there's some really
> > > cool
> > >  stuff in there. I'd love to dive into it a bit more and see what
> can
> > > be
> > >  generalized beyond scio. The repl-like interactive graph
> > construction
> > > is
> > >  very similar to what we've seen with ipython, in that it doesn't
> > > always
> > >  play nicely with the graph construction / graph execution
> > > distinction. I
> > >  wonder what changes to Beam might more generally support this. The
> > >  materialize stuff looks similar to some functionality in FlumeJava
> > we
> > >  used
> > > >>

Re: How to control the parallelism when run ParDo on PCollection?

2016-06-24 Thread Kenneth Knowles
Hi Shen,

It is completely up to the runner how to divide things into bundles: it is
one item of work that should fail or succeed atomically. Bundling limits
parallelism, but does not determine it. For example, a streaming execution
may have many bundles over time as elements arrive, regardless of
parallelism.

Kenn

On Fri, Jun 24, 2016 at 12:13 PM, Shen Li  wrote:

> Hi,
>
> The document says "when a ParDo transform is executed, the elements of the
> input PCollection are first divided up into some number of bundles".
>
> How do users control the number of bundles/parallelism? Or is it completely
> up to the runner?
>
> Thanks,
>
> Shen
>


How to control the parallelism when run ParDo on PCollection?

2016-06-24 Thread Shen Li
Hi,

The document says "when a ParDo transform is executed, the elements of the
input PCollection are first divided up into some number of bundles".

How do users control the number of bundles/parallelism? Or is it completely
up to the runner?

Thanks,

Shen


Re: What is the "Keyed State" in the capability matrix?

2016-06-24 Thread Shen Li
Hi Kenn,

Thanks for replying. I look forward to the technical docs.

Shen

On Fri, Jun 24, 2016 at 2:17 PM, Kenneth Knowles 
wrote:

> Hi Shen,
>
> The row refers to the ability for a DoFn in a ParDo to access per-key (and
> window) state cells that persist beyond the lifetime of an element or
> bundle. This is a feature that was in the later stages of design when the
> Beam code was donated. Hence it a row in the graph, but even the Beam Model
> column says "no", pending a public design proposal & consensus. Most
> runners already have a similar capability at a low level; this feature
> refers to exposing it in a nice way for users.
>
> I have a design doc that I'm busily revising to make sense for the whole
> community. I will send the doc to this list and add it to our technical
> docs folder as soon as I can get it ready. You can follow BEAM-25 [1] if
> you like, too.
>
> Kenn
>
> [1] https://issues.apache.org/jira/browse/BEAM-25
>
>
> On Fri, Jun 24, 2016 at 10:56 AM, Shen Li  wrote:
>
> > Hi,
> >
> > There is a "Keyed State" row in the  "What is being computed" section of
> > the capability matrix. What does the "Keyed State" refer to? Is it a
> global
> > key-value store?
> >
> > (
> >
> >
> http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html
> > )
> >
> > Thanks,
> >
> > Shen
> >
>


Re: Scala DSL

2016-06-24 Thread Rafal Wojdyla
Hello. When it comes to SDK vs DSL - I fully agree with Frances. About
dsls/java/scio or dsls/scio - dsls/java/scio may cause confusion, scio is a
scala DSL but lives under java directory (?) - that makes sense only once
you get that scio is using java SDK under the hood. Thus, +1 to dsls/scio.
- Rafal

On Fri, Jun 24, 2016 at 2:01 PM, Kenneth Knowles 
wrote:

> My +1 goes to dsls/scio. It already has a cool name, so let's use it. And
> there might be other Scala-based DSLs.
>
> On Fri, Jun 24, 2016 at 8:39 AM, Ismaël Mejía  wrote:
>
> > ​Hello everyone,
> >
> > Neville, thanks a lot for your contribution. Your work is amazing and I
> am
> > really happy that this scala integration is finally happening.
> > Congratulations to you and your team.
> >
> > I *strongly* disagree about the DSL classification for scio for one
> reason,
> > if you go to the root of the term, Domain Specific Languages are about a
> > domain, and the domain in this case is writing Beam pipelines, which is a
> > really broad domain.
> >
> > I agree with Frances’ argument that scio is not an SDK e.g. it reuses the
> > existing Beam java SDK. My proposition is that scio will be called the
> > Scala API because in the end this is what it is. I think the confusion
> > comes from the common definition of SDK which is normally an API + a
> > Runtime. In this case scio will share the runtime with what we call the
> > Beam Java SDK.
> >
> > One additional point of using the term API is that it sends the clear
> > message that Beam has a Scala API too (which is good for visibility as JB
> > mentioned).
> >
> > Regards,
> > Ismaël​
> >
> >
> > On Fri, Jun 24, 2016 at 5:08 PM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Hi Dan,
> > >
> > > fair enough.
> > >
> > > As I'm also working on new DSLs (XML, JSON), I already created the dsls
> > > module.
> > >
> > > So, I would say dsls/scala.
> > >
> > > WDYT ?
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 06/24/2016 05:07 PM, Dan Halperin wrote:
> > >
> > >> I don't think that sdks/scala is the right place -- scio is not a Beam
> > >> Scala SDK; it wraps the existing Java SDK.
> > >>
> > >> Some options:
> > >> * sdks/java/extensions  (Scio builds on the Java SDK) -- mentally
> vetoed
> > >> since Scio isn't an extension for the Java SDK, but rather a wrapper
> > >>
> > >> * dsls/java/scio (Scio is a Beam DSL that uses the Java SDK)
> > >> * dsls/scio  (Scio is a Beam DSL that could eventually use multiple
> > SDKs)
> > >> * extensions/java/scio  (Scio is an extension of Beam that uses the
> Java
> > >> SDK)
> > >> * extensions/scio  (Scio is an extension of Beam that is not limited
> to
> > >> one
> > >> SDK)
> > >>
> > >> I lean towards either dsls/java/scio or extensions/java/scio, since I
> > >> don't
> > >> think there are plans for Scio to handle multiple different SDKs (in
> > >> different languages). The question between these two is whether we
> think
> > >> DSLs are "big enough" to be a top level concept.
> > >>
> > >> On Thu, Jun 23, 2016 at 11:05 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> > >
> > >> wrote:
> > >>
> > >> Good point about new Fn and the fact it's based on the Java SDK.
> > >>>
> > >>> It's just that in term of "marketing", it's a good message to
> provide a
> > >>> Scala SDK even if technically it's more a DSL.
> > >>>
> > >>> For instance, a valid "marketing" DSL would be a Java fluent DSL on
> top
> > >>> of
> > >>> the Java SDK, or a declarative XML DSL.
> > >>>
> > >>> However, from a technical perspective, it can go into dsl module.
> > >>>
> > >>> My $0.02 ;)
> > >>>
> > >>> Regards
> > >>> JB
> > >>>
> > >>>
> > >>> On 06/24/2016 06:51 AM, Frances Perry wrote:
> > >>>
> > >>> +Rafal & Andrew again
> > 
> >  I am leaning DSL for two reasons: (1) scio uses the existing java
> >  execution
> >  environment (and won't have a language-specific fn harness of its
> > own),
> >  and
> >  (2) it changes the abstractions that users interact with.
> > 
> >  I recently saw a scio repl demo from Reuven -- there's some really
> > cool
> >  stuff in there. I'd love to dive into it a bit more and see what can
> > be
> >  generalized beyond scio. The repl-like interactive graph
> construction
> > is
> >  very similar to what we've seen with ipython, in that it doesn't
> > always
> >  play nicely with the graph construction / graph execution
> > distinction. I
> >  wonder what changes to Beam might more generally support this. The
> >  materialize stuff looks similar to some functionality in FlumeJava
> we
> >  used
> >  to support multi-segment pipelines with some shared intermediate
> >  PCollections.
> > 
> >  On Thu, Jun 23, 2016 at 9:22 PM, Jean-Baptiste Onofré <
> > j...@nanthrax.net>
> >  wrote:
> > 
> >  Hi Neville,
> > 
> > >
> > > thanks for the update !
> > >
> > > As it's another language support, and to clearly identify the
> > purpose,
> > > I
> > >>

Re: What is the "Keyed State" in the capability matrix?

2016-06-24 Thread Kenneth Knowles
Hi Shen,

The row refers to the ability for a DoFn in a ParDo to access per-key (and
window) state cells that persist beyond the lifetime of an element or
bundle. This is a feature that was in the later stages of design when the
Beam code was donated. Hence it a row in the graph, but even the Beam Model
column says "no", pending a public design proposal & consensus. Most
runners already have a similar capability at a low level; this feature
refers to exposing it in a nice way for users.

I have a design doc that I'm busily revising to make sense for the whole
community. I will send the doc to this list and add it to our technical
docs folder as soon as I can get it ready. You can follow BEAM-25 [1] if
you like, too.

Kenn

[1] https://issues.apache.org/jira/browse/BEAM-25


On Fri, Jun 24, 2016 at 10:56 AM, Shen Li  wrote:

> Hi,
>
> There is a "Keyed State" row in the  "What is being computed" section of
> the capability matrix. What does the "Keyed State" refer to? Is it a global
> key-value store?
>
> (
>
> http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html
> )
>
> Thanks,
>
> Shen
>


Re: Scala DSL

2016-06-24 Thread Kenneth Knowles
My +1 goes to dsls/scio. It already has a cool name, so let's use it. And
there might be other Scala-based DSLs.

On Fri, Jun 24, 2016 at 8:39 AM, Ismaël Mejía  wrote:

> ​Hello everyone,
>
> Neville, thanks a lot for your contribution. Your work is amazing and I am
> really happy that this scala integration is finally happening.
> Congratulations to you and your team.
>
> I *strongly* disagree about the DSL classification for scio for one reason,
> if you go to the root of the term, Domain Specific Languages are about a
> domain, and the domain in this case is writing Beam pipelines, which is a
> really broad domain.
>
> I agree with Frances’ argument that scio is not an SDK e.g. it reuses the
> existing Beam java SDK. My proposition is that scio will be called the
> Scala API because in the end this is what it is. I think the confusion
> comes from the common definition of SDK which is normally an API + a
> Runtime. In this case scio will share the runtime with what we call the
> Beam Java SDK.
>
> One additional point of using the term API is that it sends the clear
> message that Beam has a Scala API too (which is good for visibility as JB
> mentioned).
>
> Regards,
> Ismaël​
>
>
> On Fri, Jun 24, 2016 at 5:08 PM, Jean-Baptiste Onofré 
> wrote:
>
> > Hi Dan,
> >
> > fair enough.
> >
> > As I'm also working on new DSLs (XML, JSON), I already created the dsls
> > module.
> >
> > So, I would say dsls/scala.
> >
> > WDYT ?
> >
> > Regards
> > JB
> >
> >
> > On 06/24/2016 05:07 PM, Dan Halperin wrote:
> >
> >> I don't think that sdks/scala is the right place -- scio is not a Beam
> >> Scala SDK; it wraps the existing Java SDK.
> >>
> >> Some options:
> >> * sdks/java/extensions  (Scio builds on the Java SDK) -- mentally vetoed
> >> since Scio isn't an extension for the Java SDK, but rather a wrapper
> >>
> >> * dsls/java/scio (Scio is a Beam DSL that uses the Java SDK)
> >> * dsls/scio  (Scio is a Beam DSL that could eventually use multiple
> SDKs)
> >> * extensions/java/scio  (Scio is an extension of Beam that uses the Java
> >> SDK)
> >> * extensions/scio  (Scio is an extension of Beam that is not limited to
> >> one
> >> SDK)
> >>
> >> I lean towards either dsls/java/scio or extensions/java/scio, since I
> >> don't
> >> think there are plans for Scio to handle multiple different SDKs (in
> >> different languages). The question between these two is whether we think
> >> DSLs are "big enough" to be a top level concept.
> >>
> >> On Thu, Jun 23, 2016 at 11:05 PM, Jean-Baptiste Onofré  >
> >> wrote:
> >>
> >> Good point about new Fn and the fact it's based on the Java SDK.
> >>>
> >>> It's just that in term of "marketing", it's a good message to provide a
> >>> Scala SDK even if technically it's more a DSL.
> >>>
> >>> For instance, a valid "marketing" DSL would be a Java fluent DSL on top
> >>> of
> >>> the Java SDK, or a declarative XML DSL.
> >>>
> >>> However, from a technical perspective, it can go into dsl module.
> >>>
> >>> My $0.02 ;)
> >>>
> >>> Regards
> >>> JB
> >>>
> >>>
> >>> On 06/24/2016 06:51 AM, Frances Perry wrote:
> >>>
> >>> +Rafal & Andrew again
> 
>  I am leaning DSL for two reasons: (1) scio uses the existing java
>  execution
>  environment (and won't have a language-specific fn harness of its
> own),
>  and
>  (2) it changes the abstractions that users interact with.
> 
>  I recently saw a scio repl demo from Reuven -- there's some really
> cool
>  stuff in there. I'd love to dive into it a bit more and see what can
> be
>  generalized beyond scio. The repl-like interactive graph construction
> is
>  very similar to what we've seen with ipython, in that it doesn't
> always
>  play nicely with the graph construction / graph execution
> distinction. I
>  wonder what changes to Beam might more generally support this. The
>  materialize stuff looks similar to some functionality in FlumeJava we
>  used
>  to support multi-segment pipelines with some shared intermediate
>  PCollections.
> 
>  On Thu, Jun 23, 2016 at 9:22 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net>
>  wrote:
> 
>  Hi Neville,
> 
> >
> > thanks for the update !
> >
> > As it's another language support, and to clearly identify the
> purpose,
> > I
> > would say sdks/scala.
> >
> > Regards
> > JB
> >
> >
> > On 06/23/2016 11:56 PM, Neville Li wrote:
> >
> > +folks in my team
> >
> >>
> >> On Thu, Jun 23, 2016 at 5:57 PM Neville Li 
> >> wrote:
> >>
> >> Hi all,
> >>
> >>
> >>> I'm the co-author of Scio  and am
> >>> in
> >>> the
> >>> progress of moving code to Beam (BEAM-302
> >>> ). Just wondering
> if
> >>> sdks/scala is the right place for this code or if something like
> >>> dsls/scio
> >>> is a better choice? What do you think?

What is the "Keyed State" in the capability matrix?

2016-06-24 Thread Shen Li
Hi,

There is a "Keyed State" row in the  "What is being computed" section of
the capability matrix. What does the "Keyed State" refer to? Is it a global
key-value store?

(
http://beam.incubator.apache.org/beam/capability/2016/03/17/capability-matrix.html
)

Thanks,

Shen


Re: Scala DSL

2016-06-24 Thread Ismaël Mejía
​Hello everyone,

Neville, thanks a lot for your contribution. Your work is amazing and I am
really happy that this scala integration is finally happening.
Congratulations to you and your team.

I *strongly* disagree about the DSL classification for scio for one reason,
if you go to the root of the term, Domain Specific Languages are about a
domain, and the domain in this case is writing Beam pipelines, which is a
really broad domain.

I agree with Frances’ argument that scio is not an SDK e.g. it reuses the
existing Beam java SDK. My proposition is that scio will be called the
Scala API because in the end this is what it is. I think the confusion
comes from the common definition of SDK which is normally an API + a
Runtime. In this case scio will share the runtime with what we call the
Beam Java SDK.

One additional point of using the term API is that it sends the clear
message that Beam has a Scala API too (which is good for visibility as JB
mentioned).

Regards,
Ismaël​


On Fri, Jun 24, 2016 at 5:08 PM, Jean-Baptiste Onofré 
wrote:

> Hi Dan,
>
> fair enough.
>
> As I'm also working on new DSLs (XML, JSON), I already created the dsls
> module.
>
> So, I would say dsls/scala.
>
> WDYT ?
>
> Regards
> JB
>
>
> On 06/24/2016 05:07 PM, Dan Halperin wrote:
>
>> I don't think that sdks/scala is the right place -- scio is not a Beam
>> Scala SDK; it wraps the existing Java SDK.
>>
>> Some options:
>> * sdks/java/extensions  (Scio builds on the Java SDK) -- mentally vetoed
>> since Scio isn't an extension for the Java SDK, but rather a wrapper
>>
>> * dsls/java/scio (Scio is a Beam DSL that uses the Java SDK)
>> * dsls/scio  (Scio is a Beam DSL that could eventually use multiple SDKs)
>> * extensions/java/scio  (Scio is an extension of Beam that uses the Java
>> SDK)
>> * extensions/scio  (Scio is an extension of Beam that is not limited to
>> one
>> SDK)
>>
>> I lean towards either dsls/java/scio or extensions/java/scio, since I
>> don't
>> think there are plans for Scio to handle multiple different SDKs (in
>> different languages). The question between these two is whether we think
>> DSLs are "big enough" to be a top level concept.
>>
>> On Thu, Jun 23, 2016 at 11:05 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Good point about new Fn and the fact it's based on the Java SDK.
>>>
>>> It's just that in term of "marketing", it's a good message to provide a
>>> Scala SDK even if technically it's more a DSL.
>>>
>>> For instance, a valid "marketing" DSL would be a Java fluent DSL on top
>>> of
>>> the Java SDK, or a declarative XML DSL.
>>>
>>> However, from a technical perspective, it can go into dsl module.
>>>
>>> My $0.02 ;)
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 06/24/2016 06:51 AM, Frances Perry wrote:
>>>
>>> +Rafal & Andrew again

 I am leaning DSL for two reasons: (1) scio uses the existing java
 execution
 environment (and won't have a language-specific fn harness of its own),
 and
 (2) it changes the abstractions that users interact with.

 I recently saw a scio repl demo from Reuven -- there's some really cool
 stuff in there. I'd love to dive into it a bit more and see what can be
 generalized beyond scio. The repl-like interactive graph construction is
 very similar to what we've seen with ipython, in that it doesn't always
 play nicely with the graph construction / graph execution distinction. I
 wonder what changes to Beam might more generally support this. The
 materialize stuff looks similar to some functionality in FlumeJava we
 used
 to support multi-segment pipelines with some shared intermediate
 PCollections.

 On Thu, Jun 23, 2016 at 9:22 PM, Jean-Baptiste Onofré 
 wrote:

 Hi Neville,

>
> thanks for the update !
>
> As it's another language support, and to clearly identify the purpose,
> I
> would say sdks/scala.
>
> Regards
> JB
>
>
> On 06/23/2016 11:56 PM, Neville Li wrote:
>
> +folks in my team
>
>>
>> On Thu, Jun 23, 2016 at 5:57 PM Neville Li 
>> wrote:
>>
>> Hi all,
>>
>>
>>> I'm the co-author of Scio  and am
>>> in
>>> the
>>> progress of moving code to Beam (BEAM-302
>>> ). Just wondering if
>>> sdks/scala is the right place for this code or if something like
>>> dsls/scio
>>> is a better choice? What do you think?
>>>
>>> A little background: Scio was built as a high-level Scala API for
>>> Google
>>> Cloud Dataflow (now also Apache Beam) and is heavily influenced by
>>> Spark
>>> and Scalding. It wraps around the Dataflow/Beam Java SDK while also
>>> providing features comparable to other Scala data frameworks. We use
>>> Scio
>>> on Dataflow for production extensively inside Spotify.
>>>
>>> Cheers,
>>> Neville
>>>
>>>
>>>
>>> --

Re: Scala DSL

2016-06-24 Thread Jean-Baptiste Onofré

Hi Dan,

fair enough.

As I'm also working on new DSLs (XML, JSON), I already created the dsls 
module.


So, I would say dsls/scala.

WDYT ?

Regards
JB

On 06/24/2016 05:07 PM, Dan Halperin wrote:

I don't think that sdks/scala is the right place -- scio is not a Beam
Scala SDK; it wraps the existing Java SDK.

Some options:
* sdks/java/extensions  (Scio builds on the Java SDK) -- mentally vetoed
since Scio isn't an extension for the Java SDK, but rather a wrapper

* dsls/java/scio (Scio is a Beam DSL that uses the Java SDK)
* dsls/scio  (Scio is a Beam DSL that could eventually use multiple SDKs)
* extensions/java/scio  (Scio is an extension of Beam that uses the Java
SDK)
* extensions/scio  (Scio is an extension of Beam that is not limited to one
SDK)

I lean towards either dsls/java/scio or extensions/java/scio, since I don't
think there are plans for Scio to handle multiple different SDKs (in
different languages). The question between these two is whether we think
DSLs are "big enough" to be a top level concept.

On Thu, Jun 23, 2016 at 11:05 PM, Jean-Baptiste Onofré 
wrote:


Good point about new Fn and the fact it's based on the Java SDK.

It's just that in term of "marketing", it's a good message to provide a
Scala SDK even if technically it's more a DSL.

For instance, a valid "marketing" DSL would be a Java fluent DSL on top of
the Java SDK, or a declarative XML DSL.

However, from a technical perspective, it can go into dsl module.

My $0.02 ;)

Regards
JB


On 06/24/2016 06:51 AM, Frances Perry wrote:


+Rafal & Andrew again

I am leaning DSL for two reasons: (1) scio uses the existing java
execution
environment (and won't have a language-specific fn harness of its own),
and
(2) it changes the abstractions that users interact with.

I recently saw a scio repl demo from Reuven -- there's some really cool
stuff in there. I'd love to dive into it a bit more and see what can be
generalized beyond scio. The repl-like interactive graph construction is
very similar to what we've seen with ipython, in that it doesn't always
play nicely with the graph construction / graph execution distinction. I
wonder what changes to Beam might more generally support this. The
materialize stuff looks similar to some functionality in FlumeJava we used
to support multi-segment pipelines with some shared intermediate
PCollections.

On Thu, Jun 23, 2016 at 9:22 PM, Jean-Baptiste Onofré 
wrote:

Hi Neville,


thanks for the update !

As it's another language support, and to clearly identify the purpose, I
would say sdks/scala.

Regards
JB


On 06/23/2016 11:56 PM, Neville Li wrote:

+folks in my team


On Thu, Jun 23, 2016 at 5:57 PM Neville Li 
wrote:

Hi all,



I'm the co-author of Scio  and am in
the
progress of moving code to Beam (BEAM-302
). Just wondering if
sdks/scala is the right place for this code or if something like
dsls/scio
is a better choice? What do you think?

A little background: Scio was built as a high-level Scala API for
Google
Cloud Dataflow (now also Apache Beam) and is heavily influenced by
Spark
and Scalding. It wraps around the Dataflow/Beam Java SDK while also
providing features comparable to other Scala data frameworks. We use
Scio
on Dataflow for production extensively inside Spotify.

Cheers,
Neville




--

Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Scala DSL

2016-06-24 Thread Dan Halperin
I don't think that sdks/scala is the right place -- scio is not a Beam
Scala SDK; it wraps the existing Java SDK.

Some options:
* sdks/java/extensions  (Scio builds on the Java SDK) -- mentally vetoed
since Scio isn't an extension for the Java SDK, but rather a wrapper

* dsls/java/scio (Scio is a Beam DSL that uses the Java SDK)
* dsls/scio  (Scio is a Beam DSL that could eventually use multiple SDKs)
* extensions/java/scio  (Scio is an extension of Beam that uses the Java
SDK)
* extensions/scio  (Scio is an extension of Beam that is not limited to one
SDK)

I lean towards either dsls/java/scio or extensions/java/scio, since I don't
think there are plans for Scio to handle multiple different SDKs (in
different languages). The question between these two is whether we think
DSLs are "big enough" to be a top level concept.

On Thu, Jun 23, 2016 at 11:05 PM, Jean-Baptiste Onofré 
wrote:

> Good point about new Fn and the fact it's based on the Java SDK.
>
> It's just that in term of "marketing", it's a good message to provide a
> Scala SDK even if technically it's more a DSL.
>
> For instance, a valid "marketing" DSL would be a Java fluent DSL on top of
> the Java SDK, or a declarative XML DSL.
>
> However, from a technical perspective, it can go into dsl module.
>
> My $0.02 ;)
>
> Regards
> JB
>
>
> On 06/24/2016 06:51 AM, Frances Perry wrote:
>
>> +Rafal & Andrew again
>>
>> I am leaning DSL for two reasons: (1) scio uses the existing java
>> execution
>> environment (and won't have a language-specific fn harness of its own),
>> and
>> (2) it changes the abstractions that users interact with.
>>
>> I recently saw a scio repl demo from Reuven -- there's some really cool
>> stuff in there. I'd love to dive into it a bit more and see what can be
>> generalized beyond scio. The repl-like interactive graph construction is
>> very similar to what we've seen with ipython, in that it doesn't always
>> play nicely with the graph construction / graph execution distinction. I
>> wonder what changes to Beam might more generally support this. The
>> materialize stuff looks similar to some functionality in FlumeJava we used
>> to support multi-segment pipelines with some shared intermediate
>> PCollections.
>>
>> On Thu, Jun 23, 2016 at 9:22 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Hi Neville,
>>>
>>> thanks for the update !
>>>
>>> As it's another language support, and to clearly identify the purpose, I
>>> would say sdks/scala.
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 06/23/2016 11:56 PM, Neville Li wrote:
>>>
>>> +folks in my team

 On Thu, Jun 23, 2016 at 5:57 PM Neville Li 
 wrote:

 Hi all,

>
> I'm the co-author of Scio  and am in
> the
> progress of moving code to Beam (BEAM-302
> ). Just wondering if
> sdks/scala is the right place for this code or if something like
> dsls/scio
> is a better choice? What do you think?
>
> A little background: Scio was built as a high-level Scala API for
> Google
> Cloud Dataflow (now also Apache Beam) and is heavily influenced by
> Spark
> and Scalding. It wraps around the Dataflow/Beam Java SDK while also
> providing features comparable to other Scala data frameworks. We use
> Scio
> on Dataflow for production extensively inside Spotify.
>
> Cheers,
> Neville
>
>
>
 --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>