Re: Beam Fn API

2017-05-31 Thread Robert Bradshaw
Thank! Looks good. I've added some comments to the doc.

On Wed, May 31, 2017 at 7:00 AM, Etienne Chauchot 
wrote:

> Thanks for all these docs! They are exactly what was needed for new
> contributors as discussed in this thread
>
> https://lists.apache.org/thread.html/ac93d29424e19d57097373b
> 78f3f5bcbc701e4b51385a52a6e27b7ed@%3Cdev.beam.apache.org%3E
>
> Etienne
>
>
> Le 31/05/2017 à 11:12, Aljoscha Krettek a écrit :
>
>> Thanks for banging these out Lukasz. I’ll try and read them all this week.
>>
>> We’re also planning to add support for the Fn API to the Flink Runner so
>> that we can execute Python programs. I’m sure we’ll get some valuable
>> feedback for you while doing that.
>>
>> On 26. May 2017, at 22:49, Lukasz Cwik  wrote:
>>>
>>> I would like to share another document about the Fn API. This document
>>> specifically discusses how to access side inputs, access remote
>>> references
>>> (e.g. large iterables for hot keys produced by a GBK), and support user
>>> state.
>>> https://s.apache.org/beam-fn-state-api-and-bundle-processing
>>>
>>> The document does require a strong foundation in the Apache Beam model
>>> and
>>> a good understanding of the prior shared docs:
>>> * How to process a bundle: https://s.apache.org/beam-fn-api
>>> -processing-a-bundle
>>> * How to send and receive data: https://s.apache.org/beam-fn-api
>>> -send-and-receive-data
>>>
>>> I could really use the help of runner contributors to review the caching
>>> semantics within the SDK harness and whether they would work well for the
>>> runner they contribute to the most.
>>>
>>> On Sun, May 21, 2017 at 6:40 PM, Lukasz Cwik  wrote:
>>>
>>> Manu, the goal is to share here initially, update the docs addressing
 people's comments, and then publish them on the website once they are
 stable enough.

 On Sun, May 21, 2017 at 5:54 PM, Manu Zhang 
 wrote:

 Thanks Lukasz. The following two links were somehow incorrectly
> formatted
> in your mail.
>
> * How to process a bundle:
> https://s.apache.org/beam-fn-api-processing-a-bundle
> * How to send and receive data:
> https://s.apache.org/beam-fn-api-send-and-receive-data
>
> By the way, is there a way to find them from the Beam website ?
>
>
> On Fri, May 19, 2017 at 6:44 AM Lukasz Cwik 
> wrote:
>
> Now that I'm back from vacation and the 2.0.0 release is not taking all
>>
> my
>
>> time, I am focusing my attention on working on the Beam Portability
>> framework, specifically the Fn API so that we can get Python and other
>> language integrations work with any runner.
>>
>> For new comers, I would like to reshare the overview:
>> https://s.apache.org/beam-fn-api
>>
>> And for those of you who have been following this thread and
>>
> contributors
>
>> focusing on Runner integration with Apache Beam:
>> * How to process a bundle: https://s.apache.org/beam-fn-a
>>
> pi-processing-a-
>
>> bundle
>> * How to send and receive data: https://s.apache.org/
>> beam-fn-api-send-and-receive-data
>>
>> If you want to dive deeper, you should look at:
>> * Runner API Protobuf: https://github.com/apache/beam
>> /blob/master/sdks/
>> common/runner-api/src/main/proto/beam_runner_api.proto
>> >
> er-api/src/main/proto/beam_runner_api.proto>
>
>> * Fn API Protobuf: https://github.com/apache/beam/blob/master/sdks/
>> common/fn-api/src/main/proto/beam_fn_api.proto
>> >
> api/src/main/proto/beam_fn_api.proto>
>
>> * Java SDK Harness: https://github.com/apache/beam/tree/master/sdks/
>> java/harness
>> 
>> * Python SDK Harness: https://github.com/apache/beam
>> /tree/master/sdks/
>> python/apache_beam/runners/worker
>> >
> he_beam/runners/worker>
>
>> Next I'm planning on talking about Beam Fn State API and will need
>> help
>> from Runner contributors to talk about caching semantics and key
>> spaces
>>
> and
>
>> whether the integrations mesh well with current Runner
>> implementations.
>>
> The
>
>> State API is meant to support user state, side inputs, and
>> re-iteration
>>
> for
>
>> large values produced by GroupByKey.
>>
>> On Tue, Jan 24, 2017 at 9:46 AM, Lukasz Cwik 
>> wrote:
>>
>> Yes, I was using a Pipeline that was:
>>> Read(10 GiBs of KV (10,000,000 values)) -> GBK -> IdentityParDo (a
>>>
>> batch
>
>> pipeline in the global window using the default trigger)
>>>
>>> In Google Cloud Dataflow, the shuffle step uses the binary
>>>
>> representation

Re: Beam Fn API

2017-05-31 Thread Etienne Chauchot
Thanks for all these docs! They are exactly what was needed for new 
contributors as discussed in this thread


https://lists.apache.org/thread.html/ac93d29424e19d57097373b78f3f5bcbc701e4b51385a52a6e27b7ed@%3Cdev.beam.apache.org%3E

Etienne

Le 31/05/2017 à 11:12, Aljoscha Krettek a écrit :

Thanks for banging these out Lukasz. I’ll try and read them all this week.

We’re also planning to add support for the Fn API to the Flink Runner so that 
we can execute Python programs. I’m sure we’ll get some valuable feedback for 
you while doing that.


On 26. May 2017, at 22:49, Lukasz Cwik  wrote:

I would like to share another document about the Fn API. This document
specifically discusses how to access side inputs, access remote references
(e.g. large iterables for hot keys produced by a GBK), and support user
state.
https://s.apache.org/beam-fn-state-api-and-bundle-processing

The document does require a strong foundation in the Apache Beam model and
a good understanding of the prior shared docs:
* How to process a bundle: https://s.apache.org/beam-fn-api
-processing-a-bundle
* How to send and receive data: https://s.apache.org/beam-fn-api
-send-and-receive-data

I could really use the help of runner contributors to review the caching
semantics within the SDK harness and whether they would work well for the
runner they contribute to the most.

On Sun, May 21, 2017 at 6:40 PM, Lukasz Cwik  wrote:


Manu, the goal is to share here initially, update the docs addressing
people's comments, and then publish them on the website once they are
stable enough.

On Sun, May 21, 2017 at 5:54 PM, Manu Zhang 
wrote:


Thanks Lukasz. The following two links were somehow incorrectly formatted
in your mail.

* How to process a bundle:
https://s.apache.org/beam-fn-api-processing-a-bundle
* How to send and receive data:
https://s.apache.org/beam-fn-api-send-and-receive-data

By the way, is there a way to find them from the Beam website ?


On Fri, May 19, 2017 at 6:44 AM Lukasz Cwik 
wrote:


Now that I'm back from vacation and the 2.0.0 release is not taking all

my

time, I am focusing my attention on working on the Beam Portability
framework, specifically the Fn API so that we can get Python and other
language integrations work with any runner.

For new comers, I would like to reshare the overview:
https://s.apache.org/beam-fn-api

And for those of you who have been following this thread and

contributors

focusing on Runner integration with Apache Beam:
* How to process a bundle: https://s.apache.org/beam-fn-a

pi-processing-a-

bundle
* How to send and receive data: https://s.apache.org/
beam-fn-api-send-and-receive-data

If you want to dive deeper, you should look at:
* Runner API Protobuf: https://github.com/apache/beam/blob/master/sdks/
common/runner-api/src/main/proto/beam_runner_api.proto


* Fn API Protobuf: https://github.com/apache/beam/blob/master/sdks/
common/fn-api/src/main/proto/beam_fn_api.proto


* Java SDK Harness: https://github.com/apache/beam/tree/master/sdks/
java/harness

* Python SDK Harness: https://github.com/apache/beam/tree/master/sdks/
python/apache_beam/runners/worker


Next I'm planning on talking about Beam Fn State API and will need help
from Runner contributors to talk about caching semantics and key spaces

and

whether the integrations mesh well with current Runner implementations.

The

State API is meant to support user state, side inputs, and re-iteration

for

large values produced by GroupByKey.

On Tue, Jan 24, 2017 at 9:46 AM, Lukasz Cwik  wrote:


Yes, I was using a Pipeline that was:
Read(10 GiBs of KV (10,000,000 values)) -> GBK -> IdentityParDo (a

batch

pipeline in the global window using the default trigger)

In Google Cloud Dataflow, the shuffle step uses the binary

representation

to compare keys, so the above pipeline would normally be converted to

the

following two stages:
Read -> GBK Writer
GBK Reader -> IdentityParDo

Note that the GBK Writer and GBK Reader need to use a coder to encode

and

decode the value.

When using the Fn API, those two stages expanded because of the Fn Api
crossings using a gRPC Write/Read pair:
Read -> gRPC Write -> gRPC Read -> GBK Writer
GBK Reader -> gRPC Write -> gRPC Read -> IdentityParDo

In my naive prototype implementation, the coder was used to encode
elements at the gRPC steps. This meant that the coder was
encoding/decoding/encoding in the first stage and
decoding/encoding/decoding in the second stage. This tripled the

amount

of

times the coder was being invoked per element. This additional use of

the

coder accounted for ~12% (80% of the 15%) of the extra execution time.

This

implem

Re: Beam Fn API

2017-05-31 Thread Aljoscha Krettek
Thanks for banging these out Lukasz. I’ll try and read them all this week.

We’re also planning to add support for the Fn API to the Flink Runner so that 
we can execute Python programs. I’m sure we’ll get some valuable feedback for 
you while doing that.

> On 26. May 2017, at 22:49, Lukasz Cwik  wrote:
> 
> I would like to share another document about the Fn API. This document
> specifically discusses how to access side inputs, access remote references
> (e.g. large iterables for hot keys produced by a GBK), and support user
> state.
> https://s.apache.org/beam-fn-state-api-and-bundle-processing
> 
> The document does require a strong foundation in the Apache Beam model and
> a good understanding of the prior shared docs:
> * How to process a bundle: https://s.apache.org/beam-fn-api
> -processing-a-bundle
> * How to send and receive data: https://s.apache.org/beam-fn-api
> -send-and-receive-data
> 
> I could really use the help of runner contributors to review the caching
> semantics within the SDK harness and whether they would work well for the
> runner they contribute to the most.
> 
> On Sun, May 21, 2017 at 6:40 PM, Lukasz Cwik  wrote:
> 
>> Manu, the goal is to share here initially, update the docs addressing
>> people's comments, and then publish them on the website once they are
>> stable enough.
>> 
>> On Sun, May 21, 2017 at 5:54 PM, Manu Zhang 
>> wrote:
>> 
>>> Thanks Lukasz. The following two links were somehow incorrectly formatted
>>> in your mail.
>>> 
>>> * How to process a bundle:
>>> https://s.apache.org/beam-fn-api-processing-a-bundle
>>> * How to send and receive data:
>>> https://s.apache.org/beam-fn-api-send-and-receive-data
>>> 
>>> By the way, is there a way to find them from the Beam website ?
>>> 
>>> 
>>> On Fri, May 19, 2017 at 6:44 AM Lukasz Cwik 
>>> wrote:
>>> 
 Now that I'm back from vacation and the 2.0.0 release is not taking all
>>> my
 time, I am focusing my attention on working on the Beam Portability
 framework, specifically the Fn API so that we can get Python and other
 language integrations work with any runner.
 
 For new comers, I would like to reshare the overview:
 https://s.apache.org/beam-fn-api
 
 And for those of you who have been following this thread and
>>> contributors
 focusing on Runner integration with Apache Beam:
 * How to process a bundle: https://s.apache.org/beam-fn-a
>>> pi-processing-a-
 bundle
 * How to send and receive data: https://s.apache.org/
 beam-fn-api-send-and-receive-data
 
 If you want to dive deeper, you should look at:
 * Runner API Protobuf: https://github.com/apache/beam/blob/master/sdks/
 common/runner-api/src/main/proto/beam_runner_api.proto
 >> er-api/src/main/proto/beam_runner_api.proto>
 * Fn API Protobuf: https://github.com/apache/beam/blob/master/sdks/
 common/fn-api/src/main/proto/beam_fn_api.proto
 >> api/src/main/proto/beam_fn_api.proto>
 * Java SDK Harness: https://github.com/apache/beam/tree/master/sdks/
 java/harness
 
 * Python SDK Harness: https://github.com/apache/beam/tree/master/sdks/
 python/apache_beam/runners/worker
 >> he_beam/runners/worker>
 
 Next I'm planning on talking about Beam Fn State API and will need help
 from Runner contributors to talk about caching semantics and key spaces
>>> and
 whether the integrations mesh well with current Runner implementations.
>>> The
 State API is meant to support user state, side inputs, and re-iteration
>>> for
 large values produced by GroupByKey.
 
 On Tue, Jan 24, 2017 at 9:46 AM, Lukasz Cwik  wrote:
 
> Yes, I was using a Pipeline that was:
> Read(10 GiBs of KV (10,000,000 values)) -> GBK -> IdentityParDo (a
>>> batch
> pipeline in the global window using the default trigger)
> 
> In Google Cloud Dataflow, the shuffle step uses the binary
>>> representation
> to compare keys, so the above pipeline would normally be converted to
>>> the
> following two stages:
> Read -> GBK Writer
> GBK Reader -> IdentityParDo
> 
> Note that the GBK Writer and GBK Reader need to use a coder to encode
>>> and
> decode the value.
> 
> When using the Fn API, those two stages expanded because of the Fn Api
> crossings using a gRPC Write/Read pair:
> Read -> gRPC Write -> gRPC Read -> GBK Writer
> GBK Reader -> gRPC Write -> gRPC Read -> IdentityParDo
> 
> In my naive prototype implementation, the coder was used to encode
> elements at the gRPC steps. This meant that the coder was
> encoding/decoding/encoding in the first stage and
> decoding/encoding/decoding in the second stage. This tripled the
>>> amou

Re: Precommit Jenkins Linkage Broken

2017-05-31 Thread Jean-Baptiste Onofré

Hi Ted,

I changed to "Critical".

Regards
JB

On 05/31/2017 12:00 AM, Ted Yu wrote:

INFRA-14247 is currently marked Major.

Suggest raising the priority so that it gets more attention.

Cheers

On Tue, May 30, 2017 at 2:59 PM, Jason Kuster <
jasonkus...@google.com.invalid> wrote:


Hey folks,

Just wanted to mention on the dev list that Jenkins precommit breakage is a
known issue and has been escalated to Infra (thanks JB!)[1]. I'm monitoring
the issue and will ping back here with any updates and when it starts
working again.

Best,

Jason

[1] https://issues.apache.org/jira/browse/INFRA-14247

--
---
Jason Kuster
Apache Beam / Google Cloud Dataflow





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Build failed in Jenkins: beam_Release_NightlySnapshot #433

2017-05-31 Thread Apache Jenkins Server
See 


Changes:

[aljoscha.krettek] [BEAM-2380] Forward additional outputs to DoFnRunner in 
Flink Batch

[aljoscha.krettek] Fix flushing of pushed-back data in Flink Runner on +Inf 
watermark

[dhalperi] fix javadoc of View

[jbonofre] [BEAM-2379] Avoid reading projectId from environment variable in 
tests.

[lcwik] [BEAM-1347] Remove the usage of a thread local on a potentially hot path

[kirpichov] [BEAM-2248] KafkaIO support to use start read time to set start 
offset

[altay] Clean up pyc files before running tests

[lcwik] [BEAM-1544] Java cross-JDK version tests on Jenkins

--
[...truncated 3.28 MB...]
2017-05-31T08:06:55.439 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-streaming-java_2.10/1.2.1/flink-streaming-java_2.10-1.2.1.jar
2017-05-31T08:06:55.446 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/com/twitter/chill-java/0.7.4/chill-java-0.7.4.jar
 (50 KB at 59.0 KB/sec)
2017-05-31T08:06:55.446 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/sling/org.apache.sling.commons.json/2.0.6/org.apache.sling.commons.json-2.0.6.jar
2017-05-31T08:06:55.450 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/com/twitter/chill_2.10/0.7.4/chill_2.10-0.7.4.jar
 (217 KB at 258.1 KB/sec)
2017-05-31T08:06:55.450 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-core/1.2.1/flink-core-1.2.1-tests.jar
2017-05-31T08:06:55.483 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/sling/org.apache.sling.commons.json/2.0.6/org.apache.sling.commons.json-2.0.6.jar
 (47 KB at 53.7 KB/sec)
2017-05-31T08:06:55.484 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-runtime_2.10/1.2.1/flink-runtime_2.10-1.2.1-tests.jar
2017-05-31T08:06:55.530 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/scala-lang/scala-library/2.10.4/scala-library-2.10.4.jar
 (6960 KB at 7589.3 KB/sec)
2017-05-31T08:06:55.530 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-streaming-java_2.10/1.2.1/flink-streaming-java_2.10-1.2.1-tests.jar
2017-05-31T08:06:55.553 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-core/1.2.1/flink-core-1.2.1-tests.jar
 (716 KB at 761.5 KB/sec)
2017-05-31T08:06:55.553 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-test-utils_2.10/1.2.1/flink-test-utils_2.10-1.2.1.jar
2017-05-31T08:06:55.627 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-streaming-java_2.10/1.2.1/flink-streaming-java_2.10-1.2.1-tests.jar
 (953 KB at 939.7 KB/sec)
2017-05-31T08:06:55.628 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-test-utils-junit/1.2.1/flink-test-utils-junit-1.2.1.jar
2017-05-31T08:06:55.656 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-test-utils-junit/1.2.1/flink-test-utils-junit-1.2.1.jar
 (24 KB at 22.1 KB/sec)
2017-05-31T08:06:55.656 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/curator/curator-test/2.8.0/curator-test-2.8.0.jar
2017-05-31T08:06:55.686 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/curator/curator-test/2.8.0/curator-test-2.8.0.jar
 (39 KB at 36.3 KB/sec)
2017-05-31T08:06:55.705 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-runtime_2.10/1.2.1/flink-runtime_2.10-1.2.1-tests.jar
 (2432 KB at 2226.8 KB/sec)
2017-05-31T08:06:55.709 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-streaming-java_2.10/1.2.1/flink-streaming-java_2.10-1.2.1.jar
 (3035 KB at 2768.4 KB/sec)
2017-05-31T08:06:55.785 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-test-utils_2.10/1.2.1/flink-test-utils_2.10-1.2.1.jar
 (2366 KB at 2018.5 KB/sec)
2017-05-31T08:06:56.075 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/flink/flink-shaded-hadoop2/1.2.1/flink-shaded-hadoop2-1.2.1.jar
 (17860 KB at 12215.7 KB/sec)
2017-05-31T08:06:56.087 [INFO] 
2017-05-31T08:06:56.087 [INFO] --- maven-clean-plugin:3.0.0:clean 
(default-clean) @ beam-runners-flink_2.10 ---
2017-05-31T08:06:56.089 [INFO] Deleting 
 
(includes = [**/*.pyc, **/*.egg-info/, **/sdks/python/LICENSE, 
**/sdks/python/NOTICE, **/sdks/python/README.md], excludes = [])
2017-05-31T08:06:56.142 [INFO] 
2017-05-31T08:06:56.142 [INFO] --- maven-enforcer-plugin:1.4.1:enforce 
(enforce) @ beam-runners-flink_2.10 ---
2017-05-31T08:06:58.286 [INFO] 
2017-05-31T08:06:58.286 [INFO] --- maven-enforcer-plugin:1.4.1:enforce 
(enforce-banned-dependencies) @ beam-runners-flink_2.10 ---
2017-05-31T08:06:58.359 [INFO] 
2017-05-31T08:06:58.359 [INFO] --- maven-remote-resources-plugin:1.5:process 
(process