Re: Artifact staging in cross-language pipelines

2019-04-19 Thread Chamikara Jayalath
OK, sounds like this is a good path forward then.

* When starting up the expansion service, user (that starts up the service)
provide dependencies necessary to expand transforms. We will later add
support for adding new transforms to an already running expansion service.
* As a part of transform configuration, transform author have the option of
providing a list of dependencies that will be needed to run the transform.
* These dependencies will be send back to the pipeline SDK as a part of
expansion response and pipeline SDK will stage these resources.
* Pipeline author have the option of specifying the dependencies using a
pipeline option. (for example, https://github.com/apache/beam/pull/8340)

I think last option is important to (1) make existing transform easily
available for cross-language usage without additional configurations (2)
allow pipeline authors to override dependency versions specified by in the
transform configuration (for example, to apply security patches) without
updating the expansion service.

There are few unanswered questions though.
(1) In what form will a transform author specify dependencies ? For
example, URL to a Maven repo, URL to a local file, blob ?
(2) How will dependencies be included in the expansion response proto ?
String (URL), bytes (blob) ?
(3) How will we manage/share transitive dependencies required at runtime ?
(4) How will dependencies be staged for various runner/SDK combinations ?
(for example, portable runner/Flink, Dataflow runner)

Thanks,
Cham

On Fri, Apr 19, 2019 at 4:49 AM Maximilian Michels  wrote:

> Thank you for your replies.
>
> I did not suggest that the Expansion Service does the staging, but it
> would return the required resources (e.g. jars) for the external
> transform's runtime environment. The client then has to take care of
> staging the resources.
>
> The Expansion Service itself also needs resources to do the expansion. I
> assumed those to be provided when starting the expansion service. I
> consider it less important but we could also provide a way to add new
> transforms to the Expansion Service after startup.
>
> Good point on Docker vs externally provided environments. For the PR [1]
> it will suffice then to add Kafka to the container dependencies. The
> "--jar_package" pipeline option is ok for now but I'd like to see work
> towards staging resources for external transforms via information
> returned by the Expansion Service. That avoids users having to take care
> of including the correct jars in their pipeline options.
>
> These issues are related and we could discuss them in separate threads:
>
> * Auto-discovery of Expansion Service and its external transforms
> * Credentials required during expansion / runtime
>
> Thanks,
> Max
>
> [1] ttps://github.com/apache/beam/pull/8322
>
> On 19.04.19 07:35, Thomas Weise wrote:
> > Good discussion :)
> >
> > Initially the expansion service was considered a user responsibility,
> > but I think that isn't necessarily the case. I can also see the
> > expansion service provided as part of the infrastructure and the user
> > not wanting to deal with it at all. For example, users may want to write
> > Python transforms and use external IOs, without being concerned how
> > these IOs are provided. Under such scenario it would be good if:
> >
> > * Expansion service(s) can be auto-discovered via the job service
> endpoint
> > * Available external transforms can be discovered via the expansion
> > service(s)
> > * Dependencies for external transforms are part of the metadata returned
> > by expansion service
> >
> > Dependencies could then be staged either by the SDK client or the
> > expansion service. The expansion service could provide the locations to
> > stage to the SDK, it would still be transparent to the user.
> >
> > I also agree with Luke regarding the environments. Docker is the choice
> > for generic deployment. Other environments are used when the flexibility
> > offered by Docker isn't needed (or gets into the way). Then the
> > dependencies are provided in different ways. Whether these are Python
> > packages or jar files, by opting out of Docker the decision is made to
> > manage dependencies externally.
> >
> > Thomas
> >
> >
> > On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath  > > wrote:
> >
> >
> >
> > On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath
> > mailto:chamik...@google.com>> wrote:
> >
> > Thanks for raising the concern about credentials Ankur, I agree
> > that this is a significant issue.
> >
> > On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik  > > wrote:
> >
> > I can understand the concern about credentials, the same
> > access concern will exist for several cross language
> > transforms (mostly IOs) since some will need access to
> > credentials to read/write to an external service.
> >
> > Are there any ideas 

Re: Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Pablo Estrada
Aah that makes more sense... I'll try that out. Thanks!

On Fri, Apr 19, 2019 at 4:12 PM Kenneth Knowles  wrote:

> Oh, wait I didn't even read the pipeline well. You don't have a GBK so
> triggers don't do anything. They only apply to aggregations. Since it is
> just a ParDo the elements flow right through and your results are expected.
> If you did have a GBK then you would have this:
>
> Expected: [ ['1', '2', '3', '4', '5'], ['1', '2', '3', '4', '5', '6', '7',
> '8', '9', '10'] ]
> Actual: [ ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'] ]
>
> Where both outer lists are PCollections, hence could be reordered, and
> both inner lists are also in an undefined ordered. They have a pane index
> that says their logical order but they can be reordered. It is unusual and
> runner-dependent but best to check PCollection contents without
> order-dependence.
>
> Do you have PAssert for these sorts of checks?
>
> Kenn
>
> On Fri, Apr 19, 2019 at 4:02 PM Pablo Estrada  wrote:
>
>> created https://jira.apache.org/jira/browse/BEAM-7122
>> Best
>> -P.
>>
>> On Fri, Apr 19, 2019 at 3:50 PM Pablo Estrada  wrote:
>>
>>> Ah sorry for the lack of clarification. Each element appear only once in
>>> the final output. The failure is:
>>>
>>> ==
 FAIL: test_multiple_accumulating_firings
 (apache_beam.transforms.trigger_test.TriggerPipelineTest)
 --
 Traceback (most recent call last):
   File "apache_beam/transforms/trigger_test.py", line 491, in
 test_multiple_accumulating_firings
 TriggerPipelineTest.all_records)
 AssertionError: Lists differ: ['1', '2', '3', '4', '5', '1',... !=
 ['1', '2', '3', '4', '5', '6',...

>>> [...other output...]
>>>
>>> (expected is:)
>>>
 - ['1', '2', '3', '4', '5', '1', '2', '3', '4', '5', '6', '7', '8',
 '9', '10']
 ?   -

>>> (actual is:)
>>>
 + ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
 --
>>>
>>>
>>>
>>>
>>> On Fri, Apr 19, 2019 at 3:45 PM Kenneth Knowles  wrote:
>>>
 What is the behavior you are seeing?

 Kenn

 On Fri, Apr 19, 2019 at 3:14 PM Ahmet Altay  wrote:

>
>
> On Fri, Apr 19, 2019 at 1:58 PM Pablo Estrada 
> wrote:
>
>> Hello all,
>> I've been slowly learning a bit about life in streaming, with state,
>> timers, triggers, etc.
>>
>> The other day, I tried out a trigger pipeline that did not have the
>> behavior that I was expecting, and I am looking for feedback on whether 
>> I'm
>> missing something, or this is a bug.
>>
>> Please take a look at this unit test:
>>
>>
>> https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451
>>
>> Is the check correct that we would expect range [1, 6) to appear
>> twice? i.e. concat([1, 6), [1, 10]) ?
>>
>
> This is what I would expect. Your test code looks good to me. Could
> you file an issue?
>
>
>>
>> I have not tested this in other runners.
>> Thanks
>> -P.
>>
>


Re: Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Ahmet Altay
I missed the lack of GBK.

assert_that would be the passert equivalent, but that has known issues in
streaming mode.

On Fri, Apr 19, 2019 at 4:12 PM Kenneth Knowles  wrote:

> Oh, wait I didn't even read the pipeline well. You don't have a GBK so
> triggers don't do anything. They only apply to aggregations. Since it is
> just a ParDo the elements flow right through and your results are expected.
> If you did have a GBK then you would have this:
>
> Expected: [ ['1', '2', '3', '4', '5'], ['1', '2', '3', '4', '5', '6', '7',
> '8', '9', '10'] ]
> Actual: [ ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'] ]
>
> Where both outer lists are PCollections, hence could be reordered, and
> both inner lists are also in an undefined ordered. They have a pane index
> that says their logical order but they can be reordered. It is unusual and
> runner-dependent but best to check PCollection contents without
> order-dependence.
>
> Do you have PAssert for these sorts of checks?
>
> Kenn
>
> On Fri, Apr 19, 2019 at 4:02 PM Pablo Estrada  wrote:
>
>> created https://jira.apache.org/jira/browse/BEAM-7122
>> Best
>> -P.
>>
>> On Fri, Apr 19, 2019 at 3:50 PM Pablo Estrada  wrote:
>>
>>> Ah sorry for the lack of clarification. Each element appear only once in
>>> the final output. The failure is:
>>>
>>> ==
 FAIL: test_multiple_accumulating_firings
 (apache_beam.transforms.trigger_test.TriggerPipelineTest)
 --
 Traceback (most recent call last):
   File "apache_beam/transforms/trigger_test.py", line 491, in
 test_multiple_accumulating_firings
 TriggerPipelineTest.all_records)
 AssertionError: Lists differ: ['1', '2', '3', '4', '5', '1',... !=
 ['1', '2', '3', '4', '5', '6',...

>>> [...other output...]
>>>
>>> (expected is:)
>>>
 - ['1', '2', '3', '4', '5', '1', '2', '3', '4', '5', '6', '7', '8',
 '9', '10']
 ?   -

>>> (actual is:)
>>>
 + ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
 --
>>>
>>>
>>>
>>>
>>> On Fri, Apr 19, 2019 at 3:45 PM Kenneth Knowles  wrote:
>>>
 What is the behavior you are seeing?

 Kenn

 On Fri, Apr 19, 2019 at 3:14 PM Ahmet Altay  wrote:

>
>
> On Fri, Apr 19, 2019 at 1:58 PM Pablo Estrada 
> wrote:
>
>> Hello all,
>> I've been slowly learning a bit about life in streaming, with state,
>> timers, triggers, etc.
>>
>> The other day, I tried out a trigger pipeline that did not have the
>> behavior that I was expecting, and I am looking for feedback on whether 
>> I'm
>> missing something, or this is a bug.
>>
>> Please take a look at this unit test:
>>
>>
>> https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451
>>
>> Is the check correct that we would expect range [1, 6) to appear
>> twice? i.e. concat([1, 6), [1, 10]) ?
>>
>
> This is what I would expect. Your test code looks good to me. Could
> you file an issue?
>
>
>>
>> I have not tested this in other runners.
>> Thanks
>> -P.
>>
>


Re: Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Kenneth Knowles
Oh, wait I didn't even read the pipeline well. You don't have a GBK so
triggers don't do anything. They only apply to aggregations. Since it is
just a ParDo the elements flow right through and your results are expected.
If you did have a GBK then you would have this:

Expected: [ ['1', '2', '3', '4', '5'], ['1', '2', '3', '4', '5', '6', '7',
'8', '9', '10'] ]
Actual: [ ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10'] ]

Where both outer lists are PCollections, hence could be reordered, and both
inner lists are also in an undefined ordered. They have a pane index that
says their logical order but they can be reordered. It is unusual and
runner-dependent but best to check PCollection contents without
order-dependence.

Do you have PAssert for these sorts of checks?

Kenn

On Fri, Apr 19, 2019 at 4:02 PM Pablo Estrada  wrote:

> created https://jira.apache.org/jira/browse/BEAM-7122
> Best
> -P.
>
> On Fri, Apr 19, 2019 at 3:50 PM Pablo Estrada  wrote:
>
>> Ah sorry for the lack of clarification. Each element appear only once in
>> the final output. The failure is:
>>
>> ==
>>> FAIL: test_multiple_accumulating_firings
>>> (apache_beam.transforms.trigger_test.TriggerPipelineTest)
>>> --
>>> Traceback (most recent call last):
>>>   File "apache_beam/transforms/trigger_test.py", line 491, in
>>> test_multiple_accumulating_firings
>>> TriggerPipelineTest.all_records)
>>> AssertionError: Lists differ: ['1', '2', '3', '4', '5', '1',... != ['1',
>>> '2', '3', '4', '5', '6',...
>>>
>> [...other output...]
>>
>> (expected is:)
>>
>>> - ['1', '2', '3', '4', '5', '1', '2', '3', '4', '5', '6', '7', '8', '9',
>>> '10']
>>> ?   -
>>>
>> (actual is:)
>>
>>> + ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
>>> --
>>
>>
>>
>>
>> On Fri, Apr 19, 2019 at 3:45 PM Kenneth Knowles  wrote:
>>
>>> What is the behavior you are seeing?
>>>
>>> Kenn
>>>
>>> On Fri, Apr 19, 2019 at 3:14 PM Ahmet Altay  wrote:
>>>


 On Fri, Apr 19, 2019 at 1:58 PM Pablo Estrada 
 wrote:

> Hello all,
> I've been slowly learning a bit about life in streaming, with state,
> timers, triggers, etc.
>
> The other day, I tried out a trigger pipeline that did not have the
> behavior that I was expecting, and I am looking for feedback on whether 
> I'm
> missing something, or this is a bug.
>
> Please take a look at this unit test:
>
>
> https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451
>
> Is the check correct that we would expect range [1, 6) to appear
> twice? i.e. concat([1, 6), [1, 10]) ?
>

 This is what I would expect. Your test code looks good to me. Could you
 file an issue?


>
> I have not tested this in other runners.
> Thanks
> -P.
>



Re: Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Pablo Estrada
created https://jira.apache.org/jira/browse/BEAM-7122
Best
-P.

On Fri, Apr 19, 2019 at 3:50 PM Pablo Estrada  wrote:

> Ah sorry for the lack of clarification. Each element appear only once in
> the final output. The failure is:
>
> ==
>> FAIL: test_multiple_accumulating_firings
>> (apache_beam.transforms.trigger_test.TriggerPipelineTest)
>> --
>> Traceback (most recent call last):
>>   File "apache_beam/transforms/trigger_test.py", line 491, in
>> test_multiple_accumulating_firings
>> TriggerPipelineTest.all_records)
>> AssertionError: Lists differ: ['1', '2', '3', '4', '5', '1',... != ['1',
>> '2', '3', '4', '5', '6',...
>>
> [...other output...]
>
> (expected is:)
>
>> - ['1', '2', '3', '4', '5', '1', '2', '3', '4', '5', '6', '7', '8', '9',
>> '10']
>> ?   -
>>
> (actual is:)
>
>> + ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
>> --
>
>
>
>
> On Fri, Apr 19, 2019 at 3:45 PM Kenneth Knowles  wrote:
>
>> What is the behavior you are seeing?
>>
>> Kenn
>>
>> On Fri, Apr 19, 2019 at 3:14 PM Ahmet Altay  wrote:
>>
>>>
>>>
>>> On Fri, Apr 19, 2019 at 1:58 PM Pablo Estrada 
>>> wrote:
>>>
 Hello all,
 I've been slowly learning a bit about life in streaming, with state,
 timers, triggers, etc.

 The other day, I tried out a trigger pipeline that did not have the
 behavior that I was expecting, and I am looking for feedback on whether I'm
 missing something, or this is a bug.

 Please take a look at this unit test:


 https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451

 Is the check correct that we would expect range [1, 6) to appear twice?
 i.e. concat([1, 6), [1, 10]) ?

>>>
>>> This is what I would expect. Your test code looks good to me. Could you
>>> file an issue?
>>>
>>>

 I have not tested this in other runners.
 Thanks
 -P.

>>>


Re: Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Pablo Estrada
Ah sorry for the lack of clarification. Each element appear only once in
the final output. The failure is:

==
> FAIL: test_multiple_accumulating_firings
> (apache_beam.transforms.trigger_test.TriggerPipelineTest)
> --
> Traceback (most recent call last):
>   File "apache_beam/transforms/trigger_test.py", line 491, in
> test_multiple_accumulating_firings
> TriggerPipelineTest.all_records)
> AssertionError: Lists differ: ['1', '2', '3', '4', '5', '1',... != ['1',
> '2', '3', '4', '5', '6',...
>
[...other output...]

(expected is:)

> - ['1', '2', '3', '4', '5', '1', '2', '3', '4', '5', '6', '7', '8', '9',
> '10']
> ?   -
>
(actual is:)

> + ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10']
> --




On Fri, Apr 19, 2019 at 3:45 PM Kenneth Knowles  wrote:

> What is the behavior you are seeing?
>
> Kenn
>
> On Fri, Apr 19, 2019 at 3:14 PM Ahmet Altay  wrote:
>
>>
>>
>> On Fri, Apr 19, 2019 at 1:58 PM Pablo Estrada  wrote:
>>
>>> Hello all,
>>> I've been slowly learning a bit about life in streaming, with state,
>>> timers, triggers, etc.
>>>
>>> The other day, I tried out a trigger pipeline that did not have the
>>> behavior that I was expecting, and I am looking for feedback on whether I'm
>>> missing something, or this is a bug.
>>>
>>> Please take a look at this unit test:
>>>
>>>
>>> https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451
>>>
>>> Is the check correct that we would expect range [1, 6) to appear twice?
>>> i.e. concat([1, 6), [1, 10]) ?
>>>
>>
>> This is what I would expect. Your test code looks good to me. Could you
>> file an issue?
>>
>>
>>>
>>> I have not tested this in other runners.
>>> Thanks
>>> -P.
>>>
>>


Re: Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Kenneth Knowles
For example, my immediate suspicion with rather little to go on would be a
> versus >= issue in firing processing time triggers. Coincidentally still
showing up, as in https://github.com/apache/beam/pull/8366

If we had a portable runner with TestStream support, I would suggest using
it.

Kenn

On Fri, Apr 19, 2019 at 3:45 PM Kenneth Knowles  wrote:

> What is the behavior you are seeing?
>
> Kenn
>
> On Fri, Apr 19, 2019 at 3:14 PM Ahmet Altay  wrote:
>
>>
>>
>> On Fri, Apr 19, 2019 at 1:58 PM Pablo Estrada  wrote:
>>
>>> Hello all,
>>> I've been slowly learning a bit about life in streaming, with state,
>>> timers, triggers, etc.
>>>
>>> The other day, I tried out a trigger pipeline that did not have the
>>> behavior that I was expecting, and I am looking for feedback on whether I'm
>>> missing something, or this is a bug.
>>>
>>> Please take a look at this unit test:
>>>
>>>
>>> https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451
>>>
>>> Is the check correct that we would expect range [1, 6) to appear twice?
>>> i.e. concat([1, 6), [1, 10]) ?
>>>
>>
>> This is what I would expect. Your test code looks good to me. Could you
>> file an issue?
>>
>>
>>>
>>> I have not tested this in other runners.
>>> Thanks
>>> -P.
>>>
>>


Re: Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Kenneth Knowles
What is the behavior you are seeing?

Kenn

On Fri, Apr 19, 2019 at 3:14 PM Ahmet Altay  wrote:

>
>
> On Fri, Apr 19, 2019 at 1:58 PM Pablo Estrada  wrote:
>
>> Hello all,
>> I've been slowly learning a bit about life in streaming, with state,
>> timers, triggers, etc.
>>
>> The other day, I tried out a trigger pipeline that did not have the
>> behavior that I was expecting, and I am looking for feedback on whether I'm
>> missing something, or this is a bug.
>>
>> Please take a look at this unit test:
>>
>>
>> https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451
>>
>> Is the check correct that we would expect range [1, 6) to appear twice?
>> i.e. concat([1, 6), [1, 10]) ?
>>
>
> This is what I would expect. Your test code looks good to me. Could you
> file an issue?
>
>
>>
>> I have not tested this in other runners.
>> Thanks
>> -P.
>>
>


Re: Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Ahmet Altay
On Fri, Apr 19, 2019 at 1:58 PM Pablo Estrada  wrote:

> Hello all,
> I've been slowly learning a bit about life in streaming, with state,
> timers, triggers, etc.
>
> The other day, I tried out a trigger pipeline that did not have the
> behavior that I was expecting, and I am looking for feedback on whether I'm
> missing something, or this is a bug.
>
> Please take a look at this unit test:
>
>
> https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451
>
> Is the check correct that we would expect range [1, 6) to appear twice?
> i.e. concat([1, 6), [1, 10]) ?
>

This is what I would expect. Your test code looks good to me. Could you
file an issue?


>
> I have not tested this in other runners.
> Thanks
> -P.
>


Re: New IOIT Dashboards

2019-04-19 Thread Pablo Estrada
Woah this is excellent. It's very nice to see that the metrics are more
consistent now, and the signal of the tests will be much more useful. Love
it!
Best
-P.

On Fri, Apr 19, 2019 at 1:59 PM Łukasz Gajowy 
wrote:

> @Kenn Yes, seconds. I added suffixes to widget's legends to clarify.
> @Ankur It's still Dataflow only but we're closer to use Flink for that too
> (we already have the cluster setup in our codebase). Meanwhile, I fixed the
> dashboard title to clarify this too.
>
> Thanks,
> Łukasz
>
> pt., 19 kwi 2019 o 21:35 Ankur Goenka  napisał(a):
>
>> This looks great!
>> Which runner are we using for the pipeline?
>>
>> On Fri, Apr 19, 2019 at 12:03 PM Kenneth Knowles  wrote:
>>
>>> Very cool! I assume times are all in seconds?
>>>
>>> On Fri, Apr 19, 2019 at 6:26 AM Łukasz Gajowy 
>>> wrote:
>>>
 Hi,

 just wanted to announce that we improved the way we collect metrics
 from IOIT. Now we use Metrics API for this which allowed us to get more
 insights and collect run/read/write time (and possibly other metrics in the
 future) separately.

 The new dashboards are available here:
 https://s.apache.org/io-test-dashboards
 I also updated the docs in this PR:
 https://github.com/apache/beam/pull/8356

 Thanks,
 Łukasz





Re: Handling join with late/delayed data in one side

2019-04-19 Thread Reuven Lax
Do you have a bound on how delayed the read event is? If you do, you could
use session windows for this. You could also just use the state API to do
this type of join.

On Fri, Apr 19, 2019 at 1:49 PM Khai Tran  wrote:

> Hello beam community,
>
> I'm looking for beam api/implementation of joins with asymmetric arrival
> time. For example, for a same message, a message sent event arrives at 9am,
> but message read event may arrive at 11am or even next day. So when joining
> two streams of those two kinds of events together, we need to keep the
> buffer/state for message sent events long enough to be able to catch
> late/delayed events of message read.
>
> Currently with beam, I saw a few example of implementing joins with
> CoGroupBy and FixedWindow, so I'm thinking a few options here:
>
>
>1. Increase the size of fixed windows, however, this will add latency
>to the application
>2. If I choose early firing to reduce latency, then I would have the
>correctness issue
>   - If I choose to accumulate event, I would end up with duplicated
>   results each time early firing is triggered/
>   - If I choose to discard event, then results would miss the
>   late/delayed scenario described above.
>
>
> Any comments or suggestions on how to solve this problem? I just found
> that Spark Streaming can provide what I need in this blog post:
>
> https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html
>
>
> Thanks,
> Khai
>
> 
> Introducing Stream-Stream Joins in Apache Spark 2.3 - The Databricks Blog
> 
> databricks.com
> Since we introduced Structured Streaming in Apache Spark 2.0, it has
> supported joins (inner join and some type of outer joins) between a
> streaming and a static DataFrame/Dataset.With the release of Apache Spark
> 2.3.0, now available in Databricks Runtime 4.0 as part of Databricks
> Unified Analytics Platform, we now support stream-stream joins.In this
> post, we will explore a canonical case of how ...
>
>
>


Re: New IOIT Dashboards

2019-04-19 Thread Łukasz Gajowy
@Kenn Yes, seconds. I added suffixes to widget's legends to clarify.
@Ankur It's still Dataflow only but we're closer to use Flink for that too
(we already have the cluster setup in our codebase). Meanwhile, I fixed the
dashboard title to clarify this too.

Thanks,
Łukasz

pt., 19 kwi 2019 o 21:35 Ankur Goenka  napisał(a):

> This looks great!
> Which runner are we using for the pipeline?
>
> On Fri, Apr 19, 2019 at 12:03 PM Kenneth Knowles  wrote:
>
>> Very cool! I assume times are all in seconds?
>>
>> On Fri, Apr 19, 2019 at 6:26 AM Łukasz Gajowy  wrote:
>>
>>> Hi,
>>>
>>> just wanted to announce that we improved the way we collect metrics from
>>> IOIT. Now we use Metrics API for this which allowed us to get more insights
>>> and collect run/read/write time (and possibly other metrics in the future)
>>> separately.
>>>
>>> The new dashboards are available here:
>>> https://s.apache.org/io-test-dashboards
>>> I also updated the docs in this PR:
>>> https://github.com/apache/beam/pull/8356
>>>
>>> Thanks,
>>> Łukasz
>>>
>>>
>>>


Possible bug in accumulating triggers Python DirectRunner?

2019-04-19 Thread Pablo Estrada
Hello all,
I've been slowly learning a bit about life in streaming, with state,
timers, triggers, etc.

The other day, I tried out a trigger pipeline that did not have the
behavior that I was expecting, and I am looking for feedback on whether I'm
missing something, or this is a bug.

Please take a look at this unit test:

https://github.com/apache/beam/pull/8364/files#diff-38fb631ae11ed485e2b99507e96ff9ffR451

Is the check correct that we would expect range [1, 6) to appear twice?
i.e. concat([1, 6), [1, 10]) ?

I have not tested this in other runners.
Thanks
-P.


Re: Hazelcast Jet Runner

2019-04-19 Thread Kenneth Knowles
The ValidatesRunner tests are the best source we have for knowing the
capabilities of a runner. Are there instructions for running the tests?

Assuming we can check it out, then just open a PR to the website with the
current capabilities and caveats. Since it is a big deal and could use lots
of eyes, I would share the PR link on this thread.

Kenn

On Thu, Apr 18, 2019 at 11:53 AM Jozsef Bartok  wrote:

> Hi. We at Hazelcast Jet have been working for a while now to implement a
> Java Beam Runner (non-portable) based on Hazelcast Jet (
> https://jet.hazelcast.org/). The process is still ongoing (
> https://github.com/hazelcast/hazelcast-jet-beam-runner), but we are
> aiming for a fully functional, reliable Runner which can proudly join the
> Capability Matrix. For that purpose I would like to ask what’s your process
> of validating runners? We are already running the @ValidatesRunner tests
> and the Nexmark test suite, but beyond that what other steps do we need to
> take to get our Runner to the level it needs to be at?
>


Handling join with late/delayed data in one side

2019-04-19 Thread Khai Tran
Hello beam community,

I'm looking for beam api/implementation of joins with asymmetric arrival time. 
For example, for a same message, a message sent event arrives at 9am, but 
message read event may arrive at 11am or even next day. So when joining two 
streams of those two kinds of events together, we need to keep the buffer/state 
for message sent events long enough to be able to catch late/delayed events of 
message read.

Currently with beam, I saw a few example of implementing joins with CoGroupBy 
and FixedWindow, so I'm thinking a few options here:

  1.  Increase the size of fixed windows, however, this will add latency to the 
application
  2.  If I choose early firing to reduce latency, then I would have the 
correctness issue
 *   If I choose to accumulate event, I would end up with duplicated 
results each time early firing is triggered/
 *   If I choose to discard event, then results would miss the late/delayed 
scenario described above.


Any comments or suggestions on how to solve this problem? I just found that 
Spark Streaming can provide what I need in this blog post:
https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html


Thanks,
Khai

[https://databricks.com/wp-content/uploads/2018/03/image5.png]

Introducing Stream-Stream Joins in Apache Spark 2.3 - The Databricks 
Blog
databricks.com
Since we introduced Structured Streaming in Apache Spark 2.0, it has supported 
joins (inner join and some type of outer joins) between a streaming and a 
static DataFrame/Dataset.With the release of Apache Spark 2.3.0, now available 
in Databricks Runtime 4.0 as part of Databricks Unified Analytics Platform, we 
now support stream-stream joins.In this post, we will explore a canonical case 
of how ...





Re: New IOIT Dashboards

2019-04-19 Thread Ankur Goenka
This looks great!
Which runner are we using for the pipeline?

On Fri, Apr 19, 2019 at 12:03 PM Kenneth Knowles  wrote:

> Very cool! I assume times are all in seconds?
>
> On Fri, Apr 19, 2019 at 6:26 AM Łukasz Gajowy  wrote:
>
>> Hi,
>>
>> just wanted to announce that we improved the way we collect metrics from
>> IOIT. Now we use Metrics API for this which allowed us to get more insights
>> and collect run/read/write time (and possibly other metrics in the future)
>> separately.
>>
>> The new dashboards are available here:
>> https://s.apache.org/io-test-dashboards
>> I also updated the docs in this PR:
>> https://github.com/apache/beam/pull/8356
>>
>> Thanks,
>> Łukasz
>>
>>
>>


Re: New IOIT Dashboards

2019-04-19 Thread Kenneth Knowles
Very cool! I assume times are all in seconds?

On Fri, Apr 19, 2019 at 6:26 AM Łukasz Gajowy  wrote:

> Hi,
>
> just wanted to announce that we improved the way we collect metrics from
> IOIT. Now we use Metrics API for this which allowed us to get more insights
> and collect run/read/write time (and possibly other metrics in the future)
> separately.
>
> The new dashboards are available here:
> https://s.apache.org/io-test-dashboards
> I also updated the docs in this PR:
> https://github.com/apache/beam/pull/8356
>
> Thanks,
> Łukasz
>
>
>


Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-19 Thread Kenneth Knowles
WindowedValue has always been an interface, not a concrete representation:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/WindowedValue.java
.
It is an abstract class because we started in Java 7 where you could not
have default methods, and just due to legacy style concerns. it is not just
discussed, but implemented, that there are WindowedValue implementations
with fewer allocations.
At the coder level, it was also always intended to have multiple encodings.
We already do have separate encodings based on whether there is 1 window or
multiple windows. The coder for a particular kind of WindowedValue should
decide this. Before the Fn API none of this had to be standardized, because
the runner could just choose whatever it wants. Now we have to standardize
any encodings that runners and harnesses both need to know. There should be
many, and adding more should be just a matter of standardization, no new
design.

None of this should be user-facing or in the runner API / pipeline graph -
that is critical to making it flexible on the backend between the runner &
SDK harness.

If I understand it, from our offline discussion, you are interested in the
case where you issue a ProcessBundleRequest to the SDK harness and none of
the primitives in the subgraph will ever observe the metadata. So you want
to not even have a tiny
"WindowedValueWithNoMetadata". Is that accurate?

Kenn

On Fri, Apr 19, 2019 at 10:17 AM jincheng sun 
wrote:

> Thank you! And have a nice weekend!
>
>
> Lukasz Cwik  于2019年4月20日周六 上午1:14写道:
>
>> I have added you as a contributor.
>>
>> On Fri, Apr 19, 2019 at 9:56 AM jincheng sun 
>> wrote:
>>
>>> Hi Lukasz,
>>>
>>> Thanks for your affirmation and provide more contextual information. :)
>>>
>>> Would you please give me the contributor permission?  My JIRA ID is
>>> sunjincheng121.
>>>
>>> I would like to create/assign tickets for this work.
>>>
>>> Thanks,
>>> Jincheng
>>>
>>> Lukasz Cwik  于2019年4月20日周六 上午12:26写道:
>>>
 Since I don't think this is a contentious change.

 On Fri, Apr 19, 2019 at 9:25 AM Lukasz Cwik  wrote:

> Yes, using T makes sense.
>
> The WindowedValue was meant to be a context object in the SDK harness
> that propagates various information about the current element. We have
> discussed in the past about:
> * making optimizations which would pass around less of the context
> information if we know that the DoFns don't need it (for example, all the
> values share the same window).
> * versioning the encoding separately from the WindowedValue context
> object (see recent discussion about element timestamp precision [1])
> * the runner may want its own representation of a context object that
> makes sense for it which isn't a WindowedValue necessarily.
>
> Feel free to cut a JIRA about this and start working on a change
> towards this.
>
> 1:
> https://lists.apache.org/thread.html/221b06e81bba335d0ea8d770212cc7ee047dba65bec7978368a51473@%3Cdev.beam.apache.org%3E
>
> On Fri, Apr 19, 2019 at 3:18 AM jincheng sun 
> wrote:
>
>> Hi Beam devs,
>>
>> I read some of the docs about `Communicating over the Fn API` in
>> Beam. I feel that Beam has a very good design for Control Plane/Data
>> Plane/State Plane/Logging services, and it is described in > and
>> receive data> document. When communicating between Runner and SDK 
>> Harness,
>> the DataPlane API will be WindowedValue(An immutable triple of value,
>> timestamp, and windows.) As a contract object between Runner and SDK
>> Harness. I see the interface definitions for sending and receiving data 
>> in
>> the code as follows:
>>
>> - org.apache.beam.runners.fnexecution.data.FnDataService
>>
>> public interface FnDataService {
>>>InboundDataClient receive(LogicalEndpoint inputLocation,
>>> Coder> coder, FnDataReceiver> 
>>> listener);
>>>CloseableFnDataReceiver> send(
>>>   LogicalEndpoint outputLocation, Coder> coder);
>>> }
>>
>>
>>
>> - org.apache.beam.fn.harness.data.BeamFnDataClient
>>
>> public interface BeamFnDataClient {
>>>InboundDataClient receive(ApiServiceDescriptor
>>> apiServiceDescriptor, LogicalEndpoint inputLocation,
>>> Coder> coder, FnDataReceiver> 
>>> receiver);
>>>CloseableFnDataReceiver>
>>> send(BeamFnDataGrpcClient Endpoints.ApiServiceDescriptor
>>> apiServiceDescriptor, LogicalEndpoint outputLocation,
>>> Coder> coder);
>>> }
>>
>>
>> Both `Coder>` and `FnDataReceiver>`
>> use `WindowedValue` as the data structure that both sides of Runner and 
>> SDK
>> Harness know each other. Control Plane/Data Plane/State Plane/Logging is 
>> a
>> 

Re: CVE audit gradle plugin

2019-04-19 Thread Lukasz Cwik
 Common Vulnerabilities and Exposures (CVE)

On Fri, Apr 19, 2019 at 10:33 AM Robert Burke  wrote:

> Ah! What's CVE stand for then?
>
> Re the PR: Sadly, it's more complicated than that, which I'll explain in
> the PR. Otherwise it would have been done already. It's not too bad if the
> time is put in though.
>
> On Fri, 19 Apr 2019 at 10:17, Lukasz Cwik  wrote:
>
>> Robert, I believe what is being suggested is a tool that integrates into
>> CVE reports automatically and tells us if we have a dependency with a
>> security issue (not just whether there is a newer version). Also, there is
>> a sweet draft PR to add Go modules[1].
>>
>> 1: https://github.com/apache/beam/pull/8354
>>
>> On Fri, Apr 19, 2019 at 10:12 AM Robert Burke  wrote:
>>
>>> If we move to Go Modules, the go.mod file specifies direct dependencies
>>> and versions, and the go.sum file includes checksums of the full transitive
>>> set of dependencies. There's likely going to be a tool for detecting if an
>>> update is possible, if one doesn't exist in the go tooling already.
>>>
>>> On Fri, 19 Apr 2019 at 09:44, Lukasz Cwik  wrote:
>>>
 This seems worthwhile IMO.

 Ahmet, Pyup[1] is free for open source projects and has an API that
 allows for dependency checking. They can scan Github repos automatically it
 seems but it may not be compatible with how Apache permissions with Github
 work. I'm not sure if there is such a thing for Go.

 1: https://pyup.io/

 On Fri, Apr 19, 2019 at 2:31 AM Ismaël Mejía  wrote:

> I want to bring this subject back, any chance we can get this running
> in or main repo maybe in a weekly basis like we do for the dependency
> reports. It looks totallly worth.
>
> On Fri, Mar 1, 2019 at 2:05 AM Ahmet Altay  wrote:
> >
> > Thank you, I agree this is very important. Does anyone know a
> similar tool for python and go?
> >
> > On Thu, Feb 28, 2019 at 8:26 AM Etienne Chauchot <
> echauc...@apache.org> wrote:
> >>
> >> Hi guys,
> >>
> >> I came by this [1] gradle plugin that is a client to the Sonatype
> OSS Index CVE database.
> >>
> >> I have set it up here in a branch [2], though the cache is not
> configured and the number of requests is limited. It can be run with
> "gradle --info audit"
> >>
> >> It could be nice to have something like this to track the CVEs in
> the libs we use. I know we have been spammed by libs upgrade automatic
> requests in the past but CVE are more important IMHO.
> >>
> >> This plugin is in BSD-3-Clause which is compatible with Apache V2
> licence [3]
> >>
> >> WDYT ?
> >>
> >> Etienne
> >>
> >> [1] https://github.com/OSSIndex/ossindex-gradle-plugin
> >> [2] https://github.com/echauchot/beam/tree/cve_audit_plugin
> >> [3] https://www.apache.org/legal/resolved.html
>



Re: CVE audit gradle plugin

2019-04-19 Thread Robert Burke
Ah! What's CVE stand for then?

Re the PR: Sadly, it's more complicated than that, which I'll explain in
the PR. Otherwise it would have been done already. It's not too bad if the
time is put in though.

On Fri, 19 Apr 2019 at 10:17, Lukasz Cwik  wrote:

> Robert, I believe what is being suggested is a tool that integrates into
> CVE reports automatically and tells us if we have a dependency with a
> security issue (not just whether there is a newer version). Also, there is
> a sweet draft PR to add Go modules[1].
>
> 1: https://github.com/apache/beam/pull/8354
>
> On Fri, Apr 19, 2019 at 10:12 AM Robert Burke  wrote:
>
>> If we move to Go Modules, the go.mod file specifies direct dependencies
>> and versions, and the go.sum file includes checksums of the full transitive
>> set of dependencies. There's likely going to be a tool for detecting if an
>> update is possible, if one doesn't exist in the go tooling already.
>>
>> On Fri, 19 Apr 2019 at 09:44, Lukasz Cwik  wrote:
>>
>>> This seems worthwhile IMO.
>>>
>>> Ahmet, Pyup[1] is free for open source projects and has an API that
>>> allows for dependency checking. They can scan Github repos automatically it
>>> seems but it may not be compatible with how Apache permissions with Github
>>> work. I'm not sure if there is such a thing for Go.
>>>
>>> 1: https://pyup.io/
>>>
>>> On Fri, Apr 19, 2019 at 2:31 AM Ismaël Mejía  wrote:
>>>
 I want to bring this subject back, any chance we can get this running
 in or main repo maybe in a weekly basis like we do for the dependency
 reports. It looks totallly worth.

 On Fri, Mar 1, 2019 at 2:05 AM Ahmet Altay  wrote:
 >
 > Thank you, I agree this is very important. Does anyone know a similar
 tool for python and go?
 >
 > On Thu, Feb 28, 2019 at 8:26 AM Etienne Chauchot <
 echauc...@apache.org> wrote:
 >>
 >> Hi guys,
 >>
 >> I came by this [1] gradle plugin that is a client to the Sonatype
 OSS Index CVE database.
 >>
 >> I have set it up here in a branch [2], though the cache is not
 configured and the number of requests is limited. It can be run with
 "gradle --info audit"
 >>
 >> It could be nice to have something like this to track the CVEs in
 the libs we use. I know we have been spammed by libs upgrade automatic
 requests in the past but CVE are more important IMHO.
 >>
 >> This plugin is in BSD-3-Clause which is compatible with Apache V2
 licence [3]
 >>
 >> WDYT ?
 >>
 >> Etienne
 >>
 >> [1] https://github.com/OSSIndex/ossindex-gradle-plugin
 >> [2] https://github.com/echauchot/beam/tree/cve_audit_plugin
 >> [3] https://www.apache.org/legal/resolved.html

>>>


Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-19 Thread jincheng sun
Thank you! And have a nice weekend!


Lukasz Cwik  于2019年4月20日周六 上午1:14写道:

> I have added you as a contributor.
>
> On Fri, Apr 19, 2019 at 9:56 AM jincheng sun 
> wrote:
>
>> Hi Lukasz,
>>
>> Thanks for your affirmation and provide more contextual information. :)
>>
>> Would you please give me the contributor permission?  My JIRA ID is
>> sunjincheng121.
>>
>> I would like to create/assign tickets for this work.
>>
>> Thanks,
>> Jincheng
>>
>> Lukasz Cwik  于2019年4月20日周六 上午12:26写道:
>>
>>> Since I don't think this is a contentious change.
>>>
>>> On Fri, Apr 19, 2019 at 9:25 AM Lukasz Cwik  wrote:
>>>
 Yes, using T makes sense.

 The WindowedValue was meant to be a context object in the SDK harness
 that propagates various information about the current element. We have
 discussed in the past about:
 * making optimizations which would pass around less of the context
 information if we know that the DoFns don't need it (for example, all the
 values share the same window).
 * versioning the encoding separately from the WindowedValue context
 object (see recent discussion about element timestamp precision [1])
 * the runner may want its own representation of a context object that
 makes sense for it which isn't a WindowedValue necessarily.

 Feel free to cut a JIRA about this and start working on a change
 towards this.

 1:
 https://lists.apache.org/thread.html/221b06e81bba335d0ea8d770212cc7ee047dba65bec7978368a51473@%3Cdev.beam.apache.org%3E

 On Fri, Apr 19, 2019 at 3:18 AM jincheng sun 
 wrote:

> Hi Beam devs,
>
> I read some of the docs about `Communicating over the Fn API` in Beam.
> I feel that Beam has a very good design for Control Plane/Data Plane/State
> Plane/Logging services, and it is described in  data> document. When communicating between Runner and SDK Harness, the
> DataPlane API will be WindowedValue(An immutable triple of value,
> timestamp, and windows.) As a contract object between Runner and SDK
> Harness. I see the interface definitions for sending and receiving data in
> the code as follows:
>
> - org.apache.beam.runners.fnexecution.data.FnDataService
>
> public interface FnDataService {
>>InboundDataClient receive(LogicalEndpoint inputLocation,
>> Coder> coder, FnDataReceiver> 
>> listener);
>>CloseableFnDataReceiver> send(
>>   LogicalEndpoint outputLocation, Coder> coder);
>> }
>
>
>
> - org.apache.beam.fn.harness.data.BeamFnDataClient
>
> public interface BeamFnDataClient {
>>InboundDataClient receive(ApiServiceDescriptor
>> apiServiceDescriptor, LogicalEndpoint inputLocation,
>> Coder> coder, FnDataReceiver> 
>> receiver);
>>CloseableFnDataReceiver>
>> send(BeamFnDataGrpcClient Endpoints.ApiServiceDescriptor
>> apiServiceDescriptor, LogicalEndpoint outputLocation,
>> Coder> coder);
>> }
>
>
> Both `Coder>` and `FnDataReceiver>`
> use `WindowedValue` as the data structure that both sides of Runner and 
> SDK
> Harness know each other. Control Plane/Data Plane/State Plane/Logging is a
> highly abstraction, such as Control Plane and Logging, these are common
> requirements for all multi-language platforms. For example, the Flink
> community is also discussing how to support Python UDF, as well as how to
> deal with docker environment. how to data transfer, how to state access,
> how to logging etc. If Beam can further abstract these service interfaces,
> i.e., interface definitions are compatible with multiple engines, and
> finally provided to other projects in the form of class libraries, it
> definitely will help other platforms that want to support multiple
> languages. So could beam can further abstract the interface definition of
> FnDataService's BeamFnDataClient? Here I am to throw out a minnow to catch
> a whale, take the FnDataService#receive interface as an example, and turn
> `WindowedValue` into `T` so that other platforms can be extended
> arbitrarily, as follows:
>
>  InboundDataClient receive(LogicalEndpoint inputLocation, Coder
> coder, FnDataReceiver> listener);
>
> What do you think?
>
> Feel free to correct me if there any incorrect understanding. And
> welcome any feedback!
>
>
> Regards,
> Jincheng
>



Re: CVE audit gradle plugin

2019-04-19 Thread Lukasz Cwik
Robert, I believe what is being suggested is a tool that integrates into
CVE reports automatically and tells us if we have a dependency with a
security issue (not just whether there is a newer version). Also, there is
a sweet draft PR to add Go modules[1].

1: https://github.com/apache/beam/pull/8354

On Fri, Apr 19, 2019 at 10:12 AM Robert Burke  wrote:

> If we move to Go Modules, the go.mod file specifies direct dependencies
> and versions, and the go.sum file includes checksums of the full transitive
> set of dependencies. There's likely going to be a tool for detecting if an
> update is possible, if one doesn't exist in the go tooling already.
>
> On Fri, 19 Apr 2019 at 09:44, Lukasz Cwik  wrote:
>
>> This seems worthwhile IMO.
>>
>> Ahmet, Pyup[1] is free for open source projects and has an API that
>> allows for dependency checking. They can scan Github repos automatically it
>> seems but it may not be compatible with how Apache permissions with Github
>> work. I'm not sure if there is such a thing for Go.
>>
>> 1: https://pyup.io/
>>
>> On Fri, Apr 19, 2019 at 2:31 AM Ismaël Mejía  wrote:
>>
>>> I want to bring this subject back, any chance we can get this running
>>> in or main repo maybe in a weekly basis like we do for the dependency
>>> reports. It looks totallly worth.
>>>
>>> On Fri, Mar 1, 2019 at 2:05 AM Ahmet Altay  wrote:
>>> >
>>> > Thank you, I agree this is very important. Does anyone know a similar
>>> tool for python and go?
>>> >
>>> > On Thu, Feb 28, 2019 at 8:26 AM Etienne Chauchot 
>>> wrote:
>>> >>
>>> >> Hi guys,
>>> >>
>>> >> I came by this [1] gradle plugin that is a client to the Sonatype OSS
>>> Index CVE database.
>>> >>
>>> >> I have set it up here in a branch [2], though the cache is not
>>> configured and the number of requests is limited. It can be run with
>>> "gradle --info audit"
>>> >>
>>> >> It could be nice to have something like this to track the CVEs in the
>>> libs we use. I know we have been spammed by libs upgrade automatic requests
>>> in the past but CVE are more important IMHO.
>>> >>
>>> >> This plugin is in BSD-3-Clause which is compatible with Apache V2
>>> licence [3]
>>> >>
>>> >> WDYT ?
>>> >>
>>> >> Etienne
>>> >>
>>> >> [1] https://github.com/OSSIndex/ossindex-gradle-plugin
>>> >> [2] https://github.com/echauchot/beam/tree/cve_audit_plugin
>>> >> [3] https://www.apache.org/legal/resolved.html
>>>
>>


Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-19 Thread Lukasz Cwik
I have added you as a contributor.

On Fri, Apr 19, 2019 at 9:56 AM jincheng sun 
wrote:

> Hi Lukasz,
>
> Thanks for your affirmation and provide more contextual information. :)
>
> Would you please give me the contributor permission?  My JIRA ID is
> sunjincheng121.
>
> I would like to create/assign tickets for this work.
>
> Thanks,
> Jincheng
>
> Lukasz Cwik  于2019年4月20日周六 上午12:26写道:
>
>> Since I don't think this is a contentious change.
>>
>> On Fri, Apr 19, 2019 at 9:25 AM Lukasz Cwik  wrote:
>>
>>> Yes, using T makes sense.
>>>
>>> The WindowedValue was meant to be a context object in the SDK harness
>>> that propagates various information about the current element. We have
>>> discussed in the past about:
>>> * making optimizations which would pass around less of the context
>>> information if we know that the DoFns don't need it (for example, all the
>>> values share the same window).
>>> * versioning the encoding separately from the WindowedValue context
>>> object (see recent discussion about element timestamp precision [1])
>>> * the runner may want its own representation of a context object that
>>> makes sense for it which isn't a WindowedValue necessarily.
>>>
>>> Feel free to cut a JIRA about this and start working on a change towards
>>> this.
>>>
>>> 1:
>>> https://lists.apache.org/thread.html/221b06e81bba335d0ea8d770212cc7ee047dba65bec7978368a51473@%3Cdev.beam.apache.org%3E
>>>
>>> On Fri, Apr 19, 2019 at 3:18 AM jincheng sun 
>>> wrote:
>>>
 Hi Beam devs,

 I read some of the docs about `Communicating over the Fn API` in Beam.
 I feel that Beam has a very good design for Control Plane/Data Plane/State
 Plane/Logging services, and it is described in >>> data> document. When communicating between Runner and SDK Harness, the
 DataPlane API will be WindowedValue(An immutable triple of value,
 timestamp, and windows.) As a contract object between Runner and SDK
 Harness. I see the interface definitions for sending and receiving data in
 the code as follows:

 - org.apache.beam.runners.fnexecution.data.FnDataService

 public interface FnDataService {
>InboundDataClient receive(LogicalEndpoint inputLocation,
> Coder> coder, FnDataReceiver> listener);
>CloseableFnDataReceiver> send(
>   LogicalEndpoint outputLocation, Coder> coder);
> }



 - org.apache.beam.fn.harness.data.BeamFnDataClient

 public interface BeamFnDataClient {
>InboundDataClient receive(ApiServiceDescriptor
> apiServiceDescriptor, LogicalEndpoint inputLocation,
> Coder> coder, FnDataReceiver> receiver);
>CloseableFnDataReceiver>
> send(BeamFnDataGrpcClient Endpoints.ApiServiceDescriptor
> apiServiceDescriptor, LogicalEndpoint outputLocation,
> Coder> coder);
> }


 Both `Coder>` and `FnDataReceiver>`
 use `WindowedValue` as the data structure that both sides of Runner and SDK
 Harness know each other. Control Plane/Data Plane/State Plane/Logging is a
 highly abstraction, such as Control Plane and Logging, these are common
 requirements for all multi-language platforms. For example, the Flink
 community is also discussing how to support Python UDF, as well as how to
 deal with docker environment. how to data transfer, how to state access,
 how to logging etc. If Beam can further abstract these service interfaces,
 i.e., interface definitions are compatible with multiple engines, and
 finally provided to other projects in the form of class libraries, it
 definitely will help other platforms that want to support multiple
 languages. So could beam can further abstract the interface definition of
 FnDataService's BeamFnDataClient? Here I am to throw out a minnow to catch
 a whale, take the FnDataService#receive interface as an example, and turn
 `WindowedValue` into `T` so that other platforms can be extended
 arbitrarily, as follows:

  InboundDataClient receive(LogicalEndpoint inputLocation, Coder
 coder, FnDataReceiver> listener);

 What do you think?

 Feel free to correct me if there any incorrect understanding. And
 welcome any feedback!


 Regards,
 Jincheng

>>>


Re: CVE audit gradle plugin

2019-04-19 Thread Robert Burke
If we move to Go Modules, the go.mod file specifies direct dependencies and
versions, and the go.sum file includes checksums of the full transitive set
of dependencies. There's likely going to be a tool for detecting if an
update is possible, if one doesn't exist in the go tooling already.

On Fri, 19 Apr 2019 at 09:44, Lukasz Cwik  wrote:

> This seems worthwhile IMO.
>
> Ahmet, Pyup[1] is free for open source projects and has an API that allows
> for dependency checking. They can scan Github repos automatically it seems
> but it may not be compatible with how Apache permissions with Github work.
> I'm not sure if there is such a thing for Go.
>
> 1: https://pyup.io/
>
> On Fri, Apr 19, 2019 at 2:31 AM Ismaël Mejía  wrote:
>
>> I want to bring this subject back, any chance we can get this running
>> in or main repo maybe in a weekly basis like we do for the dependency
>> reports. It looks totallly worth.
>>
>> On Fri, Mar 1, 2019 at 2:05 AM Ahmet Altay  wrote:
>> >
>> > Thank you, I agree this is very important. Does anyone know a similar
>> tool for python and go?
>> >
>> > On Thu, Feb 28, 2019 at 8:26 AM Etienne Chauchot 
>> wrote:
>> >>
>> >> Hi guys,
>> >>
>> >> I came by this [1] gradle plugin that is a client to the Sonatype OSS
>> Index CVE database.
>> >>
>> >> I have set it up here in a branch [2], though the cache is not
>> configured and the number of requests is limited. It can be run with
>> "gradle --info audit"
>> >>
>> >> It could be nice to have something like this to track the CVEs in the
>> libs we use. I know we have been spammed by libs upgrade automatic requests
>> in the past but CVE are more important IMHO.
>> >>
>> >> This plugin is in BSD-3-Clause which is compatible with Apache V2
>> licence [3]
>> >>
>> >> WDYT ?
>> >>
>> >> Etienne
>> >>
>> >> [1] https://github.com/OSSIndex/ossindex-gradle-plugin
>> >> [2] https://github.com/echauchot/beam/tree/cve_audit_plugin
>> >> [3] https://www.apache.org/legal/resolved.html
>>
>


Re: CVE audit gradle plugin

2019-04-19 Thread Lukasz Cwik
This seems worthwhile IMO.

Ahmet, Pyup[1] is free for open source projects and has an API that allows
for dependency checking. They can scan Github repos automatically it seems
but it may not be compatible with how Apache permissions with Github work.
I'm not sure if there is such a thing for Go.

1: https://pyup.io/

On Fri, Apr 19, 2019 at 2:31 AM Ismaël Mejía  wrote:

> I want to bring this subject back, any chance we can get this running
> in or main repo maybe in a weekly basis like we do for the dependency
> reports. It looks totallly worth.
>
> On Fri, Mar 1, 2019 at 2:05 AM Ahmet Altay  wrote:
> >
> > Thank you, I agree this is very important. Does anyone know a similar
> tool for python and go?
> >
> > On Thu, Feb 28, 2019 at 8:26 AM Etienne Chauchot 
> wrote:
> >>
> >> Hi guys,
> >>
> >> I came by this [1] gradle plugin that is a client to the Sonatype OSS
> Index CVE database.
> >>
> >> I have set it up here in a branch [2], though the cache is not
> configured and the number of requests is limited. It can be run with
> "gradle --info audit"
> >>
> >> It could be nice to have something like this to track the CVEs in the
> libs we use. I know we have been spammed by libs upgrade automatic requests
> in the past but CVE are more important IMHO.
> >>
> >> This plugin is in BSD-3-Clause which is compatible with Apache V2
> licence [3]
> >>
> >> WDYT ?
> >>
> >> Etienne
> >>
> >> [1] https://github.com/OSSIndex/ossindex-gradle-plugin
> >> [2] https://github.com/echauchot/beam/tree/cve_audit_plugin
> >> [3] https://www.apache.org/legal/resolved.html
>


Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-19 Thread Lukasz Cwik
Since I don't think this is a contentious change.

On Fri, Apr 19, 2019 at 9:25 AM Lukasz Cwik  wrote:

> Yes, using T makes sense.
>
> The WindowedValue was meant to be a context object in the SDK harness that
> propagates various information about the current element. We have discussed
> in the past about:
> * making optimizations which would pass around less of the context
> information if we know that the DoFns don't need it (for example, all the
> values share the same window).
> * versioning the encoding separately from the WindowedValue context object
> (see recent discussion about element timestamp precision [1])
> * the runner may want its own representation of a context object that
> makes sense for it which isn't a WindowedValue necessarily.
>
> Feel free to cut a JIRA about this and start working on a change towards
> this.
>
> 1:
> https://lists.apache.org/thread.html/221b06e81bba335d0ea8d770212cc7ee047dba65bec7978368a51473@%3Cdev.beam.apache.org%3E
>
> On Fri, Apr 19, 2019 at 3:18 AM jincheng sun 
> wrote:
>
>> Hi Beam devs,
>>
>> I read some of the docs about `Communicating over the Fn API` in Beam. I
>> feel that Beam has a very good design for Control Plane/Data Plane/State
>> Plane/Logging services, and it is described in > data> document. When communicating between Runner and SDK Harness, the
>> DataPlane API will be WindowedValue(An immutable triple of value,
>> timestamp, and windows.) As a contract object between Runner and SDK
>> Harness. I see the interface definitions for sending and receiving data in
>> the code as follows:
>>
>> - org.apache.beam.runners.fnexecution.data.FnDataService
>>
>> public interface FnDataService {
>>>InboundDataClient receive(LogicalEndpoint inputLocation,
>>> Coder> coder, FnDataReceiver> listener);
>>>CloseableFnDataReceiver> send(
>>>   LogicalEndpoint outputLocation, Coder> coder);
>>> }
>>
>>
>>
>> - org.apache.beam.fn.harness.data.BeamFnDataClient
>>
>> public interface BeamFnDataClient {
>>>InboundDataClient receive(ApiServiceDescriptor
>>> apiServiceDescriptor, LogicalEndpoint inputLocation,
>>> Coder> coder, FnDataReceiver> receiver);
>>>CloseableFnDataReceiver>
>>> send(BeamFnDataGrpcClient Endpoints.ApiServiceDescriptor
>>> apiServiceDescriptor, LogicalEndpoint outputLocation,
>>> Coder> coder);
>>> }
>>
>>
>> Both `Coder>` and `FnDataReceiver>` use
>> `WindowedValue` as the data structure that both sides of Runner and SDK
>> Harness know each other. Control Plane/Data Plane/State Plane/Logging is a
>> highly abstraction, such as Control Plane and Logging, these are common
>> requirements for all multi-language platforms. For example, the Flink
>> community is also discussing how to support Python UDF, as well as how to
>> deal with docker environment. how to data transfer, how to state access,
>> how to logging etc. If Beam can further abstract these service interfaces,
>> i.e., interface definitions are compatible with multiple engines, and
>> finally provided to other projects in the form of class libraries, it
>> definitely will help other platforms that want to support multiple
>> languages. So could beam can further abstract the interface definition of
>> FnDataService's BeamFnDataClient? Here I am to throw out a minnow to catch
>> a whale, take the FnDataService#receive interface as an example, and turn
>> `WindowedValue` into `T` so that other platforms can be extended
>> arbitrarily, as follows:
>>
>>  InboundDataClient receive(LogicalEndpoint inputLocation, Coder
>> coder, FnDataReceiver> listener);
>>
>> What do you think?
>>
>> Feel free to correct me if there any incorrect understanding. And welcome
>> any feedback!
>>
>>
>> Regards,
>> Jincheng
>>
>


Re: [DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-19 Thread Lukasz Cwik
Yes, using T makes sense.

The WindowedValue was meant to be a context object in the SDK harness that
propagates various information about the current element. We have discussed
in the past about:
* making optimizations which would pass around less of the context
information if we know that the DoFns don't need it (for example, all the
values share the same window).
* versioning the encoding separately from the WindowedValue context object
(see recent discussion about element timestamp precision [1])
* the runner may want its own representation of a context object that makes
sense for it which isn't a WindowedValue necessarily.

Feel free to cut a JIRA about this and start working on a change towards
this.

1:
https://lists.apache.org/thread.html/221b06e81bba335d0ea8d770212cc7ee047dba65bec7978368a51473@%3Cdev.beam.apache.org%3E

On Fri, Apr 19, 2019 at 3:18 AM jincheng sun 
wrote:

> Hi Beam devs,
>
> I read some of the docs about `Communicating over the Fn API` in Beam. I
> feel that Beam has a very good design for Control Plane/Data Plane/State
> Plane/Logging services, and it is described in  data> document. When communicating between Runner and SDK Harness, the
> DataPlane API will be WindowedValue(An immutable triple of value,
> timestamp, and windows.) As a contract object between Runner and SDK
> Harness. I see the interface definitions for sending and receiving data in
> the code as follows:
>
> - org.apache.beam.runners.fnexecution.data.FnDataService
>
> public interface FnDataService {
>>InboundDataClient receive(LogicalEndpoint inputLocation,
>> Coder> coder, FnDataReceiver> listener);
>>CloseableFnDataReceiver> send(
>>   LogicalEndpoint outputLocation, Coder> coder);
>> }
>
>
>
> - org.apache.beam.fn.harness.data.BeamFnDataClient
>
> public interface BeamFnDataClient {
>>InboundDataClient receive(ApiServiceDescriptor
>> apiServiceDescriptor, LogicalEndpoint inputLocation,
>> Coder> coder, FnDataReceiver> receiver);
>>CloseableFnDataReceiver> send(BeamFnDataGrpcClient
>> Endpoints.ApiServiceDescriptor apiServiceDescriptor, LogicalEndpoint
>> outputLocation, Coder> coder);
>> }
>
>
> Both `Coder>` and `FnDataReceiver>` use
> `WindowedValue` as the data structure that both sides of Runner and SDK
> Harness know each other. Control Plane/Data Plane/State Plane/Logging is a
> highly abstraction, such as Control Plane and Logging, these are common
> requirements for all multi-language platforms. For example, the Flink
> community is also discussing how to support Python UDF, as well as how to
> deal with docker environment. how to data transfer, how to state access,
> how to logging etc. If Beam can further abstract these service interfaces,
> i.e., interface definitions are compatible with multiple engines, and
> finally provided to other projects in the form of class libraries, it
> definitely will help other platforms that want to support multiple
> languages. So could beam can further abstract the interface definition of
> FnDataService's BeamFnDataClient? Here I am to throw out a minnow to catch
> a whale, take the FnDataService#receive interface as an example, and turn
> `WindowedValue` into `T` so that other platforms can be extended
> arbitrarily, as follows:
>
>  InboundDataClient receive(LogicalEndpoint inputLocation, Coder
> coder, FnDataReceiver> listener);
>
> What do you think?
>
> Feel free to correct me if there any incorrect understanding. And welcome
> any feedback!
>
>
> Regards,
> Jincheng
>


Re: investigating python precommit wordcount_it failure

2019-04-19 Thread Udi Meiri
I believe these are separate issues. BEAM-7111 is about wordcount_it_test
failing on direct runner in streaming mode

On Thu, Apr 18, 2019 at 8:09 PM Valentyn Tymofieiev 
wrote:

> I am working on a postcommit worcount it failure in BEAM-7063.
>
> On Thu, Apr 18, 2019 at 6:05 PM Udi Meiri  wrote:
>
>> Correction: it's a postcommit failure
>>
>> On Thu, Apr 18, 2019 at 5:43 PM Udi Meiri  wrote:
>>
>>> in https://issues.apache.org/jira/browse/BEAM-7111
>>>
>>> If anyone has state please lmk
>>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature


New IOIT Dashboards

2019-04-19 Thread Łukasz Gajowy
Hi,

just wanted to announce that we improved the way we collect metrics from
IOIT. Now we use Metrics API for this which allowed us to get more insights
and collect run/read/write time (and possibly other metrics in the future)
separately.

The new dashboards are available here:
https://s.apache.org/io-test-dashboards
I also updated the docs in this PR: https://github.com/apache/beam/pull/8356

Thanks,
Łukasz


Re: Artifact staging in cross-language pipelines

2019-04-19 Thread Maximilian Michels

Thank you for your replies.

I did not suggest that the Expansion Service does the staging, but it 
would return the required resources (e.g. jars) for the external 
transform's runtime environment. The client then has to take care of 
staging the resources.


The Expansion Service itself also needs resources to do the expansion. I 
assumed those to be provided when starting the expansion service. I 
consider it less important but we could also provide a way to add new 
transforms to the Expansion Service after startup.


Good point on Docker vs externally provided environments. For the PR [1] 
it will suffice then to add Kafka to the container dependencies. The 
"--jar_package" pipeline option is ok for now but I'd like to see work 
towards staging resources for external transforms via information 
returned by the Expansion Service. That avoids users having to take care 
of including the correct jars in their pipeline options.


These issues are related and we could discuss them in separate threads:

* Auto-discovery of Expansion Service and its external transforms
* Credentials required during expansion / runtime

Thanks,
Max

[1] ttps://github.com/apache/beam/pull/8322

On 19.04.19 07:35, Thomas Weise wrote:

Good discussion :)

Initially the expansion service was considered a user responsibility, 
but I think that isn't necessarily the case. I can also see the 
expansion service provided as part of the infrastructure and the user 
not wanting to deal with it at all. For example, users may want to write 
Python transforms and use external IOs, without being concerned how 
these IOs are provided. Under such scenario it would be good if:


* Expansion service(s) can be auto-discovered via the job service endpoint
* Available external transforms can be discovered via the expansion 
service(s)
* Dependencies for external transforms are part of the metadata returned 
by expansion service


Dependencies could then be staged either by the SDK client or the 
expansion service. The expansion service could provide the locations to 
stage to the SDK, it would still be transparent to the user.


I also agree with Luke regarding the environments. Docker is the choice 
for generic deployment. Other environments are used when the flexibility 
offered by Docker isn't needed (or gets into the way). Then the 
dependencies are provided in different ways. Whether these are Python 
packages or jar files, by opting out of Docker the decision is made to 
manage dependencies externally.


Thomas


On Thu, Apr 18, 2019 at 6:01 PM Chamikara Jayalath > wrote:




On Thu, Apr 18, 2019 at 5:21 PM Chamikara Jayalath
mailto:chamik...@google.com>> wrote:

Thanks for raising the concern about credentials Ankur, I agree
that this is a significant issue.

On Thu, Apr 18, 2019 at 4:23 PM Lukasz Cwik mailto:lc...@google.com>> wrote:

I can understand the concern about credentials, the same
access concern will exist for several cross language
transforms (mostly IOs) since some will need access to
credentials to read/write to an external service.

Are there any ideas on how credential propagation could work
to these IOs?


There are some cases where existing IO transforms need
credentials to access remote resources, for example, size
estimation, validation, etc. But usually these are optional (or
transform can be configured to not perform these functions).


To clarify, I'm only talking about transform expansion here. Many IO
transforms need read/write access to remote services at run time. So
probably we need to figure out a way to propagate these credentials
anyways.

Can we use these mechanisms for staging? 



I think we'll have to find a way to do one of (1) propagate
credentials to other SDKs (2) allow users to configure SDK
containers to have necessary credentials (3) do the artifact
staging from the pipeline SDK environment which already have
credentials. I prefer (1) or (2) since this will given a
transform same feature set whether used directly (in the same
SDK language as the transform) or remotely but it might be hard
to do this for an arbitrary service that a transform might
connect to considering the number of ways users can configure
credentials (after an offline discussion with Ankur).


On Thu, Apr 18, 2019 at 3:47 PM Ankur Goenka
mailto:goe...@google.com>> wrote:

I agree that the Expansion service knows about the
artifacts required for a cross language transform and
having a prepackage folder/Zip for transforms based on
language makes sense.

One think to note here is that expansion service might
not have the same access privilege as 

Re: Contributing Beam Kata (Java & Python)

2019-04-19 Thread hsuryawirawan
I've created a PR for the Beam Kata.
https://github.com/apache/beam/pull/8358

If you're interested to experience it, please find the instruction on how to 
set it up on your machine
https://github.com/apache/beam/pull/8358#issuecomment-484855236

Should you have any issue or further question, please do let me know.


Regards,
Henry

On 2019/04/19 08:44:55, hsuryawira...@google.com  
wrote: 
> Thanks Lukasz.
> Yes you can try the kata.
> I will write a short instruction on how to use it, maybe along with the PR.
> 
> 
> On 2019/04/18 21:21:28, Lukasz Cwik  wrote: 
> > Also agree that this is really nice. Is there a place where we can try out
> > what you have created so far?
> > 
> > Opening a PR with what you have created so far may be the easiest way to
> > convey what your thinking would make sense.
> > 
> 
> 


[DISCUSS] Turn `WindowedValue` into `T` in the FnDataService and BeamFnDataClient interface definition

2019-04-19 Thread jincheng sun
Hi Beam devs,

I read some of the docs about `Communicating over the Fn API` in Beam. I
feel that Beam has a very good design for Control Plane/Data Plane/State
Plane/Logging services, and it is described in  document. When communicating between Runner and SDK Harness, the
DataPlane API will be WindowedValue(An immutable triple of value,
timestamp, and windows.) As a contract object between Runner and SDK
Harness. I see the interface definitions for sending and receiving data in
the code as follows:

- org.apache.beam.runners.fnexecution.data.FnDataService

public interface FnDataService {
>InboundDataClient receive(LogicalEndpoint inputLocation,
> Coder> coder, FnDataReceiver> listener);
>CloseableFnDataReceiver> send(
>   LogicalEndpoint outputLocation, Coder> coder);
> }



- org.apache.beam.fn.harness.data.BeamFnDataClient

public interface BeamFnDataClient {
>InboundDataClient receive(ApiServiceDescriptor apiServiceDescriptor,
> LogicalEndpoint inputLocation, Coder> coder,
> FnDataReceiver> receiver);
>CloseableFnDataReceiver> send(BeamFnDataGrpcClient
> Endpoints.ApiServiceDescriptor apiServiceDescriptor, LogicalEndpoint
> outputLocation, Coder> coder);
> }


Both `Coder>` and `FnDataReceiver>` use
`WindowedValue` as the data structure that both sides of Runner and SDK
Harness know each other. Control Plane/Data Plane/State Plane/Logging is a
highly abstraction, such as Control Plane and Logging, these are common
requirements for all multi-language platforms. For example, the Flink
community is also discussing how to support Python UDF, as well as how to
deal with docker environment. how to data transfer, how to state access,
how to logging etc. If Beam can further abstract these service interfaces,
i.e., interface definitions are compatible with multiple engines, and
finally provided to other projects in the form of class libraries, it
definitely will help other platforms that want to support multiple
languages. So could beam can further abstract the interface definition of
FnDataService's BeamFnDataClient? Here I am to throw out a minnow to catch
a whale, take the FnDataService#receive interface as an example, and turn
`WindowedValue` into `T` so that other platforms can be extended
arbitrarily, as follows:

 InboundDataClient receive(LogicalEndpoint inputLocation, Coder
coder, FnDataReceiver> listener);

What do you think?

Feel free to correct me if there any incorrect understanding. And welcome
any feedback!


Regards,
Jincheng


Re: Contributing Beam Kata (Java & Python)

2019-04-19 Thread Ismaël Mejía
+lars.fran...@gmail.com who is in the Apache training project and may
be interested in this one or at least the JetBrains like approach.

On Fri, Apr 19, 2019 at 12:01 PM Ismaël Mejía  wrote:
>
> This looks great, nice for bringing this to the project Henry!
>
> On Fri, Apr 19, 2019 at 10:53 AM hsuryawira...@google.com
>  wrote:
> >
> > Thanks Altay.
> > I'll create it under "learning/" first as this is not exactly example.
> > Please do let me know if it's not the right place.
> >
> > On 2019/04/18 22:49:47, Ahmet Altay  wrote:
> > > This looks great.
> > >
> > > +David Cavazos  was working on interactive colab 
> > > based
> > > examples (https://github.com/apache/beam/pull/7679) perhaps we can have a
> > > shared place for these two similar things.
> > >
> >


Re: [DISCUSS] Backwards compatibility of @Experimental features

2019-04-19 Thread Ismaël Mejía
It seems we mostly agree that @Experimental is important, and that API
changes (removals) on experimental features should happen quickly but still
give some time to users so the Experimental purpose is not lost.

Ahmet proposal given our current release calendar is close to 2 releases.
Can we settle this on 2 releases as a 'minimum time' before removal? (This
will let maintainers the option to choose to support it more time if they
want as discussed in the related KafkaIO thread but still be friendly with
users).

Do we agree?

Note: for the other subjects (e.g. when an Experimental feature should
become not experimental) I think we will hardly find an agreement so I
think this should be treated in a per case basis by the maintainers, but if
you want to follow up on that discussion we can open another thread for
this.



On Sat, Apr 6, 2019 at 1:04 AM Ahmet Altay  wrote:

> I agree that Experimental feature is still very useful. I was trying to
> argue that we diluted its value so +1 to reclaim that.
>
> Back to the original question, in my opinion removing existing
> "experimental and deprecated" features in n=1 release will confuse users.
> This will likely be a surprise to them because we have been maintaining
> this state release after release now. I would propose in the next release
> warning users of such a change happening and give them at least 3 months to
> upgrade to suggested newer paths. In the future we can have a shorter
> timelines assuming that we will set the user expectations right.
>
> On Fri, Apr 5, 2019 at 3:01 PM Ismaël Mejía  wrote:
>
>> I agree 100% with Kenneth on the multiple advantages that the
>> Experimental feature gave us. I also can count multiple places where this
>> has been essential in other modules than core. I disagree on the fact that
>> the @Experimental annotation has lost sense, it is simply ill defined, and
>> probably it is by design because its advantages come from it.
>>
>> Most of the topics in this thread are a consequence of the this loose
>> definition, e.g. (1) not defining how a feature becomes stable, and (2)
>> what to do when we want to remove an experimental feature, are ideas that
>> we need to decide if we define just continue to handle as we do today.
>>
>> Defining a target for graduating an Experimental feature is a bit too
>> aggressive with not much benefit, in this case we could be losing the
>> advantages of Experimental (save if we could change the proposed version in
>> the future). This probably makes sense for the removal of features but
>> makes less sense to decide when some feature becomes stable. Of course in
>> the case of the core SDKs packages this is probably more critical but
>> nothing guarantees that things will be ready when we expect too. When will
>> we tag for stability things like SDF or portability APIs?. We cannot
>> predict the future for completion of features.
>>
>> Nobody has mentioned the LTS releases couldn’t be these like the middle
>> points for these decisions? That at least will give LTS some value because
>> so far I still have issues to understand the value of this idea given that
>> we can do a minor release of any pre-released version.
>>
>> This debate is super important and nice to have, but we lost focus on my
>> initial question. I like the proposal to remove a deprecated  experimental
>> feature (or part of it) after one release, in particular if the feature has
>> a clear replacement path, however for cases like the removal of previously
>> supported versions of Kafka one release may be too short. Other opinions on
>> this? (or the other topics).
>>
>> On Fri, Apr 5, 2019 at 10:52 AM Robert Bradshaw 
>> wrote:
>>
>>> if it's technically feasible, I am also in favor of requiring
>>> experimental features to be (per-tag, Python should be updated) opt-in
>>> only. We should probably regularly audit the set of experimental features
>>> we ship (I'd say as part of the release, but that process is laborious
>>> enough, perhaps we should do it on a half-release cycle?) I think imposing
>>> hard deadlines (chosen when a feature is introduced) is too extreme, but
>>> might be valuable if opt-in plus regular audit is insufficient.
>>>
>>> On Thu, Apr 4, 2019 at 5:28 AM Kenneth Knowles  wrote:
>>>
 This all makes me think that we should rethink how we ship experimental
 features. My experience is also that (1) users don't know if something is
 experimental or don't think hard about it and (2) we don't use experimental
 time period to gather feedback and make changes.

 How can we change both of these? Perhaps we could require experimental
 features to be opt-in. Flags work and also clearly marked experimental
 dependencies that a user has to add. Changing the core is sometimes tricky
 to put behind a flag but rarely impossible. This way a contributor is also
 motivated to gather feedback to mature their feature to become default
 instead of opt-in.

 The need 

Re: [EXT] Re: [DOC] Portable Spark Runner

2019-04-19 Thread Ismaël Mejía
Thanks for sharing, the diagram really helps to understand.
Please consider adding it to the design documents webpage.
https://beam.apache.org/contribute/design-documents/


On Tue, Apr 16, 2019 at 12:00 AM Ankur Goenka  wrote:

> Thanks for sharing.
> This looks great!
>
> On Mon, Apr 15, 2019 at 2:54 PM Kenneth Knowles  wrote:
>
>> Great. Thanks for sharing!
>>
>> On Mon, Apr 15, 2019 at 2:38 PM Lei Xu  wrote:
>>
>>> This is super nice! Really look forward to use this.
>>>
>>> On Mon, Apr 15, 2019 at 2:34 PM Thomas Weise  wrote:
>>>
 Great to see the portable Spark runner taking shape. Thanks for the
 update!


 On Mon, Apr 15, 2019 at 10:53 AM Pablo Estrada 
 wrote:

> This is very cool Kyle. Thanks for moving it forward!
> Best
> -P.
>
> On Fri, Apr 12, 2019 at 1:21 PM Lukasz Cwik  wrote:
>
>> Thanks for the doc.
>>
>> On Fri, Apr 12, 2019 at 11:34 AM Kyle Weaver 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> As some of you know, I've been piggybacking on the existing Spark
>>> and Flink runners to create a portable version of the Spark runner. I 
>>> wrote
>>> up a summary of the work I've done so far and what remains to be done. 
>>> I'll
>>> keep updating this going forward to provide a reasonably up-to-date
>>> description of the state of the project. Please comment on the doc if 
>>> you
>>> have any thoughts.
>>>
>>> Link:
>>>
>>> https://docs.google.com/document/d/1j8GERTiHUuc6CzzCXZHc38rBn41uWfATBh2-5JN8hro/edit?usp=sharing
>>>
>>> Thanks,
>>> Kyle
>>>
>>> Kyle Weaver |  Software Engineer | github.com/ibzib |
>>> kcwea...@google.com |  +1650203
>>>
>>
>>>
>>> *Confidentiality Note:* We care about protecting our proprietary
>>> information, confidential material, and trade secrets. This message may
>>> contain some or all of those things. Cruise will suffer material harm if
>>> anyone other than the intended recipient disseminates or takes any action
>>> based on this message. If you have received this message (including any
>>> attachments) in error, please delete it immediately and notify the sender
>>> promptly.
>>
>>


Re: Contributing Beam Kata (Java & Python)

2019-04-19 Thread Ismaël Mejía
This looks great, nice for bringing this to the project Henry!

On Fri, Apr 19, 2019 at 10:53 AM hsuryawira...@google.com
 wrote:
>
> Thanks Altay.
> I'll create it under "learning/" first as this is not exactly example.
> Please do let me know if it's not the right place.
>
> On 2019/04/18 22:49:47, Ahmet Altay  wrote:
> > This looks great.
> >
> > +David Cavazos  was working on interactive colab based
> > examples (https://github.com/apache/beam/pull/7679) perhaps we can have a
> > shared place for these two similar things.
> >
>


Re: Postcommit kiosk dashboard

2019-04-19 Thread Ismaël Mejía
Catching up on this one, nice dashboard !
Some jobs are misisng e.g. validatesRunner for both Spark and Flink.
I suppose those are important if this may eventually replace the
README as Thomas suggests.

On Fri, Mar 15, 2019 at 2:18 AM Thomas Weise  wrote:
>
> This is very nice!
>
> Perhaps it can also replace this manually maintained list?  
> https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md
>
>
> On Thu, Mar 14, 2019 at 1:01 PM Mikhail Gryzykhin  wrote:
>>
>> Addressed comments:
>> 1. Added precommits.
>> 2. Limited timeframe to 7 days. This removed old jobs from table.
>> 2.1 We keep history of all jobs in separate DB that's used by grafana. Some 
>> of deprecated jobs come from there.
>>
>> --Mikhail
>>
>> Have feedback?
>>
>>
>> On Thu, Mar 14, 2019 at 12:03 PM Michael Luckey  wrote:
>>>
>>> Very nice!
>>>
>>> Two questions though:
>>> - the links on the left should point somewhere?
>>> - where are the beam_PostCommit_[Java|GO]_GradleBuild coming from? Cant 
>>> find them on Jenkins...
>>>
>>> On Thu, Mar 14, 2019 at 7:20 PM Mikhail Gryzykhin  wrote:

 we already have https://s.apache.org/beam-community-metrics

 --Mikhail

 Have feedback?


 On Thu, Mar 14, 2019 at 11:15 AM Pablo Estrada  wrote:
>
> Woaahhh very fanc... this is great. Thanks so much. Love it. - I also 
> like the Code Velocity dashboard that you've added.
>
> Let's make these more discoverable. How about adding a shortlink? 
> s.apache.org/beam-dash ? : )
> Best
> -P.
>
> On Thu, Mar 14, 2019 at 10:58 AM Mikhail Gryzykhin  
> wrote:
>>
>> Hi everyone,
>>
>> I've added a kiosk style post-commit status dashboard that can help 
>> decorate your office space with green and red colors.
>>
>> Regards,
>> --Mikhail
>>
>> Have feedback?


Re: CVE audit gradle plugin

2019-04-19 Thread Ismaël Mejía
I want to bring this subject back, any chance we can get this running
in or main repo maybe in a weekly basis like we do for the dependency
reports. It looks totallly worth.

On Fri, Mar 1, 2019 at 2:05 AM Ahmet Altay  wrote:
>
> Thank you, I agree this is very important. Does anyone know a similar tool 
> for python and go?
>
> On Thu, Feb 28, 2019 at 8:26 AM Etienne Chauchot  wrote:
>>
>> Hi guys,
>>
>> I came by this [1] gradle plugin that is a client to the Sonatype OSS Index 
>> CVE database.
>>
>> I have set it up here in a branch [2], though the cache is not configured 
>> and the number of requests is limited. It can be run with "gradle --info 
>> audit"
>>
>> It could be nice to have something like this to track the CVEs in the libs 
>> we use. I know we have been spammed by libs upgrade automatic requests in 
>> the past but CVE are more important IMHO.
>>
>> This plugin is in BSD-3-Clause which is compatible with Apache V2 licence [3]
>>
>> WDYT ?
>>
>> Etienne
>>
>> [1] https://github.com/OSSIndex/ossindex-gradle-plugin
>> [2] https://github.com/echauchot/beam/tree/cve_audit_plugin
>> [3] https://www.apache.org/legal/resolved.html


Re: SNAPSHOTS have not been updated since february

2019-04-19 Thread Ismaël Mejía
Thanks everyone for the quick answer and thanks Yifan for taking care.

On Thu, Apr 18, 2019 at 7:15 PM Yifan Zou  wrote:
>
> The origin build nodes were updated in Jan 24 and the nexus credentials were 
> removed from the filesystem because they are not supposed to be on external 
> build nodes (nodes Infra does not own). We now need to set up the role 
> account on the new Beam JNLP nodes. I am still contacting Infra to bring the 
> snapshot back.
>
> Yifan
>
> On Thu, Apr 18, 2019 at 10:09 AM Lukasz Cwik  wrote:
>>
>> The permissions issue is that the credentials needed to publish to the maven 
>> repository are only deployed on machines managed by Apache Infra. Now that 
>> the machines have been given back to each project to manage Yifan was 
>> investigating some other way to get the permissions on to the machine.
>>
>> On Thu, Apr 18, 2019 at 10:06 AM Boyuan Zhang  wrote:
>>>
>>> There is a test target 
>>> https://builds.apache.org/job/beam_Release_NightlySnapshot/ in beam, which 
>>> builds and pushes snapshot to maven every day. Current failure is like, the 
>>> jenkin machine cannot publish artifacts into maven owing to some weird 
>>> permission issue. I think +Yifan Zou  is working on it actively.
>>>
>>> On Thu, Apr 18, 2019 at 9:44 AM Ismaël Mejía  wrote:

 And is there a way we can detect SNAPSHOTS not been published daily in
 the future?

 On Thu, Apr 18, 2019 at 6:37 PM Ismaël Mejía  wrote:
 >
 > Any progress on this?
 >
 > On Wed, Mar 27, 2019 at 5:38 AM Daniel Oliveira  
 > wrote:
 > >
 > > I made a bug for this specific issue (artifacts not publishing to the 
 > > Apache Maven repo): https://issues.apache.org/jira/browse/BEAM-6919
 > >
 > > While I was gathering info for the bug report I also noticed +Yifan 
 > > Zou has an experimental PR testing a fix: 
 > > https://github.com/apache/beam/pull/8148
 > >
 > > On Tue, Mar 26, 2019 at 11:42 AM Boyuan Zhang  
 > > wrote:
 > >>
 > >> +Daniel Oliveira
 > >>
 > >> On Tue, Mar 26, 2019 at 9:57 AM Boyuan Zhang  
 > >> wrote:
 > >>>
 > >>> Sorry for the typo. Ideally, the snapshot publish is independent 
 > >>> from postrelease_snapshot.
 > >>>
 > >>> On Tue, Mar 26, 2019 at 9:55 AM Boyuan Zhang  
 > >>> wrote:
 > 
 >  Hey,
 > 
 >  I'm trying to publish the artifacts by commenting "Run Gradle 
 >  Publish" in my PR, but there are several errors saying "cannot 
 >  write artifacts into dir", anyone has idea on it? Ideally, the 
 >  snapshot publish is dependent from postrelease_snapshot. The 
 >  publish task is to build and publish artifacts and the 
 >  postrelease_snapshot is to verify whether the snapshot works.
 > 
 >  On Tue, Mar 26, 2019 at 8:45 AM Ahmet Altay  
 >  wrote:
 > >
 > > I believe this is related to 
 > > https://issues.apache.org/jira/browse/BEAM-6840 and +Boyuan Zhang 
 > > has a fix in progress https://github.com/apache/beam/pull/8132
 > >
 > > On Tue, Mar 26, 2019 at 7:09 AM Ismaël Mejía  
 > > wrote:
 > >>
 > >> I was trying to validate a fix on the Spark runner and realized 
 > >> that
 > >> Beam SNAPSHOTS have not been updated since February 24 !
 > >>
 > >> https://repository.apache.org/content/repositories/snapshots/org/apache/beam/beam-sdks-java-core/2.12.0-SNAPSHOT/
 > >>
 > >> Can somebody please take a look at why this is not been updated?
 > >>
 > >> Thanks,
 > >> Ismaël


Re: Contributing Beam Kata (Java & Python)

2019-04-19 Thread hsuryawirawan
Thanks Altay.
I'll create it under "learning/" first as this is not exactly example.
Please do let me know if it's not the right place.

On 2019/04/18 22:49:47, Ahmet Altay  wrote: 
> This looks great.
> 
> +David Cavazos  was working on interactive colab based
> examples (https://github.com/apache/beam/pull/7679) perhaps we can have a
> shared place for these two similar things.
> 



Re: Contributing Beam Kata (Java & Python)

2019-04-19 Thread hsuryawirawan
Thanks Kenneth.
Yeah from a glance this might fit with the incubating training project.

As of this moment, the kata is using solely the direct runner as the focus 
currently is to teach about Beam primitives.
There are ideas on how to also have it running on other runner (e.g. Dataflow), 
but so far I haven't created it since it will need integration with a GCP 
project which requires some setup for a student.

On 2019/04/18 21:49:50, Kenneth Knowles  wrote: 
> Great! By the way it might fit in with the newly incubating training
> project [1]. Do you go into specifics on using different runners at all?
> 
> Kenn
> 
> [1] https://wiki.apache.org/incubator/TrainingProposal
> 


Re: Contributing Beam Kata (Java & Python)

2019-04-19 Thread hsuryawirawan
Thanks Lukasz.
Yes you can try the kata.
I will write a short instruction on how to use it, maybe along with the PR.


On 2019/04/18 21:21:28, Lukasz Cwik  wrote: 
> Also agree that this is really nice. Is there a place where we can try out
> what you have created so far?
> 
> Opening a PR with what you have created so far may be the easiest way to
> convey what your thinking would make sense.
> 



Re: Contributing Beam Kata (Java & Python)

2019-04-19 Thread hsuryawirawan
Hi Pablo,

The file structure at the moment is structured around the language.
* beam-kata/
* beam-kata/java/
* beam-kata/java/
* beam-kata/python/
* beam-kata/python/
Adding a new language in the future should be quite easy.

For students to use the kata, they actually don't need to checkout the 
repository.
We can create a course archive (zip file) that could be versioned and released 
on GitHub.
Another alternative is to release it on the Beam website.
The course archive zip can then be opened by the IDE to start the learning.

IntelliJ/PyCharm itself has integration with Stepik (https://stepik.org) which 
allows an easy way to start the learning.
The course will be downloaded from Stepik to the student's local workspace 
within seconds.
When I'm available, I will try to create a video to show how the kata looks 
like.


On 2019/04/18 20:41:13, Pablo Estrada  wrote: 
> Hi Henry!
> this seems quite nice! For my information, what does the file structure
> look like for something like this? Also, maybe it's not the intended way of
> introducing it, but perhaps, could you share a video showing how one would
> try out the first kata?
> 
> Something that comes to mind for having this in Beam is having a
> `learning/katas/` subdirectory in the repository, with a README; and pehaps
> a section on the website for people to try this out.
> 
> Another question that comes to mind: Is it possible to structure it so that
> katas for other languages can be added?
> 
> Thoughts from others?
> Best
> -P.