Re: [Vote] Dev wiki engine

2018-07-19 Thread Jean-Baptiste Onofré
+1 for confluence provided by Apache

Regards
JB

On 19/07/2018 23:21, Mikhail Gryzykhin wrote:
> Hi everyone,
> 
> There is a long lasting discussion on starting Beam Dev Wiki
> 
> ongoing. Seems that the only question remaining is to decide on what
> engine to use for wiki. So far it seems that we have two suggestions:
> confluence and .md files in repo.
> 
> Quick summary can also be found in following doc
> .
> 
> I suggest to vote on which approach to use:
> 1. Apache Confluence
> 2. .md files in code repository (Those can be rendered by Github)
> 
> --Mikhail
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [Vote] Dev wiki engine

2018-07-19 Thread Rafael Fernandez
-1 .md! :-)

On Thu, Jul 19, 2018 at 6:38 PM Gaurav Thakur  wrote:

> +1 for confluence
>
> On Fri, Jul 20, 2018 at 11:42 AM Thomas Weise  wrote:
>
>> +1 for Confluence
>>
>>
>> On Thu, Jul 19, 2018 at 4:00 PM Kai Jiang  wrote:
>>
>>> +1 Apache Confluence
>>>
>>> On Thu, Jul 19, 2018, 15:18 Lukasz Cwik  wrote:
>>>
 +1 for confluence.

 On Thu, Jul 19, 2018 at 3:17 PM Anton Kedin  wrote:

> +1 for Confluence
>
>
> On Thu, Jul 19, 2018 at 2:56 PM Andrew Pilloud 
> wrote:
>
>> +1 Apache Confluence
>>
>> Because .md files in code repo require code review and commit.
>>
>> On Thu, Jul 19, 2018, 2:22 PM Mikhail Gryzykhin 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> There is a long lasting discussion on starting Beam Dev Wiki
>>> 
>>> ongoing. Seems that the only question remaining is to decide on what 
>>> engine
>>> to use for wiki. So far it seems that we have two suggestions: 
>>> confluence
>>> and .md files in repo.
>>>
>>> Quick summary can also be found in following doc
>>> 
>>> .
>>>
>>> I suggest to vote on which approach to use:
>>> 1. Apache Confluence
>>> 2. .md files in code repository (Those can be rendered by Github)
>>>
>>> --Mikhail
>>>
>>>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [Vote] Dev wiki engine

2018-07-19 Thread Gaurav Thakur
+1 for confluence

On Fri, Jul 20, 2018 at 11:42 AM Thomas Weise  wrote:

> +1 for Confluence
>
>
> On Thu, Jul 19, 2018 at 4:00 PM Kai Jiang  wrote:
>
>> +1 Apache Confluence
>>
>> On Thu, Jul 19, 2018, 15:18 Lukasz Cwik  wrote:
>>
>>> +1 for confluence.
>>>
>>> On Thu, Jul 19, 2018 at 3:17 PM Anton Kedin  wrote:
>>>
 +1 for Confluence


 On Thu, Jul 19, 2018 at 2:56 PM Andrew Pilloud 
 wrote:

> +1 Apache Confluence
>
> Because .md files in code repo require code review and commit.
>
> On Thu, Jul 19, 2018, 2:22 PM Mikhail Gryzykhin 
> wrote:
>
>> Hi everyone,
>>
>> There is a long lasting discussion on starting Beam Dev Wiki
>> 
>> ongoing. Seems that the only question remaining is to decide on what 
>> engine
>> to use for wiki. So far it seems that we have two suggestions: confluence
>> and .md files in repo.
>>
>> Quick summary can also be found in following doc
>> 
>> .
>>
>> I suggest to vote on which approach to use:
>> 1. Apache Confluence
>> 2. .md files in code repository (Those can be rendered by Github)
>>
>> --Mikhail
>>
>>


Re: SQS source

2018-07-19 Thread Raghu Angadi
A timestamp for a message is fundamental to an element in a PCollection.
What do you mean by not knowing timestamp of a message?
There is finalizeCheckpoint API[1] in UnboundedSource. Does that help?
PubSub is also very similar, a message need to be acked with in a timeout,
otherwise it will be redelivered to one of the consumer. Pubsub messages
are acked inside finalize().

[1]:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedSource.java#L129

On Thu, Jul 19, 2018 at 3:28 PM John Rudolf Lewis 
wrote:

> hmm... made lots of progress on this today. But need help understanding
> something
>
> UnboundedSource seems to assume that there is some guarantee of message
> ordering, and that you can get the timestamp of the current message. Using
> UnboundedSource.CheckpointMark to help advance the offset. Seems to work ok
> for any source that supports those assumptions. But SQS does not work this
> way.
>
> With a standard SQS queue, there is no guarantee of ordering and there is
> no timestamp for a message.  With SQS, one needs to call the delete api
> using the receipt handle from the message to acknowledge receipt of a
> message and prevent its redelivery after the visibility timeout has expired.
>
> I'm not sure how to adapt these two patterns and would welcome suggestions.
>
>
>
> On Thu, Jul 19, 2018 at 7:40 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Thx John !
>>
>> Regards
>> JB
>>
>> On 19/07/2018 16:39, John Rudolf Lewis wrote:
>> > Thank you.
>> >
>> > I've created a jira ticket to add SQS and have assigned it to
>> > myself: https://issues.apache.org/jira/browse/BEAM-4828
>> >
>> > Modified the documentation to show it as in-progress:
>> > https://github.com/apache/beam/pull/5995
>> >
>> > And will be starting my work
>> > here: https://github.com/JohnRudolfLewis/beam/tree/Add-SqsIO
>> >
>> > On Thu, Jul 19, 2018 at 1:43 AM, Jean-Baptiste Onofré > > > wrote:
>> >
>> > Agree with Ismaël.
>> >
>> > I would be more than happy to help on this one (as I contributed on
>> AMQP
>> > and JMS IOs ;)).
>> >
>> > Regards
>> > JB
>> >
>> > On 19/07/2018 10:39, Ismaël Mejía wrote:
>> > > Thanks for your interest John, it would be a really nice
>> contribution
>> > > to add SQS support.
>> > >
>> > > Some context on the kinesis stuff:
>> > >
>> > > The reason why kinesis is still in a separate module is more
>> related
>> > > to a licensing problem. Kinesis uses some native libraries that
>> are
>> > > published under a not 100% apache compatible license and we are
>> not
>> > > allowed to shade and republish them but it seems there is a
>> workaround
>> > > now, for more details see
>> > > https://issues.apache.org/jira/browse/BEAM-3549
>> > 
>> > > In any case if to use SQS you only need the Apache licensed
>> aws-sdk
>> > > deps it is ok (and a good idea) if you put it in the
>> > > amazon-web-services module.
>> > >
>> > > The kinesis connector is way more complex for multiple reasons,
>> first,
>> > > the raw version of the amazon client libraries is not so
>> ‘friendly’
>> > > and the guys who created KinesisIO had to do some workarounds to
>> > > provide accurate checkpointing/watermarks. So since SQS is a way
>> > > simpler system you should probably be ok basing it in simpler
>> sources
>> > > like AMQP or JMS.
>> > >
>> > > If you feel like to, please create the JIRA and don’t hesitate to
>> ask
>> > > questions if you find issues or if you need some review.
>> > >
>> > > On Thu, Jul 19, 2018 at 12:55 AM Lukasz Cwik > > > wrote:
>> > >>
>> > >>
>> > >>
>> > >> On Wed, Jul 18, 2018 at 3:30 PM John Rudolf Lewis
>> > mailto:johnrle...@gmail.com>> wrote:
>> > >>>
>> > >>> I need an SQS source for my project that is using beam. A brief
>> > search did not turn up any in-progress work in this area. Please
>> > point me to the right repo if I missed it.
>> > >>
>> > >>
>> > >> To my knowledge there is none and nobody has marked it in
>> > progress on https://beam.apache.org/documentation/io/built-in/
>> > . It would be
>> > good to create a JIRA issue on https://issues.apache.org/ and send
>> a
>> > PR to add SQS to the inprogress list referencing your JIRA. I added
>> > you as a contributor in JIRA so you should be able to assign
>> > yourself to any issues that you create.
>> > >>
>> > >>>
>> > >>> Assuming there is no in-progress effort, I would like to
>> > contribute an Amazon SQS source. I have a few questions before I
>> begin.
>> > >>
>> > >>
>> > >> Great, note that this is a good starting point for authoring an
>> > IO transform:
>> > https://beam.apache.org

Re: [Vote] Dev wiki engine

2018-07-19 Thread Thomas Weise
+1 for Confluence


On Thu, Jul 19, 2018 at 4:00 PM Kai Jiang  wrote:

> +1 Apache Confluence
>
> On Thu, Jul 19, 2018, 15:18 Lukasz Cwik  wrote:
>
>> +1 for confluence.
>>
>> On Thu, Jul 19, 2018 at 3:17 PM Anton Kedin  wrote:
>>
>>> +1 for Confluence
>>>
>>>
>>> On Thu, Jul 19, 2018 at 2:56 PM Andrew Pilloud 
>>> wrote:
>>>
 +1 Apache Confluence

 Because .md files in code repo require code review and commit.

 On Thu, Jul 19, 2018, 2:22 PM Mikhail Gryzykhin 
 wrote:

> Hi everyone,
>
> There is a long lasting discussion on starting Beam Dev Wiki
> 
> ongoing. Seems that the only question remaining is to decide on what 
> engine
> to use for wiki. So far it seems that we have two suggestions: confluence
> and .md files in repo.
>
> Quick summary can also be found in following doc
> 
> .
>
> I suggest to vote on which approach to use:
> 1. Apache Confluence
> 2. .md files in code repository (Those can be rendered by Github)
>
> --Mikhail
>
>


Re: [Vote] Dev wiki engine

2018-07-19 Thread Kai Jiang
+1 Apache Confluence

On Thu, Jul 19, 2018, 15:18 Lukasz Cwik  wrote:

> +1 for confluence.
>
> On Thu, Jul 19, 2018 at 3:17 PM Anton Kedin  wrote:
>
>> +1 for Confluence
>>
>>
>> On Thu, Jul 19, 2018 at 2:56 PM Andrew Pilloud 
>> wrote:
>>
>>> +1 Apache Confluence
>>>
>>> Because .md files in code repo require code review and commit.
>>>
>>> On Thu, Jul 19, 2018, 2:22 PM Mikhail Gryzykhin 
>>> wrote:
>>>
 Hi everyone,

 There is a long lasting discussion on starting Beam Dev Wiki
 
 ongoing. Seems that the only question remaining is to decide on what engine
 to use for wiki. So far it seems that we have two suggestions: confluence
 and .md files in repo.

 Quick summary can also be found in following doc
 
 .

 I suggest to vote on which approach to use:
 1. Apache Confluence
 2. .md files in code repository (Those can be rendered by Github)

 --Mikhail




Re: SQS source

2018-07-19 Thread Lukasz Cwik
Some of the queue based technologies don't have an explicit timestamp (and
even if they do its not the timestamp the user typically wants as its
usually the queued time). Typically an option is exposed where the user
supplies the name of a property that should be interpreted within the SQS
message to supply the timestamp. PubsubIO provides an estimate by keeping
track of the minimum over the last minute of time [1]. You can start with
something simple and deal with improving the watermark tracking once you
get something working end to end by taking a look at the other unbounded
sources and see how they track watermarks effectively (a library of such
statistical methods would be useful for future IO authors as well).

SplittableDoFn will address the issue of needing a callback to delete state
after it is read but for now you can model a set of transforms like this:
Read(UnboundedSource) --message id-> Reshuffle -> ParDo(DeleteFromSQS)
  \--message-> ... rest of pipeline
...
The reshuffle forces a materialization in all runners which means that the
message id will only get to "DeleteFromSQS" if the message was successfully
read by the unbounded source.

1:
https://github.com/apache/beam/blob/0e18bf4c81e09c193e113c74cac7301dc26dac9e/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.java#L91

On Thu, Jul 19, 2018 at 3:28 PM John Rudolf Lewis 
wrote:

> hmm... made lots of progress on this today. But need help understanding
> something
>
> UnboundedSource seems to assume that there is some guarantee of message
> ordering, and that you can get the timestamp of the current message. Using
> UnboundedSource.CheckpointMark to help advance the offset. Seems to work ok
> for any source that supports those assumptions. But SQS does not work this
> way.
>
> With a standard SQS queue, there is no guarantee of ordering and there is
> no timestamp for a message.  With SQS, one needs to call the delete api
> using the receipt handle from the message to acknowledge receipt of a
> message and prevent its redelivery after the visibility timeout has expired.
>
> I'm not sure how to adapt these two patterns and would welcome suggestions.
>
>
>
> On Thu, Jul 19, 2018 at 7:40 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Thx John !
>>
>> Regards
>> JB
>>
>> On 19/07/2018 16:39, John Rudolf Lewis wrote:
>> > Thank you.
>> >
>> > I've created a jira ticket to add SQS and have assigned it to
>> > myself: https://issues.apache.org/jira/browse/BEAM-4828
>> >
>> > Modified the documentation to show it as in-progress:
>> > https://github.com/apache/beam/pull/5995
>> >
>> > And will be starting my work
>> > here: https://github.com/JohnRudolfLewis/beam/tree/Add-SqsIO
>> >
>> > On Thu, Jul 19, 2018 at 1:43 AM, Jean-Baptiste Onofré > > > wrote:
>> >
>> > Agree with Ismaël.
>> >
>> > I would be more than happy to help on this one (as I contributed on
>> AMQP
>> > and JMS IOs ;)).
>> >
>> > Regards
>> > JB
>> >
>> > On 19/07/2018 10:39, Ismaël Mejía wrote:
>> > > Thanks for your interest John, it would be a really nice
>> contribution
>> > > to add SQS support.
>> > >
>> > > Some context on the kinesis stuff:
>> > >
>> > > The reason why kinesis is still in a separate module is more
>> related
>> > > to a licensing problem. Kinesis uses some native libraries that
>> are
>> > > published under a not 100% apache compatible license and we are
>> not
>> > > allowed to shade and republish them but it seems there is a
>> workaround
>> > > now, for more details see
>> > > https://issues.apache.org/jira/browse/BEAM-3549
>> > 
>> > > In any case if to use SQS you only need the Apache licensed
>> aws-sdk
>> > > deps it is ok (and a good idea) if you put it in the
>> > > amazon-web-services module.
>> > >
>> > > The kinesis connector is way more complex for multiple reasons,
>> first,
>> > > the raw version of the amazon client libraries is not so
>> ‘friendly’
>> > > and the guys who created KinesisIO had to do some workarounds to
>> > > provide accurate checkpointing/watermarks. So since SQS is a way
>> > > simpler system you should probably be ok basing it in simpler
>> sources
>> > > like AMQP or JMS.
>> > >
>> > > If you feel like to, please create the JIRA and don’t hesitate to
>> ask
>> > > questions if you find issues or if you need some review.
>> > >
>> > > On Thu, Jul 19, 2018 at 12:55 AM Lukasz Cwik > > > wrote:
>> > >>
>> > >>
>> > >>
>> > >> On Wed, Jul 18, 2018 at 3:30 PM John Rudolf Lewis
>> > mailto:johnrle...@gmail.com>> wrote:
>> > >>>
>> > >>> I need an SQS source for my project that is using beam. A brief
>> > search did not turn up any in-progress work in this area. Please
>> >

Re: SQS source

2018-07-19 Thread John Rudolf Lewis
hmm... made lots of progress on this today. But need help understanding
something

UnboundedSource seems to assume that there is some guarantee of message
ordering, and that you can get the timestamp of the current message. Using
UnboundedSource.CheckpointMark to help advance the offset. Seems to work ok
for any source that supports those assumptions. But SQS does not work this
way.

With a standard SQS queue, there is no guarantee of ordering and there is
no timestamp for a message.  With SQS, one needs to call the delete api
using the receipt handle from the message to acknowledge receipt of a
message and prevent its redelivery after the visibility timeout has expired.

I'm not sure how to adapt these two patterns and would welcome suggestions.



On Thu, Jul 19, 2018 at 7:40 AM, Jean-Baptiste Onofré 
wrote:

> Thx John !
>
> Regards
> JB
>
> On 19/07/2018 16:39, John Rudolf Lewis wrote:
> > Thank you.
> >
> > I've created a jira ticket to add SQS and have assigned it to
> > myself: https://issues.apache.org/jira/browse/BEAM-4828
> >
> > Modified the documentation to show it as in-progress:
> > https://github.com/apache/beam/pull/5995
> >
> > And will be starting my work
> > here: https://github.com/JohnRudolfLewis/beam/tree/Add-SqsIO
> >
> > On Thu, Jul 19, 2018 at 1:43 AM, Jean-Baptiste Onofré  > > wrote:
> >
> > Agree with Ismaël.
> >
> > I would be more than happy to help on this one (as I contributed on
> AMQP
> > and JMS IOs ;)).
> >
> > Regards
> > JB
> >
> > On 19/07/2018 10:39, Ismaël Mejía wrote:
> > > Thanks for your interest John, it would be a really nice
> contribution
> > > to add SQS support.
> > >
> > > Some context on the kinesis stuff:
> > >
> > > The reason why kinesis is still in a separate module is more
> related
> > > to a licensing problem. Kinesis uses some native libraries that are
> > > published under a not 100% apache compatible license and we are not
> > > allowed to shade and republish them but it seems there is a
> workaround
> > > now, for more details see
> > > https://issues.apache.org/jira/browse/BEAM-3549
> > 
> > > In any case if to use SQS you only need the Apache licensed aws-sdk
> > > deps it is ok (and a good idea) if you put it in the
> > > amazon-web-services module.
> > >
> > > The kinesis connector is way more complex for multiple reasons,
> first,
> > > the raw version of the amazon client libraries is not so ‘friendly’
> > > and the guys who created KinesisIO had to do some workarounds to
> > > provide accurate checkpointing/watermarks. So since SQS is a way
> > > simpler system you should probably be ok basing it in simpler
> sources
> > > like AMQP or JMS.
> > >
> > > If you feel like to, please create the JIRA and don’t hesitate to
> ask
> > > questions if you find issues or if you need some review.
> > >
> > > On Thu, Jul 19, 2018 at 12:55 AM Lukasz Cwik  > > wrote:
> > >>
> > >>
> > >>
> > >> On Wed, Jul 18, 2018 at 3:30 PM John Rudolf Lewis
> > mailto:johnrle...@gmail.com>> wrote:
> > >>>
> > >>> I need an SQS source for my project that is using beam. A brief
> > search did not turn up any in-progress work in this area. Please
> > point me to the right repo if I missed it.
> > >>
> > >>
> > >> To my knowledge there is none and nobody has marked it in
> > progress on https://beam.apache.org/documentation/io/built-in/
> > . It would be
> > good to create a JIRA issue on https://issues.apache.org/ and send a
> > PR to add SQS to the inprogress list referencing your JIRA. I added
> > you as a contributor in JIRA so you should be able to assign
> > yourself to any issues that you create.
> > >>
> > >>>
> > >>> Assuming there is no in-progress effort, I would like to
> > contribute an Amazon SQS source. I have a few questions before I
> begin.
> > >>
> > >>
> > >> Great, note that this is a good starting point for authoring an
> > IO transform:
> > https://beam.apache.org/documentation/io/authoring-overview/
> > 
> > >>
> > >>>
> > >>>
> > >>> It seems that the current AWS code is split into two different
> > modules: sdk/java/io/amazon-web-services which contains the
> > S3FileSystem, AwsOptions, etc, and sdk/java/io/kinesis which
> > contains an unbounded source based on a kinesis topic. I'd like to
> > add this source to the amazon-web-services module since I'd like to
> > depend on AwsOptions. Does adding this source to the
> > amazon-web-services module make sense?
> > >>
> > >>
> > >> Putting it inside of amazon-web-services makes a lot of se

Re: [Vote] Dev wiki engine

2018-07-19 Thread Lukasz Cwik
+1 for confluence.

On Thu, Jul 19, 2018 at 3:17 PM Anton Kedin  wrote:

> +1 for Confluence
>
>
> On Thu, Jul 19, 2018 at 2:56 PM Andrew Pilloud 
> wrote:
>
>> +1 Apache Confluence
>>
>> Because .md files in code repo require code review and commit.
>>
>> On Thu, Jul 19, 2018, 2:22 PM Mikhail Gryzykhin 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> There is a long lasting discussion on starting Beam Dev Wiki
>>> 
>>> ongoing. Seems that the only question remaining is to decide on what engine
>>> to use for wiki. So far it seems that we have two suggestions: confluence
>>> and .md files in repo.
>>>
>>> Quick summary can also be found in following doc
>>> 
>>> .
>>>
>>> I suggest to vote on which approach to use:
>>> 1. Apache Confluence
>>> 2. .md files in code repository (Those can be rendered by Github)
>>>
>>> --Mikhail
>>>
>>>


Re: [Vote] Dev wiki engine

2018-07-19 Thread Anton Kedin
+1 for Confluence


On Thu, Jul 19, 2018 at 2:56 PM Andrew Pilloud  wrote:

> +1 Apache Confluence
>
> Because .md files in code repo require code review and commit.
>
> On Thu, Jul 19, 2018, 2:22 PM Mikhail Gryzykhin  wrote:
>
>> Hi everyone,
>>
>> There is a long lasting discussion on starting Beam Dev Wiki
>> 
>> ongoing. Seems that the only question remaining is to decide on what engine
>> to use for wiki. So far it seems that we have two suggestions: confluence
>> and .md files in repo.
>>
>> Quick summary can also be found in following doc
>> 
>> .
>>
>> I suggest to vote on which approach to use:
>> 1. Apache Confluence
>> 2. .md files in code repository (Those can be rendered by Github)
>>
>> --Mikhail
>>
>>


Re: [Vote] Dev wiki engine

2018-07-19 Thread Andrew Pilloud
+1 Apache Confluence

Because .md files in code repo require code review and commit.

On Thu, Jul 19, 2018, 2:22 PM Mikhail Gryzykhin  wrote:

> Hi everyone,
>
> There is a long lasting discussion on starting Beam Dev Wiki
> 
> ongoing. Seems that the only question remaining is to decide on what engine
> to use for wiki. So far it seems that we have two suggestions: confluence
> and .md files in repo.
>
> Quick summary can also be found in following doc
> 
> .
>
> I suggest to vote on which approach to use:
> 1. Apache Confluence
> 2. .md files in code repository (Those can be rendered by Github)
>
> --Mikhail
>
>


unsubscribe

2018-07-19 Thread Apurva Desai
unsubscribe apur...@google.com


Re: [DISCUSS] Use Confluence wiki for non-user-facing stuff

2018-07-19 Thread Lukasz Cwik
I have added all PMC that had an account on the confluence wiki to be
administrators.

Current list of administrators is:
Aljoscha Krettek (aljoscha)
Davor Bonaci (davor)
Daniel Kulp (dkulp)
Gavin (gmcdonald)
Josh Wills (jwills)
Kenneth Knowles (kenn)
Lukasz Cwik (lukecwik)
Jean-Baptiste Onofré (nanthrax)
Robert Bradshaw (rbradshaw)
Stephan Ewen (sewen)
Thomas Weise (thw)

If you need access, reach out to one of the existing administrators or a
PMC to get access.

Also, what should be the list of permissions we grant to non PMC (the list
below enumerates them all)?
 AllPagesBlogAttachmentsCommentsRestrictionsMailSpace
 ViewDelete OwnAddDeleteAddDeleteAddDeleteAddDeleteAdd/DeleteDeleteExport
Admin

On Mon, Jul 16, 2018 at 9:42 AM Andrew Pilloud  wrote:

> Your doc looks good to me. It looks like only one question remains: should
> it be a Confluence or Github wiki. I see other Apache projects using both,
> so it seems like either one is possible with support of the Beam community.
> It might be time to call a vote on this?
>
> Andrew
>
> On Fri, Jul 13, 2018 at 9:14 PM Mikhail Gryzykhin 
> wrote:
>
>> Hello everyone it's Mikhail and I'm here to revive this long-sleeping
>> thread.
>>
>> I have summarized discussion above into a design/proposal document
>> 
>> .
>>
>> The initial proposal is what I consider the best approach, so it is open
>> for change.
>>
>> Please comment on following topics:
>> 1. Another engines you have in mind.
>> 2. If you have access to configure corresponding engine
>> 3. General ideas.
>>
>> Since this is a long-desired change, please, be active.
>>
>> --Mikhail
>>
>> Have feedback ?
>>
>>
>> On Tue, Jun 12, 2018 at 5:24 PM Griselda Cuevas  wrote:
>>
>>>
>>> Hi Everyone,
>>>
>>>
>>> (a) should we do it? -- I like the idea of having a wiki, yes. Mainly to
>>> differentiate the documentation we cater to users and the one we cater to
>>> contributors. For things like user examples, and more demo-y content, I'd
>>> suggest we still host it in the Website.
>>>
>>> (b) what should go there? -- The ultimate purpose of the wiki should be
>>> to host everything needed to a) get started (with official documentation)
>>> and b) how to get the most out of Beam (here is where I see things like
>>> what Robert suggested could fit, tips, tricks and other cool things created
>>> by and for our contributors.)
>>>
>>> (c) what should not go there? -- Any demos, examples or showcases. I
>>> think that material should be either embedded or linked (listed) in the
>>> website.
>>>
>>> Tu summarize, I'd like to see the wiki to be a knowledge collection for*
>>> people who contribute* to the project and the website the collection of
>>> information that allows *someone to make the decision to use Beam* (or
>>> join the community).
>>>
>>> When we are ready to vote on the creation of a wiki, I'd like to propose
>>> that the first thing we document there is the Beam Improvement plan along
>>> side with a concrete "Get Started Contributing to Beam" cheatsheet.
>>>
>>> WDYT?
>>>
>>>
>>> On Tue, 12 Jun 2018 at 09:34, Alexey Romanenko 
>>> wrote:
>>>
 +1 for having Wiki for devs and users.

 Even though editing interface is not so native and obvious (comparing
 to Google docs), but, at least, it will be already put in one place and
 should be much more easy to search and discover.

 The only my concern about Wiki (based on using it in other different
 projects) that, in course of time, the information becomes outdated and
 weak structured which makes this not so valuable and even deceptive.

 WBR,
 Alexey

 On 12 Jun 2018, at 18:01, Robert Bradshaw  wrote:

 On Mon, Jun 11, 2018 at 2:40 PM Kenneth Knowles  wrote:

> OK, yea, that all makes sense to me. Like this?
>
>  - site/documentation: writing just for users
>  - site/contribute: basic stuff as-is, writing for users to entice
> them, links to the next...
>  - wiki/contributors: contributors writing just for each other
>
> And you also have
>
>  - wiki/users: users writing for users
>
> That's interesting.
>

 Yep. We don't have to start wiki/users right away, but it could be
 useful down the line.



> On Mon, Jun 11, 2018 at 2:30 PM Robert Bradshaw 
> wrote:
>
>> On Fri, Jun 8, 2018 at 2:18 PM Kenneth Knowles 
>> wrote:
>>
>>
>>> I disagree strongly here - I don't think the wiki will have
>>> appropriate polish for users. Even if carefully polished I don't think 
>>> the
>>> presentation style is right, and it is not flexible. Power users will 
>>> find
>>> it, of course.
>>>
>>
>> I wasn't imagining a wiki as a platform for developers to author
>> documentation, rather a place for users to author content for other users
>> (tips and tri

[Vote] Dev wiki engine

2018-07-19 Thread Mikhail Gryzykhin
Hi everyone,

There is a long lasting discussion on starting Beam Dev Wiki

ongoing. Seems that the only question remaining is to decide on what engine
to use for wiki. So far it seems that we have two suggestions: confluence
and .md files in repo.

Quick summary can also be found in following doc

.

I suggest to vote on which approach to use:
1. Apache Confluence
2. .md files in code repository (Those can be rendered by Github)

--Mikhail


Re: [FEEDBACK REQUEST] Re: [ANNOUNCEMENT] Nexmark included to the CI

2018-07-19 Thread Andrew Pilloud
The doc changes look good to me, I'll add Dataflow once it is ready. Thanks
for opening the issue on the DirectRunner. I'll try to get some progress on
a dedicated perf node while you are gone, we can talk about increasing the size
of the nexmark input collection for the runs once we know what the
utilization on that looks like.

Enjoy your time off!

Andrew

On Thu, Jul 19, 2018 at 9:00 AM Etienne Chauchot 
wrote:

> Hi guys,
> As suggested by Anton bellow, I opened a PR on the website to reference
> the Nexmark dashboards.
> As I did not want users to take them for proper neutral benchmarks of the
> runners / engines, but more for a CI piece of software, I added a
> disclaimer.
>
> Please:
> - tell if you agree on the publication of such performance results
> - comment on the PR for the disclaimer.
>
> PR: https://github.com/apache/beam-site/pull/500
>
> Thanks
>
> Etienne
>
>
> Le jeudi 19 juillet 2018 à 12:30 +0200, Etienne Chauchot a écrit :
>
> Hi Anton,
>
> Yes, good idea, I'll update nexmark website page
>
> Etienne
>
> Le mercredi 18 juillet 2018 à 10:17 -0700, Anton Kedin a écrit :
>
> These dashboards look great!
>
> Can publish the links to the dashboards somewhere, for better visibility?
> E.g. in the jenkins website / emails, or the wiki.
>
> Regards,
> Anton
>
> On Wed, Jul 18, 2018 at 10:08 AM Andrew Pilloud 
> wrote:
>
> Hi Etienne,
>
> I've been asking around and it sounds like we should be able to get a
> dedicated Jenkins node for performance tests. Another thing that might help
> is making the runs a few times longer. They are currently running around 2
> seconds each, so the total time of the build probably exceeds testing.
> Internally at Google we are running them with 2000x as many events on
> Dataflow, but a job of that size won't even complete on the Direct Runner.
>
> I didn't see the query 3 issues, but now that you point it out it looks
> like a bug to me too.
>
> Andrew
>
> On Wed, Jul 18, 2018 at 1:13 AM Etienne Chauchot 
> wrote:
>
> Hi Andrew,
>
> Yes I saw that, except dedicating jenkins nodes to nexmark, I see no other
> way.
>
> Also, did you see query 3 output size on direct runner? Should be a
> straight line and it is not, I'm wondering if there is a problem with sate
> and timers impl in direct runner.
>
> Etienne
>
> Le mardi 17 juillet 2018 à 11:38 -0700, Andrew Pilloud a écrit :
>
> I'm noticing the graphs are really noisy. It looks like we are running
> these on shared Jenkins executors, so our perf tests are fighting with
> other builds for CPU. I've opened an issue
> https://issues.apache.org/jira/browse/BEAM-4804 and am wondering if
> anyone knows an easy fix to isolate these jobs.
>
> Andrew
>
> On Fri, Jul 13, 2018 at 2:39 AM Łukasz Gajowy  wrote:
>
> @Etienne: Nice to see the graphs! :)
>
> @Ismael: Good idea, there's no document yet. I think we could create a
> small google doc with instructions on how to do this.
>
> pt., 13 lip 2018 o 10:46 Etienne Chauchot 
> napisał(a):
>
> Hi,
>
> @Andrew, this is because I did not find a way to set 2 scales on the Y
> axis on the perfkit graphs. Indeed numResults varies from 1 to 100 000 and
> runtimeSec is usually bellow 10s.
>
> Etienne
>
> Le jeudi 12 juillet 2018 à 12:04 -0700, Andrew Pilloud a écrit :
>
> This is great, should make performance work much easier! I'm going to get
> the Beam SQL Nexmark jobs publishing as well. (Opened
> https://issues.apache.org/jira/browse/BEAM-4774 to track.) I might take
> on the Dataflow runner as well if no one else volunteers.
>
> I am curious as to why you have two separate graphs for runtime and count
> rather then graphing runtime/count to get the throughput rate for each run?
> Or should that be a third graph? Looks like it would just be a small tweak
> to the query in perfkit.
>
>
>
> Andrew
>
> On Thu, Jul 12, 2018 at 11:40 AM Pablo Estrada  wrote:
>
> This is really cool Etienne : ) thanks for working on this.
> Our of curiosity, do you know how often the tests run on each runner?
>
> Best
> -P.
>
> On Thu, Jul 12, 2018 at 2:15 AM Romain Manni-Bucau 
> wrote:
>
> Awesome Etienne, this is really important for the (user) community to have
> that visibility since it is one of the most important aspect of the Beam's
> quality, kudo!
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>
>
> Le jeu. 12 juil. 2018 à 10:59, Jean-Baptiste Onofré  a
> écrit :
>
> It's really great to have these dashboards and integration in Jenkins !
>
> Thanks Etienne for driving this !
>
> Regards
> JB
>
> On 11/07/2018 15:13, Etienne Chauchot wrote:
> >
> > Hi guys,
> >
> > I'm glad to announce that the CI of Beam has much improved ! Indeed
> > Nexmark is now included in the perfkit dashboards.

Re: Updating Many Dependency Versions

2018-07-19 Thread Pablo Estrada
This is long overdue. Thanks a lot for working on this Garrett and Luke!
-P.

On Thu, Jul 19, 2018 at 10:16 AM Lukasz Cwik  wrote:

> I wanted to raise awareness of https://github.com/apache/beam/pull/5988
> since it updates a lot of the Google dependency client versions and many of
> their transitive dependencies including protobuf, gRPC and Netty.
>
> With the latest change to vendor protobuf/gRPC in many places and since
> all the validates runner tests have passed I believe this change will have
> minimal impact on our community and users. I'm planning to merge the PR on
> Friday unless issues are raised.
>
-- 
Got feedback? go/pabloem-feedback


Updating Many Dependency Versions

2018-07-19 Thread Lukasz Cwik
I wanted to raise awareness of https://github.com/apache/beam/pull/5988
since it updates a lot of the Google dependency client versions and many of
their transitive dependencies including protobuf, gRPC and Netty.

With the latest change to vendor protobuf/gRPC in many places and since all
the validates runner tests have passed I believe this change will have
minimal impact on our community and users. I'm planning to merge the PR on
Friday unless issues are raised.


Re: off for 2 - 3 weeks

2018-07-19 Thread Alexey Romanenko
Enjoy your vacation, Etienne! 


> On 19 Jul 2018, at 18:00, Ahmet Altay  wrote:
> 
> Have fun!
> 
> On Thu, Jul 19, 2018 at 8:46 AM, Łukasz Gajowy  > wrote:
> Enjoy! :)
> 
> czw., 19 lip 2018 o 16:36 Reuven Lax  > napisał(a):
> Enjoy!
> 
> Reuven
> 
> On Thu, Jul 19, 2018 at 6:34 AM Etienne Chauchot  > wrote:
> Hi all,
> 
> Just a quick update to tell that I'll be off for 2 or 3 weeks starting on 
> Friday evening.
> 
> Etienne
> 



Re: Vendoring / Shading Protobuf and gRPC

2018-07-19 Thread Lukasz Cwik
A lot of the package structure is a mess because we have a bunch of runner
support code in various runner/* packages which was responsible for
translating pipelines. I don't really have a strong opinion about whether
the translation happens within sdks/java/core or another package under
sdks/java/... but I do believe that all of this logic should get moved out
of runners/... The issue right now is that we are in a mode where we are
supporting both the old way and the portable way and we were placing
classes close to the old way because gRPC/Protobuf wasn't being shaded. Now
that they are shaded we have some more freedom but moving them around
doesn't seem to be a high priority.

Unfortunately, I don't know what we can do about GCP IO in a backwards
compatible manner since I believe one of the GCP connector exposes protobuf
generated classes as part of their API surface.

We still have the ApiSurface testing like SdkCoreApiSurfaceTest to ensure
that we aren't exposing classes that we shouldn't. Since we are vendoring
gRPC/protobuf for the model/job-management/fn-execution packages, does it
still matter to not expose vendored classes?


On Thu, Jul 19, 2018 at 1:44 AM Ismaël Mejía  wrote:

> Luke, thanks for explaining the idea, that’s what I expected, I was a
> bit confused after seeing that the proto transformation of the
> Pipeline happens in runners/core-contruction-java. I thought this was
> intended (maybe to keep the SDK smaller). So what is the future ? move
> all this machinery into sdks/java/core or a different module and make
> sdks/java/core depend on that ?
>
> In any case this looks like a not public API matter. What I would like
> is to be vigilant about the exposure of Protobuf stuff in the public
> API to avoid some of the issues we have had in the past (and still
> have on GCPIO). Do we have this kind of validation in place after the
> move to gradle (like not expose in public APIs guava/protobuf) ?
> On Wed, Jul 18, 2018 at 6:29 PM Lukasz Cwik  wrote:
> >
> > Ismael, the SDK should perform the pipeline translation to proto because
> I expect the flow to be:
> > User Code -> SDK -> Proto Translation -> Job API -> Runner
> > I don't expect "runners" to live within the users process anymore
> (excluding the direct runner). There will be one portable "runner" and it
> will be responsible for communicating with the job management APIs. It
> shouldn't be called a runner but for backwards compatibility it will behave
> like a runner does today. Flink/Spark/... will all live on the other side
> of the job management API.
> >
> > Thomas, I can run RemoteExectuionTest from commit
> ae2bebaf8b277e99840fa63f1b95d828f2093d16 without needing to modify the
> project/module structure in Intellij. Adding the jars manually only helps
> with code completion.
> >
> > https://github.com/apache/beam/pull/5977 works around the duplicate
> content root issue in Intellij. I have also run into the -Werror issue
> occasionally and don't know any fix or why it gets triggered as it doesn't
> happen to me all the time.
> >
> > On Tue, Jul 17, 2018 at 7:01 PM Thomas Weise  wrote:
> >>
> >> Thanks, the classpath order matters indeed.
> >>
> >> Still not able to run RemoteExecutionTest, but I was able to get the
> Flink portable test to work by adding the following to the top of the
> dependency list of beam-runners-flink_2.11_test
> >>
> >>
> vendor/sdks-java-extensions-protobuf/build/libs/beam-vendor-sdks-java-extensions-protobuf-2.6.0-SNAPSHOT.jar
> >> model/fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
> >>
> >>
> >> On Tue, Jul 17, 2018 at 6:00 PM Ankur Goenka  wrote:
> >>>
> >>> Yes, I am able to run it.
> >>>
> >>> For tests, you also need to add dependencies to
> ":beam-runners-java-fn-execution/beam-runners-java-fn-execution_test"
> module.
> >>>
> >>> Also, I only added
> >>> :beam-model-job-management-2.6.0-SNAPSHOT.jar
> >>> :beam-model-fn-execution-2.6.0-SNAPSHOT.jar
> >>> to the dependencies manually so not sure if you want to add
> >>> io.grpc:grpc-core:1.12.0 and com.google.protobuf:protobuf-java:3.5.1
> to the dependencies.
> >>>
> >>> Note, you need to move them up in the dependencies list.
> >>>
> >>>
> >>> On Tue, Jul 17, 2018 at 5:54 PM Thomas Weise  wrote:
> 
>  Are you able to run
> org.apache.beam.runners.fnexecution.control.RemoteExecutionTest from within
> Intellij ?
> 
>  I can get the compile errors to disappear by adding
> beam-model-job-management-2.6.0-SNAPSHOT.jar, io.grpc:grpc-core:1.12.0 and
> com.google.protobuf:protobuf-java:3.5.1
> 
>  Running the test still fails since other dependencies are missing.
> 
> 
>  On Tue, Jul 17, 2018 at 4:02 PM Ankur Goenka 
> wrote:
> >
> > For reference:
> > I was able to make intellij work with the master by doing following
> steps
> >
> > Remove module :beam:vendor-sdks-java-extensions-protobuf from
> intellij.
> > Adding
> :beam-model-fn-execution/build/libs/beam-model-f

Re: off for 2 - 3 weeks

2018-07-19 Thread Ahmet Altay
Have fun!

On Thu, Jul 19, 2018 at 8:46 AM, Łukasz Gajowy 
wrote:

> Enjoy! :)
>
> czw., 19 lip 2018 o 16:36 Reuven Lax  napisał(a):
>
>> Enjoy!
>>
>> Reuven
>>
>> On Thu, Jul 19, 2018 at 6:34 AM Etienne Chauchot 
>> wrote:
>>
>>> Hi all,
>>>
>>> Just a quick update to tell that I'll be off for 2 or 3 weeks starting
>>> on Friday evening.
>>>
>>> Etienne
>>>
>>


Re: off for 2 - 3 weeks

2018-07-19 Thread Łukasz Gajowy
Enjoy! :)

czw., 19 lip 2018 o 16:36 Reuven Lax  napisał(a):

> Enjoy!
>
> Reuven
>
> On Thu, Jul 19, 2018 at 6:34 AM Etienne Chauchot 
> wrote:
>
>> Hi all,
>>
>> Just a quick update to tell that I'll be off for 2 or 3 weeks starting on
>> Friday evening.
>>
>> Etienne
>>
>


Re: SQS source

2018-07-19 Thread Jean-Baptiste Onofré
Thx John !

Regards
JB

On 19/07/2018 16:39, John Rudolf Lewis wrote:
> Thank you.
> 
> I've created a jira ticket to add SQS and have assigned it to
> myself: https://issues.apache.org/jira/browse/BEAM-4828
> 
> Modified the documentation to show it as in-progress:
> https://github.com/apache/beam/pull/5995
> 
> And will be starting my work
> here: https://github.com/JohnRudolfLewis/beam/tree/Add-SqsIO
> 
> On Thu, Jul 19, 2018 at 1:43 AM, Jean-Baptiste Onofré  > wrote:
> 
> Agree with Ismaël.
> 
> I would be more than happy to help on this one (as I contributed on AMQP
> and JMS IOs ;)).
> 
> Regards
> JB
> 
> On 19/07/2018 10:39, Ismaël Mejía wrote:
> > Thanks for your interest John, it would be a really nice contribution
> > to add SQS support.
> >
> > Some context on the kinesis stuff:
> >
> > The reason why kinesis is still in a separate module is more related
> > to a licensing problem. Kinesis uses some native libraries that are
> > published under a not 100% apache compatible license and we are not
> > allowed to shade and republish them but it seems there is a workaround
> > now, for more details see
> > https://issues.apache.org/jira/browse/BEAM-3549
> 
> > In any case if to use SQS you only need the Apache licensed aws-sdk
> > deps it is ok (and a good idea) if you put it in the
> > amazon-web-services module.
> >
> > The kinesis connector is way more complex for multiple reasons, first,
> > the raw version of the amazon client libraries is not so ‘friendly’
> > and the guys who created KinesisIO had to do some workarounds to
> > provide accurate checkpointing/watermarks. So since SQS is a way
> > simpler system you should probably be ok basing it in simpler sources
> > like AMQP or JMS.
> >
> > If you feel like to, please create the JIRA and don’t hesitate to ask
> > questions if you find issues or if you need some review.
> >
> > On Thu, Jul 19, 2018 at 12:55 AM Lukasz Cwik  > wrote:
> >>
> >>
> >>
> >> On Wed, Jul 18, 2018 at 3:30 PM John Rudolf Lewis
> mailto:johnrle...@gmail.com>> wrote:
> >>>
> >>> I need an SQS source for my project that is using beam. A brief
> search did not turn up any in-progress work in this area. Please
> point me to the right repo if I missed it.
> >>
> >>
> >> To my knowledge there is none and nobody has marked it in
> progress on https://beam.apache.org/documentation/io/built-in/
> . It would be
> good to create a JIRA issue on https://issues.apache.org/ and send a
> PR to add SQS to the inprogress list referencing your JIRA. I added
> you as a contributor in JIRA so you should be able to assign
> yourself to any issues that you create.
> >>
> >>>
> >>> Assuming there is no in-progress effort, I would like to
> contribute an Amazon SQS source. I have a few questions before I begin.
> >>
> >>
> >> Great, note that this is a good starting point for authoring an
> IO transform:
> https://beam.apache.org/documentation/io/authoring-overview/
> 
> >>
> >>>
> >>>
> >>> It seems that the current AWS code is split into two different
> modules: sdk/java/io/amazon-web-services which contains the
> S3FileSystem, AwsOptions, etc, and sdk/java/io/kinesis which
> contains an unbounded source based on a kinesis topic. I'd like to
> add this source to the amazon-web-services module since I'd like to
> depend on AwsOptions. Does adding this source to the
> amazon-web-services module make sense?
> >>
> >>
> >> Putting it inside of amazon-web-services makes a lot of sense.
> The Google connectors all live within the one package and there has
> been discussion to consolidate all the AWS stuff under
> amazon-web-services.
> >>
> >>>
> >>> Also, the kinesis source looks a touch more complex than other
> sources. Both the JMS and AMQP sources look like better examples to
> follow. Which existing source would be the best to model this
> contribution after?
> >>
> >>
> >> Some of it has to do with how many ways a source can be read and
> how complicated the watermark tracking but it would be best if the
> IO authors comment on implementation details.
> >>
> >>>
> >>> If anyone has put some thoughts into this, or better yet some
> code, I'd appreciate hearing from you.
> >>>
> >>> Thanks!
> >>>
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org 
> http://blog.nanthrax.net
> Talend - http://www.talend.com
> 
> 

-- 
Jean-Baptiste Onofré
jbono...@apache.

Re: SQS source

2018-07-19 Thread John Rudolf Lewis
Thank you.

I've created a jira ticket to add SQS and have assigned it to myself:
https://issues.apache.org/jira/browse/BEAM-4828

Modified the documentation to show it as in-progress:
https://github.com/apache/beam/pull/5995

And will be starting my work here:
https://github.com/JohnRudolfLewis/beam/tree/Add-SqsIO

On Thu, Jul 19, 2018 at 1:43 AM, Jean-Baptiste Onofré 
wrote:

> Agree with Ismaël.
>
> I would be more than happy to help on this one (as I contributed on AMQP
> and JMS IOs ;)).
>
> Regards
> JB
>
> On 19/07/2018 10:39, Ismaël Mejía wrote:
> > Thanks for your interest John, it would be a really nice contribution
> > to add SQS support.
> >
> > Some context on the kinesis stuff:
> >
> > The reason why kinesis is still in a separate module is more related
> > to a licensing problem. Kinesis uses some native libraries that are
> > published under a not 100% apache compatible license and we are not
> > allowed to shade and republish them but it seems there is a workaround
> > now, for more details see
> > https://issues.apache.org/jira/browse/BEAM-3549
> > In any case if to use SQS you only need the Apache licensed aws-sdk
> > deps it is ok (and a good idea) if you put it in the
> > amazon-web-services module.
> >
> > The kinesis connector is way more complex for multiple reasons, first,
> > the raw version of the amazon client libraries is not so ‘friendly’
> > and the guys who created KinesisIO had to do some workarounds to
> > provide accurate checkpointing/watermarks. So since SQS is a way
> > simpler system you should probably be ok basing it in simpler sources
> > like AMQP or JMS.
> >
> > If you feel like to, please create the JIRA and don’t hesitate to ask
> > questions if you find issues or if you need some review.
> >
> > On Thu, Jul 19, 2018 at 12:55 AM Lukasz Cwik  wrote:
> >>
> >>
> >>
> >> On Wed, Jul 18, 2018 at 3:30 PM John Rudolf Lewis 
> wrote:
> >>>
> >>> I need an SQS source for my project that is using beam. A brief search
> did not turn up any in-progress work in this area. Please point me to the
> right repo if I missed it.
> >>
> >>
> >> To my knowledge there is none and nobody has marked it in progress on
> https://beam.apache.org/documentation/io/built-in/. It would be good to
> create a JIRA issue on https://issues.apache.org/ and send a PR to add
> SQS to the inprogress list referencing your JIRA. I added you as a
> contributor in JIRA so you should be able to assign yourself to any issues
> that you create.
> >>
> >>>
> >>> Assuming there is no in-progress effort, I would like to contribute an
> Amazon SQS source. I have a few questions before I begin.
> >>
> >>
> >> Great, note that this is a good starting point for authoring an IO
> transform: https://beam.apache.org/documentation/io/authoring-overview/
> >>
> >>>
> >>>
> >>> It seems that the current AWS code is split into two different
> modules: sdk/java/io/amazon-web-services which contains the S3FileSystem,
> AwsOptions, etc, and sdk/java/io/kinesis which contains an unbounded source
> based on a kinesis topic. I'd like to add this source to the
> amazon-web-services module since I'd like to depend on AwsOptions. Does
> adding this source to the amazon-web-services module make sense?
> >>
> >>
> >> Putting it inside of amazon-web-services makes a lot of sense. The
> Google connectors all live within the one package and there has been
> discussion to consolidate all the AWS stuff under amazon-web-services.
> >>
> >>>
> >>> Also, the kinesis source looks a touch more complex than other
> sources. Both the JMS and AMQP sources look like better examples to follow.
> Which existing source would be the best to model this contribution after?
> >>
> >>
> >> Some of it has to do with how many ways a source can be read and how
> complicated the watermark tracking but it would be best if the IO authors
> comment on implementation details.
> >>
> >>>
> >>> If anyone has put some thoughts into this, or better yet some code,
> I'd appreciate hearing from you.
> >>>
> >>> Thanks!
> >>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: off for 2 - 3 weeks

2018-07-19 Thread Reuven Lax
Enjoy!

Reuven

On Thu, Jul 19, 2018 at 6:34 AM Etienne Chauchot 
wrote:

> Hi all,
>
> Just a quick update to tell that I'll be off for 2 or 3 weeks starting on
> Friday evening.
>
> Etienne
>


Re: off for 2 - 3 weeks

2018-07-19 Thread Jean-Baptiste Onofré
Hi Etienne,

enjoy !

See you in 2 or 3 weeks !

Regards
JB

On 19/07/2018 15:34, Etienne Chauchot wrote:
> Hi all,
> 
> Just a quick update to tell that I'll be off for 2 or 3 weeks starting
> on Friday evening.
> 
> Etienne

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


off for 2 - 3 weeks

2018-07-19 Thread Etienne Chauchot
Hi all, 

Just a quick update to tell that I'll be off for 2 or 3 weeks starting on 
Friday evening.

Etienne

Re: [portablility] metrics interrogations

2018-07-19 Thread Etienne Chauchot
Thanks for the confirmations Luke.
Le mercredi 18 juillet 2018 à 07:56 -0700, Lukasz Cwik a écrit :
> On Wed, Jul 18, 2018 at 7:01 AM Etienne Chauchot  wrote:
> > Hi,
> > Luke, Alex, I have some portable metrics interrogations, can you confirm 
> > them ? 
> > 
> > 1 - As it is the SDK harness that will run the code of the UDFs, if a UDF 
> > defines a metric, then the SDK harness
> > will give updates through GRPC calls to the runner so that the runner could 
> > update metrics cells, right?
> 
> Yes. 
> > 2 - Alex, you mentioned in proto and design doc that there will be no 
> > aggreagation of metrics. But some runners
> > (spark/flink) rely on accumulators and when they are merged, it triggers 
> > the merging of the whole chain to the
> > metric cells. I know that Dataflow does not do the same, it uses non 
> > agregated metrics and sends them to an
> > aggregation service. Will there be a change of paradigm with portability 
> > for runners that merge themselves ? 
> 
> There will be local aggregation of metrics scoped to a bundle; after the 
> bundle is finished processing they are
> discarded. This will require some kind of global aggregation support from a 
> runner, whether that runner does it via
> accumulators or via an aggregation service is up to the runner.
> > 3 - Please confirm that the distinction between attempted and committed 
> > metrics is not the business of portable
> > metrics. Indeed, it does not involve communication between the runner 
> > harness and the SDK harness as it is a runner
> > only matter. I mean, when a runner commits a bundle it just updates its 
> > committed metrics and do not need to inform
> > the SDK harness. But, of course, when the user requests committed metrics 
> > through the SDK, then the SDK harness will
> > ask the runner harness to give them.
> > 
> > 
>  You are correct in saying that during execution, the SDK does not 
> differentiate between attempted and committed
> metrics and only the runner does. We still lack an API definition and 
> contract for how an SDK would query for metrics
> from a runner but your right in saying that an SDK could request committed 
> metrics and the Runner would supply them
> some how. 
> > Thanks
> > BestEtienne
> > 
> > 

[FEEDBACK REQUEST] Re: [ANNOUNCEMENT] Nexmark included to the CI

2018-07-19 Thread Etienne Chauchot
Hi guys,As suggested by Anton bellow, I opened a PR on the website to reference 
the Nexmark dashboards. As I did not
want users to take them for proper neutral benchmarks of the runners / engines, 
 but more for a CI piece of software, I
added a disclaimer.
Please:- tell if you agree on  the publication of such performance results- 
comment on the PR for the disclaimer.
PR: https://github.com/apache/beam-site/pull/500

Thanks
Etienne

Le jeudi 19 juillet 2018 à 12:30 +0200, Etienne Chauchot a écrit :
> Hi Anton, 
> Yes, good idea, I'll update nexmark website page
> Etienne
> Le mercredi 18 juillet 2018 à 10:17 -0700, Anton Kedin a écrit :
> > These dashboards look great!
> > 
> > Can publish the links to the dashboards somewhere, for better visibility? 
> > E.g. in the jenkins website / emails, or
> > the wiki.
> > 
> > Regards,Anton
> > On Wed, Jul 18, 2018 at 10:08 AM Andrew Pilloud  wrote:
> > > Hi Etienne,
> > > 
> > > I've been asking around and it sounds like we should be able to get a 
> > > dedicated Jenkins node for performance
> > > tests. Another thing that might help is making the runs a few times 
> > > longer. They are currently running around 2
> > > seconds each, so the total time of the build probably exceeds testing. 
> > > Internally at Google we are running them
> > > with 2000x as many events on Dataflow, but a job of that size won't even 
> > > complete on the Direct Runner.
> > > I didn't see the query 3 issues, but now that you point it out it looks 
> > > like a bug to me too.
> > > 
> > > Andrew
> > > On Wed, Jul 18, 2018 at 1:13 AM Etienne Chauchot  
> > > wrote:
> > > > Hi Andrew,
> > > > Yes I saw that, except dedicating jenkins nodes to nexmark, I see no 
> > > > other way.
> > > > Also, did you see query 3 output size on direct runner? Should be a 
> > > > straight line and it is not, I'm wondering
> > > > if there is a problem with sate and timers impl in direct runner.
> > > > Etienne
> > > > Le mardi 17 juillet 2018 à 11:38 -0700, Andrew Pilloud a écrit :
> > > > > I'm noticing the graphs are really noisy. It looks like we are 
> > > > > running these on shared Jenkins executors, so
> > > > > our perf tests are fighting with other builds for CPU. I've opened an 
> > > > > issue https://issues.apache.org/jira/bro
> > > > > wse/BEAM-4804 and am wondering if anyone knows an easy fix to isolate 
> > > > > these jobs.
> > > > > Andrew
> > > > > On Fri, Jul 13, 2018 at 2:39 AM Łukasz Gajowy  
> > > > > wrote:
> > > > > > @Etienne: Nice to see the graphs! :)
> > > > > > 
> > > > > > @Ismael: Good idea, there's no document yet. I think we could 
> > > > > > create a small google doc with instructions on
> > > > > > how to do this.
> > > > > > 
> > > > > > pt., 13 lip 2018 o 10:46 Etienne Chauchot  
> > > > > > napisał(a):
> > > > > > > Hi, 
> > > > > > > @Andrew, this is because I did not find a way to set 2 scales on 
> > > > > > > the Y axis on the perfkit graphs. Indeed
> > > > > > > numResults varies from 1 to  100 000 and runtimeSec is usually 
> > > > > > > bellow 10s.
> > > > > > > Etienne
> > > > > > > Le jeudi 12 juillet 2018 à 12:04 -0700, Andrew Pilloud a écrit :
> > > > > > > > This is great, should make performance work much easier! I'm 
> > > > > > > > going to get the Beam SQL Nexmark jobs
> > > > > > > > publishing as well. (Opened 
> > > > > > > > https://issues.apache.org/jira/browse/BEAM-4774 to track.) I 
> > > > > > > > might take on
> > > > > > > > the Dataflow runner as well if no one else volunteers.
> > > > > > > > 
> > > > > > > > I am curious as to why you have two separate graphs for runtime 
> > > > > > > > and count rather then graphing
> > > > > > > > runtime/count to get the throughput rate for each run? Or 
> > > > > > > > should that be a third graph? Looks like it
> > > > > > > > would just be a small tweak to the query in perfkit.
> > > > > > > > Andrew
> > > > > > > > On Thu, Jul 12, 2018 at 11:40 AM Pablo Estrada 
> > > > > > > >  wrote:
> > > > > > > > > This is really cool Etienne : ) thanks for working on 
> > > > > > > > > this.Our of curiosity, do you know how often the
> > > > > > > > > tests run on each runner?
> > > > > > > > > 
> > > > > > > > > Best
> > > > > > > > > -P.
> > > > > > > > > 
> > > > > > > > > On Thu, Jul 12, 2018 at 2:15 AM Romain Manni-Bucau 
> > > > > > > > >  wrote:
> > > > > > > > > > Awesome Etienne, this is really important for the (user) 
> > > > > > > > > > community to have that visibility since it
> > > > > > > > > > is one of the most important aspect of the Beam's quality, 
> > > > > > > > > > kudo!
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Romain Manni-Bucau
> > > > > > > > > > @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
> > > > > > > > > > 
> > > > > > > > > > Le jeu. 12 juil. 2018 à 10:59, Jean-Baptiste Onofré 
> > > > > > > > > >  a écrit :
> > > > > > > > > > > It's really great to have these dashboards and 
> > > > > > > > > > > integration in Jenkins !
> > > > > > 

Re: [ANNOUNCEMENT] Nexmark included to the CI

2018-07-19 Thread Etienne Chauchot
Andrew,here is the ticket about 
query3:https://issues.apache.org/jira/browse/BEAM-4825
I added details with pseudo code of query3 in the ticket and a link to the 
dashboards to see how flaky this query is on
DR.
Etienne

Le jeudi 19 juillet 2018 à 12:22 +0200, Etienne Chauchot a écrit :
> Hi Andrew,
> Le mercredi 18 juillet 2018 à 10:08 -0700, Andrew Pilloud a écrit :
> > Hi Etienne,
> > I've been asking around and it sounds like we should be able to get a 
> > dedicated Jenkins node for performance tests. 
> 
> Cool !
> > Another thing that might help is making the runs a few times longer. They 
> > are currently running around 2 seconds
> > each, so the total time of the build probably exceeds testing.
> 
> You mean increasing the size of the nexmark input collection? Currently it is 
> set to 100 000 events and is narrowed
> down to 10 000 IIRC for some queries (4 and 10 IIRC)
> >  Internally at Google we are running them with 2000x as many events on 
> > Dataflow, but a job of that size won't even
> > complete on the Direct Runner.
> 
> Yes probably not
> > I didn't see the query 3 issues, but now that you point it out it looks 
> > like a bug to me too.
> 
> It seems to me too, I'll open a ticket.
> Etienne
> > Andrew
> > On Wed, Jul 18, 2018 at 1:13 AM Etienne Chauchot  
> > wrote:
> > > Hi Andrew,
> > > Yes I saw that, except dedicating jenkins nodes to nexmark, I see no 
> > > other way.
> > > Also, did you see query 3 output size on direct runner? Should be a 
> > > straight line and it is not, I'm wondering if
> > > there is a problem with sate and timers impl in direct runner.
> > > Etienne
> > > Le mardi 17 juillet 2018 à 11:38 -0700, Andrew Pilloud a écrit :
> > > > I'm noticing the graphs are really noisy. It looks like we are running 
> > > > these on shared Jenkins executors, so our
> > > > perf tests are fighting with other builds for CPU. I've opened an issue 
> > > > https://issues.apache.org/jira/browse/BE
> > > > AM-4804 and am wondering if anyone knows an easy fix to isolate these 
> > > > jobs.
> > > > Andrew
> > > > On Fri, Jul 13, 2018 at 2:39 AM Łukasz Gajowy  
> > > > wrote:
> > > > > @Etienne: Nice to see the graphs! :)
> > > > > 
> > > > > @Ismael: Good idea, there's no document yet. I think we could create 
> > > > > a small google doc with instructions on
> > > > > how to do this.
> > > > > 
> > > > > pt., 13 lip 2018 o 10:46 Etienne Chauchot  
> > > > > napisał(a):
> > > > > > Hi, 
> > > > > > @Andrew, this is because I did not find a way to set 2 scales on 
> > > > > > the Y axis on the perfkit graphs. Indeed
> > > > > > numResults varies from 1 to  100 000 and runtimeSec is usually 
> > > > > > bellow 10s.
> > > > > > Etienne
> > > > > > Le jeudi 12 juillet 2018 à 12:04 -0700, Andrew Pilloud a écrit :
> > > > > > > This is great, should make performance work much easier! I'm 
> > > > > > > going to get the Beam SQL Nexmark jobs
> > > > > > > publishing as well. (Opened 
> > > > > > > https://issues.apache.org/jira/browse/BEAM-4774 to track.) I 
> > > > > > > might take on the
> > > > > > > Dataflow runner as well if no one else volunteers.
> > > > > > > 
> > > > > > > I am curious as to why you have two separate graphs for runtime 
> > > > > > > and count rather then graphing
> > > > > > > runtime/count to get the throughput rate for each run? Or should 
> > > > > > > that be a third graph? Looks like it
> > > > > > > would just be a small tweak to the query in perfkit.
> > > > > > > Andrew
> > > > > > > On Thu, Jul 12, 2018 at 11:40 AM Pablo Estrada 
> > > > > > >  wrote:
> > > > > > > > This is really cool Etienne : ) thanks for working on this.Our 
> > > > > > > > of curiosity, do you know how often the
> > > > > > > > tests run on each runner?
> > > > > > > > 
> > > > > > > > Best
> > > > > > > > -P.
> > > > > > > > 
> > > > > > > > On Thu, Jul 12, 2018 at 2:15 AM Romain Manni-Bucau 
> > > > > > > >  wrote:
> > > > > > > > > Awesome Etienne, this is really important for the (user) 
> > > > > > > > > community to have that visibility since it is
> > > > > > > > > one of the most important aspect of the Beam's quality, kudo!
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Romain Manni-Bucau
> > > > > > > > > @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
> > > > > > > > > 
> > > > > > > > > Le jeu. 12 juil. 2018 à 10:59, Jean-Baptiste Onofré 
> > > > > > > > >  a écrit :
> > > > > > > > > > It's really great to have these dashboards and integration 
> > > > > > > > > > in Jenkins !
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Thanks Etienne for driving this !
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Regards
> > > > > > > > > > 
> > > > > > > > > > JB
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > On 11/07/2018 15:13, Etienne Chauchot wrote:
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > Hi guys,
>

Re: [ANNOUNCEMENT] Nexmark included to the CI

2018-07-19 Thread Etienne Chauchot
Hi Anton, 
Yes, good idea, I'll update nexmark website page
Etienne
Le mercredi 18 juillet 2018 à 10:17 -0700, Anton Kedin a écrit :
> These dashboards look great!
> 
> Can publish the links to the dashboards somewhere, for better visibility? 
> E.g. in the jenkins website / emails, or the
> wiki.
> 
> Regards,Anton
> On Wed, Jul 18, 2018 at 10:08 AM Andrew Pilloud  wrote:
> > Hi Etienne,
> > 
> > I've been asking around and it sounds like we should be able to get a 
> > dedicated Jenkins node for performance tests.
> > Another thing that might help is making the runs a few times longer. They 
> > are currently running around 2 seconds
> > each, so the total time of the build probably exceeds testing. Internally 
> > at Google we are running them with 2000x
> > as many events on Dataflow, but a job of that size won't even complete on 
> > the Direct Runner.
> > I didn't see the query 3 issues, but now that you point it out it looks 
> > like a bug to me too.
> > 
> > Andrew
> > On Wed, Jul 18, 2018 at 1:13 AM Etienne Chauchot  
> > wrote:
> > > Hi Andrew,
> > > Yes I saw that, except dedicating jenkins nodes to nexmark, I see no 
> > > other way.
> > > Also, did you see query 3 output size on direct runner? Should be a 
> > > straight line and it is not, I'm wondering if
> > > there is a problem with sate and timers impl in direct runner.
> > > Etienne
> > > Le mardi 17 juillet 2018 à 11:38 -0700, Andrew Pilloud a écrit :
> > > > I'm noticing the graphs are really noisy. It looks like we are running 
> > > > these on shared Jenkins executors, so our
> > > > perf tests are fighting with other builds for CPU. I've opened an issue 
> > > > https://issues.apache.org/jira/browse/BE
> > > > AM-4804 and am wondering if anyone knows an easy fix to isolate these 
> > > > jobs.
> > > > Andrew
> > > > On Fri, Jul 13, 2018 at 2:39 AM Łukasz Gajowy  
> > > > wrote:
> > > > > @Etienne: Nice to see the graphs! :)
> > > > > 
> > > > > @Ismael: Good idea, there's no document yet. I think we could create 
> > > > > a small google doc with instructions on
> > > > > how to do this.
> > > > > 
> > > > > pt., 13 lip 2018 o 10:46 Etienne Chauchot  
> > > > > napisał(a):
> > > > > > Hi, 
> > > > > > @Andrew, this is because I did not find a way to set 2 scales on 
> > > > > > the Y axis on the perfkit graphs. Indeed
> > > > > > numResults varies from 1 to  100 000 and runtimeSec is usually 
> > > > > > bellow 10s.
> > > > > > Etienne
> > > > > > Le jeudi 12 juillet 2018 à 12:04 -0700, Andrew Pilloud a écrit :
> > > > > > > This is great, should make performance work much easier! I'm 
> > > > > > > going to get the Beam SQL Nexmark jobs
> > > > > > > publishing as well. (Opened 
> > > > > > > https://issues.apache.org/jira/browse/BEAM-4774 to track.) I 
> > > > > > > might take on the
> > > > > > > Dataflow runner as well if no one else volunteers.
> > > > > > > 
> > > > > > > I am curious as to why you have two separate graphs for runtime 
> > > > > > > and count rather then graphing
> > > > > > > runtime/count to get the throughput rate for each run? Or should 
> > > > > > > that be a third graph? Looks like it
> > > > > > > would just be a small tweak to the query in perfkit.
> > > > > > > Andrew
> > > > > > > On Thu, Jul 12, 2018 at 11:40 AM Pablo Estrada 
> > > > > > >  wrote:
> > > > > > > > This is really cool Etienne : ) thanks for working on this.Our 
> > > > > > > > of curiosity, do you know how often the
> > > > > > > > tests run on each runner?
> > > > > > > > 
> > > > > > > > Best
> > > > > > > > -P.
> > > > > > > > 
> > > > > > > > On Thu, Jul 12, 2018 at 2:15 AM Romain Manni-Bucau 
> > > > > > > >  wrote:
> > > > > > > > > Awesome Etienne, this is really important for the (user) 
> > > > > > > > > community to have that visibility since it is
> > > > > > > > > one of the most important aspect of the Beam's quality, kudo!
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Romain Manni-Bucau
> > > > > > > > > @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
> > > > > > > > > 
> > > > > > > > > Le jeu. 12 juil. 2018 à 10:59, Jean-Baptiste Onofré 
> > > > > > > > >  a écrit :
> > > > > > > > > > It's really great to have these dashboards and integration 
> > > > > > > > > > in Jenkins !
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Thanks Etienne for driving this !
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Regards
> > > > > > > > > > 
> > > > > > > > > > JB
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > On 11/07/2018 15:13, Etienne Chauchot wrote:
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > Hi guys,
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > I'm glad to announce that the CI of Beam has much 
> > > > > > > > > > > improved ! Indeed
> > > > > > > > > > 
> > > > > > > > > > > Nexmark is now included in the perf

Re: [ANNOUNCEMENT] Nexmark included to the CI

2018-07-19 Thread Etienne Chauchot
Hi Andrew,
Le mercredi 18 juillet 2018 à 10:08 -0700, Andrew Pilloud a écrit :
> Hi Etienne,
> I've been asking around and it sounds like we should be able to get a 
> dedicated Jenkins node for performance tests. 

Cool !
> Another thing that might help is making the runs a few times longer. They are 
> currently running around 2 seconds each,
> so the total time of the build probably exceeds testing.

You mean increasing the size of the nexmark input collection? Currently it is 
set to 100 000 events and is narrowed down
to 10 000 IIRC for some queries (4 and 10 IIRC)
>  Internally at Google we are running them with 2000x as many events on 
> Dataflow, but a job of that size won't even
> complete on the Direct Runner.

Yes probably not
> I didn't see the query 3 issues, but now that you point it out it looks like 
> a bug to me too.

It seems to me too, I'll open a ticket.
Etienne
> Andrew
> On Wed, Jul 18, 2018 at 1:13 AM Etienne Chauchot  wrote:
> > Hi Andrew,
> > Yes I saw that, except dedicating jenkins nodes to nexmark, I see no other 
> > way.
> > Also, did you see query 3 output size on direct runner? Should be a 
> > straight line and it is not, I'm wondering if
> > there is a problem with sate and timers impl in direct runner.
> > Etienne
> > Le mardi 17 juillet 2018 à 11:38 -0700, Andrew Pilloud a écrit :
> > > I'm noticing the graphs are really noisy. It looks like we are running 
> > > these on shared Jenkins executors, so our
> > > perf tests are fighting with other builds for CPU. I've opened an issue 
> > > https://issues.apache.org/jira/browse/BEAM
> > > -4804 and am wondering if anyone knows an easy fix to isolate these jobs.
> > > Andrew
> > > On Fri, Jul 13, 2018 at 2:39 AM Łukasz Gajowy  wrote:
> > > > @Etienne: Nice to see the graphs! :)
> > > > 
> > > > @Ismael: Good idea, there's no document yet. I think we could create a 
> > > > small google doc with instructions on how
> > > > to do this.
> > > > 
> > > > pt., 13 lip 2018 o 10:46 Etienne Chauchot  
> > > > napisał(a):
> > > > > Hi, 
> > > > > @Andrew, this is because I did not find a way to set 2 scales on the 
> > > > > Y axis on the perfkit graphs. Indeed
> > > > > numResults varies from 1 to  100 000 and runtimeSec is usually bellow 
> > > > > 10s.
> > > > > Etienne
> > > > > Le jeudi 12 juillet 2018 à 12:04 -0700, Andrew Pilloud a écrit :
> > > > > > This is great, should make performance work much easier! I'm going 
> > > > > > to get the Beam SQL Nexmark jobs
> > > > > > publishing as well. (Opened 
> > > > > > https://issues.apache.org/jira/browse/BEAM-4774 to track.) I might 
> > > > > > take on the
> > > > > > Dataflow runner as well if no one else volunteers.
> > > > > > 
> > > > > > I am curious as to why you have two separate graphs for runtime and 
> > > > > > count rather then graphing runtime/count
> > > > > > to get the throughput rate for each run? Or should that be a third 
> > > > > > graph? Looks like it would just be a
> > > > > > small tweak to the query in perfkit.
> > > > > > Andrew
> > > > > > On Thu, Jul 12, 2018 at 11:40 AM Pablo Estrada  
> > > > > > wrote:
> > > > > > > This is really cool Etienne : ) thanks for working on this.Our of 
> > > > > > > curiosity, do you know how often the
> > > > > > > tests run on each runner?
> > > > > > > 
> > > > > > > Best
> > > > > > > -P.
> > > > > > > 
> > > > > > > On Thu, Jul 12, 2018 at 2:15 AM Romain Manni-Bucau 
> > > > > > >  wrote:
> > > > > > > > Awesome Etienne, this is really important for the (user) 
> > > > > > > > community to have that visibility since it is
> > > > > > > > one of the most important aspect of the Beam's quality, kudo!
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Romain Manni-Bucau
> > > > > > > > @rmannibucau |  Blog | Old Blog | Github | LinkedIn | Book
> > > > > > > > 
> > > > > > > > Le jeu. 12 juil. 2018 à 10:59, Jean-Baptiste Onofré 
> > > > > > > >  a écrit :
> > > > > > > > > It's really great to have these dashboards and integration in 
> > > > > > > > > Jenkins !
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Thanks Etienne for driving this !
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Regards
> > > > > > > > > 
> > > > > > > > > JB
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > On 11/07/2018 15:13, Etienne Chauchot wrote:
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Hi guys,
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > I'm glad to announce that the CI of Beam has much improved 
> > > > > > > > > > ! Indeed
> > > > > > > > > 
> > > > > > > > > > Nexmark is now included in the perfkit dashboards.
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > At each commit on master, nexmark suites are run and plots 
> > > > > > > > > > are created
> > > > > > > > > 
> > > > > > > > > > on the graphs.
> > > > > > > > > 
> > > > > > > > > > 
> > > > > 

mongoio

2018-07-19 Thread Chaim Turkel
Hi,
  I have been seeing a very long time for the data to be loaded from mongodb.
Is there a way to check this or fine tune it?

chaim

-- 


Loans are funded by
FinWise Bank, a Utah-chartered bank located in Sandy, 
Utah, member FDIC, Equal
Opportunity Lender. Merchant Cash Advances are 
made by Behalf. For more
information on ECOA, click here 
. For important information about 
opening a new
account, review Patriot Act procedures here 
.
Visit Legal 
 to
review our comprehensive program terms, 
conditions, and disclosures. 


Re: SQS source

2018-07-19 Thread Jean-Baptiste Onofré
Agree with Ismaël.

I would be more than happy to help on this one (as I contributed on AMQP
and JMS IOs ;)).

Regards
JB

On 19/07/2018 10:39, Ismaël Mejía wrote:
> Thanks for your interest John, it would be a really nice contribution
> to add SQS support.
> 
> Some context on the kinesis stuff:
> 
> The reason why kinesis is still in a separate module is more related
> to a licensing problem. Kinesis uses some native libraries that are
> published under a not 100% apache compatible license and we are not
> allowed to shade and republish them but it seems there is a workaround
> now, for more details see
> https://issues.apache.org/jira/browse/BEAM-3549
> In any case if to use SQS you only need the Apache licensed aws-sdk
> deps it is ok (and a good idea) if you put it in the
> amazon-web-services module.
> 
> The kinesis connector is way more complex for multiple reasons, first,
> the raw version of the amazon client libraries is not so ‘friendly’
> and the guys who created KinesisIO had to do some workarounds to
> provide accurate checkpointing/watermarks. So since SQS is a way
> simpler system you should probably be ok basing it in simpler sources
> like AMQP or JMS.
> 
> If you feel like to, please create the JIRA and don’t hesitate to ask
> questions if you find issues or if you need some review.
> 
> On Thu, Jul 19, 2018 at 12:55 AM Lukasz Cwik  wrote:
>>
>>
>>
>> On Wed, Jul 18, 2018 at 3:30 PM John Rudolf Lewis  
>> wrote:
>>>
>>> I need an SQS source for my project that is using beam. A brief search did 
>>> not turn up any in-progress work in this area. Please point me to the right 
>>> repo if I missed it.
>>
>>
>> To my knowledge there is none and nobody has marked it in progress on 
>> https://beam.apache.org/documentation/io/built-in/. It would be good to 
>> create a JIRA issue on https://issues.apache.org/ and send a PR to add SQS 
>> to the inprogress list referencing your JIRA. I added you as a contributor 
>> in JIRA so you should be able to assign yourself to any issues that you 
>> create.
>>
>>>
>>> Assuming there is no in-progress effort, I would like to contribute an 
>>> Amazon SQS source. I have a few questions before I begin.
>>
>>
>> Great, note that this is a good starting point for authoring an IO 
>> transform: https://beam.apache.org/documentation/io/authoring-overview/
>>
>>>
>>>
>>> It seems that the current AWS code is split into two different modules: 
>>> sdk/java/io/amazon-web-services which contains the S3FileSystem, 
>>> AwsOptions, etc, and sdk/java/io/kinesis which contains an unbounded source 
>>> based on a kinesis topic. I'd like to add this source to the 
>>> amazon-web-services module since I'd like to depend on AwsOptions. Does 
>>> adding this source to the amazon-web-services module make sense?
>>
>>
>> Putting it inside of amazon-web-services makes a lot of sense. The Google 
>> connectors all live within the one package and there has been discussion to 
>> consolidate all the AWS stuff under amazon-web-services.
>>
>>>
>>> Also, the kinesis source looks a touch more complex than other sources. 
>>> Both the JMS and AMQP sources look like better examples to follow. Which 
>>> existing source would be the best to model this contribution after?
>>
>>
>> Some of it has to do with how many ways a source can be read and how 
>> complicated the watermark tracking but it would be best if the IO authors 
>> comment on implementation details.
>>
>>>
>>> If anyone has put some thoughts into this, or better yet some code, I'd 
>>> appreciate hearing from you.
>>>
>>> Thanks!
>>>

-- 
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Vendoring / Shading Protobuf and gRPC

2018-07-19 Thread Ismaël Mejía
Luke, thanks for explaining the idea, that’s what I expected, I was a
bit confused after seeing that the proto transformation of the
Pipeline happens in runners/core-contruction-java. I thought this was
intended (maybe to keep the SDK smaller). So what is the future ? move
all this machinery into sdks/java/core or a different module and make
sdks/java/core depend on that ?

In any case this looks like a not public API matter. What I would like
is to be vigilant about the exposure of Protobuf stuff in the public
API to avoid some of the issues we have had in the past (and still
have on GCPIO). Do we have this kind of validation in place after the
move to gradle (like not expose in public APIs guava/protobuf) ?
On Wed, Jul 18, 2018 at 6:29 PM Lukasz Cwik  wrote:
>
> Ismael, the SDK should perform the pipeline translation to proto because I 
> expect the flow to be:
> User Code -> SDK -> Proto Translation -> Job API -> Runner
> I don't expect "runners" to live within the users process anymore (excluding 
> the direct runner). There will be one portable "runner" and it will be 
> responsible for communicating with the job management APIs. It shouldn't be 
> called a runner but for backwards compatibility it will behave like a runner 
> does today. Flink/Spark/... will all live on the other side of the job 
> management API.
>
> Thomas, I can run RemoteExectuionTest from commit 
> ae2bebaf8b277e99840fa63f1b95d828f2093d16 without needing to modify the 
> project/module structure in Intellij. Adding the jars manually only helps 
> with code completion.
>
> https://github.com/apache/beam/pull/5977 works around the duplicate content 
> root issue in Intellij. I have also run into the -Werror issue occasionally 
> and don't know any fix or why it gets triggered as it doesn't happen to me 
> all the time.
>
> On Tue, Jul 17, 2018 at 7:01 PM Thomas Weise  wrote:
>>
>> Thanks, the classpath order matters indeed.
>>
>> Still not able to run RemoteExecutionTest, but I was able to get the Flink 
>> portable test to work by adding the following to the top of the dependency 
>> list of beam-runners-flink_2.11_test
>>
>> vendor/sdks-java-extensions-protobuf/build/libs/beam-vendor-sdks-java-extensions-protobuf-2.6.0-SNAPSHOT.jar
>> model/fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>>
>>
>> On Tue, Jul 17, 2018 at 6:00 PM Ankur Goenka  wrote:
>>>
>>> Yes, I am able to run it.
>>>
>>> For tests, you also need to add dependencies to 
>>> ":beam-runners-java-fn-execution/beam-runners-java-fn-execution_test" 
>>> module.
>>>
>>> Also, I only added
>>> :beam-model-job-management-2.6.0-SNAPSHOT.jar
>>> :beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>>> to the dependencies manually so not sure if you want to add
>>> io.grpc:grpc-core:1.12.0 and com.google.protobuf:protobuf-java:3.5.1 to the 
>>> dependencies.
>>>
>>> Note, you need to move them up in the dependencies list.
>>>
>>>
>>> On Tue, Jul 17, 2018 at 5:54 PM Thomas Weise  wrote:

 Are you able to run 
 org.apache.beam.runners.fnexecution.control.RemoteExecutionTest from 
 within Intellij ?

 I can get the compile errors to disappear by adding 
 beam-model-job-management-2.6.0-SNAPSHOT.jar, io.grpc:grpc-core:1.12.0 and 
 com.google.protobuf:protobuf-java:3.5.1

 Running the test still fails since other dependencies are missing.


 On Tue, Jul 17, 2018 at 4:02 PM Ankur Goenka  wrote:
>
> For reference:
> I was able to make intellij work with the master by doing following steps
>
> Remove module :beam:vendor-sdks-java-extensions-protobuf from intellij.
> Adding 
> :beam-model-fn-execution/build/libs/beam-model-fn-execution-2.6.0-SNAPSHOT.jar
>  and 
> :beam-model-job-management/build/libs/beam-model-job-management-2.6.0-SNAPSHOT.jar
>  to the appropriate modules at the top of the dependency list.
>
>
> On Tue, Jul 17, 2018 at 2:29 PM Thomas Weise  wrote:
>>
>> Adding the external jar in Intellij (2018.1) currently fails due to a 
>> duplicate source directory (sdks/java/extensions/protobuf/src/main/java).
>>
>> The build as such also fails, with:  error: warnings found and -Werror 
>> specified
>>
>> Ismaël found removing 
>> https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L538
>>  as workaround.
>>
>>
>> On Thu, Jul 12, 2018 at 1:55 PM Ismaël Mejía  wrote:
>>>
>>> Seems reasonable, but why exactly may we need the model (or protobuf
>>> related things) in the future in the SDK ? wasn’t it supposed to be
>>> translated into the Pipeline proto representation via the runners (and
>>> in this case the dep reside in the runner side) ?
>>> On Thu, Jul 12, 2018 at 2:50 AM Lukasz Cwik  wrote:
>>> >
>>> > Got a fix[1] for Andrews issue which turned out to be a release 
>>> > blocker since it broke performing the r

Re: SQS source

2018-07-19 Thread Ismaël Mejía
Thanks for your interest John, it would be a really nice contribution
to add SQS support.

Some context on the kinesis stuff:

The reason why kinesis is still in a separate module is more related
to a licensing problem. Kinesis uses some native libraries that are
published under a not 100% apache compatible license and we are not
allowed to shade and republish them but it seems there is a workaround
now, for more details see
https://issues.apache.org/jira/browse/BEAM-3549
In any case if to use SQS you only need the Apache licensed aws-sdk
deps it is ok (and a good idea) if you put it in the
amazon-web-services module.

The kinesis connector is way more complex for multiple reasons, first,
the raw version of the amazon client libraries is not so ‘friendly’
and the guys who created KinesisIO had to do some workarounds to
provide accurate checkpointing/watermarks. So since SQS is a way
simpler system you should probably be ok basing it in simpler sources
like AMQP or JMS.

If you feel like to, please create the JIRA and don’t hesitate to ask
questions if you find issues or if you need some review.

On Thu, Jul 19, 2018 at 12:55 AM Lukasz Cwik  wrote:
>
>
>
> On Wed, Jul 18, 2018 at 3:30 PM John Rudolf Lewis  
> wrote:
>>
>> I need an SQS source for my project that is using beam. A brief search did 
>> not turn up any in-progress work in this area. Please point me to the right 
>> repo if I missed it.
>
>
> To my knowledge there is none and nobody has marked it in progress on 
> https://beam.apache.org/documentation/io/built-in/. It would be good to 
> create a JIRA issue on https://issues.apache.org/ and send a PR to add SQS to 
> the inprogress list referencing your JIRA. I added you as a contributor in 
> JIRA so you should be able to assign yourself to any issues that you create.
>
>>
>> Assuming there is no in-progress effort, I would like to contribute an 
>> Amazon SQS source. I have a few questions before I begin.
>
>
> Great, note that this is a good starting point for authoring an IO transform: 
> https://beam.apache.org/documentation/io/authoring-overview/
>
>>
>>
>> It seems that the current AWS code is split into two different modules: 
>> sdk/java/io/amazon-web-services which contains the S3FileSystem, AwsOptions, 
>> etc, and sdk/java/io/kinesis which contains an unbounded source based on a 
>> kinesis topic. I'd like to add this source to the amazon-web-services module 
>> since I'd like to depend on AwsOptions. Does adding this source to the 
>> amazon-web-services module make sense?
>
>
> Putting it inside of amazon-web-services makes a lot of sense. The Google 
> connectors all live within the one package and there has been discussion to 
> consolidate all the AWS stuff under amazon-web-services.
>
>>
>> Also, the kinesis source looks a touch more complex than other sources. Both 
>> the JMS and AMQP sources look like better examples to follow. Which existing 
>> source would be the best to model this contribution after?
>
>
> Some of it has to do with how many ways a source can be read and how 
> complicated the watermark tracking but it would be best if the IO authors 
> comment on implementation details.
>
>>
>> If anyone has put some thoughts into this, or better yet some code, I'd 
>> appreciate hearing from you.
>>
>> Thanks!
>>


Jenkins build is back to normal : beam_Release_Gradle_NightlySnapshot #105

2018-07-19 Thread Apache Jenkins Server
See