Re: Virtualenv setup issues on new machine

2019-02-28 Thread Ankur Goenka
Not sure, dog seems to be the only meaningful difference in 2 scenarios.

On Thu, Feb 28, 2019 at 8:04 PM Udi Meiri  wrote:

> Weird. Is that a known bug?
>
> On Thu, Feb 28, 2019 at 3:19 PM Ankur Goenka  wrote:
>
>> The issue seems to be with "." in the virtualenv path.
>> virtualenv works after moving from
>> "/usr/local/google/home/goenka/.local/bin" to "/usr/bin"
>>
>> On Thu, Feb 28, 2019 at 2:57 PM Udi Meiri  wrote:
>>
>>> I think gradle is complaining that the path can't be found.
>>> Is there more information if you run it with --info?
>>>
>>> On Thu, Feb 28, 2019, 14:35 Ankur Goenka  wrote:
>>>
 Hi Beamers,

 I am trying build python sdk from a fresh git checkout on a new linux
 machine but the setupVirtualEnv task is failing with the error below. The
 complete build scan is at
 https://scans.gradle.com/s/h3jwzeg5aralk/failure?openFailures=WzBd=WzQsM10#top=0

 From the error it seems that gradle is trying to find the virtualenv
 command in beam/python folder.
 I am able to run virtualenv from the bash directly and PATH seems to be
 setup correctly.

 Anypointers about what might be happening?


 org.gradle.api.tasks.TaskExecutionException
 :
 Execution failed for task ':beam-sdks-python:setupVirtualenv'.
 Open stacktrace
 Caused by:
 org.gradle.process.internal.ExecException
 :
 A problem occurred starting process 'command 'virtualenv''
 Open stacktrace
 Caused by:
 net.rubygrapefruit.platform.NativeException
 :
 Could not start 'virtualenv'
 Open stacktrace
 Caused by:
 java.io.IOException
 :
 Cannot run program "virtualenv" (in directory
 "/tmp/beam/beam/sdks/python"): error=2, No such file or directory
 Close stacktrace
 at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start
 (DefaultProcessLauncher.java:25)
 at net.rubygrapefruit.platform.internal.WrapperProcessLauncher.start
 (WrapperProcessLauncher.java:36)
 at org.gradle.process.internal.ExecHandleRunner.run
 (ExecHandleRunner.java:67)
 at
 org.gradle.internal.operations.CurrentBuildOperationPreservingRunnable.
 run(CurrentBuildOperationPreservingRunnable.java:42)
 at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures
 .onExecute(ExecutorPolicy.java:63)
 at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run
 (ManagedExecutorImpl.java:46)
 at
 org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.
 run(ThreadFactoryImpl.java:55)
 Caused by:
 java.io.IOException
 :
 error=2, No such file or directory
 Close stacktrace
 at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start
 (DefaultProcessLauncher.java:25)
 at net.rubygrapefruit.platform.internal.WrapperProcessLauncher.start
 (WrapperProcessLauncher.java:36)
 at org.gradle.process.internal.ExecHandleRunner.run
 (ExecHandleRunner.java:67)
 at
 org.gradle.internal.operations.CurrentBuildOperationPreservingRunnable.
 run(CurrentBuildOperationPreservingRunnable.java:42)
 at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures
 .onExecute(ExecutorPolicy.java:63)
 at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run
 (ManagedExecutorImpl.java:46)
 at
 org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.
 run(ThreadFactoryImpl.java:55)

>>>


Re: Virtualenv setup issues on new machine

2019-02-28 Thread Udi Meiri
Weird. Is that a known bug?

On Thu, Feb 28, 2019 at 3:19 PM Ankur Goenka  wrote:

> The issue seems to be with "." in the virtualenv path.
> virtualenv works after moving from
> "/usr/local/google/home/goenka/.local/bin" to "/usr/bin"
>
> On Thu, Feb 28, 2019 at 2:57 PM Udi Meiri  wrote:
>
>> I think gradle is complaining that the path can't be found.
>> Is there more information if you run it with --info?
>>
>> On Thu, Feb 28, 2019, 14:35 Ankur Goenka  wrote:
>>
>>> Hi Beamers,
>>>
>>> I am trying build python sdk from a fresh git checkout on a new linux
>>> machine but the setupVirtualEnv task is failing with the error below. The
>>> complete build scan is at
>>> https://scans.gradle.com/s/h3jwzeg5aralk/failure?openFailures=WzBd=WzQsM10#top=0
>>>
>>> From the error it seems that gradle is trying to find the virtualenv
>>> command in beam/python folder.
>>> I am able to run virtualenv from the bash directly and PATH seems to be
>>> setup correctly.
>>>
>>> Anypointers about what might be happening?
>>>
>>>
>>> org.gradle.api.tasks.TaskExecutionException
>>> :
>>> Execution failed for task ':beam-sdks-python:setupVirtualenv'.
>>> Open stacktrace
>>> Caused by:
>>> org.gradle.process.internal.ExecException
>>> :
>>> A problem occurred starting process 'command 'virtualenv''
>>> Open stacktrace
>>> Caused by:
>>> net.rubygrapefruit.platform.NativeException
>>> :
>>> Could not start 'virtualenv'
>>> Open stacktrace
>>> Caused by:
>>> java.io.IOException
>>> :
>>> Cannot run program "virtualenv" (in directory
>>> "/tmp/beam/beam/sdks/python"): error=2, No such file or directory
>>> Close stacktrace
>>> at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start
>>> (DefaultProcessLauncher.java:25)
>>> at net.rubygrapefruit.platform.internal.WrapperProcessLauncher.start
>>> (WrapperProcessLauncher.java:36)
>>> at org.gradle.process.internal.ExecHandleRunner.run
>>> (ExecHandleRunner.java:67)
>>> at
>>> org.gradle.internal.operations.CurrentBuildOperationPreservingRunnable.
>>> run(CurrentBuildOperationPreservingRunnable.java:42)
>>> at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.
>>> onExecute(ExecutorPolicy.java:63)
>>> at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run
>>> (ManagedExecutorImpl.java:46)
>>> at
>>> org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.
>>> run(ThreadFactoryImpl.java:55)
>>> Caused by:
>>> java.io.IOException
>>> :
>>> error=2, No such file or directory
>>> Close stacktrace
>>> at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start
>>> (DefaultProcessLauncher.java:25)
>>> at net.rubygrapefruit.platform.internal.WrapperProcessLauncher.start
>>> (WrapperProcessLauncher.java:36)
>>> at org.gradle.process.internal.ExecHandleRunner.run
>>> (ExecHandleRunner.java:67)
>>> at
>>> org.gradle.internal.operations.CurrentBuildOperationPreservingRunnable.
>>> run(CurrentBuildOperationPreservingRunnable.java:42)
>>> at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.
>>> onExecute(ExecutorPolicy.java:63)
>>> at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run
>>> (ManagedExecutorImpl.java:46)
>>> at
>>> org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.
>>> run(ThreadFactoryImpl.java:55)
>>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: CVE audit gradle plugin

2019-02-28 Thread Ahmet Altay
Thank you, I agree this is very important. Does anyone know a similar tool
for python and go?

On Thu, Feb 28, 2019 at 8:26 AM Etienne Chauchot 
wrote:

> Hi guys,
>
> I came by this [1] gradle plugin that is a client to the Sonatype OSS
> Index CVE database.
>
> I have set it up here in a branch [2], though the cache is not configured
> and the number of requests is limited. It can be run with "gradle --info
> audit"
>
> It could be nice to have something like this to track the CVEs in the libs
> we use. I know we have been spammed by libs upgrade automatic requests in
> the past but CVE are more important IMHO.
>
> This plugin is in BSD-3-Clause which is compatible with Apache V2 licence
> [3]
>
> WDYT ?
>
> Etienne
>
> [1] https://github.com/OSSIndex/ossindex-gradle-plugin
> [2] https://github.com/echauchot/beam/tree/cve_audit_plugin
> [3] https://www.apache.org/legal/resolved.html
>


Re: [ANNOUNCE] New committer announcement: Michael Luckey

2019-02-28 Thread Daniel Oliveira
Congrats Michael!

On Thu, Feb 28, 2019 at 3:12 AM Maximilian Michels  wrote:

> Welcome, it's great to have you onboard Michael!
>
> On 28.02.19 11:46, Michael Luckey wrote:
> > Thanks to all of you for the warm welcome. Really happy to be part of
> > this great community!
> >
> > michel
> >
> > On Thu, Feb 28, 2019 at 8:39 AM David Morávek  > > wrote:
> >
> > Congrats Michael! 
> >
> > D.
> >
> >  > On 28 Feb 2019, at 03:27, Ismaël Mejía  > > wrote:
> >  >
> >  > Congratulations Michael, and thanks for all the contributions!
> >  >
> >  >> On Wed, Feb 27, 2019 at 6:30 PM Ankur Goenka  > > wrote:
> >  >>
> >  >> Congratulations Michael!
> >  >>
> >  >>> On Wed, Feb 27, 2019 at 2:25 PM Thomas Weise
> > mailto:thomas.we...@gmail.com>> wrote:
> >  >>>
> >  >>> Congrats Michael!
> >  >>>
> >  >>>
> >   On Wed, Feb 27, 2019 at 12:41 PM Gleb Kanterov
> > mailto:g...@spotify.com>> wrote:
> >  
> >   Congratulations and welcome!
> >  
> >  > On Wed, Feb 27, 2019 at 8:57 PM Connell O'Callaghan
> > mailto:conne...@google.com>> wrote:
> >  >
> >  > Excellent thank you for sharing Kenn!!!
> >  >
> >  > Michael congratulations for this recognition of your
> > contributions to advancing BEAM
> >  >
> >  >> On Wed, Feb 27, 2019 at 11:52 AM Kenneth Knowles
> > mailto:k...@apache.org>> wrote:
> >  >>
> >  >> Hi all,
> >  >>
> >  >> Please join me and the rest of the Beam PMC in welcoming a
> > new committer: Michael Luckey
> >  >>
> >  >> Michael has been contributing to Beam since early 2017. He
> > has fixed many build and developer environment issues, noted and
> > root-caused breakages on master, generously reviewed many others'
> > changes to the build. In consideration of Michael's contributions,
> > the Beam PMC trusts Michael with the responsibilities of a Beam
> > committer [1].
> >  >>
> >  >> Thank you, Michael, for your contributions.
> >  >>
> >  >> Kenn
> >  >>
> >  >> [1]
> >
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> >  
> >  
> >  
> >   --
> >   Cheers,
> >   Gleb
> >
>


Re: Virtualenv setup issues on new machine

2019-02-28 Thread Ankur Goenka
The issue seems to be with "." in the virtualenv path.
virtualenv works after moving from
"/usr/local/google/home/goenka/.local/bin" to "/usr/bin"

On Thu, Feb 28, 2019 at 2:57 PM Udi Meiri  wrote:

> I think gradle is complaining that the path can't be found.
> Is there more information if you run it with --info?
>
> On Thu, Feb 28, 2019, 14:35 Ankur Goenka  wrote:
>
>> Hi Beamers,
>>
>> I am trying build python sdk from a fresh git checkout on a new linux
>> machine but the setupVirtualEnv task is failing with the error below. The
>> complete build scan is at
>> https://scans.gradle.com/s/h3jwzeg5aralk/failure?openFailures=WzBd=WzQsM10#top=0
>>
>> From the error it seems that gradle is trying to find the virtualenv
>> command in beam/python folder.
>> I am able to run virtualenv from the bash directly and PATH seems to be
>> setup correctly.
>>
>> Anypointers about what might be happening?
>>
>>
>> org.gradle.api.tasks.TaskExecutionException
>> :
>> Execution failed for task ':beam-sdks-python:setupVirtualenv'.
>> Open stacktrace
>> Caused by:
>> org.gradle.process.internal.ExecException
>> :
>> A problem occurred starting process 'command 'virtualenv''
>> Open stacktrace
>> Caused by:
>> net.rubygrapefruit.platform.NativeException
>> :
>> Could not start 'virtualenv'
>> Open stacktrace
>> Caused by:
>> java.io.IOException
>> :
>> Cannot run program "virtualenv" (in directory
>> "/tmp/beam/beam/sdks/python"): error=2, No such file or directory
>> Close stacktrace
>> at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start
>> (DefaultProcessLauncher.java:25)
>> at net.rubygrapefruit.platform.internal.WrapperProcessLauncher.start
>> (WrapperProcessLauncher.java:36)
>> at org.gradle.process.internal.ExecHandleRunner.run
>> (ExecHandleRunner.java:67)
>> at org.gradle.internal.operations.CurrentBuildOperationPreservingRunnable
>> .run(CurrentBuildOperationPreservingRunnable.java:42)
>> at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.
>> onExecute(ExecutorPolicy.java:63)
>> at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run
>> (ManagedExecutorImpl.java:46)
>> at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable
>> .run(ThreadFactoryImpl.java:55)
>> Caused by:
>> java.io.IOException
>> :
>> error=2, No such file or directory
>> Close stacktrace
>> at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start
>> (DefaultProcessLauncher.java:25)
>> at net.rubygrapefruit.platform.internal.WrapperProcessLauncher.start
>> (WrapperProcessLauncher.java:36)
>> at org.gradle.process.internal.ExecHandleRunner.run
>> (ExecHandleRunner.java:67)
>> at org.gradle.internal.operations.CurrentBuildOperationPreservingRunnable
>> .run(CurrentBuildOperationPreservingRunnable.java:42)
>> at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.
>> onExecute(ExecutorPolicy.java:63)
>> at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run
>> (ManagedExecutorImpl.java:46)
>> at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable
>> .run(ThreadFactoryImpl.java:55)
>>
>


Re: KafkaIO Exactly-Once & Flink Runner

2019-02-28 Thread Raghu Angadi
On Thu, Feb 28, 2019 at 2:42 PM Raghu Angadi  wrote:

> On Thu, Feb 28, 2019 at 2:34 PM Kenneth Knowles  wrote:
>
>> I'm not sure what a hard fail is. I probably have a shallow
>> understanding, but doesn't @RequiresStableInput work for 2PC? The
>> preCommit() phase should establish the transaction and commit() is not
>> called until after checkpoint finalization. Can you describe the way that
>> it does not work a little bit more?
>>
>
> - preCommit() is called before checkpoint. Kafka EOS in Flink starts the
> transaction before this and makes sure it flushes all records in
> preCommit(). So far good.
> - commit is called after checkpoint is persisted. Now, imagine commit()
> fails for some reason. There is no option to rerun the 1st phase to write
> the records again in a new transaction. This is a hard failure for the the
> job. In practice Flink might attempt to commit again (not sure how many
> times), which is likely to fail and eventually results in job failure.
>

In Apache Beam, the records could be stored in state, and can be written
inside commit() to work around this issue. It could have scalability issues
if checkpoints are not frequent enough in Flink runner.

Raghu.


>
>
>> Kenn
>>
>> On Thu, Feb 28, 2019 at 1:25 PM Raghu Angadi  wrote:
>>
>>> On Thu, Feb 28, 2019 at 11:01 AM Kenneth Knowles 
>>> wrote:
>>>
 I believe the way you would implement the logic behind Flink's
 KafkaProducer would be to have two steps:

 1. Start transaction
 2. @RequiresStableInput Close transaction

>>>
>>> I see.  What happens if closing the transaction fails in (2)? Flink's
>>> 2PC requires that commit() should never hard fail once preCommit()
>>> succeeds. I think that is cost of not having an extra shuffle. It is
>>> alright since this policy has worked well for Flink so far.
>>>
>>> Overall, it will be great to have @RequiresStableInput support in Flink
>>> runner.
>>>
>>> Raghu.
>>>
 The FlinkRunner would need to insert the "wait until checkpoint
 finalization" logic wherever it sees @RequiresStableInput, which is already
 what it would have to do.

 This matches the KafkaProducer's logic - delay closing the transaction
 until checkpoint finalization. This answers my main question, which is "is
 @RequiresStableInput expressive enough to allow Beam-on-Flink to have
 exactly once behavior with the same performance characteristics as native
 Flink checkpoint finalization?"

 Kenn

 [1] https://github.com/apache/beam/pull/7955

 On Thu, Feb 28, 2019 at 10:43 AM Reuven Lax  wrote:

>
>
> On Thu, Feb 28, 2019 at 10:41 AM Raghu Angadi 
> wrote:
>
>>
>> Now why does the Flink Runner not support KafkaIO EOS? Flink's native
>>> KafkaProducer supports exactly-once. It simply commits the pending
>>> transaction once it has completed a checkpoint.
>>
>>
>>
>> On Thu, Feb 28, 2019 at 9:59 AM Maximilian Michels 
>> wrote:
>>
>>> Hi,
>>>
>>> I came across KafkaIO's Runner whitelist [1] for enabling
>>> exactly-once
>>> semantics (EOS). I think it is questionable to exclude Runners from
>>> inside a transform, but I see that the intention was to save users
>>> from
>>> surprises.
>>>
>>> Now why does the Flink Runner not support KafkaIO EOS? Flink's
>>> native
>>> KafkaProducer supports exactly-once. It simply commits the pending
>>> transaction once it has completed a checkpoint.
>>>
>>
>>
>> When we discussed this in Aug 2017, the understanding was that 2
>> Phase commit utility in Flink used to implement Flink's Kafka EOS could 
>> not
>> be implemented in Beam's context.
>> See  this message
>>  in
>> that dev thread. Has anything changed in this regard? The whole thread is
>> relevant to this topic and worth going through.
>>
>
> I think that TwoPhaseCommit utility class wouldn't work. The Flink
> runner would probably want to directly use notifySnapshotComplete in order
> to implement @RequiresStableInput.
>
>>
>>
>>>
>>> A checkpoint is realized by sending barriers through all channels
>>> starting from the source until reaching all sinks. Every operator
>>> persists its state once it has received a barrier on all its input
>>> channels, it then forwards it to the downstream operators.
>>>
>>> The architecture of Beam's KafkaExactlyOnceSink is as follows[2]:
>>>
>>> Input -> AssignRandomShardIds -> GroupByKey -> AssignSequenceIds ->
>>> GroupByKey -> ExactlyOnceWriter
>>>
>>> As I understood, Spark or Dataflow use the GroupByKey stages to
>>> persist
>>> the input. That is not required in Flink to be able to take a
>>> consistent
>>> snapshot of the pipeline.
>>>
>>> Basically, for Flink we don't need any of 

Re: Virtualenv setup issues on new machine

2019-02-28 Thread Udi Meiri
I think gradle is complaining that the path can't be found.
Is there more information if you run it with --info?

On Thu, Feb 28, 2019, 14:35 Ankur Goenka  wrote:

> Hi Beamers,
>
> I am trying build python sdk from a fresh git checkout on a new linux
> machine but the setupVirtualEnv task is failing with the error below. The
> complete build scan is at
> https://scans.gradle.com/s/h3jwzeg5aralk/failure?openFailures=WzBd=WzQsM10#top=0
>
> From the error it seems that gradle is trying to find the virtualenv
> command in beam/python folder.
> I am able to run virtualenv from the bash directly and PATH seems to be
> setup correctly.
>
> Anypointers about what might be happening?
>
>
> org.gradle.api.tasks.TaskExecutionException
> :
> Execution failed for task ':beam-sdks-python:setupVirtualenv'.
> Open stacktrace
> Caused by:
> org.gradle.process.internal.ExecException
> :
> A problem occurred starting process 'command 'virtualenv''
> Open stacktrace
> Caused by:
> net.rubygrapefruit.platform.NativeException
> :
> Could not start 'virtualenv'
> Open stacktrace
> Caused by:
> java.io.IOException
> :
> Cannot run program "virtualenv" (in directory
> "/tmp/beam/beam/sdks/python"): error=2, No such file or directory
> Close stacktrace
> at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start
> (DefaultProcessLauncher.java:25)
> at net.rubygrapefruit.platform.internal.WrapperProcessLauncher.start
> (WrapperProcessLauncher.java:36)
> at org.gradle.process.internal.ExecHandleRunner.run
> (ExecHandleRunner.java:67)
> at org.gradle.internal.operations.CurrentBuildOperationPreservingRunnable.
> run(CurrentBuildOperationPreservingRunnable.java:42)
> at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.
> onExecute(ExecutorPolicy.java:63)
> at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run
> (ManagedExecutorImpl.java:46)
> at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.
> run(ThreadFactoryImpl.java:55)
> Caused by:
> java.io.IOException
> :
> error=2, No such file or directory
> Close stacktrace
> at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start
> (DefaultProcessLauncher.java:25)
> at net.rubygrapefruit.platform.internal.WrapperProcessLauncher.start
> (WrapperProcessLauncher.java:36)
> at org.gradle.process.internal.ExecHandleRunner.run
> (ExecHandleRunner.java:67)
> at org.gradle.internal.operations.CurrentBuildOperationPreservingRunnable.
> run(CurrentBuildOperationPreservingRunnable.java:42)
> at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.
> onExecute(ExecutorPolicy.java:63)
> at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run
> (ManagedExecutorImpl.java:46)
> at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.
> run(ThreadFactoryImpl.java:55)
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: KafkaIO Exactly-Once & Flink Runner

2019-02-28 Thread Raghu Angadi
On Thu, Feb 28, 2019 at 2:34 PM Kenneth Knowles  wrote:

> I'm not sure what a hard fail is. I probably have a shallow understanding,
> but doesn't @RequiresStableInput work for 2PC? The preCommit() phase should
> establish the transaction and commit() is not called until after checkpoint
> finalization. Can you describe the way that it does not work a little bit
> more?
>

- preCommit() is called before checkpoint. Kafka EOS in Flink starts the
transaction before this and makes sure it flushes all records in
preCommit(). So far good.
- commit is called after checkpoint is persisted. Now, imagine commit()
fails for some reason. There is no option to rerun the 1st phase to write
the records again in a new transaction. This is a hard failure for the the
job. In practice Flink might attempt to commit again (not sure how many
times), which is likely to fail and eventually results in job failure.


> Kenn
>
> On Thu, Feb 28, 2019 at 1:25 PM Raghu Angadi  wrote:
>
>> On Thu, Feb 28, 2019 at 11:01 AM Kenneth Knowles  wrote:
>>
>>> I believe the way you would implement the logic behind Flink's
>>> KafkaProducer would be to have two steps:
>>>
>>> 1. Start transaction
>>> 2. @RequiresStableInput Close transaction
>>>
>>
>> I see.  What happens if closing the transaction fails in (2)? Flink's 2PC
>> requires that commit() should never hard fail once preCommit() succeeds. I
>> think that is cost of not having an extra shuffle. It is alright since this
>> policy has worked well for Flink so far.
>>
>> Overall, it will be great to have @RequiresStableInput support in Flink
>> runner.
>>
>> Raghu.
>>
>>> The FlinkRunner would need to insert the "wait until checkpoint
>>> finalization" logic wherever it sees @RequiresStableInput, which is already
>>> what it would have to do.
>>>
>>> This matches the KafkaProducer's logic - delay closing the transaction
>>> until checkpoint finalization. This answers my main question, which is "is
>>> @RequiresStableInput expressive enough to allow Beam-on-Flink to have
>>> exactly once behavior with the same performance characteristics as native
>>> Flink checkpoint finalization?"
>>>
>>> Kenn
>>>
>>> [1] https://github.com/apache/beam/pull/7955
>>>
>>> On Thu, Feb 28, 2019 at 10:43 AM Reuven Lax  wrote:
>>>


 On Thu, Feb 28, 2019 at 10:41 AM Raghu Angadi  wrote:

>
> Now why does the Flink Runner not support KafkaIO EOS? Flink's native
>> KafkaProducer supports exactly-once. It simply commits the pending
>> transaction once it has completed a checkpoint.
>
>
>
> On Thu, Feb 28, 2019 at 9:59 AM Maximilian Michels 
> wrote:
>
>> Hi,
>>
>> I came across KafkaIO's Runner whitelist [1] for enabling
>> exactly-once
>> semantics (EOS). I think it is questionable to exclude Runners from
>> inside a transform, but I see that the intention was to save users
>> from
>> surprises.
>>
>> Now why does the Flink Runner not support KafkaIO EOS? Flink's native
>> KafkaProducer supports exactly-once. It simply commits the pending
>> transaction once it has completed a checkpoint.
>>
>
>
> When we discussed this in Aug 2017, the understanding was that 2 Phase
> commit utility in Flink used to implement Flink's Kafka EOS could not be
> implemented in Beam's context.
> See  this message
>  in
> that dev thread. Has anything changed in this regard? The whole thread is
> relevant to this topic and worth going through.
>

 I think that TwoPhaseCommit utility class wouldn't work. The Flink
 runner would probably want to directly use notifySnapshotComplete in order
 to implement @RequiresStableInput.

>
>
>>
>> A checkpoint is realized by sending barriers through all channels
>> starting from the source until reaching all sinks. Every operator
>> persists its state once it has received a barrier on all its input
>> channels, it then forwards it to the downstream operators.
>>
>> The architecture of Beam's KafkaExactlyOnceSink is as follows[2]:
>>
>> Input -> AssignRandomShardIds -> GroupByKey -> AssignSequenceIds ->
>> GroupByKey -> ExactlyOnceWriter
>>
>> As I understood, Spark or Dataflow use the GroupByKey stages to
>> persist
>> the input. That is not required in Flink to be able to take a
>> consistent
>> snapshot of the pipeline.
>>
>> Basically, for Flink we don't need any of that magic that KafkaIO
>> does.
>> What we would need to support EOS is a way to tell the
>> ExactlyOnceWriter
>> (a DoFn) to commit once a checkpoint has completed.
>
> I know that the new version of SDF supports checkpointing which should
>> solve this issue. But there is still a lot of work to do to make this
>> reality.
>>
>
> I don't see how SDF solves this problem.. May be 

Virtualenv setup issues on new machine

2019-02-28 Thread Ankur Goenka
Hi Beamers,

I am trying build python sdk from a fresh git checkout on a new linux
machine but the setupVirtualEnv task is failing with the error below. The
complete build scan is at
https://scans.gradle.com/s/h3jwzeg5aralk/failure?openFailures=WzBd=WzQsM10#top=0

>From the error it seems that gradle is trying to find the virtualenv
command in beam/python folder.
I am able to run virtualenv from the bash directly and PATH seems to be
setup correctly.

Anypointers about what might be happening?


org.gradle.api.tasks.TaskExecutionException
:
Execution failed for task ':beam-sdks-python:setupVirtualenv'.
Open stacktrace
Caused by:
org.gradle.process.internal.ExecException
:
A problem occurred starting process 'command 'virtualenv''
Open stacktrace
Caused by:
net.rubygrapefruit.platform.NativeException
:
Could not start 'virtualenv'
Open stacktrace
Caused by:
java.io.IOException
:
Cannot run program "virtualenv" (in directory
"/tmp/beam/beam/sdks/python"): error=2, No such file or directory
Close stacktrace
at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start
(DefaultProcessLauncher.java:25)
at net.rubygrapefruit.platform.internal.WrapperProcessLauncher.start
(WrapperProcessLauncher.java:36)
at org.gradle.process.internal.ExecHandleRunner.run
(ExecHandleRunner.java:67)
at org.gradle.internal.operations.CurrentBuildOperationPreservingRunnable.
run(CurrentBuildOperationPreservingRunnable.java:42)
at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.
onExecute(ExecutorPolicy.java:63)
at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run
(ManagedExecutorImpl.java:46)
at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.
run(ThreadFactoryImpl.java:55)
Caused by:
java.io.IOException
:
error=2, No such file or directory
Close stacktrace
at net.rubygrapefruit.platform.internal.DefaultProcessLauncher.start
(DefaultProcessLauncher.java:25)
at net.rubygrapefruit.platform.internal.WrapperProcessLauncher.start
(WrapperProcessLauncher.java:36)
at org.gradle.process.internal.ExecHandleRunner.run
(ExecHandleRunner.java:67)
at org.gradle.internal.operations.CurrentBuildOperationPreservingRunnable.
run(CurrentBuildOperationPreservingRunnable.java:42)
at org.gradle.internal.concurrent.ExecutorPolicy$CatchAndRecordFailures.
onExecute(ExecutorPolicy.java:63)
at org.gradle.internal.concurrent.ManagedExecutorImpl$1.run
(ManagedExecutorImpl.java:46)
at org.gradle.internal.concurrent.ThreadFactoryImpl$ManagedThreadRunnable.
run(ThreadFactoryImpl.java:55)


Re: KafkaIO Exactly-Once & Flink Runner

2019-02-28 Thread Kenneth Knowles
I'm not sure what a hard fail is. I probably have a shallow understanding,
but doesn't @RequiresStableInput work for 2PC? The preCommit() phase should
establish the transaction and commit() is not called until after checkpoint
finalization. Can you describe the way that it does not work a little bit
more?

Kenn

On Thu, Feb 28, 2019 at 1:25 PM Raghu Angadi  wrote:

> On Thu, Feb 28, 2019 at 11:01 AM Kenneth Knowles  wrote:
>
>> I believe the way you would implement the logic behind Flink's
>> KafkaProducer would be to have two steps:
>>
>> 1. Start transaction
>> 2. @RequiresStableInput Close transaction
>>
>
> I see.  What happens if closing the transaction fails in (2)? Flink's 2PC
> requires that commit() should never hard fail once preCommit() succeeds. I
> think that is cost of not having an extra shuffle. It is alright since this
> policy has worked well for Flink so far.
>
> Overall, it will be great to have @RequiresStableInput support in Flink
> runner.
>
> Raghu.
>
>> The FlinkRunner would need to insert the "wait until checkpoint
>> finalization" logic wherever it sees @RequiresStableInput, which is already
>> what it would have to do.
>>
>> This matches the KafkaProducer's logic - delay closing the transaction
>> until checkpoint finalization. This answers my main question, which is "is
>> @RequiresStableInput expressive enough to allow Beam-on-Flink to have
>> exactly once behavior with the same performance characteristics as native
>> Flink checkpoint finalization?"
>>
>> Kenn
>>
>> [1] https://github.com/apache/beam/pull/7955
>>
>> On Thu, Feb 28, 2019 at 10:43 AM Reuven Lax  wrote:
>>
>>>
>>>
>>> On Thu, Feb 28, 2019 at 10:41 AM Raghu Angadi  wrote:
>>>

 Now why does the Flink Runner not support KafkaIO EOS? Flink's native
> KafkaProducer supports exactly-once. It simply commits the pending
> transaction once it has completed a checkpoint.



 On Thu, Feb 28, 2019 at 9:59 AM Maximilian Michels 
 wrote:

> Hi,
>
> I came across KafkaIO's Runner whitelist [1] for enabling exactly-once
> semantics (EOS). I think it is questionable to exclude Runners from
> inside a transform, but I see that the intention was to save users
> from
> surprises.
>
> Now why does the Flink Runner not support KafkaIO EOS? Flink's native
> KafkaProducer supports exactly-once. It simply commits the pending
> transaction once it has completed a checkpoint.
>


 When we discussed this in Aug 2017, the understanding was that 2 Phase
 commit utility in Flink used to implement Flink's Kafka EOS could not be
 implemented in Beam's context.
 See  this message
  in
 that dev thread. Has anything changed in this regard? The whole thread is
 relevant to this topic and worth going through.

>>>
>>> I think that TwoPhaseCommit utility class wouldn't work. The Flink
>>> runner would probably want to directly use notifySnapshotComplete in order
>>> to implement @RequiresStableInput.
>>>


>
> A checkpoint is realized by sending barriers through all channels
> starting from the source until reaching all sinks. Every operator
> persists its state once it has received a barrier on all its input
> channels, it then forwards it to the downstream operators.
>
> The architecture of Beam's KafkaExactlyOnceSink is as follows[2]:
>
> Input -> AssignRandomShardIds -> GroupByKey -> AssignSequenceIds ->
> GroupByKey -> ExactlyOnceWriter
>
> As I understood, Spark or Dataflow use the GroupByKey stages to
> persist
> the input. That is not required in Flink to be able to take a
> consistent
> snapshot of the pipeline.
>
> Basically, for Flink we don't need any of that magic that KafkaIO
> does.
> What we would need to support EOS is a way to tell the
> ExactlyOnceWriter
> (a DoFn) to commit once a checkpoint has completed.

 I know that the new version of SDF supports checkpointing which should
> solve this issue. But there is still a lot of work to do to make this
> reality.
>

 I don't see how SDF solves this problem.. May be pseudo code would make
 more clear.  But if helps, that is great!

 So I think it would make sense to think about a way to make KafkaIO's
> EOS more accessible to Runners which support a different way of
> checkpointing.
>

 Absolutely. I would love to support EOS in KakaIO for Flink. I think
 that will help many future exactly-once sinks.. and address fundamental
 incompatibility between Beam model and Flink's horizontal checkpointing for
 such applications.

 Raghu.


> Cheers,
> Max
>
> PS: I found this document about RequiresStableInput [3], but IMHO
> defining an annotation only manifests the conceptual difference
> between

Re: KafkaIO Exactly-Once & Flink Runner

2019-02-28 Thread Raghu Angadi
On Thu, Feb 28, 2019 at 11:01 AM Kenneth Knowles  wrote:

> I believe the way you would implement the logic behind Flink's
> KafkaProducer would be to have two steps:
>
> 1. Start transaction
> 2. @RequiresStableInput Close transaction
>

I see.  What happens if closing the transaction fails in (2)? Flink's 2PC
requires that commit() should never hard fail once preCommit() succeeds. I
think that is cost of not having an extra shuffle. It is alright since this
policy has worked well for Flink so far.

Overall, it will be great to have @RequiresStableInput support in Flink
runner.

Raghu.

> The FlinkRunner would need to insert the "wait until checkpoint
> finalization" logic wherever it sees @RequiresStableInput, which is already
> what it would have to do.
>
> This matches the KafkaProducer's logic - delay closing the transaction
> until checkpoint finalization. This answers my main question, which is "is
> @RequiresStableInput expressive enough to allow Beam-on-Flink to have
> exactly once behavior with the same performance characteristics as native
> Flink checkpoint finalization?"
>
> Kenn
>
> [1] https://github.com/apache/beam/pull/7955
>
> On Thu, Feb 28, 2019 at 10:43 AM Reuven Lax  wrote:
>
>>
>>
>> On Thu, Feb 28, 2019 at 10:41 AM Raghu Angadi  wrote:
>>
>>>
>>> Now why does the Flink Runner not support KafkaIO EOS? Flink's native
 KafkaProducer supports exactly-once. It simply commits the pending
 transaction once it has completed a checkpoint.
>>>
>>>
>>>
>>> On Thu, Feb 28, 2019 at 9:59 AM Maximilian Michels 
>>> wrote:
>>>
 Hi,

 I came across KafkaIO's Runner whitelist [1] for enabling exactly-once
 semantics (EOS). I think it is questionable to exclude Runners from
 inside a transform, but I see that the intention was to save users from
 surprises.

 Now why does the Flink Runner not support KafkaIO EOS? Flink's native
 KafkaProducer supports exactly-once. It simply commits the pending
 transaction once it has completed a checkpoint.

>>>
>>>
>>> When we discussed this in Aug 2017, the understanding was that 2 Phase
>>> commit utility in Flink used to implement Flink's Kafka EOS could not be
>>> implemented in Beam's context.
>>> See  this message
>>>  in
>>> that dev thread. Has anything changed in this regard? The whole thread is
>>> relevant to this topic and worth going through.
>>>
>>
>> I think that TwoPhaseCommit utility class wouldn't work. The Flink runner
>> would probably want to directly use notifySnapshotComplete in order to
>> implement @RequiresStableInput.
>>
>>>
>>>

 A checkpoint is realized by sending barriers through all channels
 starting from the source until reaching all sinks. Every operator
 persists its state once it has received a barrier on all its input
 channels, it then forwards it to the downstream operators.

 The architecture of Beam's KafkaExactlyOnceSink is as follows[2]:

 Input -> AssignRandomShardIds -> GroupByKey -> AssignSequenceIds ->
 GroupByKey -> ExactlyOnceWriter

 As I understood, Spark or Dataflow use the GroupByKey stages to persist
 the input. That is not required in Flink to be able to take a
 consistent
 snapshot of the pipeline.

 Basically, for Flink we don't need any of that magic that KafkaIO does.
 What we would need to support EOS is a way to tell the
 ExactlyOnceWriter
 (a DoFn) to commit once a checkpoint has completed.
>>>
>>> I know that the new version of SDF supports checkpointing which should
 solve this issue. But there is still a lot of work to do to make this
 reality.

>>>
>>> I don't see how SDF solves this problem.. May be pseudo code would make
>>> more clear.  But if helps, that is great!
>>>
>>> So I think it would make sense to think about a way to make KafkaIO's
 EOS more accessible to Runners which support a different way of
 checkpointing.

>>>
>>> Absolutely. I would love to support EOS in KakaIO for Flink. I think
>>> that will help many future exactly-once sinks.. and address fundamental
>>> incompatibility between Beam model and Flink's horizontal checkpointing for
>>> such applications.
>>>
>>> Raghu.
>>>
>>>
 Cheers,
 Max

 PS: I found this document about RequiresStableInput [3], but IMHO
 defining an annotation only manifests the conceptual difference between
 the Runners.


 [1]

 https://github.com/apache/beam/blob/988a22f01bb133dd65b3cc75e05978d695aed76c/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1144
 [2]

 https://github.com/apache/beam/blob/988a22f01bb133dd65b3cc75e05978d695aed76c/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaExactlyOnceSink.java#L166
 [3]

 https://docs.google.com/document/d/117yRKbbcEdm3eIKB_26BHOJGmHSZl1YNoF0RqWGtqAM

>>>


Re: KafkaIO Exactly-Once & Flink Runner

2019-02-28 Thread Kenneth Knowles
I believe the way you would implement the logic behind Flink's
KafkaProducer would be to have two steps:

1. Start transaction
2. @RequiresStableInput Close transaction

The FlinkRunner would need to insert the "wait until checkpoint
finalization" logic wherever it sees @RequiresStableInput, which is already
what it would have to do.

This matches the KafkaProducer's logic - delay closing the transaction
until checkpoint finalization. This answers my main question, which is "is
@RequiresStableInput expressive enough to allow Beam-on-Flink to have
exactly once behavior with the same performance characteristics as native
Flink checkpoint finalization?"

Kenn

[1] https://github.com/apache/beam/pull/7955

On Thu, Feb 28, 2019 at 10:43 AM Reuven Lax  wrote:

>
>
> On Thu, Feb 28, 2019 at 10:41 AM Raghu Angadi  wrote:
>
>>
>> Now why does the Flink Runner not support KafkaIO EOS? Flink's native
>>> KafkaProducer supports exactly-once. It simply commits the pending
>>> transaction once it has completed a checkpoint.
>>
>>
>>
>> On Thu, Feb 28, 2019 at 9:59 AM Maximilian Michels 
>> wrote:
>>
>>> Hi,
>>>
>>> I came across KafkaIO's Runner whitelist [1] for enabling exactly-once
>>> semantics (EOS). I think it is questionable to exclude Runners from
>>> inside a transform, but I see that the intention was to save users from
>>> surprises.
>>>
>>> Now why does the Flink Runner not support KafkaIO EOS? Flink's native
>>> KafkaProducer supports exactly-once. It simply commits the pending
>>> transaction once it has completed a checkpoint.
>>>
>>
>>
>> When we discussed this in Aug 2017, the understanding was that 2 Phase
>> commit utility in Flink used to implement Flink's Kafka EOS could not be
>> implemented in Beam's context.
>> See  this message
>>  in that
>> dev thread. Has anything changed in this regard? The whole thread is
>> relevant to this topic and worth going through.
>>
>
> I think that TwoPhaseCommit utility class wouldn't work. The Flink runner
> would probably want to directly use notifySnapshotComplete in order to
> implement @RequiresStableInput.
>
>>
>>
>>>
>>> A checkpoint is realized by sending barriers through all channels
>>> starting from the source until reaching all sinks. Every operator
>>> persists its state once it has received a barrier on all its input
>>> channels, it then forwards it to the downstream operators.
>>>
>>> The architecture of Beam's KafkaExactlyOnceSink is as follows[2]:
>>>
>>> Input -> AssignRandomShardIds -> GroupByKey -> AssignSequenceIds ->
>>> GroupByKey -> ExactlyOnceWriter
>>>
>>> As I understood, Spark or Dataflow use the GroupByKey stages to persist
>>> the input. That is not required in Flink to be able to take a consistent
>>> snapshot of the pipeline.
>>>
>>> Basically, for Flink we don't need any of that magic that KafkaIO does.
>>> What we would need to support EOS is a way to tell the ExactlyOnceWriter
>>> (a DoFn) to commit once a checkpoint has completed.
>>
>> I know that the new version of SDF supports checkpointing which should
>>> solve this issue. But there is still a lot of work to do to make this
>>> reality.
>>>
>>
>> I don't see how SDF solves this problem.. May be pseudo code would make
>> more clear.  But if helps, that is great!
>>
>> So I think it would make sense to think about a way to make KafkaIO's
>>> EOS more accessible to Runners which support a different way of
>>> checkpointing.
>>>
>>
>> Absolutely. I would love to support EOS in KakaIO for Flink. I think that
>> will help many future exactly-once sinks.. and address fundamental
>> incompatibility between Beam model and Flink's horizontal checkpointing for
>> such applications.
>>
>> Raghu.
>>
>>
>>> Cheers,
>>> Max
>>>
>>> PS: I found this document about RequiresStableInput [3], but IMHO
>>> defining an annotation only manifests the conceptual difference between
>>> the Runners.
>>>
>>>
>>> [1]
>>>
>>> https://github.com/apache/beam/blob/988a22f01bb133dd65b3cc75e05978d695aed76c/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1144
>>> [2]
>>>
>>> https://github.com/apache/beam/blob/988a22f01bb133dd65b3cc75e05978d695aed76c/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaExactlyOnceSink.java#L166
>>> [3]
>>>
>>> https://docs.google.com/document/d/117yRKbbcEdm3eIKB_26BHOJGmHSZl1YNoF0RqWGtqAM
>>>
>>


Re: KafkaIO Exactly-Once & Flink Runner

2019-02-28 Thread Reuven Lax
On Thu, Feb 28, 2019 at 10:41 AM Raghu Angadi  wrote:

>
> Now why does the Flink Runner not support KafkaIO EOS? Flink's native
>> KafkaProducer supports exactly-once. It simply commits the pending
>> transaction once it has completed a checkpoint.
>
>
>
> On Thu, Feb 28, 2019 at 9:59 AM Maximilian Michels  wrote:
>
>> Hi,
>>
>> I came across KafkaIO's Runner whitelist [1] for enabling exactly-once
>> semantics (EOS). I think it is questionable to exclude Runners from
>> inside a transform, but I see that the intention was to save users from
>> surprises.
>>
>> Now why does the Flink Runner not support KafkaIO EOS? Flink's native
>> KafkaProducer supports exactly-once. It simply commits the pending
>> transaction once it has completed a checkpoint.
>>
>
>
> When we discussed this in Aug 2017, the understanding was that 2 Phase
> commit utility in Flink used to implement Flink's Kafka EOS could not be
> implemented in Beam's context.
> See  this message
>  in that
> dev thread. Has anything changed in this regard? The whole thread is
> relevant to this topic and worth going through.
>

I think that TwoPhaseCommit utility class wouldn't work. The Flink runner
would probably want to directly use notifySnapshotComplete in order to
implement @RequiresStableInput.

>
>
>>
>> A checkpoint is realized by sending barriers through all channels
>> starting from the source until reaching all sinks. Every operator
>> persists its state once it has received a barrier on all its input
>> channels, it then forwards it to the downstream operators.
>>
>> The architecture of Beam's KafkaExactlyOnceSink is as follows[2]:
>>
>> Input -> AssignRandomShardIds -> GroupByKey -> AssignSequenceIds ->
>> GroupByKey -> ExactlyOnceWriter
>>
>> As I understood, Spark or Dataflow use the GroupByKey stages to persist
>> the input. That is not required in Flink to be able to take a consistent
>> snapshot of the pipeline.
>>
>> Basically, for Flink we don't need any of that magic that KafkaIO does.
>> What we would need to support EOS is a way to tell the ExactlyOnceWriter
>> (a DoFn) to commit once a checkpoint has completed.
>
> I know that the new version of SDF supports checkpointing which should
>> solve this issue. But there is still a lot of work to do to make this
>> reality.
>>
>
> I don't see how SDF solves this problem.. May be pseudo code would make
> more clear.  But if helps, that is great!
>
> So I think it would make sense to think about a way to make KafkaIO's
>> EOS more accessible to Runners which support a different way of
>> checkpointing.
>>
>
> Absolutely. I would love to support EOS in KakaIO for Flink. I think that
> will help many future exactly-once sinks.. and address fundamental
> incompatibility between Beam model and Flink's horizontal checkpointing for
> such applications.
>
> Raghu.
>
>
>> Cheers,
>> Max
>>
>> PS: I found this document about RequiresStableInput [3], but IMHO
>> defining an annotation only manifests the conceptual difference between
>> the Runners.
>>
>>
>> [1]
>>
>> https://github.com/apache/beam/blob/988a22f01bb133dd65b3cc75e05978d695aed76c/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1144
>> [2]
>>
>> https://github.com/apache/beam/blob/988a22f01bb133dd65b3cc75e05978d695aed76c/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaExactlyOnceSink.java#L166
>> [3]
>>
>> https://docs.google.com/document/d/117yRKbbcEdm3eIKB_26BHOJGmHSZl1YNoF0RqWGtqAM
>>
>


Re: KafkaIO Exactly-Once & Flink Runner

2019-02-28 Thread Raghu Angadi
> Now why does the Flink Runner not support KafkaIO EOS? Flink's native
> KafkaProducer supports exactly-once. It simply commits the pending
> transaction once it has completed a checkpoint.



On Thu, Feb 28, 2019 at 9:59 AM Maximilian Michels  wrote:

> Hi,
>
> I came across KafkaIO's Runner whitelist [1] for enabling exactly-once
> semantics (EOS). I think it is questionable to exclude Runners from
> inside a transform, but I see that the intention was to save users from
> surprises.
>
> Now why does the Flink Runner not support KafkaIO EOS? Flink's native
> KafkaProducer supports exactly-once. It simply commits the pending
> transaction once it has completed a checkpoint.
>


When we discussed this in Aug 2017, the understanding was that 2 Phase
commit utility in Flink used to implement Flink's Kafka EOS could not be
implemented in Beam's context.
See  this message
 in that
dev thread. Has anything changed in this regard? The whole thread is
relevant to this topic and worth going through.


>
> A checkpoint is realized by sending barriers through all channels
> starting from the source until reaching all sinks. Every operator
> persists its state once it has received a barrier on all its input
> channels, it then forwards it to the downstream operators.
>
> The architecture of Beam's KafkaExactlyOnceSink is as follows[2]:
>
> Input -> AssignRandomShardIds -> GroupByKey -> AssignSequenceIds ->
> GroupByKey -> ExactlyOnceWriter
>
> As I understood, Spark or Dataflow use the GroupByKey stages to persist
> the input. That is not required in Flink to be able to take a consistent
> snapshot of the pipeline.
>
> Basically, for Flink we don't need any of that magic that KafkaIO does.
> What we would need to support EOS is a way to tell the ExactlyOnceWriter
> (a DoFn) to commit once a checkpoint has completed.

I know that the new version of SDF supports checkpointing which should
> solve this issue. But there is still a lot of work to do to make this
> reality.
>

I don't see how SDF solves this problem.. May be pseudo code would make
more clear.  But if helps, that is great!

So I think it would make sense to think about a way to make KafkaIO's
> EOS more accessible to Runners which support a different way of
> checkpointing.
>

Absolutely. I would love to support EOS in KakaIO for Flink. I think that
will help many future exactly-once sinks.. and address fundamental
incompatibility between Beam model and Flink's horizontal checkpointing for
such applications.

Raghu.


> Cheers,
> Max
>
> PS: I found this document about RequiresStableInput [3], but IMHO
> defining an annotation only manifests the conceptual difference between
> the Runners.
>
>
> [1]
>
> https://github.com/apache/beam/blob/988a22f01bb133dd65b3cc75e05978d695aed76c/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1144
> [2]
>
> https://github.com/apache/beam/blob/988a22f01bb133dd65b3cc75e05978d695aed76c/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaExactlyOnceSink.java#L166
> [3]
>
> https://docs.google.com/document/d/117yRKbbcEdm3eIKB_26BHOJGmHSZl1YNoF0RqWGtqAM
>


Re: KafkaIO Exactly-Once & Flink Runner

2019-02-28 Thread Reuven Lax
This is exactly what RequiresStableInput is supposed to do. On the Flink
runner, this would be implemented by delaying processing until the current
checkpoint is done . In fact many sinks are probably subtly broken on the
Flink runner today without RequiresStableInput, so we really need to finish
this work and add a Flink implementation of it.

Reuven

On Thu, Feb 28, 2019 at 9:59 AM Maximilian Michels  wrote:

> Hi,
>
> I came across KafkaIO's Runner whitelist [1] for enabling exactly-once
> semantics (EOS). I think it is questionable to exclude Runners from
> inside a transform, but I see that the intention was to save users from
> surprises.
>
> Now why does the Flink Runner not support KafkaIO EOS? Flink's native
> KafkaProducer supports exactly-once. It simply commits the pending
> transaction once it has completed a checkpoint.
>
> A checkpoint is realized by sending barriers through all channels
> starting from the source until reaching all sinks. Every operator
> persists its state once it has received a barrier on all its input
> channels, it then forwards it to the downstream operators.
>
> The architecture of Beam's KafkaExactlyOnceSink is as follows[2]:
>
> Input -> AssignRandomShardIds -> GroupByKey -> AssignSequenceIds ->
> GroupByKey -> ExactlyOnceWriter
>
> As I understood, Spark or Dataflow use the GroupByKey stages to persist
> the input. That is not required in Flink to be able to take a consistent
> snapshot of the pipeline.
>
> Basically, for Flink we don't need any of that magic that KafkaIO does.
> What we would need to support EOS is a way to tell the ExactlyOnceWriter
> (a DoFn) to commit once a checkpoint has completed.
>
> I know that the new version of SDF supports checkpointing which should
> solve this issue. But there is still a lot of work to do to make this
> reality.
>
> So I think it would make sense to think about a way to make KafkaIO's
> EOS more accessible to Runners which support a different way of
> checkpointing.
>
> Cheers,
> Max
>
> PS: I found this document about RequiresStableInput [3], but IMHO
> defining an annotation only manifests the conceptual difference between
> the Runners.
>
>
> [1]
>
> https://github.com/apache/beam/blob/988a22f01bb133dd65b3cc75e05978d695aed76c/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1144
> [2]
>
> https://github.com/apache/beam/blob/988a22f01bb133dd65b3cc75e05978d695aed76c/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaExactlyOnceSink.java#L166
> [3]
>
> https://docs.google.com/document/d/117yRKbbcEdm3eIKB_26BHOJGmHSZl1YNoF0RqWGtqAM
>


KafkaIO Exactly-Once & Flink Runner

2019-02-28 Thread Maximilian Michels

Hi,

I came across KafkaIO's Runner whitelist [1] for enabling exactly-once 
semantics (EOS). I think it is questionable to exclude Runners from 
inside a transform, but I see that the intention was to save users from 
surprises.


Now why does the Flink Runner not support KafkaIO EOS? Flink's native 
KafkaProducer supports exactly-once. It simply commits the pending 
transaction once it has completed a checkpoint.


A checkpoint is realized by sending barriers through all channels 
starting from the source until reaching all sinks. Every operator 
persists its state once it has received a barrier on all its input 
channels, it then forwards it to the downstream operators.


The architecture of Beam's KafkaExactlyOnceSink is as follows[2]:

Input -> AssignRandomShardIds -> GroupByKey -> AssignSequenceIds -> 
GroupByKey -> ExactlyOnceWriter


As I understood, Spark or Dataflow use the GroupByKey stages to persist 
the input. That is not required in Flink to be able to take a consistent 
snapshot of the pipeline.


Basically, for Flink we don't need any of that magic that KafkaIO does. 
What we would need to support EOS is a way to tell the ExactlyOnceWriter 
(a DoFn) to commit once a checkpoint has completed.


I know that the new version of SDF supports checkpointing which should 
solve this issue. But there is still a lot of work to do to make this 
reality.


So I think it would make sense to think about a way to make KafkaIO's 
EOS more accessible to Runners which support a different way of 
checkpointing.


Cheers,
Max

PS: I found this document about RequiresStableInput [3], but IMHO 
defining an annotation only manifests the conceptual difference between 
the Runners.



[1] 
https://github.com/apache/beam/blob/988a22f01bb133dd65b3cc75e05978d695aed76c/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L1144
[2] 
https://github.com/apache/beam/blob/988a22f01bb133dd65b3cc75e05978d695aed76c/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaExactlyOnceSink.java#L166
[3] 
https://docs.google.com/document/d/117yRKbbcEdm3eIKB_26BHOJGmHSZl1YNoF0RqWGtqAM


CVE audit gradle plugin

2019-02-28 Thread Etienne Chauchot
Hi guys,

I came by this [1] gradle plugin that is a client to the Sonatype OSS Index CVE 
database.

I have set it up here in a branch [2], though the cache is not configured and 
the number of requests is limited. It can
be run with "gradle --info audit"

It could be nice to have something like this to track the CVEs in the libs we 
use. I know we have been spammed by libs
upgrade automatic requests in the past but CVE are more important IMHO.

This plugin is in BSD-3-Clause which is compatible with Apache V2 licence [3]

WDYT ?

Etienne

[1] https://github.com/OSSIndex/ossindex-gradle-plugin
[2] https://github.com/echauchot/beam/tree/cve_audit_plugin
[3] https://www.apache.org/legal/resolved.html


Re: Merge of vendored Guava (Some PRs need a rebase)

2019-02-28 Thread Kenneth Knowles
If someone is using BigTableIO with bigtable-client-core then having
BigTableIO and bigtable-client-core both depend on Guava 26.0 is fine,
right? Specifically, a user of BigTableIO after
https://github.com/apache/beam/pull/7957 will still have non-vendored Guava
on the classpath due to the transitive deps of bigtable-client-core.

In any case it seems very wrong for the Beam root project to manage the
version of Guava in BigTableIO since the whole point is to be compatible
with bigtable-client-core. Would it work to delete our pinned Guava version
[1] and chase down all the places it breaks, moving Guava dependencies
local to places where an IO or extension must use it for interop? Then you
don't need adapters.

In both of the above approaches, diamond dependency problems between IOs
are quite possible.

I don't know if we can do better. For example, producing a
bigtable-client-core where we have relocated Guava internally and using
that could really be an interop nightmare as things that look like the same
type would not be. Less likely to be broken would be bigtable-client-core
entirely relocated and vendored, but generally IO connectors exchange
objects with users and the users would have to use the relocated versions,
so that's gross.

Kenn

[1]
https://github.com/apache/beam/blob/master/buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy#L353


On Thu, Feb 28, 2019 at 2:29 AM Gleb Kanterov  wrote:

> For the past week, two independent people have asked me if I can help with
> guava NoSuchMethodError in BigtableIO. It turns out we still have a
> potential problem with dependencies that don't vendor guava, in this case,
> it was bigtable-client-core that depends on guava-26.0. However, the root
> cause of the classpath problem was in the usage of a deprecated method from
> non-vendored guava in BigtableServiceClientImpl in the code path where we
> integrate with bigtable client.
>
> I created apache/beam#7957  [1]
> to fix that. There few other IO-s where we use non-vendored guava that we
> can fix using adapters.
>
> And there is an unknown number of conflicts between guava versions in our
> dependencies that don't vendor it, that as I understand it, could be fixed
> by relocating them, in a similar way we do for Calcite [2].
>
> [1]: https://github.com/apache/beam/pull/7957
> [2]:
> https://github.com/apache/beam/blob/61de62ecbe8658de866280a8976030a0cb877041/sdks/java/extensions/sql/build.gradle#L30-L39
>
> Gleb
>
> On Sun, Jan 20, 2019 at 11:43 AM Gleb Kanterov  wrote:
>
>> I didn't look deep into it, but it seems we can put
>> .idea/codeInsightSettings.xml into our repository where we blacklist
>> packages from auto-import. See an example in
>> JetBrains/kotlin/.idea/codeInsightSettings.xml
>> 
>> .
>>
>> On Sat, Jan 19, 2019 at 8:03 PM Reuven Lax  wrote:
>>
>>> Bad IDEs automatically generate the wrong import. I think we need to
>>> automatically prevent this, otherwise the bad imports will inevitably slip
>>> back in.
>>>
>>> Reuven
>>>
>>> On Tue, Jan 15, 2019 at 2:54 AM Łukasz Gajowy 
>>> wrote:
>>>
 Great news. Thanks all for this work!

 +1 to enforcing this on dependency level as Kenn suggested.

 Łukasz

 wt., 15 sty 2019 o 01:18 Kenneth Knowles  napisał(a):

> We can enforce at the dependency level, since it is a compile error. I
> think some IDEs and build tools may allow the compile-time classpath to 
> get
> polluted by transitive runtime deps, so protecting against bad imports is
> also a good idea.
>
> Kenn
>
> On Mon, Jan 14, 2019 at 8:42 AM Ismaël Mejía 
> wrote:
>
>> Not yet, we need to add that too, there are still some tasks to be
>> done like improve the contribution guide with this info, and document
>> how to  generate a src build artifact locally since I doubt we can
>> publish that into Apache for copyright reasons.
>> I will message in the future for awareness for awareness when most of
>> the pending tasks are finished.
>>
>>
>> On Mon, Jan 14, 2019 at 3:51 PM Maximilian Michels 
>> wrote:
>> >
>> > Thanks for the heads up, Ismaël! Great to see the vendored Guava
>> version is used
>> > everywhere now.
>> >
>> > Do we already have a Checkstyle rule that prevents people from
>> using the
>> > unvendored Guava? If not, such a rule could be useful because it's
>> almost
>> > inevitable that the unvedored Guava will slip back in.
>> >
>> > Cheers,
>> > Max
>> >
>> > On 14.01.19 05:55, Ismaël Mejía wrote:
>> > > We merged today the PR [1] that changes most of the code to use
>> our
>> > > new guava vendored dependency. In practice it means that most of
>> the
>> > > imports of the classes were changed from `com.google.common.` to
>> > > 

Re: [ANNOUNCE] New committer announcement: Michael Luckey

2019-02-28 Thread Maximilian Michels

Welcome, it's great to have you onboard Michael!

On 28.02.19 11:46, Michael Luckey wrote:
Thanks to all of you for the warm welcome. Really happy to be part of 
this great community!


michel

On Thu, Feb 28, 2019 at 8:39 AM David Morávek > wrote:


Congrats Michael! 

D.

 > On 28 Feb 2019, at 03:27, Ismaël Mejía mailto:ieme...@gmail.com>> wrote:
 >
 > Congratulations Michael, and thanks for all the contributions!
 >
 >> On Wed, Feb 27, 2019 at 6:30 PM Ankur Goenka mailto:goe...@google.com>> wrote:
 >>
 >> Congratulations Michael!
 >>
 >>> On Wed, Feb 27, 2019 at 2:25 PM Thomas Weise
mailto:thomas.we...@gmail.com>> wrote:
 >>>
 >>> Congrats Michael!
 >>>
 >>>
  On Wed, Feb 27, 2019 at 12:41 PM Gleb Kanterov
mailto:g...@spotify.com>> wrote:
 
  Congratulations and welcome!
 
 > On Wed, Feb 27, 2019 at 8:57 PM Connell O'Callaghan
mailto:conne...@google.com>> wrote:
 >
 > Excellent thank you for sharing Kenn!!!
 >
 > Michael congratulations for this recognition of your
contributions to advancing BEAM
 >
 >> On Wed, Feb 27, 2019 at 11:52 AM Kenneth Knowles
mailto:k...@apache.org>> wrote:
 >>
 >> Hi all,
 >>
 >> Please join me and the rest of the Beam PMC in welcoming a
new committer: Michael Luckey
 >>
 >> Michael has been contributing to Beam since early 2017. He
has fixed many build and developer environment issues, noted and
root-caused breakages on master, generously reviewed many others'
changes to the build. In consideration of Michael's contributions,
the Beam PMC trusts Michael with the responsibilities of a Beam
committer [1].
 >>
 >> Thank you, Michael, for your contributions.
 >>
 >> Kenn
 >>
 >> [1]

https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
 
 
 
  --
  Cheers,
  Gleb



Re: [ANNOUNCE] New committer announcement: Michael Luckey

2019-02-28 Thread Michael Luckey
Thanks to all of you for the warm welcome. Really happy to be part of this
great community!

michel

On Thu, Feb 28, 2019 at 8:39 AM David Morávek 
wrote:

> Congrats Michael! 
>
> D.
>
> > On 28 Feb 2019, at 03:27, Ismaël Mejía  wrote:
> >
> > Congratulations Michael, and thanks for all the contributions!
> >
> >> On Wed, Feb 27, 2019 at 6:30 PM Ankur Goenka  wrote:
> >>
> >> Congratulations Michael!
> >>
> >>> On Wed, Feb 27, 2019 at 2:25 PM Thomas Weise 
> wrote:
> >>>
> >>> Congrats Michael!
> >>>
> >>>
>  On Wed, Feb 27, 2019 at 12:41 PM Gleb Kanterov 
> wrote:
> 
>  Congratulations and welcome!
> 
> > On Wed, Feb 27, 2019 at 8:57 PM Connell O'Callaghan <
> conne...@google.com> wrote:
> >
> > Excellent thank you for sharing Kenn!!!
> >
> > Michael congratulations for this recognition of your contributions
> to advancing BEAM
> >
> >> On Wed, Feb 27, 2019 at 11:52 AM Kenneth Knowles 
> wrote:
> >>
> >> Hi all,
> >>
> >> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Michael Luckey
> >>
> >> Michael has been contributing to Beam since early 2017. He has
> fixed many build and developer environment issues, noted and root-caused
> breakages on master, generously reviewed many others' changes to the build.
> In consideration of Michael's contributions, the Beam PMC trusts Michael
> with the responsibilities of a Beam committer [1].
> >>
> >> Thank you, Michael, for your contributions.
> >>
> >> Kenn
> >>
> >> [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> 
> 
> 
>  --
>  Cheers,
>  Gleb
>


Re: [ANNOUNCE] New committer announcement: Michael Luckey

2019-02-28 Thread Łukasz Gajowy
Congrats and welcome! :)

czw., 28 lut 2019 o 08:39 David Morávek 
napisał(a):

> Congrats Michael! 
>
> D.
>
> > On 28 Feb 2019, at 03:27, Ismaël Mejía  wrote:
> >
> > Congratulations Michael, and thanks for all the contributions!
> >
> >> On Wed, Feb 27, 2019 at 6:30 PM Ankur Goenka  wrote:
> >>
> >> Congratulations Michael!
> >>
> >>> On Wed, Feb 27, 2019 at 2:25 PM Thomas Weise 
> wrote:
> >>>
> >>> Congrats Michael!
> >>>
> >>>
>  On Wed, Feb 27, 2019 at 12:41 PM Gleb Kanterov 
> wrote:
> 
>  Congratulations and welcome!
> 
> > On Wed, Feb 27, 2019 at 8:57 PM Connell O'Callaghan <
> conne...@google.com> wrote:
> >
> > Excellent thank you for sharing Kenn!!!
> >
> > Michael congratulations for this recognition of your contributions
> to advancing BEAM
> >
> >> On Wed, Feb 27, 2019 at 11:52 AM Kenneth Knowles 
> wrote:
> >>
> >> Hi all,
> >>
> >> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Michael Luckey
> >>
> >> Michael has been contributing to Beam since early 2017. He has
> fixed many build and developer environment issues, noted and root-caused
> breakages on master, generously reviewed many others' changes to the build.
> In consideration of Michael's contributions, the Beam PMC trusts Michael
> with the responsibilities of a Beam committer [1].
> >>
> >> Thank you, Michael, for your contributions.
> >>
> >> Kenn
> >>
> >> [1]
> https://beam.apache.org/contribute/become-a-committer/#an-apache-beam-committer
> 
> 
> 
>  --
>  Cheers,
>  Gleb
>


Re: Merge of vendored Guava (Some PRs need a rebase)

2019-02-28 Thread Gleb Kanterov
For the past week, two independent people have asked me if I can help with
guava NoSuchMethodError in BigtableIO. It turns out we still have a
potential problem with dependencies that don't vendor guava, in this case,
it was bigtable-client-core that depends on guava-26.0. However, the root
cause of the classpath problem was in the usage of a deprecated method from
non-vendored guava in BigtableServiceClientImpl in the code path where we
integrate with bigtable client.

I created apache/beam#7957  [1]
to fix that. There few other IO-s where we use non-vendored guava that we
can fix using adapters.

And there is an unknown number of conflicts between guava versions in our
dependencies that don't vendor it, that as I understand it, could be fixed
by relocating them, in a similar way we do for Calcite [2].

[1]: https://github.com/apache/beam/pull/7957
[2]:
https://github.com/apache/beam/blob/61de62ecbe8658de866280a8976030a0cb877041/sdks/java/extensions/sql/build.gradle#L30-L39

Gleb

On Sun, Jan 20, 2019 at 11:43 AM Gleb Kanterov  wrote:

> I didn't look deep into it, but it seems we can put
> .idea/codeInsightSettings.xml into our repository where we blacklist
> packages from auto-import. See an example in
> JetBrains/kotlin/.idea/codeInsightSettings.xml
> 
> .
>
> On Sat, Jan 19, 2019 at 8:03 PM Reuven Lax  wrote:
>
>> Bad IDEs automatically generate the wrong import. I think we need to
>> automatically prevent this, otherwise the bad imports will inevitably slip
>> back in.
>>
>> Reuven
>>
>> On Tue, Jan 15, 2019 at 2:54 AM Łukasz Gajowy 
>> wrote:
>>
>>> Great news. Thanks all for this work!
>>>
>>> +1 to enforcing this on dependency level as Kenn suggested.
>>>
>>> Łukasz
>>>
>>> wt., 15 sty 2019 o 01:18 Kenneth Knowles  napisał(a):
>>>
 We can enforce at the dependency level, since it is a compile error. I
 think some IDEs and build tools may allow the compile-time classpath to get
 polluted by transitive runtime deps, so protecting against bad imports is
 also a good idea.

 Kenn

 On Mon, Jan 14, 2019 at 8:42 AM Ismaël Mejía  wrote:

> Not yet, we need to add that too, there are still some tasks to be
> done like improve the contribution guide with this info, and document
> how to  generate a src build artifact locally since I doubt we can
> publish that into Apache for copyright reasons.
> I will message in the future for awareness for awareness when most of
> the pending tasks are finished.
>
>
> On Mon, Jan 14, 2019 at 3:51 PM Maximilian Michels 
> wrote:
> >
> > Thanks for the heads up, Ismaël! Great to see the vendored Guava
> version is used
> > everywhere now.
> >
> > Do we already have a Checkstyle rule that prevents people from using
> the
> > unvendored Guava? If not, such a rule could be useful because it's
> almost
> > inevitable that the unvedored Guava will slip back in.
> >
> > Cheers,
> > Max
> >
> > On 14.01.19 05:55, Ismaël Mejía wrote:
> > > We merged today the PR [1] that changes most of the code to use our
> > > new guava vendored dependency. In practice it means that most of
> the
> > > imports of the classes were changed from `com.google.common.` to
> > > `org.apache.beam.vendor.guava.v20_0.com.google.common.`
> > >
> > > This is a great improvement to fix a long existing problem of guava
> > > leaking through some Beam modules. This also reduces the size of
> most
> > > jars in the project because they don't need to relocate and include
> > > guava anymore, they just use the vendored dependency.
> > >
> > > Kudos to Kenn Knowles, Lukasz Cwik, Scott Wegner and the others
> that
> > > worked (are working) to make this possible.
> > >
> > > Sadly as a side effect of the merge of this PR multiple PRs were
> > > broken so please review if yours was and do a rebase and fix the
> > > imports to use the new vendored dependency. Sorry for the
> > > inconvenience. From now one all uses of guava should use the
> vendored
> > > version. Expect some updates in the docs.
> > >
> > > [1]  https://github.com/apache/beam/pull/6809
> > >
>

>
> --
> Cheers,
> Gleb
>


-- 
Cheers,
Gleb