Re: [PROPOSAL] Prepare Beam 2.6.0 release

2018-07-30 Thread Pablo Estrada
Hello everyone,
we will explore a workaround for the issues with the Dataflow worker.

Because of this, the 2.6.0 release is no longer blocked by the Dataflow
workers at the moment. I hope to produce a release candidate by end-of-day
tomorrow.

Please reach out on this thread, or to me directly if you have any concerns
or questions.

Thanks!
-P.

On Sun, Jul 29, 2018 at 9:23 PM Reuven Lax  wrote:

> No problem :) I'm not entirely sure why the PR fixed his issue, but it
> couldn't cause any harm so I went ahead and merged it.
>
> On Sun, Jul 29, 2018 at 11:56 AM Jean-Baptiste Onofré 
> wrote:
>
>> Yes, it's this one and I planned to do the review tonight.
>>
>> Just saw that you did the review and merge it. Thanks.
>>
>> Regards
>> JB
>>
>> On 29/07/2018 19:40, Reuven Lax wrote:
>> > Jozef, you reference a pull request in the bug, are you referring to
>> > this one?
>> >
>> > https://github.com/apache/beam/pull/6075/files
>> >
>> > On Sun, Jul 29, 2018 at 12:13 AM Jozef Vilcek > > > wrote:
>> >
>> > Hello, is there a change this bug make it into the release?
>> >
>> > https://issues.apache.org/jira/browse/BEAM-5028
>> >
>> > On Sat, Jul 28, 2018 at 1:38 AM Pablo Estrada > > > wrote:
>> >
>> > Hello all,
>> > I will start daily updates of progress on the 2.6.0 release.
>> > As of today, the main release blockers are issues in Dataflow
>> > that are preventing us from cutting the Dataflow workers.
>> >
>> >   * One issue in Java SDK related to FnAPI. Specifically PR 5709
>> > requires Dataflow worker changes[1].
>> >   * One issue in the Python SDK related to context management.
>> > PR 5356 also requires Dataflow worker changes [2].
>> >
>> > Please reach out to me if you have any questions.
>> > Best
>> > -P.
>> >
>> > [1] https://github.com/apache/beam/pull/5709
>> > [2] https://github.com/apache/beam/pull/5356
>> >
>> >
>> > On Thu, Jul 26, 2018 at 2:16 PM Pablo Estrada
>> > mailto:pabl...@google.com>> wrote:
>> >
>> > Hello everyone,
>> > I wanted to do an update on the state of the release, as
>> > there haven't been news on this for a while.
>> > We have found a few issues that broke postcommits a few
>> > weeks back, but we hadn't noticed. Some people are tacking
>> > these to try to stabilize the release branch[1].
>> >
>> > In the meantime, the release has been blocked, but Boyuan
>> > Zhang has taken advantage of this to code up a few scripts
>> > to try and automate release steps. (Thanks Boyuan!). We will
>> > try these as soon as the release is unblocked.
>> >
>> > Best
>> > -P.
>> >
>> > [1] https://github.com/apache/beam/pull/6072
>> >
>> > On Wed, Jul 18, 2018 at 11:03 AM Pablo Estrada
>> > mailto:pabl...@google.com>> wrote:
>> >
>> > Hello all!
>> > I've cut the release branch (release-2.6.0), with some
>> > help from Ahmet and Boyuan. From now on, please
>> > cherry-pick 2.6.0 blockers into the branch.
>> > Now we start stabilizing it.
>> >
>> > Thanks!
>> >
>> > -P.
>> >
>> > On Tue, Jul 17, 2018 at 9:34 PM Jean-Baptiste Onofré
>> > mailto:j...@nanthrax.net>> wrote:
>> >
>> > Hi Pablo,
>> >
>> > I'm investigating this issue, but it's a little long
>> > process.
>> >
>> > So, I propose you start with the release process,
>> > cutting the branch,
>> > and then, I will create a cherry-pick PR for this
>> one.
>> >
>> > Regards
>> > JB
>> >
>> > On 17/07/2018 20:19, Pablo Estrada wrote:
>> > > Checking once more:
>> > > What does the communitythink we should do
>> > >
>> > about
>> https://issues.apache.org/jira/browse/BEAM-4750 ?
>> > Should I bump it
>> > > to 2.7.0?
>> > > Best
>> > > -P.
>> > >
>> > > On Fri, Jul 13, 2018 at 5:15 PM Ahmet Altay
>> > mailto:al...@google.com>
>> > > > > >> wrote:
>> > >
>> > >
>> >  Update:
>> https://issues.apache.org/jira/browse/BEAM-4784 is
>> > not a
>> > > release blocker, details in the JIRA issue.
>> > >
>> > >   

Beam Dependency Check Report (2018-07-31)

2018-07-30 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  
google-cloud-bigquery
0.25.0
1.4.0
2017-06-26
2018-07-16


google-cloud-core
0.25.0
0.28.1
2018-06-07
2018-06-07


google-cloud-pubsub
0.26.0
0.35.4
2017-06-26
2018-06-08


ply
3.8
3.11
2018-06-07
2018-06-07


High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  
org.assertj:assertj-core
2.5.0
3.10.0
2016-07-03
2018-05-11


com.google.auto.service:auto-service
1.0-rc2
1.0-rc4
2018-06-25
2017-12-11


biz.aQute:bndlib
1.43.0
2.0.0.20130123-133441
2018-06-25
2018-06-25


org.apache.cassandra:cassandra-all
3.9
3.11.2
2016-09-26
2018-02-14


org.apache.commons:commons-dbcp2
2.1.1
2.5.0
2015-08-02
2018-07-16


de.flapdoodle.embed:de.flapdoodle.embed.mongo
1.50.1
2.1.1
2015-12-11
2018-06-25


de.flapdoodle.embed:de.flapdoodle.embed.process
1.50.1
2.0.5
2015-12-11
2018-06-25


org.apache.derby:derby
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.apache.derby:derbyclient
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.apache.derby:derbynet
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.elasticsearch:elasticsearch
5.6.3
6.3.2
2017-10-06
2018-07-30


org.elasticsearch:elasticsearch-hadoop
5.0.0
6.3.2
2016-10-26
2018-07-30


org.elasticsearch.client:elasticsearch-rest-client
5.6.3
6.3.2
2017-10-06
2018-07-30


com.alibaba:fastjson
1.2.12
1.2.47
2016-05-21
2018-03-15


org.elasticsearch.test:framework
5.6.3
6.3.2
2017-10-06
2018-07-30


org.freemarker:freemarker
2.3.25-incubating
2.3.28
2016-06-14
2018-03-30


net.ltgt.gradle:gradle-apt-plugin
0.13
0.18
2017-11-01
2018-07-23


com.commercehub.gradle.plugin:gradle-avro-plugin
0.11.0
0.14.2
2018-01-30
2018-06-06


gradle.plugin.com.palantir.gradle.docker:gradle-docker
0.13.0
0.20.1
2017-04-05
2018-07-09


com.github.ben-manes:gradle-versions-plugin
0.17.0
0.20.0
2018-06-06
2018-06-25


org.codehaus.groovy:groovy-all
2.4.13
3.0.0-alpha-3
2017-11-22
2018-06-26


com.google.guava:guava
20.0
25.1-jre
2018-07-16
2018-07-16


org.apache.hbase:hbase-common
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-hadoop-compat
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-hadoop2-compat
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-server
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-shaded-client
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-shaded-server
1.2.6
2.0.0-alpha2
2017-05-29
2018-05-31


org.apache.hive:hive-cli
2.1.0
3.1.0.3.0.0.0-1634
2016-06-16
2018-07-16


org.apache.hive:hive-common
2.1.0
3.1.0.3.0.0.0-1634
2016-06-16
2018-07-16


org.apache.hive:hive-exec
2.1.0
3.1.0.3.0.0.0-1634
2016-06-16
2018-07-16


org.apache.hive.hcatalog:hive-hcatalog-core
2.1.0
3.1.0.3.0.0.0-1634
2016-06-16
2018-07-16


org.apache.httpcomponents:httpasyncclient
4.1.2
4.1.4
2016-06-18
2018-07-23


org.apache.httpcomponents:httpclient
4.5.2
4.5.6
2016-02-21
2018-07-09



Beam Dependency Check Report (2018-07-31)

2018-07-30 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  
google-cloud-bigquery
0.25.0
1.4.0
2017-06-26
2018-07-16


google-cloud-core
0.25.0
0.28.1
2018-06-07
2018-06-07


google-cloud-pubsub
0.26.0
0.35.4
2017-06-26
2018-06-08


ply
3.8
3.11
2018-06-07
2018-06-07


High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  
org.assertj:assertj-core
2.5.0
3.10.0
2016-07-03
2018-05-11


com.google.auto.service:auto-service
1.0-rc2
1.0-rc4
2018-06-25
2017-12-11


biz.aQute:bndlib
1.43.0
2.0.0.20130123-133441
2018-06-25
2018-06-25


org.apache.cassandra:cassandra-all
3.9
3.11.2
2016-09-26
2018-02-14


org.apache.commons:commons-dbcp2
2.1.1
2.5.0
2015-08-02
2018-07-16


de.flapdoodle.embed:de.flapdoodle.embed.mongo
1.50.1
2.1.1
2015-12-11
2018-06-25


de.flapdoodle.embed:de.flapdoodle.embed.process
1.50.1
2.0.5
2015-12-11
2018-06-25


org.apache.derby:derby
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.apache.derby:derbyclient
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.apache.derby:derbynet
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.elasticsearch:elasticsearch
5.6.3
6.3.2
2017-10-06
2018-07-30


org.elasticsearch:elasticsearch-hadoop
5.0.0
6.3.2
2016-10-26
2018-07-30


org.elasticsearch.client:elasticsearch-rest-client
5.6.3
6.3.2
2017-10-06
2018-07-30


com.alibaba:fastjson
1.2.12
1.2.47
2016-05-21
2018-03-15


org.elasticsearch.test:framework
5.6.3
6.3.2
2017-10-06
2018-07-30


org.freemarker:freemarker
2.3.25-incubating
2.3.28
2016-06-14
2018-03-30


net.ltgt.gradle:gradle-apt-plugin
0.13
0.18
2017-11-01
2018-07-23


com.commercehub.gradle.plugin:gradle-avro-plugin
0.11.0
0.14.2
2018-01-30
2018-06-06


gradle.plugin.com.palantir.gradle.docker:gradle-docker
0.13.0
0.20.1
2017-04-05
2018-07-09


com.github.ben-manes:gradle-versions-plugin
0.17.0
0.20.0
2018-06-06
2018-06-25


org.codehaus.groovy:groovy-all
2.4.13
3.0.0-alpha-3
2017-11-22
2018-06-26


com.google.guava:guava
20.0
25.1-jre
2018-07-16
2018-07-16


org.apache.hbase:hbase-common
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-hadoop-compat
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-hadoop2-compat
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-server
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-shaded-client
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-shaded-server
1.2.6
2.0.0-alpha2
2017-05-29
2018-05-31


org.apache.hive:hive-cli
2.1.0
3.1.0.3.0.0.0-1634
2016-06-16
2018-07-16


org.apache.hive:hive-common
2.1.0
3.1.0.3.0.0.0-1634
2016-06-16
2018-07-16


org.apache.hive:hive-exec
2.1.0
3.1.0.3.0.0.0-1634
2016-06-16
2018-07-16


org.apache.hive.hcatalog:hive-hcatalog-core
2.1.0
3.1.0.3.0.0.0-1634
2016-06-16
2018-07-16


org.apache.httpcomponents:httpasyncclient
4.1.2
4.1.4
2016-06-18
2018-07-23


org.apache.httpcomponents:httpclient
4.5.2
4.5.6
2016-02-21
2018-07-09



Re: SQS source

2018-07-30 Thread John Rudolf Lewis
I created a pr for my SqsIO contribution. I look forward to your comments.

https://github.com/apache/beam/pull/6101

Any chance this could be a part of the 2.6.0 release?

On Thu, Jul 19, 2018 at 7:39 AM, John Rudolf Lewis 
wrote:

> Thank you.
>
> I've created a jira ticket to add SQS and have assigned it to myself:
> https://issues.apache.org/jira/browse/BEAM-4828
>
> Modified the documentation to show it as in-progress:
> https://github.com/apache/beam/pull/5995
>
> And will be starting my work here: https://github.com/
> JohnRudolfLewis/beam/tree/Add-SqsIO
>
>
> On Thu, Jul 19, 2018 at 1:43 AM, Jean-Baptiste Onofré 
> wrote:
>
>> Agree with Ismaël.
>>
>> I would be more than happy to help on this one (as I contributed on AMQP
>> and JMS IOs ;)).
>>
>> Regards
>> JB
>>
>> On 19/07/2018 10:39, Ismaël Mejía wrote:
>> > Thanks for your interest John, it would be a really nice contribution
>> > to add SQS support.
>> >
>> > Some context on the kinesis stuff:
>> >
>> > The reason why kinesis is still in a separate module is more related
>> > to a licensing problem. Kinesis uses some native libraries that are
>> > published under a not 100% apache compatible license and we are not
>> > allowed to shade and republish them but it seems there is a workaround
>> > now, for more details see
>> > https://issues.apache.org/jira/browse/BEAM-3549
>> > In any case if to use SQS you only need the Apache licensed aws-sdk
>> > deps it is ok (and a good idea) if you put it in the
>> > amazon-web-services module.
>> >
>> > The kinesis connector is way more complex for multiple reasons, first,
>> > the raw version of the amazon client libraries is not so ‘friendly’
>> > and the guys who created KinesisIO had to do some workarounds to
>> > provide accurate checkpointing/watermarks. So since SQS is a way
>> > simpler system you should probably be ok basing it in simpler sources
>> > like AMQP or JMS.
>> >
>> > If you feel like to, please create the JIRA and don’t hesitate to ask
>> > questions if you find issues or if you need some review.
>> >
>> > On Thu, Jul 19, 2018 at 12:55 AM Lukasz Cwik  wrote:
>> >>
>> >>
>> >>
>> >> On Wed, Jul 18, 2018 at 3:30 PM John Rudolf Lewis <
>> johnrle...@gmail.com> wrote:
>> >>>
>> >>> I need an SQS source for my project that is using beam. A brief
>> search did not turn up any in-progress work in this area. Please point me
>> to the right repo if I missed it.
>> >>
>> >>
>> >> To my knowledge there is none and nobody has marked it in progress on
>> https://beam.apache.org/documentation/io/built-in/. It would be good to
>> create a JIRA issue on https://issues.apache.org/ and send a PR to add
>> SQS to the inprogress list referencing your JIRA. I added you as a
>> contributor in JIRA so you should be able to assign yourself to any issues
>> that you create.
>> >>
>> >>>
>> >>> Assuming there is no in-progress effort, I would like to contribute
>> an Amazon SQS source. I have a few questions before I begin.
>> >>
>> >>
>> >> Great, note that this is a good starting point for authoring an IO
>> transform: https://beam.apache.org/documentation/io/authoring-overview/
>> >>
>> >>>
>> >>>
>> >>> It seems that the current AWS code is split into two different
>> modules: sdk/java/io/amazon-web-services which contains the
>> S3FileSystem, AwsOptions, etc, and sdk/java/io/kinesis which contains an
>> unbounded source based on a kinesis topic. I'd like to add this source to
>> the amazon-web-services module since I'd like to depend on AwsOptions. Does
>> adding this source to the amazon-web-services module make sense?
>> >>
>> >>
>> >> Putting it inside of amazon-web-services makes a lot of sense. The
>> Google connectors all live within the one package and there has been
>> discussion to consolidate all the AWS stuff under amazon-web-services.
>> >>
>> >>>
>> >>> Also, the kinesis source looks a touch more complex than other
>> sources. Both the JMS and AMQP sources look like better examples to follow.
>> Which existing source would be the best to model this contribution after?
>> >>
>> >>
>> >> Some of it has to do with how many ways a source can be read and how
>> complicated the watermark tracking but it would be best if the IO authors
>> comment on implementation details.
>> >>
>> >>>
>> >>> If anyone has put some thoughts into this, or better yet some code,
>> I'd appreciate hearing from you.
>> >>>
>> >>> Thanks!
>> >>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>


Re: delayed emit (timer) in py-beam?

2018-07-30 Thread Austin Bennett
Fantastic; thanks, Charles!



On Mon, Jul 30, 2018 at 3:49 PM, Charles Chen  wrote:

> Hey Austin,
>
> This API is not yet implemented in the Python SDK.  I am working on this
> feature:  the next step from my end is to finish a reference implementation
> in the local DirectRunner.  As you note, the doc at
> https://s.apache.org/beam-python-user-state-and-timers describes the
> design.
>
> You can track progress on the mailing list thread here:
> https://lists.apache.org/thread.html/51ba1a00027ad8635bc1d2c0df805c
> e873995170c75d6a08dfe21997@%3Cdev.beam.apache.org%3E
>
> Best,
> Charles
>
> On Mon, Jul 30, 2018 at 3:34 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> What's going on with timers and python?
>>
>> Am looking at building a pipeline (assuming another group in my company
>> will grant access to the Kafka topic):
>>
>> Kafka -> beam -> have beam wait 24 hours -> do transform(s) and emit a
>> record.  If I read things correctly that's not currently possible in python
>> on beam.  What all is needed?  (trying to figure out whether that is
>> something that I am capable of and there is room for me to help with).
>> Looking for similar functionality to https://www.rabbitmq.com/blog/
>> 2015/04/16/scheduling-messages-with-rabbitmq/ (though don't need
>> alternate routing, nor is that example in python).
>>
>>
>> For example, I see:  https://beam.apache.org/blog/
>> 2017/08/28/timely-processing.html
>>
>> and tickets like:  https://issues.apache.org/jira/browse/BEAM-4594
>>
>>
>>


Re: delayed emit (timer) in py-beam?

2018-07-30 Thread Charles Chen
Hey Austin,

This API is not yet implemented in the Python SDK.  I am working on this
feature:  the next step from my end is to finish a reference implementation
in the local DirectRunner.  As you note, the doc at
https://s.apache.org/beam-python-user-state-and-timers describes the design.

You can track progress on the mailing list thread here:
https://lists.apache.org/thread.html/51ba1a00027ad8635bc1d2c0df805ce873995170c75d6a08dfe21997@%3Cdev.beam.apache.org%3E

Best,
Charles

On Mon, Jul 30, 2018 at 3:34 PM Austin Bennett 
wrote:

> What's going on with timers and python?
>
> Am looking at building a pipeline (assuming another group in my company
> will grant access to the Kafka topic):
>
> Kafka -> beam -> have beam wait 24 hours -> do transform(s) and emit a
> record.  If I read things correctly that's not currently possible in python
> on beam.  What all is needed?  (trying to figure out whether that is
> something that I am capable of and there is room for me to help with).
> Looking for similar functionality to
> https://www.rabbitmq.com/blog/2015/04/16/scheduling-messages-with-rabbitmq/
> (though don't need alternate routing, nor is that example in python).
>
>
> For example, I see:
> https://beam.apache.org/blog/2017/08/28/timely-processing.html
>
> and tickets like:  https://issues.apache.org/jira/browse/BEAM-4594
>
>
>


delayed emit (timer) in py-beam?

2018-07-30 Thread Austin Bennett
What's going on with timers and python?

Am looking at building a pipeline (assuming another group in my company
will grant access to the Kafka topic):

Kafka -> beam -> have beam wait 24 hours -> do transform(s) and emit a
record.  If I read things correctly that's not currently possible in python
on beam.  What all is needed?  (trying to figure out whether that is
something that I am capable of and there is room for me to help with).
Looking for similar functionality to https://www.rabbitmq.com/blog/
2015/04/16/scheduling-messages-with-rabbitmq/ (though don't need alternate
routing, nor is that example in python).


For example, I see:  https://beam.apache.org/blog/
2017/08/28/timely-processing.html

and tickets like:  https://issues.apache.org/jira/browse/BEAM-4594


Re: Live coding & reviewing adventures

2018-07-30 Thread Holden Karau
So small schedule changes.
I’ll be doing some poking at the Go SDK at 2pm today -
https://www.youtube.com/watch?v=9UAu1DOZJhM and the one with Gris setting
up Beam on a new machine will be moved to Friday because her laptop got
delayed - https://www.youtube.com/watch?v=x8Wg7qCDA5k

On Tue, Jul 24, 2018 at 8:41 PM Holden Karau  wrote:

> I'll be doing this again this week & next looking at a few different
> topics.
>
> Tomorrow (July 25th @ 10am pacific) Gris & I will be updating the PR from
> my last live stream (adding Python dependency handling) -
> https://www.twitch.tv/events/P92irbgYR9Sx6nMQ-lGY3g /
> https://www.youtube.com/watch?v=4xDsY5QL2zM
>
> In the afternoon @ 3 pm pacific I'll be looking at the dev tools we've had
> some discussions around with respect to reviews - https://www.twitch.tv/
> events/vNzcZ7DdSuGFNYURW_9WEQ / https://www.youtube.com/
> watch?v=6cTmC_fP9B0
>
> Next week on Thursday August 1st @ 2pm pacific Gris & I will be setting up
> Beam on her new laptop together, so for any new users looking to see how to
> install Beam from source this one is for you (or for devs looking to see
> how painful set up is) - https://www.twitch.tv/
> events/YAYvNp3tT0COkcpNBxnp6A / https://www.youtube.com/watch?
> v=x8Wg7qCDA5k
>
> P.S.
>
> As always I'll be doing my regular Friday code reviews in Spark -
> https://www.youtube.com/watch?v=O4rRx-3PTiM . You can see the other ones
> I have planned on my twitch  events
>  and youtube
> .
>
> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
> wrote:
>
>> Hi folks! I've been doing some live coding in my other projects and I
>> figured I'd do some with Apache Beam as well.
>>
>> Today @ 3pm pacific I'm going be doing some impromptu exploration better
>> review tooling possibilities (looking at forking spark-pr-dashboard for
>> other projects like beam and setting up mentionbot to work with ASF infra)
>> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>>
>> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
>> trying to get easier dependency management for the Python portable runner
>> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>>
>> If your interested in seeing more of the development process I hope you
>> will join me :)
>>
>> P.S.
>>
>> You can also follow on twitch which does a better job of notifications
>> https://www.twitch.tv/holdenkarau
>>
>> Also one of the other thing I do is "live reviews" of PRs but they are
>> generally opt-in and I don't have enough opt-ins from the Beam community to
>> do live reviews in Beam, if you work on Beam and would be OK with me doing
>> a live streamed review of your PRs let me know (if your curious to what
>> they look like you can see some of them here in Spark land
>> 
>> ).
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: Beam Docs Contributor

2018-07-30 Thread Thomas Weise
Welcome Rose, and looking forward to the docs update!

On Mon, Jul 30, 2018 at 9:15 AM Henning Rohde  wrote:

> Welcome Rose! Great to have you here.
>
> On Mon, Jul 30, 2018 at 2:23 AM Ismaël Mejía  wrote:
>
>> Welcome !
>> Great to see someone new working in this important area for the project.
>>
>>
>> On Mon, Jul 30, 2018 at 5:57 AM Kai Jiang  wrote:
>>
>>> Welcome Rose!
>>> ᐧ
>>>
>>> On Sun, Jul 29, 2018 at 8:53 PM Rui Wang  wrote:
>>>
 Welcome!

 -Rui

 On Sun, Jul 29, 2018 at 7:07 PM Griselda Cuevas 
 wrote:

> Welcome Rose, very glad to have you in the community :)
>
>
>
> On Fri, 27 Jul 2018 at 16:29, Ahmet Altay  wrote:
>
>> Welcome Rose! Looking forward to your contributions.
>>
>> On Fri, Jul 27, 2018 at 4:08 PM, Rose Nguyen 
>> wrote:
>>
>>> Hi all:
>>>
>>> I'm Rose! I've worked on Cloud Dataflow documentation and now I'm
>>> starting a project to refresh the Beam docs and improve the onboarding
>>> experience. We're planning on splitting up the programming guide into
>>> multiple pages, making the docs more accessible for new users. I've got
>>> lots of ideas for doc improvements, some of which are motivated by the 
>>> UX
>>> research, and am excited to share them with you all and work on them.
>>>
>>> I look forward to interacting with everybody in the community. I
>>> welcome comments, thoughts, feedback, etc.
>>> --
>>>
>>>
>>> Rose Thi Nguyen
>>>
>>>   Technical Writer
>>>
>>> (281) 683-6900
>>>
>>
>>


Re: Beam Docs Contributor

2018-07-30 Thread Henning Rohde
Welcome Rose! Great to have you here.

On Mon, Jul 30, 2018 at 2:23 AM Ismaël Mejía  wrote:

> Welcome !
> Great to see someone new working in this important area for the project.
>
>
> On Mon, Jul 30, 2018 at 5:57 AM Kai Jiang  wrote:
>
>> Welcome Rose!
>> ᐧ
>>
>> On Sun, Jul 29, 2018 at 8:53 PM Rui Wang  wrote:
>>
>>> Welcome!
>>>
>>> -Rui
>>>
>>> On Sun, Jul 29, 2018 at 7:07 PM Griselda Cuevas  wrote:
>>>
 Welcome Rose, very glad to have you in the community :)



 On Fri, 27 Jul 2018 at 16:29, Ahmet Altay  wrote:

> Welcome Rose! Looking forward to your contributions.
>
> On Fri, Jul 27, 2018 at 4:08 PM, Rose Nguyen 
> wrote:
>
>> Hi all:
>>
>> I'm Rose! I've worked on Cloud Dataflow documentation and now I'm
>> starting a project to refresh the Beam docs and improve the onboarding
>> experience. We're planning on splitting up the programming guide into
>> multiple pages, making the docs more accessible for new users. I've got
>> lots of ideas for doc improvements, some of which are motivated by the UX
>> research, and am excited to share them with you all and work on them.
>>
>> I look forward to interacting with everybody in the community. I
>> welcome comments, thoughts, feedback, etc.
>> --
>>
>>
>> Rose Thi Nguyen
>>
>>   Technical Writer
>>
>> (281) 683-6900
>>
>
>


Re: Issues with Beam SQL on Spark

2018-07-30 Thread Andrew Pilloud
That sounds great. I'm subscribed to that list as well, so I'll keep an eye
out for your email.

Andrew

On Sun, Jul 29, 2018 at 9:07 PM Kai Jiang  wrote:

> Hi Andrew,
>
> I tried on replacing "jdbc:calcite" to "jdbc:beam" in calcite and
> re-shadow. After that, Beam Sql can run on Spark now.
> However, I didn't find an approach to modify code during shading Calcite
> library. I think second method you mentioned is feasible.
> I'll forward this thread to dev@calcite and to see if we can connect
> between calcite modules without using the DriverManager.
>
> Best,
> Kai
> ᐧ
>
> On Tue, Jul 24, 2018 at 1:04 PM Kai Jiang  wrote:
>
>> Thank you Andrew! I will take a look at if it is feasible to rewrite
>> "jdbc:calcite:" in Beam's repackaged calcite.
>>
>> Best,
>> Kai
>>
>> On 2018/07/24 19:08:17, Andrew Pilloud  wrote:
>> > I don't really think this is something that involves changes to
>> > DriverManager. Beam is causing the problem by relocating calcite's path
>> but
>> > not also modifying the global state it creates.
>> >
>> > Andrew
>> >
>> > On Tue, Jul 24, 2018 at 12:03 PM Kai Jiang  wrote:
>> >
>> > > Thanks Andrew! It's really helpful. I'll take a try on shade calcite
>> with
>> > > rewriting the "jdbc:calcite".
>> > > I also have a look at the doc of DriverManager. Do you think include
>> all
>> > > repackaged jdbc driver property setting like below will be helpful?
>> > >  jdbc.drivers=org.apache.beam.repackaged.beam.
>> > >
>> > > Best,
>> > > Kai
>> > >
>> > > On 2018/07/24 16:56:50, Andrew Pilloud  wrote:
>> > > > Looks like calcite isn't easily repackageable. This issue can be
>> fixed
>> > > > either in our shading (by also rewriting the "jdbc:calcite:" string
>> when
>> > > we
>> > > > shade calcite) or in calcite (by not using the driver manager to
>> connect
>> > > > between calcite modules).
>> > > >
>> > > > Andrew
>> > > >
>> > > > On Mon, Jul 23, 2018 at 11:18 PM Kai Jiang 
>> wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I met an issue when I ran Beam SQL on Spark. I want to check and
>> see if
>> > > > > anyone has same issue with me. I believe let beam sql running on
>> spark
>> > > is
>> > > > > important. If you encountered same problem, it will be really
>> helpful
>> > > if
>> > > > > you could give some inputs.
>> > > > >
>> > > > > Context:
>> > > > > I setup TPC framework to run sql on spark. Code
>> > > > > <
>> > >
>> https://github.com/vectorijk/beam/blob/tpch/sdks/java/extensions/tpc/src/main/java/org/apache/beam/sdk/extensions/tpc/BeamTpc.java
>> > > >
>> > > > > is simple which just ingests csv data and apply Sql on that.
>> Gradle
>> > > > > <
>> > >
>> https://github.com/vectorijk/beam/blob/tpch/sdks/java/extensions/tpc/build.gradle
>> >
>> > > setting
>> > > > > includes `runner-spark` and necessary libraries.  Exception Stack
>> trace
>> > > > > <
>> https://gist.github.com/vectorijk/849cbcd5bce558e5e7c97916ca4c793a>
>> > > shows
>> > > > > some details. However, same code can running on Flink and Dataflow
>> > > > > successfully.
>> > > > >
>> > > > > Investigations:
>> > > > > BEAM-3386  also
>> > > > > describes the similar issue I have. It took me some time on
>> > > investigating
>> > > > > it. I guess there should be a version conflict between Calcite
>> library
>> > > in
>> > > > > Spark and Beam SQL repackaged Calcite. The version of Calcite
>> library
>> > > Spark
>> > > > > ( * - 2.3.1) used is very old (1.2.0-incubating).
>> > > > >
>> > > > > After packaging fat jar and submitting it to Spark, Spark
>> registered
>> > > both
>> > > > > old version's calcite jdbc driver and Beam's repackaged jdbc
>> driver in
>> > > > >
>> > > > > registeredDrivers(DriverManager.java#L294 <
>> > >
>> https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L294
>> >).
>> > > Jdbc's DriverManager always connects to old version calcite's jdbc in
>> spark
>> > > instead of beam's repackaged calcite.
>> > > > >
>> > > > >
>> > > > > Looking into Line DriverManager.java#L556 <
>> > >
>> https://github.com/JetBrains/jdk8u_jdk/blob/master/src/share/classes/java/sql/DriverManager.java#L556
>> >
>> > > and insert a breakpoint, aClass =
>> > > Class.forName(driver.getClass().getName(), true, classLoader);
>> > > > >
>> > > > > driver.getClass().getName() -> "org.apache.calcite.jdbc.Driver"
>> > > > > classLoader only has class 'org.apache.beam.**' and
>> > > > > 'org.apache.beam.repackaged.beam_***'. (There is no path of class
>> > > > > 'org.apache.calcite.*')
>> > > > >
>> > > > > Oddly, aClass is assigned with Class
>> "org.apache.calcite.jdbc.Driver".
>> > > I
>> > > > > think it should raise an exception and be skipped. Actually, It
>> did
>> > > not.  So
>> > > > > this spark's calcite jdbc driver has been connected. All logic
>> > > afterwards
>> > > > > goes to spark's calcite classpath. I believe that's pivot point.
>> > > > >
>> > > > > Potentially solutions:
>> > > > > 

Beam Dependency Check Report (2018-07-30)

2018-07-30 Thread Apache Jenkins Server

High Priority Dependency Updates Of Beam Python SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  
google-cloud-bigquery
0.25.0
1.4.0
2017-06-26
2018-07-16


google-cloud-core
0.25.0
0.28.1
2018-06-07
2018-06-07


google-cloud-pubsub
0.26.0
0.35.4
2017-06-26
2018-06-08


ply
3.8
3.11
2018-06-07
2018-06-07


High Priority Dependency Updates Of Beam Java SDK:


  Dependency Name
  Current Version
  Latest Version
  Release Date Of the Current Used Version
  Release Date Of The Latest Release
  
org.assertj:assertj-core
2.5.0
3.10.0
2016-07-03
2018-05-11


com.google.auto.service:auto-service
1.0-rc2
1.0-rc4
2018-06-25
2017-12-11


biz.aQute:bndlib
1.43.0
2.0.0.20130123-133441
2018-06-25
2018-06-25


org.apache.cassandra:cassandra-all
3.9
3.11.2
2016-09-26
2018-02-14


org.apache.commons:commons-dbcp2
2.1.1
2.5.0
2015-08-02
2018-07-16


de.flapdoodle.embed:de.flapdoodle.embed.mongo
1.50.1
2.1.1
2015-12-11
2018-06-25


de.flapdoodle.embed:de.flapdoodle.embed.process
1.50.1
2.0.5
2015-12-11
2018-06-25


org.apache.derby:derby
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.apache.derby:derbyclient
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.apache.derby:derbynet
10.12.1.1
10.14.2.0
2015-10-10
2018-05-03


org.elasticsearch:elasticsearch
5.6.3
6.3.2
2017-10-06
2018-07-30


org.elasticsearch:elasticsearch-hadoop
5.0.0
6.3.2
2016-10-26
2018-07-30


org.elasticsearch.client:elasticsearch-rest-client
5.6.3
6.3.2
2017-10-06
2018-07-30


com.alibaba:fastjson
1.2.12
1.2.47
2016-05-21
2018-03-15


org.elasticsearch.test:framework
5.6.3
6.3.2
2017-10-06
2018-07-30


org.freemarker:freemarker
2.3.25-incubating
2.3.28
2016-06-14
2018-03-30


net.ltgt.gradle:gradle-apt-plugin
0.13
0.18
2017-11-01
2018-07-23


com.commercehub.gradle.plugin:gradle-avro-plugin
0.11.0
0.14.2
2018-01-30
2018-06-06


gradle.plugin.com.palantir.gradle.docker:gradle-docker
0.13.0
0.20.1
2017-04-05
2018-07-09


com.github.ben-manes:gradle-versions-plugin
0.17.0
0.20.0
2018-06-06
2018-06-25


org.codehaus.groovy:groovy-all
2.4.13
3.0.0-alpha-3
2017-11-22
2018-06-26


com.google.guava:guava
20.0
25.1-jre
2018-07-16
2018-07-16


org.apache.hbase:hbase-common
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-hadoop-compat
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-hadoop2-compat
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-server
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-shaded-client
1.2.6
2.1.0
2017-05-29
2018-07-23


org.apache.hbase:hbase-shaded-server
1.2.6
2.0.0-alpha2
2017-05-29
2018-05-31


org.apache.hive:hive-cli
2.1.0
3.0.0
2016-06-16
2018-07-30


org.apache.hive:hive-common
2.1.0
3.0.0
2016-06-16
2018-07-30


org.apache.hive:hive-exec
2.1.0
3.0.0
2016-06-16
2018-07-30


org.apache.hive.hcatalog:hive-hcatalog-core
2.1.0
3.0.0
2016-06-16
2018-07-30


org.apache.httpcomponents:httpasyncclient
4.1.2
4.1.4
2016-06-18
2018-07-23


org.apache.httpcomponents:httpclient
4.5.2
4.5.6
2016-02-21
2018-07-09


org.apache.httpcomponents:httpcore
4.4.5
   

Re: ElasticsearchIO bulk delete

2018-07-30 Thread Tim Robertson
> we decided to postpone the feature

That makes sense.

I believe the ES6 branch is in-part working (I've looked at the code but
not used it) which you can see here [1] and the jira to watch or contribute
is [2]. It would be a useful addition to test independently and report any
observations or improvement requests on that jira.

The offer to assist in your first PR remains open for the future - please
don't hesitate to ask.

Thanks,
Tim

[1]
https://github.com/jsteggink/beam/tree/BEAM-3199/sdks/java/io/elasticsearch-6/src/main/java/org/apache/beam/sdk/io/elasticsearch
[2] https://issues.apache.org/jira/browse/BEAM-3199

On Mon, Jul 30, 2018 at 10:55 AM, Wout Scheepers <
wout.scheep...@vente-exclusive.com> wrote:

> Hey Tim,
>
>
>
> Thanks for your proposal to mentor me through my first PR.
>
> As we’re definitely planning to upgrade to ES6 when Beam supports it, we
> decided to postpone the feature (we have a fix that works for us, for now).
>
> When Beam supports ES6, I’ll be happy to make a contribution to get bulk
> deletes working.
>
>
>
> For reference, I opened a ticket (https://issues.apache.org/
> jira/browse/BEAM-5042).
>
>
>
> Cheers,
>
> Wout
>
>
>
>
>
> *From: *Tim Robertson 
> *Reply-To: *"u...@beam.apache.org" 
> *Date: *Friday, 27 July 2018 at 17:43
> *To: *"u...@beam.apache.org" 
> *Subject: *Re: ElasticsearchIO bulk delete
>
>
>
> Hi Wout,
>
>
>
> This is great, thank you. I wrote the partial update support you reference
> and I'll be happy to mentor you through your first PR - welcome aboard. Can
> you please open a Jira to reference this work and we'll assign it to you?
>
>
>
> We discussed having the "_xxx" fields in the document and triggering
> actions based on that in the partial update jira but opted to avoid
> it. Based on that discussion the ActionFn would likely be the preferred
> approach.  Would that be possible?
>
>
>
> It will be important to provide unit and integration tests as well.
>
>
>
> Please be aware that there is a branch and work underway for ES6 already
> which is rather different on the write() path so this may become redundant
> rather quickly.
>
>
>
> Thanks,
>
> Tim
>
>
>
> @timrobertson100 on the Beam slack channel
>
>
>
>
>
>
>
> On Fri, Jul 27, 2018 at 2:53 PM, Wout Scheepers  exclusive.com> wrote:
>
> Hey all,
>
>
>
> A while ago, I patched ElasticsearchIO to be able to do partial updates
> and deletes.
>
> However, I did not consider my patch pull-request-worthy as the json
> parsing was done inefficient (parsed it twice per document).
>
>
>
> Since Beam 2.5.0 partial updates are supported, so the only thing I’m
> missing is the ability to send bulk *delete* requests.
>
> We’re using entity updates for event sourcing in our data lake and need to
> persist deleted entities in elastic.
>
> We’ve been using my patch in production for the last year, but I would
> like to contribute to get the functionality we need into one of the next
> releases.
>
>
>
> I’ve created a gist that works for me, but is still inefficient (parsing
> twice: once to check the ‘_action` field, once to get the metadata).
>
> Each document I want to delete needs an additional ‘_action’ field with
> the value ‘delete’. It doesn’t matter the document still contains the
> redundant field, as the delete action only requires the metadata.
>
> I’ve added the method isDelete() and made some changes to the
> processElement() method.
>
> https://gist.github.com/wscheep/26cca4bda0145ffd38faf7efaf2c21b9
>
>
>
> I would like to make my solution more generic to fit into the current
> ElasticsearchIO and create a proper pull request.
>
> As this would be my first pull request for beam, can anyone point me in
> the right direction before I spent too much time creating something that
> will be rejected?
>
>
>
> Some questions on the top of my mind are:
>
>- Is it a good idea it to make the ‘action’ part for the bulk api
>generic?
>- Should it be even more generic? (e.g.: set an ‘ActionFn’ on the
>ElasticsearchIO)
>- If I want to avoid parsing twice, the parsing should be done outside
>of the getDocumentMetaData() method. Would this be acceptable?
>- Is it possible to avoid passing the action as a field in the
>document?
>- Is there another or better way to get the delete functionality in
>general?
>
>
>
> All feedback is more than welcome.
>
>
> Cheers,
> Wout
>
>
>
>
>
>
>


Re: Beam Docs Contributor

2018-07-30 Thread Ismaël Mejía
Welcome !
Great to see someone new working in this important area for the project.


On Mon, Jul 30, 2018 at 5:57 AM Kai Jiang  wrote:

> Welcome Rose!
> ᐧ
>
> On Sun, Jul 29, 2018 at 8:53 PM Rui Wang  wrote:
>
>> Welcome!
>>
>> -Rui
>>
>> On Sun, Jul 29, 2018 at 7:07 PM Griselda Cuevas  wrote:
>>
>>> Welcome Rose, very glad to have you in the community :)
>>>
>>>
>>>
>>> On Fri, 27 Jul 2018 at 16:29, Ahmet Altay  wrote:
>>>
 Welcome Rose! Looking forward to your contributions.

 On Fri, Jul 27, 2018 at 4:08 PM, Rose Nguyen 
 wrote:

> Hi all:
>
> I'm Rose! I've worked on Cloud Dataflow documentation and now I'm
> starting a project to refresh the Beam docs and improve the onboarding
> experience. We're planning on splitting up the programming guide into
> multiple pages, making the docs more accessible for new users. I've got
> lots of ideas for doc improvements, some of which are motivated by the UX
> research, and am excited to share them with you all and work on them.
>
> I look forward to interacting with everybody in the community. I
> welcome comments, thoughts, feedback, etc.
> --
>
>
> Rose Thi Nguyen
>
>   Technical Writer
>
> (281) 683-6900
>