Re: makes bundle concept usable?

2017-11-17 Thread Eugene Kirpichov
I must admit I'm still failing to understand the problem, so let's step back even further. Could you give an example of an IO that is currently difficult to implement specifically because of lack of the feature you're talking about? I'm asking because I've reviewed almost all Beam IOs and don't r

Re: makes bundle concept usable?

2017-11-17 Thread Romain Manni-Bucau
Yep, just take ES IO, if a part of a bundle fails you are in an unmanaged state. This is the case for all O (of IO ;)). Issue is not much about "1" (the code it takes) but more the fact it doesn't integrate with runner features and retries potentially: what happens if a bundle has a failure? => und

Re: Improving integration test coverage of I/O transforms

2017-11-17 Thread Chamikara Jayalath
I created following JIRAs. Integration test for TextIO ReadAll and dynamic writes: https://issues.apache.org/jira/browse/BEAM-3211 Integration test for MongoDBIO: https://issues.apache.org/jira/browse/BEAM-3212 Large scale performance test for MongoDBIO: https://issues.apache.org/jira/browse/BEAM-

Re: [VOTE] Choose the "new" Spark runner

2017-11-17 Thread Kenneth Knowles
For your convenience, here's the compare view: https://github.com/apache/beam/compare/master...jbonofre:BEAM-1920-SPARK2-MODULES https://github.com/apache/beam/compare/master...jbonofre:BEAM-1920-SPARK2-ONLY Kenn On Thu, Nov 16, 2017 at 5:08 AM, Jean-Baptiste Onofré wrote: > Hi guys, > > To

Re: [VOTE] Choose the "new" Spark runner

2017-11-17 Thread Kenneth Knowles
> > [ ] Use Spark 1 & Spark 2 Support Branch [X] Use Spark 2 Only Branch Kenn

Re: makes bundle concept usable?

2017-11-17 Thread Eugene Kirpichov
The behavior if a bundle has a failure is quite defined: the entire bundle is considered failed and processing of the bundle's elements will get retried. The level at which retries are performed is unspecified: a runner would be allowed to retry the bundle, or it would be allowed to split the remai

Re: makes bundle concept usable?

2017-11-17 Thread Eugene Kirpichov
In case of Elasticsearch: Elasticsearch takes a PCollection with JSON documents, which may contain a document id. ES will overwrite a document with the same id if it exists, so in case of retries inserting the same document multiple times will not lead to duplicates. I guess the solution is to simp

Re: makes bundle concept usable?

2017-11-17 Thread Raghu Angadi
On Thu, Nov 16, 2017 at 10:40 PM, Eugene Kirpichov < kirpic...@google.com.invalid> wrote: > > [...] So it would help if you could give a > more concrete example: for example, take some IO that you think could be > easier to write with your proposed API, give the contents of a hypothetical > PCollec

Re: [VOTE] Choose the "new" Spark runner

2017-11-17 Thread Ted Yu
[ ] Use Spark 1 & Spark 2 Support Branch [X] Use Spark 2 Only Branch On Thu, Nov 16, 2017 at 5:08 AM, Jean-Baptiste Onofré wrote: > Hi guys, > > To illustrate the current discussion about Spark versions support, you can > take a look on: > > -- > Spark 1 & Spark 2 Support Branch > > http

Re: [VOTE] Choose the "new" Spark runner

2017-11-17 Thread Ben Sidhom
[ ] Use Spark 1 & Spark 2 Support Branch [X] Use Spark 2 Only Branch On Fri, Nov 17, 2017 at 9:46 AM, Ted Yu wrote: > [ ] Use Spark 1 & Spark 2 Support Branch > [X] Use Spark 2 Only Branch > > On Thu, Nov 16, 2017 at 5:08 AM, Jean-Baptiste Onofré > wrote: > > > Hi guys, > > > >

Re: Sink API question

2017-11-17 Thread Chet Aldrich
Hey JB, I wasn’t really thinking about open-sourcing the I/O transform with Beam in this case because of the API’s proprietary nature, but then again I suppose that BigQuery and other Google services are similarly proprietary and have transforms included with Beam. If you feel like it’d be a

Re: makes bundle concept usable?

2017-11-17 Thread Raghu Angadi
On Fri, Nov 17, 2017 at 1:02 AM, Romain Manni-Bucau wrote: > Yep, just take ES IO, if a part of a bundle fails you are in an > unmanaged state. This is the case for all O (of IO ;)). Issue is not > much about "1" (the code it takes) but more the fact it doesn't > integrate with runner features an

ElasticSearch and dynamic indexes

2017-11-17 Thread NerdyNick
So I'm looking to expand the ElasticSearchIO writer to support dynamic indexes. Wanted to kick off talking about design. I'll create a Jira for it once I get going. Right now I'm thinking to expand the ConnectionConfiguration class to accept either an index (Current), dot-path string to look into

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-17 Thread Kenneth Knowles
Hi all, Following up on past discussions and https://issues.apache.org/jira/browse/BEAM-1189 I have prepared a spreadsheet so we can sign up for validation steps that must be done by a human. The spreadsheet for 2.2.0 is at https://s.apache.org/beam-2.2.0-release-validation. Everyone can edit, so

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-17 Thread Valentyn Tymofieiev
I have verified: SHA & MD5 signatures of Python artifacts in [2], and checked Python side of the validation checklist on Linux. There is one known issue in UserScore example for Dataflow runner. The issue has been fixed on master branch and does not require a cherry-pick at this point. A workaroun

Re: ElasticSearch and dynamic indexes

2017-11-17 Thread NerdyNick
Jira for this is https://issues.apache.org/jira/browse/BEAM-3222 Also looking at https://github.com/json-path/JsonPath for providing the json fetching. On Fri, Nov 17, 2017 at 11:43 AM, NerdyNick wrote: > So I'm looking to expand the ElasticSearchIO writer to support dynamic > indexes. Wanted t

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-17 Thread Lukasz Cwik
+1, Approve the release I have verified the wordcount quickstart on the Apache Beam website using Apex, DirectRunner, Flink & Spark on Linux. The Gearpump runner is yet to have a quickstart listed on our website. Adding the quickstart is already represented by this existing issue: https://issues.

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-17 Thread Valentyn Tymofieiev
I have a process question: is the vote open for committers only or for all contributors? On Fri, Nov 17, 2017 at 4:06 PM, Lukasz Cwik wrote: > +1, Approve the release > > I have verified the wordcount quickstart on the Apache Beam website using > Apex, DirectRunner, Flink & Spark on Linux. > > T

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-17 Thread Lukasz Cwik
Its open to all, its just that there are binding votes and non-binding votes. On Fri, Nov 17, 2017 at 4:26 PM, Valentyn Tymofieiev < valen...@google.com.invalid> wrote: > I have a process question: is the vote open for committers only or for all > contributors? > > On Fri, Nov 17, 2017 at 4:06 PM

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-17 Thread Eugene Kirpichov
How can I specify a dependency on the staged RC? E.g. I'm trying to validate the quickstart per https://beam.apache.org/get-started/quickstart-java/ and specifying version 2.2.0 doesn't work I suppose because it's not released yet. Should I pass some command-line flag to mvn to make it fetch the ve

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-17 Thread Robert Bradshaw
The source distribution contains a couple of files not on github (e.g. folders that were added on master, Python generated files). The pom files differed only by missing -SNAPSHOT, other than that presumably the source release should just be "wget https://github.com/apache/beam/archive/release-2.2.

Re: [VOTE] Release 2.2.0, release candidate #4

2017-11-17 Thread Reuven Lax
hmmm, I thought I removed those generated files from the zip file before sending this email. Let me check again. Reuven On Sat, Nov 18, 2017 at 8:52 AM, Robert Bradshaw < rober...@google.com.invalid> wrote: > The source distribution contains a couple of files not on github (e.g. > folders that w

Jenkins build is back to normal : beam_SeedJob #637

2017-11-17 Thread Apache Jenkins Server
See

Re: [Proposal] IOIT test parameters validation

2017-11-17 Thread Chamikara Jayalath
On Thu, Nov 16, 2017 at 10:13 AM Łukasz Gajowy wrote: > Hi all! > > We are currently working on the IO IT "test harness" that will allow to > run the IOITs on various runners, filesystems and with changing amount of > data. It is described in a doc some of you have probably seen and put > comment