Re: [VOTE] Choose the "new" Spark runner

2017-11-18 Thread Reuven Lax
[ ] Use Spark 1 & Spark 2 Support Branch [X] Use Spark 2 Only Branch On Sat, Nov 18, 2017 at 1:54 AM, Ben Sidhom wrote: > [ ] Use Spark 1 & Spark 2 Support Branch > [X] Use Spark 2 Only Branch > > On Fri, Nov 17, 2017 at 9:46 AM, Ted Yu

Re: makes bundle concept usable?

2017-11-18 Thread Romain Manni-Bucau
@Eugene: "workaround" as specific to the IO each time and therefore still highlight a lack in the core. Other comments inline 2017-11-19 7:40 GMT+01:00 Robert Bradshaw : > There is a possible fourth issue that we don't handle well: efficiency. For > very large

Re: makes bundle concept usable?

2017-11-18 Thread Robert Bradshaw
There is a possible fourth issue that we don't handle well: efficiency. For very large bundles, it may be advantageous to avoid replaying a bunch of idempotent operations if there were a way to record what ones we're sure went through. Not sure if that's the issue here (though one could possibly

Re: makes bundle concept usable?

2017-11-18 Thread Eugene Kirpichov
I disagree that the usage of document id in ES is a "workaround" - it does not address any *accidental *complexity coming from shortcomings of Beam, it addresses the *essential* complexity that a distributed system forces one to take it as a fact of

Re: makes bundle concept usable?

2017-11-18 Thread Romain Manni-Bucau
Eugene, point - and issue with a single sample - is you can always find *workarounds* on a case by case basis as the id one with ES but beam doesnt solve the problem as a framework. >From my past, I clearly dont see how batch frameworks solved that for years and beam is not able to do it - keep

Re: makes bundle concept usable?

2017-11-18 Thread Eugene Kirpichov
After giving this thread my best attempt at understanding exactly what is the problem and the proposed solution, I'm afraid I still fail to understand both. To reiterate, I think the only way to make progress here is to be more concrete: (quote) take some IO that you think could be easier to write

Re: Questions with containerized runners plans?

2017-11-18 Thread Holden Karau
Cool, thanks! It seems like some good follow ups might exist to simplify things for Python users so they don’t have to roll their own docker files (like allow them provide a requirements.txt which is used in the dockerfile) :) I’m really excited about the direction with the containerized runners

Re: Questions with containerized runners plans?

2017-11-18 Thread Reuven Lax
On Sat, Nov 18, 2017 at 10:33 PM, Holden Karau wrote: > So I was looking through https://beam.apache.org/contribute/portability/ > which lead me to BEAM-2900, and then to > https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKW > FH9R6mtVmR7xp0/edit# > . > > I

Re: Questions with containerized runners plans?

2017-11-18 Thread Henning Rohde
A benefit of using docker containers is that (nearly) arbitrary native dependencies can be installed in the container image itself by either the user or SDK. For example, the (minimal, in progress) Python container Dockerfile is here:

Questions with containerized runners plans?

2017-11-18 Thread Holden Karau
So I was looking through https://beam.apache.org/contribute/portability/ which lead me to BEAM-2900, and then to https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKWFH9R6mtVmR7xp0/edit# . I was wondering if there is any considerations being given to native dependencies that user code

Re: makes bundle concept usable?

2017-11-18 Thread Romain Manni-Bucau
First bundle retry is unusable with dome runners like spark where the bundle size is the collection size / number of work. This means a user cant use bundle API or feature reliably and portably - which is beam promise. Aligning chunking and bundles would guarantee that bit can be not desired, that