Jenkins build became unstable: beam_Release_NightlySnapshot #175

2016-09-22 Thread Apache Jenkins Server
See

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
Generally this makes sense, though I thought that this is what IOChannelFactory was (also) about, and eventually the runner needs to facilitate the splitting/partitioning of the source, so I was wondering if the source could have a generic mechanism for locality as well. On Thu, Sep 22, 2016 at 6:

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Jesse Anderson
I think the runners should. Each framework has put far more effort into data locality than Beam should. Beam should just take advantage of it. On Thu, Sep 22, 2016, 7:57 AM Amit Sela wrote: > Not where in the file, where in the cluster. > > Like you said - mapper - in MapReduce the mapper instan

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
Not where in the file, where in the cluster. Like you said - mapper - in MapReduce the mapper instance will *prefer* to start on the same machine as the Node hosting it (unless that's changed, I've been out of touch with MR for a while...). And for Spark - https://databricks.gitbooks.io/databrick

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Jesse Anderson
I've only ever seen that being used to figure out which file the runner/mapper/operation is working on. Otherwise, I haven't seen those operations care where in the file they're working. On Thu, Sep 22, 2016 at 5:57 AM Amit Sela wrote: > Wouldn't it force all runners to implement this for all di

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
Wouldn't it force all runners to implement this for all distributed filesystems ? It's true that each runner has it's own "partitioning" mechanism, but I assume (maybe I'm wrong) that open-source runners use the Hadoop InputFormat/InputSplit for that.. and the proper connectors for that to run on t

Re: Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Jean-Baptiste Onofré
Hi Amit, as the purpose is to remove IOChannelFactory, then I would suggest it's a runner concern. The Read.Bounded should "locate" the bundles on a executor close to the read data (even if it's not always possible depending of the source). My $0.01 Regards JB On 09/22/2016 02:26 PM, Amit

Preferred locations (or data locality) for batch pipelines.

2016-09-22 Thread Amit Sela
It's not new that batch pipeline can optimize on data locality, my question is regarding this responsibility in Beam. If runners should implement a generic Read.Bounded support, should they also implement locating the input blocks ? or should it be a part of IOChannelFactory implementations ? or an