Re: streamed splitting

2015-03-13 Thread Siddharth Seth
Johannes, Couple of questions - do you happen to know why the splits are actually taking 21 minutes to generate? Is the namenode overloaded, or is it just the large number of files ? Is the input format (assuming you're using an inputFormat for splits) going beyond analyzing block boundaries to

RE: streamed splitting

2015-03-12 Thread Bikas Saha
That's not it. Please open a new one. Thanks! -Original Message- From: Johannes Zillmann [mailto:jzillm...@googlemail.com] Sent: Thursday, March 12, 2015 1:14 PM To: user@tez.apache.org Subject: Re: streamed splitting So.. its complex ;) Regarding the jira, closest thing i fou

Re: streamed splitting

2015-03-12 Thread Johannes Zillmann
So.. its complex ;) Regarding the jira, closest thing i found is https://issues.apache.org/jira/browse/TEZ-1166 Should i add to this or create a new one ? Johannes > On 12 Mar 2015, at 15:44, Hitesh Shah wrote: > > Hello Johannes, > > This is something we have discussed quite often but have

Re: streamed splitting

2015-03-12 Thread Hitesh Shah
Hello Johannes, This is something we have discussed quite often but have not got around to implementing this. There might be an open jira related to “pipelining” of splits. If you cannot find it, please go ahead and create one. The general issues with these are: - how to handle dynamic crea

Re: streamed splitting

2015-03-12 Thread Johannes Zillmann
Hey Jeff, so one scenario i recently encountered was an job on about 300.000 files in hdfs. The splitting alone took 21 minutes. So i thought until the splitting is completed completely the a lot of splits could have already been processed… thanks for you answer! Johannes > On 12 Mar 2015, at

Re: streamed splitting

2015-03-12 Thread Jianfeng (Jeff) Zhang
HI Johannes, If the input-initlizeer is not done, workers can not be started. What¹s your scenario ? Why do you want to start the workers before splitting is generated ? Just save the launch time or let the worker to do other stuff ? Best Regard, Jeff Zhang On 3/12/15, 5:38 PM, "Johannes Z

streamed splitting

2015-03-12 Thread Johannes Zillmann
Hey guys, dump question. With Tez can i have a input-initializaer which don’t require to create every split before starting the processing of already created splits ? Means if i have a lot of splits and my splitting process takes a long time, can the workers start working already while still doi