Ok, I try to clarify: 1) The worker is the logic inside my mapper and the same for both cases. 2) I have two cases. In the first one I use hadoop to execute my worker and in a second one, I execute my worker without hadoop (simple read of the file). Now I measured, for both cases, the time the worker and the surroundings need (so i have two values for each case). The worker took the same time in both cases for the same input (this is expected). But the surroundings took 17% more time when using hadoop. 3) ~ 3GB.
I want to know how to reduce this difference and where they come from. I hope that helped? If not, feel free to ask again :) Greetings, MK P.S. just for your information, I did the same test with hypertable as well. I got: * worker without anything: 15% overhead * worker with hadoop: 32% overhead * worker with hypertable: 53% overhead Remark: overhead was measured in comparison to the worker. e.g. hypertable uses 53% of the whole process time, while worker uses 47%. 2012/8/13 Bertrand Dechoux <decho...@gmail.com> > I am not sure to understand and I guess I am not the only one. > > 1) What's a worker in your context? Only the logic inside your Mapper or > something else? > 2) You should clarify your cases. You seem to have two cases but both are > in overhead so I am assuming there is a baseline? Hadoop vs sequential, so > sequential is not Hadoop? > 3) What are the size of the file? > > Bertrand > > > On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke < > matthias.mk.kri...@gmail.com> wrote: > >> Hello all, >> >> I'm using CDH3u3. >> If I want to process one File, set to non splitable hadoop starts one >> Mapper and no Reducer (thats ok for this test scenario). The Mapper >> goes through a configuration step where some variables for the worker >> inside the mapper are initialized. >> Now the Mapper gives me K,V-pairs, which are lines of an input file. I >> process the V with the worker. >> >> When I compare the run time of hadoop to the run time of the same process >> in sequentiell manner, I get: >> >> worker time --> same in both cases >> >> case: mapper --> overhead of ~32% to the worker process (same for bigger >> chunk size) >> case: sequentiell --> overhead of ~15% to the worker process >> >> It shouldn't be that much slower, because of non splitable, the mapper >> will be executed where the data is saved by HDFS, won't it? >> Where did those 17% go? How to reduce this? Did hadoop needs the whole >> time for reading or streaming the data out of HDFS? >> >> I would appreciate your help, >> >> Greetings >> mk >> >> > > > -- > Bertrand Dechoux >