Ok, I try to clarify:

1) The worker is the logic inside my mapper and the same for both cases.
2) I have two cases. In the first one I use hadoop to execute my worker and
in a second one, I execute my worker without hadoop (simple read of the
file).
   Now I measured, for both cases, the time the worker and
the surroundings need (so i have two values for each case). The worker took
the same time in both cases for the same input (this is expected). But the
surroundings took 17%  more time when using hadoop.
3) ~  3GB.

I want to know how to reduce this difference and where they come from.
I hope that helped? If not, feel free to ask again :)

Greetings,
MK

P.S. just for your information, I did the same test with hypertable as
well.
I got:
 * worker without anything: 15% overhead
 * worker with hadoop: 32% overhead
 * worker with hypertable: 53% overhead
Remark: overhead was measured in comparison to the worker. e.g. hypertable
uses 53% of the whole process time, while worker uses 47%.

2012/8/13 Bertrand Dechoux <decho...@gmail.com>

> I am not sure to understand and I guess I am not the only one.
>
> 1) What's a worker in your context? Only the logic inside your Mapper or
> something else?
> 2) You should clarify your cases. You seem to have two cases but both are
> in overhead so I am assuming there is a baseline? Hadoop vs sequential, so
> sequential is not Hadoop?
> 3) What are the size of the file?
>
> Bertrand
>
>
> On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke <
> matthias.mk.kri...@gmail.com> wrote:
>
>> Hello all,
>>
>> I'm using CDH3u3.
>> If I want to process one File, set to non splitable hadoop starts one
>> Mapper and no Reducer (thats ok for this test scenario). The Mapper
>> goes through a configuration step where some variables for the worker
>> inside the mapper are initialized.
>> Now the Mapper gives me K,V-pairs, which are lines of an input file. I
>> process the V with the worker.
>>
>> When I compare the run time of hadoop to the run time of the same process
>> in sequentiell manner, I get:
>>
>> worker time --> same in both cases
>>
>> case: mapper --> overhead of ~32% to the worker process (same for bigger
>> chunk size)
>> case: sequentiell --> overhead of ~15% to the worker process
>>
>> It shouldn't be that much slower, because of non splitable, the mapper
>> will be executed where the data is saved by HDFS, won't it?
>> Where did those 17% go? How to reduce this? Did hadoop needs the whole
>> time for reading or streaming the data out of HDFS?
>>
>> I would appreciate your help,
>>
>> Greetings
>> mk
>>
>>
>
>
> --
> Bertrand Dechoux
>

Reply via email to