Hi Cheolsoo,
Thanks - let's see, I'll give it a try now.
Best Regards,
Ben
On 25 August 2013 02:27, Cheolsoo Park piaozhe...@gmail.com wrote:
Hi Benjamin,
Thanks for letting us know. That means my original assumption was wrong.
The size of bags is not small. In fact, you can compute the
Hi Cheolsoo,
Just ran the benchmarks: no luck.
No combiner + mapPartAgg set to true is slower than without the combiner:
real 752.85
real 757.41
real 749.03
On 25 August 2013 17:11, Benjamin Jakobus jakobusbe...@gmail.com wrote:
Hi Cheolsoo,
Thanks - let's see, I'll give it a try now.
combiner + mapPartAgg set to true - yup!
On 25 August 2013 18:57, Cheolsoo Park piaozhe...@gmail.com wrote:
I guess you mean combiner + mapPartAgg set to true not no combiner +
mapPartAgg set to true.
On Sun, Aug 25, 2013 at 10:10 AM, Benjamin Jakobus
jakobusbe...@gmail.comwrote:
Hi
I have no more suggestion. If you find anything, please share with us. I
would be interested in understanding what you're seeing.
On Sun, Aug 25, 2013 at 11:14 AM, Benjamin Jakobus
jakobusbe...@gmail.comwrote:
combiner + mapPartAgg set to true - yup!
On 25 August 2013 18:57, Cheolsoo Park
Thanks. Will do!
On 25 August 2013 20:31, Cheolsoo Park piaozhe...@gmail.com wrote:
I have no more suggestion. If you find anything, please share with us. I
would be interested in understanding what you're seeing.
On Sun, Aug 25, 2013 at 11:14 AM, Benjamin Jakobus
Ah, I see. Thank you for the explanation taking the time!! Makes sense.
On 22 August 2013 16:38, Alan Gates ga...@hortonworks.com wrote:
When data comes out of a map task, Hadoop serializes it so that it can
know its exact size as it writes it into the output buffer. To run it
through the
Hi Alan, Cheolsoo,
I re-ran the benchmarks with and without the combiner. Enabling the
combiner is faster:
With combiner:
real 668.44
real 663.10
real 665.05
Without combiner:
real 795.97
real 810.51
real 810.16
Best Regards,
Ben
On 22 August 2013 16:33, Cheolsoo Park piaozhe...@gmail.com
Hi Benjamin,
Thanks for letting us know. That means my original assumption was wrong.
The size of bags is not small. In fact, you can compute the avg size of
bags as follows: total number of input records / ( reduce input groups x
number of reducers ).
One more thing you can try is turning on
Hi Cheolsoo,
Thanks - I will try this now and get back to you.
Out of interest; could you explain (or point me towards resources that
would) why the combiner would be a problem?
Also, could the fact that Pig builds an intermediary data structure (?)
whilst Hive just performs a sort then the
Hi Benjamin,
To answer your question, how the Hadoop combiner works is that 1) mappers
write outputs to disk and 2) combiners read them, combine and write them
again. So you're paying extra disk I/O as well as
serialization/deserialization.
This will pay off if combiners significantly reduce the
When data comes out of a map task, Hadoop serializes it so that it can know its
exact size as it writes it into the output buffer. To run it through the
combiner it needs to deserialize it again, and then re-serialize it when it
comes out. So each pass through the combiner costs a
Hi Cheolsoo,
What's your query like? Can you share it? Do you call any algebraic UDF
after group by? I am wondering whether combiner matters in your test.
I have been running 3 different types of queries.
The first was performed on datasets of 6 different sizes:
- Dataset size 1: 30,000
Hi Benjamin,
Thank you very much for sharing detailed information!
1) From the runtime numbers that you provided, the mappers are very slow.
CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700
178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910
2) In your GROUP BY
13 matches
Mail list logo