Re: Slow Group By operator

2013-08-25 Thread Benjamin Jakobus
Hi Cheolsoo, Thanks - let's see, I'll give it a try now. Best Regards, Ben On 25 August 2013 02:27, Cheolsoo Park piaozhe...@gmail.com wrote: Hi Benjamin, Thanks for letting us know. That means my original assumption was wrong. The size of bags is not small. In fact, you can compute the

Re: Slow Group By operator

2013-08-25 Thread Benjamin Jakobus
Hi Cheolsoo, Just ran the benchmarks: no luck. No combiner + mapPartAgg set to true is slower than without the combiner: real 752.85 real 757.41 real 749.03 On 25 August 2013 17:11, Benjamin Jakobus jakobusbe...@gmail.com wrote: Hi Cheolsoo, Thanks - let's see, I'll give it a try now.

Re: Slow Group By operator

2013-08-25 Thread Benjamin Jakobus
combiner + mapPartAgg set to true - yup! On 25 August 2013 18:57, Cheolsoo Park piaozhe...@gmail.com wrote: I guess you mean combiner + mapPartAgg set to true not no combiner + mapPartAgg set to true. On Sun, Aug 25, 2013 at 10:10 AM, Benjamin Jakobus jakobusbe...@gmail.comwrote: Hi

Re: Slow Group By operator

2013-08-25 Thread Cheolsoo Park
I have no more suggestion. If you find anything, please share with us. I would be interested in understanding what you're seeing. On Sun, Aug 25, 2013 at 11:14 AM, Benjamin Jakobus jakobusbe...@gmail.comwrote: combiner + mapPartAgg set to true - yup! On 25 August 2013 18:57, Cheolsoo Park

Re: Slow Group By operator

2013-08-25 Thread Benjamin Jakobus
Thanks. Will do! On 25 August 2013 20:31, Cheolsoo Park piaozhe...@gmail.com wrote: I have no more suggestion. If you find anything, please share with us. I would be interested in understanding what you're seeing. On Sun, Aug 25, 2013 at 11:14 AM, Benjamin Jakobus

Re: Slow Group By operator

2013-08-24 Thread Benjamin Jakobus
Ah, I see. Thank you for the explanation taking the time!! Makes sense. On 22 August 2013 16:38, Alan Gates ga...@hortonworks.com wrote: When data comes out of a map task, Hadoop serializes it so that it can know its exact size as it writes it into the output buffer. To run it through the

Re: Slow Group By operator

2013-08-24 Thread Benjamin Jakobus
Hi Alan, Cheolsoo, I re-ran the benchmarks with and without the combiner. Enabling the combiner is faster: With combiner: real 668.44 real 663.10 real 665.05 Without combiner: real 795.97 real 810.51 real 810.16 Best Regards, Ben On 22 August 2013 16:33, Cheolsoo Park piaozhe...@gmail.com

Re: Slow Group By operator

2013-08-24 Thread Cheolsoo Park
Hi Benjamin, Thanks for letting us know. That means my original assumption was wrong. The size of bags is not small. In fact, you can compute the avg size of bags as follows: total number of input records / ( reduce input groups x number of reducers ). One more thing you can try is turning on

Re: Slow Group By operator

2013-08-22 Thread Benjamin Jakobus
Hi Cheolsoo, Thanks - I will try this now and get back to you. Out of interest; could you explain (or point me towards resources that would) why the combiner would be a problem? Also, could the fact that Pig builds an intermediary data structure (?) whilst Hive just performs a sort then the

Re: Slow Group By operator

2013-08-22 Thread Cheolsoo Park
Hi Benjamin, To answer your question, how the Hadoop combiner works is that 1) mappers write outputs to disk and 2) combiners read them, combine and write them again. So you're paying extra disk I/O as well as serialization/deserialization. This will pay off if combiners significantly reduce the

Re: Slow Group By operator

2013-08-22 Thread Alan Gates
When data comes out of a map task, Hadoop serializes it so that it can know its exact size as it writes it into the output buffer. To run it through the combiner it needs to deserialize it again, and then re-serialize it when it comes out. So each pass through the combiner costs a

Re: Slow Group By operator

2013-08-21 Thread Benjamin Jakobus
Hi Cheolsoo, What's your query like? Can you share it? Do you call any algebraic UDF after group by? I am wondering whether combiner matters in your test. I have been running 3 different types of queries. The first was performed on datasets of 6 different sizes: - Dataset size 1: 30,000

Re: Slow Group By operator

2013-08-21 Thread Cheolsoo Park
Hi Benjamin, Thank you very much for sharing detailed information! 1) From the runtime numbers that you provided, the mappers are very slow. CPU time spent (ms)5,081,610168,7405,250,350CPU time spent (ms)5,052,700 178,2205,230,920CPU time spent (ms)5,084,430193,4805,277,910 2) In your GROUP BY