Hi all, I am trying to understand the difference between how Pig implements the Group By operator and how Hive does it. My hypothesis is that Pig, being a procedural and lazy language and hence creates a aliases for each "stage" in the execution.
Background: After benchmarking Hive and Pig, I found that the Group By operator in Pig is drastically slower that Hive's. I was wondering whether anybody has experienced the same? And whether people may have any tips for improving the performance of this operation? (Adding a DISTINCT as suggested by an earlier post on here doesn't help. I am currently re-running the benchmark with LZO compression enabled). For the given Group By, Pig launches 99 mappers, Hive only 8. I have been running 3 different types of queries. The first was performed on datasets of 6 different sizes: - Dataset size 1: 30,000 records (772KB) - Dataset size 2: 300,000 records (6.4MB) - Dataset size 3: 3,000,000 records (63MB) - Dataset size 4: 30 million records (628MB) - Dataset size 5: 300 million records (6.2GB) - Dataset size 6: 3 billion records (62GB) The datasets scale linearly, whereby the size equates to 3000 * 10n . A seventh dataset consisting of 1,000 records (23KB) was produced to perform join operations on. Its schema is as follows: name - string marks - integer gpa - float The data was generated using the generate data.pl perl script available for download from https://issues.apache.org/jira/browse/PIG-200 to produce the datasets. The results are as follows: * * * * * * *Set 1 * *Set 2** * *Set 3** * *Set 4** * *Set 5** * *Set 6* *Arithmetic** * 32.82* * 36.21* * 49.49* * 83.25* * 423.63* * 3900.78 *Filter 10%** * 32.94* * 34.32* * 44.56* * 66.68* * 295.59* * 2640.52 *Filter 90%** * 33.93* * 32.55* * 37.86* * 53.22* * 197.36* * 1657.37 *Group** * * *49.43* * 53.34* * 69.84* * 105.12* *497.61* * 4394.21 *Join** * * * 49.89* * 50.08* * 78.55* * 150.39* *1045.34* *10258.19 *Averaged performance of arithmetic, join, group, order, distinct select and filter operations on six datasets using Pig. Scripts were configured as to use 8 reduce and 11 map tasks.* * * * Set 1** * *Set 2** * *Set 3** * *Set 4** * *Set 5** * *Set 6* *Arithmetic** * 32.84* * 37.33* * 72.55* * 300.08 2633.72 27821.19 *Filter 10% * 32.36* * 53.28* * 59.22* * 209.5* * 1672.3* *18222.19 *Filter 90% * 31.23* * 32.68* * 36.8* * 69.55* * 331.88* *3320.59 *Group * * * 48.27* * 47.68* * 46.87* * 53.66* *141.36* *1233.4 *Join * * * * *48.54* *56.86* * 104.6* * 517.5* * 4388.34* * - *Distinct** * * *48.73* *53.28* * 72.54* * 109.77* * - * * * * - *Averaged performance of arithmetic, join, group, distinct select and filter operations on six datasets using Hive. Scripts were configured as to use 8 reduce and 11 map tasks.* (If you want to see the standard deviation, let me know). So, to summarize the results: Pig outperforms Hive, with the exception of using *Group By*. The Pig scripts used for this benchmark are as follows: *Arithmetic* -- Generate with basic arithmetic A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age, gpa) PARALLEL $reducers; B = foreach A generate age * gpa + 3, age/gpa - 1.5 PARALLEL $reducers; store B into '$output/dataset_300000000_projection' using PigStorage() PARALLEL $reducers; * * *Filter 10%* -- Filter that removes 10% of data A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age, gpa) PARALLEL $reducers; B = filter A by gpa < '3.6' PARALLEL $reducers; store B into '$output/dataset_300000000_filter_10' using PigStorage() PARALLEL $reducers; *Filter 90%* -- Filter that removes 90% of data A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age, gpa) PARALLEL $reducers; B = filter A by age < '25' PARALLEL $reducers; store B into '$output/dataset_300000000_filter_90' using PigStorage() PARALLEL $reducers; * * *Group* A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age, gpa) PARALLEL $reducers; B = group A by name PARALLEL $reducers; C = foreach B generate flatten(group), COUNT(A.age) PARALLEL $reducers; store C into '$output/dataset_300000000_group' using PigStorage() PARALLEL $reducers; * * *Join* A = load '$input/dataset_300000000' using PigStorage('\t') as (name, age, gpa) PARALLEL $reducers; B = load '$input/dataset_join' using PigStorage('\t') as (name, age, gpa) PARALLEL $reducers; C = cogroup A by name inner, B by name inner PARALLEL $reducers; D = foreach C generate flatten(A), flatten(B) PARALLEL $reducers; store D into '$output/dataset_300000000_cogroup_big' using PigStorage() PARALLEL $reducers; Similarly, here the Hive scripts: *Arithmetic* SELECT (dataset.age * dataset.gpa + 3) AS F1, (dataset.age/dataset.gpa - 1.5) AS F2 FROM dataset WHERE dataset.gpa > 0; *Filter 10%* SELECT * FROM dataset WHERE dataset.gpa < 3.6; *Filter 90%* SELECT * FROM dataset WHERE dataset.age < 25; *Group* SELECT COUNT(dataset.age) FROM dataset GROUP BY dataset.name; *Join* SELECT * FROM dataset JOIN dataset_join ON dataset.name = dataset_join.name; I will re-run the benchmarks to see whether it is the reduce or map side that is slower and get back to you later today. The other two benchmarks were slightly different: I performed transitive self joins in which Pig outperformed Hive. However once I added a Group By, Hive began outperforming Pig. I also ran the TPC-H benchmarks and noticed that Hive (surprisingly) outperformed Pig. However what *seems* to cause the actual performance difference is the heavy usage of the Group By operator in all but 3 TPC-H test scripts. Re-running the scripts whilst omitting the the grouping of data produces the expected results. For example, running script 3 (q3_shipping_priority.pig) whilst omitting the Group By operator significantly reduces the runtime (to 1278.49 seconds real time runtime or a total of 12,257,630ms CPU time). The fact that the Group By operator skews the TPC-H benchmark in favour of Apache Hive is supported by further experiments: as noted earlier a benchmark was carried out on a transitive self-join. The former took Pig an average of 45.36 seconds (real time runtime) to execute; it took Hive 56.73 seconds. The latter took Pig 157.97 and Hive 180.19 seconds (again, on average). However adding the Group By operator to the scripts turned the tides: Pig is now significantly slower than Hive, requiring an average of 278.15 seconds. Hive on the other hand required only 204.01 to perform the JOIN and GROUP operations. Real time runtime is measured using the time -p command. Disabling the combiner made things even slower. Regards, Ben
