On Tue, Jun 23, 2009 at 1:36 PM, Alan Gates (JIRA)<j...@apache.org> wrote: > > [ > https://issues.apache.org/jira/browse/HIVE-396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723199#action_12723199 > ] > > Alan Gates commented on HIVE-396: > --------------------------------- > > Comments on how to speed up the Pig Latin scripts used in this benchmark. > > grep_select.pig: > > Adding types in the LOAD statement will force Pig to cast the key field, even > though it doesn't need to (it only reads and writes the key field). So I'd > change the query to be: > > {code} > rmf output/PIG_bench/grep_select; > a = load '/data/grep/*' using PigStorage as (key,field); > b = filter a by field matches '.*XYZ.*'; > store b into 'output/PIG_bench/grep_select'; > {code} > > field will still be cast to a chararray for the matches, but we won't waste > time casting key and then turning it back into bytes for the store. > > rankings_select.pig: > > Same comment, remove the casts. pagerank will be properly cast to an integer. > > {code} > rmf output/PIG_bench/rankings_select; > a = load '/data/rankings/*' using PigStorage('|') as > (pagerank,pageurl,aveduration); > b = filter a by pagerank > 10; > store b into 'output/PIG_bench/rankings_select'; > {code} > > rankings_uservisits_join.pig: > > Here you want to keep the cast of pagerank so that it is handled as the right > type, since AVG can take either double or int and would default to double. > adRevenue will default to double in SUM when you don't specify a type. > > You want to project out all unneeded columns as soon as possible. > > You should set PARALLEL on the join to use the number of reducers appropriate > for your cluster. Given that you have 10 machines and 5 reduce slots per > machine, and speculative execution is off you probably want 50 reducers. > (I'm assuming here when you say you have a 10 node cluster you mean 10 data > nodes, not counting your name node and task tracker. The reduce formula > should be 5 * number of data nodes.) > > I notice you set parallel to 60 on the group by. That will give you 10 > trailing reducers. Unless you have a need for the result to be split 60 ways > you should reduce that to 50 as well. > > A last question is how large are the uservisits and rankings data sets? If > either is < 80M or so you can use the fragment/replicate join, which is much > faster than the general join. The following script assumes that isn't the > case; but if it is let me know and I can show you the syntax for it. > > So the end query looks like: > > {code} > rmf output/PIG_bench/html_join; > a = load '/data/uservisits/*' using PigStorage('|') as > > (sourceIP,destURL,visitDate,adRevenue,userAgent,countryCode,languageCode:,searchWord,duration); > b = load '/data/rankings/*' using PigStorage('|') as > (pagerank:int,pageurl,aveduration); > c = filter a by visitDate > '1999-01-01' AND visitDate < '2000-01-01'; > c1 = fjjkkoreach c generate sourceIP, destURL, addRevenue; > b1 = foreach b generate pagerank, pageurl; > d = JOIN c1 by destURL, b1 by pageurl parallel 50; > d1 = foreach d generate sourceIP, pagerank, adRevenue; > e = group d1 by sourceIP parallel 50; > f = FOREACH e GENERATE group, AVG(d1.pagerank), SUM(d1.adRevenue); > store f into 'output/PIG_bench/html_join'; > {code} > > uservisists_agrre.pig: > > Same comments as above on projecting out as early as possible and on setting > parallel appropriately for your cluster. > > {code} > rmf output/PIG_bench/uservisits_aggre; > a = load '/data/uservisits/*' using PigStorage('|') as > > (sourceIP,destURL,visitDate,adRevenue,userAgent,countryCode,languageCode,searchWord,duration); > a1 = foreach a generate sourceIP, adRevenue; > b = group a by sourceIP parallel 50; > c = FOREACH b GENERATE group, SUM(a. adRevenue); > store c into 'output/PIG_bench/uservisits_aggre'; > {code} > > >> Hive performance benchmarks >> --------------------------- >> >> Key: HIVE-396 >> URL: https://issues.apache.org/jira/browse/HIVE-396 >> Project: Hadoop Hive >> Issue Type: New Feature >> Reporter: Zheng Shao >> Attachments: hive_benchmark_2009-06-18.pdf, >> hive_benchmark_2009-06-18.tar.gz >> >> >> We need some performance benchmark to measure and track the performance >> improvements of Hive. >> Some references: >> PIG performance benchmarks PIG-200 >> PigMix: http://wiki.apache.org/pig/PigMix > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > >
We should also benchmark against cloudbase. They announce releases on the hadoop list, I believe they are open source. It would be nice to see an open source system vs a closed source one.