On Tue, Jun 23, 2009 at 1:36 PM, Alan Gates (JIRA)<j...@apache.org> wrote:
>
>    [ 
> https://issues.apache.org/jira/browse/HIVE-396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723199#action_12723199
>  ]
>
> Alan Gates commented on HIVE-396:
> ---------------------------------
>
> Comments on how to speed up the Pig Latin scripts used in this benchmark.
>
> grep_select.pig:
>
> Adding types in the LOAD statement will force Pig to cast the key field, even 
> though it doesn't need to (it only reads and writes the key field).  So I'd 
> change the query to be:
>
> {code}
> rmf output/PIG_bench/grep_select;
> a = load '/data/grep/*' using PigStorage as (key,field);
> b = filter a by field matches '.*XYZ.*';
> store b into 'output/PIG_bench/grep_select';
> {code}
>
> field will still be cast to a chararray for the matches, but we won't waste 
> time casting key and then turning it back into bytes for the store.
>
> rankings_select.pig:
>
> Same comment, remove the casts.  pagerank will be properly cast to an integer.
>
> {code}
> rmf output/PIG_bench/rankings_select;
> a = load '/data/rankings/*' using PigStorage('|') as 
> (pagerank,pageurl,aveduration);
> b = filter a by pagerank > 10;
> store b into 'output/PIG_bench/rankings_select';
> {code}
>
> rankings_uservisits_join.pig:
>
> Here you want to keep the cast of pagerank so that it is handled as the right 
> type, since AVG can take either double or int and would default to double.  
> adRevenue will default to double in SUM when you don't specify a type.
>
> You want to project out all unneeded columns as soon as possible.
>
> You should set PARALLEL on the join to use the number of reducers appropriate 
> for your cluster.  Given that you have 10 machines and 5 reduce slots per 
> machine, and speculative execution is off you probably want 50 reducers.  
> (I'm assuming here when you say you have a 10 node cluster you mean 10 data 
> nodes, not counting your name node and task tracker.  The reduce formula 
> should be 5 * number of data nodes.)
>
> I notice you set parallel to 60 on the group by.  That will give you 10 
> trailing reducers.  Unless you have a need for the result to be split 60 ways 
> you should reduce that to 50 as well.
>
> A last question is how large are the uservisits and rankings data sets?  If 
> either is < 80M or so you can use the fragment/replicate join, which is much 
> faster than the general join.  The following script assumes that isn't the 
> case; but if it is let me know and I can show you the syntax for it.
>
> So the end query looks like:
>
> {code}
> rmf output/PIG_bench/html_join;
> a = load '/data/uservisits/*' using PigStorage('|') as
>        
> (sourceIP,destURL,visitDate,adRevenue,userAgent,countryCode,languageCode:,searchWord,duration);
> b = load '/data/rankings/*' using PigStorage('|') as 
> (pagerank:int,pageurl,aveduration);
> c = filter a by visitDate > '1999-01-01' AND visitDate < '2000-01-01';
> c1 = fjjkkoreach c generate sourceIP, destURL, addRevenue;
> b1 = foreach b generate pagerank, pageurl;
> d = JOIN c1 by destURL, b1 by pageurl parallel 50;
> d1 = foreach d generate sourceIP, pagerank, adRevenue;
> e = group d1 by sourceIP parallel 50;
> f = FOREACH e GENERATE group, AVG(d1.pagerank), SUM(d1.adRevenue);
> store f into 'output/PIG_bench/html_join';
> {code}
>
> uservisists_agrre.pig:
>
> Same comments as above on projecting out as early as possible and on setting 
> parallel appropriately for your cluster.
>
> {code}
> rmf output/PIG_bench/uservisits_aggre;
> a = load '/data/uservisits/*' using PigStorage('|') as
>        
> (sourceIP,destURL,visitDate,adRevenue,userAgent,countryCode,languageCode,searchWord,duration);
> a1 = foreach a generate sourceIP, adRevenue;
> b = group a by sourceIP parallel 50;
> c = FOREACH b GENERATE group, SUM(a. adRevenue);
> store c into 'output/PIG_bench/uservisits_aggre';
> {code}
>
>
>> Hive performance benchmarks
>> ---------------------------
>>
>>                 Key: HIVE-396
>>                 URL: https://issues.apache.org/jira/browse/HIVE-396
>>             Project: Hadoop Hive
>>          Issue Type: New Feature
>>            Reporter: Zheng Shao
>>         Attachments: hive_benchmark_2009-06-18.pdf, 
>> hive_benchmark_2009-06-18.tar.gz
>>
>>
>> We need some performance benchmark to measure and track the performance 
>> improvements of Hive.
>> Some references:
>> PIG performance benchmarks PIG-200
>> PigMix: http://wiki.apache.org/pig/PigMix
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

We should also benchmark against cloudbase. They announce releases on
the hadoop list, I believe they are open source. It would be nice to
see an open source system vs a closed source one.

Reply via email to