[jira] Commented: (HIVE-600) Running TPC-H queries on Hive

2010-02-28 Thread Kamil Bajda-Pawlikowski (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839478#action_12839478
 ] 

Kamil Bajda-Pawlikowski commented on HIVE-600:
--

Hi Yuntao,

I have attempted to run TPC-H on Hive. Thanks for really well prepared scripts!

During the first query, I realized that things are not going well. It seems 
that Aaron's concern about the number of reducers was valid one.
However, the problem is that Hive schedules too many reducers! The default 
configuration of Hive tries to determine # of tasks automatically using value 
of  "hive.exec.reducers.bytes.per.reducer" property (the default setting is to 
have one reduce task per 1GB of input data). When the size of the data is huge, 
this is inefficient. This needs to capped!

For example in my case, there is 50GB data per node, but only 2 reduce task 
slots and I'm getting 25 reduce task waves. Q1 ran for 1h49min. In contrast, 
when I set "hive.exec.reducers.max" property to the number of reduce slots in 
my Hadoop installation, the query running time is only about 23min. Of note, 
the default value for "hive.exec.reducers.max" is 999.

The above issue was not too bad for the data size you used. TPC-H dataset with 
SF=100 translates into at most 100 reducers per job, and with 40 reduce slots 
in total, each job had max. 2.5 reduce task waves. Still, your numbers could be 
somewhat better by capping "hive.exec.reducers.max" to 40 per Tom White's tip 
#9 from http://www.cloudera.com/blog/2009/05/10-mapreduce-tips.

Could please confirm whether my understanding is correct.

Thank you,
Kamil





> Running TPC-H queries on Hive
> -
>
> Key: HIVE-600
> URL: https://issues.apache.org/jira/browse/HIVE-600
> Project: Hadoop Hive
>  Issue Type: New Feature
>Reporter: Yuntao Jia
>Assignee: Yuntao Jia
> Attachments: TPC-H_on_Hive_2009-08-11.pdf, 
> TPC-H_on_Hive_2009-08-11.tar.gz, TPC-H_on_Hive_2009-08-14.tar.gz
>
>
> The goal is to run all TPC-H (http://www.tpc.org/tpch/) benchmark queries on 
> Hive for two reasons. First, through those queries, we would like to find the 
> new features that we need to put into Hive so that Hive supports common SQL 
> queries. Second, we would like to measure the performance of Hive to find out 
> what Hive is not good at. We can then improve Hive based on those 
> information. 
> For queries that are not supported now in Hive, I will try to rewrite them to 
> one or more Hive-supported queries. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-600) Running TPC-H queries on Hive

2009-08-11 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742189#action_12742189
 ] 

Aaron Kimball commented on HIVE-600:


Sounds good to me :)

> Running TPC-H queries on Hive
> -
>
> Key: HIVE-600
> URL: https://issues.apache.org/jira/browse/HIVE-600
> Project: Hadoop Hive
>  Issue Type: New Feature
>Reporter: Yuntao Jia
>Assignee: Yuntao Jia
> Attachments: TPC-H_on_Hive_2009-08-11.pdf, 
> TPC-H_on_Hive_2009-08-11.tar.gz
>
>
> The goal is to run all TPC-H (http://www.tpc.org/tpch/) benchmark queries on 
> Hive for two reasons. First, through those queries, we would like to find the 
> new features that we need to put into Hive so that Hive supports common SQL 
> queries. Second, we would like to measure the performance of Hive to find out 
> what Hive is not good at. We can then improve Hive based on those 
> information. 
> For queries that are not supported now in Hive, I will try to rewrite them to 
> one or more Hive-supported queries. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-600) Running TPC-H queries on Hive

2009-08-11 Thread Yuntao Jia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742185#action_12742185
 ] 

Yuntao Jia commented on HIVE-600:
-

To the 1st question, the reduce number is set in Hive. In particular, in 
Hive-default.xml, one property is:


  mapred.reduce.tasks
  -1
The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local". Hadoop set this to 1 by default, whereas hive 
uses -1 as its default value.
  By setting this property to -1, Hive will automatically figure out what 
should be the number of reducers.
  



To the 2nd question, in the actual Hadoop configuration, we did use four paths. 
However, for security reasons, we anonymized the configuration file and put one 
path instead.

Hope that answers your questions.


> Running TPC-H queries on Hive
> -
>
> Key: HIVE-600
> URL: https://issues.apache.org/jira/browse/HIVE-600
> Project: Hadoop Hive
>  Issue Type: New Feature
>Reporter: Yuntao Jia
>Assignee: Yuntao Jia
> Attachments: TPC-H_on_Hive_2009-08-11.pdf, 
> TPC-H_on_Hive_2009-08-11.tar.gz
>
>
> The goal is to run all TPC-H (http://www.tpc.org/tpch/) benchmark queries on 
> Hive for two reasons. First, through those queries, we would like to find the 
> new features that we need to put into Hive so that Hive supports common SQL 
> queries. Second, we would like to measure the performance of Hive to find out 
> what Hive is not good at. We can then improve Hive based on those 
> information. 
> For queries that are not supported now in Hive, I will try to rewrite them to 
> one or more Hive-supported queries. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-600) Running TPC-H queries on Hive

2009-08-11 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742169#action_12742169
 ] 

Aaron Kimball commented on HIVE-600:


Yuntao,

Thanks. I took a look through this file and have some questions:

1) {{mapred.reduce.tasks}} isn't set in hadoop-site.xml, nor do any of the 
scripts explicitly set it. This means it's left at the default value of '1'. 
Necessary for anything with an {{ORDER BY}} clause, but slows down anything 
else (you could set this to 40 on your cluster for any situations where you 
don't need total ordering). Could some of these queries get refactored to make 
use of multiple reducers in the middle? 

2) Your writeup says that you've got 4 hdds per machine, but  {{dfs.data.dir}} 
and {{mapred.local.dir}} both just reference a single path each. Are you doing 
something unusual in your filesystem to get this to spread across all 4 disks? 
Or could three of them be unused by this?

Thank you
- Aaron

> Running TPC-H queries on Hive
> -
>
> Key: HIVE-600
> URL: https://issues.apache.org/jira/browse/HIVE-600
> Project: Hadoop Hive
>  Issue Type: New Feature
>Reporter: Yuntao Jia
>Assignee: Yuntao Jia
> Attachments: TPC-H_on_Hive_2009-08-11.pdf, 
> TPC-H_on_Hive_2009-08-11.tar.gz
>
>
> The goal is to run all TPC-H (http://www.tpc.org/tpch/) benchmark queries on 
> Hive for two reasons. First, through those queries, we would like to find the 
> new features that we need to put into Hive so that Hive supports common SQL 
> queries. Second, we would like to measure the performance of Hive to find out 
> what Hive is not good at. We can then improve Hive based on those 
> information. 
> For queries that are not supported now in Hive, I will try to rewrite them to 
> one or more Hive-supported queries. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-600) Running TPC-H queries on Hive

2009-08-11 Thread Yuntao Jia (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742166#action_12742166
 ] 

Yuntao Jia commented on HIVE-600:
-

The hadoop-site.xml is included in the attached package: 
TPC-H_on_Hive_2009-08-11.tar.gz. You can download it and check it out. Since It 
has more than 300 lines, I'd better not post it here.

> Running TPC-H queries on Hive
> -
>
> Key: HIVE-600
> URL: https://issues.apache.org/jira/browse/HIVE-600
> Project: Hadoop Hive
>  Issue Type: New Feature
>Reporter: Yuntao Jia
>Assignee: Yuntao Jia
> Attachments: TPC-H_on_Hive_2009-08-11.pdf, 
> TPC-H_on_Hive_2009-08-11.tar.gz
>
>
> The goal is to run all TPC-H (http://www.tpc.org/tpch/) benchmark queries on 
> Hive for two reasons. First, through those queries, we would like to find the 
> new features that we need to put into Hive so that Hive supports common SQL 
> queries. Second, we would like to measure the performance of Hive to find out 
> what Hive is not good at. We can then improve Hive based on those 
> information. 
> For queries that are not supported now in Hive, I will try to rewrite them to 
> one or more Hive-supported queries. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-600) Running TPC-H queries on Hive

2009-08-11 Thread Aaron Kimball (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12742116#action_12742116
 ] 

Aaron Kimball commented on HIVE-600:


Interesting results. Can you please post the hadoop-site.xml file used to run 
the test? I'm curious what Hadoop performance-tuning settings you used.

> Running TPC-H queries on Hive
> -
>
> Key: HIVE-600
> URL: https://issues.apache.org/jira/browse/HIVE-600
> Project: Hadoop Hive
>  Issue Type: New Feature
>Reporter: Yuntao Jia
>Assignee: Yuntao Jia
> Attachments: TPC-H_on_Hive_2009-08-11.pdf, 
> TPC-H_on_Hive_2009-08-11.tar.gz
>
>
> The goal is to run all TPC-H (http://www.tpc.org/tpch/) benchmark queries on 
> Hive for two reasons. First, through those queries, we would like to find the 
> new features that we need to put into Hive so that Hive supports common SQL 
> queries. Second, we would like to measure the performance of Hive to find out 
> what Hive is not good at. We can then improve Hive based on those 
> information. 
> For queries that are not supported now in Hive, I will try to rewrite them to 
> one or more Hive-supported queries. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.