RE: Partition performance

2016-01-26 Thread Mich Talebzadeh
Check the threads in hive user group under “Impact of partitioning on certain 
queries”

 

HTH

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf

Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", 
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one 
out shortly

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

From: Shubhvardhan Manjayya [mailto:shub...@gmail.com] 
Sent: 27 January 2016 04:14
To: user@hive.apache.org
Subject: Partition performance

 

Hi see this cloudera blog at:
http://blog.cloudera.com/blog/2014/08/improving-query-performance-using-partitioning-in-apache-hive/

That mentions "Do not over-partition the data. With too many small partitions, 
the task of recursively scanning the directories becomes more expensive than a 
full table scan of the table."

If I have two tables with this partition structure:

1. table1 pointing to hdfs location  /c1/c2/data

2. table2 pointing to hdfs location /c1/c2/c3/data

and hadoop fs -du -h -s /c1 has the same result for both these,

 

and the files are splitttable, snappy compressed,

when I compare the two queries,

select count(1) from table1;

select count(1) from table2;

For  which usecases do the two queries have different execution time? I am 
guessing both should perform the same always as long as we dont use the c3 
partitioned column in the where clause?


-- 


-Shubh



RE: Partition performance

2013-07-04 Thread Peter Marron
Hi,

Just to check that I understand this problem, my reading suggests that the 
overhead of
many partitions is currently unavoidable. Specifically this means that any 
query on a table that has, let’s say, 10,000 partitions
will be significantly slower (than on un-partitioned table with the “same” 
data) even if
the query explicitly specifies a single partition.
(I mean I _could_ actually do the experiments myself…)

Regards,

Z

From: Owen O'Malley [mailto:omal...@apache.org]
Sent: 02 July 2013 15:52
To: user@hive.apache.org
Subject: Re: Partition performance

On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron 
peter.mar...@trilliumsoftware.commailto:peter.mar...@trilliumsoftware.com 
wrote:
Hi Owen,

I’m curious about this advice about partitioning. Is there some fundamental 
reason why Hive
is slow when the number of partitions is 10,000 rather than 1,000?

The precise numbers don't matter. I wanted to give people a ballpark range that 
they should be looking at. Most tables at 1000 partitions won't cause big slow 
downs, but the cost scales with the number of partitions. By the time you are 
at 10,000 the cost is noticeable. I have one customer who has a table with 1.2 
million partitions. That causes a lot of slow downs.

And the improvements
that you mention are they going to be in version 12? Is there a JIRA raised so 
that I can track them?
(It’s not currently a problem for me but I can see that I am going to need to 
be able to explain the situation.)

I think this is the one they will use: 
https://issues.apache.org/jira/browse/HIVE-4051

-- Owen


RE: Partition performance

2013-07-04 Thread Peter Marron
Sorry, just caught up with the last couple of day’s email and I feel that this 
question
has already been answered fairly comprehensively. Apologies.

Z

From: Peter Marron [mailto:peter.mar...@trilliumsoftware.com]
Sent: 04 July 2013 08:37
To: user@hive.apache.org
Subject: RE: Partition performance

Hi,

Just to check that I understand this problem, my reading suggests that the 
overhead of
many partitions is currently unavoidable. Specifically this means that any 
query on a table that has, let’s say, 10,000 partitions
will be significantly slower (than on un-partitioned table with the “same” 
data) even if
the query explicitly specifies a single partition.
(I mean I _could_ actually do the experiments myself…)

Regards,

Z

From: Owen O'Malley [mailto:omal...@apache.org]
Sent: 02 July 2013 15:52
To: user@hive.apache.orgmailto:user@hive.apache.org
Subject: Re: Partition performance

On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron 
peter.mar...@trilliumsoftware.commailto:peter.mar...@trilliumsoftware.com 
wrote:
Hi Owen,

I’m curious about this advice about partitioning. Is there some fundamental 
reason why Hive
is slow when the number of partitions is 10,000 rather than 1,000?

The precise numbers don't matter. I wanted to give people a ballpark range that 
they should be looking at. Most tables at 1000 partitions won't cause big slow 
downs, but the cost scales with the number of partitions. By the time you are 
at 10,000 the cost is noticeable. I have one customer who has a table with 1.2 
million partitions. That causes a lot of slow downs.

And the improvements
that you mention are they going to be in version 12? Is there a JIRA raised so 
that I can track them?
(It’s not currently a problem for me but I can see that I am going to need to 
be able to explain the situation.)

I think this is the one they will use: 
https://issues.apache.org/jira/browse/HIVE-4051

-- Owen


Re: Partition performance

2013-07-03 Thread David Morel
On 2 Jul 2013, at 16:51, Owen O'Malley wrote:

 On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron 
 peter.mar...@trilliumsoftware.com wrote:

 Hi Owen,

 ** **

 I’m curious about this advice about partitioning. Is there some
 fundamental reason why Hive

 is slow when the number of partitions is 10,000 rather than 1,000?


 The precise numbers don't matter. I wanted to give people a ballpark range
 that they should be looking at. Most tables at 1000 partitions won't cause
 big slow downs, but the cost scales with the number of partitions. By the
 time you are at 10,000 the cost is noticeable. I have one customer who has
 a table with 1.2 million partitions. That causes a lot of slow downs.

That is still not really answering the question, which is: why is it slower
to run a query on a heavily partitioned table than it is on the same number 
of files in a less heavily partitioned table.

David


Re: Partition performance

2013-07-03 Thread Dean Wampler
How big were the files in each case in your experiment? Having lots of
small files will add Hadoop overhead.

Also, it would be useful to know the execution times of the map and reduce
tasks. The rule of thumb is that under 20 seconds each, or so, you're
paying a significant of the execution time in startup and shutdown overhead.

Of course, another factor is the number of tasks your cluster can run in
parallel. Scanning 20K partitions with a 1K MapReduce slot capacity over
the cluster will obviously take ~20 passes vs. ~1 pass for 1K partitions.

dean

On Tue, Jul 2, 2013 at 4:34 AM, Peter Marron 
peter.mar...@trilliumsoftware.com wrote:

  ...

 ** **

 *From: *Ian liu...@yahoo.com
 *Reply-To: *user@hive.apache.org user@hive.apache.org, Ian 
 liu...@yahoo.com
 *Date: *Thursday, April 4, 2013 4:01 PM
 *To: *user@hive.apache.org user@hive.apache.org
 *Subject: *Partition performance

 ** **

 Hi,

  

 I created 3 years of hourly log files (totally 26280 files), and use
 External Table with partition to query. I tried two partition methods.

  

 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per
 hour). Use date and hour as partition keys. Add 3 years of directories to
 the table partitions. So there are 26280 partitions.

 CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
 string, hr int);

 ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION
 '/test1/2013/04/02/16';

  

 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per
 day, 24 files in each directory). Use date as partition key. Add 3 years of
 directories to the table partitions. So there are 1095 partitions.

 CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
 string);

 ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
 '/test2/2013/04/02';

  

 When doing a simple query like 

 SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt =
 '2013-02-14'

 Using approach #1 takes 320 seconds, but #2 only takes 70 seconds. 

  

 I'm wondering why there is a big performance difference between these two?
 These two approaches have the same number of files, only the directory
 structure is different. So Hive is going to load the same amount of files.
 Why does the number of partitions have such big impact? Does that mean #2
 is a better partition strategy?

  

 Thanks.

  

  

 ** **

 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution is
 prohibited. If you are not the intended recipient, please contact the
 sender by reply email and destroy all copies of the original message along
 with any attachments, from your computer system. If you are the intended
 recipient, please be advised that the content of this message is subject to
 access, review and disclosure by the sender's Email System Administrator.*
 ***

 ** **

 ** **




-- 
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com


Re: Partition performance

2013-07-03 Thread Edward Capriolo
1) each partition object is a row in the metastore usually mysql, querying
large tables with many partitions has longer startup time as the hive query
planner has to fetch and process all of this meta-information. This is not
a distributed process. It is usually fast within a few seconds but for very
large partitions it can be slow.

2) hadoop's small files problem. - google that. Small files end up being
much more overhead for a given map reduce job, generally the more
files/partitions the more map/reduce tasks. More map reduce tasks is more
overhead, more overhead is less throughput.

::SHAMELESS PLUG:: We discuss this in detail the book programming hive, in
the schema design section



On Wed, Jul 3, 2013 at 8:19 AM, David Morel dmore...@gmail.com wrote:

 On 2 Jul 2013, at 16:51, Owen O'Malley wrote:

  On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron 
  peter.mar...@trilliumsoftware.com wrote:
 
  Hi Owen,
 
  ** **
 
  I’m curious about this advice about partitioning. Is there some
  fundamental reason why Hive
 
  is slow when the number of partitions is 10,000 rather than 1,000?
 
 
  The precise numbers don't matter. I wanted to give people a ballpark
 range
  that they should be looking at. Most tables at 1000 partitions won't
 cause
  big slow downs, but the cost scales with the number of partitions. By the
  time you are at 10,000 the cost is noticeable. I have one customer who
 has
  a table with 1.2 million partitions. That causes a lot of slow downs.

 That is still not really answering the question, which is: why is it slower
 to run a query on a heavily partitioned table than it is on the same number
 of files in a less heavily partitioned table.

 David



Re: Partition performance

2013-07-03 Thread Owen O'Malley
On Wed, Jul 3, 2013 at 5:19 AM, David Morel dmore...@gmail.com wrote:


 That is still not really answering the question, which is: why is it slower
 to run a query on a heavily partitioned table than it is on the same number
 of files in a less heavily partitioned table.


According to Gopal's investigations in
https://issues.apache.org/jira/browse/HIVE-4051, each time Hive plans a
query, it does a query per a partition to the backing SQL database. That
would explain a lot of the latency for tables with large numbers of
partitions.

-- Owen


RE: Partition performance

2013-07-02 Thread Peter Marron
Hi Owen,

I’m curious about this advice about partitioning. Is there some fundamental 
reason why Hive
is slow when the number of partitions is 10,000 rather than 1,000? And the 
improvements
that you mention are they going to be in version 12? Is there a JIRA raised so 
that I can track them?
(It’s not currently a problem for me but I can see that I am going to need to 
be able to explain the situation.)

Warm regards,

Z

From: Owen O'Malley [mailto:omal...@apache.org]
Sent: 05 April 2013 00:26
To: user@hive.apache.org
Subject: Re: Partition performance

See slide #9 from my Optimizing Hive Queries talk 
http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will 
improve it, but for now you are much better off with 1,000 partitions than 
10,000.

-- Owen

On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle 
ramki.pa...@gmail.commailto:ramki.pa...@gmail.com wrote:
Is it possible for you to send the explain plan of these two queries?
Regards,
Ramki.

On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.commailto:sanjay.subraman...@wizecommerce.com
 wrote:
The slow down is most possibly due to large number of partitions.
I believe the Hive book authors tell us to be cautious with large number of 
partitions :-)  and I abide by that.

Users
Please add your points of view and experiences

Thanks
sanjay

From: Ian liu...@yahoo.commailto:liu...@yahoo.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, Ian 
liu...@yahoo.commailto:liu...@yahoo.com
Date: Thursday, April 4, 2013 4:01 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Partition performance

Hi,

I created 3 years of hourly log files (totally 26280 files), and use External 
Table with partition to query. I tried two partition methods.

1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per 
hour). Use date and hour as partition keys. Add 3 years of directories to the 
table partitions. So there are 26280 partitions.
CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, 
hr int);
ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION 
'/test1/2013/04/02/16';

2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per day, 
24 files in each directory). Use date as partition key. Add 3 years of 
directories to the table partitions. So there are 1095 partitions.
CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string);
ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION 
'/test2/2013/04/02';

When doing a simple query like
SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt = '2013-02-14'
Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

I'm wondering why there is a big performance difference between these two? 
These two approaches have the same number of files, only the directory 
structure is different. So Hive is going to load the same amount of files. Why 
does the number of partitions have such big impact? Does that mean #2 is a 
better partition strategy?

Thanks.



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.




Re: Partition performance

2013-07-02 Thread Owen O'Malley
On Tue, Jul 2, 2013 at 2:34 AM, Peter Marron 
peter.mar...@trilliumsoftware.com wrote:

  Hi Owen,

 ** **

 I’m curious about this advice about partitioning. Is there some
 fundamental reason why Hive

 is slow when the number of partitions is 10,000 rather than 1,000?


The precise numbers don't matter. I wanted to give people a ballpark range
that they should be looking at. Most tables at 1000 partitions won't cause
big slow downs, but the cost scales with the number of partitions. By the
time you are at 10,000 the cost is noticeable. I have one customer who has
a table with 1.2 million partitions. That causes a lot of slow downs.


 And the improvements

 that you mention are they going to be in version 12? Is there a JIRA
 raised so that I can track them?

 (It’s not currently a problem for me but I can see that I am going to need
 to be able to explain the situation.)


I think this is the one they will use:
https://issues.apache.org/jira/browse/HIVE-4051

-- Owen


Re: Partition performance

2013-04-11 Thread Ian
There is only one map task because it's using the CombineHiveInputFormat (In my 
test cases, all files are very small). If I set hive.input.format to 
HiveInputFormat, then it has 336 map tasks in the first case. But the 
performance is even worse since there are too many map tasks and each one is 
only handling a small file.
 
It takes a lot of time before it actually submits the job. So maybe querying 
the metastore for partition info takes time?
  


 From: Ramki Palle ramki.pa...@gmail.com
To: user@hive.apache.org; Ian liu...@yahoo.com 
Sent: Friday, April 5, 2013 1:12 PM
Subject: Re: Partition performance
  


Can you tell how many map tasks are there in each scenario?

If my assumption is correct, you should have 336 in the first case and 14 in 
second case.



It looks like it is combing all small files in a folder and running as one map 
task for all 24 files in a folder, whereas it is running a separate task in 
these files are there in different partitions (folders).


You can try to reuse the JVM and see if the response time is similar.

Can you please try the following and let us know how long each strategy takes?


hive set mapred.job.reuse.jvm.num.tasks = 24;


Run your  query that has more partitions and see if the response time is lower.


Regards,

Ramki.




On Fri, Apr 5, 2013 at 11:36 AM, Ian liu...@yahoo.com wrote:

Thanks. This is just a test from my local box. So each file is only 1kb. I 
shared the query plans of these two tests at: 
http://codetidy.com/paste/raw/5198
http://codetidy.com/paste/raw/5199 
 
Also in the Hadoop log, there is this line for each 
partition:org.apache.hadoop.hive.ql.exec.MapOperator: Adding alias test1 to 
work list for file hdfs://localhost:8020/test1/2011/02/01/01 
Does that mean each partition will become a map task?

I'm still new in Hive, just wondering what are the common strategy for 
partitioning the hourly logs? I know we shouldn't have too many partitions but 
I'm wondering what's the reason behind it? If I run this on a real cluster, 
maybe it won't perform so differently?

Thanks.  
 From: Dean Wampler dean.wamp...@thinkbiganalytics.com
To: user@hive.apache.org 
Sent: Thursday, April 4, 2013 4:28 PM
Subject: Re: Partition performance
 


Also, how big are the files in each directory? Are they roughly the size of 
one HDFS block or a multiple. Lots of small files will mean lots of mapper 
tasks will little to do.


You can also compare the job tracker console output for each job. I bet the 
slow one has a lot of very short map and reduce tasks, while the faster one 
has fewer tasks that run longer. A rule of thumb is that any one task should 
take 20 seconds or more to amortize over the few seconds spent in start up per 
task. 


In other words, if you think about what's happening at the HDFS and MR level, 
you can learn to predict how fast or slow things will run. Learning to read 
the output of EXPLAIN or EXPLAIN EXTENDED helps with this. 


dean


On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley omal...@apache.org wrote:

See slide #9 from my Optimizing Hive Queries talk 
http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will 
improve it, but for now you are much better off with 1,000 partitions than 
10,000.

-- Owen



On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle ramki.pa...@gmail.com wrote:

Is it possible for you to send the explain plan of these two queries?

Regards,
Ramki.




On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.com wrote:

The slow down is most possibly due to large number of partitions. 
I believe the Hive book authors tell us to be cautious with large number of 
partitions :-)  and I abide by that. 

 
Users 
Please add your points of view and experiences 

 
Thanks 
sanjay 

 From: Ian liu...@yahoo.com
Reply-To: user@hive.apache.org user@hive.apache.org, Ian 
liu...@yahoo.com
Date: Thursday, April 4, 2013 4:01 PM
To: user@hive.apache.org user@hive.apache.org
Subject: Partition performance


 
Hi, 

I created 3 years of hourly log files (totally 26280 files), and use 
External Table with partition to query. I tried two partition methods. 

1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per 
hour). Use date and hour as partition keys. Add 3 years of directories to 
the table partitions. So there are 26280 partitions. 
CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt 
string, hr int); 
ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION 
'/test1/2013/04/02/16'; 
  
2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per 
day, 24 files in each directory). Use date as partition key. Add 3 years of 
directories to the table partitions. So there are 1095 partitions. 
CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string); 
ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION 
'/test2/2013/04/02'; 
  
When doing

Re: Partition performance

2013-04-05 Thread Ian
Thanks. This is just a test from my local box. So each file is only 1kb. I 
shared the query plans of these two tests at:
http://codetidy.com/paste/raw/5198
http://codetidy.com/paste/raw/5199
 
Also in the Hadoop log, there is this line for each 
partition:org.apache.hadoop.hive.ql.exec.MapOperator: Adding alias test1 to 
work list for file hdfs://localhost:8020/test1/2011/02/01/01
Does that mean each partition will become a map task?
 
I'm still new in Hive, just wondering what are the common strategy for 
partitioning the hourly logs? I know we shouldn't have too many partitions but 
I'm wondering what's the reason behind it? If I run this on a real cluster, 
maybe it won't perform so differently?
 
Thanks.  


 From: Dean Wampler dean.wamp...@thinkbiganalytics.com
To: user@hive.apache.org 
Sent: Thursday, April 4, 2013 4:28 PM
Subject: Re: Partition performance
  

Also, how big are the files in each directory? Are they roughly the size of one 
HDFS block or a multiple. Lots of small files will mean lots of mapper tasks 
will little to do.

You can also compare the job tracker console output for each job. I bet the 
slow one has a lot of very short map and reduce tasks, while the faster one has 
fewer tasks that run longer. A rule of thumb is that any one task should take 
20 seconds or more to amortize over the few seconds spent in start up per task. 

In other words, if you think about what's happening at the HDFS and MR level, 
you can learn to predict how fast or slow things will run. Learning to read the 
output of EXPLAIN or EXPLAIN EXTENDED helps with this. 

dean


On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley omal...@apache.org wrote:

See slide #9 from my Optimizing Hive Queries talk 
http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will 
improve it, but for now you are much better off with 1,000 partitions than 
10,000.

-- Owen



On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle ramki.pa...@gmail.com wrote:

Is it possible for you to send the explain plan of these two queries?

Regards,
Ramki.




On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.com wrote:

The slow down is most possibly due to large number of partitions. 
I believe the Hive book authors tell us to be cautious with large number of 
partitions :-)  and I abide by that. 

 
Users 
Please add your points of view and experiences 

 
Thanks 
sanjay 

 From: Ian liu...@yahoo.com
Reply-To: user@hive.apache.org user@hive.apache.org, Ian 
liu...@yahoo.com
Date: Thursday, April 4, 2013 4:01 PM
To: user@hive.apache.org user@hive.apache.org
Subject: Partition performance


 
Hi, 

I created 3 years of hourly log files (totally 26280 files), and use 
External Table with partition to query. I tried two partition methods. 

1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per 
hour). Use date and hour as partition keys. Add 3 years of directories to 
the table partitions. So there are 26280 partitions. 
CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt 
string, hr int); 
ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION 
'/test1/2013/04/02/16'; 
  
2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per 
day, 24 files in each directory). Use date as partition key. Add 3 years of 
directories to the table partitions. So there are 1095 partitions. 
CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string); 
ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION 
'/test2/2013/04/02'; 
  
When doing a simple query like  
SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt = 
'2013-02-14'  
Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.  

I'm wondering why there is a big performance difference between these two? 
These two approaches have the same number of files, only the directory 
structure is different. So Hive is going to load the same amount of files. 
Why does the number of partitions have such big impact? Does that mean #2 is 
a better partition strategy? 
  
Thanks.  

   

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, use, disclosure or distribution is 
prohibited. If you are not the intended recipient,
 please contact the sender by reply email and destroy all copies of the 
original message along with any attachments, from your computer system. If you 
are the intended recipient, please be advised that the content of this message 
is subject to access, review
 and disclosure by the sender's Email System Administrator.
  
 
 


-- 
Dean Wampler, Ph.D.
thinkbiganalytics.com
+1-312-339-1330

Re: Partition performance

2013-04-05 Thread Ramki Palle
Can you tell how many map tasks are there in each scenario?

If my assumption is correct, you should have 336 in the first case and 14
in second case.


It looks like it is combing all small files in a folder and running as one
map task for all 24 files in a folder, whereas it is running a separate
task in these files are there in different partitions (folders).

You can try to reuse the JVM and see if the response time is similar.

Can you please try the following and let us know how long each strategy
takes?

hive set mapred.job.reuse.jvm.num.tasks = 24;

Run your  query that has more partitions and see if the response time is
lower.

Regards,
Ramki.


On Fri, Apr 5, 2013 at 11:36 AM, Ian liu...@yahoo.com wrote:

 Thanks. This is just a test from my local box. So each file is only 1kb. I
 shared the query plans of these two tests at:
 http://codetidy.com/paste/raw/5198
 http://codetidy.com/paste/raw/5199

 Also in the Hadoop log, there is this line for each partition:
 org.apache.hadoop.hive.ql.exec.MapOperator: Adding alias test1 to work
 list for file hdfs://localhost:8020/test1/2011/02/01/01
 Does that mean each partition will become a map task?

 I'm still new in Hive, just wondering what are the common strategy for
 partitioning the hourly logs? I know we shouldn't have too many partitions
 but I'm wondering what's the reason behind it? If I run this on a real
 cluster, maybe it won't perform so differently?

 Thanks.

   *From:* Dean Wampler dean.wamp...@thinkbiganalytics.com
 *To:* user@hive.apache.org
 *Sent:* Thursday, April 4, 2013 4:28 PM
 *Subject:* Re: Partition performance

 Also, how big are the files in each directory? Are they roughly the size
 of one HDFS block or a multiple. Lots of small files will mean lots of
 mapper tasks will little to do.

 You can also compare the job tracker console output for each job. I bet
 the slow one has a lot of very short map and reduce tasks, while the faster
 one has fewer tasks that run longer. A rule of thumb is that any one task
 should take 20 seconds or more to amortize over the few seconds spent in
 start up per task.

 In other words, if you think about what's happening at the HDFS and MR
 level, you can learn to predict how fast or slow things will run. Learning
 to read the output of EXPLAIN or EXPLAIN EXTENDED helps with this.

 dean

 On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley omal...@apache.org wrote:

 See slide #9 from my Optimizing Hive Queries talk
 http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we
 will improve it, but for now you are much better off with 1,000 partitions
 than 10,000.

 -- Owen


 On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle ramki.pa...@gmail.com wrote:

 Is it possible for you to send the explain plan of these two queries?

 Regards,
 Ramki.


 On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
 sanjay.subraman...@wizecommerce.com wrote:

  The slow down is most possibly due to large number of partitions.
 I believe the Hive book authors tell us to be cautious with large number
 of partitions :-)  and I abide by that.

  Users
 Please add your points of view and experiences

  Thanks
 sanjay

   From: Ian liu...@yahoo.com
 Reply-To: user@hive.apache.org user@hive.apache.org, Ian 
 liu...@yahoo.com
 Date: Thursday, April 4, 2013 4:01 PM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Partition performance

   Hi,

 I created 3 years of hourly log files (totally 26280 files), and use
 External Table with partition to query. I tried two partition methods.

 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per
 hour). Use date and hour as partition keys. Add 3 years of directories to
 the table partitions. So there are 26280 partitions.
 CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
 string, hr int);
 ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION
 '/test1/2013/04/02/16';

 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per
 day, 24 files in each directory). Use date as partition key. Add 3 years of
 directories to the table partitions. So there are 1095 partitions.
  CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
 string);
 ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
 '/test2/2013/04/02';

 When doing a simple query like
 SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt =
 '2013-02-14'
  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

 I'm wondering why there is a big performance difference between these two?
 These two approaches have the same number of files, only the directory
 structure is different. So Hive is going to load the same amount of files.
 Why does the number of partitions have such big impact? Does that mean #2
 is a better partition strategy?

 Thanks.



 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended

Re: Partition performance

2013-04-04 Thread Sanjay Subramanian
The slow down is most possibly due to large number of partitions.
I believe the Hive book authors tell us to be cautious with large number of 
partitions :-)  and I abide by that.

Users
Please add your points of view and experiences

Thanks
sanjay

From: Ian liu...@yahoo.commailto:liu...@yahoo.com
Reply-To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org, Ian 
liu...@yahoo.commailto:liu...@yahoo.com
Date: Thursday, April 4, 2013 4:01 PM
To: user@hive.apache.orgmailto:user@hive.apache.org 
user@hive.apache.orgmailto:user@hive.apache.org
Subject: Partition performance

Hi,

I created 3 years of hourly log files (totally 26280 files), and use External 
Table with partition to query. I tried two partition methods.

1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per 
hour). Use date and hour as partition keys. Add 3 years of directories to the 
table partitions. So there are 26280 partitions.
CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt string, 
hr int);
ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION 
'/test1/2013/04/02/16';

2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per day, 
24 files in each directory). Use date as partition key. Add 3 years of 
directories to the table partitions. So there are 1095 partitions.
CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt string);
ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION 
'/test2/2013/04/02';

When doing a simple query like
SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt = '2013-02-14'
Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

I'm wondering why there is a big performance difference between these two? 
These two approaches have the same number of files, only the directory 
structure is different. So Hive is going to load the same amount of files. Why 
does the number of partitions have such big impact? Does that mean #2 is a 
better partition strategy?

Thanks.



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Partition performance

2013-04-04 Thread Ramki Palle
Is it possible for you to send the explain plan of these two queries?

Regards,
Ramki.


On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
sanjay.subraman...@wizecommerce.com wrote:

  The slow down is most possibly due to large number of partitions.
 I believe the Hive book authors tell us to be cautious with large number
 of partitions :-)  and I abide by that.

  Users
 Please add your points of view and experiences

  Thanks
 sanjay

   From: Ian liu...@yahoo.com
 Reply-To: user@hive.apache.org user@hive.apache.org, Ian 
 liu...@yahoo.com
 Date: Thursday, April 4, 2013 4:01 PM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Partition performance

   Hi,

 I created 3 years of hourly log files (totally 26280 files), and use
 External Table with partition to query. I tried two partition methods.

 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per
 hour). Use date and hour as partition keys. Add 3 years of directories to
 the table partitions. So there are 26280 partitions.
 CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
 string, hr int);
 ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16) LOCATION
 '/test1/2013/04/02/16';

 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory per
 day, 24 files in each directory). Use date as partition key. Add 3 years of
 directories to the table partitions. So there are 1095 partitions.
  CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
 string);
 ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
 '/test2/2013/04/02';

 When doing a simple query like
 SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt =
 '2013-02-14'
  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

 I'm wondering why there is a big performance difference between these two?
 These two approaches have the same number of files, only the directory
 structure is different. So Hive is going to load the same amount of files.
 Why does the number of partitions have such big impact? Does that mean #2
 is a better partition strategy?

 Thanks.



 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution is
 prohibited. If you are not the intended recipient, please contact the
 sender by reply email and destroy all copies of the original message along
 with any attachments, from your computer system. If you are the intended
 recipient, please be advised that the content of this message is subject to
 access, review and disclosure by the sender's Email System Administrator.



Re: Partition performance

2013-04-04 Thread Owen O'Malley
See slide #9 from my Optimizing Hive Queries talk
http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we
will improve it, but for now you are much better off with 1,000 partitions
than 10,000.

-- Owen


On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle ramki.pa...@gmail.com wrote:

 Is it possible for you to send the explain plan of these two queries?

 Regards,
 Ramki.


 On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
 sanjay.subraman...@wizecommerce.com wrote:

  The slow down is most possibly due to large number of partitions.
 I believe the Hive book authors tell us to be cautious with large number
 of partitions :-)  and I abide by that.

  Users
 Please add your points of view and experiences

  Thanks
 sanjay

   From: Ian liu...@yahoo.com
 Reply-To: user@hive.apache.org user@hive.apache.org, Ian 
 liu...@yahoo.com
 Date: Thursday, April 4, 2013 4:01 PM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Partition performance

   Hi,

 I created 3 years of hourly log files (totally 26280 files), and use
 External Table with partition to query. I tried two partition methods.

 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory
 per hour). Use date and hour as partition keys. Add 3 years of directories
 to the table partitions. So there are 26280 partitions.
 CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
 string, hr int);
 ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16)
 LOCATION '/test1/2013/04/02/16';

 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory
 per day, 24 files in each directory). Use date as partition key. Add 3
 years of directories to the table partitions. So there are 1095 partitions.
  CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY (dt
 string);
 ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
 '/test2/2013/04/02';

 When doing a simple query like
 SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt =
 '2013-02-14'
  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

 I'm wondering why there is a big performance difference between these
 two? These two approaches have the same number of files, only the directory
 structure is different. So Hive is going to load the same amount of files.
 Why does the number of partitions have such big impact? Does that mean #2
 is a better partition strategy?

 Thanks.



 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution is
 prohibited. If you are not the intended recipient, please contact the
 sender by reply email and destroy all copies of the original message along
 with any attachments, from your computer system. If you are the intended
 recipient, please be advised that the content of this message is subject to
 access, review and disclosure by the sender's Email System Administrator.





Re: Partition performance

2013-04-04 Thread Dean Wampler
Also, how big are the files in each directory? Are they roughly the size of
one HDFS block or a multiple. Lots of small files will mean lots of mapper
tasks will little to do.

You can also compare the job tracker console output for each job. I bet the
slow one has a lot of very short map and reduce tasks, while the faster one
has fewer tasks that run longer. A rule of thumb is that any one task
should take 20 seconds or more to amortize over the few seconds spent in
start up per task.

In other words, if you think about what's happening at the HDFS and MR
level, you can learn to predict how fast or slow things will run. Learning
to read the output of EXPLAIN or EXPLAIN EXTENDED helps with this.

dean

On Thu, Apr 4, 2013 at 6:25 PM, Owen O'Malley omal...@apache.org wrote:

 See slide #9 from my Optimizing Hive Queries talk
 http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we
 will improve it, but for now you are much better off with 1,000 partitions
 than 10,000.

 -- Owen


 On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle ramki.pa...@gmail.com wrote:

 Is it possible for you to send the explain plan of these two queries?

 Regards,
 Ramki.


 On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian 
 sanjay.subraman...@wizecommerce.com wrote:

  The slow down is most possibly due to large number of partitions.
 I believe the Hive book authors tell us to be cautious with large number
 of partitions :-)  and I abide by that.

  Users
 Please add your points of view and experiences

  Thanks
 sanjay

   From: Ian liu...@yahoo.com
 Reply-To: user@hive.apache.org user@hive.apache.org, Ian 
 liu...@yahoo.com
 Date: Thursday, April 4, 2013 4:01 PM
 To: user@hive.apache.org user@hive.apache.org
 Subject: Partition performance

   Hi,

 I created 3 years of hourly log files (totally 26280 files), and use
 External Table with partition to query. I tried two partition methods.

 1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory
 per hour). Use date and hour as partition keys. Add 3 years of directories
 to the table partitions. So there are 26280 partitions.
 CREATE EXTERNAL TABLE test1 (logline string) PARTITIONED BY (dt
 string, hr int);
 ALTER TABLE test1 ADD PARTITION (dt='2013-04-02', hr=16)
 LOCATION '/test1/2013/04/02/16';

 2). Log files are stored as /test2/2013/04/02/16_00_0 (A directory
 per day, 24 files in each directory). Use date as partition key. Add 3
 years of directories to the table partitions. So there are 1095 partitions.
  CREATE EXTERNAL TABLE test2 (logline string) PARTITIONED BY
 (dt string);
 ALTER TABLE test2 ADD PARTITION (dt='2013-04-02') LOCATION
 '/test2/2013/04/02';

 When doing a simple query like
 SELECT * FROM  test1/test2  WHERE  dt = '2013-02-01' and dt =
 '2013-02-14'
  Using approach #1 takes 320 seconds, but #2 only takes 70 seconds.

 I'm wondering why there is a big performance difference between these
 two? These two approaches have the same number of files, only the directory
 structure is different. So Hive is going to load the same amount of files.
 Why does the number of partitions have such big impact? Does that mean #2
 is a better partition strategy?

 Thanks.



 CONFIDENTIALITY NOTICE
 ==
 This email message and any attachments are for the exclusive use of the
 intended recipient(s) and may contain confidential and privileged
 information. Any unauthorized review, use, disclosure or distribution is
 prohibited. If you are not the intended recipient, please contact the
 sender by reply email and destroy all copies of the original message along
 with any attachments, from your computer system. If you are the intended
 recipient, please be advised that the content of this message is subject to
 access, review and disclosure by the sender's Email System Administrator.






-- 
*Dean Wampler, Ph.D.*
thinkbiganalytics.com
+1-312-339-1330