Re: Handling hierarchical data in Hive

Prasan Samtani Tue, 25 Mar 2014 07:58:11 -0700

Hi Saumitra,

You might want to look into clustering within the partition. That is, partition 
by "day", but cluster by "generated by" (within those partitions), and see if 
that improves performance. Refer to the CLUSTER BY command in the Hive language 
Manual.


-Prasan


On Mar 25, 2014, at 4:26 AM, "Saumitra Shahapure (Vizury)" 
<saumitra.shahap...@vizury.com<mailto:saumitra.shahap...@vizury.com>> wrote:

Hi Nitin,

We are not facing small files problem since data is in S3. Also we do not want 
to merge files. Merging files are creating large analyze table for say one day 
would slow down queries fired on specific day and generated_by.

Let me explain my problem in other words.
Right now we are over-partitioning our table. Over-partitioning is giving us 
benefit that query on 1-2 partitions is too fast. It's side-effect is that If 
we try to query large number of partitions, query is too slow. Is there a way 
to get good performance in both of the scenarios?

--
Regards,
Saumitra Shahapure


On Tue, Mar 25, 2014 at 4:25 PM, Nitin Pawar 
<nitinpawar...@gmail.com<mailto:nitinpawar...@gmail.com>> wrote:
see if this is what you are looking for https://github.com/sskaje/hive_merge




On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) 
<saumitra.shahap...@vizury.com<mailto:saumitra.shahap...@vizury.com>> wrote:
Hello,

We are using Hive to query S3 data. For one of our tables named analyze, we 
generate data hierarchically. First level of hierarchy is date and second level 
is a field named generated_by. e.g. for 20 march we may have S3 directories as
s3://analyze/20140320/111/
s3://analyze/20140320/222/
s3://analyze/20140320/333/
Size of files in each folders is typically small.

Till now we have been using static partitioning so that queries on specific 
date and generated_by would be faster.

Now problem is that number of generated_by folders is increased to 1000s. 
Everyday we end up adding 1000s of partitions to Hive. So queries on analyze on 
one month are slowed down.

Is there any way to get rid of partitions, and at the same time maintain good  
performance of queries which are fired on specific day and generated_by?
--
Regards,
Saumitra Shahapure



--
Nitin Pawar

Re: Handling hierarchical data in Hive

Reply via email to