Partitioning can greatly increase performance for WHERE clauses since hive can 
omit parsing the data in the partitions which do no meet the requirement.
For example if you partition by date (I do it by INT dateint, in which case I 
set dateint to be YYYYMMDD) and you do WHERE dateint >= 20120101 then it won't 
even have to touch any of the data from before 2012-01-01 and in my case that 
means I don't parse the last 2 years of data, reducing the time the query takes 
by about 70% :-)

Buckets are the second awesome way of getting a big optimization in, 
specifically for joins! If you have 2 tables you're joining onto each other 
then if they're both bucketed on their join column it will also greatly 
increase speed.
Another good join optimization is MAPJOIN, if one of the tables you're joining 
is rather small (below 30mb) then you can force it to MAPJOIN or you can enable 
automatic mapjoin, I personally prefere explicit behavory instead of automagic 
so use a hint:
SELECT /* +MAPJOIN(the_small_table) */ fields FROM table JOIN the_small_table, 
etc.
Sorted by is for sorting within buckets, only relevant if you're doing a lot of 
ordering I think.

I'm assuming sequencefiles are faster, but I wouldn't really know :( need 
someone else to tell us more about that ;)


-----Original Message-----
From: Avdeev V. M. [mailto:[email protected]] 
Sent: Monday, May 28, 2012 7:17 AM
To: [email protected]
Subject: table design and performance questions

Question from novice.

Where I can read table design best practices? I have a measure table with 
millions of rows and many dimension tables with less than 1000 rows each. I 
can't find out the way to get optimal design of both kind of tables. Is there 
performance tuning guides or performance FAQ?

Specifically
1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using 
these statements make sense?
2) DDL language manual says 'This can improve performance on certain kinds of 
queries.' about CLUSTERED BY statement. What kind of queries can be improved?
3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of 
performance? What aspects should be taken into account when choosing a file 
format?
4) Compressed storage article says 'Keeping data compressed in Hive tables has, 
in some cases, known to give better performance that uncompressed storage;' and 
again - What is these cases? 

Thanks!
Vyacheslav

Reply via email to