Partitioning can greatly increase performance for WHERE clauses since hive can omit parsing the data in the partitions which do no meet the requirement. For example if you partition by date (I do it by INT dateint, in which case I set dateint to be YYYYMMDD) and you do WHERE dateint >= 20120101 then it won't even have to touch any of the data from before 2012-01-01 and in my case that means I don't parse the last 2 years of data, reducing the time the query takes by about 70% :-)
Buckets are the second awesome way of getting a big optimization in, specifically for joins! If you have 2 tables you're joining onto each other then if they're both bucketed on their join column it will also greatly increase speed. Another good join optimization is MAPJOIN, if one of the tables you're joining is rather small (below 30mb) then you can force it to MAPJOIN or you can enable automatic mapjoin, I personally prefere explicit behavory instead of automagic so use a hint: SELECT /* +MAPJOIN(the_small_table) */ fields FROM table JOIN the_small_table, etc. Sorted by is for sorting within buckets, only relevant if you're doing a lot of ordering I think. I'm assuming sequencefiles are faster, but I wouldn't really know :( need someone else to tell us more about that ;) -----Original Message----- From: Avdeev V. M. [mailto:[email protected]] Sent: Monday, May 28, 2012 7:17 AM To: [email protected] Subject: table design and performance questions Question from novice. Where I can read table design best practices? I have a measure table with millions of rows and many dimension tables with less than 1000 rows each. I can't find out the way to get optimal design of both kind of tables. Is there performance tuning guides or performance FAQ? Specifically 1) PARTITIONED BY, CLUSTERED BY, SORTED BY statements. In which cases using these statements make sense? 2) DDL language manual says 'This can improve performance on certain kinds of queries.' about CLUSTERED BY statement. What kind of queries can be improved? 3) What is preferable - SEQUENCEFILE, RCFILE or TEXTFILE - in terms of performance? What aspects should be taken into account when choosing a file format? 4) Compressed storage article says 'Keeping data compressed in Hive tables has, in some cases, known to give better performance that uncompressed storage;' and again - What is these cases? Thanks! Vyacheslav
