Hi, Can you please send the output of
DESC FORMATTED <TABLE_NAME> after running (if you have not so already) ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMN For both tables? HTH, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 23 June 2016 at 23:49, Lalitha MV <lalitham...@gmail.com> wrote: > Hi, > > I am using Hadoop 2.7.2, Tez 0.8.3 and Hive 2.0.1. > > I created a hive table with text file size = ~141 Mb. > show tblproperties of this table (textfile): > numFiles 1 > numRows 1000000 > rawDataSize 141869803 > totalSize 142869803 > > I then created a hive table, with orc compression from the above table. > The compressed file size is ~50 Mb. > > show tblproperties for new table (orc): > > numFiles 1 > numRows 1000000 > rawDataSize 471000000 > totalSize 50444668 > > I had two sets of questions regarding this: > > 1. Why is the rawDataSize so high in case of ORC table (3.3 times the text > file size). > How is the rawDataSize calculated in this case? (Is it the sum of each > datatype size of the columns, multiplied the numRows)? > > 2. In Hive query plans, the estimated data size of the tables in each > phase (map and reduce), are equal to the rawDataSize. The number of > reducers get caluclated from this size (atleast in Tez, not in case of MR > though). Isn't this wrong, shouldn't it pick the totalSize rather? Is there > a way to force Hive/Tez to pick the totalSize in query plans/ or atleast > while calculating the number of reducers? > > Thanks in advance. > > Cheers, > Lalitha >