Storing statistics of input dataset

Prasanth J Sun, 05 Aug 2012 01:13:15 -0700

Hello everyone

Came across this excellent post about storing column statistics in Hive 
http://www.cloudera.com/blog/2012/08/column-statistics-in-hive/


Does pig gather statistics similar to what hive does? I think gathering such 
statistics will be very helpful not only for cost based optimizer but in other 
cases like knowing the count of rows, knowing the histogram of underlying data 
etc.. In my case, I am working on cube computation for holistic measure where I 
need to know the count of rows, based on it I can load sample data set for 
determining the partition factor for large groups. I am sure gathering 
statistics and persisting it will help in other cases/optimizations as well.

If I am right, pig doesn't use cost based estimation while optimizing the 
logical plan instead I believe it uses rules of thumb (Plz. correct me if I am 
wrong). Having statistics about the datasets would help to provide better 
optimization (similar to the join optimization in the blog post). Any thoughts 
about having such statistics in pig and implementing ANALYZE command for 
gathering statistics?

Thanks
-- Prasanth Jayachandran

Storing statistics of input dataset

Reply via email to