Re: Storing statistics of input dataset

Alan Gates Mon, 06 Aug 2012 10:36:05 -0700

Pig does not have a metadata store, so it doesn't store statistics on data.  
However, through HCatalog it will have access to the same statistics that Hive 
stores.


As far as using this data to optimize Pig operations, I'd like to rework the 
backend to start taking advantage of such statistics when available (either 
from metadata like this or statistics that are generated on the fly as scripts 
are executed).  I also hope to share as much of this work as possible with Hive 
so that both can benefit.

Alan.

On Aug 5, 2012, at 1:12 AM, Prasanth J wrote:

> Hello everyone
> 
> Came across this excellent post about storing column statistics in Hive 
> http://www.cloudera.com/blog/2012/08/column-statistics-in-hive/
> 
> Does pig gather statistics similar to what hive does? I think gathering such 
> statistics will be very helpful not only for cost based optimizer but in 
> other cases like knowing the count of rows, knowing the histogram of 
> underlying data etc.. In my case, I am working on cube computation for 
> holistic measure where I need to know the count of rows, based on it I can 
> load sample data set for determining the partition factor for large groups. I 
> am sure gathering statistics and persisting it will help in other 
> cases/optimizations as well.
> 
> If I am right, pig doesn't use cost based estimation while optimizing the 
> logical plan instead I believe it uses rules of thumb (Plz. correct me if I 
> am wrong). Having statistics about the datasets would help to provide better 
> optimization (similar to the join optimization in the blog post). Any 
> thoughts about having such statistics in pig and implementing ANALYZE command 
> for gathering statistics?
> 
> Thanks
> -- Prasanth Jayachandran
>

Re: Storing statistics of input dataset

Reply via email to