Re: Review Request 28417: DRILL-1742 Use Hive stats when planning queries on Hive data sources

abdelhakim deneche Tue, 25 Nov 2014 07:20:02 -0800


> On Nov. 25, 2014, 1:23 a.m., Aman Sinha wrote:
> > contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScan.java,
> >  line 93
> > <https://reviews.apache.org/r/28417/diff/3/?file=775081#file775081line93>
> >
> >     I still think this needs to be initialized and not depend on 
> > getSplits() since obviously after your latest changes, the rowCount 
> > property is not assumed to be available.  Also, see my later comment about 
> > distinguishing between an empty table (0 rowcount) and one where stats is 
> > not available.

The problem is, at least with hive 0.12.0, when a table has rows but the 
statistics haven't been computed yet, the "numRows" property will be available 
and will contain the value of 0. So getting a numRows=0 doesn't actually tell 
us much about the size of the table, that's why leaving rowCount initialized to 
0 is correct.

It should be noted that according to hive 
[documentation](https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-NewlyCreatedTables)
 tables should always have their statistics computed, so a missing "numRows" 
property looks more like a bug in hive 0.12.0 (I will try a more recent version 
of hive to see if the problem persists).

> On Nov. 25, 2014, 1:23 a.m., Aman Sinha wrote:
> > contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScan.java,
> >  lines 298-300
> > <https://reviews.apache.org/r/28417/diff/3/?file=775081#file775081line298>
> >
> >     This is not necessarily true;  if you have empty tables, the rowcount 
> > will be 0. So you need to distinguish between the case where the stats are 
> > not available (maybe use -1 as an indicator) from the case where it is 
> > available and has 0 rowcount.

The problem is that when numRows=0 in the stats can actually mean the stats 
have not been computed yet! so we still need to estimate the row count using 
the size of the input splits.
I made some tests using empty tables, and the estimated row count for those 
tables is 0 too, so it's correct.

- abdelhakim

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28417/#review62916
-----------------------------------------------------------

On Nov. 25, 2014, 12:56 a.m., abdelhakim deneche wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28417/
> -----------------------------------------------------------
> 
> (Updated Nov. 25, 2014, 12:56 a.m.)
> 
> 
> Review request for drill.
> 
> 
> Bugs: DRILL-1742
>     https://issues.apache.org/jira/browse/DRILL-1742
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> HiveScan.getSplits() already gets the table and partitions metadata using 
> MetaStoreUtils.
> We compute the total number of rows using the numRows property and store the 
> computed number of rows in rowCount attribute which is later returned by 
> getScanStats().
> 
> 
> Diffs
> -----
> 
>   
> contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScan.java
>  ddbc100 
> 
> Diff: https://reviews.apache.org/r/28417/diff/
> 
> 
> Testing
> -------
> 
> created several partitioned and non-partitioned tables, loaded data in hive.
> 
> used explain plan to check the number of rows when the whole table is queried 
> and also when specific partitions are queried (to make sure the row count 
> takes hive partition pruning into account)
> 
> 
> Thanks,
> 
> abdelhakim deneche
> 
>

Re: Review Request 28417: DRILL-1742 Use Hive stats when planning queries on Hive data sources

Reply via email to