> On Nov. 25, 2014, 1:23 a.m., Aman Sinha wrote: > > contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScan.java, > > line 93 > > <https://reviews.apache.org/r/28417/diff/3/?file=775081#file775081line93> > > > > I still think this needs to be initialized and not depend on > > getSplits() since obviously after your latest changes, the rowCount > > property is not assumed to be available. Also, see my later comment about > > distinguishing between an empty table (0 rowcount) and one where stats is > > not available.
The problem is, at least with hive 0.12.0, when a table has rows but the statistics haven't been computed yet, the "numRows" property will be available and will contain the value of 0. So getting a numRows=0 doesn't actually tell us much about the size of the table, that's why leaving rowCount initialized to 0 is correct. It should be noted that according to hive [documentation](https://cwiki.apache.org/confluence/display/Hive/StatsDev#StatsDev-NewlyCreatedTables) tables should always have their statistics computed, so a missing "numRows" property looks more like a bug in hive 0.12.0 (I will try a more recent version of hive to see if the problem persists). > On Nov. 25, 2014, 1:23 a.m., Aman Sinha wrote: > > contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScan.java, > > lines 298-300 > > <https://reviews.apache.org/r/28417/diff/3/?file=775081#file775081line298> > > > > This is not necessarily true; if you have empty tables, the rowcount > > will be 0. So you need to distinguish between the case where the stats are > > not available (maybe use -1 as an indicator) from the case where it is > > available and has 0 rowcount. The problem is that when numRows=0 in the stats can actually mean the stats have not been computed yet! so we still need to estimate the row count using the size of the input splits. I made some tests using empty tables, and the estimated row count for those tables is 0 too, so it's correct. - abdelhakim ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28417/#review62916 ----------------------------------------------------------- On Nov. 25, 2014, 12:56 a.m., abdelhakim deneche wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/28417/ > ----------------------------------------------------------- > > (Updated Nov. 25, 2014, 12:56 a.m.) > > > Review request for drill. > > > Bugs: DRILL-1742 > https://issues.apache.org/jira/browse/DRILL-1742 > > > Repository: drill-git > > > Description > ------- > > HiveScan.getSplits() already gets the table and partitions metadata using > MetaStoreUtils. > We compute the total number of rows using the numRows property and store the > computed number of rows in rowCount attribute which is later returned by > getScanStats(). > > > Diffs > ----- > > > contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveScan.java > ddbc100 > > Diff: https://reviews.apache.org/r/28417/diff/ > > > Testing > ------- > > created several partitioned and non-partitioned tables, loaded data in hive. > > used explain plan to check the number of rows when the whole table is queried > and also when specific partitions are queried (to make sure the row count > takes hive partition pruning into account) > > > Thanks, > > abdelhakim deneche > >
