-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/6878/
-----------------------------------------------------------

(Updated Oct. 3, 2012, 7:16 p.m.)


Review request for hive and Carl Steinbach.


Changes
-------

This revision addresses the review comments from revision#3, particularly the 
following,

* Fixes the TODOs. There is still one outstanding TODO - make the accuracy a 
user provided parameter for Flajolet-Martin sketch in 
NumDisinctValueEstimator.java
* Fixes the formatting
* Uses java generics on LHS except in StatsSemanticAnalyzer.java. 
StatsSemanticAnalyzer.java inherits from BaseSemanticAnalyzer.java and one of 
methods StatsSemanticAnalyzer over rides from BaseSemanticAnalyzer returns a 
HashSet instead of a Set. This patch doesn't use generics on the LHS in this 
particular instance. This is beyond the scope of this JIRA, will be happy to do 
it as part of a cleanup JIRA.
* Replaces shortened variable names with long variable names


Description
-------

This patch implements version 1 of the column statistics project in Hive. It 
adds support for computing and persisting statistical summary of column values 
in Hive Tables and Partitions. In order to support column statistics in Hive, 
this patch does the following,

* Adds a new compute stats UDAF to compute scalar statistics for all primitive 
Hive data types. In version 1 of the project, we support the following scalar 
statistics on primitive types - estimate of number of distinct values, number 
of null values, number of trues/falses for boolean typed columsn, max and avg 
length for string and binary typed columns, max and min value for long and 
double typed columns. Note that version 1 of the column stats project includes 
support for column statistics both at the table and partition level.

* Adds Metastore schema tables to persist the newly added statistics both at 
table and partition level.
* Adds Metastore Thrift API to persist, retrieve and delete column statistics 
at both table and partition level. 
Please refer to the following wiki link for the details of the schema and the 
Thrift API changes - 
https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive

* Extends the analyze table compute statistics statement to trigger statistics 
computation and persistence for one or more columns. Please note that 
statistics for multiple columns is computed through a single scan of the table 
data. Please refer to the following wiki link for the syntax changes - 
https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive

One thing missing from the patch at this point is the metastore upgrade scrips 
for MySQL/Derby/Postgres/Oracle. I'm waiting for the review to finalize the 
metastore schema changes before I go ahead and add the upgrade scripts.

In a follow on patch, as part of version 2 of the column statistics project, we 
will add support for computing, persisting and retrieving histograms on long 
and double typed column values.

Generated Thrift files have been removed for viewing pleasure. JIRA page has 
the patch with the generated Thrift files.


This addresses bug HIVE-1362.
    https://issues.apache.org/jira/browse/HIVE-1362


Diffs (updated)
-----

  data/files/UserVisits.dat PRE-CREATION 
  data/files/binary.txt PRE-CREATION 
  data/files/bool.txt PRE-CREATION 
  data/files/double.txt PRE-CREATION 
  data/files/employee.dat PRE-CREATION 
  data/files/employee2.dat PRE-CREATION 
  data/files/int.txt PRE-CREATION 
  ivy/libraries.properties 7ac6778 
  metastore/if/hive_metastore.thrift d4fad72 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 
8fec13d 
  metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java 
17b986c 
  metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java 
3883b5b 
  metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java eff44b1 
  metastore/src/java/org/apache/hadoop/hive/metastore/RawStore.java bf5ae3a 
  metastore/src/java/org/apache/hadoop/hive/metastore/Warehouse.java 77d1caa 
  
metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java
 PRE-CREATION 
  
metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java
 PRE-CREATION 
  metastore/src/model/package.jdo 38ce6d5 
  
metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java
 528a100 
  metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java 
925938d 
  ql/build.xml 5de3f78 
  ql/if/queryplan.thrift 05fbf58 
  ql/ivy.xml aa3b8ce 
  ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnStatsTask.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 425900d 
  ql/src/java/org/apache/hadoop/hive/ql/exec/Task.java 4446952 
  ql/src/java/org/apache/hadoop/hive/ql/exec/TaskFactory.java 79b87f1 
  ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 7440889 
  
ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java
 0b55ac4 
  ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 344dc69 
  ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java f7257cd 
  ql/src/java/org/apache/hadoop/hive/ql/parse/ExplainSemanticAnalyzer.java 
e75a075 
  ql/src/java/org/apache/hadoop/hive/ql/parse/ExportSemanticAnalyzer.java 
61bc7fd 
  ql/src/java/org/apache/hadoop/hive/ql/parse/FunctionSemanticAnalyzer.java 
6024dd4 
  ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 356779a 
  ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java 
09ef969 
  ql/src/java/org/apache/hadoop/hive/ql/parse/LoadSemanticAnalyzer.java 22fa20f 
  ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java a0ccbe6 
  ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java b38c002 
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 5ce31f1 
  ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java 
ad1a14c 
  ql/src/java/org/apache/hadoop/hive/ql/parse/StatsSemanticAnalyzer.java 
PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsDesc.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsWork.java PRE-CREATION 
  ql/src/java/org/apache/hadoop/hive/ql/plan/HiveOperation.java cb54753 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/DoubleNumDistinctValueEstimator.java
 PRE-CREATION 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java 
PRE-CREATION 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/LongNumDistinctValueEstimator.java
 PRE-CREATION 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java
 PRE-CREATION 
  
ql/src/java/org/apache/hadoop/hive/ql/udf/generic/StringNumDistinctValueEstimator.java
 PRE-CREATION 
  ql/src/test/queries/clientpositive/columnstats_partlvl.q PRE-CREATION 
  ql/src/test/queries/clientpositive/columnstats_tbllvl.q PRE-CREATION 
  ql/src/test/queries/clientpositive/compute_stats_binary.q PRE-CREATION 
  ql/src/test/queries/clientpositive/compute_stats_boolean.q PRE-CREATION 
  ql/src/test/queries/clientpositive/compute_stats_double.q PRE-CREATION 
  ql/src/test/queries/clientpositive/compute_stats_long.q PRE-CREATION 
  ql/src/test/queries/clientpositive/compute_stats_string.q PRE-CREATION 
  ql/src/test/results/clientpositive/columnstats_partlvl.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/columnstats_tbllvl.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/compute_stats_binary.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/compute_stats_boolean.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/compute_stats_double.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/compute_stats_long.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/compute_stats_string.q.out PRE-CREATION 
  ql/src/test/results/clientpositive/show_functions.q.out 02f6a94 
  ql/src/test/results/clientpositive/udaf_histogram.q.out PRE-CREATION 
  
serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java
 5430814 

Diff: https://reviews.apache.org/r/6878/diff/


Testing
-------

All the existing hive tests pass. Additionally this patch adds the following 
unit tests,

* Tests to TestHiveMetaStore.java to test the Metastore schema and Thrift API 
changes,
* Tests to exercise compute_stats UDAF for all primitive types,
* End to end test both at table and partition level for computing stats on 
multiple columns. Note that these tests use the extended syntax of the analyze 
command.


Thanks,

Shreepadma Venugopalan

Reply via email to