----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/6878/ -----------------------------------------------------------
(Updated Oct. 3, 2012, 7:16 p.m.) Review request for hive and Carl Steinbach. Changes ------- This revision addresses the review comments from revision#3, particularly the following, * Fixes the TODOs. There is still one outstanding TODO - make the accuracy a user provided parameter for Flajolet-Martin sketch in NumDisinctValueEstimator.java * Fixes the formatting * Uses java generics on LHS except in StatsSemanticAnalyzer.java. StatsSemanticAnalyzer.java inherits from BaseSemanticAnalyzer.java and one of methods StatsSemanticAnalyzer over rides from BaseSemanticAnalyzer returns a HashSet instead of a Set. This patch doesn't use generics on the LHS in this particular instance. This is beyond the scope of this JIRA, will be happy to do it as part of a cleanup JIRA. * Replaces shortened variable names with long variable names Description ------- This patch implements version 1 of the column statistics project in Hive. It adds support for computing and persisting statistical summary of column values in Hive Tables and Partitions. In order to support column statistics in Hive, this patch does the following, * Adds a new compute stats UDAF to compute scalar statistics for all primitive Hive data types. In version 1 of the project, we support the following scalar statistics on primitive types - estimate of number of distinct values, number of null values, number of trues/falses for boolean typed columsn, max and avg length for string and binary typed columns, max and min value for long and double typed columns. Note that version 1 of the column stats project includes support for column statistics both at the table and partition level. * Adds Metastore schema tables to persist the newly added statistics both at table and partition level. * Adds Metastore Thrift API to persist, retrieve and delete column statistics at both table and partition level. Please refer to the following wiki link for the details of the schema and the Thrift API changes - https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive * Extends the analyze table compute statistics statement to trigger statistics computation and persistence for one or more columns. Please note that statistics for multiple columns is computed through a single scan of the table data. Please refer to the following wiki link for the syntax changes - https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive One thing missing from the patch at this point is the metastore upgrade scrips for MySQL/Derby/Postgres/Oracle. I'm waiting for the review to finalize the metastore schema changes before I go ahead and add the upgrade scripts. In a follow on patch, as part of version 2 of the column statistics project, we will add support for computing, persisting and retrieving histograms on long and double typed column values. Generated Thrift files have been removed for viewing pleasure. JIRA page has the patch with the generated Thrift files. This addresses bug HIVE-1362. https://issues.apache.org/jira/browse/HIVE-1362 Diffs (updated) ----- data/files/UserVisits.dat PRE-CREATION data/files/binary.txt PRE-CREATION data/files/bool.txt PRE-CREATION data/files/double.txt PRE-CREATION data/files/employee.dat PRE-CREATION data/files/employee2.dat PRE-CREATION data/files/int.txt PRE-CREATION ivy/libraries.properties 7ac6778 metastore/if/hive_metastore.thrift d4fad72 metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java 8fec13d metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java 17b986c metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java 3883b5b metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java eff44b1 metastore/src/java/org/apache/hadoop/hive/metastore/RawStore.java bf5ae3a metastore/src/java/org/apache/hadoop/hive/metastore/Warehouse.java 77d1caa metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java PRE-CREATION metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java PRE-CREATION metastore/src/model/package.jdo 38ce6d5 metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java 528a100 metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java 925938d ql/build.xml 5de3f78 ql/if/queryplan.thrift 05fbf58 ql/ivy.xml aa3b8ce ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnStatsTask.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 425900d ql/src/java/org/apache/hadoop/hive/ql/exec/Task.java 4446952 ql/src/java/org/apache/hadoop/hive/ql/exec/TaskFactory.java 79b87f1 ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 7440889 ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java 0b55ac4 ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java 344dc69 ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java f7257cd ql/src/java/org/apache/hadoop/hive/ql/parse/ExplainSemanticAnalyzer.java e75a075 ql/src/java/org/apache/hadoop/hive/ql/parse/ExportSemanticAnalyzer.java 61bc7fd ql/src/java/org/apache/hadoop/hive/ql/parse/FunctionSemanticAnalyzer.java 6024dd4 ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 356779a ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java 09ef969 ql/src/java/org/apache/hadoop/hive/ql/parse/LoadSemanticAnalyzer.java 22fa20f ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java a0ccbe6 ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java b38c002 ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 5ce31f1 ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java ad1a14c ql/src/java/org/apache/hadoop/hive/ql/parse/StatsSemanticAnalyzer.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsDesc.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsWork.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/HiveOperation.java cb54753 ql/src/java/org/apache/hadoop/hive/ql/udf/generic/DoubleNumDistinctValueEstimator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/udf/generic/LongNumDistinctValueEstimator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/udf/generic/StringNumDistinctValueEstimator.java PRE-CREATION ql/src/test/queries/clientpositive/columnstats_partlvl.q PRE-CREATION ql/src/test/queries/clientpositive/columnstats_tbllvl.q PRE-CREATION ql/src/test/queries/clientpositive/compute_stats_binary.q PRE-CREATION ql/src/test/queries/clientpositive/compute_stats_boolean.q PRE-CREATION ql/src/test/queries/clientpositive/compute_stats_double.q PRE-CREATION ql/src/test/queries/clientpositive/compute_stats_long.q PRE-CREATION ql/src/test/queries/clientpositive/compute_stats_string.q PRE-CREATION ql/src/test/results/clientpositive/columnstats_partlvl.q.out PRE-CREATION ql/src/test/results/clientpositive/columnstats_tbllvl.q.out PRE-CREATION ql/src/test/results/clientpositive/compute_stats_binary.q.out PRE-CREATION ql/src/test/results/clientpositive/compute_stats_boolean.q.out PRE-CREATION ql/src/test/results/clientpositive/compute_stats_double.q.out PRE-CREATION ql/src/test/results/clientpositive/compute_stats_long.q.out PRE-CREATION ql/src/test/results/clientpositive/compute_stats_string.q.out PRE-CREATION ql/src/test/results/clientpositive/show_functions.q.out 02f6a94 ql/src/test/results/clientpositive/udaf_histogram.q.out PRE-CREATION serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java 5430814 Diff: https://reviews.apache.org/r/6878/diff/ Testing ------- All the existing hive tests pass. Additionally this patch adds the following unit tests, * Tests to TestHiveMetaStore.java to test the Metastore schema and Thrift API changes, * Tests to exercise compute_stats UDAF for all primitive types, * End to end test both at table and partition level for computing stats on multiple columns. Note that these tests use the extended syntax of the analyze command. Thanks, Shreepadma Venugopalan