> On Oct. 3, 2012, 11:50 a.m., namit jain wrote: > > Some questions: > > > > How much this interact with hive.stats.reliable ? > > > > There are many places with a TODO > > The formatting needs to be fixed in many places. > > Some functions are returning hashset etc. - they should be changed to > > return a set instead ? > > Can you make sure you use complete variable names -
This patch doesn't interact in any way with hive.stats.reliable. Will rename any variables with shortened name to use the full name. Will return generic java type instead of ArrayList, HashSet etc. Will fix formatting. There is only one place with a real TODO - the implementation of Flajolet-Martin Sketch. I was planning to fix the TODO by making the accuracy percentage a configurable parameter. The other places with a TODO are auto generated code which says "TODO - Auto Generated method". > On Oct. 3, 2012, 11:50 a.m., namit jain wrote: > > ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java, line 52 > > <https://reviews.apache.org/r/6878/diff/3/?file=173533#file173533line52> > > > > can you use full variable name instead of Rwt Will do. > On Oct. 3, 2012, 11:50 a.m., namit jain wrote: > > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java, > > line 161 > > <https://reviews.apache.org/r/6878/diff/3/?file=173542#file173542line161> > > > > LHS should not be an arraylist > > > > Please fix all such occurences Will change any LHS occurrences to use Java generic type. - Shreepadma ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/6878/#review12131 ----------------------------------------------------------- On Oct. 3, 2012, 3:10 a.m., Shreepadma Venugopalan wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/6878/ > ----------------------------------------------------------- > > (Updated Oct. 3, 2012, 3:10 a.m.) > > > Review request for hive and Carl Steinbach. > > > Description > ------- > > This patch implements version 1 of the column statistics project in Hive. It > adds support for computing and persisting statistical summary of column > values in Hive Tables and Partitions. In order to support column statistics > in Hive, this patch does the following, > > * Adds a new compute stats UDAF to compute scalar statistics for all > primitive Hive data types. In version 1 of the project, we support the > following scalar statistics on primitive types - estimate of number of > distinct values, number of null values, number of trues/falses for boolean > typed columsn, max and avg length for string and binary typed columns, max > and min value for long and double typed columns. Note that version 1 of the > column stats project includes support for column statistics both at the table > and partition level. > > * Adds Metastore schema tables to persist the newly added statistics both at > table and partition level. > * Adds Metastore Thrift API to persist, retrieve and delete column statistics > at both table and partition level. > Please refer to the following wiki link for the details of the schema and the > Thrift API changes - > https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive > > * Extends the analyze table compute statistics statement to trigger > statistics computation and persistence for one or more columns. Please note > that statistics for multiple columns is computed through a single scan of the > table data. Please refer to the following wiki link for the syntax changes - > https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive > > One thing missing from the patch at this point is the metastore upgrade > scrips for MySQL/Derby/Postgres/Oracle. I'm waiting for the review to > finalize the metastore schema changes before I go ahead and add the upgrade > scripts. > > In a follow on patch, as part of version 2 of the column statistics project, > we will add support for computing, persisting and retrieving histograms on > long and double typed column values. > > Generated Thrift files have been removed for viewing pleasure. JIRA page has > the patch with the generated Thrift files. > > > This addresses bug HIVE-1362. > https://issues.apache.org/jira/browse/HIVE-1362 > > > Diffs > ----- > > data/files/UserVisits.dat PRE-CREATION > data/files/binary.txt PRE-CREATION > data/files/bool.txt PRE-CREATION > data/files/double.txt PRE-CREATION > data/files/employee.dat PRE-CREATION > data/files/employee2.dat PRE-CREATION > data/files/int.txt PRE-CREATION > ivy/libraries.properties 7ac6778 > metastore/if/hive_metastore.thrift d4fad72 > metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java > 8fec13d > > metastore/src/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java > 17b986c > metastore/src/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java > 3883b5b > metastore/src/java/org/apache/hadoop/hive/metastore/ObjectStore.java > eff44b1 > metastore/src/java/org/apache/hadoop/hive/metastore/RawStore.java bf5ae3a > metastore/src/java/org/apache/hadoop/hive/metastore/Warehouse.java 77d1caa > > metastore/src/model/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java > PRE-CREATION > > metastore/src/model/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java > PRE-CREATION > metastore/src/model/package.jdo 38ce6d5 > > metastore/src/test/org/apache/hadoop/hive/metastore/DummyRawStoreForJdoConnection.java > 528a100 > metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java > 925938d > ql/build.xml 5de3f78 > ql/if/queryplan.thrift 05fbf58 > ql/ivy.xml aa3b8ce > ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnStatsTask.java > PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java 425900d > ql/src/java/org/apache/hadoop/hive/ql/exec/MapRedTask.java 4c8831f > ql/src/java/org/apache/hadoop/hive/ql/exec/Task.java 4446952 > ql/src/java/org/apache/hadoop/hive/ql/exec/TaskFactory.java 79b87f1 > ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java 7440889 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/index/RewriteParseContextGenerator.java > 0b55ac4 > ql/src/java/org/apache/hadoop/hive/ql/parse/BaseSemanticAnalyzer.java > 344dc69 > ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java > f7257cd > ql/src/java/org/apache/hadoop/hive/ql/parse/ExplainSemanticAnalyzer.java > e75a075 > ql/src/java/org/apache/hadoop/hive/ql/parse/ExportSemanticAnalyzer.java > 61bc7fd > ql/src/java/org/apache/hadoop/hive/ql/parse/FunctionSemanticAnalyzer.java > 6024dd4 > ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 356779a > ql/src/java/org/apache/hadoop/hive/ql/parse/ImportSemanticAnalyzer.java > 09ef969 > ql/src/java/org/apache/hadoop/hive/ql/parse/LoadSemanticAnalyzer.java > 22fa20f > ql/src/java/org/apache/hadoop/hive/ql/parse/QB.java a0ccbe6 > ql/src/java/org/apache/hadoop/hive/ql/parse/QBParseInfo.java b38c002 > ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 5ce31f1 > ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzerFactory.java > ad1a14c > ql/src/java/org/apache/hadoop/hive/ql/parse/StatsSemanticAnalyzer.java > PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsDesc.java > PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/plan/ColumnStatsWork.java > PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/plan/HiveOperation.java cb54753 > > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/DoubleNumDistinctValueEstimator.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/LongNumDistinctValueEstimator.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java > PRE-CREATION > > ql/src/java/org/apache/hadoop/hive/ql/udf/generic/StringNumDistinctValueEstimator.java > PRE-CREATION > ql/src/test/queries/clientpositive/columnstats_partlvl.q PRE-CREATION > ql/src/test/queries/clientpositive/columnstats_tbllvl.q PRE-CREATION > ql/src/test/queries/clientpositive/compute_stats_binary.q PRE-CREATION > ql/src/test/queries/clientpositive/compute_stats_boolean.q PRE-CREATION > ql/src/test/queries/clientpositive/compute_stats_double.q PRE-CREATION > ql/src/test/queries/clientpositive/compute_stats_long.q PRE-CREATION > ql/src/test/queries/clientpositive/compute_stats_string.q PRE-CREATION > ql/src/test/results/clientpositive/columnstats_partlvl.q.out PRE-CREATION > ql/src/test/results/clientpositive/columnstats_tbllvl.q.out PRE-CREATION > ql/src/test/results/clientpositive/compute_stats_binary.q.out PRE-CREATION > ql/src/test/results/clientpositive/compute_stats_boolean.q.out PRE-CREATION > ql/src/test/results/clientpositive/compute_stats_double.q.out PRE-CREATION > ql/src/test/results/clientpositive/compute_stats_long.q.out PRE-CREATION > ql/src/test/results/clientpositive/compute_stats_string.q.out PRE-CREATION > ql/src/test/results/clientpositive/show_functions.q.out 02f6a94 > ql/src/test/results/clientpositive/udaf_histogram.q.out PRE-CREATION > > serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java > 5430814 > > Diff: https://reviews.apache.org/r/6878/diff/ > > > Testing > ------- > > All the existing hive tests pass. Additionally this patch adds the following > unit tests, > > * Tests to TestHiveMetaStore.java to test the Metastore schema and Thrift API > changes, > * Tests to exercise compute_stats UDAF for all primitive types, > * End to end test both at table and partition level for computing stats on > multiple columns. Note that these tests use the extended syntax of the > analyze command. > > > Thanks, > > Shreepadma Venugopalan > >