[ https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022277#comment-13022277 ]
jirapos...@reviews.apache.org commented on HIVE-2121: ----------------------------------------------------- ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/633/ ----------------------------------------------------------- Review request for hive, Ning Zhang and namit jain. Summary ------- We need a better input sampling to serve at least two purposes: 1. test their queries against a smaller data set 2. understand more about how the data look like without scanning the whole table. A simple function that gives a subset splits will help in those cases. It doesn't have to be strict sampling. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples input splits with size at least n% of the original inputs. This addresses bug HIVE-2121. https://issues.apache.org/jira/browse/HIVE-2121 Diffs ----- trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244 trunk/conf/hive-default.xml 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 1095244 trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java PRE-CREATION trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244 trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q PRE-CREATION trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q PRE-CREATION trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out PRE-CREATION trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out PRE-CREATION trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244 trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244 trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244 trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244 trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244 Diff: https://reviews.apache.org/r/633/diff Testing ------- TestCliDriver TestNegativeCliDriver, manual tests on real clusters. Thanks, Siying > Input Sampling By Splits > ------------------------ > > Key: HIVE-2121 > URL: https://issues.apache.org/jira/browse/HIVE-2121 > Project: Hive > Issue Type: New Feature > Reporter: Siying Dong > Assignee: Siying Dong > Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch > > > We need a better input sampling to serve at least two purposes: > 1. test their queries against a smaller data set > 2. understand more about how the data look like without scanning the whole > table. > A simple function that gives a subset splits will help in those cases. It > doesn't have to be strict sampling. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira