[
https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026204#comment-13026204
]
[email protected] commented on HIVE-2121:
-----------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/
-----------------------------------------------------------
(Updated 2011-04-28 08:32:17.534107)
Review request for hive, Ning Zhang and namit jain.
Changes
-------
Two changes made according to Namit's comments:
1. explain will print out some about the sampling. (It might not be the best
way to print but it follows the framework)
2. the granularity of sampling is down from split-level to HDFS block level.
Summary
-------
We need a better input sampling to serve at least two purposes:
1. test their queries against a smaller data set
2. understand more about how the data look like without scanning the whole
table.
A simple function that gives a subset splits will help in those cases. It
doesn't have to be strict sampling.
This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples
input splits with size at least n% of the original inputs.
This addresses bug HIVE-2121.
https://issues.apache.org/jira/browse/HIVE-2121
Diffs (updated)
-----
trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852
trunk/conf/hive-default.xml 1096852
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
1096852
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java
1096852
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java
1096852
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java
1096852
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
1096852
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
1096852
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java
1096852
trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852
trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852
trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
1096852
trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java
PRE-CREATION
trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852
trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q
PRE-CREATION
trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q
PRE-CREATION
trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION
trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out
PRE-CREATION
trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out
PRE-CREATION
trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852
trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852
trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852
trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852
trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852
trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852
trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852
trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852
trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852
trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852
trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852
trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852
trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852
trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852
trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java
1096852
trunk/shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java
1096852
trunk/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java
1096852
Diff: https://reviews.apache.org/r/633/diff
Testing
-------
TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
Thanks,
Siying
> Input Sampling By Splits
> ------------------------
>
> Key: HIVE-2121
> URL: https://issues.apache.org/jira/browse/HIVE-2121
> Project: Hive
> Issue Type: New Feature
> Reporter: Siying Dong
> Assignee: Siying Dong
> Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch,
> HIVE-2121.4.patch, HIVE-2121.5.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole
> table.
> A simple function that gives a subset splits will help in those cases. It
> doesn't have to be strict sampling.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira