[
https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025498#comment-13025498
]
[email protected] commented on HIVE-2121:
-----------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review567
-----------------------------------------------------------
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1205>
This function could be quite expensive inside the loops. You may want to
test a case where there are large # of partitions and each partition contains a
large # of small files.
- Ning
On 2011-04-26 21:19:18, Siying Dong wrote:
bq.
bq. -----------------------------------------------------------
bq. This is an automatically generated e-mail. To reply, visit:
bq. https://reviews.apache.org/r/633/
bq. -----------------------------------------------------------
bq.
bq. (Updated 2011-04-26 21:19:18)
bq.
bq.
bq. Review request for hive, Ning Zhang and namit jain.
bq.
bq.
bq. Summary
bq. -------
bq.
bq. We need a better input sampling to serve at least two purposes:
bq. 1. test their queries against a smaller data set
bq. 2. understand more about how the data look like without scanning the whole
table.
bq. A simple function that gives a subset splits will help in those cases. It
doesn't have to be strict sampling.
bq.
bq. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which
samples input splits with size at least n% of the original inputs.
bq.
bq.
bq. This addresses bug HIVE-2121.
bq. https://issues.apache.org/jira/browse/HIVE-2121
bq.
bq.
bq. Diffs
bq. -----
bq.
bq. trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852
bq. trunk/conf/hive-default.xml 1096852
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
1096852
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java
1096852
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java
1096852
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java
1096852
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
1096852
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
1096852
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java
1096852
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java
1096852
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
1096852
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java
PRE-CREATION
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852
bq. trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q
PRE-CREATION
bq. trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q
PRE-CREATION
bq. trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q
PRE-CREATION
bq. trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION
bq. trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out
PRE-CREATION
bq. trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out
PRE-CREATION
bq. trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out
PRE-CREATION
bq. trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852
bq.
bq. Diff: https://reviews.apache.org/r/633/diff
bq.
bq.
bq. Testing
bq. -------
bq.
bq. TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.
bq.
bq. Thanks,
bq.
bq. Siying
bq.
bq.
> Input Sampling By Splits
> ------------------------
>
> Key: HIVE-2121
> URL: https://issues.apache.org/jira/browse/HIVE-2121
> Project: Hive
> Issue Type: New Feature
> Reporter: Siying Dong
> Assignee: Siying Dong
> Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole
> table.
> A simple function that gives a subset splits will help in those cases. It
> doesn't have to be strict sampling.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira