[
https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025107#comment-13025107
]
[email protected] commented on HIVE-2121:
-----------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review558
-----------------------------------------------------------
trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
<https://reviews.apache.org/r/633/#comment1178>
The naming of this parameter is a little bit confusing: the parameter key
is called "randomnumber" but the value of it is a fixed number. Do you mean
this number is actually the seed to generate samples?
trunk/conf/hive-default.xml
<https://reviews.apache.org/r/633/#comment1179>
same as above.
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1195>
better add a comment for this function explain what the rationale behind
sampling splits.
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1192>
we should declare retLists as interface (List) rather than implementation
(ArrayList)
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1193>
same here, should declare it as Map rather than HashMap.
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/633/#comment1194>
can you add comments here?
trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1182>
can you add a comment on what this Map is used for and what are the key and
value of the Map?
trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1184>
I think we don't need to introduce this parameter at all. For one it is a
new feature rather than a different code path for an old feature. We don't need
the "fallback" protection by a new parameter. Secondly, throwing a
SemanticException here can only make the user asking how to solve this problem,
which is to set the parameter to true. So it seems that it doesn't make sense
to set the parameter to false in any cases. So why not remove the this
parameter to make it cleaner.
trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1189>
you may want to check the percentage number (if it is a valid double and
within the range [0,100]) and throw SemanticException if it is invalid before
creating a SplitSample object.
trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
<https://reviews.apache.org/r/633/#comment1190>
why limit cannot be combined with block sampling?
trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java
<https://reviews.apache.org/r/633/#comment1191>
This comment doesn't belong to this class.
- Ning
On 2011-04-20 18:28:29, Siying Dong wrote:
bq.
bq. -----------------------------------------------------------
bq. This is an automatically generated e-mail. To reply, visit:
bq. https://reviews.apache.org/r/633/
bq. -----------------------------------------------------------
bq.
bq. (Updated 2011-04-20 18:28:29)
bq.
bq.
bq. Review request for hive, Ning Zhang and namit jain.
bq.
bq.
bq. Summary
bq. -------
bq.
bq. We need a better input sampling to serve at least two purposes:
bq. 1. test their queries against a smaller data set
bq. 2. understand more about how the data look like without scanning the whole
table.
bq. A simple function that gives a subset splits will help in those cases. It
doesn't have to be strict sampling.
bq.
bq. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which
samples input splits with size at least n% of the original inputs.
bq.
bq.
bq. This addresses bug HIVE-2121.
bq. https://issues.apache.org/jira/browse/HIVE-2121
bq.
bq.
bq. Diffs
bq. -----
bq.
bq. trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1095244
bq. trunk/conf/hive-default.xml 1095244
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
1095244
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java
1095244
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java
1095244
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java
1095244
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
1095244
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
1095244
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java
1095244
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1095244
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java
1095244
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
1095244
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java
PRE-CREATION
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1095244
bq. trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q
PRE-CREATION
bq. trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q
PRE-CREATION
bq. trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION
bq. trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out
PRE-CREATION
bq. trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out
PRE-CREATION
bq. trunk/ql/src/test/results/clientpositive/bucket1.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/bucket2.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/bucket3.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/sample1.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/sample10.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/sample2.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/sample3.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/sample4.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/sample5.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/sample6.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/sample7.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/sample8.q.out 1095244
bq. trunk/ql/src/test/results/clientpositive/sample9.q.out 1095244
bq.
bq. Diff: https://reviews.apache.org/r/633/diff
bq.
bq.
bq. Testing
bq. -------
bq.
bq. TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.
bq.
bq. Thanks,
bq.
bq. Siying
bq.
bq.
> Input Sampling By Splits
> ------------------------
>
> Key: HIVE-2121
> URL: https://issues.apache.org/jira/browse/HIVE-2121
> Project: Hive
> Issue Type: New Feature
> Reporter: Siying Dong
> Assignee: Siying Dong
> Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole
> table.
> A simple function that gives a subset splits will help in those cases. It
> doesn't have to be strict sampling.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira