[
https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025468#comment-13025468
]
[email protected] commented on HIVE-2121:
-----------------------------------------------------
bq. On 2011-04-26 20:50:30, Siying Dong wrote:
bq. > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java,
line 498
bq. > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>
bq. >
bq. > I feel like it is a little hard to explain what this sample
guarantees. It basically only guarantees that we fetch at least the sampled
percentage of source data. Not exact number, nor guarantee for #rows. I think
an option to disable it is a way to avoid confusion in some ways. How do you
think?
bq.
bq. Ning Zhang wrote:
bq. I think if we specify clearly the semantics of block-level sample in
the wiki/documentation, there shouldn't be much confusion. In fact I think it
is much easier to explain than the bucket-based sampling. In addition if the
user has confusions about the semantics, throwing an SemanticException won't
help them understand. I think the only use case for this parameter is to act
as a gatekeeper to this feature if we found a bug in it and want to disable the
feature quickly. That should be able to be achieved by switching branches
quickly. If we have a gatekeeper parameter for each feature, the conf will grow
unnecessarily large quickly.
OK. I'll remove this switch. We need to document and communicate very well to
users. People will easily misunderstand this.
bq. On 2011-04-26 20:50:30, Siying Dong wrote:
bq. > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java,
line 6392
bq. > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>
bq. >
bq. > limit can be combined with block sampling. Just this optimization
for limit doesn't make sense when users already sample the input data and we
won't get much benefit.
bq.
bq. Ning Zhang wrote:
bq. I think combining these two still makes sense: 1) as you mentioned
block sampling is not limiting on the # of rows, but limit is. Combining these
two allows the users to get approximately N rows quickly. 2) this restriction
makes an exception in terms of the query language composition. From the
language syntax, it is allowed and makes senses to combine block-sampling and
limit, but the user will get a SemanticException if they do. I think
SemanticException should be thrown only when there is a legitimate semantic
error (e.g., the percentage number is negative). If you feel that it is not a
major use case and would rather do it in a follow-up JIRA, we should document
it in TODO and file a JIRA for it.
I think it is the misunderstanding here. Limit works with split sampling well.
No exception will be thrown in any of those two combination and the result will
be what we expected.
This condition only disabled the optimization that runs against a smaller data
set for some limit queries. With split sampling, user already specific what a
percentage to sample, there is no need to further run the query against a small
subset of the inputs. From test case split_sample.q, you can already see how
they work together well.
- Siying
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------
On 2011-04-26 21:19:18, Siying Dong wrote:
bq.
bq. -----------------------------------------------------------
bq. This is an automatically generated e-mail. To reply, visit:
bq. https://reviews.apache.org/r/633/
bq. -----------------------------------------------------------
bq.
bq. (Updated 2011-04-26 21:19:18)
bq.
bq.
bq. Review request for hive, Ning Zhang and namit jain.
bq.
bq.
bq. Summary
bq. -------
bq.
bq. We need a better input sampling to serve at least two purposes:
bq. 1. test their queries against a smaller data set
bq. 2. understand more about how the data look like without scanning the whole
table.
bq. A simple function that gives a subset splits will help in those cases. It
doesn't have to be strict sampling.
bq.
bq. This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which
samples input splits with size at least n% of the original inputs.
bq.
bq.
bq. This addresses bug HIVE-2121.
bq. https://issues.apache.org/jira/browse/HIVE-2121
bq.
bq.
bq. Diffs
bq. -----
bq.
bq. trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852
bq. trunk/conf/hive-default.xml 1096852
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
1096852
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java
1096852
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java
1096852
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java
1096852
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
1096852
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
1096852
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java
1096852
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java
1096852
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
1096852
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java
PRE-CREATION
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852
bq. trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q
PRE-CREATION
bq. trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q
PRE-CREATION
bq. trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q
PRE-CREATION
bq. trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION
bq. trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out
PRE-CREATION
bq. trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out
PRE-CREATION
bq. trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out
PRE-CREATION
bq. trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852
bq. trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852
bq.
bq. Diff: https://reviews.apache.org/r/633/diff
bq.
bq.
bq. Testing
bq. -------
bq.
bq. TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
bq.
bq.
bq. Thanks,
bq.
bq. Siying
bq.
bq.
> Input Sampling By Splits
> ------------------------
>
> Key: HIVE-2121
> URL: https://issues.apache.org/jira/browse/HIVE-2121
> Project: Hive
> Issue Type: New Feature
> Reporter: Siying Dong
> Assignee: Siying Dong
> Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole
> table.
> A simple function that gives a subset splits will help in those cases. It
> doesn't have to be strict sampling.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira