Re: Review Request: Input Sampling Splits

Siying Dong Tue, 26 Apr 2011 15:10:14 -0700


> On 2011-04-26 20:50:30, Siying Dong wrote:
> > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, 
> > line 498
> > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line498>
> >
> >     I feel like it is a little hard to explain what this sample guarantees. 
> > It basically only guarantees that we fetch at least the sampled percentage 
> > of source data. Not exact number, nor guarantee for #rows. I think an 
> > option to disable it is a way to avoid confusion in some ways. How do you 
> > think?
> 
> Ning Zhang wrote:
>     I think if we specify clearly the semantics of block-level sample in the 
> wiki/documentation, there shouldn't be much confusion. In fact I think it is 
> much easier to explain than the bucket-based sampling. In addition if the 
> user has confusions about the semantics, throwing an SemanticException won't 
> help them understand.  I think the only use case for this parameter is to act 
> as a gatekeeper to this feature if we found a bug in it and want to disable 
> the feature quickly. That should be able to be achieved by switching branches 
> quickly. If we have a gatekeeper parameter for each feature, the conf will 
> grow unnecessarily large quickly.


OK. I'll remove this switch. We need to document and communicate very well to 
users. People will easily misunderstand this.


> On 2011-04-26 20:50:30, Siying Dong wrote:
> > trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java, 
> > line 6392
> > <https://reviews.apache.org/r/633/diff/1/?file=16093#file16093line6392>
> >
> >     limit can be combined with block sampling. Just this optimization for 
> > limit doesn't make sense when users already sample the input data and we 
> > won't get much benefit.
> 
> Ning Zhang wrote:
>     I think combining these two still makes sense: 1) as you mentioned block 
> sampling is not limiting on the # of rows, but limit is. Combining these two 
> allows the users to get approximately N rows quickly. 2) this restriction 
> makes an exception in terms of the query language composition. From the 
> language syntax, it is allowed and makes senses to combine block-sampling and 
> limit, but the user will get a SemanticException if they do. I think 
> SemanticException should be thrown only when there is a legitimate semantic 
> error (e.g., the percentage number is negative). If you feel that it is not a 
> major use case and would rather do it in a follow-up JIRA, we should document 
> it in TODO and file a JIRA for it.

I think it is the misunderstanding here. Limit works with split sampling well. 
No exception will be thrown in any of those two combination and the result will 
be what we expected.
This condition only disabled the optimization that runs against a smaller data 
set for some limit queries. With split sampling, user already specific what a 
percentage to sample, there is no need to further run the query against a small 
subset of the inputs. From test case split_sample.q, you can already see how 
they work together well.


- Siying


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/#review561
-----------------------------------------------------------


On 2011-04-26 21:19:18, Siying Dong wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/633/
> -----------------------------------------------------------
> 
> (Updated 2011-04-26 21:19:18)
> 
> 
> Review request for hive, Ning Zhang and namit jain.
> 
> 
> Summary
> -------
> 
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole 
> table.
> A simple function that gives a subset splits will help in those cases. It 
> doesn't have to be strict sampling.
> 
> This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples 
> input splits with size at least n% of the original inputs.
> 
> 
> This addresses bug HIVE-2121.
>     https://issues.apache.org/jira/browse/HIVE-2121
> 
> 
> Diffs
> -----
> 
>   trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
>   trunk/conf/hive-default.xml 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
> 1096852 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java 
> PRE-CREATION 
>   trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
>   trunk/ql/src/test/queries/clientnegative/split_sample_disabled.q 
> PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q 
> PRE-CREATION 
>   trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q 
> PRE-CREATION 
>   trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_disabled.q.out 
> PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out 
> PRE-CREATION 
>   trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out 
> PRE-CREATION 
>   trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
>   trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
> 
> Diff: https://reviews.apache.org/r/633/diff
> 
> 
> Testing
> -------
> 
> TestCliDriver TestNegativeCliDriver, manual tests on real clusters.
> 
> 
> Thanks,
> 
> Siying
> 
>

Re: Review Request: Input Sampling Splits

Reply via email to