[jira] [Commented] (HIVE-2121) Input Sampling By Splits

[email protected] (JIRA) Thu, 28 Apr 2011 01:34:45 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026204#comment-13026204
 ]


[email protected] commented on HIVE-2121:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/633/
-----------------------------------------------------------

(Updated 2011-04-28 08:32:17.534107)


Review request for hive, Ning Zhang and namit jain.


Changes
-------

Two changes made according to Namit's comments:
1. explain will print out some about the sampling. (It might not be the best 
way to print but it follows the framework)
2. the granularity of sampling is down from split-level to HDFS block level.


Summary
-------

We need a better input sampling to serve at least two purposes:
1. test their queries against a smaller data set
2. understand more about how the data look like without scanning the whole 
table.
A simple function that gives a subset splits will help in those cases. It 
doesn't have to be strict sampling.

This diff allows a syntax of .. table TABLESAMPLE(n PERCENT), which samples 
input splits with size at least n% of the original inputs.


This addresses bug HIVE-2121.
    https://issues.apache.org/jira/browse/HIVE-2121


Diffs (updated)
-----

  trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1096852 
  trunk/conf/hive-default.xml 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java 
1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java 
1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRTableScan1.java 
1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java 
1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinFactory.java 
1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/Hive.g 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 
1096852 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SplitSample.java 
PRE-CREATION 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1096852 
  trunk/ql/src/test/queries/clientnegative/split_sample_out_of_range.q 
PRE-CREATION 
  trunk/ql/src/test/queries/clientnegative/split_sample_wrong_format.q 
PRE-CREATION 
  trunk/ql/src/test/queries/clientpositive/split_sample.q PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_out_of_range.q.out 
PRE-CREATION 
  trunk/ql/src/test/results/clientnegative/split_sample_wrong_format.q.out 
PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/bucket1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucket2.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucket3.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample1.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample10.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample2.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample3.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample4.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample5.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample6.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample7.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample8.q.out 1096852 
  trunk/ql/src/test/results/clientpositive/sample9.q.out 1096852 
  trunk/shims/src/0.20/java/org/apache/hadoop/hive/shims/Hadoop20Shims.java 
1096852 
  trunk/shims/src/0.20S/java/org/apache/hadoop/hive/shims/Hadoop20SShims.java 
1096852 
  trunk/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java 
1096852 

Diff: https://reviews.apache.org/r/633/diff


Testing
-------

TestCliDriver TestNegativeCliDriver, manual tests on real clusters.


Thanks,

Siying



> Input Sampling By Splits
> ------------------------
>
>                 Key: HIVE-2121
>                 URL: https://issues.apache.org/jira/browse/HIVE-2121
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Siying Dong
>            Assignee: Siying Dong
>         Attachments: HIVE-2121.1.patch, HIVE-2121.2.patch, HIVE-2121.3.patch, 
> HIVE-2121.4.patch, HIVE-2121.5.patch
>
>
> We need a better input sampling to serve at least two purposes:
> 1. test their queries against a smaller data set
> 2. understand more about how the data look like without scanning the whole 
> table.
> A simple function that gives a subset splits will help in those cases. It 
> doesn't have to be strict sampling.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2121) Input Sampling By Splits

Reply via email to