[ 
https://issues.apache.org/jira/browse/HIVE-5283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13770196#comment-13770196
 ] 

Tony Murphy commented on HIVE-5283:
-----------------------------------

[~cwsteinbach]Thanks for the questions and I would definitely appreciate some 
feedback on how to appropriately document the test strategy I used here.

In regards to you question about magic numbers in the queries, the values of 
effectively random, but they are important. If you look at 
[ql/src/test/org/apache/hadoop/hive/ql/exec/vector/util/OrcFileGenerator.java|https://reviews.apache.org/r/14130/diff/?page=17#337]
 which is the data generation class you'll see that those values are specified 
in the initializeFixedPointValues for each data type. When I created the 
queries I used those values where I needed scalar values to ensure that when 
the queries executed their predicates would be filtering on values that are 
guaranteed to exist.

Beyond those values, all the other data in the alltypesorc file is random, but 
there is a specific pattern to the data that is important for coverage. In orc 
and subsequently vectorization there are a number of optimizations for certain 
data patterns: AllValues, NoNulls, RepeatingValue, RepeatingNull. The data in 
alltypesorc is generated such that each column has exactly 3 batches of each 
data pattern. This gives us coverage for the vector expression optimizations 
and ensure the metadata in appropriately set on the row batch object which are 
reused across batches. 
 
For the queries themselves in order to efficiently cover as much of the new 
vectorization functionality as I could I used a number of different techniques 
to create the vectorization_*.q test suites, primarily equivalence classes, and 
pairwise combinations.

First I divided the search space into a number of dimensions such as type, 
aggregate function, filter operation, arithmetic operation, etc. The types were 
explored as equivalence classes of long, double, time, string, and bool. Also, 
rather than creating a very large number of small queries the resulting vectors 
were grouped by compatible dimensions to reduce the number of queries.

It wouldn't be to much work to add comments into the .q files that summarize 
the coverage they provide based on the vectors used to create each scenario.



                
> Merge vectorization branch to trunk
> -----------------------------------
>
>                 Key: HIVE-5283
>                 URL: https://issues.apache.org/jira/browse/HIVE-5283
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Jitendra Nath Pandey
>            Assignee: Jitendra Nath Pandey
>         Attachments: HIVE-5283.1.patch, HIVE-5283.2.patch
>
>
> The purpose of this jira is to upload vectorization patch, run tests etc. The 
> actual work will continue under HIVE-4160 umbrella jira.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to