[ 
https://issues.apache.org/jira/browse/HIVE-13496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Seth updated HIVE-13496:
----------------------------------
    Attachment: HIVE-13496.01.patch

Patch to make this change. This does the third point mentioned in the previous 
post.

If the data does not exist, create it and copy it to a known location for 
future runs.
If the data exists in the known location, copy it over for the current run.

mvn clean gets rid of the cached data, in case it needs to be re-generated 
again.

For "mvn test -Dtest=TestCliDriver  -Dqfile="udf_md5.q""
Without patch:
Run1: Total time: 1:09.271s
Run2: Total time: 1:07.661s
Run3: Total time: 1:09.281s

With patch: 
Run1: Total time: 1:08.162s
Run2: Total time: 18.754s
Run3: Total time: 18.680s

For Precommit tests, TestCliDriver runs 2131 tests - ~143 batches on 14 nodes - 
so an average 10 batches per node. Lookin at existing test results 
(specifically the mvn output against the test xml) - there's over a minute of 
data gen overhead on the build machines. Should take 10+ minutes off the 
runtime.

Only done for TestCliDriver right now. I think we should get this change in 
(ideally without pre-commit), and then look at the other tests. [~ashutoshc], 
[~thejas] - could you please take a look.

> Create initial test data once across multiple test runs
> -------------------------------------------------------
>
>                 Key: HIVE-13496
>                 URL: https://issues.apache.org/jira/browse/HIVE-13496
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Test
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>         Attachments: HIVE-13496.01.patch
>
>
> All TestCliDriver, TezMiniTezCliDriver etc tests create a standard data set 
> when they start up. When running on a box with SSDs - this step takes over a 
> minute.
> Running a single qtest cannot be faster than this. On the ptest framework - 
> all batches end up doing this which is a lot of wastage.
> Instead, this data generation should be shared across runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to