[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2018-01-13 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325452#comment-16325452
 ] 

Lefty Leverenz commented on HIVE-18149:
---

Thanks Zoltan, I tweaked the doc to show the old default as well as the new.

> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Fix For: 3.0.0
>
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, 
> HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, 
> HIVE-18149.03wip02.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25]
>  will estimate the rowsize of the table to be 404 bytes; however the 20 rows 
> of text is only 169 bytes...so it ends up with 0 rows...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2018-01-02 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308351#comment-16308351
 ] 

Zoltan Haindrich commented on HIVE-18149:
-

I've added some addendums...I've missed TestAcidOnTez - fortunately it had set 
noconditionalthreshold already

> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>  Labels: TODOC3.0
> Fix For: 3.0.0
>
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, 
> HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, 
> HIVE-18149.03wip02.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25]
>  will estimate the rowsize of the table to be 404 bytes; however the 20 rows 
> of text is only 169 bytes...so it ends up with 0 rows...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-30 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307135#comment-16307135
 ] 

Lefty Leverenz commented on HIVE-18149:
---

Doc note:  This changes the default value of 
*hive.stats.deserialization.factor* from 1.0 to 10.0, so the wiki needs to be 
updated.

* [hive.stats.deserialization.factor | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.stats.deserialization.factor]

Added a TODOC3.0 label.  (Please add your own TODOC labels and doc notes in the 
future.)

> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>  Labels: TODOC3.0
> Fix For: 3.0.0
>
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, 
> HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, 
> HIVE-18149.03wip02.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25]
>  will estimate the rowsize of the table to be 404 bytes; however the 20 rows 
> of text is only 169 bytes...so it ends up with 0 rows...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-19 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297194#comment-16297194
 ] 

Ashutosh Chauhan commented on HIVE-18149:
-

+1

> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, 
> HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, 
> HIVE-18149.03wip02.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25]
>  will estimate the rowsize of the table to be 404 bytes; however the 20 rows 
> of text is only 169 bytes...so it ends up with 0 rows...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-19 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297104#comment-16297104
 ] 

Hive QA commented on HIVE-18149:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12902874/HIVE-18149.03.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 19 failed/errored test(s), 11528 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] 
(batchId=12)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2]
 (batchId=152)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] 
(batchId=169)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[smb_mapjoin_15]
 (batchId=168)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucketizedhiveinputformat]
 (batchId=178)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part]
 (batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[stats_aggregator_error_1]
 (batchId=93)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_10]
 (batchId=138)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketsortoptimize_insert_7]
 (batchId=128)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] 
(batchId=120)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] 
(batchId=113)
org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut 
(batchId=209)
org.apache.hadoop.hive.ql.TestAcidOnTez.testMapJoinOnTez (batchId=223)
org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints 
(batchId=226)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8322/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8322/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8322/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 19 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12902874 - PreCommit-HIVE-Build

> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, 
> HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, 
> HIVE-18149.03wip02.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25]
>  will estimate the rowsize of the table to be 404 bytes; however the 20 rows 
> of text is only 169 bytes...so it ends up 

[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-19 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297048#comment-16297048
 ] 

Hive QA commented on HIVE-18149:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
1s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
14s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
14s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
49s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 8s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
30s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
21s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
29s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 19m 47s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh |
| git revision | master / 9efed65 |
| Default Java | 1.8.0_111 |
| modules | C: common ql contrib itests/hive-blobstore U: . |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8322/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, 
> HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, 
> HIVE-18149.03wip02.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> 

[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-19 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16296858#comment-16296858
 ] 

Zoltan Haindrich commented on HIVE-18149:
-

#3)

* updated q.out-s

> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, 
> HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, 
> HIVE-18149.03wip02.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25]
>  will estimate the rowsize of the table to be 404 bytes; however the 20 rows 
> of text is only 169 bytes...so it ends up with 0 rows...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295505#comment-16295505
 ] 

Hive QA commented on HIVE-18149:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12902677/HIVE-18149.03wip02.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 41 failed/errored test(s), 11531 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table]
 (batchId=249)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table]
 (batchId=249)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_join25] (batchId=72)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_sortmerge_join_12] 
(batchId=33)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=67)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby10] (batchId=62)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_cube1] 
(batchId=4)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_rollup1] 
(batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input17] (batchId=37)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input3_limit] 
(batchId=63)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input4] (batchId=81)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input5] (batchId=14)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath2] 
(batchId=37)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] 
(batchId=30)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] 
(batchId=12)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge5] (batchId=56)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge6] (batchId=34)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge_incompat1] 
(batchId=67)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge_incompat2] 
(batchId=83)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_gather_stats] 
(batchId=86)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_reduce_groupby_duplicate_cols]
 (batchId=35)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2]
 (batchId=152)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] 
(batchId=169)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=160)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucketizedhiveinputformat]
 (batchId=178)
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning_4]
 (batchId=179)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part]
 (batchId=93)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[stats_aggregator_error_1]
 (batchId=93)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_10]
 (batchId=138)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_12]
 (batchId=119)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketsortoptimize_insert_7]
 (batchId=128)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] 
(batchId=120)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] 
(batchId=113)
org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut 
(batchId=209)
org.apache.hadoop.hive.ql.TestAcidOnTez.testMapJoinOnTez (batchId=223)
org.apache.hadoop.hive.ql.TestTxnCommandsForOrcMmTable.testInsertOverwriteWithDynamicPartition
 (batchId=278)
org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints 
(batchId=226)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8300/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8300/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8300/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 41 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 

[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295393#comment-16295393
 ] 

Hive QA commented on HIVE-18149:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
22s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
33s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m 
14s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
23s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
47s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
24s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
30s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
12s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
26s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
13s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 23m 19s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh |
| git revision | master / 8259022 |
| Default Java | 1.8.0_111 |
| modules | C: common ql contrib itests/hive-blobstore U: . |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8300/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, 
> HIVE-18149.02.patch, HIVE-18149.03wip01.patch, HIVE-18149.03wip02.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> 

[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-14 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291929#comment-16291929
 ] 

Hive QA commented on HIVE-18149:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12902078/HIVE-18149.03wip01.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 36 failed/errored test(s), 11527 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table]
 (batchId=249)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table]
 (batchId=249)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_sortmerge_join_12] 
(batchId=33)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=67)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[fp_literal_arithmetic] 
(batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby10] (batchId=62)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_cube1] 
(batchId=4)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_rollup1] 
(batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input17] (batchId=37)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input3_limit] 
(batchId=63)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input4] (batchId=81)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input5] (batchId=14)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath2] 
(batchId=37)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] 
(batchId=30)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] 
(batchId=12)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge5] (batchId=56)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge6] (batchId=34)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge_incompat1] 
(batchId=67)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge_incompat2] 
(batchId=83)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_gather_stats] 
(batchId=86)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2]
 (batchId=152)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] 
(batchId=169)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast]
 (batchId=160)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[quotedid_smb]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=160)
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part]
 (batchId=93)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_10]
 (batchId=138)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_12]
 (batchId=119)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketsortoptimize_insert_7]
 (batchId=128)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] 
(batchId=120)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] 
(batchId=113)
org.apache.hadoop.hive.ql.TestAcidOnTez.testMapJoinOnTez (batchId=223)
org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints 
(batchId=226)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8250/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8250/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8250/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 36 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12902078 - PreCommit-HIVE-Build

> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, 
> 

[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-14 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291891#comment-16291891
 ] 

Hive QA commented on HIVE-18149:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  1m 
14s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
35s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  2m  
1s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
15s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
33s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
21s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
18s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
32s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 21m  2s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh |
| git revision | master / e120bd8 |
| Default Java | 1.8.0_111 |
| modules | C: common ql contrib itests/hive-blobstore U: . |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8250/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, 
> HIVE-18149.02.patch, HIVE-18149.03wip01.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after 

[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-09 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285012#comment-16285012
 ] 

Hive QA commented on HIVE-18149:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12900887/HIVE-18149.02.patch

{color:red}ERROR:{color} -1 due to build exiting with an error

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8164/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8164/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8164/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Tests exited with: ExecutionException: java.util.concurrent.ExecutionException: 
org.apache.hive.ptest.execution.ssh.SSHExecutionException: RSyncResult 
[localFile=/data/hiveptest/logs/PreCommit-HIVE-Build-8164/succeeded/205_UTBatch_service_8_tests,
 remoteFile=/home/hiveptest/104.198.217.87-hiveptest-0/logs/, 
getExitCode()=255, getException()=null, getUser()=hiveptest, 
getHost()=104.198.217.87, getInstance()=0]: 'Warning: Permanently added 
'104.198.217.87' (ECDSA) to the list of known hosts.
receiving incremental file list
./
TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.auth.TestLdapAtnProviderWithMiniDS.xml

  0   0%0.00kB/s0:00:00  
 91,994 100%2.58MB/s0:00:00 (xfr#1, to-chk=12/14)
TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.auth.ldap.TestCustomQueryFilter.xml

  0   0%0.00kB/s0:00:00  
 87,958 100%2.33MB/s0:00:00 (xfr#2, to-chk=11/14)
TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.auth.ldap.TestQuery.xml

  0   0%0.00kB/s0:00:00  
 87,939 100%1.22MB/s0:00:00 (xfr#3, to-chk=10/14)
TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.auth.ldap.TestUserFilter.xml

  0   0%0.00kB/s0:00:00  
 87,929 100%1.22MB/s0:00:00 (xfr#4, to-chk=9/14)
TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.auth.ldap.TestUserSearchFilter.xml

  0   0%0.00kB/s0:00:00  
 88,367 100%  837.82kB/s0:00:00 (xfr#5, to-chk=8/14)
TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.cli.TestCLIServiceConnectionLimits.xml

  0   0%0.00kB/s0:00:00  
 89,238 100%  837.95kB/s0:00:00 (xfr#6, to-chk=7/14)
TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.cli.TestCLIServiceRestore.xml

  0   0%0.00kB/s0:00:00  
 87,707 100%  823.57kB/s0:00:00 (xfr#7, to-chk=6/14)
TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.cli.TestHiveSQLException.xml

  0   0%0.00kB/s0:00:00  
 88,448 100%  822.62kB/s0:00:00 (xfr#8, to-chk=5/14)
maven-test.txt

  0   0%0.00kB/s0:00:00  
  6,086 100%   56.07kB/s0:00:00 (xfr#9, to-chk=4/14)
logs/
logs/derby.log

  0   0%0.00kB/s0:00:00  
989 100%9.11kB/s0:00:00 (xfr#10, to-chk=1/14)
logs/hive.log

  0   0%0.00kB/s0:00:00  
 35,487,744   0%   33.84MB/s0:02:01  
 91,553,792   2%   43.66MB/s0:01:32  
148,144,128   3%   47.09MB/s0:01:24  
205,651,968   4%   49.04MB/s0:01:20  
262,864,896   6%   54.22MB/s0:01:11  
319,324,160   7%   54.35MB/s0:01:10  
377,126,912   8%   54.63MB/s0:01:08  Timeout, server 104.198.217.87 not 
responding.

rsync: connection unexpectedly closed (391788893 bytes received so far) 
[receiver]
rsync error: error in rsync protocol data stream (code 12) at io.c(226) 
[receiver=3.1.1]
rsync: connection unexpectedly closed (904 bytes received so far) [generator]
rsync error: unexplained error (code 255) at io.c(226) [generator=3.1.1]
ssh: connect to host 104.198.217.87 port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]
ssh: connect to host 104.198.217.87 port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]
ssh: connect to host 104.198.217.87 port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]
ssh: connect to host 104.198.217.87 port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1]
'
{noformat}

This message is automatically 

[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-09 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284829#comment-16284829
 ] 

Hive QA commented on HIVE-18149:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
56s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
26s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
49s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
11s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
29s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
20s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
14s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
11s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
32s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 19m 55s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh |
| git revision | master / 5bbd864 |
| Default Java | 1.8.0_111 |
| modules | C: common ql contrib itests/hive-blobstore U: . |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8164/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, 
> HIVE-18149.02.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> 

[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-05 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279486#comment-16279486
 ] 

Ashutosh Chauhan commented on HIVE-18149:
-

+1

> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25]
>  will estimate the rowsize of the table to be 404 bytes; however the 20 rows 
> of text is only 169 bytes...so it ends up with 0 rows...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-05 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279445#comment-16279445
 ] 

Hive QA commented on HIVE-18149:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12900726/HIVE-18149.01.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 11 failed/errored test(s), 11509 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[runtime_skewjoin_mapjoin_spark]
 (batchId=54)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_character_length] 
(batchId=38)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_octet_length] 
(batchId=32)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[union_view] (batchId=15)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynpart_sort_opt_vectorization]
 (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2]
 (batchId=157)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata]
 (batchId=165)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] 
(batchId=160)
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[union_view] 
(batchId=110)
org.apache.hadoop.hive.ql.TestAcidOnTez.testMapJoinOnTez (batchId=224)
org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints 
(batchId=227)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8117/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8117/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8117/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 11 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12900726 - PreCommit-HIVE-Build

> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25]
>  will estimate the rowsize of the table to be 404 bytes; however the 20 rows 
> of text is only 169 bytes...so it ends up with 0 rows...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-05 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279361#comment-16279361
 ] 

Hive QA commented on HIVE-18149:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
47s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  4m 
59s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
45s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
11s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
26s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
19s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  2m 
 3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
44s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 6s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
24s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 18m 43s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh |
| git revision | master / fb85336 |
| Default Java | 1.8.0_111 |
| modules | C: common ql contrib itests/hive-blobstore U: . |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8117/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> 

[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-05 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279331#comment-16279331
 ] 

Ashutosh Chauhan commented on HIVE-18149:
-

Since ORC and parquet are most common formats these days, bumping up this ratio 
makes sense, since columnar formats  usually compresses very well and then 
there is bloat in memory size after this as well.

> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25]
>  will estimate the rowsize of the table to be 404 bytes; however the 20 rows 
> of text is only 169 bytes...so it ends up with 0 rows...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-05 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278516#comment-16278516
 ] 

Hive QA commented on HIVE-18149:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12899660/HIVE-18149.01wip01.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 333 failed/errored test(s), 11509 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[select_dummy_source] 
(batchId=247)
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[smb_mapjoin_1] 
(batchId=247)
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[smb_mapjoin_2] 
(batchId=247)
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[smb_mapjoin_3] 
(batchId=247)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[explain] 
(batchId=250)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_dynamic_partitions]
 (batchId=250)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table]
 (batchId=250)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_dynamic_partitions]
 (batchId=250)
org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table]
 (batchId=250)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[annotate_stats_part] 
(batchId=15)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[autoColumnStats_5] 
(batchId=41)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[autoColumnStats_5a] 
(batchId=53)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_join_stats2] 
(batchId=86)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_join_stats] 
(batchId=48)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_sortmerge_join_12] 
(batchId=33)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_sortmerge_join_5] 
(batchId=87)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[binarysortable_1] 
(batchId=73)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[bucket_map_join_1] 
(batchId=65)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[bucket_map_join_2] 
(batchId=57)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[bucketcontext_5] 
(batchId=23)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[bucketmapjoin_negative3] 
(batchId=28)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] 
(batchId=67)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_join1] 
(batchId=71)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_udaf_percentile_approx_23]
 (batchId=35)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[columnarserde_create_shortcut]
 (batchId=66)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[columnstats_tbllvl] 
(batchId=8)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[combine2] (batchId=6)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[compute_stats_date] 
(batchId=44)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[concat_op] (batchId=73)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[correlationoptimizer5] 
(batchId=69)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision2] 
(batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision] 
(batchId=52)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_udf2] 
(batchId=87)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_udf] (batchId=9)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[display_colstats_tbllvl] 
(batchId=3)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[distinct_windowing] 
(batchId=11)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[distinct_windowing_no_cbo]
 (batchId=64)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[drop_table_with_index] 
(batchId=36)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[filter_cond_pushdown2] 
(batchId=64)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[gen_udf_example_add10] 
(batchId=45)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby10] (batchId=62)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_cube1] 
(batchId=4)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_id3] 
(batchId=26)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_sets1] 
(batchId=68)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_sets2] 
(batchId=25)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_sets3] 
(batchId=1)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_sets4] 
(batchId=31)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_sets5] 
(batchId=49)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_sets6] 
(batchId=70)

[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-12-05 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278468#comment-16278468
 ] 

Hive QA commented on HIVE-18149:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
1s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
56s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
45s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
15s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
50s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
5s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
20s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
33s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
7s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 15m 28s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh |
| git revision | master / f631241 |
| Default Java | 1.8.0_111 |
| modules | C: common ql U: . |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-8109/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
> Attachments: HIVE-18149.01wip01.patch
>
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> 

[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-11-28 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16268492#comment-16268492
 ] 

Zoltan Haindrich commented on HIVE-18149:
-

unfortunately these changes are starting to stick togetherbecause of this 
problem; some table stats are demoted to PARTIAL in HIVE-18108 because the 
estimated rowsize is greater than the whole dataset size...

> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>  Components: Statistics
>Reporter: Zoltan Haindrich
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25]
>  will estimate the rowsize of the table to be 404 bytes; however the 20 rows 
> of text is only 169 bytes...so it ends up with 0 rows...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-11-27 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266714#comment-16266714
 ] 

Zoltan Haindrich commented on HIVE-18149:
-

possibly an alternative option would be to estimate a deserialization factor by 
estimating the "online" rowsize and divide it with an estimated "offline" 
rowsize...

> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Zoltan Haindrich
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25]
>  will estimate the rowsize of the table to be 404 bytes; however the 20 rows 
> of text is only 169 bytes...so it ends up with 0 rows...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases

2017-11-27 Thread Zoltan Haindrich (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266711#comment-16266711
 ] 

Zoltan Haindrich commented on HIVE-18149:
-

I think setting {{hive.stats.deserialization.factor}} to about {{10.0}} might 
possibly yield more realistic estimates... ; for the above example it would 
estimate 4 rows which is much better than zero rows


> Stats: rownum estimation from datasize underestimates in most cases
> ---
>
> Key: HIVE-18149
> URL: https://issues.apache.org/jira/browse/HIVE-18149
> Project: Hive
>  Issue Type: Sub-task
>Reporter: Zoltan Haindrich
>
> rownum estimation is based on the following fact as of now:
> * datasize being used from the following sources:
> ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are 
> able to give "raw size" estimation - I've checked orc; but I'm sure others 
> will do the sameapi docs are a bit vague about the methods purpose...
> ** if the basicstats level info is not available; the filesystem level 
> "file-size-sums" are used as the "raw data size" ; which is multiplied by the 
> [deserialization 
> ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261]
>  ; which is currently 1.
> the problem with all of this is that deser factor is 1; and that rowsize 
> counts in the online object headers..
> example; 20 rows are loaded into a partition 
> [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7]
> after HIVE-18108 [this 
> explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25]
>  will estimate the rowsize of the table to be 404 bytes; however the 20 rows 
> of text is only 169 bytes...so it ends up with 0 rows...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)