[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16325452#comment-16325452 ] Lefty Leverenz commented on HIVE-18149: --- Thanks Zoltan, I tweaked the doc to show the old default as well as the new. > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Fix For: 3.0.0 > > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, > HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, > HIVE-18149.03wip02.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16308351#comment-16308351 ] Zoltan Haindrich commented on HIVE-18149: - I've added some addendums...I've missed TestAcidOnTez - fortunately it had set noconditionalthreshold already > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Labels: TODOC3.0 > Fix For: 3.0.0 > > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, > HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, > HIVE-18149.03wip02.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16307135#comment-16307135 ] Lefty Leverenz commented on HIVE-18149: --- Doc note: This changes the default value of *hive.stats.deserialization.factor* from 1.0 to 10.0, so the wiki needs to be updated. * [hive.stats.deserialization.factor | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.stats.deserialization.factor] Added a TODOC3.0 label. (Please add your own TODOC labels and doc notes in the future.) > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Labels: TODOC3.0 > Fix For: 3.0.0 > > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, > HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, > HIVE-18149.03wip02.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297194#comment-16297194 ] Ashutosh Chauhan commented on HIVE-18149: - +1 > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, > HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, > HIVE-18149.03wip02.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297104#comment-16297104 ] Hive QA commented on HIVE-18149: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12902874/HIVE-18149.03.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 19 failed/errored test(s), 11528 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] (batchId=12) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2] (batchId=152) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2] (batchId=157) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata] (batchId=165) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] (batchId=169) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast] (batchId=160) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[smb_mapjoin_15] (batchId=168) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucketizedhiveinputformat] (batchId=178) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part] (batchId=93) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[stats_aggregator_error_1] (batchId=93) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_10] (batchId=138) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketsortoptimize_insert_7] (batchId=128) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] (batchId=120) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] (batchId=113) org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut (batchId=209) org.apache.hadoop.hive.ql.TestAcidOnTez.testMapJoinOnTez (batchId=223) org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints (batchId=226) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8322/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8322/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8322/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 19 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12902874 - PreCommit-HIVE-Build > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, > HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, > HIVE-18149.03wip02.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297048#comment-16297048 ] Hive QA commented on HIVE-18149: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 1s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 14s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 14s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 49s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 8s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 30s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 21s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 29s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 12s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 19m 47s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh | | git revision | master / 9efed65 | | Default Java | 1.8.0_111 | | modules | C: common ql contrib itests/hive-blobstore U: . | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-8322/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, > HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, > HIVE-18149.03wip02.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition >
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16296858#comment-16296858 ] Zoltan Haindrich commented on HIVE-18149: - #3) * updated q.out-s > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, > HIVE-18149.02.patch, HIVE-18149.03.patch, HIVE-18149.03wip01.patch, > HIVE-18149.03wip02.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295505#comment-16295505 ] Hive QA commented on HIVE-18149: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12902677/HIVE-18149.03wip02.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 41 failed/errored test(s), 11531 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table] (batchId=249) org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table] (batchId=249) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_join25] (batchId=72) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_sortmerge_join_12] (batchId=33) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] (batchId=67) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby10] (batchId=62) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_cube1] (batchId=4) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_rollup1] (batchId=32) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input17] (batchId=37) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input3_limit] (batchId=63) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input4] (batchId=81) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input5] (batchId=14) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath2] (batchId=37) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] (batchId=30) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] (batchId=12) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge5] (batchId=56) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge6] (batchId=34) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge_incompat1] (batchId=67) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge_incompat2] (batchId=83) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_gather_stats] (batchId=86) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_reduce_groupby_duplicate_cols] (batchId=35) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2] (batchId=152) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2] (batchId=157) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata] (batchId=165) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] (batchId=169) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast] (batchId=160) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] (batchId=160) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[bucketizedhiveinputformat] (batchId=178) org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[spark_dynamic_partition_pruning_4] (batchId=179) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part] (batchId=93) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[stats_aggregator_error_1] (batchId=93) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_10] (batchId=138) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_12] (batchId=119) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketsortoptimize_insert_7] (batchId=128) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] (batchId=120) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] (batchId=113) org.apache.hadoop.hive.cli.control.TestDanglingQOuts.checkDanglingQOut (batchId=209) org.apache.hadoop.hive.ql.TestAcidOnTez.testMapJoinOnTez (batchId=223) org.apache.hadoop.hive.ql.TestTxnCommandsForOrcMmTable.testInsertOverwriteWithDynamicPartition (batchId=278) org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints (batchId=226) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8300/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8300/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8300/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 41 tests failed {noformat} This message is automatically generated. ATTACHMENT ID:
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295393#comment-16295393 ] Hive QA commented on HIVE-18149: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 22s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 33s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 14s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 23s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 47s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 26s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 13s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 23m 19s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh | | git revision | master / 8259022 | | Default Java | 1.8.0_111 | | modules | C: common ql contrib itests/hive-blobstore U: . | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-8300/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, > HIVE-18149.02.patch, HIVE-18149.03wip01.patch, HIVE-18149.03wip02.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition >
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291929#comment-16291929 ] Hive QA commented on HIVE-18149: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12902078/HIVE-18149.03wip01.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 36 failed/errored test(s), 11527 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table] (batchId=249) org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table] (batchId=249) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_sortmerge_join_12] (batchId=33) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] (batchId=67) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[fp_literal_arithmetic] (batchId=68) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby10] (batchId=62) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_cube1] (batchId=4) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_rollup1] (batchId=32) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input17] (batchId=37) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input3_limit] (batchId=63) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input4] (batchId=81) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input5] (batchId=14) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath2] (batchId=37) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[input_testxpath] (batchId=30) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[mapjoin_hook] (batchId=12) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge5] (batchId=56) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge6] (batchId=34) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge_incompat1] (batchId=67) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[orc_merge_incompat2] (batchId=83) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ppd_join5] (batchId=35) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_gather_stats] (batchId=86) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[bucketsortoptimize_insert_2] (batchId=152) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2] (batchId=157) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata] (batchId=165) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid] (batchId=169) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[llap_acid_fast] (batchId=160) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[quotedid_smb] (batchId=157) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] (batchId=160) org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_part] (batchId=93) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_10] (batchId=138) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[auto_sortmerge_join_12] (batchId=119) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[bucketsortoptimize_insert_7] (batchId=128) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[ppd_join5] (batchId=120) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multi] (batchId=113) org.apache.hadoop.hive.ql.TestAcidOnTez.testMapJoinOnTez (batchId=223) org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints (batchId=226) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8250/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8250/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8250/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 36 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12902078 - PreCommit-HIVE-Build > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, >
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291891#comment-16291891 ] Hive QA commented on HIVE-18149: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 14s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 35s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 2m 1s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 15s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 33s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 21s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 12s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 21m 2s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh | | git revision | master / e120bd8 | | Default Java | 1.8.0_111 | | modules | C: common ql contrib itests/hive-blobstore U: . | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-8250/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, > HIVE-18149.02.patch, HIVE-18149.03wip01.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16285012#comment-16285012 ] Hive QA commented on HIVE-18149: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12900887/HIVE-18149.02.patch {color:red}ERROR:{color} -1 due to build exiting with an error Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8164/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8164/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8164/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Tests exited with: ExecutionException: java.util.concurrent.ExecutionException: org.apache.hive.ptest.execution.ssh.SSHExecutionException: RSyncResult [localFile=/data/hiveptest/logs/PreCommit-HIVE-Build-8164/succeeded/205_UTBatch_service_8_tests, remoteFile=/home/hiveptest/104.198.217.87-hiveptest-0/logs/, getExitCode()=255, getException()=null, getUser()=hiveptest, getHost()=104.198.217.87, getInstance()=0]: 'Warning: Permanently added '104.198.217.87' (ECDSA) to the list of known hosts. receiving incremental file list ./ TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.auth.TestLdapAtnProviderWithMiniDS.xml 0 0%0.00kB/s0:00:00 91,994 100%2.58MB/s0:00:00 (xfr#1, to-chk=12/14) TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.auth.ldap.TestCustomQueryFilter.xml 0 0%0.00kB/s0:00:00 87,958 100%2.33MB/s0:00:00 (xfr#2, to-chk=11/14) TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.auth.ldap.TestQuery.xml 0 0%0.00kB/s0:00:00 87,939 100%1.22MB/s0:00:00 (xfr#3, to-chk=10/14) TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.auth.ldap.TestUserFilter.xml 0 0%0.00kB/s0:00:00 87,929 100%1.22MB/s0:00:00 (xfr#4, to-chk=9/14) TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.auth.ldap.TestUserSearchFilter.xml 0 0%0.00kB/s0:00:00 88,367 100% 837.82kB/s0:00:00 (xfr#5, to-chk=8/14) TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.cli.TestCLIServiceConnectionLimits.xml 0 0%0.00kB/s0:00:00 89,238 100% 837.95kB/s0:00:00 (xfr#6, to-chk=7/14) TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.cli.TestCLIServiceRestore.xml 0 0%0.00kB/s0:00:00 87,707 100% 823.57kB/s0:00:00 (xfr#7, to-chk=6/14) TEST-205_UTBatch_service_8_tests-TEST-org.apache.hive.service.cli.TestHiveSQLException.xml 0 0%0.00kB/s0:00:00 88,448 100% 822.62kB/s0:00:00 (xfr#8, to-chk=5/14) maven-test.txt 0 0%0.00kB/s0:00:00 6,086 100% 56.07kB/s0:00:00 (xfr#9, to-chk=4/14) logs/ logs/derby.log 0 0%0.00kB/s0:00:00 989 100%9.11kB/s0:00:00 (xfr#10, to-chk=1/14) logs/hive.log 0 0%0.00kB/s0:00:00 35,487,744 0% 33.84MB/s0:02:01 91,553,792 2% 43.66MB/s0:01:32 148,144,128 3% 47.09MB/s0:01:24 205,651,968 4% 49.04MB/s0:01:20 262,864,896 6% 54.22MB/s0:01:11 319,324,160 7% 54.35MB/s0:01:10 377,126,912 8% 54.63MB/s0:01:08 Timeout, server 104.198.217.87 not responding. rsync: connection unexpectedly closed (391788893 bytes received so far) [receiver] rsync error: error in rsync protocol data stream (code 12) at io.c(226) [receiver=3.1.1] rsync: connection unexpectedly closed (904 bytes received so far) [generator] rsync error: unexplained error (code 255) at io.c(226) [generator=3.1.1] ssh: connect to host 104.198.217.87 port 22: Connection timed out rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1] ssh: connect to host 104.198.217.87 port 22: Connection timed out rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1] ssh: connect to host 104.198.217.87 port 22: Connection timed out rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1] ssh: connect to host 104.198.217.87 port 22: Connection timed out rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(226) [Receiver=3.1.1] ' {noformat} This message is automatically
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16284829#comment-16284829 ] Hive QA commented on HIVE-18149: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 56s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 26s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 49s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 11s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 29s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 20s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 11s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 12s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 19m 55s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh | | git revision | master / 5bbd864 | | Default Java | 1.8.0_111 | | modules | C: common ql contrib itests/hive-blobstore U: . | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-8164/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch, > HIVE-18149.02.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this >
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279486#comment-16279486 ] Ashutosh Chauhan commented on HIVE-18149: - +1 > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279445#comment-16279445 ] Hive QA commented on HIVE-18149: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12900726/HIVE-18149.01.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 11 failed/errored test(s), 11509 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[runtime_skewjoin_mapjoin_spark] (batchId=54) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_character_length] (batchId=38) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[udf_octet_length] (batchId=32) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[union_view] (batchId=15) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[dynpart_sort_opt_vectorization] (batchId=162) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[hybridgrace_hashjoin_2] (batchId=157) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[insert_values_orig_table_use_metadata] (batchId=165) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[sysdb] (batchId=160) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[union_view] (batchId=110) org.apache.hadoop.hive.ql.TestAcidOnTez.testMapJoinOnTez (batchId=224) org.apache.hadoop.hive.ql.parse.TestReplicationScenarios.testConstraints (batchId=227) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/8117/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/8117/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-8117/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 11 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12900726 - PreCommit-HIVE-Build > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279361#comment-16279361 ] Hive QA commented on HIVE-18149: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 47s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 4m 59s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 45s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 11s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 26s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 19s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 2m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 44s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 24s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 12s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 18m 43s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh | | git revision | master / fb85336 | | Default Java | 1.8.0_111 | | modules | C: common ql contrib itests/hive-blobstore U: . | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-8117/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this >
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16279331#comment-16279331 ] Ashutosh Chauhan commented on HIVE-18149: - Since ORC and parquet are most common formats these days, bumping up this ratio makes sense, since columnar formats usually compresses very well and then there is bloat in memory size after this as well. > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01.patch, HIVE-18149.01wip01.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278516#comment-16278516 ] Hive QA commented on HIVE-18149: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12899660/HIVE-18149.01wip01.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 333 failed/errored test(s), 11509 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[select_dummy_source] (batchId=247) org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[smb_mapjoin_1] (batchId=247) org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[smb_mapjoin_2] (batchId=247) org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[smb_mapjoin_3] (batchId=247) org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[explain] (batchId=250) org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_dynamic_partitions] (batchId=250) org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_into_table] (batchId=250) org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_dynamic_partitions] (batchId=250) org.apache.hadoop.hive.cli.TestBlobstoreCliDriver.testCliDriver[insert_overwrite_table] (batchId=250) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[annotate_stats_part] (batchId=15) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[autoColumnStats_5] (batchId=41) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[autoColumnStats_5a] (batchId=53) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_join_stats2] (batchId=86) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_join_stats] (batchId=48) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_sortmerge_join_12] (batchId=33) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[auto_sortmerge_join_5] (batchId=87) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[binarysortable_1] (batchId=73) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[bucket_map_join_1] (batchId=65) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[bucket_map_join_2] (batchId=57) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[bucketcontext_5] (batchId=23) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[bucketmapjoin_negative3] (batchId=28) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[case_sensitivity] (batchId=67) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_join1] (batchId=71) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cbo_rp_udaf_percentile_approx_23] (batchId=35) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[columnarserde_create_shortcut] (batchId=66) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[columnstats_tbllvl] (batchId=8) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[combine2] (batchId=6) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[compute_stats_date] (batchId=44) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[concat_op] (batchId=73) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[correlationoptimizer5] (batchId=69) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision2] (batchId=52) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_precision] (batchId=52) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_udf2] (batchId=87) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[decimal_udf] (batchId=9) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[display_colstats_tbllvl] (batchId=3) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[distinct_windowing] (batchId=11) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[distinct_windowing_no_cbo] (batchId=64) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[drop_table_with_index] (batchId=36) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[filter_cond_pushdown2] (batchId=64) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[gen_udf_example_add10] (batchId=45) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby10] (batchId=62) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_cube1] (batchId=4) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_id3] (batchId=26) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_sets1] (batchId=68) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_sets2] (batchId=25) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_sets3] (batchId=1) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_sets4] (batchId=31) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_sets5] (batchId=49) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[groupby_grouping_sets6] (batchId=70)
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16278468#comment-16278468 ] Hive QA commented on HIVE-18149: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 1s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 56s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 45s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 15s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 50s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 5s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 20s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 33s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 20s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 7s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 12s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 15m 28s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus/dev-support/hive-personality.sh | | git revision | master / f631241 | | Default Java | 1.8.0_111 | | modules | C: common ql U: . | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-8109/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich >Assignee: Zoltan Haindrich > Attachments: HIVE-18149.01wip01.patch > > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this >
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16268492#comment-16268492 ] Zoltan Haindrich commented on HIVE-18149: - unfortunately these changes are starting to stick togetherbecause of this problem; some table stats are demoted to PARTIAL in HIVE-18108 because the estimated rowsize is greater than the whole dataset size... > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task > Components: Statistics >Reporter: Zoltan Haindrich > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266714#comment-16266714 ] Zoltan Haindrich commented on HIVE-18149: - possibly an alternative option would be to estimate a deserialization factor by estimating the "online" rowsize and divide it with an estimated "offline" rowsize... > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task >Reporter: Zoltan Haindrich > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-18149) Stats: rownum estimation from datasize underestimates in most cases
[ https://issues.apache.org/jira/browse/HIVE-18149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266711#comment-16266711 ] Zoltan Haindrich commented on HIVE-18149: - I think setting {{hive.stats.deserialization.factor}} to about {{10.0}} might possibly yield more realistic estimates... ; for the above example it would estimate 4 rows which is much better than zero rows > Stats: rownum estimation from datasize underestimates in most cases > --- > > Key: HIVE-18149 > URL: https://issues.apache.org/jira/browse/HIVE-18149 > Project: Hive > Issue Type: Sub-task >Reporter: Zoltan Haindrich > > rownum estimation is based on the following fact as of now: > * datasize being used from the following sources: > ** basicstats aggregates the loaded "on-heap" row sizes ; other readers are > able to give "raw size" estimation - I've checked orc; but I'm sure others > will do the sameapi docs are a bit vague about the methods purpose... > ** if the basicstats level info is not available; the filesystem level > "file-size-sums" are used as the "raw data size" ; which is multiplied by the > [deserialization > ratio|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L261] > ; which is currently 1. > the problem with all of this is that deser factor is 1; and that rowsize > counts in the online object headers.. > example; 20 rows are loaded into a partition > [columnstats_partlvl_dp.q|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L7] > after HIVE-18108 [this > explain|https://github.com/apache/hive/blob/d9924ab3e285536f7e2cc15ecbea36a78c59c66d/ql/src/test/queries/clientpositive/columnstats_partlvl_dp.q#L25] > will estimate the rowsize of the table to be 404 bytes; however the 20 rows > of text is only 169 bytes...so it ends up with 0 rows... -- This message was sent by Atlassian JIRA (v6.4.14#64029)