[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809694#comment-16809694 ] Adam Szita commented on HIVE-21509: --- Thanks Slim! > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: HIVE-21509.0.wip.patch, HIVE-21509.1.wip.patch, > HIVE-21509.2.patch, HIVE-21509.3.patch, HIVE-21509.4.patch > > > In some scenarios, LLAP might store column vectors in cache that are getting > reused and reset just before their original content would be written. > The issue is a concurrency issue and is thereby flaky. It is not easy to > reproduce, but the odds of surfacing this issue can by improved by setting > LLAP executor and IO thread counts this way: > * set hive.llap.daemon.num.executors=32; > * set hive.llap.io.threadpool.size=1; > * using TPCDS input data of store_sales table, have at least a couple of > 100k's of rows, and use text format: > {code:java} > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( 'field.delim'='|', 'serialization.format'='|') > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'{code} > * having more splits increases the issue showing itself, so it is worth to > _set tez.grouping.min-size=1024; set tez.grouping.max-size=1024;_ > * run query on this this table: select min(ss_sold_date_sk) from store_sales; > The first query result is correct (2450816 in my case). Repeating the query > will trigger reading from LLAP cache and produce a wrong result: 0. > If one wants to make sure of running into this issue, place a > Thread.sleep(250) at the beginning of VectorDeserializeOrcWriter#run(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809294#comment-16809294 ] slim bouguerra commented on HIVE-21509: --- +1 > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: HIVE-21509.0.wip.patch, HIVE-21509.1.wip.patch, > HIVE-21509.2.patch, HIVE-21509.3.patch, HIVE-21509.4.patch > > > In some scenarios, LLAP might store column vectors in cache that are getting > reused and reset just before their original content would be written. > The issue is a concurrency issue and is thereby flaky. It is not easy to > reproduce, but the odds of surfacing this issue can by improved by setting > LLAP executor and IO thread counts this way: > * set hive.llap.daemon.num.executors=32; > * set hive.llap.io.threadpool.size=1; > * using TPCDS input data of store_sales table, have at least a couple of > 100k's of rows, and use text format: > {code:java} > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( 'field.delim'='|', 'serialization.format'='|') > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'{code} > * having more splits increases the issue showing itself, so it is worth to > _set tez.grouping.min-size=1024; set tez.grouping.max-size=1024;_ > * run query on this this table: select min(ss_sold_date_sk) from store_sales; > The first query result is correct (2450816 in my case). Repeating the query > will trigger reading from LLAP cache and produce a wrong result: 0. > If one wants to make sure of running into this issue, place a > Thread.sleep(250) at the beginning of VectorDeserializeOrcWriter#run(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807209#comment-16807209 ] Hive QA commented on HIVE-21509: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12964448/HIVE-21509.4.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:green}SUCCESS:{color} +1 due to 15890 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/16800/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16800/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16800/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12964448 - PreCommit-HIVE-Build > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: HIVE-21509.0.wip.patch, HIVE-21509.1.wip.patch, > HIVE-21509.2.patch, HIVE-21509.3.patch, HIVE-21509.4.patch > > > In some scenarios, LLAP might store column vectors in cache that are getting > reused and reset just before their original content would be written. > The issue is a concurrency issue and is thereby flaky. It is not easy to > reproduce, but the odds of surfacing this issue can by improved by setting > LLAP executor and IO thread counts this way: > * set hive.llap.daemon.num.executors=32; > * set hive.llap.io.threadpool.size=1; > * using TPCDS input data of store_sales table, have at least a couple of > 100k's of rows, and use text format: > {code:java} > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( 'field.delim'='|', 'serialization.format'='|') > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'{code} > * having more splits increases the issue showing itself, so it is worth to > _set tez.grouping.min-size=1024; set tez.grouping.max-size=1024;_ > * run query on this this table: select min(ss_sold_date_sk) from store_sales; > The first query result is correct (2450816 in my case). Repeating the query > will trigger reading from LLAP cache and produce a wrong result: 0. > If one wants to make sure of running into this issue, place a > Thread.sleep(250) at the beginning of VectorDeserializeOrcWriter#run(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16807173#comment-16807173 ] Hive QA commented on HIVE-21509: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 2m 10s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 53s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 40s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 27s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 33s{color} | {color:blue} storage-api in master has 48 extant Findbugs warnings. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 53s{color} | {color:blue} llap-server in master has 81 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 30s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 23s{color} | {color:red} llap-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 23s{color} | {color:red} llap-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 23s{color} | {color:red} llap-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 15s{color} | {color:red} llap-server: The patch generated 1 new + 29 unchanged - 1 fixed = 30 total (was 30) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 21s{color} | {color:red} llap-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 17m 38s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-16800/dev-support/hive-personality.sh | | git revision | master / 34d2bda | | Default Java | 1.8.0_111 | | findbugs | v3.0.1 | | mvninstall | http://104.198.109.242/logs//PreCommit-HIVE-Build-16800/yetus/patch-mvninstall-llap-server.txt | | compile | http://104.198.109.242/logs//PreCommit-HIVE-Build-16800/yetus/patch-compile-llap-server.txt | | javac | http://104.198.109.242/logs//PreCommit-HIVE-Build-16800/yetus/patch-compile-llap-server.txt | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-16800/yetus/diff-checkstyle-llap-server.txt | | findbugs | http://104.198.109.242/logs//PreCommit-HIVE-Build-16800/yetus/patch-findbugs-llap-server.txt | | modules | C: storage-api llap-server U: . | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-16800/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: HIVE-21509.0.wip.patch, HIVE-21509.1.wip.patch, > HIVE-21509.2.patch, HIVE-21509.3.patch, HIVE-21509.4.patch > > > In some scenarios, LLAP
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806649#comment-16806649 ] Hive QA commented on HIVE-21509: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12964421/HIVE-21509.3.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:green}SUCCESS:{color} +1 due to 15886 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/16793/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16793/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16793/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12964421 - PreCommit-HIVE-Build > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: HIVE-21509.0.wip.patch, HIVE-21509.1.wip.patch, > HIVE-21509.2.patch, HIVE-21509.3.patch > > > In some scenarios, LLAP might store column vectors in cache that are getting > reused and reset just before their original content would be written. > The issue is a concurrency issue and is thereby flaky. It is not easy to > reproduce, but the odds of surfacing this issue can by improved by setting > LLAP executor and IO thread counts this way: > * set hive.llap.daemon.num.executors=32; > * set hive.llap.io.threadpool.size=1; > * using TPCDS input data of store_sales table, have at least a couple of > 100k's of rows, and use text format: > {code:java} > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( 'field.delim'='|', 'serialization.format'='|') > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'{code} > * having more splits increases the issue showing itself, so it is worth to > _set tez.grouping.min-size=1024; set tez.grouping.max-size=1024;_ > * run query on this this table: select min(ss_sold_date_sk) from store_sales; > The first query result is correct (2450816 in my case). Repeating the query > will trigger reading from LLAP cache and produce a wrong result: 0. > If one wants to make sure of running into this issue, place a > Thread.sleep(250) at the beginning of VectorDeserializeOrcWriter#run(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16806601#comment-16806601 ] Hive QA commented on HIVE-21509: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 57s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 59s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 38s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 26s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 33s{color} | {color:blue} storage-api in master has 48 extant Findbugs warnings. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 49s{color} | {color:blue} llap-server in master has 81 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 30s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 23s{color} | {color:red} llap-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 23s{color} | {color:red} llap-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 23s{color} | {color:red} llap-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 15s{color} | {color:red} llap-server: The patch generated 4 new + 29 unchanged - 1 fixed = 33 total (was 30) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 22s{color} | {color:red} llap-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 15s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 17m 15s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-16793/dev-support/hive-personality.sh | | git revision | master / 7bbd93f | | Default Java | 1.8.0_111 | | findbugs | v3.0.1 | | mvninstall | http://104.198.109.242/logs//PreCommit-HIVE-Build-16793/yetus/patch-mvninstall-llap-server.txt | | compile | http://104.198.109.242/logs//PreCommit-HIVE-Build-16793/yetus/patch-compile-llap-server.txt | | javac | http://104.198.109.242/logs//PreCommit-HIVE-Build-16793/yetus/patch-compile-llap-server.txt | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-16793/yetus/diff-checkstyle-llap-server.txt | | findbugs | http://104.198.109.242/logs//PreCommit-HIVE-Build-16793/yetus/patch-findbugs-llap-server.txt | | modules | C: storage-api llap-server U: . | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-16793/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: HIVE-21509.0.wip.patch, HIVE-21509.1.wip.patch, > HIVE-21509.2.patch, HIVE-21509.3.patch > > > In some scenarios, LLAP might store column
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805527#comment-16805527 ] Hive QA commented on HIVE-21509: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12964206/HIVE-21509.2.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:green}SUCCESS:{color} +1 due to 15883 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/16762/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/16762/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-16762/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12964206 - PreCommit-HIVE-Build > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: HIVE-21509.0.wip.patch, HIVE-21509.1.wip.patch, > HIVE-21509.2.patch > > > In some scenarios, LLAP might store column vectors in cache that are getting > reused and reset just before their original content would be written. > The issue is a concurrency issue and is thereby flaky. It is not easy to > reproduce, but the odds of surfacing this issue can by improved by setting > LLAP executor and IO thread counts this way: > * set hive.llap.daemon.num.executors=32; > * set hive.llap.io.threadpool.size=1; > * using TPCDS input data of store_sales table, have at least a couple of > 100k's of rows, and use text format: > {code:java} > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( 'field.delim'='|', 'serialization.format'='|') > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'{code} > * having more splits increases the issue showing itself, so it is worth to > _set tez.grouping.min-size=1024; set tez.grouping.max-size=1024;_ > * run query on this this table: select min(ss_sold_date_sk) from store_sales; > The first query result is correct (2450816 in my case). Repeating the query > will trigger reading from LLAP cache and produce a wrong result: 0. > If one wants to make sure of running into this issue, place a > Thread.sleep(250) at the beginning of VectorDeserializeOrcWriter#run(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805481#comment-16805481 ] Hive QA commented on HIVE-21509: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 47s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 41s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 35s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 25s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 32s{color} | {color:blue} storage-api in master has 48 extant Findbugs warnings. {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 49s{color} | {color:blue} llap-server in master has 81 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 27s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:red}-1{color} | {color:red} mvninstall {color} | {color:red} 0m 22s{color} | {color:red} llap-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} compile {color} | {color:red} 0m 22s{color} | {color:red} llap-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} javac {color} | {color:red} 0m 22s{color} | {color:red} llap-server in the patch failed. {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 15s{color} | {color:red} llap-server: The patch generated 4 new + 29 unchanged - 1 fixed = 33 total (was 30) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 0m 22s{color} | {color:red} llap-server in the patch failed. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 14s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 16m 32s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-16762/dev-support/hive-personality.sh | | git revision | master / 23ab7f2 | | Default Java | 1.8.0_111 | | findbugs | v3.0.1 | | mvninstall | http://104.198.109.242/logs//PreCommit-HIVE-Build-16762/yetus/patch-mvninstall-llap-server.txt | | compile | http://104.198.109.242/logs//PreCommit-HIVE-Build-16762/yetus/patch-compile-llap-server.txt | | javac | http://104.198.109.242/logs//PreCommit-HIVE-Build-16762/yetus/patch-compile-llap-server.txt | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-16762/yetus/diff-checkstyle-llap-server.txt | | findbugs | http://104.198.109.242/logs//PreCommit-HIVE-Build-16762/yetus/patch-findbugs-llap-server.txt | | modules | C: storage-api llap-server U: . | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-16762/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: HIVE-21509.0.wip.patch, HIVE-21509.1.wip.patch, > HIVE-21509.2.patch > > > In some scenarios, LLAP might store column vectors in cache that
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805103#comment-16805103 ] Adam Szita commented on HIVE-21509: --- Also attached new patch [^HIVE-21509.2.patch] that has some additional null checks and added a test that can deterministically reproduce the issue. > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: HIVE-21509.0.wip.patch, HIVE-21509.1.wip.patch, > HIVE-21509.2.patch > > > In some scenarios, LLAP might store column vectors in cache that are getting > reused and reset just before their original content would be written. > The issue is a concurrency issue and is thereby flaky. It is not easy to > reproduce, but the odds of surfacing this issue can by improved by setting > LLAP executor and IO thread counts this way: > * set hive.llap.daemon.num.executors=32; > * set hive.llap.io.threadpool.size=1; > * using TPCDS input data of store_sales table, have at least a couple of > 100k's of rows, and use text format: > {code:java} > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( 'field.delim'='|', 'serialization.format'='|') > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'{code} > * having more splits increases the issue showing itself, so it is worth to > _set tez.grouping.min-size=1024; set tez.grouping.max-size=1024;_ > * run query on this this table: select min(ss_sold_date_sk) from store_sales; > The first query result is correct (2450816 in my case). Repeating the query > will trigger reading from LLAP cache and produce a wrong result: 0. > If one wants to make sure of running into this issue, place a > Thread.sleep(250) at the beginning of VectorDeserializeOrcWriter#run(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804828#comment-16804828 ] Adam Szita commented on HIVE-21509: --- [~kgyrtkirk] I think what you're suggestion is to shallow copy the refcount part too. I agree, this way any of the shallow-copied instances of a CV will track the refcount correctly which is what we want. I updated my patch accordingly. > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: HIVE-21509.0.wip.patch, HIVE-21509.1.wip.patch > > > In some scenarios, LLAP might store column vectors in cache that are getting > reused and reset just before their original content would be written. > The issue is a concurrency issue and is thereby flaky. It is not easy to > reproduce, but the odds of surfacing this issue can by improved by setting > LLAP executor and IO thread counts this way: > * set hive.llap.daemon.num.executors=32; > * set hive.llap.io.threadpool.size=1; > * using TPCDS input data of store_sales table, have at least a couple of > 100k's of rows, and use text format: > {code:java} > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( 'field.delim'='|', 'serialization.format'='|') > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'{code} > * having more splits increases the issue showing itself, so it is worth to > _set tez.grouping.min-size=1024; set tez.grouping.max-size=1024;_ > * run query on this this table: select min(ss_sold_date_sk) from store_sales; > The first query result is correct (2450816 in my case). Repeating the query > will trigger reading from LLAP cache and produce a wrong result: 0. > If one wants to make sure of running into this issue, place a > Thread.sleep(250) at the beginning of VectorDeserializeOrcWriter#run(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801734#comment-16801734 ] Zoltan Haindrich commented on HIVE-21509: - {code} @@ -258,5 +279,6 @@ public void shallowCopyTo(ColumnVector otherCv) { otherCv.isRepeating = isRepeating; otherCv.preFlattenIsRepeating = preFlattenIsRepeating; otherCv.preFlattenNoNulls = preFlattenNoNulls; +otherCv.refCount.set(refCount.get()); {code} this seems to "shallowcopy" the "refCount" as well - I feel that something is not right with this...I would instead expect to that "other" should refer to the same reference counter; or should retain it's own counter... > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: HIVE-21509.0.wip.patch > > > In some scenarios, LLAP might store column vectors in cache that are getting > reused and reset just before their original content would be written. > The issue is a concurrency issue and is thereby flaky. It is not easy to > reproduce, but the odds of surfacing this issue can by improved by setting > LLAP executor and IO thread counts this way: > * set hive.llap.daemon.num.executors=32; > * set hive.llap.io.threadpool.size=1; > * using TPCDS input data of store_sales table, have at least a couple of > 100k's of rows, and use text format: > {code:java} > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( 'field.delim'='|', 'serialization.format'='|') > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'{code} > * having more splits increases the issue showing itself, so it is worth to > _set tez.grouping.min-size=1024; set tez.grouping.max-size=1024;_ > * run query on this this table: select min(ss_sold_date_sk) from store_sales; > The first query result is correct (2450816 in my case). Repeating the query > will trigger reading from LLAP cache and produce a wrong result: 0. > If one wants to make sure of running into this issue, place a > Thread.sleep(250) at the beginning of VectorDeserializeOrcWriter#run(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801716#comment-16801716 ] Adam Szita commented on HIVE-21509: --- I've attached a proposed solution (work in progress) that would fix this problem, see [^HIVE-21509.0.wip.patch]. It adds a reference counter to ColumnVectors which is increased before beginning to write and decreased once the CV has been passed to the actual writer. EncodedDataConsumer will check this counter and will not return CVs into the object pool whose ref count is greater than zero. > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > Attachments: HIVE-21509.0.wip.patch > > > In some scenarios, LLAP might store column vectors in cache that are getting > reused and reset just before their original content would be written. > The issue is a concurrency issue and is thereby flaky. It is not easy to > reproduce, but the odds of surfacing this issue can by improved by setting > LLAP executor and IO thread counts this way: > * set hive.llap.daemon.num.executors=32; > * set hive.llap.io.threadpool.size=1; > * using TPCDS input data of store_sales table, have at least a couple of > 100k's of rows, and use text format: > {code:java} > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( 'field.delim'='|', 'serialization.format'='|') > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'{code} > * having more splits increases the issue showing itself, so it is worth to > _set tez.grouping.min-size=1024; set tez.grouping.max-size=1024;_ > * run query on this this table: select min(ss_sold_date_sk) from store_sales; > The first query result is correct (2450816 in my case). Repeating the query > will trigger reading from LLAP cache and produce a wrong result: 0. > If one wants to make sure of running into this issue, place a > Thread.sleep(250) at the beginning of VectorDeserializeOrcWriter#run(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801561#comment-16801561 ] Adam Szita commented on HIVE-21509: --- Looks like a quite serious issue. The root cause is as follows: # At the first execution of the query the LLAP IO thread has to read from the input text file and produce (Long)ColumnVectors (CVs) from the data, wrapped into VectorizedRowBatches (VRBs). # These VRBs are ## passed by VectorDeserializeOrcWriter to a newly created async ORC writer thread for ORC encoding and cache persistence. ## also propagated back to consumers, namely to OrcEncodedDataConsumer and then finally to LLAPRecordReader. # The ORC writer thread may get to writing out the VRB (and therefore the CV) only after that the IO thread has: ## Created a CVB in OrcEncodedDataConsumer#decodeBatch to wrap the CV coming from VRB and passed the batch to LLAPRecordReader ## LLAPRecordReader used this batch and is receiving a new one. This time (on Tez thread) it will return the previous CVB and offer it back to an object pool so that the next decodeBatch may reuse it. ## The next decodeBatch call polls this reused CVB from the pool and will call CV.reset() on the CVs wrapped inside, and finally it will also overwrite the existing data in there ## and now is the time that the ORC writer thread got to writing the VRB and therefore the very same CVs into cache, that have just been modified in the meantime due to this re-using logic of LLAPRecordReader and OrcEncodedDataConsumer ## (I guess this is why high executor count vs low IO thread count helps surfacing this issue: the 32 Tez threads are very fast returning the used CVBs, but the one IO thread and its one ORC writer thread is outnumbered when trying to write it out in time, before it'd get corrupted) # Because of this the correct query result will be displayed at first (LLAPRecordReader does get all the correct CVs), but the content written in cache is corrupted # The second run of this query will go directly to cache and use the corrupted data there to produce a wrong result this time. > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > > In some scenarios, LLAP might store column vectors in cache that are getting > reused and reset just before their original content would be written. > The issue is a concurrency issue and is thereby flaky. It is not easy to > reproduce, but the odds of surfacing this issue can by improved by setting > LLAP executor and IO thread counts this way: > * set hive.llap.daemon.num.executors=32; > * set hive.llap.io.threadpool.size=1; > * using TPCDS input data of store_sales table, which is in text format: > {code:java} > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( 'field.delim'='|', 'serialization.format'='|') > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'{code} > * run query on this this table: select min(ss_sold_date_sk) from store_sales; > The first query result is correct (2450816 in my case). Repeating the query > will trigger reading from LLAP cache and produce a wrong result: 0. > If one wants to make sure of running into this issue, place a > Thread.sleep(250) at the beginning of VectorDeserializeOrcWriter#run(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801560#comment-16801560 ] Ivan Suller commented on HIVE-21509: [~kgyrtkirk] it is possible. I already closed the ticket tracking that issue, because I couldn't reproduce it anymore. But if it is a cache issue this is expected. > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > > In some scenarios, LLAP might store column vectors in cache that are getting > reused and reset just before their original content would be written. > The issue is a concurrency issue and is thereby flaky. It is not easy to > reproduce, but the odds of surfacing this issue can by improved by setting > LLAP executor and IO thread counts this way: > * set hive.llap.daemon.num.executors=32; > * set hive.llap.io.threadpool.size=1; > * using TPCDS input data of store_sales table, which is in text format: > {code:java} > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( 'field.delim'='|', 'serialization.format'='|') > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'{code} > * run query on this this table: select min(ss_sold_date_sk) from store_sales; > The first query result is correct (2450816 in my case). Repeating the query > will trigger reading from LLAP cache and produce a wrong result: 0. > If one wants to make sure of running into this issue, place a > Thread.sleep(250) at the beginning of VectorDeserializeOrcWriter#run(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-21509) LLAP may cache corrupted column vectors and return wrong query result
[ https://issues.apache.org/jira/browse/HIVE-21509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801542#comment-16801542 ] Zoltan Haindrich commented on HIVE-21509: - cc: [~isuller] this could be the same issue you have been running into? > LLAP may cache corrupted column vectors and return wrong query result > - > > Key: HIVE-21509 > URL: https://issues.apache.org/jira/browse/HIVE-21509 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Adam Szita >Assignee: Adam Szita >Priority: Major > > In some scenarios, LLAP might store column vectors in cache that are getting > reused and reset just before their original content would be written. > The issue is a concurrency issue and is thereby flaky. It is not easy to > reproduce, but the odds of surfacing this issue can by improved by setting > LLAP executor and IO thread counts this way: > * set hive.llap.daemon.num.executors=32; > * set hive.llap.io.threadpool.size=1; > * using TPCDS input data of store_sales table, which is in text format: > {code:java} > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( 'field.delim'='|', 'serialization.format'='|') > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'{code} > * run query on this this table: select min(ss_sold_date_sk) from store_sales; > The first query result is correct (2450816 in my case). Repeating the query > will trigger reading from LLAP cache and produce a wrong result: 0. > If one wants to make sure of running into this issue, place a > Thread.sleep(250) at the beginning of VectorDeserializeOrcWriter#run(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)