[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545875#comment-16545875 ] Vihang Karajgaonkar commented on HIVE-19668: Patch merged into branch-3 as well. Thanks for your contribution [~mi...@cloudera.com] > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Fix For: 4.0.0, 3.2.0 > > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > HIVE-19668.03.patch, HIVE-19668.04.patch, HIVE-19668.05.patch, > image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545833#comment-16545833 ] Vihang Karajgaonkar commented on HIVE-19668: The general convention is to fix newly introduced check-style errors. But I have seen other devs ignoring the check-style issues especially since precommit takes a long time to come back. I am not sure if there is a way to run yetus style checks locally. Since you have already tried to fix it and there are no code changes in the latest patch, I can commit v5 of the patch attached. If the precommit comes back with style issues we can commit an addendum patch later. > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > HIVE-19668.03.patch, HIVE-19668.04.patch, HIVE-19668.05.patch, > image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545643#comment-16545643 ] Misha Dmitriev commented on HIVE-19668: --- [~vihangk1] yes, the previous patch passed all tests, but looks like there were some checkstyle problems with it. Unfortunately, the checkstyle report is gone, so I've just fixed several lines which I think could be problematic, and resubmitted the patch. Are these checkstyle issues considered important? > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > HIVE-19668.03.patch, HIVE-19668.04.patch, HIVE-19668.05.patch, > image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16545629#comment-16545629 ] Vihang Karajgaonkar commented on HIVE-19668: Hi [~mi...@cloudera.com] The previous patch showed green run. Looks like you are still actively working on the patch. I will wait if this is still a work in progress. > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > HIVE-19668.03.patch, HIVE-19668.04.patch, HIVE-19668.05.patch, > image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544209#comment-16544209 ] Hive QA commented on HIVE-19668: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12931615/HIVE-19668.04.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:green}SUCCESS:{color} +1 due to 14649 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/12606/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12606/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12606/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12931615 - PreCommit-HIVE-Build > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > HIVE-19668.03.patch, HIVE-19668.04.patch, image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544178#comment-16544178 ] Hive QA commented on HIVE-19668: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 1s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 24s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 44s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 3m 52s{color} | {color:blue} ql in master has 2291 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 54s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 45s{color} | {color:red} ql: The patch generated 4 new + 773 unchanged - 0 fixed = 777 total (was 773) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 12s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 22m 48s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-12606/dev-support/hive-personality.sh | | git revision | master / 1b5903b | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-12606/yetus/diff-checkstyle-ql.txt | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-12606/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > HIVE-19668.03.patch, HIVE-19668.04.patch, image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543655#comment-16543655 ] Vihang Karajgaonkar commented on HIVE-19668: +1 > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > HIVE-19668.03.patch, image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542861#comment-16542861 ] Hive QA commented on HIVE-19668: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12931396/HIVE-19668.03.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:green}SUCCESS:{color} +1 due to 14650 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/12579/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12579/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12579/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12931396 - PreCommit-HIVE-Build > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > HIVE-19668.03.patch, image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542792#comment-16542792 ] Hive QA commented on HIVE-19668: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | {color:red} patch {color} | {color:red} 0m 10s{color} | {color:red} /data/hiveptest/logs/PreCommit-HIVE-Build-12579/patches/PreCommit-HIVE-Build-12579.patch does not apply to master. Rebase required? Wrong Branch? See http://cwiki.apache.org/confluence/display/Hive/HowToContribute for help. {color} | \\ \\ || Subsystem || Report/Notes || | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-12579/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > HIVE-19668.03.patch, image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542265#comment-16542265 ] Misha Dmitriev commented on HIVE-19668: --- Thank you for checking, [~vihangk1] [~aihuaxu] and [~stakiar]. In the end, it turns out that at least some failures are reproducible locally, and my changes are responsible. Not allĀ {{CommonToken}}s can be madeĀ {{ImmutableToken}}s, because for some of them the type may be rewritten in some special operators later. I've already found one such type in the past, and now eliminating others. Will post the updated patch once I am done. > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542258#comment-16542258 ] Sahil Takiar commented on HIVE-19668: - [~mi...@cloudera.com] unless you can re-produce the test failures locally, its unlikely the failed tests are related to your patch. If you can re-produce them locally, let me know the stack-trace and I can help you debug. Otherwise, can you re-base the patch and post and updated version? This will re-trigger Hive QA and re-run all the tests. > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528883#comment-16528883 ] Misha Dmitriev commented on HIVE-19668: --- [~aihuaxu] I've checked the logs of failed tests, but couldn't find anything obviously related to my changes. However, my experience with Hive is not deep at all. May I ask you to check these logs before they disappear? > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528387#comment-16528387 ] Hive QA commented on HIVE-19668: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12929477/HIVE-19668.02.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 14632 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_view_delete] (batchId=35) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_multiinsert] (batchId=87) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_unqual_corr_expr] (batchId=8) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multiinsert] (batchId=145) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/12256/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12256/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12256/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12929477 - PreCommit-HIVE-Build > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16528368#comment-16528368 ] Hive QA commented on HIVE-19668: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 1s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 47s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 4m 6s{color} | {color:blue} ql in master has 2287 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 57s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 45s{color} | {color:red} ql: The patch generated 4 new + 725 unchanged - 0 fixed = 729 total (was 725) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 12s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 24m 6s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-12256/dev-support/hive-personality.sh | | git revision | master / 7eac7f6 | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-12256/yetus/diff-checkstyle-ql.txt | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-12256/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new >
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526593#comment-16526593 ] Aihua Xu commented on HIVE-19668: - The patch looks good to me. +1 pending tests. > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, HIVE-19668.02.patch, > image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16525477#comment-16525477 ] Aihua Xu commented on HIVE-19668: - [~mi...@cloudera.com] The patch looks good to me. There are some style issues like missing apache header not from your change. Can you fix those issues? > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502638#comment-16502638 ] Hive QA commented on HIVE-19668: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12926217/HIVE-19668.01.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 14466 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_view_delete] (batchId=35) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_multiinsert] (batchId=87) org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[subquery_unqual_corr_expr] (batchId=8) org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver[subquery_multiinsert] (batchId=145) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/11536/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/11536/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-11536/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12926217 - PreCommit-HIVE-Build > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where such objects are > apparently repeatedly created with e.g. {{new > CommonToken(HiveParser.TOK_INSERT, "TOK_INSERT")}} If these 33 token kinds > are instead created once and reused, we will save more than 1/10th of the > heap in this scenario. Plus, since these objects are small but very numerous, > getting rid of them will remove a gread deal of pressure from the GC. > Another source of waste are duplicate strings, that collectively waste 26.1% > of memory. Some of them come from CommonToken objects that have the same text > (i.e. for multiple CommonToken objects the contents of their 'text' Strings > are the same, but each has its own copy of that String). Other duplicate > strings come from other sources, that are easy enough to fix by adding > String.intern() calls. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19668) Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and duplicate strings
[ https://issues.apache.org/jira/browse/HIVE-19668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502604#comment-16502604 ] Hive QA commented on HIVE-19668: | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 6m 42s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 3m 19s{color} | {color:blue} ql in master has 2280 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 13s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 39s{color} | {color:red} ql: The patch generated 5 new + 720 unchanged - 0 fixed = 725 total (was 720) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 30s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 50s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 11s{color} | {color:red} The patch generated 2 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 20m 15s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-11536/dev-support/hive-personality.sh | | git revision | master / afc5fa4 | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | checkstyle | http://104.198.109.242/logs//PreCommit-HIVE-Build-11536/yetus/diff-checkstyle-ql.txt | | asflicense | http://104.198.109.242/logs//PreCommit-HIVE-Build-11536/yetus/patch-asflicense-problems.txt | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-11536/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Over 30% of the heap wasted by duplicate org.antlr.runtime.CommonToken's and > duplicate strings > -- > > Key: HIVE-19668 > URL: https://issues.apache.org/jira/browse/HIVE-19668 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 >Affects Versions: 3.0.0 >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HIVE-19668.01.patch, image-2018-05-22-17-41-39-572.png > > > I've recently analyzed a HS2 heap dump, obtained when there was a huge memory > spike during compilation of some big query. The analysis was done with jxray > ([www.jxray.com).|http://www.jxray.com)./] It turns out that more than 90% of > the 20G heap was used by data structures associated with query parsing > ({{org.apache.hadoop.hive.ql.parse.QBExpr}}). There are probably multiple > opportunities for optimizations here. One of them is to stop the code from > creating duplicate instances of {{org.antlr.runtime.CommonToken}} class. See > a sample of these objects in the attached image: > !image-2018-05-22-17-41-39-572.png|width=879,height=399! > Looks like these particular {{CommonToken}} objects are constants, that don't > change once created. I see some code, e.g. in > {{org.apache.hadoop.hive.ql.parse.CalcitePlanner}}, where