[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+
[ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560102#comment-16560102 ] Szehon Ho commented on HIVE-20153: -- Thanks Aihua for the fix. Yes I can test it, I am out of town at the moment so need to wait to get back, and hope I can do it sometime next week. If you dont want to wait, feel free to go ahead, I can comment my findings afterwards. > Count and Sum UDF consume more memory in Hive 2+ > > > Key: HIVE-20153 > URL: https://issues.apache.org/jira/browse/HIVE-20153 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 2.3.2 >Reporter: Szehon Ho >Assignee: Aihua Xu >Priority: Major > Attachments: HIVE-20153.1.patch, Screen Shot 2018-07-12 at 6.41.28 > PM.png > > > While playing with Hive2, we noticed that queries with a lot of count() and > sum() aggregations run out of memory on Hadoop side where they worked before > in Hive1. > In many queries, we have to double the Mapper Memory settings (in our > particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it > makes it not so easy to upgrade to Hive 2. > Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' > in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+
[ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559163#comment-16559163 ] Hive QA commented on HIVE-20153: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12933260/HIVE-20153.1.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:green}SUCCESS:{color} +1 due to 14812 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/12886/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12886/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12886/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.YetusPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12933260 - PreCommit-HIVE-Build > Count and Sum UDF consume more memory in Hive 2+ > > > Key: HIVE-20153 > URL: https://issues.apache.org/jira/browse/HIVE-20153 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 2.3.2 >Reporter: Szehon Ho >Assignee: Aihua Xu >Priority: Major > Attachments: HIVE-20153.1.patch, Screen Shot 2018-07-12 at 6.41.28 > PM.png > > > While playing with Hive2, we noticed that queries with a lot of count() and > sum() aggregations run out of memory on Hadoop side where they worked before > in Hive1. > In many queries, we have to double the Mapper Memory settings (in our > particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it > makes it not so easy to upgrade to Hive 2. > Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' > in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+
[ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559161#comment-16559161 ] Hive QA commented on HIVE-20153: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 58m 0s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s{color} | {color:green} master passed {color} | | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 3m 57s{color} | {color:blue} ql in master has 2296 extant Findbugs warnings. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 56s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 5s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 14s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 73m 42s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Optional Tests | asflicense javac javadoc findbugs checkstyle compile | | uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux | | Build tool | maven | | Personality | /data/hiveptest/working/yetus_PreCommit-HIVE-Build-12886/dev-support/hive-personality.sh | | git revision | master / 1ad4882 | | Default Java | 1.8.0_111 | | findbugs | v3.0.0 | | modules | C: ql U: ql | | Console output | http://104.198.109.242/logs//PreCommit-HIVE-Build-12886/yetus.txt | | Powered by | Apache Yetushttp://yetus.apache.org | This message was automatically generated. > Count and Sum UDF consume more memory in Hive 2+ > > > Key: HIVE-20153 > URL: https://issues.apache.org/jira/browse/HIVE-20153 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 2.3.2 >Reporter: Szehon Ho >Assignee: Aihua Xu >Priority: Major > Attachments: HIVE-20153.1.patch, Screen Shot 2018-07-12 at 6.41.28 > PM.png > > > While playing with Hive2, we noticed that queries with a lot of count() and > sum() aggregations run out of memory on Hadoop side where they worked before > in Hive1. > In many queries, we have to double the Mapper Memory settings (in our > particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it > makes it not so easy to upgrade to Hive 2. > Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' > in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+
[ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558917#comment-16558917 ] Gopal V commented on HIVE-20153: LGTM - +1 tests pending. This extra field is still taking up meaningful amounts of memory for the objects in the heap. >From JOL. {code} * 64-bit VM: ** org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum$GenericUDAFSumEvaluator$SumAgg object internals: OFFSET SIZETYPE DESCRIPTION VALUE 016 (object header) N/A 16 1 boolean SumAgg.empty N/A 17 7 (alignment/padding gap) 24 8java.lang.Object SumAgg.sumN/A 32 8 java.util.HashSet SumAgg.uniqueObjects N/A Instance size: 40 bytes Space losses: 7 bytes internal + 0 bytes external = 7 bytes total ... * 64-bit VM, compressed references enabled: *** org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum$GenericUDAFSumEvaluator$SumAgg object internals: OFFSET SIZETYPE DESCRIPTION VALUE 012 (object header) N/A 12 1 boolean SumAgg.empty N/A 13 3 (alignment/padding gap) 16 4java.lang.Object SumAgg.sumN/A 20 4 java.util.HashSet SumAgg.uniqueObjects N/A Instance size: 24 bytes Space losses: 3 bytes internal + 0 bytes external = 3 bytes total {code} a PTF specific sub-class would remove that part & let me think of a way of having a SumAggEmpty class (the "which class is it" goes into the 12 byte obj header). > Count and Sum UDF consume more memory in Hive 2+ > > > Key: HIVE-20153 > URL: https://issues.apache.org/jira/browse/HIVE-20153 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 2.3.2 >Reporter: Szehon Ho >Assignee: Aihua Xu >Priority: Major > Attachments: HIVE-20153.1.patch, Screen Shot 2018-07-12 at 6.41.28 > PM.png > > > While playing with Hive2, we noticed that queries with a lot of count() and > sum() aggregations run out of memory on Hadoop side where they worked before > in Hive1. > In many queries, we have to double the Mapper Memory settings (in our > particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it > makes it not so easy to upgrade to Hive 2. > Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' > in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+
[ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544665#comment-16544665 ] Aihua Xu commented on HIVE-20153: - [~gopalv] If you want to take a look, feel free to take it. Otherwise, I will be on PTO for a week and will investigate after that. > Count and Sum UDF consume more memory in Hive 2+ > > > Key: HIVE-20153 > URL: https://issues.apache.org/jira/browse/HIVE-20153 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 2.3.2 >Reporter: Szehon Ho >Assignee: Aihua Xu >Priority: Major > Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png > > > While playing with Hive2, we noticed that queries with a lot of count() and > sum() aggregations run out of memory on Hadoop side where they worked before > in Hive1. > In many queries, we have to double the Mapper Memory settings (in our > particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it > makes it not so easy to upgrade to Hive 2. > Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' > in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+
[ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543672#comment-16543672 ] Aihua Xu commented on HIVE-20153: - Yes. I'm able to download it. > Count and Sum UDF consume more memory in Hive 2+ > > > Key: HIVE-20153 > URL: https://issues.apache.org/jira/browse/HIVE-20153 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 2.3.2 >Reporter: Szehon Ho >Assignee: Aihua Xu >Priority: Major > Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png > > > While playing with Hive2, we noticed that queries with a lot of count() and > sum() aggregations run out of memory on Hadoop side where they worked before > in Hive1. > In many queries, we have to double the Mapper Memory settings (in our > particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it > makes it not so easy to upgrade to Hive 2. > Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' > in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+
[ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542758#comment-16542758 ] Szehon Ho commented on HIVE-20153: -- Hello Aihua, nice to see you too, thanks for looking at it! Yes, in fact they are all hashmap of 0 items. I cant get jxray to work on Mac, but i shared the heap dump on my Drive, does it work? [https://drive.google.com/open?id=1nKe43ybfgEEe0yQvtsyQPVyxghGa5X2A] > Count and Sum UDF consume more memory in Hive 2+ > > > Key: HIVE-20153 > URL: https://issues.apache.org/jira/browse/HIVE-20153 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 2.3.2 >Reporter: Szehon Ho >Assignee: Aihua Xu >Priority: Major > Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png > > > While playing with Hive2, we noticed that queries with a lot of count() and > sum() aggregations run out of memory on Hadoop side where they worked before > in Hive1. > In many queries, we have to double the Mapper Memory settings (in our > particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it > makes it not so easy to upgrade to Hive 2. > Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' > in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+
[ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542550#comment-16542550 ] Gopal V commented on HIVE-20153: >From a quick look, it looks like they are hashmaps with 0 items. {code} @Override public void reset(AggregationBuffer agg) throws HiveException { ((CountAgg) agg).value = 0; ((CountAgg) agg).uniqueObjects = new HashSet(); } {code} > Count and Sum UDF consume more memory in Hive 2+ > > > Key: HIVE-20153 > URL: https://issues.apache.org/jira/browse/HIVE-20153 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 2.3.2 >Reporter: Szehon Ho >Assignee: Aihua Xu >Priority: Major > Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png > > > While playing with Hive2, we noticed that queries with a lot of count() and > sum() aggregations run out of memory on Hadoop side where they worked before > in Hive1. > In many queries, we have to double the Mapper Memory settings (in our > particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it > makes it not so easy to upgrade to Hive 2. > Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' > in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+
[ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542161#comment-16542161 ] Aihua Xu commented on HIVE-20153: - [~szehon] Nice to see you again. :) I will take a look. Do you have the full heap dump? If it's too big, you may try to use http://www.jxray.com/ to generate a small file. > Count and Sum UDF consume more memory in Hive 2+ > > > Key: HIVE-20153 > URL: https://issues.apache.org/jira/browse/HIVE-20153 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 2.3.2 >Reporter: Szehon Ho >Priority: Major > Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png > > > While playing with Hive2, we noticed that queries with a lot of count() and > sum() aggregations run out of memory on Hadoop side where they worked before > in Hive1. > In many queries, we have to double the Mapper Memory settings (in our > particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it > makes it not so easy to upgrade to Hive 2. > Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' > in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+
[ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541929#comment-16541929 ] Szehon Ho commented on HIVE-20153: -- [~aihuaxu] do you think there is some way to improve this? (I didn't yet take much look at this code to deeply understand). It seems to consume memory even if its used in the window function or not. The query is something like (generalizing the table): select count(distinct), count(), count(), count(), min(), min(), max(), max(), min(), max() from table group by field; Also I attach the heap dump of a mapper that was killed OOM for reference, there's 3 million GenericUDAFCountEvaluator, each with a hashmap, I also don't know if that is weird or not. !Screen Shot 2018-07-12 at 6.41.28 PM.png! > Count and Sum UDF consume more memory in Hive 2+ > > > Key: HIVE-20153 > URL: https://issues.apache.org/jira/browse/HIVE-20153 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 2.3.2 >Reporter: Szehon Ho >Priority: Major > Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png > > > While playing with Hive2, we noticed that queries with a lot of count() and > sum() aggregations run out of memory on Hadoop side much faster than in > Hive1. In many queries, we have to double the memory. > > Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' > in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+
[ https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541927#comment-16541927 ] Sahil Takiar commented on HIVE-20153: - CC: [~aihuaxu] > Count and Sum UDF consume more memory in Hive 2+ > > > Key: HIVE-20153 > URL: https://issues.apache.org/jira/browse/HIVE-20153 > Project: Hive > Issue Type: Bug > Components: UDF >Affects Versions: 2.3.2 >Reporter: Szehon Ho >Priority: Major > Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png > > > While playing with Hive2, we noticed that queries with a lot of count() and > sum() aggregations run out of memory on Hadoop side much faster than in > Hive1. In many queries, we have to double the memory. > > Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' > in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window > functions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)