[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+

2018-07-27 Thread Szehon Ho (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560102#comment-16560102
 ] 

Szehon Ho commented on HIVE-20153:
--

Thanks Aihua for the fix.  Yes I can test it, I am out of town at the moment so 
need to wait to get back, and hope I can do it sometime next week.  If you dont 
want to wait, feel free to go ahead, I can comment my findings afterwards.

> Count and Sum UDF consume more memory in Hive 2+
> 
>
> Key: HIVE-20153
> URL: https://issues.apache.org/jira/browse/HIVE-20153
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 2.3.2
>Reporter: Szehon Ho
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HIVE-20153.1.patch, Screen Shot 2018-07-12 at 6.41.28 
> PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and 
> sum() aggregations run out of memory on Hadoop side where they worked before 
> in Hive1. 
> In many queries, we have to double the Mapper Memory settings (in our 
> particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it 
> makes it not so easy to upgrade to Hive 2.
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' 
> in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+

2018-07-26 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559163#comment-16559163
 ] 

Hive QA commented on HIVE-20153:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12933260/HIVE-20153.1.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:green}SUCCESS:{color} +1 due to 14812 tests passed

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/12886/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/12886/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-12886/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12933260 - PreCommit-HIVE-Build

> Count and Sum UDF consume more memory in Hive 2+
> 
>
> Key: HIVE-20153
> URL: https://issues.apache.org/jira/browse/HIVE-20153
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 2.3.2
>Reporter: Szehon Ho
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HIVE-20153.1.patch, Screen Shot 2018-07-12 at 6.41.28 
> PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and 
> sum() aggregations run out of memory on Hadoop side where they worked before 
> in Hive1. 
> In many queries, we have to double the Mapper Memory settings (in our 
> particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it 
> makes it not so easy to upgrade to Hive 2.
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' 
> in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+

2018-07-26 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559161#comment-16559161
 ] 

Hive QA commented on HIVE-20153:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 58m 
 0s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
3s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
40s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
57s{color} | {color:blue} ql in master has 2296 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
56s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
27s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m  
5s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m  
4s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
14s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 73m 42s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-12886/dev-support/hive-personality.sh
 |
| git revision | master / 1ad4882 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| modules | C: ql U: ql |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-12886/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Count and Sum UDF consume more memory in Hive 2+
> 
>
> Key: HIVE-20153
> URL: https://issues.apache.org/jira/browse/HIVE-20153
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 2.3.2
>Reporter: Szehon Ho
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HIVE-20153.1.patch, Screen Shot 2018-07-12 at 6.41.28 
> PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and 
> sum() aggregations run out of memory on Hadoop side where they worked before 
> in Hive1. 
> In many queries, we have to double the Mapper Memory settings (in our 
> particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it 
> makes it not so easy to upgrade to Hive 2.
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' 
> in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+

2018-07-26 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558917#comment-16558917
 ] 

Gopal V commented on HIVE-20153:


LGTM - +1 tests pending.

This extra field is still taking up meaningful amounts of memory for the 
objects in the heap. 

>From JOL.

{code}
* 64-bit VM: **
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum$GenericUDAFSumEvaluator$SumAgg
 object internals:
 OFFSET  SIZETYPE DESCRIPTION   
VALUE
  016 (object header)   N/A
 16 1 boolean SumAgg.empty  N/A
 17 7 (alignment/padding gap)  
 24 8java.lang.Object SumAgg.sumN/A
 32 8   java.util.HashSet SumAgg.uniqueObjects  N/A
Instance size: 40 bytes
Space losses: 7 bytes internal + 0 bytes external = 7 bytes total
...
* 64-bit VM, compressed references enabled: ***
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum$GenericUDAFSumEvaluator$SumAgg
 object internals:
 OFFSET  SIZETYPE DESCRIPTION   
VALUE
  012 (object header)   N/A
 12 1 boolean SumAgg.empty  N/A
 13 3 (alignment/padding gap)  
 16 4java.lang.Object SumAgg.sumN/A
 20 4   java.util.HashSet SumAgg.uniqueObjects  N/A
Instance size: 24 bytes
Space losses: 3 bytes internal + 0 bytes external = 3 bytes total
{code}

a PTF specific sub-class would remove that part & let me think of a way of 
having a SumAggEmpty class (the "which class is it" goes into the 12 byte obj 
header).

> Count and Sum UDF consume more memory in Hive 2+
> 
>
> Key: HIVE-20153
> URL: https://issues.apache.org/jira/browse/HIVE-20153
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 2.3.2
>Reporter: Szehon Ho
>Assignee: Aihua Xu
>Priority: Major
> Attachments: HIVE-20153.1.patch, Screen Shot 2018-07-12 at 6.41.28 
> PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and 
> sum() aggregations run out of memory on Hadoop side where they worked before 
> in Hive1. 
> In many queries, we have to double the Mapper Memory settings (in our 
> particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it 
> makes it not so easy to upgrade to Hive 2.
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' 
> in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+

2018-07-15 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544665#comment-16544665
 ] 

Aihua Xu commented on HIVE-20153:
-

[~gopalv] If you want to take a look, feel free to take it. Otherwise, I will 
be on PTO for a week and will investigate after that.

> Count and Sum UDF consume more memory in Hive 2+
> 
>
> Key: HIVE-20153
> URL: https://issues.apache.org/jira/browse/HIVE-20153
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 2.3.2
>Reporter: Szehon Ho
>Assignee: Aihua Xu
>Priority: Major
> Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and 
> sum() aggregations run out of memory on Hadoop side where they worked before 
> in Hive1. 
> In many queries, we have to double the Mapper Memory settings (in our 
> particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it 
> makes it not so easy to upgrade to Hive 2.
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' 
> in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+

2018-07-13 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16543672#comment-16543672
 ] 

Aihua Xu commented on HIVE-20153:
-

Yes. I'm able to download it. 

> Count and Sum UDF consume more memory in Hive 2+
> 
>
> Key: HIVE-20153
> URL: https://issues.apache.org/jira/browse/HIVE-20153
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 2.3.2
>Reporter: Szehon Ho
>Assignee: Aihua Xu
>Priority: Major
> Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and 
> sum() aggregations run out of memory on Hadoop side where they worked before 
> in Hive1. 
> In many queries, we have to double the Mapper Memory settings (in our 
> particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it 
> makes it not so easy to upgrade to Hive 2.
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' 
> in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+

2018-07-13 Thread Szehon Ho (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542758#comment-16542758
 ] 

Szehon Ho commented on HIVE-20153:
--

Hello Aihua, nice to see you too, thanks for looking at it! 

Yes, in fact they are all hashmap of 0 items.

I cant get jxray to work on Mac, but i shared the heap dump on my Drive, does 
it work?  

[https://drive.google.com/open?id=1nKe43ybfgEEe0yQvtsyQPVyxghGa5X2A]

> Count and Sum UDF consume more memory in Hive 2+
> 
>
> Key: HIVE-20153
> URL: https://issues.apache.org/jira/browse/HIVE-20153
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 2.3.2
>Reporter: Szehon Ho
>Assignee: Aihua Xu
>Priority: Major
> Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and 
> sum() aggregations run out of memory on Hadoop side where they worked before 
> in Hive1. 
> In many queries, we have to double the Mapper Memory settings (in our 
> particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it 
> makes it not so easy to upgrade to Hive 2.
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' 
> in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+

2018-07-12 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542550#comment-16542550
 ] 

Gopal V commented on HIVE-20153:


>From a quick look, it looks like they are hashmaps with 0 items.

{code}
@Override
public void reset(AggregationBuffer agg) throws HiveException {
  ((CountAgg) agg).value = 0;
  ((CountAgg) agg).uniqueObjects = new HashSet();
}
{code}

> Count and Sum UDF consume more memory in Hive 2+
> 
>
> Key: HIVE-20153
> URL: https://issues.apache.org/jira/browse/HIVE-20153
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 2.3.2
>Reporter: Szehon Ho
>Assignee: Aihua Xu
>Priority: Major
> Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and 
> sum() aggregations run out of memory on Hadoop side where they worked before 
> in Hive1. 
> In many queries, we have to double the Mapper Memory settings (in our 
> particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it 
> makes it not so easy to upgrade to Hive 2.
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' 
> in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+

2018-07-12 Thread Aihua Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16542161#comment-16542161
 ] 

Aihua Xu commented on HIVE-20153:
-

[~szehon] Nice to see you again. :) I will take a look. Do you have the full 
heap dump? If it's too big, you may try to use http://www.jxray.com/ to 
generate a small file.

> Count and Sum UDF consume more memory in Hive 2+
> 
>
> Key: HIVE-20153
> URL: https://issues.apache.org/jira/browse/HIVE-20153
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 2.3.2
>Reporter: Szehon Ho
>Priority: Major
> Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and 
> sum() aggregations run out of memory on Hadoop side where they worked before 
> in Hive1. 
> In many queries, we have to double the Mapper Memory settings (in our 
> particular case mapreduce.map.java.opts from -Xmx2000M to -Xmx4000M), it 
> makes it not so easy to upgrade to Hive 2.
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' 
> in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+

2018-07-12 Thread Szehon Ho (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541929#comment-16541929
 ] 

Szehon Ho commented on HIVE-20153:
--

[~aihuaxu] do you think there is some way to improve this?  (I didn't yet take 
much look at this code to deeply understand).   It seems to consume memory even 
if its used in the window function or not.

The query is something like (generalizing the table):

select count(distinct), count(), count(), count(), min(), min(), max(), max(), 
min(), max() from table group by field;

Also I attach the heap dump of a mapper that was killed OOM for reference, 
there's 3 million GenericUDAFCountEvaluator, each with a hashmap, I also don't 
know if that is weird or not.

 

 

!Screen Shot 2018-07-12 at 6.41.28 PM.png!

 

> Count and Sum UDF consume more memory in Hive 2+
> 
>
> Key: HIVE-20153
> URL: https://issues.apache.org/jira/browse/HIVE-20153
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 2.3.2
>Reporter: Szehon Ho
>Priority: Major
> Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and 
> sum() aggregations run out of memory on Hadoop side much faster than in 
> Hive1.  In many queries, we have to double the memory.
>  
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' 
> in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20153) Count and Sum UDF consume more memory in Hive 2+

2018-07-12 Thread Sahil Takiar (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541927#comment-16541927
 ] 

Sahil Takiar commented on HIVE-20153:
-

CC: [~aihuaxu]

> Count and Sum UDF consume more memory in Hive 2+
> 
>
> Key: HIVE-20153
> URL: https://issues.apache.org/jira/browse/HIVE-20153
> Project: Hive
>  Issue Type: Bug
>  Components: UDF
>Affects Versions: 2.3.2
>Reporter: Szehon Ho
>Priority: Major
> Attachments: Screen Shot 2018-07-12 at 6.41.28 PM.png
>
>
> While playing with Hive2, we noticed that queries with a lot of count() and 
> sum() aggregations run out of memory on Hadoop side much faster than in 
> Hive1.  In many queries, we have to double the memory.
>  
> Taking heap dump, we see one of the main culprit is the field 'uniqueObjects' 
> in GeneraicUDAFSum and GenericUDAFCount, which was added to support Window 
> functions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)