[jira] [Updated] (HIVE-16079) HS2: high memory pressure due to duplicate Properties objects

Misha Dmitriev (JIRA) Wed, 01 Mar 2017 12:41:47 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Misha Dmitriev updated HIVE-16079:
----------------------------------
    Description: 
I've created a Hive table with 2000 partitions, each backed by two files, with 
one row in each file. When I execute some number of concurrent queries against 
this table, e.g. as follows

{code}
for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:10000 -n admin -p 
admin -e "select count(i_f_1) from misha_table;" & done
{code}

it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
server with -Xmx200m and with 50 queries - in the one with -Xmx500m.

I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
that was generated in the 50queries/500m heap scenario. It suggests that there 
are several opportunities to reduce memory pressure with not very invasive 
changes to the code. One (duplicate strings) has been addressed in 
https://issues.apache.org/jira/browse/HIVE-15882 In this ticket, I am going to 
address the fact that almost 20% of memory is used by instances of 
java.util.Properties. These objects are highly duplicate, since for each 
partition each concurrently running query creates its own copy of Partion, 
PartitionDesc and Properties. Thus we have nearly 100,000 (50 queries * 2,000 
partitions) Properties in memory. By interning/deduplicating these objects we 
may be able to save perhaps 15% of memory.

Note, however, that if there are queries that mutate partitions, the 
corresponding Properties would be mutated as well. Thus we cannot simply use a 
single "canonicalized" Properties object at all times for all Partition objects 
representing the same DB partition. Instead, I am going to introduce a special 
CopyOnFirstWriteProperties class. Such an object initially internally 
references a canonicalized Properties object, and keeps doing so while only 
read methods are called. However, once any mutating method is called, the given 
CopyOnFirstWriteProperties copies the data into its own table from the 
canonicalized table, and uses it ever after.

  was:
I've created a Hive table with 2000 partitions, each backed by two files, with 
one row in each file. When I execute some number of concurrent queries against 
this table, e.g. as follows

{code}
for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:10000 -n admin -p 
admin -e "select count(i_f_1) from misha_table;" & done
{code}

it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
server with -Xmx200m and with 50 queries - in the one with -Xmx500m.

I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
that was generated in the 50queries/500m heap scenario. It suggests that there 
are several opportunities to reduce memory pressure with not very invasive 
changes to the code:

1. 24.5% of memory is wasted by duplicate strings (see section 6). With 
String.intern() calls added in the ~10 relevant places in the code, this 
overhead can be highly reduced.

2. Almost 20% of memory is wasted due to various suboptimally used collections 
(see section 8). There are many maps and lists that are either empty or have 
just 1 element. By modifying the code that creates and populates these 
collections, we may likely save 5-10% of memory.

3. Almost 20% of memory is used by instances of java.util.Properties. It looks 
like these objects are highly duplicate, since for each Partition each 
concurrently running query creates its own copy of Partion, PartitionDesc and 
Properties. Thus we have nearly 100,000 (50 queries * 2,000 partitions) 
Properties in memory. By interning/deduplicating these objects we may be able 
to save perhaps 15% of memory.

So overall, I think there is a good chance to reduce HS2 memory consumption in 
this scenario by ~40%.



> HS2: high memory pressure due to duplicate Properties objects
> -------------------------------------------------------------
>
>                 Key: HIVE-16079
>                 URL: https://issues.apache.org/jira/browse/HIVE-16079
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: hs2-crash-2000p-500m-50q.txt
>
>
> I've created a Hive table with 2000 partitions, each backed by two files, 
> with one row in each file. When I execute some number of concurrent queries 
> against this table, e.g. as follows
> {code}
> for i in `seq 1 50`; do beeline -u jdbc:hive2://localhost:10000 -n admin -p 
> admin -e "select count(i_f_1) from misha_table;" & done
> {code}
> it results in a big memory spike. With 20 queries I caused an OOM in a HS2 
> server with -Xmx200m and with 50 queries - in the one with -Xmx500m.
> I am attaching the results of jxray (www.jxray.com) analysis of a heap dump 
> that was generated in the 50queries/500m heap scenario. It suggests that 
> there are several opportunities to reduce memory pressure with not very 
> invasive changes to the code. One (duplicate strings) has been addressed in 
> https://issues.apache.org/jira/browse/HIVE-15882 In this ticket, I am going 
> to address the fact that almost 20% of memory is used by instances of 
> java.util.Properties. These objects are highly duplicate, since for each 
> partition each concurrently running query creates its own copy of Partion, 
> PartitionDesc and Properties. Thus we have nearly 100,000 (50 queries * 2,000 
> partitions) Properties in memory. By interning/deduplicating these objects we 
> may be able to save perhaps 15% of memory.
> Note, however, that if there are queries that mutate partitions, the 
> corresponding Properties would be mutated as well. Thus we cannot simply use 
> a single "canonicalized" Properties object at all times for all Partition 
> objects representing the same DB partition. Instead, I am going to introduce 
> a special CopyOnFirstWriteProperties class. Such an object initially 
> internally references a canonicalized Properties object, and keeps doing so 
> while only read methods are called. However, once any mutating method is 
> called, the given CopyOnFirstWriteProperties copies the data into its own 
> table from the canonicalized table, and uses it ever after.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (HIVE-16079) HS2: high memory pressure due to duplicate Properties objects

Reply via email to