[
https://issues.apache.org/jira/browse/HIVE-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626037#comment-13626037
]
Phabricator commented on HIVE-4248:
-----------------------------------
kevinwilfong has commented on the revision "HIVE-4248 [jira] Implement a memory
manager for ORC".
This allows for cases where the memory used could exceed the amount of memory
allocated by significant amounts.
E.g. say totalMemoryPool = 256 Mb = stripe size, also say we have a writer
that writes 255 Mb to a stripe, then a second writer is created (e.g. a new
dynamic partition value is encountered) and all new rows get written to this
second writer, than nothing will get written out until the second writer
accumulates 128 Mb of data in the stripe using a total of 383 Mb of the
allocated 256 Mb. In theory, with some terrible luck, these could be chained
together to use significantly more memory (first writer writes 255 Mb, second
writes 127 Mb, third writes 85 Mb, etc.)
Could you loop through the stripes whenever a writer is added (shouldn't
happen to frequently) and check if the estimated stripe size of any of these
writers exceeds the value of stripeSize * memoryManager.getAllocationScale()
(should be doable by making a couple methods public and storing a reference to
the WriterImpl along with or instead of the Path).
Also (could be done in a follow up) could there be an additional check to see
what the total HeapMemoryUsage is? E.g. in the shouldBeFlushed method of
GroupByOperator, every 1000 rows, it checks that no more than 90% of the total
heap has been used, and if so it flushes the hash map. Something similar could
be done for WriterImpl, and given the MemoryManager, could even flush the
largest stripe, rather than just the one that pushed it over the edge. This
would be particularly useful given that in the case of a map join, followed by
a map aggregation, the mapjoin is allowed to use 55% of the memory, and the
group by another 30%, if there was also a FileSinkOpeartor, allowing the ORC
WriterImpl to use 50% could be too much.
INLINE COMMENTS
common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:490 could you add
this to conf/hive-default.xml.template as well.
REVISION DETAIL
https://reviews.facebook.net/D9993
To: JIRA, omalley
Cc: kevinwilfong
> Implement a memory manager for ORC
> ----------------------------------
>
> Key: HIVE-4248
> URL: https://issues.apache.org/jira/browse/HIVE-4248
> Project: Hive
> Issue Type: New Feature
> Components: Serializers/Deserializers
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Attachments: HIVE-4248.D9993.1.patch, HIVE-4248.D9993.2.patch
>
>
> With the large default stripe size (256MB) and dynamic partitions, it is
> quite easy for users to run out of memory when writing ORC files. We probably
> need a solution that keeps track of the total number of concurrent ORC
> writers and divides the available heap space between them.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira