[
https://issues.apache.org/jira/browse/HIVE-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646783#comment-13646783
]
Phabricator commented on HIVE-4421:
-----------------------------------
ashutoshc has requested changes to the revision "HIVE-4421 [jira] Improve
memory usage by ORC dictionaries".
Logic in patch mostly looks good. Just requesting for more comments, though
ORC is already have pretty good comments. Also, I didn't understand changes in
RedBlackTree. I assume you have improved memory accounting for it. But it will
be great if you can spell out what was the problem earlier which you are fixing
in this patch.
INLINE COMMENTS
ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:44 "added
to.." is repeated.
ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:47 I think
its better to define this in HiveConf as well, so that we can up or down this
value without needing to recompile Hive. Specially, since size of row is
unbounded. Size of 5K rows are very much data dependent. e.g., I recently saw a
table which had more than 100 string columns.
ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:141 I think
it will be good to add a note in comment about usage of synchronized keyword,
ie the scenario where this method might be invoked from different threads.
ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:94 It will be
good to add a comment on when oldVal could possibly be null. On the first
reading of code, it wasn't obvious to me.
ql/src/java/org/apache/hadoop/hive/ql/io/orc/DynamicIntArray.java:138 Better
name : getSizeInBytes ?
ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:142 It will
be good to add a comment saying that every 5000 rows added across all writers
we request each writer to flush their content to disk if they are using memory
beyond their quota.
ql/src/java/org/apache/hadoop/hive/ql/io/orc/OutStream.java:233 I didnt get
when current ByteBuffer could be null. It will always be non-null when this
method is invoked. Isnt it? Will be good to add a comment if the case is
otherwise.
ql/src/java/org/apache/hadoop/hive/ql/io/orc/OutStream.java:236 Just for my
own clarity, this will be null when compression is off, right ?
ql/src/java/org/apache/hadoop/hive/ql/io/orc/OutStream.java:37 It will be
good to add a comment for all these 3 ByteBuffers for what kind of data are
they holding.
ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java:687 Pardon my
ignorance. I didn't get what countOutput was meant for earlier and why it is no
longer required.
REVISION DETAIL
https://reviews.facebook.net/D10545
BRANCH
h-4421
ARCANIST PROJECT
hive
To: JIRA, ashutoshc, omalley
> Improve memory usage by ORC dictionaries
> ----------------------------------------
>
> Key: HIVE-4421
> URL: https://issues.apache.org/jira/browse/HIVE-4421
> Project: Hive
> Issue Type: Bug
> Components: Serializers/Deserializers
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
> Fix For: 0.11.0
>
> Attachments: HIVE-4421.D10545.1.patch, HIVE-4421.D10545.2.patch,
> HIVE-4421.D10545.3.patch
>
>
> Currently, for tables with many string columns, it is possible to
> significantly underestimate the memory used by the ORC dictionaries and cause
> the query to run out of memory in the task.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira