[ 
https://issues.apache.org/jira/browse/HIVE-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14151496#comment-14151496
 ] 

Dong Chen commented on HIVE-7685:
---------------------------------

Hi Brock,

I think a brief design for this memory manager is:
Every new writer registers itself to the manager. The manager has an overall 
view of all the writers. When a condition is up (such as every 1000 rows), it 
will notify the writers to check memory usage and flush if necessary.

However, a problem for Parquet specifically is: Hive only has a wrapper for the 
ParquetRecordWriter, and even ParquetRecordWriter also wrap the real writer 
(InternalParquetRecordWriter) in Parquet project. Since the behaviors of 
measuring dynamic buffer size and flushing are private in the real writer, I 
think we also have to add code in InternalParquetRecordWriter to implement the 
memory manager functionality. 

It seems only changing Hive code cannot fix this Jira. 
Not sure whether we should put this problem in Parquet project and fix it 
there, if it is generic enough and not Hive specific? 

Any other ideas?

Best Regards,
Dong

> Parquet memory manager
> ----------------------
>
>                 Key: HIVE-7685
>                 URL: https://issues.apache.org/jira/browse/HIVE-7685
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>            Reporter: Brock Noland
>
> Similar to HIVE-4248, Parquet tries to write large very large "row groups". 
> This causes Hive to run out of memory during dynamic partitions when a 
> reducer may have many Parquet files open at a given time.
> As such, we should implement a memory manager which ensures that we don't run 
> out of memory due to writing too many row groups within a single JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to