[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584008#comment-14584008
 ] 

Rohini Palaniswamy edited comment on PIG-3251 at 6/12/15 8:25 PM:
------------------------------------------------------------------

Hadoop 2.x's splittable bzip implementation seems to be more stable with bug 
fixes for bzip2. I think we can make it default for Hadoop2 (switching back if 
not splittable like native bz2 implementation and a config setting to switch 
back in case of issues) and leave it at Pig's Bzip2TextInputFormat  for Hadoop 
1.x. 


was (Author: rohini):
I think we can make it default for Hadoop2 (with setting to switch back in case 
of issues) and leave it at Pig's Bzip2TextInputFormat  for Hadoop 1.x. Hadoop 
2.x's seems to be more stable with bug fixes for bzip2.

> Bzip2TextInputFormat requires double the memory of maximum record size
> ----------------------------------------------------------------------
>
>                 Key: PIG-3251
>                 URL: https://issues.apache.org/jira/browse/PIG-3251
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>            Priority: Minor
>         Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to