[
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koji Noguchi updated PIG-3251:
------------------------------
Attachment: pig-3251-trunk-v02.patch
(1)
Current status (before any patch)
||hadoop version || PigTextInputFormat || Bzip2TextInputFormat.java
||
| 0.20 | [i] SLOW due to HADOOP-6109 | (iii) Needs EXTRA MEMORY.
This Jira. |
| 0.23 | [ii] Good. | (iv) Needs EXTRA MEMORY.
This Jira. |
(2)
My initial patch (pig-3251-trunk-v01.patch) changes this to
||hadoop version || PigTextInputFormat || Bzip2TextInputFormat.java
||
| 0.20 | [i] SLOW due to HADOOP-6109 | (iii) Slow due to
HADOOP-6109 |
| 0.23 | [ii] Good. | (iv) Good |
(3)
If we can backport hadoop-6109 to 0.20 + my pig-3251-trunk-v01.patch, it solves
all the problem.
||hadoop version || PigTextInputFormat || Bzip2TextInputFormat.java
||
| 0.20+Hadoop-6109 | [i] Good | (iii) Good |
| 0.23 | [ii] Good. | (iv) Good |
However, I've seen a discussion about pig supporting 0.20.2 users.
So I guess we can't ask them to backport HADOOP-6109 then.
I think my remaining options are
(a) Give up. Wait till everyone upgrades to 0.23/2.0 or backport HADOOP-6109
to hadoop 1.2* and wait till pig moves off from 0.20.2/1.0.*.
(b) Try to workaround without touching hadoop code.
I think (a) is reasonable but tried (b). This patch makes the status as below.
(4)
Patch (pig-3251-trunk-v02.patch)
||hadoop version || PigTextInputFormat || Bzip2TextInputFormat.java
||
| 0.20 | [i] SLOW due to HADOOP-6109 | (iii) Good |
| 0.23 | [ii] Good. | (iv) Good |
Penalty of not touching the hadoop code is, my patch adds two unnecessary
bytearray copies when extending the Text size. But frequency is low due to
exponentially increasing sizes, so I hope the overall overhead is negligible.
> Bzip2TextInputFormat requires double the memory of maximum record size
> ----------------------------------------------------------------------
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
> Issue Type: Improvement
> Reporter: Koji Noguchi
> Assignee: Koji Noguchi
> Priority: Minor
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream)
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and
> Text was 160MBytes.
> We can probably eliminate one of them.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira