[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096796#comment-15096796 ] Koji Noguchi commented on PIG-3251: --- Thanks [~rohini]. Created PIG-4779 for tracking. In my test environment it was throwing different IOException and incorrectly passing the test. I'll check. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Fix For: 0.16.0 > > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch, > pig-3251-trunk-v06.patch, pig-3251-trunk-v07.patch, pig-3251-trunk-v08.patch, > pig-3251-trunk-v09.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981296#comment-14981296 ] Rohini Palaniswamy commented on PIG-3251: - +1 > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Fix For: 0.16.0 > > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch, > pig-3251-trunk-v06.patch, pig-3251-trunk-v07.patch, pig-3251-trunk-v08.patch, > pig-3251-trunk-v09.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981093#comment-14981093 ] Rohini Palaniswamy commented on PIG-3251: - Just noticed one more issue. -Dppig.bzip.use.hadoop.inputformat in one of the e2e tests has ppig instead of pig. Can we also do - mLog.info instead of mLog.debug . It is just going to appear once or twice in the front-end and backend. So info should be fine and would be useful. - Add one line documentation in pig.properties > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Fix For: 0.16.0 > > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch, > pig-3251-trunk-v06.patch, pig-3251-trunk-v07.patch, pig-3251-trunk-v08.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965715#comment-14965715 ] Rohini Palaniswamy commented on PIG-3251: - - Can you change the name of the constant as well to PIG_BZIP_USE_HADOOP_INPUTFORMAT? {code} public static final String PIG_BZIPINPUT_USEHADOOPS = "pig.bzip.use.hadoop.inputformat"; {code} - Equals sign is missing before false {code} +'num' => 3, +'java_params' => ['-Dppig.bzip.use.hadoop.inputformatfalse'], {code} > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Fix For: 0.16.0 > > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch, > pig-3251-trunk-v06.patch, pig-3251-trunk-v07.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791291#comment-14791291 ] Rohini Palaniswamy commented on PIG-3251: - Looks good so far. Will wait for the unit test investigation. Just one suggestion for a small change to have better readability public static final String PIG_BZIPINPUT_USEHADOOPS = "pig.bzipinput.usehadoops"; to public static final String PIG_BZIP_USE_HADOOP_INPUTFORMAT = "pig.bzip.use.hadoop.inputformat"; > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Fix For: 0.16.0 > > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch, > pig-3251-trunk-v06.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14584008#comment-14584008 ] Rohini Palaniswamy commented on PIG-3251: - I think we can make it default for Hadoop2 (with setting to switch back in case of issues) and leave it at Pig's Bzip2TextInputFormat for Hadoop 1.x. Hadoop 2.x's seems to be more stable with bug fixes for bzip2. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583985#comment-14583985 ] Rohini Palaniswamy commented on PIG-3251: - Had a discussion with [~knoguchi]. He was suggesting that we should probably start with using hadoop's TextInputFormat for bzip2 processing as a option based on configuration. After more stabilization, can make it default. Agree with that idea. I also noticed that there is one open issue with concatenated bzip support in hadoop - HADOOP-6852. Since we have not supported concatenated gzip at all so far with the Pig's bzip implementation, that should be ok. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648788#comment-13648788 ] Koji Noguchi commented on PIG-3251: --- bq. FYI, couple of tests from TestBZip are failing after applying my patch. Looking. 3 tests failed. {noformat} Testcase: testBZ2Concatenation took 38.266 sec FAILED Expected exception: java.io.IOException junit.framework.AssertionFailedError: Expected exception: java.io.IOException Testcase: testBlockHeaderEndingWithCR took 49.539 sec FAILED expected:<82094> but was:<82093> junit.framework.AssertionFailedError: expected:<82094> but was:<82093> at org.apache.pig.test.TestBZip.testCount(TestBZip.java:256) at org.apache.pig.test.TestBZip.testBlockHeaderEndingWithCR(TestBZip.java:112) Testcase: testBlockHeaderEndingAtSplitNotByteAligned took 48.996 sec FAILED expected:<74999> but was:<101591> junit.framework.AssertionFailedError: expected:<74999> but was:<101591> at org.apache.pig.test.TestBZip.testCount(TestBZip.java:256) at org.apache.pig.test.TestBZip.testBlockHeaderEndingAtSplitNotByteAligned(TestBZip.java:88) {noformat} "testBZ2Concatenation" is expected since hadoop bzip2 codec handles concatenated bzip files (whereas pig's TestBZip is testing whether it reliably fails). Other two are worrisome to me. Asking my colleague to check. It'll take some time. Depending on what we find, we may need to change the condition for using hadoop's bzip codec. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648776#comment-13648776 ] Daniel Dai commented on PIG-3251: - bq. Or, are you suggesting I create two silly wrappers instead of one? No need, if there is no easy way then forget about it. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648745#comment-13648745 ] Koji Noguchi commented on PIG-3251: --- FYI, couple of tests from TestBZip are failing after applying my patch. Looking. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648664#comment-13648664 ] Daniel Dai commented on PIG-3251: - Hi, [~knoguchi], is the patch ready? Some comments for pig-3251-trunk-v04.patch: 1. Cache job in PigStorage seems to be confusing, can we just cache splittable? 2. Is it possible to wrap a codec deal with both bz2/bz? Just feel it might be less confusing than wrap bz2 alone, and use PigTextInputFormat for bz > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648464#comment-13648464 ] Koji Noguchi commented on PIG-3251: --- bq. Using hadoop's bzip codec on 0.23/2.0 would have an additional benefit of having native codec. (HADOOP-8462) Learned that bzip native codec so far does not support splitting (and falls back to java version for splits). > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, > pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13608030#comment-13608030 ] Daniel Dai commented on PIG-3251: - Makes sense, we shall move to the new approach for Hadoop 1.1.0+, use Bzip2TextInputFormat otherwise for backward compatibility. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607886#comment-13607886 ] Koji Noguchi commented on PIG-3251: --- bq. With HADOOP-7823, can we remove Bzip2TextInputFormat and just use PigTextInputFormat? Since our platform has moved to 0.23, I'll be happy if we can simply remove Bzip2TextInputFormat just for hadoop 0.23 or later. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607860#comment-13607860 ] Koji Noguchi commented on PIG-3251: --- bq. With HADOOP-7823, can we remove Bzip2TextInputFormat and just use PigTextInputFormat? That'll (almost) have the same effect of my initial patch pig-3251-trunk-v01.patch which takes to status (2) in my previous comment. With HADOOP-7823 + HADOOP-6109, then it'll be (3). Without a doubt, HADOOP-7823 + HADOOP-6109 is the cleanest approach. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13607820#comment-13607820 ] Richard Ding commented on PIG-3251: --- With HADOOP-7823, can we remove Bzip2TextInputFormat and just use PigTextInputFormat? > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13606445#comment-13606445 ] Koji Noguchi commented on PIG-3251: --- bq. Let me know if you find any problem in your testing. Thanks Daniel. My initial test went well on 0.23 cluster. It was as fast as the original and requiring less memory. However, the patched pig is super slow on 1.0.2 cluster. Reason is, I'm using the Text directly as the replacement of ByteArrayOutputStream. Without HADOOP-6109 which was committed in 0.21, Text grows linearly whereas ByteArrayOutputStream grows exponentially requiring a lot more copies for the former. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size
[ https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13605874#comment-13605874 ] Daniel Dai commented on PIG-3251: - Looks good. Seems the reason why we use ByteArrayOutputStream before is to get auto-expand byte array for free, which Pig can manage by itself to reduce the memory foot print. Let me know if you find any problem in your testing. > Bzip2TextInputFormat requires double the memory of maximum record size > -- > > Key: PIG-3251 > URL: https://issues.apache.org/jira/browse/PIG-3251 > Project: Pig > Issue Type: Improvement >Reporter: Koji Noguchi >Assignee: Koji Noguchi >Priority: Minor > Attachments: pig-3251-trunk-v01.patch > > > While looking at user's OOM heap dump, noticed that pig's > Bzip2TextInputFormat consumes memory at both > Bzip2TextInputFormat.buffer (ByteArrayOutputStream) > and actual Text that is returned as line. > For example, when having one record with 160MBytes, buffer was 268MBytes and > Text was 160MBytes. > We can probably eliminate one of them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira