[jira] Commented: (HADOOP-2608) Reading sequence file consumes 100% cpu with maximum throughput being about 5MB/sec per process

Runping Qi (JIRA) Wed, 16 Jan 2008 22:06:59 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559812#action_12559812
 ]


Runping Qi commented on HADOOP-2608:
------------------------------------


I profiled the program of reading sequence files.
It turned out that a lot of cpu was spent on deserializing the values.
The values are of a JuteRecord class having many many fields of ustring type.
The deserializing an object of that class involves calling 
org.apache.hadoop.record.Utils.fromBinaryString, which is very expensive 
(compared with deserialization buffer class).
After I replaced the ustring type with buffer type in the jute ddl, the scan 
throughput improved by 3x!
Not supprisingly, the 
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect became 
the most expensive operation (28% cpu spent on that call).

So, one thing we learnt here is that the cost for deserializing ustring 3x that 
of deserializing buffer.
That seems to be  too huge a cost to pay for using ustring for large amount of 
data.

An obvious question is that is there some low hanging fruits in improving 
org.apache.hadoop.record.Utils.fromBinaryString?


 

> Reading sequence file consumes 100% cpu with maximum throughput being about 
> 5MB/sec per process
> -----------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2608
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2608
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: io
>            Reporter: Runping Qi
>
> I did some tests on the throughput of scanning block-compressed sequence 
> files.
> The sustained throughput was bounded at 5MB/sec per process, with the cpu of 
> each process maxed at 100%.
> It seems to me that the cpu consumption is too high and the throughput is too 
> low for just scanning files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2608) Reading sequence file consumes 100% cpu with maximum throughput being about 5MB/sec per process

Reply via email to