[
https://issues.apache.org/jira/browse/HADOOP-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559812#action_12559812
]
Runping Qi commented on HADOOP-2608:
------------------------------------
I profiled the program of reading sequence files.
It turned out that a lot of cpu was spent on deserializing the values.
The values are of a JuteRecord class having many many fields of ustring type.
The deserializing an object of that class involves calling
org.apache.hadoop.record.Utils.fromBinaryString, which is very expensive
(compared with deserialization buffer class).
After I replaced the ustring type with buffer type in the jute ddl, the scan
throughput improved by 3x!
Not supprisingly, the
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect became
the most expensive operation (28% cpu spent on that call).
So, one thing we learnt here is that the cost for deserializing ustring 3x that
of deserializing buffer.
That seems to be too huge a cost to pay for using ustring for large amount of
data.
An obvious question is that is there some low hanging fruits in improving
org.apache.hadoop.record.Utils.fromBinaryString?
> Reading sequence file consumes 100% cpu with maximum throughput being about
> 5MB/sec per process
> -----------------------------------------------------------------------------------------------
>
> Key: HADOOP-2608
> URL: https://issues.apache.org/jira/browse/HADOOP-2608
> Project: Hadoop
> Issue Type: Improvement
> Components: io
> Reporter: Runping Qi
>
> I did some tests on the throughput of scanning block-compressed sequence
> files.
> The sustained throughput was bounded at 5MB/sec per process, with the cpu of
> each process maxed at 100%.
> It seems to me that the cpu consumption is too high and the throughput is too
> low for just scanning files.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.