[ https://issues.apache.org/jira/browse/HADOOP-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559812#action_12559812 ]
Runping Qi commented on HADOOP-2608: ------------------------------------ I profiled the program of reading sequence files. It turned out that a lot of cpu was spent on deserializing the values. The values are of a JuteRecord class having many many fields of ustring type. The deserializing an object of that class involves calling org.apache.hadoop.record.Utils.fromBinaryString, which is very expensive (compared with deserialization buffer class). After I replaced the ustring type with buffer type in the jute ddl, the scan throughput improved by 3x! Not supprisingly, the org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect became the most expensive operation (28% cpu spent on that call). So, one thing we learnt here is that the cost for deserializing ustring 3x that of deserializing buffer. That seems to be too huge a cost to pay for using ustring for large amount of data. An obvious question is that is there some low hanging fruits in improving org.apache.hadoop.record.Utils.fromBinaryString? > Reading sequence file consumes 100% cpu with maximum throughput being about > 5MB/sec per process > ----------------------------------------------------------------------------------------------- > > Key: HADOOP-2608 > URL: https://issues.apache.org/jira/browse/HADOOP-2608 > Project: Hadoop > Issue Type: Improvement > Components: io > Reporter: Runping Qi > > I did some tests on the throughput of scanning block-compressed sequence > files. > The sustained throughput was bounded at 5MB/sec per process, with the cpu of > each process maxed at 100%. > It seems to me that the cpu consumption is too high and the throughput is too > low for just scanning files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.