[ https://issues.apache.org/jira/browse/AVRO-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13569215#comment-13569215 ]
Yin Huai commented on AVRO-1208: -------------------------------- Did a quick test with [my benchmark tool|https://github.com/yhuai/tableplacement.git], using the table described at [here|https://github.com/yhuai/tableplacement/blob/AVRO-1208/tableplacement-experiment/tableProperties/LazyBinaryColumnarSerDe/t0.4-singleFile-noColumnGroup.properties]. Without prefetching, the throughput is about 70 MiB/s. By prefetching 16 blocks (reading 17 blocks at a time), the throughput is about 82 Mib/s. My system info: OS: Ubuntu 12.04 with kernel 3.2.0-37. Java: 1.6.0_24 Disk: WD RE4 WD1003FBYX 1TB 7200 RPM To play with the patch. 1) install avro applied with this patch to your local maven repo 2) checkout branch AVRO-1208 from the tool linked above and build it with {code}mvn clean package -DskipTests -P avro-1.7.4{\code} 3) in tableplacement-experiment/expScripts/exp0.6.conf, modify the path of data generated by the benchmark ("DIR") and the device you are using ("DEVICE") 4) in tableplacement-experiment/expScripts/expConf, use {code}sudo ./exp.write.Trevni.sh exp0.6{\code} to generate data 5) in tableplacement-experiment/expScripts/expConf, use {code}sudo ./exp.strace.read.Trevni.sh exp0.6 <io buffer size> cfg1:all <dir of strace> <num of prefetched blocks>{\code} to do the test. <io buffer size> does not matter at here since I am using InputFile. <dir of strace> is the location to store the results of strace. > Improve Trevni's performance on row-oriented data access > -------------------------------------------------------- > > Key: AVRO-1208 > URL: https://issues.apache.org/jira/browse/AVRO-1208 > Project: Avro > Issue Type: Improvement > Affects Versions: 1.7.3 > Reporter: Yin Huai > Attachments: AVRO-1208.1.patch > > > Trevni uses an 64KB internal buffer to store values of a column. When > accessing a column, it reads 64KB (if we do not consider compression and > checksum) data from the storage layer. However, when the table is accessed in > a row-oriented fashion (a entire row needs to be handed over to the upper > layer), in the worst case (a full table scan and values of this table are all > the same size), every 64KB data read can cause a seek. > This jira is used to discuss if we should consider the data access pattern > mentioned above and if so, how to improve the performance of Trevni. > Row-oriented data processing engines, e.g. Hive, can benefit from this work. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira