[ https://issues.apache.org/jira/browse/SPARK-21113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051134#comment-16051134 ]
Apache Spark commented on SPARK-21113: -------------------------------------- User 'sitalkedia' has created a pull request for this issue: https://github.com/apache/spark/pull/18317 > Support for read ahead input stream to amortize disk IO cost in the Spill > reader > -------------------------------------------------------------------------------- > > Key: SPARK-21113 > URL: https://issues.apache.org/jira/browse/SPARK-21113 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.0.2 > Reporter: Sital Kedia > Priority: Minor > > Profiling some of our big jobs, we see that around 30% of the time is being > spent in reading the spill files from disk. In order to amortize the disk IO > cost, the idea is to implement a read ahead input stream which which > asynchronously reads ahead from the underlying input stream when specified > amount of data has been read from the current buffer. It does it by > maintaining two buffer - active buffer and read ahead buffer. Active buffer > contains data which should be returned when a read() call is issued. The read > ahead buffer is used to asynchronously read from the underlying input stream > and once the current active buffer is exhausted, we flip the two buffers so > that we can start reading from the read ahead buffer without being blocked in > disk I/O. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org