[jira] [Assigned] (SPARK-13695) Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading spills
[ https://issues.apache.org/jira/browse/SPARK-13695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13695: Assignee: Apache Spark (was: Josh Rosen) > Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading > spills > --- > > Key: SPARK-13695 > URL: https://issues.apache.org/jira/browse/SPARK-13695 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Apache Spark > > When a cached block is spilled to disk and read back in serialized form (i.e. > as bytes), the current BlockManager implementation will attempt to re-insert > the serialized block into the MemoryStore even if the block's storage level > requests deserialized caching. > This behavior adds some complexity to the MemoryStore but I don't think it > offers many performance benefits and I'd like to remove it in order to > simplify a larger refactoring patch. Therefore, I propose to change the > behavior such that disk store reads will only cache bytes in the memory store > for blocks with serialized storage levels. > There are two places where we request serialized bytes from the BlockStore: > 1. getLocalBytes(), which is only called when reading local copies of > TorrentBroadcast pieces. Broadcast pieces are always cached using a > serialized storage level, so this won't lead to a mismatch in serialization > forms if spilled bytes read from disk are cached as bytes in the memory store. > 2. the non-shuffle-block branch in getBlockData(), which is only called by > the NettyBlockRpcServer when responding to requests to read remote blocks. > Caching the serialized bytes in memory will only benefit us if those cached > bytes are read before they're evicted and the likelihood of that happening > seems low since the frequency of remote reads of non-broadcast cached blocks > seems very low. Caching these bytes when they have a low probability of being > read is bad if it risks the eviction of blocks which are cached in their > expected serialized/deserialized forms, since those blocks seem more likely > to be read in local computation. > Therefore, I think this is a safe change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13695) Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading spills
[ https://issues.apache.org/jira/browse/SPARK-13695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13695: Assignee: Josh Rosen (was: Apache Spark) > Don't cache MEMORY_AND_DISK blocks as bytes in memory store when reading > spills > --- > > Key: SPARK-13695 > URL: https://issues.apache.org/jira/browse/SPARK-13695 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > > When a cached block is spilled to disk and read back in serialized form (i.e. > as bytes), the current BlockManager implementation will attempt to re-insert > the serialized block into the MemoryStore even if the block's storage level > requests deserialized caching. > This behavior adds some complexity to the MemoryStore but I don't think it > offers many performance benefits and I'd like to remove it in order to > simplify a larger refactoring patch. Therefore, I propose to change the > behavior such that disk store reads will only cache bytes in the memory store > for blocks with serialized storage levels. > There are two places where we request serialized bytes from the BlockStore: > 1. getLocalBytes(), which is only called when reading local copies of > TorrentBroadcast pieces. Broadcast pieces are always cached using a > serialized storage level, so this won't lead to a mismatch in serialization > forms if spilled bytes read from disk are cached as bytes in the memory store. > 2. the non-shuffle-block branch in getBlockData(), which is only called by > the NettyBlockRpcServer when responding to requests to read remote blocks. > Caching the serialized bytes in memory will only benefit us if those cached > bytes are read before they're evicted and the likelihood of that happening > seems low since the frequency of remote reads of non-broadcast cached blocks > seems very low. Caching these bytes when they have a low probability of being > read is bad if it risks the eviction of blocks which are cached in their > expected serialized/deserialized forms, since those blocks seem more likely > to be read in local computation. > Therefore, I think this is a safe change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org