Reynold Xin created SPARK-2496: ---------------------------------- Summary: Compression streams should write its codec info to the stream Key: SPARK-2496 URL: https://issues.apache.org/jira/browse/SPARK-2496 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Priority: Critical
Spark sometime store compressed data outside of Spark (e.g. event logs, blocks in tachyon), and those data are read back directly using the codec configured by the user. When the codec differs between runs, Spark wouldn't be able to read the codec back. I'm not sure what the best strategy here is yet. If we write the codec identifier for all streams, then we will be writing a lot of identifiers for shuffle blocks. One possibility is to only write it for blocks that will be shared across different Spark instances (i.e. managed outside of Spark), which includes tachyon blocks and event log blocks. -- This message was sent by Atlassian JIRA (v6.2#6252)