[
https://issues.apache.org/jira/browse/SPARK-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reynold Xin updated SPARK-2496:
-------------------------------
Component/s: Spark Core
Shuffle
> Compression streams should write its codec info to the stream
> -------------------------------------------------------------
>
> Key: SPARK-2496
> URL: https://issues.apache.org/jira/browse/SPARK-2496
> Project: Spark
> Issue Type: Improvement
> Components: Shuffle, Spark Core
> Reporter: Reynold Xin
> Priority: Critical
>
> Spark sometime store compressed data outside of Spark (e.g. event logs,
> blocks in tachyon), and those data are read back directly using the codec
> configured by the user. When the codec differs between runs, Spark wouldn't
> be able to read the codec back.
> I'm not sure what the best strategy here is yet. If we write the codec
> identifier for all streams, then we will be writing a lot of identifiers for
> shuffle blocks. One possibility is to only write it for blocks that will be
> shared across different Spark instances (i.e. managed outside of Spark),
> which includes tachyon blocks and event log blocks.
--
This message was sent by Atlassian JIRA
(v6.2#6252)