Reynold Xin created SPARK-2496:
----------------------------------
Summary: Compression streams should write its codec info to the
stream
Key: SPARK-2496
URL: https://issues.apache.org/jira/browse/SPARK-2496
Project: Spark
Issue Type: Improvement
Reporter: Reynold Xin
Priority: Critical
Spark sometime store compressed data outside of Spark (e.g. event logs, blocks
in tachyon), and those data are read back directly using the codec configured
by the user. When the codec differs between runs, Spark wouldn't be able to
read the codec back.
I'm not sure what the best strategy here is yet. If we write the codec
identifier for all streams, then we will be writing a lot of identifiers for
shuffle blocks. One possibility is to only write it for blocks that will be
shared across different Spark instances (i.e. managed outside of Spark), which
includes tachyon blocks and event log blocks.
--
This message was sent by Atlassian JIRA
(v6.2#6252)