Nicholas Chammas created SPARK-29102:
----------------------------------------

             Summary: Read gzipped file into multiple partitions without full 
gzip expansion on a single-node
                 Key: SPARK-29102
                 URL: https://issues.apache.org/jira/browse/SPARK-29102
             Project: Spark
          Issue Type: Improvement
          Components: Input/Output
    Affects Versions: 2.4.4
            Reporter: Nicholas Chammas


Large gzipped files are a common stumbling block for new users (SPARK-5685, 
SPARK-28366) and an ongoing pain point for users who must process such files 
delivered from external parties who can't or won't break them up into smaller 
files or compress them using a splittable compression format like bzip2.

To deal with large gzipped files today, users must either load them via a 
single task and then repartition the resulting RDD or DataFrame, or they must 
launch a preprocessing step outside of Spark to split up the file or recompress 
it using a splittable format. In either case, the user needs a single host 
capable of holding the entire decompressed file.

Spark can potentially a) spare new users the confusion over why only one task 
is processing their gzipped data, and b) relieve new and experienced users 
alike from needing to maintain infrastructure capable of decompressing a large 
gzipped file on a single node, by directly loading gzipped files into multiple 
partitions across the cluster.

The rough idea is to have tasks divide a given gzipped file into ranges and 
then have them all concurrently decompress the file, with each task throwing 
away the data leading up to the target range. (This kind of partial 
decompression is apparently [doable using standard Unix 
utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
doable in Spark too.)

In this way multiple tasks can concurrently load a single gzipped file into 
multiple partitions. Even though every task will need to unpack the file from 
the beginning to the task's target range, and the stage will run no faster than 
what it would take with Spark's current gzip loading behavior, this nonetheless 
addresses the two problems called out above. Users no longer need to load and 
then repartition gzipped files, and their infrastructure does not need to 
decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to