[ https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932997#comment-16932997 ]
Nicholas Chammas commented on SPARK-29102: ------------------------------------------ {quote}It duplicately decompresses and each map task process what they want. And then, each map task stops decompressing if they processes what they want. {quote} Yup, that's what I was suggesting in this issue. Glad some folks have already tried that out. Hopefully, I'll get lucky and {{nl.basjes.hadoop.io.compress.SplittableGzipCodec}} will just work for me. {quote}We could resolve this JIRA but if you feel like it's still feasible, I don't mind leaving this JIRA open. {quote} I've resolved it for now as "Won't Fix". I'll report back here if the solution you pointed me to works. > Read gzipped file into multiple partitions without full gzip expansion on a > single-node > --------------------------------------------------------------------------------------- > > Key: SPARK-29102 > URL: https://issues.apache.org/jira/browse/SPARK-29102 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Affects Versions: 2.4.4 > Reporter: Nicholas Chammas > Priority: Minor > > Large gzipped files are a common stumbling block for new users (SPARK-5685, > SPARK-28366) and an ongoing pain point for users who must process such files > delivered from external parties who can't or won't break them up into smaller > files or compress them using a splittable compression format like bzip2. > To deal with large gzipped files today, users must either load them via a > single task and then repartition the resulting RDD or DataFrame, or they must > launch a preprocessing step outside of Spark to split up the file or > recompress it using a splittable format. In either case, the user needs a > single host capable of holding the entire decompressed file. > Spark can potentially a) spare new users the confusion over why only one task > is processing their gzipped data, and b) relieve new and experienced users > alike from needing to maintain infrastructure capable of decompressing a > large gzipped file on a single node, by directly loading gzipped files into > multiple partitions across the cluster. > The rough idea is to have tasks divide a given gzipped file into ranges and > then have them all concurrently decompress the file, with each task throwing > away the data leading up to the target range. (This kind of partial > decompression is apparently [doable using standard Unix > utilities|https://unix.stackexchange.com/a/415831/70630], so it should be > doable in Spark too.) > In this way multiple tasks can concurrently load a single gzipped file into > multiple partitions. Even though every task will need to unpack the file from > the beginning to the task's target range, and the stage will run no faster > than what it would take with Spark's current gzip loading behavior, this > nonetheless addresses the two problems called out above. Users no longer need > to load and then repartition gzipped files, and their infrastructure does not > need to decompress any large gzipped file on a single node. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org