[ 
https://issues.apache.org/jira/browse/SPARK-29102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933803#comment-16933803
 ] 

Nicholas Chammas commented on SPARK-29102:
------------------------------------------

[~hyukjin.kwon] - Would you happen to know how to instruct Spark to use a 
custom codec _on read_?

I'm trying out that {{SplittableGzipCodec}} but I can't seem to get Spark to 
actually use it. I'm starting up PySpark as follows:
{code:java}
pyspark --packages nl.basjes.hadoop:splittablegzip:1.2{code}
Then I'm trying to read a gzipped CSV as follows:
{code:java}
spark.conf.set('spark.hadoop.io.compression.codecs', 
'nl.basjes.hadoop.io.compress.SplittableGzipCodec')
spark.read.csv(...).count() {code}
But Spark doesn't seem to be using the codec.

I know Spark can "see" the codec, because I can use it on write:
{code:java}
spark.range(10).write.csv('test.csv', mode='overwrite', 
compression='nl.basjes.hadoop.io.compress.SplittableGzipCodec') {code}
However, Spark doesn't offer a {{compression}} option on read.

Do you know how I can get PySpark to use this codec on read? Apologies if this 
not appropriate for JIRA. I can take it to StackOverflow if you prefer.

> Read gzipped file into multiple partitions without full gzip expansion on a 
> single-node
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-29102
>                 URL: https://issues.apache.org/jira/browse/SPARK-29102
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output
>    Affects Versions: 2.4.4
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> Large gzipped files are a common stumbling block for new users (SPARK-5685, 
> SPARK-28366) and an ongoing pain point for users who must process such files 
> delivered from external parties who can't or won't break them up into smaller 
> files or compress them using a splittable compression format like bzip2.
> To deal with large gzipped files today, users must either load them via a 
> single task and then repartition the resulting RDD or DataFrame, or they must 
> launch a preprocessing step outside of Spark to split up the file or 
> recompress it using a splittable format. In either case, the user needs a 
> single host capable of holding the entire decompressed file.
> Spark can potentially a) spare new users the confusion over why only one task 
> is processing their gzipped data, and b) relieve new and experienced users 
> alike from needing to maintain infrastructure capable of decompressing a 
> large gzipped file on a single node, by directly loading gzipped files into 
> multiple partitions across the cluster.
> The rough idea is to have tasks divide a given gzipped file into ranges and 
> then have them all concurrently decompress the file, with each task throwing 
> away the data leading up to the target range. (This kind of partial 
> decompression is apparently [doable using standard Unix 
> utilities|https://unix.stackexchange.com/a/415831/70630], so it should be 
> doable in Spark too.)
> In this way multiple tasks can concurrently load a single gzipped file into 
> multiple partitions. Even though every task will need to unpack the file from 
> the beginning to the task's target range, and the stage will run no faster 
> than what it would take with Spark's current gzip loading behavior, this 
> nonetheless addresses the two problems called out above. Users no longer need 
> to load and then repartition gzipped files, and their infrastructure does not 
> need to decompress any large gzipped file on a single node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to