Spark uses the Hadoop API to access files. This means they are transparently 
decompressed. However gzip can be only decompressed in a single thread / file 
and bzip2 is very slow.
The best is either to have multiple files (each one at least the size of a HDFS 
block) or better to use a modern storage format such as Avro, orc or parquet 
where this issue does not occur, since these formats internally create blocks 
that can be decompressed in parallel (any compression format).

Zip is not out of the box supported, but it would be trivial to develop that 
yourself.

> On 19. Jul 2017, at 22:22, Ashok Kumar <ashok34...@yahoo.com.INVALID> wrote:
> 
> Hi,
> 
> How does spark handle compressed files? Are they optimizable in terms of 
> using multiple RDDs against the file pr one needs to uncompress them 
> beforehand say bz type files.
> 
> thanks

Reply via email to