Spark uses the Hadoop API to access files. This means they are transparently decompressed. However gzip can be only decompressed in a single thread / file and bzip2 is very slow. The best is either to have multiple files (each one at least the size of a HDFS block) or better to use a modern storage format such as Avro, orc or parquet where this issue does not occur, since these formats internally create blocks that can be decompressed in parallel (any compression format).
Zip is not out of the box supported, but it would be trivial to develop that yourself. > On 19. Jul 2017, at 22:22, Ashok Kumar <ashok34...@yahoo.com.INVALID> wrote: > > Hi, > > How does spark handle compressed files? Are they optimizable in terms of > using multiple RDDs against the file pr one needs to uncompress them > beforehand say bz type files. > > thanks