I would suggest to use other compression technique which is splittable for eg. Bzip2, lzo, lz4.
On Wed, Jan 15, 2020, 1:32 AM Enrico Minack <m...@enrico.minack.dev> wrote: > Hi, > > Spark does not support 7z natively, but you can read any file in Spark: > > def read(stream: PortableDataStream): Iterator[String] = { > Seq(stream.getPath()).iterator } > > spark.sparkContext > .binaryFiles("*.7z") > .flatMap(file => read(file._2)) > .toDF("path") > .show(false) > > This scales with the number of files. A single large 7z file would not > scale well (a single partition). > > Any file that matches *.7z will be loaded via the read(stream: > PortableDataStream) method, which returns an iterator over the rows. This > method is executed on the executor and can implement the 7z specific code, > which is independent of Spark and should not be too hard (here it does not > open the input stream but returns the path only). > > If you are planning to read the same files more than once, then it would > be worth to first uncompress and convert them into files Spark supports. > Then Spark can scale much better. > > Regards, > Enrico > > > Am 13.01.20 um 13:31 schrieb HARSH TAKKAR: > > Hi, > > > Is it possible to read 7z compressed file in spark? > > > Kind Regards > Harsh Takkar > > >