Re: Reading 7z file in spark

Someshwar Kale Tue, 14 Jan 2020 17:37:03 -0800

I would suggest to use other compression technique which is splittable for
eg. Bzip2, lzo, lz4.


On Wed, Jan 15, 2020, 1:32 AM Enrico Minack <m...@enrico.minack.dev> wrote:

> Hi,
>
> Spark does not support 7z natively, but you can read any file in Spark:
>
> def read(stream: PortableDataStream): Iterator[String] = { 
> Seq(stream.getPath()).iterator }
>
> spark.sparkContext
>   .binaryFiles("*.7z")
>   .flatMap(file => read(file._2))
>   .toDF("path")
>   .show(false)
>
> This scales with the number of files. A single large 7z file would not
> scale well (a single partition).
>
> Any file that matches *.7z will be loaded via the read(stream:
> PortableDataStream) method, which returns an iterator over the rows. This
> method is executed on the executor and can implement the 7z specific code,
> which is independent of Spark and should not be too hard (here it does not
> open the input stream but returns the path only).
>
> If you are planning to read the same files more than once, then it would
> be worth to first uncompress and convert them into files Spark supports.
> Then Spark can scale much better.
>
> Regards,
> Enrico
>
>
> Am 13.01.20 um 13:31 schrieb HARSH TAKKAR:
>
> Hi,
>
>
> Is it possible to read 7z compressed file in spark?
>
>
> Kind Regards
> Harsh Takkar
>
>
>

Re: Reading 7z file in spark

Reply via email to