Hello Oliver,

Yes, Spark makes this possible using the Hadoop compression codecs and the
Hadoop-compatible FileSystem interface [1]. Here is an example of reading:

df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2")
df.show(10)

This is using a test data set of the complete works of Shakespeare, stored
as text and compressed to a single .bz2 file. This code sample didn't need
to do anything special to declare that it's working with bzip2 compression,
because the Hadoop compression codecs detect that the file has a .bz2
extension and automatically assume it needs to be decompressed before
presenting it to our code in the DataFrame as text.

On the write side, if you wanted to declare a particular kind of output
compression, you can do it with a write option like this:

df.write.option("compression",
"org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS
bucket>/data/shakespeare-bz2-copy")

This writes the contents of the DataFrame, stored as text and compressed to
.bz2 files in the destination path.

My example is testing with a GCS bucket (scheme "gs:"), but you can also
switch the Hadoop file system interface to target other file systems like
S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure the
S3AFIleSystem, including how to pass credentials for access to the S3
bucket [2].

Note that for big data use cases, other compression codecs like Snappy are
generally preferred for greater efficiency. (Of course, we're not always in
complete control of the data formats we're given, so the support for bz2 is
there.)

[1]
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
[2]
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

Chris Nauroth


On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>      Hello,
>
>   Is it possible to read/write a DataFrame from/to a set of bgzipped
> files? Can it read from/write to AWS S3? Thanks!
>
>      Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
> Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>

Reply via email to