Hello Oliver, Yes, Spark makes this possible using the Hadoop compression codecs and the Hadoop-compatible FileSystem interface [1]. Here is an example of reading:
df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2") df.show(10) This is using a test data set of the complete works of Shakespeare, stored as text and compressed to a single .bz2 file. This code sample didn't need to do anything special to declare that it's working with bzip2 compression, because the Hadoop compression codecs detect that the file has a .bz2 extension and automatically assume it needs to be decompressed before presenting it to our code in the DataFrame as text. On the write side, if you wanted to declare a particular kind of output compression, you can do it with a write option like this: df.write.option("compression", "org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS bucket>/data/shakespeare-bz2-copy") This writes the contents of the DataFrame, stored as text and compressed to .bz2 files in the destination path. My example is testing with a GCS bucket (scheme "gs:"), but you can also switch the Hadoop file system interface to target other file systems like S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure the S3AFIleSystem, including how to pass credentials for access to the S3 bucket [2]. Note that for big data use cases, other compression codecs like Snappy are generally preferred for greater efficiency. (Of course, we're not always in complete control of the data formats we're given, so the support for bz2 is there.) [1] https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html [2] https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html Chris Nauroth On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello, > > Is it possible to read/write a DataFrame from/to a set of bgzipped > files? Can it read from/write to AWS S3? Thanks! > > Best, Oliver > > -- > Oliver Ruebenacker, Ph.D. (he) > Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, > Flannick > Lab <http://www.flannicklab.org/>, Broad Institute > <http://www.broadinstitute.org/> >