Hello, Thanks for the response, but I mean compressed with bgzip <http://www.htslib.org/doc/bgzip.html>, not bzip2.
Best, Oliver On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth <cnaur...@apache.org> wrote: > Hello Oliver, > > Yes, Spark makes this possible using the Hadoop compression codecs and the > Hadoop-compatible FileSystem interface [1]. Here is an example of reading: > > df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2") > df.show(10) > > This is using a test data set of the complete works of Shakespeare, stored > as text and compressed to a single .bz2 file. This code sample didn't need > to do anything special to declare that it's working with bzip2 compression, > because the Hadoop compression codecs detect that the file has a .bz2 > extension and automatically assume it needs to be decompressed before > presenting it to our code in the DataFrame as text. > > On the write side, if you wanted to declare a particular kind of output > compression, you can do it with a write option like this: > > df.write.option("compression", > "org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS > bucket>/data/shakespeare-bz2-copy") > > This writes the contents of the DataFrame, stored as text and compressed > to .bz2 files in the destination path. > > My example is testing with a GCS bucket (scheme "gs:"), but you can also > switch the Hadoop file system interface to target other file systems like > S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure the > S3AFIleSystem, including how to pass credentials for access to the S3 > bucket [2]. > > Note that for big data use cases, other compression codecs like Snappy are > generally preferred for greater efficiency. (Of course, we're not always in > complete control of the data formats we're given, so the support for bz2 is > there.) > > [1] > https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html > [2] > https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html > > Chris Nauroth > > > On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker < > oliv...@broadinstitute.org> wrote: > >> >> Hello, >> >> Is it possible to read/write a DataFrame from/to a set of bgzipped >> files? Can it read from/write to AWS S3? Thanks! >> >> Best, Oliver >> >> -- >> Oliver Ruebenacker, Ph.D. (he) >> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >> Flannick >> Lab <http://www.flannicklab.org/>, Broad Institute >> <http://www.broadinstitute.org/> >> > -- Oliver Ruebenacker, Ph.D. (he) Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad Institute <http://www.broadinstitute.org/>