Re: [PySpark] Reader/Writer for bgzipped data

Oliver Ruebenacker Mon, 05 Dec 2022 14:07:14 -0800

     Hello,

  Thanks for the response, but I mean compressed with bgzip
<http://www.htslib.org/doc/bgzip.html>, not bzip2.


     Best, Oliver

On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth <cnaur...@apache.org> wrote:

> Hello Oliver,
>
> Yes, Spark makes this possible using the Hadoop compression codecs and the
> Hadoop-compatible FileSystem interface [1]. Here is an example of reading:
>
> df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2")
> df.show(10)
>
> This is using a test data set of the complete works of Shakespeare, stored
> as text and compressed to a single .bz2 file. This code sample didn't need
> to do anything special to declare that it's working with bzip2 compression,
> because the Hadoop compression codecs detect that the file has a .bz2
> extension and automatically assume it needs to be decompressed before
> presenting it to our code in the DataFrame as text.
>
> On the write side, if you wanted to declare a particular kind of output
> compression, you can do it with a write option like this:
>
> df.write.option("compression",
> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS
> bucket>/data/shakespeare-bz2-copy")
>
> This writes the contents of the DataFrame, stored as text and compressed
> to .bz2 files in the destination path.
>
> My example is testing with a GCS bucket (scheme "gs:"), but you can also
> switch the Hadoop file system interface to target other file systems like
> S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure the
> S3AFIleSystem, including how to pass credentials for access to the S3
> bucket [2].
>
> Note that for big data use cases, other compression codecs like Snappy are
> generally preferred for greater efficiency. (Of course, we're not always in
> complete control of the data formats we're given, so the support for bz2 is
> there.)
>
> [1]
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
> [2]
> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
>
> Chris Nauroth
>
>
> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>      Hello,
>>
>>   Is it possible to read/write a DataFrame from/to a set of bgzipped
>> files? Can it read from/write to AWS S3? Thanks!
>>
>>      Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>> Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Reader/Writer for bgzipped data

Reply via email to