Sorry, I misread that in the original email.

This is my first time looking at bgzip. I see from the documentation that
it is putting some additional framing around gzip and producing a series of
small blocks, such that you can create an index of the file and decompress
individual blocks instead of the whole file. That's interesting, because it
could potentially support a splittable format. (Plain gzip isn't
splittable.)

I also noticed that it states it is "compatible with" gzip. I tried a basic
test of running bgzip on a file, which produced a .gz output file, and then
running the same spark.read.text code sample from earlier. Sure enough, I
was able to read the data. This implies there is at least some basic
compatibility, so that you could read files created by bgzip. However, that
read would not be optimized in any way to take advantage of an index file.
There also would not be any way to produce bgzip-style output like in the
df.write.option code sample. To achieve either of those, it would require
writing a custom Hadoop compression codec to integrate more closely with
the data format.

Chris Nauroth


On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>      Hello,
>
>   Thanks for the response, but I mean compressed with bgzip
> <http://www.htslib.org/doc/bgzip.html>, not bzip2.
>
>      Best, Oliver
>
> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth <cnaur...@apache.org> wrote:
>
>> Hello Oliver,
>>
>> Yes, Spark makes this possible using the Hadoop compression codecs and
>> the Hadoop-compatible FileSystem interface [1]. Here is an example of
>> reading:
>>
>> df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2")
>> df.show(10)
>>
>> This is using a test data set of the complete works of Shakespeare,
>> stored as text and compressed to a single .bz2 file. This code sample
>> didn't need to do anything special to declare that it's working with bzip2
>> compression, because the Hadoop compression codecs detect that the file has
>> a .bz2 extension and automatically assume it needs to be decompressed
>> before presenting it to our code in the DataFrame as text.
>>
>> On the write side, if you wanted to declare a particular kind of output
>> compression, you can do it with a write option like this:
>>
>> df.write.option("compression",
>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS
>> bucket>/data/shakespeare-bz2-copy")
>>
>> This writes the contents of the DataFrame, stored as text and compressed
>> to .bz2 files in the destination path.
>>
>> My example is testing with a GCS bucket (scheme "gs:"), but you can also
>> switch the Hadoop file system interface to target other file systems like
>> S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure the
>> S3AFIleSystem, including how to pass credentials for access to the S3
>> bucket [2].
>>
>> Note that for big data use cases, other compression codecs like Snappy
>> are generally preferred for greater efficiency. (Of course, we're not
>> always in complete control of the data formats we're given, so the support
>> for bz2 is there.)
>>
>> [1]
>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
>> [2]
>> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
>>
>> Chris Nauroth
>>
>>
>> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>      Hello,
>>>
>>>   Is it possible to read/write a DataFrame from/to a set of bgzipped
>>> files? Can it read from/write to AWS S3? Thanks!
>>>
>>>      Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>>> Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
> Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>

Reply via email to