Re: [PySpark] Reader/Writer for bgzipped data

Oliver Ruebenacker Tue, 06 Dec 2022 05:43:42 -0800

     Hello Chris,

  Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but
to start reading from somewhere other than the beginning of the file, you
would need to use an index to tell you where the blocks start. Originally,
a Tabix index was used and is still the popular choice, although other
types of indices also exist.


     Best, Oliver

On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth <cnaur...@apache.org> wrote:

> Sorry, I misread that in the original email.
>
> This is my first time looking at bgzip. I see from the documentation that
> it is putting some additional framing around gzip and producing a series of
> small blocks, such that you can create an index of the file and decompress
> individual blocks instead of the whole file. That's interesting, because it
> could potentially support a splittable format. (Plain gzip isn't
> splittable.)
>
> I also noticed that it states it is "compatible with" gzip. I tried a
> basic test of running bgzip on a file, which produced a .gz output file,
> and then running the same spark.read.text code sample from earlier. Sure
> enough, I was able to read the data. This implies there is at least some
> basic compatibility, so that you could read files created by bgzip.
> However, that read would not be optimized in any way to take advantage of
> an index file. There also would not be any way to produce bgzip-style
> output like in the df.write.option code sample. To achieve either of those,
> it would require writing a custom Hadoop compression codec to integrate
> more closely with the data format.
>
> Chris Nauroth
>
>
> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>      Hello,
>>
>>   Thanks for the response, but I mean compressed with bgzip
>> <http://www.htslib.org/doc/bgzip.html>, not bzip2.
>>
>>      Best, Oliver
>>
>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth <cnaur...@apache.org> wrote:
>>
>>> Hello Oliver,
>>>
>>> Yes, Spark makes this possible using the Hadoop compression codecs and
>>> the Hadoop-compatible FileSystem interface [1]. Here is an example of
>>> reading:
>>>
>>> df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2")
>>> df.show(10)
>>>
>>> This is using a test data set of the complete works of Shakespeare,
>>> stored as text and compressed to a single .bz2 file. This code sample
>>> didn't need to do anything special to declare that it's working with bzip2
>>> compression, because the Hadoop compression codecs detect that the file has
>>> a .bz2 extension and automatically assume it needs to be decompressed
>>> before presenting it to our code in the DataFrame as text.
>>>
>>> On the write side, if you wanted to declare a particular kind of output
>>> compression, you can do it with a write option like this:
>>>
>>> df.write.option("compression",
>>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS
>>> bucket>/data/shakespeare-bz2-copy")
>>>
>>> This writes the contents of the DataFrame, stored as text and compressed
>>> to .bz2 files in the destination path.
>>>
>>> My example is testing with a GCS bucket (scheme "gs:"), but you can also
>>> switch the Hadoop file system interface to target other file systems like
>>> S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure the
>>> S3AFIleSystem, including how to pass credentials for access to the S3
>>> bucket [2].
>>>
>>> Note that for big data use cases, other compression codecs like Snappy
>>> are generally preferred for greater efficiency. (Of course, we're not
>>> always in complete control of the data formats we're given, so the support
>>> for bz2 is there.)
>>>
>>> [1]
>>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
>>> [2]
>>> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
>>>
>>> Chris Nauroth
>>>
>>>
>>> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>
>>>>
>>>>      Hello,
>>>>
>>>>   Is it possible to read/write a DataFrame from/to a set of bgzipped
>>>> files? Can it read from/write to AWS S3? Thanks!
>>>>
>>>>      Best, Oliver
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>> Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Reader/Writer for bgzipped data

Reply via email to