Re: [PySpark] Reader/Writer for bgzipped data

Holden Karau Tue, 06 Dec 2022 07:48:11 -0800

Take a look at https://github.com/nielsbasjes/splittablegzip :D


On Tue, Dec 6, 2022 at 7:46 AM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>      Hello Holden,
>
>   Thank you for the response, but what is "splittable gzip"?
>
>      Best, Oliver
>
> On Tue, Dec 6, 2022 at 9:22 AM Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> There is the splittable gzip Hadoop input format, maybe someone could
>> extend that to use support bgzip?
>>
>> On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>      Hello Chris,
>>>
>>>   Yes, you can use gunzip/gzip to uncompress a file created by bgzip,
>>> but to start reading from somewhere other than the beginning of the file,
>>> you would need to use an index to tell you where the blocks start.
>>> Originally, a Tabix index was used and is still the popular choice,
>>> although other types of indices also exist.
>>>
>>>      Best, Oliver
>>>
>>> On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth <cnaur...@apache.org>
>>> wrote:
>>>
>>>> Sorry, I misread that in the original email.
>>>>
>>>> This is my first time looking at bgzip. I see from the documentation
>>>> that it is putting some additional framing around gzip and producing a
>>>> series of small blocks, such that you can create an index of the file and
>>>> decompress individual blocks instead of the whole file. That's interesting,
>>>> because it could potentially support a splittable format. (Plain gzip isn't
>>>> splittable.)
>>>>
>>>> I also noticed that it states it is "compatible with" gzip. I tried a
>>>> basic test of running bgzip on a file, which produced a .gz output file,
>>>> and then running the same spark.read.text code sample from earlier. Sure
>>>> enough, I was able to read the data. This implies there is at least some
>>>> basic compatibility, so that you could read files created by bgzip.
>>>> However, that read would not be optimized in any way to take advantage of
>>>> an index file. There also would not be any way to produce bgzip-style
>>>> output like in the df.write.option code sample. To achieve either of those,
>>>> it would require writing a custom Hadoop compression codec to integrate
>>>> more closely with the data format.
>>>>
>>>> Chris Nauroth
>>>>
>>>>
>>>> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
>>>> oliv...@broadinstitute.org> wrote:
>>>>
>>>>>
>>>>>      Hello,
>>>>>
>>>>>   Thanks for the response, but I mean compressed with bgzip
>>>>> <http://www.htslib.org/doc/bgzip.html>, not bzip2.
>>>>>
>>>>>      Best, Oliver
>>>>>
>>>>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth <cnaur...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hello Oliver,
>>>>>>
>>>>>> Yes, Spark makes this possible using the Hadoop compression codecs
>>>>>> and the Hadoop-compatible FileSystem interface [1]. Here is an example of
>>>>>> reading:
>>>>>>
>>>>>> df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2")
>>>>>> df.show(10)
>>>>>>
>>>>>> This is using a test data set of the complete works of Shakespeare,
>>>>>> stored as text and compressed to a single .bz2 file. This code sample
>>>>>> didn't need to do anything special to declare that it's working with 
>>>>>> bzip2
>>>>>> compression, because the Hadoop compression codecs detect that the file 
>>>>>> has
>>>>>> a .bz2 extension and automatically assume it needs to be decompressed
>>>>>> before presenting it to our code in the DataFrame as text.
>>>>>>
>>>>>> On the write side, if you wanted to declare a particular kind of
>>>>>> output compression, you can do it with a write option like this:
>>>>>>
>>>>>> df.write.option("compression",
>>>>>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS
>>>>>> bucket>/data/shakespeare-bz2-copy")
>>>>>>
>>>>>> This writes the contents of the DataFrame, stored as text and
>>>>>> compressed to .bz2 files in the destination path.
>>>>>>
>>>>>> My example is testing with a GCS bucket (scheme "gs:"), but you can
>>>>>> also switch the Hadoop file system interface to target other file systems
>>>>>> like S3 (scheme "s3a:"). Hadoop maintains documentation on how to 
>>>>>> configure
>>>>>> the S3AFIleSystem, including how to pass credentials for access to the S3
>>>>>> bucket [2].
>>>>>>
>>>>>> Note that for big data use cases, other compression codecs like
>>>>>> Snappy are generally preferred for greater efficiency. (Of course, we're
>>>>>> not always in complete control of the data formats we're given, so the
>>>>>> support for bz2 is there.)
>>>>>>
>>>>>> [1]
>>>>>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
>>>>>> [2]
>>>>>> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
>>>>>>
>>>>>> Chris Nauroth
>>>>>>
>>>>>>
>>>>>> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
>>>>>> oliv...@broadinstitute.org> wrote:
>>>>>>
>>>>>>>
>>>>>>>      Hello,
>>>>>>>
>>>>>>>   Is it possible to read/write a DataFrame from/to a set of bgzipped
>>>>>>> files? Can it read from/write to AWS S3? Thanks!
>>>>>>>
>>>>>>>      Best, Oliver
>>>>>>>
>>>>>>> --
>>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>>> <http://www.broadinstitute.org/>
>>>>>
>>>>
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>>> Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
> Flannick
> Lab <http://www.flannicklab.org/>, Broad Institute
> <http://www.broadinstitute.org/>
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: [PySpark] Reader/Writer for bgzipped data

Reply via email to