Re: [PySpark] Reader/Writer for bgzipped data

Oliver Ruebenacker Tue, 06 Dec 2022 08:06:09 -0800

  Thank you, yes, it would be great if this could be extended to use an
index.


  In our case, we're reading files from Amazon S3. S3 does offer the option
to request only a chunk out of a file, and any efficient solution would
need to use this rather than downloading the file multiple times.


On Tue, Dec 6, 2022 at 10:47 AM Holden Karau <hol...@pigscanfly.ca> wrote:

> Take a look at https://github.com/nielsbasjes/splittablegzip :D
>
> On Tue, Dec 6, 2022 at 7:46 AM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>      Hello Holden,
>>
>>   Thank you for the response, but what is "splittable gzip"?
>>
>>      Best, Oliver
>>
>> On Tue, Dec 6, 2022 at 9:22 AM Holden Karau <hol...@pigscanfly.ca> wrote:
>>
>>> There is the splittable gzip Hadoop input format, maybe someone could
>>> extend that to use support bgzip?
>>>
>>> On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>
>>>>
>>>>      Hello Chris,
>>>>
>>>>   Yes, you can use gunzip/gzip to uncompress a file created by bgzip,
>>>> but to start reading from somewhere other than the beginning of the file,
>>>> you would need to use an index to tell you where the blocks start.
>>>> Originally, a Tabix index was used and is still the popular choice,
>>>> although other types of indices also exist.
>>>>
>>>>      Best, Oliver
>>>>
>>>> On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth <cnaur...@apache.org>
>>>> wrote:
>>>>
>>>>> Sorry, I misread that in the original email.
>>>>>
>>>>> This is my first time looking at bgzip. I see from the documentation
>>>>> that it is putting some additional framing around gzip and producing a
>>>>> series of small blocks, such that you can create an index of the file and
>>>>> decompress individual blocks instead of the whole file. That's 
>>>>> interesting,
>>>>> because it could potentially support a splittable format. (Plain gzip 
>>>>> isn't
>>>>> splittable.)
>>>>>
>>>>> I also noticed that it states it is "compatible with" gzip. I tried a
>>>>> basic test of running bgzip on a file, which produced a .gz output file,
>>>>> and then running the same spark.read.text code sample from earlier. Sure
>>>>> enough, I was able to read the data. This implies there is at least some
>>>>> basic compatibility, so that you could read files created by bgzip.
>>>>> However, that read would not be optimized in any way to take advantage of
>>>>> an index file. There also would not be any way to produce bgzip-style
>>>>> output like in the df.write.option code sample. To achieve either of 
>>>>> those,
>>>>> it would require writing a custom Hadoop compression codec to integrate
>>>>> more closely with the data format.
>>>>>
>>>>> Chris Nauroth
>>>>>
>>>>>
>>>>> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
>>>>> oliv...@broadinstitute.org> wrote:
>>>>>
>>>>>>
>>>>>>      Hello,
>>>>>>
>>>>>>   Thanks for the response, but I mean compressed with bgzip
>>>>>> <http://www.htslib.org/doc/bgzip.html>, not bzip2.
>>>>>>
>>>>>>      Best, Oliver
>>>>>>
>>>>>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth <cnaur...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Oliver,
>>>>>>>
>>>>>>> Yes, Spark makes this possible using the Hadoop compression codecs
>>>>>>> and the Hadoop-compatible FileSystem interface [1]. Here is an example 
>>>>>>> of
>>>>>>> reading:
>>>>>>>
>>>>>>> df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2")
>>>>>>> df.show(10)
>>>>>>>
>>>>>>> This is using a test data set of the complete works of Shakespeare,
>>>>>>> stored as text and compressed to a single .bz2 file. This code sample
>>>>>>> didn't need to do anything special to declare that it's working with 
>>>>>>> bzip2
>>>>>>> compression, because the Hadoop compression codecs detect that the file 
>>>>>>> has
>>>>>>> a .bz2 extension and automatically assume it needs to be decompressed
>>>>>>> before presenting it to our code in the DataFrame as text.
>>>>>>>
>>>>>>> On the write side, if you wanted to declare a particular kind of
>>>>>>> output compression, you can do it with a write option like this:
>>>>>>>
>>>>>>> df.write.option("compression",
>>>>>>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS
>>>>>>> bucket>/data/shakespeare-bz2-copy")
>>>>>>>
>>>>>>> This writes the contents of the DataFrame, stored as text and
>>>>>>> compressed to .bz2 files in the destination path.
>>>>>>>
>>>>>>> My example is testing with a GCS bucket (scheme "gs:"), but you can
>>>>>>> also switch the Hadoop file system interface to target other file 
>>>>>>> systems
>>>>>>> like S3 (scheme "s3a:"). Hadoop maintains documentation on how to 
>>>>>>> configure
>>>>>>> the S3AFIleSystem, including how to pass credentials for access to the 
>>>>>>> S3
>>>>>>> bucket [2].
>>>>>>>
>>>>>>> Note that for big data use cases, other compression codecs like
>>>>>>> Snappy are generally preferred for greater efficiency. (Of course, we're
>>>>>>> not always in complete control of the data formats we're given, so the
>>>>>>> support for bz2 is there.)
>>>>>>>
>>>>>>> [1]
>>>>>>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
>>>>>>> [2]
>>>>>>> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
>>>>>>>
>>>>>>> Chris Nauroth
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
>>>>>>> oliv...@broadinstitute.org> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>      Hello,
>>>>>>>>
>>>>>>>>   Is it possible to read/write a DataFrame from/to a set of
>>>>>>>> bgzipped files? Can it read from/write to AWS S3? Thanks!
>>>>>>>>
>>>>>>>>      Best, Oliver
>>>>>>>>
>>>>>>>> --
>>>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>> Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Reader/Writer for bgzipped data

Reply via email to