Re: [PySpark] Reader/Writer for bgzipped data

Oliver Ruebenacker Tue, 06 Dec 2022 07:46:35 -0800

     Hello Holden,

  Thank you for the response, but what is "splittable gzip"?


     Best, Oliver

On Tue, Dec 6, 2022 at 9:22 AM Holden Karau <[email protected]> wrote:

> There is the splittable gzip Hadoop input format, maybe someone could
> extend that to use support bgzip?
>
> On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker <
> [email protected]> wrote:
>
>>
>>      Hello Chris,
>>
>>   Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but
>> to start reading from somewhere other than the beginning of the file, you
>> would need to use an index to tell you where the blocks start. Originally,
>> a Tabix index was used and is still the popular choice, although other
>> types of indices also exist.
>>
>>      Best, Oliver
>>
>> On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth <[email protected]> wrote:
>>
>>> Sorry, I misread that in the original email.
>>>
>>> This is my first time looking at bgzip. I see from the documentation
>>> that it is putting some additional framing around gzip and producing a
>>> series of small blocks, such that you can create an index of the file and
>>> decompress individual blocks instead of the whole file. That's interesting,
>>> because it could potentially support a splittable format. (Plain gzip isn't
>>> splittable.)
>>>
>>> I also noticed that it states it is "compatible with" gzip. I tried a
>>> basic test of running bgzip on a file, which produced a .gz output file,
>>> and then running the same spark.read.text code sample from earlier. Sure
>>> enough, I was able to read the data. This implies there is at least some
>>> basic compatibility, so that you could read files created by bgzip.
>>> However, that read would not be optimized in any way to take advantage of
>>> an index file. There also would not be any way to produce bgzip-style
>>> output like in the df.write.option code sample. To achieve either of those,
>>> it would require writing a custom Hadoop compression codec to integrate
>>> more closely with the data format.
>>>
>>> Chris Nauroth
>>>
>>>
>>> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
>>> [email protected]> wrote:
>>>
>>>>
>>>>      Hello,
>>>>
>>>>   Thanks for the response, but I mean compressed with bgzip
>>>> <http://www.htslib.org/doc/bgzip.html>, not bzip2.
>>>>
>>>>      Best, Oliver
>>>>
>>>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth <[email protected]>
>>>> wrote:
>>>>
>>>>> Hello Oliver,
>>>>>
>>>>> Yes, Spark makes this possible using the Hadoop compression codecs and
>>>>> the Hadoop-compatible FileSystem interface [1]. Here is an example of
>>>>> reading:
>>>>>
>>>>> df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2")
>>>>> df.show(10)
>>>>>
>>>>> This is using a test data set of the complete works of Shakespeare,
>>>>> stored as text and compressed to a single .bz2 file. This code sample
>>>>> didn't need to do anything special to declare that it's working with bzip2
>>>>> compression, because the Hadoop compression codecs detect that the file 
>>>>> has
>>>>> a .bz2 extension and automatically assume it needs to be decompressed
>>>>> before presenting it to our code in the DataFrame as text.
>>>>>
>>>>> On the write side, if you wanted to declare a particular kind of
>>>>> output compression, you can do it with a write option like this:
>>>>>
>>>>> df.write.option("compression",
>>>>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS
>>>>> bucket>/data/shakespeare-bz2-copy")
>>>>>
>>>>> This writes the contents of the DataFrame, stored as text and
>>>>> compressed to .bz2 files in the destination path.
>>>>>
>>>>> My example is testing with a GCS bucket (scheme "gs:"), but you can
>>>>> also switch the Hadoop file system interface to target other file systems
>>>>> like S3 (scheme "s3a:"). Hadoop maintains documentation on how to 
>>>>> configure
>>>>> the S3AFIleSystem, including how to pass credentials for access to the S3
>>>>> bucket [2].
>>>>>
>>>>> Note that for big data use cases, other compression codecs like Snappy
>>>>> are generally preferred for greater efficiency. (Of course, we're not
>>>>> always in complete control of the data formats we're given, so the support
>>>>> for bz2 is there.)
>>>>>
>>>>> [1]
>>>>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
>>>>> [2]
>>>>> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
>>>>>
>>>>> Chris Nauroth
>>>>>
>>>>>
>>>>> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>>      Hello,
>>>>>>
>>>>>>   Is it possible to read/write a DataFrame from/to a set of bgzipped
>>>>>> files? Can it read from/write to AWS S3? Thanks!
>>>>>>
>>>>>>      Best, Oliver
>>>>>>
>>>>>> --
>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Oliver Ruebenacker, Ph.D. (he)
>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>> <http://www.broadinstitute.org/>
>>>>
>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>> Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark] Reader/Writer for bgzipped data

Reply via email to