Take a look at https://github.com/nielsbasjes/splittablegzip :D
On Tue, Dec 6, 2022 at 7:46 AM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello Holden, > > Thank you for the response, but what is "splittable gzip"? > > Best, Oliver > > On Tue, Dec 6, 2022 at 9:22 AM Holden Karau <hol...@pigscanfly.ca> wrote: > >> There is the splittable gzip Hadoop input format, maybe someone could >> extend that to use support bgzip? >> >> On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker < >> oliv...@broadinstitute.org> wrote: >> >>> >>> Hello Chris, >>> >>> Yes, you can use gunzip/gzip to uncompress a file created by bgzip, >>> but to start reading from somewhere other than the beginning of the file, >>> you would need to use an index to tell you where the blocks start. >>> Originally, a Tabix index was used and is still the popular choice, >>> although other types of indices also exist. >>> >>> Best, Oliver >>> >>> On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth <cnaur...@apache.org> >>> wrote: >>> >>>> Sorry, I misread that in the original email. >>>> >>>> This is my first time looking at bgzip. I see from the documentation >>>> that it is putting some additional framing around gzip and producing a >>>> series of small blocks, such that you can create an index of the file and >>>> decompress individual blocks instead of the whole file. That's interesting, >>>> because it could potentially support a splittable format. (Plain gzip isn't >>>> splittable.) >>>> >>>> I also noticed that it states it is "compatible with" gzip. I tried a >>>> basic test of running bgzip on a file, which produced a .gz output file, >>>> and then running the same spark.read.text code sample from earlier. Sure >>>> enough, I was able to read the data. This implies there is at least some >>>> basic compatibility, so that you could read files created by bgzip. >>>> However, that read would not be optimized in any way to take advantage of >>>> an index file. There also would not be any way to produce bgzip-style >>>> output like in the df.write.option code sample. To achieve either of those, >>>> it would require writing a custom Hadoop compression codec to integrate >>>> more closely with the data format. >>>> >>>> Chris Nauroth >>>> >>>> >>>> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker < >>>> oliv...@broadinstitute.org> wrote: >>>> >>>>> >>>>> Hello, >>>>> >>>>> Thanks for the response, but I mean compressed with bgzip >>>>> <http://www.htslib.org/doc/bgzip.html>, not bzip2. >>>>> >>>>> Best, Oliver >>>>> >>>>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth <cnaur...@apache.org> >>>>> wrote: >>>>> >>>>>> Hello Oliver, >>>>>> >>>>>> Yes, Spark makes this possible using the Hadoop compression codecs >>>>>> and the Hadoop-compatible FileSystem interface [1]. Here is an example of >>>>>> reading: >>>>>> >>>>>> df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2") >>>>>> df.show(10) >>>>>> >>>>>> This is using a test data set of the complete works of Shakespeare, >>>>>> stored as text and compressed to a single .bz2 file. This code sample >>>>>> didn't need to do anything special to declare that it's working with >>>>>> bzip2 >>>>>> compression, because the Hadoop compression codecs detect that the file >>>>>> has >>>>>> a .bz2 extension and automatically assume it needs to be decompressed >>>>>> before presenting it to our code in the DataFrame as text. >>>>>> >>>>>> On the write side, if you wanted to declare a particular kind of >>>>>> output compression, you can do it with a write option like this: >>>>>> >>>>>> df.write.option("compression", >>>>>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS >>>>>> bucket>/data/shakespeare-bz2-copy") >>>>>> >>>>>> This writes the contents of the DataFrame, stored as text and >>>>>> compressed to .bz2 files in the destination path. >>>>>> >>>>>> My example is testing with a GCS bucket (scheme "gs:"), but you can >>>>>> also switch the Hadoop file system interface to target other file systems >>>>>> like S3 (scheme "s3a:"). Hadoop maintains documentation on how to >>>>>> configure >>>>>> the S3AFIleSystem, including how to pass credentials for access to the S3 >>>>>> bucket [2]. >>>>>> >>>>>> Note that for big data use cases, other compression codecs like >>>>>> Snappy are generally preferred for greater efficiency. (Of course, we're >>>>>> not always in complete control of the data formats we're given, so the >>>>>> support for bz2 is there.) >>>>>> >>>>>> [1] >>>>>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html >>>>>> [2] >>>>>> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html >>>>>> >>>>>> Chris Nauroth >>>>>> >>>>>> >>>>>> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker < >>>>>> oliv...@broadinstitute.org> wrote: >>>>>> >>>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> Is it possible to read/write a DataFrame from/to a set of bgzipped >>>>>>> files? Can it read from/write to AWS S3? Thanks! >>>>>>> >>>>>>> Best, Oliver >>>>>>> >>>>>>> -- >>>>>>> Oliver Ruebenacker, Ph.D. (he) >>>>>>> Senior Software Engineer, Knowledge Portal Network >>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad >>>>>>> Institute <http://www.broadinstitute.org/> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> Oliver Ruebenacker, Ph.D. (he) >>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>>> <http://www.broadinstitute.org/> >>>>> >>>> >>> >>> -- >>> Oliver Ruebenacker, Ph.D. (he) >>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>> Flannick >>> Lab <http://www.flannicklab.org/>, Broad Institute >>> <http://www.broadinstitute.org/> >>> >> -- >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> > > > -- > Oliver Ruebenacker, Ph.D. (he) > Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, > Flannick > Lab <http://www.flannicklab.org/>, Broad Institute > <http://www.broadinstitute.org/> > -- Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> YouTube Live Streams: https://www.youtube.com/user/holdenkarau