Hello Holden, Thank you for the response, but what is "splittable gzip"?
Best, Oliver On Tue, Dec 6, 2022 at 9:22 AM Holden Karau <hol...@pigscanfly.ca> wrote: > There is the splittable gzip Hadoop input format, maybe someone could > extend that to use support bgzip? > > On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker < > oliv...@broadinstitute.org> wrote: > >> >> Hello Chris, >> >> Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but >> to start reading from somewhere other than the beginning of the file, you >> would need to use an index to tell you where the blocks start. Originally, >> a Tabix index was used and is still the popular choice, although other >> types of indices also exist. >> >> Best, Oliver >> >> On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth <cnaur...@apache.org> wrote: >> >>> Sorry, I misread that in the original email. >>> >>> This is my first time looking at bgzip. I see from the documentation >>> that it is putting some additional framing around gzip and producing a >>> series of small blocks, such that you can create an index of the file and >>> decompress individual blocks instead of the whole file. That's interesting, >>> because it could potentially support a splittable format. (Plain gzip isn't >>> splittable.) >>> >>> I also noticed that it states it is "compatible with" gzip. I tried a >>> basic test of running bgzip on a file, which produced a .gz output file, >>> and then running the same spark.read.text code sample from earlier. Sure >>> enough, I was able to read the data. This implies there is at least some >>> basic compatibility, so that you could read files created by bgzip. >>> However, that read would not be optimized in any way to take advantage of >>> an index file. There also would not be any way to produce bgzip-style >>> output like in the df.write.option code sample. To achieve either of those, >>> it would require writing a custom Hadoop compression codec to integrate >>> more closely with the data format. >>> >>> Chris Nauroth >>> >>> >>> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker < >>> oliv...@broadinstitute.org> wrote: >>> >>>> >>>> Hello, >>>> >>>> Thanks for the response, but I mean compressed with bgzip >>>> <http://www.htslib.org/doc/bgzip.html>, not bzip2. >>>> >>>> Best, Oliver >>>> >>>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth <cnaur...@apache.org> >>>> wrote: >>>> >>>>> Hello Oliver, >>>>> >>>>> Yes, Spark makes this possible using the Hadoop compression codecs and >>>>> the Hadoop-compatible FileSystem interface [1]. Here is an example of >>>>> reading: >>>>> >>>>> df = spark.read.text("gs://<GCS bucket>/data/shakespeare-bz2") >>>>> df.show(10) >>>>> >>>>> This is using a test data set of the complete works of Shakespeare, >>>>> stored as text and compressed to a single .bz2 file. This code sample >>>>> didn't need to do anything special to declare that it's working with bzip2 >>>>> compression, because the Hadoop compression codecs detect that the file >>>>> has >>>>> a .bz2 extension and automatically assume it needs to be decompressed >>>>> before presenting it to our code in the DataFrame as text. >>>>> >>>>> On the write side, if you wanted to declare a particular kind of >>>>> output compression, you can do it with a write option like this: >>>>> >>>>> df.write.option("compression", >>>>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://<GCS >>>>> bucket>/data/shakespeare-bz2-copy") >>>>> >>>>> This writes the contents of the DataFrame, stored as text and >>>>> compressed to .bz2 files in the destination path. >>>>> >>>>> My example is testing with a GCS bucket (scheme "gs:"), but you can >>>>> also switch the Hadoop file system interface to target other file systems >>>>> like S3 (scheme "s3a:"). Hadoop maintains documentation on how to >>>>> configure >>>>> the S3AFIleSystem, including how to pass credentials for access to the S3 >>>>> bucket [2]. >>>>> >>>>> Note that for big data use cases, other compression codecs like Snappy >>>>> are generally preferred for greater efficiency. (Of course, we're not >>>>> always in complete control of the data formats we're given, so the support >>>>> for bz2 is there.) >>>>> >>>>> [1] >>>>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html >>>>> [2] >>>>> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html >>>>> >>>>> Chris Nauroth >>>>> >>>>> >>>>> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker < >>>>> oliv...@broadinstitute.org> wrote: >>>>> >>>>>> >>>>>> Hello, >>>>>> >>>>>> Is it possible to read/write a DataFrame from/to a set of bgzipped >>>>>> files? Can it read from/write to AWS S3? Thanks! >>>>>> >>>>>> Best, Oliver >>>>>> >>>>>> -- >>>>>> Oliver Ruebenacker, Ph.D. (he) >>>>>> Senior Software Engineer, Knowledge Portal Network >>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad >>>>>> Institute <http://www.broadinstitute.org/> >>>>>> >>>>> >>>> >>>> -- >>>> Oliver Ruebenacker, Ph.D. (he) >>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>> <http://www.broadinstitute.org/> >>>> >>> >> >> -- >> Oliver Ruebenacker, Ph.D. (he) >> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >> Flannick >> Lab <http://www.flannicklab.org/>, Broad Institute >> <http://www.broadinstitute.org/> >> > -- > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > -- Oliver Ruebenacker, Ph.D. (he) Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad Institute <http://www.broadinstitute.org/>