Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Oliver Ruebenacker
  Thank you, yes, it would be great if this could be extended to use an
index.

  In our case, we're reading files from Amazon S3. S3 does offer the option
to request only a chunk out of a file, and any efficient solution would
need to use this rather than downloading the file multiple times.


On Tue, Dec 6, 2022 at 10:47 AM Holden Karau  wrote:

> Take a look at https://github.com/nielsbasjes/splittablegzip :D
>
> On Tue, Dec 6, 2022 at 7:46 AM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello Holden,
>>
>>   Thank you for the response, but what is "splittable gzip"?
>>
>>  Best, Oliver
>>
>> On Tue, Dec 6, 2022 at 9:22 AM Holden Karau  wrote:
>>
>>> There is the splittable gzip Hadoop input format, maybe someone could
>>> extend that to use support bgzip?
>>>
>>> On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>

  Hello Chris,

   Yes, you can use gunzip/gzip to uncompress a file created by bgzip,
 but to start reading from somewhere other than the beginning of the file,
 you would need to use an index to tell you where the blocks start.
 Originally, a Tabix index was used and is still the popular choice,
 although other types of indices also exist.

  Best, Oliver

 On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth 
 wrote:

> Sorry, I misread that in the original email.
>
> This is my first time looking at bgzip. I see from the documentation
> that it is putting some additional framing around gzip and producing a
> series of small blocks, such that you can create an index of the file and
> decompress individual blocks instead of the whole file. That's 
> interesting,
> because it could potentially support a splittable format. (Plain gzip 
> isn't
> splittable.)
>
> I also noticed that it states it is "compatible with" gzip. I tried a
> basic test of running bgzip on a file, which produced a .gz output file,
> and then running the same spark.read.text code sample from earlier. Sure
> enough, I was able to read the data. This implies there is at least some
> basic compatibility, so that you could read files created by bgzip.
> However, that read would not be optimized in any way to take advantage of
> an index file. There also would not be any way to produce bgzip-style
> output like in the df.write.option code sample. To achieve either of 
> those,
> it would require writing a custom Hadoop compression codec to integrate
> more closely with the data format.
>
> Chris Nauroth
>
>
> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   Thanks for the response, but I mean compressed with bgzip
>> , not bzip2.
>>
>>  Best, Oliver
>>
>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth 
>> wrote:
>>
>>> Hello Oliver,
>>>
>>> Yes, Spark makes this possible using the Hadoop compression codecs
>>> and the Hadoop-compatible FileSystem interface [1]. Here is an example 
>>> of
>>> reading:
>>>
>>> df = spark.read.text("gs:///data/shakespeare-bz2")
>>> df.show(10)
>>>
>>> This is using a test data set of the complete works of Shakespeare,
>>> stored as text and compressed to a single .bz2 file. This code sample
>>> didn't need to do anything special to declare that it's working with 
>>> bzip2
>>> compression, because the Hadoop compression codecs detect that the file 
>>> has
>>> a .bz2 extension and automatically assume it needs to be decompressed
>>> before presenting it to our code in the DataFrame as text.
>>>
>>> On the write side, if you wanted to declare a particular kind of
>>> output compression, you can do it with a write option like this:
>>>
>>> df.write.option("compression",
>>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://>> bucket>/data/shakespeare-bz2-copy")
>>>
>>> This writes the contents of the DataFrame, stored as text and
>>> compressed to .bz2 files in the destination path.
>>>
>>> My example is testing with a GCS bucket (scheme "gs:"), but you can
>>> also switch the Hadoop file system interface to target other file 
>>> systems
>>> like S3 (scheme "s3a:"). Hadoop maintains documentation on how to 
>>> configure
>>> the S3AFIleSystem, including how to pass credentials for access to the 
>>> S3
>>> bucket [2].
>>>
>>> Note that for big data use cases, other compression codecs like
>>> Snappy are generally preferred for greater efficiency. (Of course, we're
>>> not always in complete control of the data formats we're given, so the
>>> support for bz2 is there.)
>>>
>>> [1]
>>> https://hadoop.apache.org/docs/current/h

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Holden Karau
Take a look at https://github.com/nielsbasjes/splittablegzip :D

On Tue, Dec 6, 2022 at 7:46 AM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello Holden,
>
>   Thank you for the response, but what is "splittable gzip"?
>
>  Best, Oliver
>
> On Tue, Dec 6, 2022 at 9:22 AM Holden Karau  wrote:
>
>> There is the splittable gzip Hadoop input format, maybe someone could
>> extend that to use support bgzip?
>>
>> On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello Chris,
>>>
>>>   Yes, you can use gunzip/gzip to uncompress a file created by bgzip,
>>> but to start reading from somewhere other than the beginning of the file,
>>> you would need to use an index to tell you where the blocks start.
>>> Originally, a Tabix index was used and is still the popular choice,
>>> although other types of indices also exist.
>>>
>>>  Best, Oliver
>>>
>>> On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth 
>>> wrote:
>>>
 Sorry, I misread that in the original email.

 This is my first time looking at bgzip. I see from the documentation
 that it is putting some additional framing around gzip and producing a
 series of small blocks, such that you can create an index of the file and
 decompress individual blocks instead of the whole file. That's interesting,
 because it could potentially support a splittable format. (Plain gzip isn't
 splittable.)

 I also noticed that it states it is "compatible with" gzip. I tried a
 basic test of running bgzip on a file, which produced a .gz output file,
 and then running the same spark.read.text code sample from earlier. Sure
 enough, I was able to read the data. This implies there is at least some
 basic compatibility, so that you could read files created by bgzip.
 However, that read would not be optimized in any way to take advantage of
 an index file. There also would not be any way to produce bgzip-style
 output like in the df.write.option code sample. To achieve either of those,
 it would require writing a custom Hadoop compression codec to integrate
 more closely with the data format.

 Chris Nauroth


 On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
 oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   Thanks for the response, but I mean compressed with bgzip
> , not bzip2.
>
>  Best, Oliver
>
> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth 
> wrote:
>
>> Hello Oliver,
>>
>> Yes, Spark makes this possible using the Hadoop compression codecs
>> and the Hadoop-compatible FileSystem interface [1]. Here is an example of
>> reading:
>>
>> df = spark.read.text("gs:///data/shakespeare-bz2")
>> df.show(10)
>>
>> This is using a test data set of the complete works of Shakespeare,
>> stored as text and compressed to a single .bz2 file. This code sample
>> didn't need to do anything special to declare that it's working with 
>> bzip2
>> compression, because the Hadoop compression codecs detect that the file 
>> has
>> a .bz2 extension and automatically assume it needs to be decompressed
>> before presenting it to our code in the DataFrame as text.
>>
>> On the write side, if you wanted to declare a particular kind of
>> output compression, you can do it with a write option like this:
>>
>> df.write.option("compression",
>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://> bucket>/data/shakespeare-bz2-copy")
>>
>> This writes the contents of the DataFrame, stored as text and
>> compressed to .bz2 files in the destination path.
>>
>> My example is testing with a GCS bucket (scheme "gs:"), but you can
>> also switch the Hadoop file system interface to target other file systems
>> like S3 (scheme "s3a:"). Hadoop maintains documentation on how to 
>> configure
>> the S3AFIleSystem, including how to pass credentials for access to the S3
>> bucket [2].
>>
>> Note that for big data use cases, other compression codecs like
>> Snappy are generally preferred for greater efficiency. (Of course, we're
>> not always in complete control of the data formats we're given, so the
>> support for bz2 is there.)
>>
>> [1]
>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
>> [2]
>> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
>>
>> Chris Nauroth
>>
>>
>> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello,
>>>
>>>   Is it possible to read/write a DataFrame from/to a set of bgzipped
>>> files? Can it read from/write to AWS S3? Thanks!
>>>
>>>  Best, Oliver
>>>

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Oliver Ruebenacker
 Hello Holden,

  Thank you for the response, but what is "splittable gzip"?

 Best, Oliver

On Tue, Dec 6, 2022 at 9:22 AM Holden Karau  wrote:

> There is the splittable gzip Hadoop input format, maybe someone could
> extend that to use support bgzip?
>
> On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello Chris,
>>
>>   Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but
>> to start reading from somewhere other than the beginning of the file, you
>> would need to use an index to tell you where the blocks start. Originally,
>> a Tabix index was used and is still the popular choice, although other
>> types of indices also exist.
>>
>>  Best, Oliver
>>
>> On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth  wrote:
>>
>>> Sorry, I misread that in the original email.
>>>
>>> This is my first time looking at bgzip. I see from the documentation
>>> that it is putting some additional framing around gzip and producing a
>>> series of small blocks, such that you can create an index of the file and
>>> decompress individual blocks instead of the whole file. That's interesting,
>>> because it could potentially support a splittable format. (Plain gzip isn't
>>> splittable.)
>>>
>>> I also noticed that it states it is "compatible with" gzip. I tried a
>>> basic test of running bgzip on a file, which produced a .gz output file,
>>> and then running the same spark.read.text code sample from earlier. Sure
>>> enough, I was able to read the data. This implies there is at least some
>>> basic compatibility, so that you could read files created by bgzip.
>>> However, that read would not be optimized in any way to take advantage of
>>> an index file. There also would not be any way to produce bgzip-style
>>> output like in the df.write.option code sample. To achieve either of those,
>>> it would require writing a custom Hadoop compression codec to integrate
>>> more closely with the data format.
>>>
>>> Chris Nauroth
>>>
>>>
>>> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>

  Hello,

   Thanks for the response, but I mean compressed with bgzip
 , not bzip2.

  Best, Oliver

 On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth 
 wrote:

> Hello Oliver,
>
> Yes, Spark makes this possible using the Hadoop compression codecs and
> the Hadoop-compatible FileSystem interface [1]. Here is an example of
> reading:
>
> df = spark.read.text("gs:///data/shakespeare-bz2")
> df.show(10)
>
> This is using a test data set of the complete works of Shakespeare,
> stored as text and compressed to a single .bz2 file. This code sample
> didn't need to do anything special to declare that it's working with bzip2
> compression, because the Hadoop compression codecs detect that the file 
> has
> a .bz2 extension and automatically assume it needs to be decompressed
> before presenting it to our code in the DataFrame as text.
>
> On the write side, if you wanted to declare a particular kind of
> output compression, you can do it with a write option like this:
>
> df.write.option("compression",
> "org.apache.hadoop.io.compress.BZip2Codec").text("gs:// bucket>/data/shakespeare-bz2-copy")
>
> This writes the contents of the DataFrame, stored as text and
> compressed to .bz2 files in the destination path.
>
> My example is testing with a GCS bucket (scheme "gs:"), but you can
> also switch the Hadoop file system interface to target other file systems
> like S3 (scheme "s3a:"). Hadoop maintains documentation on how to 
> configure
> the S3AFIleSystem, including how to pass credentials for access to the S3
> bucket [2].
>
> Note that for big data use cases, other compression codecs like Snappy
> are generally preferred for greater efficiency. (Of course, we're not
> always in complete control of the data formats we're given, so the support
> for bz2 is there.)
>
> [1]
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
> [2]
> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
>
> Chris Nauroth
>
>
> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   Is it possible to read/write a DataFrame from/to a set of bgzipped
>> files? Can it read from/write to AWS S3? Thanks!
>>
>>  Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network
>> , Flannick Lab , Broad
>> Institute 
>>
>

 --
 Oliver Ruebenacker, Ph.D. (h

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Holden Karau
There is the splittable gzip Hadoop input format, maybe someone could
extend that to use support bgzip?

On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello Chris,
>
>   Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but
> to start reading from somewhere other than the beginning of the file, you
> would need to use an index to tell you where the blocks start. Originally,
> a Tabix index was used and is still the popular choice, although other
> types of indices also exist.
>
>  Best, Oliver
>
> On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth  wrote:
>
>> Sorry, I misread that in the original email.
>>
>> This is my first time looking at bgzip. I see from the documentation that
>> it is putting some additional framing around gzip and producing a series of
>> small blocks, such that you can create an index of the file and decompress
>> individual blocks instead of the whole file. That's interesting, because it
>> could potentially support a splittable format. (Plain gzip isn't
>> splittable.)
>>
>> I also noticed that it states it is "compatible with" gzip. I tried a
>> basic test of running bgzip on a file, which produced a .gz output file,
>> and then running the same spark.read.text code sample from earlier. Sure
>> enough, I was able to read the data. This implies there is at least some
>> basic compatibility, so that you could read files created by bgzip.
>> However, that read would not be optimized in any way to take advantage of
>> an index file. There also would not be any way to produce bgzip-style
>> output like in the df.write.option code sample. To achieve either of those,
>> it would require writing a custom Hadoop compression codec to integrate
>> more closely with the data format.
>>
>> Chris Nauroth
>>
>>
>> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello,
>>>
>>>   Thanks for the response, but I mean compressed with bgzip
>>> , not bzip2.
>>>
>>>  Best, Oliver
>>>
>>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth 
>>> wrote:
>>>
 Hello Oliver,

 Yes, Spark makes this possible using the Hadoop compression codecs and
 the Hadoop-compatible FileSystem interface [1]. Here is an example of
 reading:

 df = spark.read.text("gs:///data/shakespeare-bz2")
 df.show(10)

 This is using a test data set of the complete works of Shakespeare,
 stored as text and compressed to a single .bz2 file. This code sample
 didn't need to do anything special to declare that it's working with bzip2
 compression, because the Hadoop compression codecs detect that the file has
 a .bz2 extension and automatically assume it needs to be decompressed
 before presenting it to our code in the DataFrame as text.

 On the write side, if you wanted to declare a particular kind of output
 compression, you can do it with a write option like this:

 df.write.option("compression",
 "org.apache.hadoop.io.compress.BZip2Codec").text("gs://>>> bucket>/data/shakespeare-bz2-copy")

 This writes the contents of the DataFrame, stored as text and
 compressed to .bz2 files in the destination path.

 My example is testing with a GCS bucket (scheme "gs:"), but you can
 also switch the Hadoop file system interface to target other file systems
 like S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure
 the S3AFIleSystem, including how to pass credentials for access to the S3
 bucket [2].

 Note that for big data use cases, other compression codecs like Snappy
 are generally preferred for greater efficiency. (Of course, we're not
 always in complete control of the data formats we're given, so the support
 for bz2 is there.)

 [1]
 https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
 [2]
 https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

 Chris Nauroth


 On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
 oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   Is it possible to read/write a DataFrame from/to a set of bgzipped
> files? Can it read from/write to AWS S3? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network ,
> Flannick Lab , Broad Institute
> 
>

>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network 

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Oliver Ruebenacker
 Hello Chris,

  Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but
to start reading from somewhere other than the beginning of the file, you
would need to use an index to tell you where the blocks start. Originally,
a Tabix index was used and is still the popular choice, although other
types of indices also exist.

 Best, Oliver

On Mon, Dec 5, 2022 at 6:17 PM Chris Nauroth  wrote:

> Sorry, I misread that in the original email.
>
> This is my first time looking at bgzip. I see from the documentation that
> it is putting some additional framing around gzip and producing a series of
> small blocks, such that you can create an index of the file and decompress
> individual blocks instead of the whole file. That's interesting, because it
> could potentially support a splittable format. (Plain gzip isn't
> splittable.)
>
> I also noticed that it states it is "compatible with" gzip. I tried a
> basic test of running bgzip on a file, which produced a .gz output file,
> and then running the same spark.read.text code sample from earlier. Sure
> enough, I was able to read the data. This implies there is at least some
> basic compatibility, so that you could read files created by bgzip.
> However, that read would not be optimized in any way to take advantage of
> an index file. There also would not be any way to produce bgzip-style
> output like in the df.write.option code sample. To achieve either of those,
> it would require writing a custom Hadoop compression codec to integrate
> more closely with the data format.
>
> Chris Nauroth
>
>
> On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   Thanks for the response, but I mean compressed with bgzip
>> , not bzip2.
>>
>>  Best, Oliver
>>
>> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth  wrote:
>>
>>> Hello Oliver,
>>>
>>> Yes, Spark makes this possible using the Hadoop compression codecs and
>>> the Hadoop-compatible FileSystem interface [1]. Here is an example of
>>> reading:
>>>
>>> df = spark.read.text("gs:///data/shakespeare-bz2")
>>> df.show(10)
>>>
>>> This is using a test data set of the complete works of Shakespeare,
>>> stored as text and compressed to a single .bz2 file. This code sample
>>> didn't need to do anything special to declare that it's working with bzip2
>>> compression, because the Hadoop compression codecs detect that the file has
>>> a .bz2 extension and automatically assume it needs to be decompressed
>>> before presenting it to our code in the DataFrame as text.
>>>
>>> On the write side, if you wanted to declare a particular kind of output
>>> compression, you can do it with a write option like this:
>>>
>>> df.write.option("compression",
>>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://>> bucket>/data/shakespeare-bz2-copy")
>>>
>>> This writes the contents of the DataFrame, stored as text and compressed
>>> to .bz2 files in the destination path.
>>>
>>> My example is testing with a GCS bucket (scheme "gs:"), but you can also
>>> switch the Hadoop file system interface to target other file systems like
>>> S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure the
>>> S3AFIleSystem, including how to pass credentials for access to the S3
>>> bucket [2].
>>>
>>> Note that for big data use cases, other compression codecs like Snappy
>>> are generally preferred for greater efficiency. (Of course, we're not
>>> always in complete control of the data formats we're given, so the support
>>> for bz2 is there.)
>>>
>>> [1]
>>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
>>> [2]
>>> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
>>>
>>> Chris Nauroth
>>>
>>>
>>> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
>>> oliv...@broadinstitute.org> wrote:
>>>

  Hello,

   Is it possible to read/write a DataFrame from/to a set of bgzipped
 files? Can it read from/write to AWS S3? Thanks!

  Best, Oliver

 --
 Oliver Ruebenacker, Ph.D. (he)
 Senior Software Engineer, Knowledge Portal Network ,
 Flannick Lab , Broad Institute
 

>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad Institute
>> 
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute



Re: [PySpark] Reader/Writer for bgzipped data

2022-12-05 Thread Chris Nauroth
Sorry, I misread that in the original email.

This is my first time looking at bgzip. I see from the documentation that
it is putting some additional framing around gzip and producing a series of
small blocks, such that you can create an index of the file and decompress
individual blocks instead of the whole file. That's interesting, because it
could potentially support a splittable format. (Plain gzip isn't
splittable.)

I also noticed that it states it is "compatible with" gzip. I tried a basic
test of running bgzip on a file, which produced a .gz output file, and then
running the same spark.read.text code sample from earlier. Sure enough, I
was able to read the data. This implies there is at least some basic
compatibility, so that you could read files created by bgzip. However, that
read would not be optimized in any way to take advantage of an index file.
There also would not be any way to produce bgzip-style output like in the
df.write.option code sample. To achieve either of those, it would require
writing a custom Hadoop compression codec to integrate more closely with
the data format.

Chris Nauroth


On Mon, Dec 5, 2022 at 2:08 PM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   Thanks for the response, but I mean compressed with bgzip
> , not bzip2.
>
>  Best, Oliver
>
> On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth  wrote:
>
>> Hello Oliver,
>>
>> Yes, Spark makes this possible using the Hadoop compression codecs and
>> the Hadoop-compatible FileSystem interface [1]. Here is an example of
>> reading:
>>
>> df = spark.read.text("gs:///data/shakespeare-bz2")
>> df.show(10)
>>
>> This is using a test data set of the complete works of Shakespeare,
>> stored as text and compressed to a single .bz2 file. This code sample
>> didn't need to do anything special to declare that it's working with bzip2
>> compression, because the Hadoop compression codecs detect that the file has
>> a .bz2 extension and automatically assume it needs to be decompressed
>> before presenting it to our code in the DataFrame as text.
>>
>> On the write side, if you wanted to declare a particular kind of output
>> compression, you can do it with a write option like this:
>>
>> df.write.option("compression",
>> "org.apache.hadoop.io.compress.BZip2Codec").text("gs://> bucket>/data/shakespeare-bz2-copy")
>>
>> This writes the contents of the DataFrame, stored as text and compressed
>> to .bz2 files in the destination path.
>>
>> My example is testing with a GCS bucket (scheme "gs:"), but you can also
>> switch the Hadoop file system interface to target other file systems like
>> S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure the
>> S3AFIleSystem, including how to pass credentials for access to the S3
>> bucket [2].
>>
>> Note that for big data use cases, other compression codecs like Snappy
>> are generally preferred for greater efficiency. (Of course, we're not
>> always in complete control of the data formats we're given, so the support
>> for bz2 is there.)
>>
>> [1]
>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
>> [2]
>> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
>>
>> Chris Nauroth
>>
>>
>> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>>
>>>  Hello,
>>>
>>>   Is it possible to read/write a DataFrame from/to a set of bgzipped
>>> files? Can it read from/write to AWS S3? Thanks!
>>>
>>>  Best, Oliver
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network , 
>>> Flannick
>>> Lab , Broad Institute
>>> 
>>>
>>
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>


Re: [PySpark] Reader/Writer for bgzipped data

2022-12-05 Thread Oliver Ruebenacker
 Hello,

  Thanks for the response, but I mean compressed with bgzip
, not bzip2.

 Best, Oliver

On Fri, Dec 2, 2022 at 4:44 PM Chris Nauroth  wrote:

> Hello Oliver,
>
> Yes, Spark makes this possible using the Hadoop compression codecs and the
> Hadoop-compatible FileSystem interface [1]. Here is an example of reading:
>
> df = spark.read.text("gs:///data/shakespeare-bz2")
> df.show(10)
>
> This is using a test data set of the complete works of Shakespeare, stored
> as text and compressed to a single .bz2 file. This code sample didn't need
> to do anything special to declare that it's working with bzip2 compression,
> because the Hadoop compression codecs detect that the file has a .bz2
> extension and automatically assume it needs to be decompressed before
> presenting it to our code in the DataFrame as text.
>
> On the write side, if you wanted to declare a particular kind of output
> compression, you can do it with a write option like this:
>
> df.write.option("compression",
> "org.apache.hadoop.io.compress.BZip2Codec").text("gs:// bucket>/data/shakespeare-bz2-copy")
>
> This writes the contents of the DataFrame, stored as text and compressed
> to .bz2 files in the destination path.
>
> My example is testing with a GCS bucket (scheme "gs:"), but you can also
> switch the Hadoop file system interface to target other file systems like
> S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure the
> S3AFIleSystem, including how to pass credentials for access to the S3
> bucket [2].
>
> Note that for big data use cases, other compression codecs like Snappy are
> generally preferred for greater efficiency. (Of course, we're not always in
> complete control of the data formats we're given, so the support for bz2 is
> there.)
>
> [1]
> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
> [2]
> https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
>
> Chris Nauroth
>
>
> On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>  Hello,
>>
>>   Is it possible to read/write a DataFrame from/to a set of bgzipped
>> files? Can it read from/write to AWS S3? Thanks!
>>
>>  Best, Oliver
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network , 
>> Flannick
>> Lab , Broad Institute
>> 
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
, Flannick
Lab , Broad Institute



Re: [PySpark] Reader/Writer for bgzipped data

2022-12-02 Thread Chris Nauroth
Hello Oliver,

Yes, Spark makes this possible using the Hadoop compression codecs and the
Hadoop-compatible FileSystem interface [1]. Here is an example of reading:

df = spark.read.text("gs:///data/shakespeare-bz2")
df.show(10)

This is using a test data set of the complete works of Shakespeare, stored
as text and compressed to a single .bz2 file. This code sample didn't need
to do anything special to declare that it's working with bzip2 compression,
because the Hadoop compression codecs detect that the file has a .bz2
extension and automatically assume it needs to be decompressed before
presenting it to our code in the DataFrame as text.

On the write side, if you wanted to declare a particular kind of output
compression, you can do it with a write option like this:

df.write.option("compression",
"org.apache.hadoop.io.compress.BZip2Codec").text("gs:///data/shakespeare-bz2-copy")

This writes the contents of the DataFrame, stored as text and compressed to
.bz2 files in the destination path.

My example is testing with a GCS bucket (scheme "gs:"), but you can also
switch the Hadoop file system interface to target other file systems like
S3 (scheme "s3a:"). Hadoop maintains documentation on how to configure the
S3AFIleSystem, including how to pass credentials for access to the S3
bucket [2].

Note that for big data use cases, other compression codecs like Snappy are
generally preferred for greater efficiency. (Of course, we're not always in
complete control of the data formats we're given, so the support for bz2 is
there.)

[1]
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
[2]
https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

Chris Nauroth


On Fri, Dec 2, 2022 at 11:32 AM Oliver Ruebenacker <
oliv...@broadinstitute.org> wrote:

>
>  Hello,
>
>   Is it possible to read/write a DataFrame from/to a set of bgzipped
> files? Can it read from/write to AWS S3? Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker, Ph.D. (he)
> Senior Software Engineer, Knowledge Portal Network , 
> Flannick
> Lab , Broad Institute
> 
>