removing the header from a gzip'd string

2006-12-21 Thread Rajarshi
Hi, I have some code that takes a string and obtains a compressed
version using zlib.compress

Does anybody know how I can remove the header portion of the compressed
bytes, such that I only have the compressed data remaining? (Obviously
I do not intend to perform the decompression!)

Thanks,

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: removing the header from a gzip'd string

2006-12-21 Thread Fredrik Lundh
Rajarshi wrote:

> Hi, I have some code that takes a string and obtains a compressed
> version using zlib.compress
> 
> Does anybody know how I can remove the header portion of the compressed
> bytes, such that I only have the compressed data remaining?

what makes you think there's a "header portion" in the data you get
from zlib.compress ?  it's just a continuous stream of bits, all of 
which are needed by the decoder.

 > (Obviously I do not intend to perform the decompression!)

oh.  in that case, this should be good enough:

 data[random.randint(0,len(data)):]



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: removing the header from a gzip'd string

2006-12-21 Thread Bjoern Schliessmann
Rajarshi wrote:

> Does anybody know how I can remove the header portion of the
> compressed bytes, such that I only have the compressed data
> remaining? (Obviously I do not intend to perform the
> decompression!)

Just curious: What's your goal? :) A home made hash function?

Regards,


Björn

-- 
BOFH excuse #80:

That's a great computer you have there; have you considered how it
would work as a BSD machine?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: removing the header from a gzip'd string

2006-12-21 Thread Gabriel Genellina

At Thursday 21/12/2006 18:32, Fredrik Lundh wrote:


> Hi, I have some code that takes a string and obtains a compressed
> version using zlib.compress
>
> Does anybody know how I can remove the header portion of the compressed
> bytes, such that I only have the compressed data remaining?

what makes you think there's a "header portion" in the data you get
from zlib.compress ?  it's just a continuous stream of bits, all of
which are needed by the decoder.


No. The first 2 bytes (or more if using a preset dictionary) are 
header information. The last 4 bytes are for checksum. In-between 
lies the encoded bit stream.
Using the default options ("deflate", default compression level, no 
custom dictionary) will make those first two bytes 0x78 0x9c.
If you want to encrypt a compressed text, you must remove redundant 
information first. Knowing part of the clear message is a security 
hole. Using an structured container (like a zip/rar/... file) gets 
worse because the fixed (or "guessable") part is longer, but anyway, 
2 bytes may be bad enough.

See RFC1950 


--
Gabriel Genellina
Softlab SRL 







__ 
Preguntá. Respondé. Descubrí. 
Todo lo que querías saber, y lo que ni imaginabas, 
está en Yahoo! Respuestas (Beta). 
¡Probalo ya! 
http://www.yahoo.com.ar/respuestas 

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: removing the header from a gzip'd string

2006-12-21 Thread Fredrik Lundh
Gabriel Genellina wrote:

> Using the default options ("deflate", default compression level, no 
> custom dictionary) will make those first two bytes 0x78 0x9c.
 >
 > If you want to encrypt a compressed text, you must remove redundant
 > information first.

encryption?  didn't the OP say that he *didn't* plan to decompress the 
resulting data stream?

 > Knowing part of the clear message is a security hole.

well, knowing the algorithm used to convert from the original clear
text to the text that's actually encrypted also gives an attacker
plenty of clues (especially if the original is regular in some way,
such as "always an XML file" or "always a record having this format"). 
sounds to me like trying to address this potential hole by stripping
off 16 bits from the payload won't really solve that problem...



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: removing the header from a gzip'd string

2006-12-22 Thread vasudevram

Fredrik Lundh wrote:
> Gabriel Genellina wrote:
>
> > Using the default options ("deflate", default compression level, no
> > custom dictionary) will make those first two bytes 0x78 0x9c.
>  >
>  > If you want to encrypt a compressed text, you must remove redundant
>  > information first.
>
> encryption?  didn't the OP say that he *didn't* plan to decompress the
> resulting data stream?
>
>  > Knowing part of the clear message is a security hole.
>
> well, knowing the algorithm used to convert from the original clear
> text to the text that's actually encrypted also gives an attacker
> plenty of clues (especially if the original is regular in some way,
> such as "always an XML file" or "always a record having this format").
> sounds to me like trying to address this potential hole by stripping
> off 16 bits from the payload won't really solve that problem...
>
> 

Yes, I'm also interested to know why the OP wants to remove the header.

Though I'm not an expert on the zip format, my understanding is that
most binary formats are not of much use in pieces (though some
composite formats might be, e.g. you might be able to meaningfully
extract a piece, such as an image embedded in a Word file). I somehow
don't think a compressed zip file would be of use in pieces (except
possibly for the header itself). But I could be wrong of course.

Vasudev Ram
http://www.dancingbison.com

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: removing the header from a gzip'd string

2006-12-22 Thread Gabriel Genellina
Fredrik Lundh ha escrito:
> Gabriel Genellina wrote:
>
>  > If you want to encrypt a compressed text, you must remove redundant
>  > information first.
>
> encryption?  didn't the OP say that he *didn't* plan to decompress the
> resulting data stream?
I was trying to imagine any motivation for asking that question. And I
considered the second part as "I'm not the guy who will reconstruct the
original data". But I'm still intrigued by the actual use case...

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: removing the header from a gzip'd string

2006-12-23 Thread debarchana . ghosh

Bjoern Schliessmann wrote:
> Rajarshi wrote:
>
> > Does anybody know how I can remove the header portion of the
> > compressed bytes, such that I only have the compressed data
> > remaining? (Obviously I do not intend to perform the
> > decompression!)
>
> Just curious: What's your goal? :) A home made hash function?

Actually I was implementing the use of the normalized compression
distance to evaluate molecular similarity as described in an article in
J.Chem.Inf.Model (http://dx.doi.org/10.1021/ci600384z, subscriber
access only, unfortunately).

Essentially, they note that the NCD does not always bevave like a
metric and one reason they put forward is that this may be due to the
size of the header portion (they were using the command line gzip and
bzip2 programs) compared to the strings being compressed (which are on
average 48 bytes long).

So I was interested to see if the NCD behaved like a metric if I
removed everything that was not the compressed string. And since I only
need to calculate similarity between two strings, I do not need to do
any decompression.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: removing the header from a gzip'd string

2006-12-23 Thread Bjoern Schliessmann
[EMAIL PROTECTED] wrote:

> Actually I was implementing the use of the normalized compression
> distance to evaluate molecular similarity as described in an
> article in J.Chem.Inf.Model (http://dx.doi.org/10.1021/ci600384z,
> subscriber access only, unfortunately).

Interesting. Thanks for the reply.

Regards,


Björn

-- 
BOFH excuse #438:

sticky bit has come loose

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: removing the header from a gzip'd string

2006-12-24 Thread Fredrik Lundh
[EMAIL PROTECTED] wrote:

> Essentially, they note that the NCD does not always bevave like a
> metric and one reason they put forward is that this may be due to the
> size of the header portion (they were using the command line gzip and
> bzip2 programs) compared to the strings being compressed (which are on
> average 48 bytes long).

gzip datastreams have a real header, with a file type identifier, 
optional filenames, comments, and a bunch of flags.

but even if you strip that off (which is basically what happens if you 
use zlib.compress instead of gzip), I doubt you'll get representative 
"compressability" metrics on strings that short.  like most other 
compression algorithms, those algorithms are designed for much larger 
datasets.



-- 
http://mail.python.org/mailman/listinfo/python-list