Detecting repeated sub-strings, in strings

Theodore H. Smith Sun, 06 May 2007 14:50:27 -0700

Anyone know anything about detecting sub-strings in strings? If you  
want to reply off the RB list that's fine, but on my ElfData list is  
a good place to reply.


I've finally decided on a good decompressor code. I'm writing some  
compression code for my plugin, you see.

It's pretty cool actually! I got so many optimisations over the  
standard "deflate" algorithm. Mine for example can even compress  
strings of 2 bytes down to 1 byte. Deflate can only compress strings  
of 4 bytes minimum, and even then it's harder to pull off, as there  
is an increased overhead from breaking last escaped block, and then  
having to write another escaped block header. Mine has a minimum 1  
byte length for an escaped block header, whereas deflate has a 2 byte  
escaped block header.

But my escaped blocks are variable, due to some clever carry and  
multiply code. Anyhow, it's a far more sophisticated decompressor,  
yet only written in 1 page of code (if you include the inlined  
functions), and it's safe and fast. It can't be hacked to crash by  
reading unallocated RAM, for example. So it's good for apps which  
need security, we wouldn't want someone to "decompress" variables off  
the stack now would we :). Of course ElfData must do everything  
correctly so that's why I paid attention to details like this.

I'm going to add entropy encoding, later, as a final stage, to  
compress things further. And experiment with BOCU compression for  
UTF-8, and byte-math-difference based code (similar idea to BOCU) for  
non-UTF-8.

This basically means I can acheive tighter compression due to a  
tighter syntax, and also compress patterns that other compressors  
would have to ignore, increasing the compression more!

Well, that's all very well, but it's only the decompressor :) I  
didn't write the compressor yet.

I don't know anything about pattern recognition theory. In theory, I  
could just dump all the strings found, into an ElfDataDictionary...  
but that would be a lot of strings. And I'm not so sure about how it  
would find suffixes. To dump everything from a sliding window of  
1-257 bytes into an ElfDataDictionary would be ridiculously slow and  
require huge amounts of RAM, making it basically like a PPM  
compressor.  It would take about 66,000 setting/getting of dictionary  
keys, per byte! So I can't do that.

Obviously, there has to be a faster more practical way, but I'm not  
familiar with it yet. I'm sure I can see (see, meaning invent) some  
good optimisations over the standard common practice, though :)

Any tips or pointers on recognising repeating sub-strings are much  
appreciated!

--
http://elfdata.com/plugin/
"String processing, done right"


_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives:
<http://support.realsoftware.com/listarchives/lists.html>

Detecting repeated sub-strings, in strings

Reply via email to