I'll reiterate my previous comment. The problem with many compression algorithms is that they are adaptive, constantly changing the "dictionary" based upon what was previously seen and the current "window".

The way to do the compression such that it can be applied to a verse at a time is to do a two pass compression. The first pass analyzes the data to determine the dictionary, then the dictionary is used to compress the input.

Decompression can be applied to any byte sequence. It may be necessary to skip some bits to synchronize, but it is easy to synchronize. Once synchronized, it is easy to match a byte sequence using some of the better string literal matching algorithms.


On May 19, 2006, at 1:59 AM, David Cary wrote:

Dear SWORD developers,

From: L.Allan-pbio
...
I can think of several reasons for rawtext (non-compressed):
...
2. Search speed can be significantly faster. ...

That may be true for zText. However, other compression formats are
faster to search than plain text.

3. It is easier to debug/examine a module. You can use a text editor ...

I think this is the overwhelming reason in favor of plain text.
http://c2.com/cgi/wiki?PowerOfPlainText
has convinced me to stick with plain text format (and plain-text-like
formats, such as HTML) if at all possible.

From: L.Allan-pbio
...
I defined a sourceforge project BibleDb that would
be optimized for Bible decompression/decryption/search speed (not
necessarily for compression ratio).
...
BibleDB is only in pre-alpha stage.
http://sourceforge.net/project/admin/?group_id=117234

Interesting. I will look at this soon.

Perhaps we can apply some of the ideas from this article:

"Compression: A Key for Next-Generation Text Retrieval Systems"
by Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo
Baeza-Yates
in
_Computer_ magazine November 2000

Their decompressor takes 1, 2, or 3 whole bytes of compressed data
and decompresses (using a vocabulary list) into a whole word. This
makes many kinds of searches *much* faster. One can directly search the
compressed text for words or phrases, which turns out to be faster
than searching uncompressed text.

(Rather than *uncompressing* the entire Bible, and comparing the
uncompressed Bible to the search string, we can *compress* just the
search string, then compare the compressed Bible directly to the
compressed search string).

The article also has lots of other ideas about compressing indexes and
approximate-match searching.

From: L.Allan-pbio
My limited experience is that if you don't have a large block of data
(book), then the compression ratio isn't very good.

That's very true. But I hope you can see that:
* Ziviani's technique *does* have a large block of data, so
potentially the compression ratio can be good. To give the best
compression, the compressor scans the entire Bible (in order to pick
out the most-common words and give them one-byte representations).
* Ziviani's technique lets you point to any word in the text with a
normal (byte) pointer and start decompressing immediately from that
point. The decompressor can decompress a single verse -- it doesn't
need to start at the first verse. (The decompressor needs more
information than just the compressed version of the verse -- it also
needs the global wordlist generated by the compressor).

I am interested in other ways of decompressing just a verse or so,
without needing to decompress everything from the beginning (and which
still gives adequate compression).

--
David Cary
http://theconnexion.net/compass/index.php/User:DavidCary
http://groups.google.com/groups/search?q=%22Compressing+the+Bible +for+a+PDA%22

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to