Package: wnpp Severity: wishlist
* Package name : libcgrep Version : x.y.z Upstream Author : Paolo Ferragina <[EMAIL PROTECTED]> * URL : http://roquefort.di.unipi.it/~ferrax/CompressedSearch/ * License : GPL Description : Search through compressed file (fast) Summary: - In summary,CGrep is about four times faster than ZGrep, and additionally it supports fast advanced IR-searches that are completely absent in the Grep-family tools. Preconditions: - doxygen - agrep Following programs are included: - huffw; to (de)compress a textual file according to the Huffword paradigm - cgrep; search for complicated patterns over the Huffword-compressed files Following libraries are included: Huffword Library: This library offers some algorithms and data structures for the efficient (de)compression of textual files. The compression paradigm is a variant of the Huffman algorithm, here adapted to be byte-aligned, tagged and word oriented. More precisely, the tokens are alphanumeric words, and the codewords are byte sequences with some tagged bits that allow to guarantee synchronization. The tagging is obtained as follows: the Huffman tree has fan-out 128, and the arc labels of 7 bits are stored in the least significant bits of a byte; the codeword is finally constructed by setting the first bit of the first byte to 1, and the first bit of each other byte to 0. This compression paradigm can be improved by removing each single space that occurs between two alphanumeric tokens, hereafter called Spaceless Model. CGrep Library: This library offers all the pattern-matching functionalities that allow to search over Huffword compressed files. Actually the search consists of three steps: first the complex pattern query (possibly involving errors, reg exp, ....) is resolved against the dictionary of tokens, built at compression time by the Huffword algorithm. The tokens that satisfy the query are coded and then their codewords are searched directly onto the compressed file (note that this is an exact multiple-patterns search). This approach induces a twofold speedup: the search cost is actually proportional to the length of the compressed file, no decompression at all is required; and the search cost for complicated patterns (like reg exp, errors,....) depends on the dictionary size, which is small compared to the entire file. The features that we have implemented in addition to the original paper are: proximity query, arbitrary number of query-patters each with its search options, snippet extraction. -- System Information: Debian Release: 3.1 APT prefers unstable APT policy: (500, 'unstable'), (1, 'experimental') Architecture: i386 (i686) Kernel: Linux 2.6.9-1-686 Locale: LANG=C, LC_CTYPE=C (charmap=ISO-8859-1) (ignored: LC_ALL set to en_US) -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]