On approximately 3/25/2004 9:16 PM, came the following characters from
the keyboard of Randy W. Sims:
On 3/25/2004 11:50 PM, Glenn Linderman wrote:

For sorted lists of text, like dictionaries, one quick-to-decode technique that saves a fair amount of space, is to start each string with the number of bytes that match the previous string, and then append the remainder of the string.

In other words, the list of words

though
thought
thoughtful

would reduce to

0though
6t
7ful

I seem to recall stumbling across a Perl module that does this sort of thing once, but I'm not getting the right keywords in my searches to find it again. Or else I'm searching in the wrong places (CPAN, Google).

Any one know where such a module might be hiding?


Hi Glenn,

I think the term your thinking of is stemming. Maybe Lingua-Stem <http://search.cpan.org/dist/Lingua-Stem/> is what your looking for?

Thanks for the quick response, Randy. I guess stemming is a similar topic, but is linguistic in nature. Looks to me like instead of operating on a sequence (or pair) of words, it just analyzes the word to reduce it to the "root" word.

I'm looking for an algorithm that operates pairwise on a sorted sequence
of words, that returns the number of common leading characters, so that
they can be eliminated to save (disk, memory) space.

I could write it, as a fairly simple character oriented loop, but
thought I'd seen it somewhere.... and figured if someone had made a
module, maybe they'd already dropped into C for performance, because the
loop I envision doesn't seem like it would be all that efficient in
Perl.  Of course, maybe it wouldn't be all that efficient in C either.

--
Glenn -- http://nevcal.com/
===========================
The best part about procrastination is that you are never bored,
because you have all kinds of things that you should be doing.

_______________________________________________
Perl-Win32-Users mailing list
[EMAIL PROTECTED]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Reply via email to