On approximately 3/25/2004 9:16 PM, came the following characters from
the keyboard of Randy W. Sims:
On 3/25/2004 11:50 PM, Glenn Linderman wrote:
For sorted lists of text, like dictionaries, one quick-to-decode
technique that saves a fair amount of space, is to start each string
with the number of bytes that match the previous string, and then
append the remainder of the string.
In other words, the list of words
though
thought
thoughtful
would reduce to
0though
6t
7ful
I seem to recall stumbling across a Perl module that does this sort of
thing once, but I'm not getting the right keywords in my searches to
find it again. Or else I'm searching in the wrong places (CPAN, Google).
Any one know where such a module might be hiding?
Hi Glenn,
I think the term your thinking of is stemming. Maybe Lingua-Stem
<http://search.cpan.org/dist/Lingua-Stem/> is what your looking for?
Thanks for the quick response, Randy. I guess stemming is a similar
topic, but is linguistic in nature. Looks to me like instead of
operating on a sequence (or pair) of words, it just analyzes the word to
reduce it to the "root" word.
I'm looking for an algorithm that operates pairwise on a sorted sequence
of words, that returns the number of common leading characters, so that
they can be eliminated to save (disk, memory) space.
I could write it, as a fairly simple character oriented loop, but
thought I'd seen it somewhere.... and figured if someone had made a
module, maybe they'd already dropped into C for performance, because the
loop I envision doesn't seem like it would be all that efficient in
Perl. Of course, maybe it wouldn't be all that efficient in C either.
--
Glenn -- http://nevcal.com/
===========================
The best part about procrastination is that you are never bored,
because you have all kinds of things that you should be doing.
_______________________________________________
Perl-Win32-Users mailing list
[EMAIL PROTECTED]
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs