--- Begin Message ---
Thank You!
-cam

On Tue, Jul 28, 2015 at 11:39 AM, Cameron Sanders via Pharo-users <
pharo-users@lists.pharo.org> wrote:

>
>
> ---------- Forwarded message ----------
> From: Cameron Sanders <camsand...@aol.com>
> To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
> Cc:
> Date: Tue, 28 Jul 2015 11:00:11 -0400
> Subject: Re: [Pharo-users] New methods for the String class
> What fuzzy-string matching tools & packages are available today?
>
> -cam
>
> On Wed, Feb 26, 2014 at 9:09 AM, Hernán Morales Durand <
> hernan.mora...@gmail.com> wrote:
>
>>
>>
>>
>> 2014-02-26 7:10 GMT-03:00 Norbert Hartl <norb...@hartl.name>:
>>
>>>
>>> Am 26.02.2014 um 09:50 schrieb Pharo4Stef <pharo4s...@free.fr>:
>>>
>>>
>>> We can have an information retrieval API for aproximate string matching,
>>> i.e. Levenshtein distance (already implemented, various versions), Hamming
>>> distance, both are the most used and simplest edit distances.
>>> Then you have Longest common subsequence, Longest common substring (they
>>> are implemented in a package called "Fuzz", #longestCommonSubsequenceWith:
>>> ). Also there is the shift-or adapted for approximate matches (also
>>> implemented), fuzzy phrasing is another world also. Many applications use
>>> Damerau edit distance. Bioinformatics uses the Needleman-Wunsch and
>>> Smith-Waterman, but they call them "aligners" :) but you don't want to code
>>> the optimized version in Smalltalk, some say it could take years.
>>> All edit distances out there have specific requirements and no one is
>>> better than another for all cases. For example Jaro-Winkler is useful for
>>> one-word short strings.
>>>
>>>
>>> I’m not sure that all these edit distances should be part of the String
>>> core api.
>>> Now what would be good is to have a chapter describing them. This
>>> chapter would work well with the bioSmalltalk one :)
>>>
>>> I’m pretty sure they shouldn’t. Most of these are most likely for
>>> special applications. So a perfect candidate for a string extension
>>> package. A real modular entity that could load each of them individually
>>> would be perfect but we don’t have the proper tools, yet. Unless of course
>>> every of those algorithms is composed of multiple classes and would fit
>>> naturally in a package.
>>>
>>
>> Absolutely for a separate package for information retrieval algorithms.
>> From what I've seen, some algorithms require optimization through dynamic
>> programming (automatas, matrices, etc) and that would lead to multiple
>> classes, assuming you don't want to get dirty String class.
>>
>>
>>> But the most important prerequisite would be to make a separate package
>>> out of it. Did I understand that right that those are part of biosmalltalk?
>>>
>>
>> No. Those algorithms are spread over different packages in repositories
>> like SqueakSource, Cincom Store, etc.
>>
>> Hernán
>>
>>
>>
>
>

--- End Message ---

Reply via email to