Jason Kohles schrieb:
On Mar 2, 2007, at 10:42 AM, Mario Minati wrote:

Hello @all,

I'm looking for a solution to find out if there is already some data in my dataset that is similar to a new entry.

Example:
Companynames
I would like to find out if there are already companies in my addressbook (DB) which are similar to a given name to avoid double entries.

How to measure similarity:
I'am thinking of the hammingdistance. That means the difference between Linux and Linus is 1 as there is one letter different. The distance between Linux and Lisa is 3 as there is one letter more and two are different.

Does anyone have an idea how to realize that?
Can one realize this with code running on the database (PL/SQL or something) or is there a way doing that with DBIx::Class (drawback: all data had to read before processing).

Many databases have either built-in or extensions available to do this. The only ones that I have used are from the PostgreSQL contrib module fuzzystrmatch, which provides functions for calculating Levenshtein Distance and Soundex (possibly others, I can't recall).

If your database doesn't support it directly, for some solutions (Soundex, Metaphone) where you are calculating a checksum of the value and then looking for ones where the checksum is the same, you can precalculate the value when updating the database, and then just do a simple comparison search. Something like this would probably work...

package MyDB::Company;
use base 'DBIx::Class';
use Text::Metaphone qw();

__PACKAGE__->add_columns( qw( company company_metaphone ) );

sub calculate_company_metaphone {
my ( $self, $value ) = @_;

my $co = lc( $value );
$co =~ s/[^\w\s]//g;
$co =~ s/\b(incorporated|inc|corporation|corp)\b/INC/;
$co =~ s/\b(company|co)\b/CO/;
# probably some other useful standardization stuff here
return Text::Metaphone::Metaphone( $co );
}

sub store_column {
my ( $self, $name, $value ) = @_;

if ( $name eq 'company' ) {
$self->company_metaphone( $self->calculate_comany_metaphone( $value ) );
}
$self->next::method( $name, $value );
}

Then you can simply search something like...

my $rs = $schema->resultset( 'Company' )->search( {
company_metaphone => MyDB::Company->calculate_company_metaphone( $company_name ),
} );
Hello Jason,

your idea of an adapted store_column function is very clever.
But in the meantime I did some more googlesports and found out that the soundex functions (Metaphone is one of them) was build for english language and that there are some serious implications with the german language. As I am living in germany ...

So I thought of useing the Levenshtein algorithm which cannot be precalculated. So I'll do it in the database. I found out that there is an addon for Levenshtein and it was pretty easy to install. No I've just to create a stored procedure which takes the company names and split's them at whitespaces and calculates the Levenshtein difference to a given word and returns the lowest value per company name. This way I'll get a good value, as the 'string' is an ajax user input.

Once again thank you for the code sample, I'll use it soon for an other problem.

Greets,
Mario

_______________________________________________
List: http://lists.rawmode.org/cgi-bin/mailman/listinfo/dbix-class
Wiki: http://dbix-class.shadowcatsystems.co.uk/
IRC: irc.perl.org#dbix-class
SVN: http://dev.catalyst.perl.org/repos/bast/trunk/DBIx-Class/
Searchable Archive: http://www.mail-archive.com/[email protected]/

Reply via email to