On Mar 2, 2007, at 10:42 AM, Mario Minati wrote:

Hello @all,

I'm looking for a solution to find out if there is already some data in my dataset that is similar to a new entry.

Example:
Companynames
I would like to find out if there are already companies in my addressbook (DB) which are similar to a given name to avoid double entries.

How to measure similarity:
I'am thinking of the hammingdistance. That means the difference between Linux and Linus is 1 as there is one letter different. The distance between Linux and Lisa is 3 as there is one letter more and two are different.

Does anyone have an idea how to realize that?
Can one realize this with code running on the database (PL/SQL or something) or is there a way doing that with DBIx::Class (drawback: all data had to read before processing).

Many databases have either built-in or extensions available to do this. The only ones that I have used are from the PostgreSQL contrib module fuzzystrmatch, which provides functions for calculating Levenshtein Distance and Soundex (possibly others, I can't recall).

If your database doesn't support it directly, for some solutions (Soundex, Metaphone) where you are calculating a checksum of the value and then looking for ones where the checksum is the same, you can precalculate the value when updating the database, and then just do a simple comparison search. Something like this would probably work...

package MyDB::Company;
use base 'DBIx::Class';
use Text::Metaphone qw();

__PACKAGE__->add_columns( qw( company company_metaphone ) );

sub calculate_company_metaphone {
        my ( $self, $value ) = @_;

        my $co = lc( $value );
        $co =~ s/[^\w\s]//g;
        $co =~ s/\b(incorporated|inc|corporation|corp)\b/INC/;
        $co =~ s/\b(company|co)\b/CO/;
        # probably some other useful standardization stuff here
        return Text::Metaphone::Metaphone( $co );
}

sub store_column {
        my ( $self, $name, $value ) = @_;

        if ( $name eq 'company' ) {
$self->company_metaphone( $self->calculate_comany_metaphone ( $value ) );
        }
        $self->next::method( $name, $value );
}

Then you can simply search something like...

my $rs = $schema->resultset( 'Company' )->search( {
company_metaphone => MyDB::Company->calculate_company_metaphone ( $company_name ),
} );

--
Jason Kohles
[EMAIL PROTECTED]
http://www.jasonkohles.com/
"A witty saying proves nothing."  -- Voltaire


_______________________________________________
List: http://lists.rawmode.org/cgi-bin/mailman/listinfo/dbix-class
Wiki: http://dbix-class.shadowcatsystems.co.uk/
IRC: irc.perl.org#dbix-class
SVN: http://dev.catalyst.perl.org/repos/bast/trunk/DBIx-Class/
Searchable Archive: http://www.mail-archive.com/[email protected]/

Reply via email to