Hi All,

Problem Description

I'm trying to implement a custom algorithm to match user provided
free-text input, a company name such as "Ford Motor", against a
reference data source consisting of 1.4 million company names.

The algorithm executes following steps:

Step 1) Performs an "Exact Match", followed by "Begins Match" and
finally "Contains Match" of user provided search input. Results from
this step are also sorted in the same order.

Step 2) Performs a token by token match of search input with reference
company name.

Every token is matched in following order: Exact, Begins, Contains,
Levenshtein Distance (< 0.2) and Refined Soundex.

E.g. If user input is "Foord Motur Holding" and it's being matched
against "The Ford Motor Holdings Company" then first token "Foord"
will match "Ford" based on Soundex match, second token "Motur" will
match "Motor" based on Edit Distance Algo and and last token "Holding"
will match "Holdings" via Begins match.

Scoring: Every token match is first scored on a scale that rates the
matching technique, with Exact match being the best and Soundex being
the worst.

The overall score is calculated, on a scale of 0-100%, by calculating
a weighted average of individual token-match scores. Weights are
assigned based on index-order of token i.e. the first token has
highest weight and last token has lowest.



My Partial Solution

I have implemented a simple schema in solr to store referance company
names. A String field (called companyName), a simple text field
(called as companyText) copied from string and another text field
(called as companySoundex) copied from string and using
PhoneticFilterFactory for Refined Soundex based matching.

I have been able to replicate step 1) in a single solr query.

For step 2) I plan to fire 3 parallel queries to solr server. First
query performing a simple text search on companyText field, second
query performing fuzzy match using ~ operator on companyText field and
third query performing soundex match on companySoundex field. I plan
to somehow combine the results from these 3 parallel queries to get
desired final result.



Questions:

1) Is there a better way to replicate Step 2) of original algorithm?

2) Even if I go with my "three-parallel-queries" approach then how to
get the "right" sorting order as I get in the original algorithm ? I
guess the main problem is how to compare the solr scores from these 3
entirely different queries to do the final combining of results

Thanks for reading this long question. Any help/pointers would be
greatly appreciated.

-- 
Thanks

Reply via email to