Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.
On 19 July 2010 19:46, tedd wrote: > At 12:39 PM +0100 7/19/10, Richard Quadling wrote: >> >> I'm using MS SQL, not mySQL. >> >> Found a extended stored procedure with a UDF. >> >> Testing it looks excellent. >> >> Searching for a match on 30,000 vehicles next to no additional time - >> a few seconds in total, compared to the over 3 minutes to search using >> SQL code. > > That seems a bit slow. > > For example, currently I'm searching over 4,000 records (which contains > 4,000 paragraphs taken from the text of the King James version of the Bible) > for matching words, such as %created% and the times are typically around > 0.009 seconds. > > As such, searching ten times that amount should be in the range of tenths of > a second and not seconds -- so taking a few seconds to search 30,000 records > seems excessive to me. Tedd, I'm not looking for a "word". I'm looking for similar "wrds". Word is closer to the misspelled wrds that it is to wars. select dbo.DamerauLevenshteinDistance('words', 'wars'), dbo.DamerauLevenshteinDistance('words', 'wrds') (No column name)(No column name) 2 1 Lower is better. Also, I have to compare every row in the set and then sort it to find the lowest values for the Damerau-Levenshtein or the highest for the Jaro–Winkler distance. As the value entered is always going to be the unknown, I can't pre-calculate the distances. I do an exact match test first. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.
On Mon, Jul 19, 2010 at 2:46 PM, tedd wrote: > At 12:39 PM +0100 7/19/10, Richard Quadling wrote: >> >> I'm using MS SQL, not mySQL. >> >> Found a extended stored procedure with a UDF. >> >> Testing it looks excellent. >> >> Searching for a match on 30,000 vehicles next to no additional time - >> a few seconds in total, compared to the over 3 minutes to search using >> SQL code. > > That seems a bit slow. > > For example, currently I'm searching over 4,000 records (which contains > 4,000 paragraphs taken from the text of the King James version of the Bible) > for matching words, such as %created% and the times are typically around > 0.009 seconds. > > As such, searching ten times that amount should be in the range of tenths of > a second and not seconds -- so taking a few seconds to search 30,000 records > seems excessive to me. > > Cheers, > > tedd I would be surprised if a Levenshtein or similar_text comparison in a database were NOT slower than even a wildcard search because of the calculations that have to be performed on each row in the column being compared. That, and the fact that user-defined functions in SQL Server often have a performance penalty of their own. Just for kicks, you could try loading the values in that column into an array in PHP and then time iterating the array to calculate the Levenshtein distances for each value to see how it compares. Andrew -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.
At 12:39 PM +0100 7/19/10, Richard Quadling wrote: I'm using MS SQL, not mySQL. Found a extended stored procedure with a UDF. Testing it looks excellent. Searching for a match on 30,000 vehicles next to no additional time - a few seconds in total, compared to the over 3 minutes to search using SQL code. That seems a bit slow. For example, currently I'm searching over 4,000 records (which contains 4,000 paragraphs taken from the text of the King James version of the Bible) for matching words, such as %created% and the times are typically around 0.009 seconds. As such, searching ten times that amount should be in the range of tenths of a second and not seconds -- so taking a few seconds to search 30,000 records seems excessive to me. Cheers, tedd -- --- http://sperling.com http://ancientstones.com http://earthstones.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.
There is an algorithm called longest common sub sequence. If you can find the longest common sub sequence of the strings of database for the given string and sort it, you'll get the most matched word. But I think this algo is developed already and available in your context. It's name can be different. As a developer I am not sure actually what function in php or mysql serves the purpose. Shiplu Mokadd.im My talks, http://talk.cmyweb.net Follow me, http://twitter.com/shiplu SUST Programmers, http://groups.google.com/group/p2psust Innovation distinguishes bet ... ... (ask Steve Jobs the rest) -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.
At 12:09 PM -0400 7/15/10, Daniel P. Brown wrote: On Thu, Jul 15, 2010 at 11:54, Richard Quadling wrote: I'm looking for is a way to compare what they've entered against a known list and to provide my 10 best guesses. Look into the following functions and families: levenshtein() similar_text() What would one use to have MySQL do the heavy lifting here? In other words, are there similar MySQL functions? Cheers, tedd PS: I thought of this Friday, but finally got my fingers to work today. -- --- http://sperling.com http://ancientstones.com http://earthstones.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.
On 16 July 2010 13:47, Daniel P. Brown wrote: > On Fri, Jul 16, 2010 at 07:09, Richard Quadling wrote: >> >> Having just found a levenshtein() UDF for MS SQL [1] >> >> I'm very impressed. >> >> Thank you for the suggestion. >> >> Regards, >> >> Richard. >> >> [1] http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540&whichpage=2#425160 > > Dear dearest sir; > > Thank you for your formal reply. Your email is important to me. > Please contact me at your convenience to discuss this reply to your > reply. > > Sincerely, > > Bozo. > > > > -- > > UNADVERTISED DEDICATED SERVER SPECIALS > SAME-DAY SETUP > Just ask me what we're offering today! > daniel.br...@parasane.net || danbr...@php.net > http://www.parasane.net/ || http://www.pilotpig.net/ > Oh. It must be Friday! -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.
On Fri, Jul 16, 2010 at 07:09, Richard Quadling wrote: > > Having just found a levenshtein() UDF for MS SQL [1] > > I'm very impressed. > > Thank you for the suggestion. > > Regards, > > Richard. > > [1] http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540&whichpage=2#425160 Dear dearest sir; Thank you for your formal reply. Your email is important to me. Please contact me at your convenience to discuss this reply to your reply. Sincerely, Bozo. -- UNADVERTISED DEDICATED SERVER SPECIALS SAME-DAY SETUP Just ask me what we're offering today! daniel.br...@parasane.net || danbr...@php.net http://www.parasane.net/ || http://www.pilotpig.net/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.
On 15 July 2010 17:09, Daniel P. Brown wrote: > Look into the following functions and families: > > levenshtein() > similar_text() > Having just found a levenshtein() UDF for MS SQL [1] I'm very impressed. Thank you for the suggestion. Regards, Richard. [1] http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540&whichpage=2#425160 -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.
On Thu, Jul 15, 2010 at 11:54, Richard Quadling wrote: > > I'm looking for is a way to compare what they've entered against a > known list and to provide my 10 best guesses. Look into the following functions and families: levenshtein() similar_text() You wouldn't want to use metaphone() or soundex(), because you're working with alphanumeric combinations, but if you want to be a glutton for punishment, it could be fun. Something along the lines of using levenshtein() to get the best matches, then similar_text() to score each of the matches would probably be a viable solution for you. The last time I incorporated something like that was for my own use to track down a record in a MySQL database for which I could remember just a small blip of text. It looked through about 23 million records in under two minutes and I found exactly what it was for which I was searching --- even though I was off by one word. Would've taken me forever to find it using alternative methods. -- UNADVERTISED DEDICATED SERVER SPECIALS SAME-DAY SETUP Just ask me what we're offering today! daniel.br...@parasane.net || danbr...@php.net http://www.parasane.net/ || http://www.pilotpig.net/ -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.
On Thu, 2010-07-15 at 16:54 +0100, Richard Quadling wrote: > Hi. > > It seems that users cannot enter a vehicle registration 100% accurately. > > We have recently released a small mobile web app which allows service > engineers/inspectors to enter a vehicle registration number and a pin > number to get service history for the vehicle. > > We are getting around a 40% fail rate on the registrations for the > first time of entry. This drops to around a 1.5% error rate on the > second attempt. > > Most of the time it is simply a case of 2 letters/numbers being > swapped. Sometimes a letter/number is entered for a number/letter. > > 0/O > 1/I/l > 2/Z > 3/E > 4/A > 5/S > 6/G > 7/T > 8/B > 9/q > > Some of the registrations are private and don't obey any format (NOTE: > Trailer registrations aren't the same as vehicle registrations - they > can be anything the owner wants. In some cases we have them as > straight numbers - 1, 2, 3, etc.). > > I'm looking for is a way to compare what they've entered against a > known list and to provide my 10 best guesses. > > What I'm stuck on is what criteria do I use. > > I think something like the old style colour Mastermind game (right > colour in the right place, right colour wrong place, wrong colour). > But that's going to be slow. One of the contracts has over 30,000 > vehicles/trailers available to them. > > Any suggestions really. > > Regards, > > Richard. > I'm not sure how well it would work for things like a registration number, but metaphone keys are pretty good for detecting similar words based on phonetic sounds. If you don't get a match first time, maybe convert all the digits in the reg no. to letters, and create a metaphone key that you compare against the list of other keys in your DB that you already created in the same manner (you'll need to create these first time) Like I said, I'm not sure how well it will work, but it might possibly reduce the failure rate a bit. Also, there's the electric shock treatment. Find out which engineers are the worst typists, and... well you get the idea! Thanks, Ash http://www.ashleysheridan.co.uk
[PHP] Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items.
Hi. It seems that users cannot enter a vehicle registration 100% accurately. We have recently released a small mobile web app which allows service engineers/inspectors to enter a vehicle registration number and a pin number to get service history for the vehicle. We are getting around a 40% fail rate on the registrations for the first time of entry. This drops to around a 1.5% error rate on the second attempt. Most of the time it is simply a case of 2 letters/numbers being swapped. Sometimes a letter/number is entered for a number/letter. 0/O 1/I/l 2/Z 3/E 4/A 5/S 6/G 7/T 8/B 9/q Some of the registrations are private and don't obey any format (NOTE: Trailer registrations aren't the same as vehicle registrations - they can be anything the owner wants. In some cases we have them as straight numbers - 1, 2, 3, etc.). I'm looking for is a way to compare what they've entered against a known list and to provide my 10 best guesses. What I'm stuck on is what criteria do I use. I think something like the old style colour Mastermind game (right colour in the right place, right colour wrong place, wrong colour). But that's going to be slow. One of the contracts has over 30,000 vehicles/trailers available to them. Any suggestions really. Regards, Richard. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php