Re: Fast relevance sorting of full text search results

Shawn Green Sun, 06 Jan 2008 07:58:08 -0800

Urms wrote:

I'm using pretty standard approach to sorting search results by relevancy:
SELECT DISTINCT product_name,MATCH (keywords) AGAINST ('CONSOLIDATED* 16* bearing*' IN BOOLEAN MODE) AS
rate
FROM _TTWHERE MATCH ( keywords ) AGAINST ('CONSOLIDATED* 16* bearing*' IN BOOLEAN
MODE ) >0
ORDER BY rate DESC

It works fine as long as the quantity of results is not big. Once the
quantity is about 50,000 and more (I have a very big database) the query
starts working way too slow. Total number of records is about 4 million. It
takes about 2 sec when there are 50,000 records in the result but at the
same time it takes only about 0.006 sec without ORDER BY clause.
I understand that ORDER BY is time consuming but maybe someone knows a
different way to have sorting by relevancy.

Thanks in advance!

I think it's your SELECT DISTINCT that is slowing you down. For each newrow being considered for inclusion to your result set, you are askingthe engine to compare that row against all other rows you already havein the set. So what's happening is that you are doing a longer andlonger linear search the larger your datasets become.

One option is to cache your results in a temporary table thende-duplicate your results from there.

Another option is to create a temporary table with a UNIQUE key on thecolumns you want to remain unique and use an INSERT IGNORE. Because ofthe UNIQUE key (or PRIMARY KEY if that's your choice) you will be doingan indexed search of all values rather than a linear search through theentire list. This would look something like:


CREATE TEMPORARY TABLE tmpFT_results (
  product_name <insert datatype here>
, rate <insert datatype here>
, primary key (product_name)
) ENGINE = MEMORY;

INSERT IGNORE tmpFT_results
SELECT product_name,
MATCH (keywords) AGAINST ('CONSOLIDATED* 16* bearing*' IN BOOLEAN MODE) AS
rate
FROM _TT
WHERE MATCH ( keywords ) AGAINST ('CONSOLIDATED* 16* bearing*' IN BOOLEAN
MODE ) >0;

SELECT product_name, rate
FROM tmpFT_results
ORDER BY rate DESC;

Sure it's three separate steps but it's tuned to the process you aretrying to perform. The SELECT DISTINCT processing has no idea that youonly need to keep the values of product_name distinct as we would hopethe `rate` component may be duplicated.

If there is the possibility of different `rate` results for the sameproduct_name value then you may also want to use the temporary tablemethod to somehow weight (sum or average comes to mind) the match valuesacross all responses before returning the results.


I hope these ideas help your performance and search accuracy.

Best wishes,
--
Shawn Green, Support Engineer
MySQL Inc., USA, www.mysql.com
Office: Blountville, TN
    __  ___     ___ ____  __
   /  |/  /_ __/ __/ __ \/ /
  / /|_/ / // /\ \/ /_/ / /__
 /_/  /_/\_, /___/\___\_\___/
        <___/
 Join the Quality Contribution Program Today!
 http://dev.mysql.com/qualitycontribution.html


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Fast relevance sorting of full text search results

Reply via email to