Hi!

On Jan 11, Matt W wrote:
> Hi,
> 
> ----- Original Message -----
> From: "Sergei Golubchik"
> To: "Steven Roussey"
> Sent: Wednesday, December 10, 2003 7:44 AM
> Subject: Re: 4.1.1 FTS 2-level?
> 
> 
> > Hi!
> >
> > On Dec 09, Steven Roussey wrote:
> > > Does Mysql 4.1.1 have the two level index system integrated into
> > > it for full text searches?
> >
> > What do you mean ?
> > Is it used to optimize searches ? No.
> >
> > Still there could be some speedup because, e.g, MyISAM will use
> > binary search in the key pages instead of linear one, etc.
>
> You're right.  Wow!
> 
> I assume you were comparing it to 4.0.x, not 4.1.0.  Once the data rows
> are cached, the index search in boolean mode seems to be about *7 times*
> faster than 4.0. :-)  2 test searches on the same data went from 48s ->
> 7 and 35 -> 5.  Nice!

Cool :)

> So are these faster index searches only the result of binary vs linear
> search?  (I don't know the exact difference, but binary sounds good.
> ;-))  The actual full-text code itself is NOT any more optimized than
> 4.0?

As far as I remember - no.

The difference is simple - if one has, say, a string or words like

 "just,simply,words,zyx"

that is, comma-separated words in alphabetic order.
Then to find a particular word one needs to compare word1 ("just") with
this particular word, then word "simply", etc. It is because words - as
a sequences of characters - may have different lengths. So the only way
to find where in the string starts the word N is to read all the words
before it.

If it's an array of, say, integers

  int array[100]

then i-th entry is simply array[i], and to find a word in a sorted array
one can use binary search.

These are two different types of pages of a btree in MyISAM index.
If an index includes variable-length segments or(and) compression is used
then different key entries on the page may have different lengths,
MyISAM will use linear search for each key page, time is linear - O(N).
(for keys starting from a char/varchar - like a fulltext index - MyISAM
uses a special variant of the linear search, that does not unpacks each
key for comparison, it's ~2-4 times faster than "normal" linear search,
but still it's linear - O(N))

Otherwise MyISAM will use binary search, time is logarithmic - O(log N)

In normal fulltext index key entry is <varchar, float, offset>
In second level it's only <float, offset> - and MyISAM uses binary
seacrh for the pages on the 2nd level.

> Are the 2-level indexes solely for FTS to use, or can MyISAM use them in
> general for any indexes?

Fulltext only.

> Just wondering, since you said "Is it used to
> optimize searches? No."  Which sounds like it's being used for
> *storage*, just not the word count statistics for optimization, etc.

Yes, for storage.
But it also used for word count statistics, as new there is only one
entry per word (well, for common enough words, rare words don't matter),
and this entry is the place where per-word statistics is be stored.

> And my index file was reduced from 1.74G in 4.0 to 1.59G, so I thought
> maybe this is where some space was saved.

Yes.
MyISAM does compress fulltext indexes with prefix compression.
That is for each word it stores only the difference from the previous
word on this page. If the words are identical, second entry will take
one or two bytes only (I don't remember exactly). And the first entry on
the page is always not compressed.

So, by converting all entries for some particular word to 2-level
structure saves these one-two bytes per word and
length(word)*(number_of_pages-1)

Usually - with prefix compression - one get about ~250 entries on the
page (I mean for the same common word), thus number_of_pages is,
approximately number_or_rows_with_the_word/250

> BTW, would (re)building the index be slower with 4.1 for any reason?  I
> thought maybe it was, but I'd have to try again to be sure.

I didn't compare :(

But out of my head I don't know any reason why it should be.

Regards,
Sergei

-- 
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /   Sergei Golubchik <[EMAIL PROTECTED]>
 / /|_/ / // /\ \/ /_/ / /__  MySQL AB, Senior Software Developer
/_/  /_/\_, /___/\___\_\___/  Osnabrueck, Germany
       <___/  www.mysql.com

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to