Re: Lots of FULLTEXT stuff (suggestions)

Sergei Golubchik Wed, 27 Aug 2003 20:29:23 +0000

Hi!

First: thanks for ideas - I'm adding them to my todo :)


About dates - it's very difficult to say when a particular feature will
be implemented. Anyway, first I'm going to finish with this 2-level
index structure - to implement optimizations that rely on it.

> Any speed/optimization improvements are welcome for gigs of data,
> especially with IN BOOLEAN MODE (e.g. automagically sorted by relevance
> like a natural language query, although this is probably difficult if a
> wildcard* is used?).

It's not possible - at least I don't know to do it.
In natural language mode the fulltext search is done in in "Fulltext
initialization" stage - as you noticed. So an engine can sort documents
on relevance. In boolean mode each found document is returned at once -
that's why this search mode is faster, it need not support/keep the list
of all matched documents.

> And the FULLTEXT index shouldn't always be chosen
> for non-const join types when another index would find less rows first.
> e.g. ... WHERE MATCH ... AND primary_key IN (1, 2); should use the
> PRIMARY key, not the FULLTEXT. :-) But maybe that's not possible, since
> I guess it's a problem auto sorting by relevance if it's not using the
> FULLTEXT index.

Hmm. The logic in making FULLTEXT index always the preferred one is that
even if it's not the index as reported by EXPLAIN, it is still used
in "Fulltext initialization". So, using it in join to retrieve rows
adds no extra costs.

But now I think that there is still the cost of reading row data from
disk, so using PRIMARY/UNIQUE index can be faster in some cases.

I am not sure, though, optimizer can take this into account properly -
to know the number of matched rows before choosing an index
would mean doing fulltext search for EXPLAIN too - I doubt it will be
appreciated :)

Still, with 2-level index some estimations can be made...
Great - thanks for the idea!

Anyway, in boolean mode there's no "initialization" so there's no
reasons (besides historical) for it to be "preferred" - it'll be fixed.

> To the developers: any word on if and when any of these things would be
> implemented? I know from the TODO and other list messages that some
> will. Any *estimates* (in months or MySQL version) on when would be
> great. Just any info on new full-text features, even ones that I didn't
> mention, would be awesome to hear. :-) And like how they would be
> implemented and used by us (syntax, etc.).

As I told - it's very difficult to predict this :(
Anyway, I doubt anything that requires changing .frm
file structure will get into 4.1
 
> How about changing the default min/max (or just min if you want) word
> length? I think everyone *really* wishes ft_min_word_len was 3. Seems
> like that and indexing numbers shorter than min_word_len could be easily
> done. Please? :-)

Yes, it's safe enough for 4.1

> There Sergei is talking about a new .frm format (plain text) that will
> allow more of these features. Will it allow us to somehow define how to
> parse things or something?? Could you elaborate more on what this will
> bring? In November 2001, he said the new .frm format would be here "this
> year." It's been almost 2 years since then, so when is it do?

It's now planned for 5.1 - plain text .frm comes together with complete
redesign of internal table structure handling, table structure cache,
etc.

But even without it .frm format was extended in 4.1 so I don't need it
for adding per-index options anymore.

> Also, are the current MySQL versions using the "2 level" full-text index
> format yet? I'm thinking not?

4.1.0 is using it.
This index structure was done to make possible new powerful
optimizations. It is these optimizations what is not implemented yet :(
It's in my highest-priority todo.
 
> Finally, in the full-text TODO, it says "Generic user-suppliable UDF
> preparser." Could you also elaborate on this? The "generic" part almost
> makes it sound like some sort of "script" to define how to parse the
> text. But UDF makes it sound like a separate thing that has to be loaded
> with CREATE FUNCTION. But UDFs won't work with your MySQL binaries, will
> they, since they're complied statically?

mysql-max binary is compiled dynamically - so it works with UDF's.
And "UDF" in the todo item does not mean it will MySQL "User-Definable
Function" yet - it could be a Stored Procedure, e.g.

The idea is to be able to supply a function (whatever it is) that takes
a column's data and returns a list of words that this "data" contain.
It could be used e.g. to fulltext-index pdf's or xml's or MS Word files,
or whatever.

Regards,
Sergei

-- 
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /   Sergei Golubchik <[EMAIL PROTECTED]>
 / /|_/ / // /\ \/ /_/ / /__  MySQL AB, Senior Software Developer
/_/  /_/\_, /___/\___\_\___/  Osnabrueck, Germany
       <___/  www.mysql.com

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Lots of FULLTEXT stuff (suggestions)

Reply via email to