Re: Lots of FULLTEXT stuff (suggestions)

2003-09-12 Thread Sergei Golubchik
Hi!

On Sep 11, Matt W wrote:
 Hi Sergei!
 
 I'll try to keep my observations/ideas below as short and simple to
 understand as possible. :-)

Thanks :)

Here I reply very quickly to some questions.
Another reply will follow...
 
 Sure, boolean mode is faster in *some* cases, since, as you said, it
 doesn't need the list of all matched documents. From my experience,
 it's [only] faster in searches like ' some words ' (no boolean
 operators; and yes, I know some is a stopword ;-)), especially with
 LIMIT, but it just returns the first documents it finds with any single
 word.

No, it's MUCH faster then search string contains popular words.  E.g.
1,000,000 rows, and the few (say, five) words that are present in
400,000 rows each, but intersection is small (numbers similar to my
benchmarks).  This will make NL fulltext search engine to keep a list of
1,000,000 documents (w/o actual row data of course) in the memory - and
search through it - it'll be hell as slow.

 ...
 the ones that contain more of the words
 will usually be ranked highest from my experience.

Normally yes.
But this is not guaranteed. E.g. the document that contain only one of
the words, but is short can have higher relevance than the other one
than contains both words, but also has many other words, so relative
weight of these two words in the document is low. This effect is only
noticeable when texts' lengths are significatly different.

 It would be great if these disk seeks could be optimized to read a chunk
 of rows at a time *in row order* as it seems right now that each row is
 read one-at-a-time in relevance order. Like if you could take a chunk
 of, say, 1000 row pointers which are in relevance order, sort them in
 datafile row order, and then read them like that. Wouldn't this cause
 fewer random seeks since you keep moving in the same direction in the
 file?

Yes, that would.
I can say it for sure, as this optimization is already implemented in
MySQL. It is used in filesort. I should've known it can be used here too
:)
Thanks.

 Right now if I want all words in my application, I'm favoring using a
 natural language query with LIMIT 1000 or so and then running another
 query with LIKE to check those 1000 document IDs to see if they contain
 all words.

You can simply do AND MATCH ... AGAINST (... IN BOOLEAN MODE)+0

+0 will guarantee that in this case an index will not be used,
so that it will be used for your MATCH in natural-language mode.
(I cannot say out of my head what MATCH from both optimizer will prefer,
so this simple trick will do)

 I was even thinking that IN BOOLEAN MODE syntax should be removed
 (ignored, actually, to maintain backwards compatibility) and whether or
 not boolean operators are used in the query determines whether or not
 boolean mode is used (relevance sorting in both cases). This seems
 logical. :-)

Yes, it how it worked in the very first version.
Then I though to make it explicit to avoid problems with boolean
operators that can occasionally be present in the natural language query
(especially when it is a big chunk of text - a very typical application
for NL search).
 
   To the developers: any word on if and when any of these things would be
   implemented? I know from the TODO and other list messages that some
   will. Any *estimates* (in months or MySQL version) on when would be
   great. Just any info on new full-text features, even ones that I didn't
   mention, would be awesome to hear. :-) And like how they would be
   implemented and used by us (syntax, etc.).
 
  As I told - it's very difficult to predict this :(
  Anyway, I doubt anything that requires changing .frm
  file structure will get into 4.1
 
   How about changing the default min/max (or just min if you want) word
   length? I think everyone *really* wishes ft_min_word_len was 3.  Seems
   like that and indexing numbers shorter than min_word_len could be easily
   done. Please? :-)
 
  Yes, it's safe enough for 4.1
 
 Sorry, I don't know what this means. :-) You mean ft_min_word_len will
 be 3 by default in 4.1? And what is safe enough?

safe enough means that no big features can be added to 4.1 anymore.
Only small local changes.
Most fulltext-related changes are local :)
 
 P.S. Is there a document somewhere that has information about the
 internals of full-text search or MySQL in general? I noticed this bk
 commit - mysqldoc tree (1.790) message on the Internals list the other
 day:
 http://lists.mysql.com/list.php?3:mss:9961:200309:bpfbpgphemknogaidjep
 and bk commit - mysqldoc tree (1.799)
 http://lists.mysql.com/list.php?3:mss:9996:200309:lejkmpinlmdninacgcpl
 They appear to be updating a document about internals (inc. full-text
 search). However, I can't find this document anywhere (source (for
 Win32), MySQL site/docs). Could you tell me where to get it? :-) Thanks!

It's in the bk tree only.
But you may use http bk interface:

http://mysql.bkbits.net:8080/mysqldoc
http://mysql.bkbits.net:8080/mysqldoc/anno/Docs/[EMAIL 

Re: Lots of FULLTEXT stuff (suggestions)

2003-09-11 Thread Matt W
Hi Sergei!

Thanks for your reply and taking time to read and consider my
suggestions. :-)  I didn't reply sooner because I was deciding what to
say in this message. ;-)

I joined the list specifically for posting these suggestions, and, with
your reply, I wanted to say that it's great to have direct contact with
the developers like this!

I'll try to keep my observations/ideas below as short and simple to
understand as possible. :-)


- Original Message -
From: Sergei Golubchik [EMAIL PROTECTED]
Sent: Wednesday, August 27, 2003 3:31 PM
Subject: Re: Lots of FULLTEXT stuff (suggestions)

 Hi!

 First: thanks for ideas - I'm adding them to my todo :)

 About dates - it's very difficult to say when a particular feature
will
 be implemented. Anyway, first I'm going to finish with this 2-level
 index structure - to implement optimizations that rely on it.

  Any speed/optimization improvements are welcome for gigs of data,
  especially with IN BOOLEAN MODE (e.g. automagically sorted by
relevance
  like a natural language query, although this is probably difficult
if a
  wildcard* is used?).

 It's not possible - at least I don't know to do it.
 In natural language mode the fulltext search is done in in Fulltext
 initialization stage - as you noticed. So an engine can sort
documents
 on relevance. In boolean mode each found document is returned at
once -
 that's why this search mode is faster, it need not support/keep the
list
 of all matched documents.

Yeah, it would be really nice though if boolean searches could auto-sort
by relevance (see below).

I have 2 speed vs. usefulness issues -- 1 with boolean mode and another
with natural language mode:

Sure, boolean mode is faster in *some* cases, since, as you said, it
doesn't need the list of all matched documents. From my experience,
it's [only] faster in searches like ' some words ' (no boolean
operators; and yes, I know some is a stopword ;-)), especially with
LIMIT, but it just returns the first documents it finds with any single
word. I think this is pretty useless as you will get many rows that
would be low relevance. If you use boolean mode without boolean
operators, you should've just used natural language and, if combined
with LIMIT, you would fairly quickly get the most relevant documents
with any of the words -- and the ones that contain more of the words
will usually be ranked highest from my experience. The results are MUCH
better than boolean mode without boolean operators, even though it may
take slightly longer.

And now the issue with natural language mode: If the needed parts of the
datafile aren't cached by the OS, all the random disk seeks are KILLING
performance! However, once the query is run once and the OS has cached
the datafile parts, the search is VERY quick (with LIMIT to prevent a
ton of rows). I would say faster than any but the simplest ' no boolean
operator ' boolean searches. And the results are relevance sorted!

It would be great if these disk seeks could be optimized to read a chunk
of rows at a time *in row order* as it seems right now that each row is
read one-at-a-time in relevance order. Like if you could take a chunk
of, say, 1000 row pointers which are in relevance order, sort them in
datafile row order, and then read them like that. Wouldn't this cause
fewer random seeks since you keep moving in the same direction in the
file? Of course you'd need to get that chunk of rows back in relevance
order if they were read in row order. :-) If you can't hold those rows
in memory to put back in order, maybe you could read/discard,
read/discard, and so on, the rows in row order, then when you read them
in random relevance order, the data would at least be cached by the OS.
Just something I, who doesn't know much about file access, was thinking
about! :-) Boolean mode seems to read the datafile faster because, at
least in my fresh table, the results are found/read in datafile order.

Back to boolean mode. I don't think it's faster than natural language
(especially on subsequent same queries) when you have a search like '
+some +words ' or ' some words '. When some and words appear in
many documents, but rarely, if ever, appear in the same document. It
uses lots of CPU time finding one word in the index and failing on the
boolean criteria.

Right now if I want all words in my application, I'm favoring using a
natural language query with LIMIT 1000 or so and then running another
query with LIKE to check those 1000 document IDs to see if they contain
all words. And the documents that do will almost certainly be in the
first rows returned. e.g. once they're not for more than a few documents
in a row, they probably won't all appear in a document again in the
lower relevance results.

Heck, even if I want to simulate (' some +words ' IN BOOLEAN MODE) with
natural language, I can use (' some words words '). By specifying
words multiple times, it gives it higher relevance, so I can again
check with LIKE in another query

Re: Lots of FULLTEXT stuff (suggestions)

2003-08-27 Thread Sergei Golubchik
Hi!

First: thanks for ideas - I'm adding them to my todo :)

About dates - it's very difficult to say when a particular feature will
be implemented. Anyway, first I'm going to finish with this 2-level
index structure - to implement optimizations that rely on it.

 Any speed/optimization improvements are welcome for gigs of data,
 especially with IN BOOLEAN MODE (e.g. automagically sorted by relevance
 like a natural language query, although this is probably difficult if a
 wildcard* is used?).

It's not possible - at least I don't know to do it.
In natural language mode the fulltext search is done in in Fulltext
initialization stage - as you noticed. So an engine can sort documents
on relevance. In boolean mode each found document is returned at once -
that's why this search mode is faster, it need not support/keep the list
of all matched documents.

 And the FULLTEXT index shouldn't always be chosen
 for non-const join types when another index would find less rows first.
 e.g. ... WHERE MATCH ... AND primary_key IN (1, 2); should use the
 PRIMARY key, not the FULLTEXT. :-) But maybe that's not possible, since
 I guess it's a problem auto sorting by relevance if it's not using the
 FULLTEXT index.

Hmm. The logic in making FULLTEXT index always the preferred one is that
even if it's not the index as reported by EXPLAIN, it is still used
in Fulltext initialization. So, using it in join to retrieve rows
adds no extra costs.

But now I think that there is still the cost of reading row data from
disk, so using PRIMARY/UNIQUE index can be faster in some cases.

I am not sure, though, optimizer can take this into account properly -
to know the number of matched rows before choosing an index
would mean doing fulltext search for EXPLAIN too - I doubt it will be
appreciated :)

Still, with 2-level index some estimations can be made...
Great - thanks for the idea!

Anyway, in boolean mode there's no initialization so there's no
reasons (besides historical) for it to be preferred - it'll be fixed.

 To the developers: any word on if and when any of these things would be
 implemented? I know from the TODO and other list messages that some
 will. Any *estimates* (in months or MySQL version) on when would be
 great. Just any info on new full-text features, even ones that I didn't
 mention, would be awesome to hear. :-) And like how they would be
 implemented and used by us (syntax, etc.).

As I told - it's very difficult to predict this :(
Anyway, I doubt anything that requires changing .frm
file structure will get into 4.1
 
 How about changing the default min/max (or just min if you want) word
 length? I think everyone *really* wishes ft_min_word_len was 3. Seems
 like that and indexing numbers shorter than min_word_len could be easily
 done. Please? :-)

Yes, it's safe enough for 4.1

 There Sergei is talking about a new .frm format (plain text) that will
 allow more of these features. Will it allow us to somehow define how to
 parse things or something?? Could you elaborate more on what this will
 bring? In November 2001, he said the new .frm format would be here this
 year. It's been almost 2 years since then, so when is it do?

It's now planned for 5.1 - plain text .frm comes together with complete
redesign of internal table structure handling, table structure cache,
etc.

But even without it .frm format was extended in 4.1 so I don't need it
for adding per-index options anymore.

 Also, are the current MySQL versions using the 2 level full-text index
 format yet? I'm thinking not?

4.1.0 is using it.
This index structure was done to make possible new powerful
optimizations. It is these optimizations what is not implemented yet :(
It's in my highest-priority todo.
 
 Finally, in the full-text TODO, it says Generic user-suppliable UDF
 preparser. Could you also elaborate on this? The generic part almost
 makes it sound like some sort of script to define how to parse the
 text. But UDF makes it sound like a separate thing that has to be loaded
 with CREATE FUNCTION. But UDFs won't work with your MySQL binaries, will
 they, since they're complied statically?

mysql-max binary is compiled dynamically - so it works with UDF's.
And UDF in the todo item does not mean it will MySQL User-Definable
Function yet - it could be a Stored Procedure, e.g.

The idea is to be able to supply a function (whatever it is) that takes
a column's data and returns a list of words that this data contain.
It could be used e.g. to fulltext-index pdf's or xml's or MS Word files,
or whatever.

Regards,
Sergei

-- 
   __  ___ ___   __
  /  |/  /_ __/ __/ __ \/ /   Sergei Golubchik [EMAIL PROTECTED]
 / /|_/ / // /\ \/ /_/ / /__  MySQL AB, Senior Software Developer
/_/  /_/\_, /___/\___\_\___/  Osnabrueck, Germany
   ___/  www.mysql.com

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



RE: Lots of FULLTEXT stuff (suggestions)

2003-08-26 Thread Steven Roussey
 Thanks for replying. Your posts that I've found when searching for
 FULLTEXT information have had great ideas. :-) Searching millions of
 posts efficiently and effectively isn't easy. :-( Heh.

FULLTEXT does not scale very well once the files get bigger than your
RAM.

The redesign of the index where it gets normalized will help quite a bit
in reducing the size of the files. For large tables, it will help
immensely.

 Most 1-3 letter words that you don't want indexed should be
 stopwords anyway, right? So why NOT index the ones that are left?
 Doesn't seem like it'd make the index much larger to me. BTW, what is
 your min_word_len value?

I haven't really thought about it. Although I don't see any value in
one-letter words (or numbers). I use min_word_len=3 and my own stop
list, which is merge of stopwords in many languages. I made both changes
at the same time and ended up with a slightly smaller index.

--steve-


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Lots of FULLTEXT stuff (suggestions)

2003-08-26 Thread Thomas Spahni
Matt,

I fully agree that indexing short words and numbers is a necessity
sometimes. I'm processing legal text where abbreviations are widely used
and people want to search for chunks like:
  Art. 234 Abs. 3 OR
and the search should also find occurrances of
  Art. 234 OR

These are so common that I risk to run into the 50% cutoff unless using
BOOLEAN MODE. Indexing numbers is top of my wish list.

I observed user's queries for some time now and found that the ranking of
the results is optimal (i.e. match the user's expectations) when the words
he typed occur close together in the text (but not necessarily close to
the top).

Regards,
Thomas Spahni

On Sun, 24 Aug 2003, Matt W wrote:

 Hi all,

 I'm planning to use MySQL's full-text search for my forum system
 (possibly 5+ million posts). I've been playing with it a lot lately to
 see the performance and functionality and have some
 suggestions/questions.

 First, since a few of you may be wanting to know, here is a thread where
 I was doing some speed/optimization tests and stuff with 3 million
 posts: http://www.sitepointforums.com/showthread.php?threadid=69555
 (From post #12)

 Especially discovered that IN BOOLEAN MODE is really slow if you want to
 sort by relevance (with a lot of matching rows anyway). :-( For
 non-BOOLEAN searches, though, I can get 1000 relevance-sorted results in
 about 8-10 secs. for searches that match a LOT of rows and everything
 has to be read from disk. The full-text processing seems to be very fast
 (max 1-2 seconds of FULLTEXT initialization in PROCESSLIST). It's the
 disk seeks to read random rows from the data file (Sending data) that
 take the most time (7200 RPM/~8ms seek IDE drive). Searches are *MUCH*
 faster when the needed parts of the data file are cached by the OS!

 Anyway, my suggestions:

 --
 *) Min/Max Word Length -- This should really be able to be set on at
 least a per table basis (others may want per index). Right now, people
 that don't have control of the server are at the mercy of the admin to
 change the min/max word length.

 I would also suggest that ft_min_word_len be 3 and ft_max_word_len be 32
 by default. I think these would be better defaults for everyone than the
 current 4/254.

 Or if we could use

 SET ft_min_word_len=n;

 etc. for the current connection it would be nice.


 *) Parser: Indexing of Any and All Numbers -- I think it would be a good
 idea to index any sequence of digits less than ft_min_word_len long.
 Anything numeric could be very relevant for searching -- software
 versions, ages, dates, etc. -- and shouldn't be excluded.

 Even anything *containing* a number (among letters) is probably relevant
 for searching, again, even if it's shorter than ft_min_word_len. e.g.
 RC1, B2, 8oz, F5, etc.


 *) Parser: Other Things -- I've seen people trying to search
 catalog/item/part numbers with pieces of the number separated by -
 or / for example (making some pieces too short). How about indexing
 words that are on either side of a - or / (with no space) no matter
 their length? I don't mean including the - or / in the index -- just the
 usual word characters on either side (I think) as *separate* words, not
 a *single* word with the - or / removed. This would help with things
 like CD-ROM, TCP/IP, etc.

 Single quotes being counted as a word character is another issue I have.
 (I discovered that they're not counted as part of the word when on the
 end(s): 'quote' (thank God! :-))) Example: if someone searches for
 MySQL, it won't find rows with MySQL's. Since possessive's (sic) are the
 biggest problem, how about stripping any 's from the end of the word in
 the index? So MySQL's would be indexed as MySQL.


 *) Always Index Words -- Like it says in the full-text TODO section of
 the manual. This should be able to be set on at least a per table basis
 (again, others may want per index).


 *) Stopword File -- I would also like to be able to define this per
 table somehow.


 *) Miscellaneous -- Mostly functionality related, from the TODO:
 STEMMING! (controlled more finely than server level I hope), multi-byte
 character set support, proximity operators. Anything to get it closer to
 Verity's full-text functionality. ;-)

 Any speed/optimization improvements are welcome for gigs of data,
 especially with IN BOOLEAN MODE (e.g. automagically sorted by relevance
 like a natural language query, although this is probably difficult if a
 wildcard* is used?). And the FULLTEXT index shouldn't always be chosen
 for non-const join types when another index would find less rows first.
 e.g. ... WHERE MATCH ... AND primary_key IN (1, 2); should use the
 PRIMARY key, not the FULLTEXT. :-) But maybe that's not possible, since
 I guess it's a problem auto sorting by relevance if it's not using the
 FULLTEXT index.
 --

 To other full-text users: what do you think of these suggestions?

 To the 

Re: Lots of FULLTEXT stuff (suggestions)

2003-08-25 Thread Matt W
Hi Steven,

Thanks for replying. Your posts that I've found when searching for
FULLTEXT information have had great ideas. :-) Searching millions of
posts efficiently and effectively isn't easy. :-( Heh.


Thinking about ft_min_word_len some more (and how I wish they'd lower
the default), I don't really see that much use/purpose for a minimum
word length. Most 1-3 letter words that you don't want indexed should be
stopwords anyway, right? So why NOT index the ones that are left?
Doesn't seem like it'd make the index much larger to me. BTW, what is
your min_word_len value?

- Original Message -
From: Steven Roussey
Sent: Sunday, August 24, 2003 3:39 PM
Subject: RE: Lots of FULLTEXT stuff (suggestions)

  And the FULLTEXT index shouldn't always be chosen
  for non-const join types when another index would find less rows
 first.

 The short answer is that it doesn't work that way (also, I think this
is
 why there are no composite indexes between integer and fulltext
 indexes). The two systems don't know anything about each other.

Yeah...


  Also, are the current MySQL versions using the 2 level full-text
  index format yet? I'm thinking not?

 No. MySQL 4.1.0 has some low-level support for this, but FTS needs to
 altered (quite a bit I'd guess) to use it. So there is hope that it
will
 come in the 4.1.x line, but no guarantee.

I'm hoping for 4.1.x!

BTW, I wonder if Sergei is the [only] one who handles all the full-text
coding?


  In November 2001, he said the new .frm format would be here this
  year. It's been almost 2 years since then, so when is it do? ;-/

 I think it was pushed back to version 5.1. I'd figure another two
years.

Aww. :-( Yeah, now I remember seeing that in Features planned for 5.1.

Since the acquisition of SAP DB, I think they're supposed to have more
resources for faster development. Maybe some of these things will come
sooner than expected. That's what I'm hoping anyway.

Matt


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: Lots of FULLTEXT stuff (suggestions)

2003-08-25 Thread Santino
My TODO List:

1. Speed
More speed in inserts, deletes and selects.
2. Stoplist
1 stoplist for each index
create fulltext index on x( y) stoplist 'mystop'
the stoplist could be divided in section like my.cfg:
[STOPWORD]

[VARIABLES]
ft_min_word_len=3

[CLASSES]
...
[SPECIAL CHARS]
...
The classes section defines word boundary; assigning chars or group
of chars to one class (Alphanumeric, Word Break, punctuation, etc.).
The parser uses this section to extract words to index.
SPECIAL CHARS = replacement of some character ( ASCII  128) to an
equivalent 7bit character ( eg é=e -- index resumé as resume).
3. Proximity

4. TAGs
Recognize tags to stop/restart index: inside the text I can insert
tags to skip indexing a part of text (stop index, start index).
5. UDF or Filter
Cascade UDF. The output of one function can be sent to input of
another function.
Every record can have a different filter.
This solution allows to handle external documents( Text, Word, Excel,
XML, PDF, etc) without storing their data in a table (just store path
 type) crypt them (usefull in CD-ROM distribution).
This filter can also fill other columns with data from the text.
These features with no great speed degrade can improve MySql and open
new opportunity to MySql.
Santino

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]


RE: Lots of FULLTEXT stuff (suggestions)

2003-08-24 Thread Steven Roussey
 Lots of stuff
 STEMMING! (controlled more finely than server level I hope),
multi-byte
 character set support, proximity operators. Anything to get it closer
to
 Verity's full-text functionality. ;-)

Yes, all these things would be nice... :)

 And the FULLTEXT index shouldn't always be chosen
 for non-const join types when another index would find less rows
first.

The short answer is that it doesn't work that way (also, I think this is
why there are no composite indexes between integer and fulltext
indexes). The two systems don't know anything about each other.

 Also, are the current MySQL versions using the 2 level full-text
index
 format yet? I'm thinking not?

No. MySQL 4.1.0 has some low-level support for this, but FTS needs to
altered (quite a bit I'd guess) to use it. So there is hope that it will
come in the 4.1.x line, but no guarantee.

 In November 2001, he said the new .frm format would be here this
 year. It's been almost 2 years since then, so when is it do? ;-/

I think it was pushed back to version 5.1. I'd figure another two years.

--steve-



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]