subject:"Distributed Fulltext\?"

Re: Distributed Fulltext?

2002-02-16 Thread David Axmark


On Fri, 2002-02-15 at 02:44, Alex Aulbach wrote:
 Wednesday, from David Axmark:
   Your other point about exact vs. approximate answers is unclear, I expect
   that Google's answers are exact for their currently available indexes at any
   given time.  But even if they are approximate, I'd be happy with that too.
   The scoring on a FULLTEXT search in Mysql is exact but based on a
   formula that is approximate anyway.
 
  No, MySQL returns all data according to a search. Web engines return
  what they they find on one search machine. So you can get different
  results with Google every time you hit refresh if you are routed to
  different machines. This had happened to me when I was looking for the
  number of matches and not the result itself.
 
  So we should try to make fulltext searches with a limit between 10 and
  100 be fast to be closer to google.
 
  I have also head about some other things web search engines do since I
  know some people at FAST but I have forgot that already.
 
 My opinion is, that mySQL itself never should try to find approximate
 matches. This is against the definition of SQL itself. SQL is a fourth
 generation language. That means, if you say SELECT, the engine selects.
 And it has to be as exactly that, what I have searched, every time, on
 every machine in any combination with the same data.

 So SQL needs a new language construct to make an approximate search. But
 what is an approximate search? How is approximate defined?
 

I agree, If there is a good technical reason for doing so. I would be
willing to add a keyword APPROXIMATE for such a thing. But it would have
to be a extremely clear even to uninitiated that this is approximate. 

I can se such a feature used by the people who build search engines with
MySQL.

 I don't think it is a good idea to implement it in this way.
 Approximazation must be always done on the application level, cause it is
 highly dependend on application, what an approximate result could be.
 
  We will try to make every feature as good as possible. But we do have
  limited resources.
 
 Exactly. FTS is not so important as other features and people which want
 you to include a new feature should think about supporting mysql with
 money. :-)
 
 But (yes, we support mysql! :-) I think the need is growing rapidly, cause
 the amount of data, that has to be indexed is growing over the years. And
 other DB's have much more experices with it. Currently we can live with
 the speed. Those who cannot live with it should buy better machines,
 think about their SE-concept or support mysql.
 
 Search engines techniques are *not* trivial, so the last way is in my eyes
 one of the cheapest.
 
 
  Well there is always the option of sponsoring further fulltext
  development. We have a guy who has been working on the GNU fulltext
  engines who is interesting in working with MySQL fulltext. But for the
  moment we can not afford it.
 
 This was my first thought: People write about speed problems and how to
 cluster and so on. Things, that I would calculate with weeks and high TCO.
 
 But it maybe much cheaper to pay mySQL for this. How much do you estimate
 would it cost to implement inverted files? I think this is difficult,
 cause Sergei told me, that he couldn't use mySQL-index files any more.

If needed we could add a new index file (.MYT ?). But what would require
some internal changes of unknown (to me) size. But if this is needed for
continues development of fulltext we will do it.

 I just ask, nothing special in sight, but many questions from everyone who
 needs it. Cause FTS is a feature which highly improves the value of a web
 site. And coustomers have no problem to pay for things they think they get
 money for. FTS is such a thing.
 
 But perhaps if we know, under which circumstances FTS is improved, it is
 easier for us to find a possible way to share the costs for it or find a
 compromise. I also understand, if mySQL don't want to speak about it
 here.
 
 I think it is also important for us, how much it can be theoretically
 improved. My calculations showed me a theoretical speed up of factor 100
 or so. This is ... wow. But in live everything is most times slower...

We want to do things as fast as possible in most cases. We will try to
do this for text-search also but the question is how long time it will
take before we finish.

  So if some of you are interested in sponsoring this (or know about
  others who might be) write to [EMAIL PROTECTED]
 
 Or like this... maybe we find coustomers who needs it. Think it's
 possible.
 
 My personal feeling is and my stomach says, that fulltext indexing is a
 feature, which needs to be expanded.

We do have Sergei working on it still. And we do plan to expand it
during the coming months. But the way to get our attention in cases like
this is simple. Give us some core or money!

/David


-
Before posting, please check:

Re: Distributed Fulltext?

2002-02-14 Thread Alex Aulbach


Wednesday, from Mike Wexler:
 I don't think that would be appropriate. My example, is our site (tias.com) has
 lots of antiques and collectibles. One popular categories is jewelry. If
 somebody does a search for gold jewelry and the search engine interprets this
 as anything that mentions gold or jewelry. It is going to match a lot of items.
 It would be nice if we could use explain or something like it to get a rough
 estimate of how many results a query would generate, and if it was really bad,
 we could tell the user to be more specific.

This is not a solution, but we make it by using the sql query

SELECT COUNT(*) FROM table WHERE MATCH(index) AGAINST ('gold')
(results e.g. in 100)

and

SELECT COUNT(*) FROM table WHERE MATCH(index) AGAINST ('jewelry')
(results e.g. in 200)

OR-Search: The result is between 300 and 500 matches.
AND-Search: The result is between 0 and 200 matches.

The problem is: The queries lasts nearly as fast, as the complete search.
:)

-- 

SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-14 Thread Alex Aulbach


Wednesday, from David Axmark:
  Your other point about exact vs. approximate answers is unclear, I expect
  that Google's answers are exact for their currently available indexes at any
  given time.  But even if they are approximate, I'd be happy with that too.
  The scoring on a FULLTEXT search in Mysql is exact but based on a
  formula that is approximate anyway.

 No, MySQL returns all data according to a search. Web engines return
 what they they find on one search machine. So you can get different
 results with Google every time you hit refresh if you are routed to
 different machines. This had happened to me when I was looking for the
 number of matches and not the result itself.

 So we should try to make fulltext searches with a limit between 10 and
 100 be fast to be closer to google.

 I have also head about some other things web search engines do since I
 know some people at FAST but I have forgot that already.

My opinion is, that mySQL itself never should try to find approximate
matches. This is against the definition of SQL itself. SQL is a fourth
generation language. That means, if you say SELECT, the engine selects.
And it has to be as exactly that, what I have searched, every time, on
every machine in any combination with the same data.

So SQL needs a new language construct to make an approximate search. But
what is an approximate search? How is approximate defined?

I don't think it is a good idea to implement it in this way.
Approximazation must be always done on the application level, cause it is
highly dependend on application, what an approximate result could be.

 We will try to make every feature as good as possible. But we do have
 limited resources.

Exactly. FTS is not so important as other features and people which want
you to include a new feature should think about supporting mysql with
money. :-)

But (yes, we support mysql! :-) I think the need is growing rapidly, cause
the amount of data, that has to be indexed is growing over the years. And
other DB's have much more experices with it. Currently we can live with
the speed. Those who cannot live with it should buy better machines,
think about their SE-concept or support mysql.

Search engines techniques are *not* trivial, so the last way is in my eyes
one of the cheapest.


 Well there is always the option of sponsoring further fulltext
 development. We have a guy who has been working on the GNU fulltext
 engines who is interesting in working with MySQL fulltext. But for the
 moment we can not afford it.

This was my first thought: People write about speed problems and how to
cluster and so on. Things, that I would calculate with weeks and high TCO.

But it maybe much cheaper to pay mySQL for this. How much do you estimate
would it cost to implement inverted files? I think this is difficult,
cause Sergei told me, that he couldn't use mySQL-index files any more.

I just ask, nothing special in sight, but many questions from everyone who
needs it. Cause FTS is a feature which highly improves the value of a web
site. And coustomers have no problem to pay for things they think they get
money for. FTS is such a thing.

But perhaps if we know, under which circumstances FTS is improved, it is
easier for us to find a possible way to share the costs for it or find a
compromise. I also understand, if mySQL don't want to speak about it
here.

I think it is also important for us, how much it can be theoretically
improved. My calculations showed me a theoretical speed up of factor 100
or so. This is ... wow. But in live everything is most times slower...


 So if some of you are interested in sponsoring this (or know about
 others who might be) write to [EMAIL PROTECTED]

Or like this... maybe we find coustomers who needs it. Think it's
possible.

My personal feeling is and my stomach says, that fulltext indexing is a
feature, which needs to be expanded.

-- 

SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread James Montebello



I did this at a previous job, and we split the data up more or less
this way (we used a pre-existing item number for the split which was
essentially random in relation to the text data), with a aggregator that
did the query X ways, each to a separate box holding 1/X of the data.
The results from each unit were paged and sorted, so all the aggregator
did was do a simple merge sort on a page of the set, which was fast.
On a 6M record dataset, it produced millisecond-range search results.
Not exactly Google-class, but pretty good for 12 Linux boxes, two
programmers, and about six weeks of effort.

james montebello

On Thu, 7 Feb 2002, Brian Bray wrote:

 
 It seems to me like the best solution that could be implemented as-is 
 would be to keep a random int column in your table (with a range of say 
 1-100) and then have fulltext server 1 psudo-replicate records with a 
 the random number in the range of 1-10, server 2 11-20 and server 3 
 21-30 and so on.
 
 Then run your query on all 10 servers and merge the result sets and 
 possibly re-sort them if you use the score column.
 
 The problem with splitting the index up by word is that is messes up all 
 your scoring and ranking.  For example what if you search using 5 
 keywords, all starting with letters from different groups?  Your going 
 to get pretty bad score for each match, and it could totally break 
 boolean searches.
 
 --
 Brian Bray
 
 
 
 
 Brian DeFeyter wrote:
  On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
  [snip]
  
 Wouldn't be too tough to write a little query routing system if you are using 
 perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
 server so it routes the query to several places and returns a single result 
 set. Ordering could be achieved as well. I'm sure there are commercial 
 packages out there as well. I don't see why the individual database servers 
 would need to do anything special.
 
  [snip]
  
  If I'm understanding you correctly, I think you're refering to routing
  based on the first character of the word. That would work for cases
  where the query is searching for a word that begins with a certain
  character.. however fulltext searches also return results with the term
  in the middle.
  
  ie: a search for 'foo' could return:
  foo.txt
  foobar
  
  but also could return:
  thisisfoo
  that_is_foolish
  
  I could be wrong, but it's my understanding that MySQL stores it's
  fulltext index based on all the 'unique words' found. For such a system
  as you mentioned above, you'd probably have to create your own fulltext
  indexing system to determine: a) where to store the data 'segments' and
  b) how to route queries.  It seems like this could probably be done much
  more efficiently inside of the server.
  
   - Brian
  
  
  
  -
  Before posting, please check:
 http://www.mysql.com/manual.php   (the manual)
 http://lists.mysql.com/   (the list archive)
  
  To request this thread, e-mail [EMAIL PROTECTED]
  To unsubscribe, e-mail [EMAIL PROTECTED]
  Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
  
  
 
 
 
 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)
 
 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail [EMAIL PROTECTED]
 Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
 


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Brian Bray



It seems to me like the best solution that could be implemented as-is 
would be to keep a random int column in your table (with a range of say 
1-100) and then have fulltext server 1 psudo-replicate records with a 
the random number in the range of 1-10, server 2 11-20 and server 3 
21-30 and so on.

Then run your query on all 10 servers and merge the result sets and 
possibly re-sort them if you use the score column.

The problem with splitting the index up by word is that is messes up all 
your scoring and ranking.  For example what if you search using 5 
keywords, all starting with letters from different groups?  Your going 
to get pretty bad score for each match, and it could totally break 
boolean searches.

--
Brian Bray




Brian DeFeyter wrote:
 On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
 [snip]
 
Wouldn't be too tough to write a little query routing system if you are using 
perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
server so it routes the query to several places and returns a single result 
set. Ordering could be achieved as well. I'm sure there are commercial 
packages out there as well. I don't see why the individual database servers 
would need to do anything special.

 [snip]
 
 If I'm understanding you correctly, I think you're refering to routing
 based on the first character of the word. That would work for cases
 where the query is searching for a word that begins with a certain
 character.. however fulltext searches also return results with the term
 in the middle.
 
 ie: a search for 'foo' could return:
 foo.txt
 foobar
 
 but also could return:
 thisisfoo
 that_is_foolish
 
 I could be wrong, but it's my understanding that MySQL stores it's
 fulltext index based on all the 'unique words' found. For such a system
 as you mentioned above, you'd probably have to create your own fulltext
 indexing system to determine: a) where to store the data 'segments' and
 b) how to route queries.  It seems like this could probably be done much
 more efficiently inside of the server.
 
  - Brian
 
 
 
 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)
 
 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail [EMAIL PROTECTED]
 Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
 
 



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Tod Harter


On Thursday 07 February 2002 14:53, Brian DeFeyter wrote:
 Has anyone made a suggestion or thought about ways to distribute
 databases which focus on fulltext indexes?

 fulltext indexes do a good job of indexing a moderate amount of data,
 but when you get a lot of data to be indexed, the queries slow down
 significantly.

 I have an example table, with about 90 million rows.. and has a fulltext
 index on a varchar(100) field. A single-word query which would return
 approx 300k results takes an average of 15 seconds. A query with smaller
 results (~ 10k) can be as quick as 1 sec.. which I would consider
 acceptable.

 Has any thought about splitting the data into distributed files or even
 machines? ie: something as simple as 'words' starting with 'X' are split
 into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
 automatically split results based on (#results per unique 'word' /
 desired # of 'split files/machines') Would such a system give any
 advantages to searching speed and concurrenct query scalability? I
 haven't looked at the fulltext internals.. so I don't know if such
 query routing could take place or not.

 If nothing else, does anyone else have experience with a table of this
 size or even larger? What kind of tuning have you done?

 Thanks,

  - Brian

Wouldn't be too tough to write a little query routing system if you are using 
perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
server so it routes the query to several places and returns a single result 
set. Ordering could be achieved as well. I'm sure there are commercial 
packages out there as well. I don't see why the individual database servers 
would need to do anything special.


 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)

 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail
 [EMAIL PROTECTED] Trouble
 unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Mike Wexler




Steve Rapaport wrote:
 On Friday 08 February 2002 06:14 pm, James Montebello wrote:
 
Distribution is how Google gets its speed.  You say clustering won't
solve the problem, but distributing the indicies across many processors
*is* going to gain you a huge speed increase through sheer parallelism.

 
 True, but not enough.  The BEST parallelism can do  in a compute-bound
 application is divide the time by the number of processors.  That's assuming
 a PERFECT routing system.  (Correct me if I'm wrong here)

There are actually some exceptions. Specifically if you have a very large 
problem set and parallelism allows you to move the problem set into a faster 
storage medium, you can sometimes see greater performance increases. Lets say 
you are doing a full text search. And you have a 20 GB full text index. It may 
not be feasible to build a machine with 20GB of RAM for storing the index in 
RAM. But it might be more feasible to store 1/20th of the index in each of 20 
1GB machines. And with RAM being 1000 times as fast as hard disk. You could get 
a huge win.


 
 So to make the routing system + parallelism add up to 
 a MILLION times better performance, you would need at
 least a MILLION processors.  I doubt that even Google is
 doing that.  
 
 
Google uses thousands of processors to handle its index, and any given
search is going to be spread over 100s of processors.  

 
 Right, so we can expect Google to do, say, 10,000 times  (10^4) better
 than Mysql at a Fulltext search.  But in fact we're seeing 
 a million, 10^6, being generous.  It's that extra factor
 of a hundred (more likely a thousand, my estimates
 were very generous) that I'm getting all fussy about.
 
 
Asking
a general purpose RDBMS to be really good a 10 different things is asking
a bit much.  Fulltext searches are well down the list.

 
 Here we get out of the realm of hard numbers and into opinions.   But here's
 mine:  If mysql bothers to support fulltext searches, it's presumably because 
 there's some demand for them in some circumstances.  The level of scaling and
 optimization that could reasonably be expected:  What it takes to make the
 feature useful (i.e. perform similarly) in similar cases to other Mysql 
 features.  With a regular index, I can do a hash lookup on 23 million records
 subsecond.  With a fulltext index (on a small field, only 40 characters) my
 time slips to 3 to 180 seconds.  That extra little factor of 100 is my 
 problem.  Distributing over 4 processors wouldn't really help much.  And 
 because people don't always type a company name the exact same way,
 FULLTEXT really is the best way to do this.
 
 So Monty et al, my request is this:  please put on the wish list, enhancement 
 of FULLTEXT search  to approximately match the performance of an indexed
 search on 25 million records, on the same hardware and with other things
 held equal.
 
 Steve Rapaport
 
 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)
 
 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail [EMAIL PROTECTED]
 Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
 
 
 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)
 
 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail [EMAIL PROTECTED]
 Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
 



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread alec . cawley




 Why is it that Altavista can index terabytes overnight and return
 a fulltext boolean for the WHOLE WEB
 within a second, and Mysql takes so long?

I don't know about Altavista, but if you read up on Google, they do indeed
do some sort of
spreading of keywords across multiple machines - last I saw they had over
10,000
machines.

But you have to ask the question as to whether this would be an appropriate
development for MySQL. This would seeem to me to be a very complex
development
of interest to a very small subset of MySQL customers. MySQL, it seems to
me,
specialises in doing the mainstream database manipulations used by many
people simply,
quicly, and (my most important criterion) reliably. Such a development
would, IMO,
be away from the whole trend of MySQL, and shouldn't happen before
triggers,
stored procedures and the like - if ever. A multe-machine distributed
database is
a wholly different animal - IMO again. If I was going there I wouldn't
start from here -
it would need a clean brain look at the whole problem. If you really need
such searches,
Google survives by licencint that technology - speak to them.

 Alec Cawley




-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Steve Rapaport


Ooops, factual error:

  If, say, Google, can search 2 trillion web pages, averaging say 70k
  bytes each, in 1 second, and Mysql can search 22 million records, with
  an index on 40 bytes each, in 3 seconds (my experience) on a good day,
  what's the order of magnitude difference?  Roughly 10^9.


Google has only 2 Billion with a B web pages.

So the order of magnitude difference is only 10^6.  But that still
leaves a mysql Fulltext search a MILLION times slower than Google.

Someone correctly pointed out today that it's not Mysql's job
to be Google, and I agree.  But it seems to me that it would be
fair for mysql to be able to handle searches in under 1 second
for databases 1 millionth the size of Google.  All I want here
is a decent flexible phone book lookup, not a search engine.

-steve

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Steve Rapaport


I said:
  Why is it that Altavista can index terabytes overnight and return
  a fulltext boolean for the WHOLE WEB
  within a second, and Mysql takes so long?

On Friday 08 February 2002 08:56, Vincent Stoessel wrote:

 Apples and oranges.

Yeah, I know.  But let's see if we can make some distinctions.
If, say, Google, can search 2 trillion web pages, averaging say 70k bytes 
each, in 1 second, and Mysql can search 22 million records, with an index
on 40 bytes each, in 3 seconds (my experience) on a good day,
what's the order of magnitude difference?  Roughly 10^9.

 Have you seen the /hardware/ run that enterprise with?
Irrelevant, you're unlikely to get 9 orders of magnitude difference with
faster hardware or even with clustering.

 Also, their software is optimized for full text searches and that
 is /all/ they do. Mysql is an SQL database and is optimized as such.

Absolutely granted. You are completely right. 
And I don't expect the data format to change.

 BUT:  thought experiment:

When Mysql decides to generate a FULLTEXT index, it
is using an index file, the .MYI file, which can have ANY FORMAT
it wants to.  If the .MYI format is poorly optimized for fulltext
searches, they can improve the format.  They can even introduce
a new index file type .MYF solely for optimizing fulltext searches.

None of this need have any impact on the data file format or the 
SQL search optimizations, and yet it could still improve the search
speed for fulltext.  It might not help as much for the slow indexing,
but it could certainly improve the performance of the search.

Thinking out loud...
Steve


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread David Axmark


On Tue, 2002-02-12 at 15:38, Steve Rapaport wrote:
 David Axmark writes:
 
  So the standard answer with Apples and Oranges certainly apply here!
 
 More like Äpplen och Apelsiner, that is, different but similar.  You Swedish 
 guys should know.  Thanks for answering, David, I appreciate the attention
 from a founder.
 
 I also appreciate your point that Google is updating continuously and
 therefore not always caught up to the current state of the web.
 But isn't that a problem with indexing speed, not with search speed?
 Their search speed is still amazing, as is Altavista, and most of the
 other engines.
 
 Your other point about exact vs. approximate answers is unclear, I expect
 that Google's answers are exact for their currently available indexes at any
 given time.  But even if they are approximate, I'd be happy with that too.  
 The scoring on a FULLTEXT search in Mysql is exact but based on a
 formula that is approximate anyway.

No, MySQL returns all data according to a search. Web engines return
what they they find on one search machine. So you can get different
results with Google every time you hit refresh if you are routed to
different machines. This had happened to me when I was looking for the
number of matches and not the result itself. 

So we should try to make fulltext searches with a limit between 10 and
100 be fast to be closer to google.

I have also head about some other things web search engines do since I
know some people at FAST but I have forgot that already.

 I'll summarize this thread best I can.
 
 From the math I used, we started with my estimate of 10^9,
 which was mistaken.  The real figure was 10^6, that is, Google
 searches fulltext about a million times faster than Mysql.
 Then we used Google's 1 machines +DRAM indexing to reduce the
 gap to 10^2, or 100 times faster.  

I would say we should reduce it even further but that could be
discussed.

 It turns out that 100 times is about the factor that is causing
 my application problems.  If it just ran 100 times faster it would be
 about as fast as a regular indexed search, and I'd
 be happy.
 
 A few people suggested that Mysql shouldn't try to be faster,
 I (and some high-support customers like Mike Wexler) disagreed.
 And Alex Aulbach, bless him, actually did his homework and showed that
 things could be much improved with smart index techniques like
 inverted files.

We will try to make every feature as good as possible. But we do have
limited resources.

 Then Sergei Golubchik wrote back to say he had taken some of the good ideas
 and inserted them into the TODO list, although he had higher priorities
 at the time.
 
 And I was satisfied for now, although my application still isn't working
 satisfactorily due to a really slow and CPU-hungry FULLTEXT search.

Well there is always the option of sponsoring further fulltext
development. We have a guy who has been working on the GNU fulltext
engines who is interesting in working with MySQL fulltext. But for the
moment we can not afford it.

So if some of you are interested in sponsoring this (or know about
others who might be) write to [EMAIL PROTECTED]

 I think that's our story so far.
 
 Steve Rapaport
 Director, Technical Services
 A-Tono


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Steven Roussey


 [comparisons to Google...]

While any speed up with a full table fulltext search would be helpful
and useful, there are instances where the search is intersected with
another column and the problem of search is therefore more complex but
also leads to potential optimizations.

In our case we rarely do searches that use just the FTS index. More
commonly, we do searches based on other columns and a FTS indexed text
column. The WHERE clause restraints on the non-text column usually
eliminate 99% of the rows. Fulltext indexes can not be combined with
numeric columns in a single composite index, so as far as I can tell,
MySQL performs the FTS then filters the results to meet the WHERE
conditions. I believe the required amount of the index to be paged into
main memory is much larger than might otherwise be needed (since the
indexes are larger than available memory, and there a zillion other
tables accessed simultaneously, this tends to effect all queries).

So besides the obvious desire to lower the size of indexes, it would be
beneficial if a combined index could lower the number of matching rows
and the amount of the index needed to be in memory.

I actually find it odd for people to not be using secondary columns in a
fulltext search, other than to do something like Google (which does not
lend itself to using SQL in the first place. 

Of course, I have a myopic view stemming from my own requirements and
experience.

All that said, I'm glad to have the FTS we have. 

Thanks Sergei for such a wonderful improvement to LIKE '%word%'!

Sincerely,
Steven Roussey
http://Network54.com/?pp=e 

PS - is FTS search and update O(n^2)?



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Mike Wexler


My understanding is that part of how google and Altavista get such high speeds 
is to keep everything in memory. Is it possible to create a HEAP table with a 
full text index? If so, does the full text index take advantage of being in 
memory? For example, I would imagine that if you were keeping the whole index in 
memory, details like the index page size, and the format of the pointers/record 
numbers would be different.

Then you could just do something roughly like (i know the syntax is a little off)

CREATE HEAP TABLE fooFast SELECT * FROM fooSlow
ALTER fooFast ADD fulltext(a, b, c)

Or maybe you could just have fooSlow on one server. And then have it replicated 
on N other servers. But on the other servers you could alter the table type so 
it was a heap table. So you would have one persistent table and a bunch of 
replicated heap tables. And all the search go could against the heap tables.


Brian Bray wrote:
 It seems to me like the best solution that could be implemented as-is 
 would be to keep a random int column in your table (with a range of say 
 1-100) and then have fulltext server 1 psudo-replicate records with a 
 the random number in the range of 1-10, server 2 11-20 and server 3 
 21-30 and so on.
 
 Then run your query on all 10 servers and merge the result sets and 
 possibly re-sort them if you use the score column.
 
 The problem with splitting the index up by word is that is messes up all 
 your scoring and ranking.  For example what if you search using 5 
 keywords, all starting with letters from different groups?  Your going 
 to get pretty bad score for each match, and it could totally break 
 boolean searches.
 
 --
 Brian Bray
 
 
 
 
 Brian DeFeyter wrote:
 
On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
[snip]


Wouldn't be too tough to write a little query routing system if you are using 
perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
server so it routes the query to several places and returns a single result 
set. Ordering could be achieved as well. I'm sure there are commercial 
packages out there as well. I don't see why the individual database servers 
would need to do anything special.


[snip]

If I'm understanding you correctly, I think you're refering to routing
based on the first character of the word. That would work for cases
where the query is searching for a word that begins with a certain
character.. however fulltext searches also return results with the term
in the middle.

ie: a search for 'foo' could return:
foo.txt
foobar

but also could return:
thisisfoo
that_is_foolish

I could be wrong, but it's my understanding that MySQL stores it's
fulltext index based on all the 'unique words' found. For such a system
as you mentioned above, you'd probably have to create your own fulltext
indexing system to determine: a) where to store the data 'segments' and
b) how to route queries.  It seems like this could probably be done much
more efficiently inside of the server.

 - Brian



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php



 
 
 
 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)
 
 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail [EMAIL PROTECTED]
 Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
 
 
 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)
 
 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail [EMAIL PROTECTED]
 Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
 



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Mike Wexler




Steve Rapaport wrote:

 
 Someone correctly pointed out today that it's not Mysql's job
 to be Google, and I agree.  But it seems to me that it would be
 fair for mysql to be able to handle searches in under 1 second
 for databases 1 millionth the size of Google.  All I want here
 is a decent flexible phone book lookup, not a search engine.

That may be what you want. But I want a search engine. For my database. I want 
to be able to do search of the million records of inventory in our database and 
find the items with certain keywords, that are currently available for sale, in 
the venue the user is shopping in and within the price range the user specifies. 
This is a combination of database and fulltext search that I think is exactly in 
MySQL's marketplace.

And the current situation where a user fulltext query can take between 1 ms and 
100 secs with no easy way for the application to tell ahead of time is really 
unacceptable.

So I think several things need to be done:

1) full text search needs to be faster (they are working on it).
2) There needs to be some way (ala explain) of figuring out roughly how long a 
particular full text search will take. If I coudl do

EXPLAIN SELECT * FROM inventory WHERE MATCH ($userQuery) AGAINST (title, 
description)

and get something useful for predicting search time. Then I could at least say 
something to the user like:

That query matches too many records. Please pick a more specific query.

Rather than hosing the whole database for 3 or 4 common words.

3) Find ways to use replication and memory to speed up full-text queries. See my 
previous mail about heap tables for a possible example.




-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Brian DeFeyter


I sorta like that idea. I don't know exactly what you can and can't do
as far as indexing inside of HEAP tables.. but the index size would
likely differ from the written index. Then you can expand the idea and
use the X/(num slices) on (num slices) boxes technique.. sending the
query to each, and compiling all of the results.

Your comment about only having a few common words that are searched
makes me wonder if a reverse-stopword function would be valuable (ie:
only index words in a list you generate) Probably could be done other
ways though too with what's available. Maybe a psuedo-bitmap index
inside a mysql table?

 - bdf

On Wed, 2002-02-13 at 12:09, Mike Wexler wrote:
 My understanding is that part of how google and Altavista get such high speeds 
 is to keep everything in memory. Is it possible to create a HEAP table with a 
 full text index? If so, does the full text index take advantage of being in 
 memory? For example, I would imagine that if you were keeping the whole index in 
 memory, details like the index page size, and the format of the pointers/record 
 numbers would be different.
 
 Then you could just do something roughly like (i know the syntax is a little off)
 
 CREATE HEAP TABLE fooFast SELECT * FROM fooSlow
 ALTER fooFast ADD fulltext(a, b, c)
 
 Or maybe you could just have fooSlow on one server. And then have it replicated 
 on N other servers. But on the other servers you could alter the table type so 
 it was a heap table. So you would have one persistent table and a bunch of 
 replicated heap tables. And all the search go could against the heap tables.
 
 
 Brian Bray wrote:
  It seems to me like the best solution that could be implemented as-is 
  would be to keep a random int column in your table (with a range of say 
  1-100) and then have fulltext server 1 psudo-replicate records with a 
  the random number in the range of 1-10, server 2 11-20 and server 3 
  21-30 and so on.
  
  Then run your query on all 10 servers and merge the result sets and 
  possibly re-sort them if you use the score column.
  
  The problem with splitting the index up by word is that is messes up all 
  your scoring and ranking.  For example what if you search using 5 
  keywords, all starting with letters from different groups?  Your going 
  to get pretty bad score for each match, and it could totally break 
  boolean searches.
  
  --
  Brian Bray
  
  
  
  
  Brian DeFeyter wrote:
  
 On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
 [snip]
 
 
 Wouldn't be too tough to write a little query routing system if you are using 
 perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
 server so it routes the query to several places and returns a single result 
 set. Ordering could be achieved as well. I'm sure there are commercial 
 packages out there as well. I don't see why the individual database servers 
 would need to do anything special.
 
 
 [snip]
 
 If I'm understanding you correctly, I think you're refering to routing
 based on the first character of the word. That would work for cases
 where the query is searching for a word that begins with a certain
 character.. however fulltext searches also return results with the term
 in the middle.
 
 ie: a search for 'foo' could return:
 foo.txt
 foobar
 
 but also could return:
 thisisfoo
 that_is_foolish
 
 I could be wrong, but it's my understanding that MySQL stores it's
 fulltext index based on all the 'unique words' found. For such a system
 as you mentioned above, you'd probably have to create your own fulltext
 indexing system to determine: a) where to store the data 'segments' and
 b) how to route queries.  It seems like this could probably be done much
 more efficiently inside of the server.
 
  - Brian
 
 
 
 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)
 
 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail [EMAIL PROTECTED]
 Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
 
 
 
  
  
  
  -
  Before posting, please check:
 http://www.mysql.com/manual.php   (the manual)
 http://lists.mysql.com/   (the list archive)
  
  To request this thread, e-mail [EMAIL PROTECTED]
  To unsubscribe, e-mail [EMAIL PROTECTED]
  Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
  
  
  -
  Before posting, please check:
 http://www.mysql.com/manual.php   (the manual)
 http://lists.mysql.com/   (the list archive)
  
  To request this thread, e-mail [EMAIL PROTECTED]
  To unsubscribe, e-mail [EMAIL PROTECTED]
  Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Mike Wexler




Brian DeFeyter wrote:
 I sorta like that idea. I don't know exactly what you can and can't do
 as far as indexing inside of HEAP tables.. but the index size would
 likely differ from the written index. Then you can expand the idea and
 use the X/(num slices) on (num slices) boxes technique.. sending the
 query to each, and compiling all of the results.
 
 Your comment about only having a few common words that are searched
 makes me wonder if a reverse-stopword function would be valuable (ie:
 only index words in a list you generate) Probably could be done other
 ways though too with what's available. Maybe a psuedo-bitmap index
 inside a mysql table?

I don't think that would be appropriate. My example, is our site (tias.com) has 
lots of antiques and collectibles. One popular categories is jewelry. If 
somebody does a search for gold jewelry and the search engine interprets this 
as anything that mentions gold or jewelry. It is going to match a lot of items. 
It would be nice if we could use explain or something like it to get a rough 
estimate of how many results a query would generate, and if it was really bad, 
we could tell the user to be more specific.


 
  - bdf
 
 On Wed, 2002-02-13 at 12:09, Mike Wexler wrote:
 
My understanding is that part of how google and Altavista get such high speeds 
is to keep everything in memory. Is it possible to create a HEAP table with a 
full text index? If so, does the full text index take advantage of being in 
memory? For example, I would imagine that if you were keeping the whole index in 
memory, details like the index page size, and the format of the pointers/record 
numbers would be different.

Then you could just do something roughly like (i know the syntax is a little off)

CREATE HEAP TABLE fooFast SELECT * FROM fooSlow
ALTER fooFast ADD fulltext(a, b, c)

Or maybe you could just have fooSlow on one server. And then have it replicated 
on N other servers. But on the other servers you could alter the table type so 
it was a heap table. So you would have one persistent table and a bunch of 
replicated heap tables. And all the search go could against the heap tables.


Brian Bray wrote:

It seems to me like the best solution that could be implemented as-is 
would be to keep a random int column in your table (with a range of say 
1-100) and then have fulltext server 1 psudo-replicate records with a 
the random number in the range of 1-10, server 2 11-20 and server 3 
21-30 and so on.

Then run your query on all 10 servers and merge the result sets and 
possibly re-sort them if you use the score column.

The problem with splitting the index up by word is that is messes up all 
your scoring and ranking.  For example what if you search using 5 
keywords, all starting with letters from different groups?  Your going 
to get pretty bad score for each match, and it could totally break 
boolean searches.

--
Brian Bray




Brian DeFeyter wrote:


On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
[snip]



Wouldn't be too tough to write a little query routing system if you are using 
perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
server so it routes the query to several places and returns a single result 
set. Ordering could be achieved as well. I'm sure there are commercial 
packages out there as well. I don't see why the individual database servers 
would need to do anything special.



[snip]

If I'm understanding you correctly, I think you're refering to routing
based on the first character of the word. That would work for cases
where the query is searching for a word that begins with a certain
character.. however fulltext searches also return results with the term
in the middle.

ie: a search for 'foo' could return:
foo.txt
foobar

but also could return:
thisisfoo
that_is_foolish

I could be wrong, but it's my understanding that MySQL stores it's
fulltext index based on all the 'unique words' found. For such a system
as you mentioned above, you'd probably have to create your own fulltext
indexing system to determine: a) where to store the data 'segments' and
b) how to route queries.  It seems like this could probably be done much
more efficiently inside of the server.

- Brian



-
Before posting, please check:
  http://www.mysql.com/manual.php   (the manual)
  http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php






-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Brian DeFeyter


On Wed, 2002-02-13 at 16:39, Mike Wexler wrote:
 
 
 Brian DeFeyter wrote:
  I sorta like that idea. I don't know exactly what you can and can't do
  as far as indexing inside of HEAP tables.. but the index size would
  likely differ from the written index. Then you can expand the idea and
  use the X/(num slices) on (num slices) boxes technique.. sending the
  query to each, and compiling all of the results.
  
  Your comment about only having a few common words that are searched
  makes me wonder if a reverse-stopword function would be valuable (ie:
  only index words in a list you generate) Probably could be done other
  ways though too with what's available. Maybe a psuedo-bitmap index
  inside a mysql table?
 
 I don't think that would be appropriate. My example, is our site (tias.com) has 
 lots of antiques and collectibles. One popular categories is jewelry. If 
 somebody does a search for gold jewelry and the search engine interprets this 
 as anything that mentions gold or jewelry. It is going to match a lot of items. 
 It would be nice if we could use explain or something like it to get a rough 
 estimate of how many results a query would generate, and if it was really bad, 
 we could tell the user to be more specific.

The FOUND_ROWS() function in mysql-4.0.0+ might help you out in this
case.



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Mike Wexler




Brian DeFeyter wrote:
 On Wed, 2002-02-13 at 16:39, Mike Wexler wrote:
 

Brian DeFeyter wrote:

I sorta like that idea. I don't know exactly what you can and can't do
as far as indexing inside of HEAP tables.. but the index size would
likely differ from the written index. Then you can expand the idea and
use the X/(num slices) on (num slices) boxes technique.. sending the
query to each, and compiling all of the results.

Your comment about only having a few common words that are searched
makes me wonder if a reverse-stopword function would be valuable (ie:
only index words in a list you generate) Probably could be done other
ways though too with what's available. Maybe a psuedo-bitmap index
inside a mysql table?

I don't think that would be appropriate. My example, is our site (tias.com) has 
lots of antiques and collectibles. One popular categories is jewelry. If 
somebody does a search for gold jewelry and the search engine interprets this 
as anything that mentions gold or jewelry. It is going to match a lot of items. 
It would be nice if we could use explain or something like it to get a rough 
estimate of how many results a query would generate, and if it was really bad, 
we could tell the user to be more specific.

 
 The FOUND_ROWS() function in mysql-4.0.0+ might help you out in this
 case.

That only works after you've wait 2 minutes for the query to complete. I want 
something (like EXPLAIN) that is fairly quick and that you can run ahead of time.


 
 



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread hooker


 Steve Rapaport wrote:
  Someone correctly pointed out today that it's not Mysql's job
  to be Google, and I agree.  But it seems to me that it would be
  fair for mysql to be able to handle searches in under 1 second
  for databases 1 millionth the size of Google.  All I want here
  is a decent flexible phone book lookup, not a search engine.
 
 That may be what you want. But I want a search engine. For my database. I want 
 to be able to do search of the million records of inventory in our database and 
 find the items with certain keywords, that are currently available for sale, in 
 the venue the user is shopping in and within the price range the user specifies. 
 This is a combination of database and fulltext search that I think is exactly in 
 MySQL's marketplace.

I think that there are two different requirements here. A database able to handle
fulltext searches is becoming almost mandatory, and MySQL achieves this, within
limits. They appear to recognise the current limitations of the product, and are
working to make it more capable, and that's exactly what I'd expect from a
company with good market awareness.

However, a full search-engine is a different creature. MySQL is unlikely ever
to equal Google. That requires tailored software and data structures, and more
hardware than most of us are likely to have access to.

MySQL has a stop list, in my view it also needs a yes list so that indexing on
a closed vocabulary can be implemented easily. The work that I've done trying
to implement full text searching (before MySQL offered the functionality) shows
what pretty much everyone else seems to have found. Regardless of the capabilities
of the underlying engine, full text retrieval only works well with very large
document sets. With smaller volumes of text, you only get good results from a
closed vocabulary - keywords if you will.

If MySQL could be restricted to a list of words/phrases to be indexed, I think
that a lot of the performance issues would be reduced but it's certainly not 
the answer to every problem.

Paul Wilson
iiNet Ltd


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread hooker


 While any speed up with a full table fulltext search would be helpful
 and useful, there are instances where the search is intersected with
 another column and the problem of search is therefore more complex but
 also leads to potential optimizations.
 
 In our case we rarely do searches that use just the FTS index. More
 commonly, we do searches based on other columns and a FTS indexed text
 column. The WHERE clause restraints on the non-text column usually
 eliminate 99% of the rows. Fulltext indexes can not be combined with
 numeric columns in a single composite index, so as far as I can tell,
 MySQL performs the FTS then filters the results to meet the WHERE
 conditions. I believe the required amount of the index to be paged into
 main memory is much larger than might otherwise be needed (since the
 indexes are larger than available memory, and there a zillion other
 tables accessed simultaneously, this tends to effect all queries).

There's also a consideration about splitting the workload between the
RDBMS and the application. I've improved the performance of applications
by simplifying the SQL and making the Perl/C do more of the work. When
it is successful (usually), it's because the application can be written
to understand the implications behind the query results, and the
language is more generalised than SQL.

Paul Wilson
iiNet Ltd


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-13 Thread Steven Roussey


Mike Wexler wrote:


That may be what you want. But I want a search engine. For my database.
I want 
to be able to do search of the million records of inventory in our
database and 
find the items with certain keywords, that are currently available for
sale, in 
the venue the user is shopping in and within the price range the user
specifies. 
This is a combination of database and fulltext search that I think is
exactly in 
MySQL's marketplace.


Personally I use the term 'search engine' to mean Google-like text-only
search. I don't have a term for the above, but FTS seems fine. I agree
that MySQL users will tend to want an index over several columns
including one that is text (FTS).

As for practical advice now, I'll tell you what we did:

1. Got better hardware. Athlons, DDR memory, RAID: more is better. :)

2. We split the FTS tables into 10 tables. Since we do not need to
search the text without some other constraint (yet -- I'd like FTS to
work with merge tables in the few cases where we did), we know which
table to use. Your mileage may vary. This helps for several reasons:
less index to pull into memory for the search, and FTS does not seem to
be O(n) but rather O(n^2) though I can't be sure.

3. Replicated the table to do FTS on to another computer. All it does is
FTS. We made our own routine to replicate the table (actually 10 tables)
since our text is compressed on the main server (smaller = less disk
access = faster) and the indexes are different (obviously). In this
case, it can hold a lot of the index in memory -- big difference. Also,
a long search does not lock the tables. This is worth in and of itself.
Even if you have this copy of the table on the same server.
De-normalization can produce huge speed increases depending on what and
how you do things.

4. If you do #2 and #3 you'll notice that you can have x (10 for us)
number of servers partition the FTS. We don't actually do this, but we
could and therefore get 'Distributed Fulltext' -- the title of this
thread!!!

Number 1 and 3 should work for everyone, I think. Only if your app can
partition, can you do #2 and #4. 

Sincerely,
Steven Roussey
http://Network54.com/?pp=e 




-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-12 Thread Brian DeFeyter



 On Friday 08 February 2002 08:56, Vincent Stoessel wrote:

 Apples and oranges.

 Yeah, I know.  But let's see if we can make some distinctions.
 If, say, Google, can search 2 trillion web pages, averaging say 70k 
 bytes each, in 1 second, and Mysql can search 22 million records, with 
 an index on 40 bytes each, in 3 seconds (my experience) on a good day,
 what's the order of magnitude difference?  Roughly 10^9.

 Have you seen the /hardware/ run that enterprise with?
 Irrelevant, you're unlikely to get 9 orders of magnitude difference with
 faster hardware or even with clustering.

Google runs on cheap pc hardware with local IDE disk... nothing fancy. The 
majority of the speed comes from sharding the data to distributed 
indexes which are optimized for what they do, and having a frontend that
knows where to route those requests.

 Also, their software is optimized for full text searches and that
 is /all/ they do. Mysql is an SQL database and is optimized as such. 
 Absolutely granted. You are completely right.
 And I don't expect the data format to change.

  BUT:  thought experiment: 

 When Mysql decides to generate a FULLTEXT index, it is using an index 
 file, the .MYI file, which can have ANY FORMAT it wants to.  If the .MYI 
 format is poorly optimized for fulltext searches, they can improve the 
 format.  They can even introduce a new index file type .MYF solely for 
 optimizing fulltext searche.

Agreed... however I don't know if that's the case either. Anyone 
(developers?) know enough about the internals to comment on such an 
approach? It seems if you know your goal is fulltext speed.. things could 
probably be done better.

 Thinking out loud...
 Steve



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-12 Thread James Montebello



For the slice servers, you simply assume that if one is lost, you lose X%
of the data until it is revived, which is usually not even noticable by
the end user.  For the aggregators, we had four behind a load-balancer.
In practice, we had nearly zero downtime over a roughly 18 month period.

james montebello

On 7 Feb 2002, Amir Aliabadi wrote:

 How do you make something like this fault tolerant?
 The answer is probably what I suspect, 2 of every thing.
 How does the aggregator handle this or are these machines in a cluster?
 
 We are thinking of how to rebuild our fulltext search.  Currently it is
 in MS SQL 7.0 -  MySQL 4.0 seems to blow the doors off the cataloging
 time as compaired to MS SQL 7.0 Or even 8.0.
 
 
 On Thu, 2002-02-07 at 15:19, James Montebello wrote:
  
  I did this at a previous job, and we split the data up more or less
  this way (we used a pre-existing item number for the split which was
  essentially random in relation to the text data), with a aggregator that
  did the query X ways, each to a separate box holding 1/X of the data.
  The results from each unit were paged and sorted, so all the aggregator
  did was do a simple merge sort on a page of the set, which was fast.
  On a 6M record dataset, it produced millisecond-range search results.
  Not exactly Google-class, but pretty good for 12 Linux boxes, two
  programmers, and about six weeks of effort.
  
  james montebello
  
  On Thu, 7 Feb 2002, Brian Bray wrote:
  
   
   It seems to me like the best solution that could be implemented as-is 
   would be to keep a random int column in your table (with a range of say 
   1-100) and then have fulltext server 1 psudo-replicate records with a 
   the random number in the range of 1-10, server 2 11-20 and server 3 
   21-30 and so on.
   
   Then run your query on all 10 servers and merge the result sets and 
   possibly re-sort them if you use the score column.
   
   The problem with splitting the index up by word is that is messes up all 
   your scoring and ranking.  For example what if you search using 5 
   keywords, all starting with letters from different groups?  Your going 
   to get pretty bad score for each match, and it could totally break 
   boolean searches.
   
   --
   Brian Bray
   
   
   
   
   Brian DeFeyter wrote:
On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
[snip]

   Wouldn't be too tough to write a little query routing system if you are using 
   perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
   server so it routes the query to several places and returns a single result 
   set. Ordering could be achieved as well. I'm sure there are commercial 
   packages out there as well. I don't see why the individual database servers 
   would need to do anything special.
   
[snip]

If I'm understanding you correctly, I think you're refering to routing
based on the first character of the word. That would work for cases
where the query is searching for a word that begins with a certain
character.. however fulltext searches also return results with the term
in the middle.

ie: a search for 'foo' could return:
foo.txt
foobar

but also could return:
thisisfoo
that_is_foolish

I could be wrong, but it's my understanding that MySQL stores it's
fulltext index based on all the 'unique words' found. For such a system
as you mentioned above, you'd probably have to create your own fulltext
indexing system to determine: a) where to store the data 'segments' and
b) how to route queries.  It seems like this could probably be done much
more efficiently inside of the server.

 - Brian



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


   
   
   
   -
   Before posting, please check:
  http://www.mysql.com/manual.php   (the manual)
  http://lists.mysql.com/   (the list archive)
   
   To request this thread, e-mail [EMAIL PROTECTED]
   To unsubscribe, e-mail [EMAIL PROTECTED]
   Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
   
  
  
  -
  Before posting, please check:
 http://www.mysql.com/manual.php   (the manual)
 http://lists.mysql.com/   (the list archive)
  
  To request this thread, e-mail [EMAIL PROTECTED]
  To unsubscribe, e-mail [EMAIL PROTECTED]
  Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-12 Thread George M. Ellenburg


 Last week on Slashdot there was an article where the CEO of Google mentioned he
 uses DRAM (solid state disk arrays) rather than hard drives for the indexes and
 arrays because of the magnitude of difference in speed they provide.

 There's your 10^6 difference in speed (or part of it).

 G.
 Google Articles on Slashdot:
 http://slashdot.org/article.pl?sid=02/02/03/1339216mode=thread

sql, MySQL, db, rdbms, query


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-12 Thread David Axmark


On Fri, 2002-02-08 at 11:11, Steve Rapaport wrote:
 I said:
   Why is it that Altavista can index terabytes overnight and return
   a fulltext boolean for the WHOLE WEB
   within a second, and Mysql takes so long?
 
 On Friday 08 February 2002 08:56, Vincent Stoessel wrote:
 
  Apples and oranges.
 
 Yeah, I know.  But let's see if we can make some distinctions.
 If, say, Google, can search 2 trillion web pages, averaging say 70k bytes 
 each, in 1 second, and Mysql can search 22 million records, with an index
 on 40 bytes each, in 3 seconds (my experience) on a good day,
 what's the order of magnitude difference?  Roughly 10^9.
 
  Have you seen the /hardware/ run that enterprise with?
 Irrelevant, you're unlikely to get 9 orders of magnitude difference with
 faster hardware or even with clustering.

Actually not since Google runs on 1 of machines. And if you search
for a while you will notice that you will get different answers (at
least I have had troubles with this).

Also I have used google to track occurrences of certain words over time.
And according to google some words (with millions of occurrences)
changed with about 40% in a week. So they simple provide approximate
answers while MySQL has to provide exact answers.

So the standard answer with Apples and Oranges certainly apply here!

/David


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-12 Thread Steve Rapaport


David Axmark writes:

 So the standard answer with Apples and Oranges certainly apply here!

More like Äpplen och Apelsiner, that is, different but similar.  You Swedish 
guys should know.  Thanks for answering, David, I appreciate the attention
from a founder.

I also appreciate your point that Google is updating continuously and
therefore not always caught up to the current state of the web.
But isn't that a problem with indexing speed, not with search speed?
Their search speed is still amazing, as is Altavista, and most of the
other engines.

Your other point about exact vs. approximate answers is unclear, I expect
that Google's answers are exact for their currently available indexes at any
given time.  But even if they are approximate, I'd be happy with that too.  
The scoring on a FULLTEXT search in Mysql is exact but based on a
formula that is approximate anyway.

I'll summarize this thread best I can.

From the math I used, we started with my estimate of 10^9,
which was mistaken.  The real figure was 10^6, that is, Google
searches fulltext about a million times faster than Mysql.
Then we used Google's 1 machines +DRAM indexing to reduce the
gap to 10^2, or 100 times faster.  

It turns out that 100 times is about the factor that is causing
my application problems.  If it just ran 100 times faster it would be
about as fast as a regular indexed search, and I'd
be happy.

A few people suggested that Mysql shouldn't try to be faster,
I (and some high-support customers like Mike Wexler) disagreed.
And Alex Aulbach, bless him, actually did his homework and showed that
things could be much improved with smart index techniques like
inverted files.

Then Sergei Golubchik wrote back to say he had taken some of the good ideas
and inserted them into the TODO list, although he had higher priorities
at the time.

And I was satisfied for now, although my application still isn't working
satisfactorily due to a really slow and CPU-hungry FULLTEXT search.

I think that's our story so far.

Steve Rapaport
Director, Technical Services
A-Tono


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-11 Thread Steve Rapaport



I second the question.  It could also reduce the size of the
fulltext index and the time taken to update it.

-steve

 On Thursday 07 February 2002 20:53, Brian  wrote:
  Has anyone made a suggestion or thought about ways to distribute
  databases which focus on fulltext indexes?
 
  fulltext indexes do a good job of indexing a moderate amount of data,
  but when you get a lot of data to be indexed, the queries slow down
  significantly.
 
  I have an example table, with about 90 million rows.. and has a fulltext
  index on a varchar(100) field. A single-word query which would return
  approx 300k results takes an average of 15 seconds. A query with smaller
  results (~ 10k) can be as quick as 1 sec.. which I would consider
  acceptable.
 
  Has any thought about splitting the data into distributed files or even
  machines? ie: something as simple as 'words' starting with 'X' are split
  into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
  automatically split results based on (#results per unique 'word' /
  desired # of 'split files/machines') Would such a system give any
  advantages to searching speed and concurrenct query scalability? I
  haven't looked at the fulltext internals.. so I don't know if such
  query routing could take place or not.
 
  If nothing else, does anyone else have experience with a table of this
  size or even larger? What kind of tuning have you done?

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-11 Thread Brian DeFeyter


On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
[snip]
 Wouldn't be too tough to write a little query routing system if you are using 
 perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
 server so it routes the query to several places and returns a single result 
 set. Ordering could be achieved as well. I'm sure there are commercial 
 packages out there as well. I don't see why the individual database servers 
 would need to do anything special.
[snip]

If I'm understanding you correctly, I think you're refering to routing
based on the first character of the word. That would work for cases
where the query is searching for a word that begins with a certain
character.. however fulltext searches also return results with the term
in the middle.

ie: a search for 'foo' could return:
foo.txt
foobar

but also could return:
thisisfoo
that_is_foolish

I could be wrong, but it's my understanding that MySQL stores it's
fulltext index based on all the 'unique words' found. For such a system
as you mentioned above, you'd probably have to create your own fulltext
indexing system to determine: a) where to store the data 'segments' and
b) how to route queries.  It seems like this could probably be done much
more efficiently inside of the server.

 - Brian



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-11 Thread Steve Rapaport


Also, I have to ask the question:  

Why is it that Altavista can index terabytes overnight and return 
a fulltext boolean for the WHOLE WEB 
within a second, and Mysql takes so long? 

On Friday 08 February 2002 11:50, Steve Rapaport wrote:
 I second the question.  It could also reduce the size of the
 fulltext index and the time taken to update it.

 -steve

  On Thursday 07 February 2002 20:53, Brian  wrote:
   Has anyone made a suggestion or thought about ways to distribute
   databases which focus on fulltext indexes?
  
   fulltext indexes do a good job of indexing a moderate amount of data,
   but when you get a lot of data to be indexed, the queries slow down
   significantly.
  
   I have an example table, with about 90 million rows.. and has a
   fulltext index on a varchar(100) field. A single-word query which would
   return approx 300k results takes an average of 15 seconds. A query with
   smaller results (~ 10k) can be as quick as 1 sec.. which I would
   consider acceptable.
  
   Has any thought about splitting the data into distributed files or even
   machines? ie: something as simple as 'words' starting with 'X' are
   split into a-h, i-p, q-z... or something more advanced? (maybe mysqld
   could automatically split results based on (#results per unique 'word'
   / desired # of 'split files/machines') Would such a system give any
   advantages to searching speed and concurrenct query scalability? I
   haven't looked at the fulltext internals.. so I don't know if such
   query routing could take place or not.
  
   If nothing else, does anyone else have experience with a table of this
   size or even larger? What kind of tuning have you done?

 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)

 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail
 [EMAIL PROTECTED] Trouble
 unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-10 Thread Steve Rapaport






On Friday 08 February 2002 06:14 pm, James Montebello wrote:
 Distribution is how Google gets its speed.  You say clustering won't
 solve the problem, but distributing the indicies across many processors
 *is* going to gain you a huge speed increase through sheer parallelism.

True, but not enough.  The BEST parallelism can do  in a compute-bound
application is divide the time by the number of processors.  That's assuming
a PERFECT routing system.  (Correct me if I'm wrong here)

So to make the routing system + parallelism add up to 
a MILLION times better performance, you would need at
least a MILLION processors.  I doubt that even Google is
doing that.  

 Google uses thousands of processors to handle its index, and any given
 search is going to be spread over 100s of processors.  

Right, so we can expect Google to do, say, 10,000 times  (10^4) better
than Mysql at a Fulltext search.  But in fact we're seeing 
a million, 10^6, being generous.  It's that extra factor
of a hundred (more likely a thousand, my estimates
were very generous) that I'm getting all fussy about.

 Asking
 a general purpose RDBMS to be really good a 10 different things is asking
 a bit much.  Fulltext searches are well down the list.

Here we get out of the realm of hard numbers and into opinions.   But here's
mine:  If mysql bothers to support fulltext searches, it's presumably because 
there's some demand for them in some circumstances.  The level of scaling and
optimization that could reasonably be expected:  What it takes to make the
feature useful (i.e. perform similarly) in similar cases to other Mysql 
features.  With a regular index, I can do a hash lookup on 23 million records
subsecond.  With a fulltext index (on a small field, only 40 characters) my
time slips to 3 to 180 seconds.  That extra little factor of 100 is my 
problem.  Distributing over 4 processors wouldn't really help much.  And 
because people don't always type a company name the exact same way,
FULLTEXT really is the best way to do this.

So Monty et al, my request is this:  please put on the wish list, enhancement 
of FULLTEXT search  to approximately match the performance of an indexed
search on 25 million records, on the same hardware and with other things
held equal.

Steve Rapaport

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-10 Thread George M. Ellenburg


 Last week on Slashdot there was an article where the CEO of Google mentioned he
 uses DRAM (solid state disk arrays) rather than hard drives for the indexes and
 arrays because of the magnitude of difference in speed they provide.

 There's your 10^6 difference in speed (or part of it).

 G.
 Google Articles on Slashdot:
 http://slashdot.org/article.pl?sid=02/02/03/1339216mode=thread

sql, MySQL, db, rdbms, query


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-08 Thread alec . cawley




 Why is it that Altavista can index terabytes overnight and return
 a fulltext boolean for the WHOLE WEB
 within a second, and Mysql takes so long?

I don't know about Altavista, but if you read up on Google, they do indeed
do some sort of
spreading of keywords across multiple machines - last I saw they had over
10,000
machines.

But you have to ask the question as to whether this would be an appropriate
development for MySQL. This would seeem to me to be a very complex
development
of interest to a very small subset of MySQL customers. MySQL, it seems to
me,
specialises in doing the mainstream database manipulations used by many
people simply,
quicly, and (my most important criterion) reliably. Such a development
would, IMO,
be away from the whole trend of MySQL, and shouldn't happen before
triggers,
stored procedures and the like - if ever. A multe-machine distributed
database is
a wholly different animal - IMO again. If I was going there I wouldn't
start from here -
it would need a clean brain look at the whole problem. If you really need
such searches,
Google survives by licencint that technology - speak to them.

 Alec Cawley




-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-08 Thread Alex Aulbach


Yesterday, from Brian DeFeyter:

 Has anyone made a suggestion or thought about ways to distribute
 databases which focus on fulltext indexes?

 fulltext indexes do a good job of indexing a moderate amount of data,
 but when you get a lot of data to be indexed, the queries slow down
 significantly.

Yea, this is cause the method to search the index is a geometric
algorithm.

 I have an example table, with about 90 million rows.. and has a fulltext
 index on a varchar(100) field. A single-word query which would return
 approx 300k results takes an average of 15 seconds. A query with smaller
 results (~ 10k) can be as quick as 1 sec.. which I would consider
 acceptable.

The only interesting thing is, how many words have to be indexed (how many
rows is not very important), how big grows your index (does it go in the
memory) and how many rows can be found for one word.

These are the most depending things.

 Has any thought about splitting the data into distributed files or even
 machines? ie: something as simple as 'words' starting with 'X' are split
 into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
 automatically split results based on (#results per unique 'word' /
 desired # of 'split files/machines') Would such a system give any
 advantages to searching speed and concurrenct query scalability? I
 haven't looked at the fulltext internals.. so I don't know if such
 query routing could take place or not.

Hum, I think it's *much* cheaper is to come together and pay the mySQL
people to introduce a new feature into mySQL called inverted files. This
method is in short, that you only store the word and in which records it
can be found. This can redurce the size of indexes and so the speed
dramatically.

 If nothing else, does anyone else have experience with a table of this
 size or even larger? What kind of tuning have you done?

We have made for example an extra table for indexing on an extra server.
This is good, cause

- mySQL can concentrate only for this table, so the
machine dosn't need to swap

- big searches or reindex dosn't bother the rest of the system

- indexing can be done via a cron-job

- we optimized the write routines, for example we stripped all tags out
and wrote special stop-lists, before we write it.



BTW: My wishlist for fulltext indexing:
---

- inverted files

- rules to define words

- stop-word-lists and stop-word-regex

- a function which returns a table, which can tell me, what can be
searched.
E.g SHOW MATCH(table.index) AGAINST('hugobla hugo bla');

 word  count rows stopword   autostopword   score wscore
 hugobla  22no no 10  1
 hugo106no no  8  1
 bla  00no yes 0  1

Words can be searched, depending on rules, stop-words and what has been
indexed.

- configurable scoring, e.g. very fast (and simple) scoring for speed

- special scores for special words

- a cache, which stores often used words and the rows


-- 

SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-08 Thread Steve Rapaport


Ooops, factual error:

  If, say, Google, can search 2 trillion web pages, averaging say 70k
  bytes each, in 1 second, and Mysql can search 22 million records, with
  an index on 40 bytes each, in 3 seconds (my experience) on a good day,
  what's the order of magnitude difference?  Roughly 10^9.


Google has only 2 Billion with a B web pages.

So the order of magnitude difference is only 10^6.  But that still
leaves a mysql Fulltext search a MILLION times slower than Google.

Someone correctly pointed out today that it's not Mysql's job
to be Google, and I agree.  But it seems to me that it would be
fair for mysql to be able to handle searches in under 1 second
for databases 1 millionth the size of Google.  All I want here
is a decent flexible phone book lookup, not a search engine.

-steve

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-08 Thread James Montebello



For the slice servers, you simply assume that if one is lost, you lose X%
of the data until it is revived, which is usually not even noticable by
the end user.  For the aggregators, we had four behind a load-balancer.
In practice, we had nearly zero downtime over a roughly 18 month period.

james montebello

On 7 Feb 2002, Amir Aliabadi wrote:

 How do you make something like this fault tolerant?
 The answer is probably what I suspect, 2 of every thing.
 How does the aggregator handle this or are these machines in a cluster?
 
 We are thinking of how to rebuild our fulltext search.  Currently it is
 in MS SQL 7.0 -  MySQL 4.0 seems to blow the doors off the cataloging
 time as compaired to MS SQL 7.0 Or even 8.0.
 
 
 On Thu, 2002-02-07 at 15:19, James Montebello wrote:
  
  I did this at a previous job, and we split the data up more or less
  this way (we used a pre-existing item number for the split which was
  essentially random in relation to the text data), with a aggregator that
  did the query X ways, each to a separate box holding 1/X of the data.
  The results from each unit were paged and sorted, so all the aggregator
  did was do a simple merge sort on a page of the set, which was fast.
  On a 6M record dataset, it produced millisecond-range search results.
  Not exactly Google-class, but pretty good for 12 Linux boxes, two
  programmers, and about six weeks of effort.
  
  james montebello
  
  On Thu, 7 Feb 2002, Brian Bray wrote:
  
   
   It seems to me like the best solution that could be implemented as-is 
   would be to keep a random int column in your table (with a range of say 
   1-100) and then have fulltext server 1 psudo-replicate records with a 
   the random number in the range of 1-10, server 2 11-20 and server 3 
   21-30 and so on.
   
   Then run your query on all 10 servers and merge the result sets and 
   possibly re-sort them if you use the score column.
   
   The problem with splitting the index up by word is that is messes up all 
   your scoring and ranking.  For example what if you search using 5 
   keywords, all starting with letters from different groups?  Your going 
   to get pretty bad score for each match, and it could totally break 
   boolean searches.
   
   --
   Brian Bray
   
   
   
   
   Brian DeFeyter wrote:
On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
[snip]

   Wouldn't be too tough to write a little query routing system if you are using 
   perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
   server so it routes the query to several places and returns a single result 
   set. Ordering could be achieved as well. I'm sure there are commercial 
   packages out there as well. I don't see why the individual database servers 
   would need to do anything special.
   
[snip]

If I'm understanding you correctly, I think you're refering to routing
based on the first character of the word. That would work for cases
where the query is searching for a word that begins with a certain
character.. however fulltext searches also return results with the term
in the middle.

ie: a search for 'foo' could return:
foo.txt
foobar

but also could return:
thisisfoo
that_is_foolish

I could be wrong, but it's my understanding that MySQL stores it's
fulltext index based on all the 'unique words' found. For such a system
as you mentioned above, you'd probably have to create your own fulltext
indexing system to determine: a) where to store the data 'segments' and
b) how to route queries.  It seems like this could probably be done much
more efficiently inside of the server.

 - Brian



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


   
   
   
   -
   Before posting, please check:
  http://www.mysql.com/manual.php   (the manual)
  http://lists.mysql.com/   (the list archive)
   
   To request this thread, e-mail [EMAIL PROTECTED]
   To unsubscribe, e-mail [EMAIL PROTECTED]
   Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
   
  
  
  -
  Before posting, please check:
 http://www.mysql.com/manual.php   (the manual)
 http://lists.mysql.com/   (the list archive)
  
  To request this thread, e-mail [EMAIL PROTECTED]
  To unsubscribe, e-mail [EMAIL PROTECTED]
  Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Distributed Fulltext?

2002-02-07 Thread Brian DeFeyter


Has anyone made a suggestion or thought about ways to distribute
databases which focus on fulltext indexes?

fulltext indexes do a good job of indexing a moderate amount of data,
but when you get a lot of data to be indexed, the queries slow down
significantly.

I have an example table, with about 90 million rows.. and has a fulltext
index on a varchar(100) field. A single-word query which would return
approx 300k results takes an average of 15 seconds. A query with smaller
results (~ 10k) can be as quick as 1 sec.. which I would consider
acceptable.

Has any thought about splitting the data into distributed files or even
machines? ie: something as simple as 'words' starting with 'X' are split
into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
automatically split results based on (#results per unique 'word' /
desired # of 'split files/machines') Would such a system give any
advantages to searching speed and concurrenct query scalability? I
haven't looked at the fulltext internals.. so I don't know if such
query routing could take place or not.

If nothing else, does anyone else have experience with a table of this
size or even larger? What kind of tuning have you done?

Thanks,

 - Brian


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-07 Thread Tod Harter


On Thursday 07 February 2002 14:53, Brian DeFeyter wrote:
 Has anyone made a suggestion or thought about ways to distribute
 databases which focus on fulltext indexes?

 fulltext indexes do a good job of indexing a moderate amount of data,
 but when you get a lot of data to be indexed, the queries slow down
 significantly.

 I have an example table, with about 90 million rows.. and has a fulltext
 index on a varchar(100) field. A single-word query which would return
 approx 300k results takes an average of 15 seconds. A query with smaller
 results (~ 10k) can be as quick as 1 sec.. which I would consider
 acceptable.

 Has any thought about splitting the data into distributed files or even
 machines? ie: something as simple as 'words' starting with 'X' are split
 into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
 automatically split results based on (#results per unique 'word' /
 desired # of 'split files/machines') Would such a system give any
 advantages to searching speed and concurrenct query scalability? I
 haven't looked at the fulltext internals.. so I don't know if such
 query routing could take place or not.

 If nothing else, does anyone else have experience with a table of this
 size or even larger? What kind of tuning have you done?

 Thanks,

  - Brian

Wouldn't be too tough to write a little query routing system if you are using 
perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
server so it routes the query to several places and returns a single result 
set. Ordering could be achieved as well. I'm sure there are commercial 
packages out there as well. I don't see why the individual database servers 
would need to do anything special.


 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)

 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail
 [EMAIL PROTECTED] Trouble
 unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-07 Thread Brian DeFeyter


On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
[snip]
 Wouldn't be too tough to write a little query routing system if you are using 
 perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
 server so it routes the query to several places and returns a single result 
 set. Ordering could be achieved as well. I'm sure there are commercial 
 packages out there as well. I don't see why the individual database servers 
 would need to do anything special.
[snip]

If I'm understanding you correctly, I think you're refering to routing
based on the first character of the word. That would work for cases
where the query is searching for a word that begins with a certain
character.. however fulltext searches also return results with the term
in the middle.

ie: a search for 'foo' could return:
foo.txt
foobar

but also could return:
thisisfoo
that_is_foolish

I could be wrong, but it's my understanding that MySQL stores it's
fulltext index based on all the 'unique words' found. For such a system
as you mentioned above, you'd probably have to create your own fulltext
indexing system to determine: a) where to store the data 'segments' and
b) how to route queries.  It seems like this could probably be done much
more efficiently inside of the server.

 - Brian



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-07 Thread Steve Rapaport



I second the question.  It could also reduce the size of the
fulltext index and the time taken to update it.

-steve

 On Thursday 07 February 2002 20:53, Brian  wrote:
  Has anyone made a suggestion or thought about ways to distribute
  databases which focus on fulltext indexes?
 
  fulltext indexes do a good job of indexing a moderate amount of data,
  but when you get a lot of data to be indexed, the queries slow down
  significantly.
 
  I have an example table, with about 90 million rows.. and has a fulltext
  index on a varchar(100) field. A single-word query which would return
  approx 300k results takes an average of 15 seconds. A query with smaller
  results (~ 10k) can be as quick as 1 sec.. which I would consider
  acceptable.
 
  Has any thought about splitting the data into distributed files or even
  machines? ie: something as simple as 'words' starting with 'X' are split
  into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
  automatically split results based on (#results per unique 'word' /
  desired # of 'split files/machines') Would such a system give any
  advantages to searching speed and concurrenct query scalability? I
  haven't looked at the fulltext internals.. so I don't know if such
  query routing could take place or not.
 
  If nothing else, does anyone else have experience with a table of this
  size or even larger? What kind of tuning have you done?

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-07 Thread Brian Bray



It seems to me like the best solution that could be implemented as-is 
would be to keep a random int column in your table (with a range of say 
1-100) and then have fulltext server 1 psudo-replicate records with a 
the random number in the range of 1-10, server 2 11-20 and server 3 
21-30 and so on.

Then run your query on all 10 servers and merge the result sets and 
possibly re-sort them if you use the score column.

The problem with splitting the index up by word is that is messes up all 
your scoring and ranking.  For example what if you search using 5 
keywords, all starting with letters from different groups?  Your going 
to get pretty bad score for each match, and it could totally break 
boolean searches.

--
Brian Bray




Brian DeFeyter wrote:
 On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
 [snip]
 
Wouldn't be too tough to write a little query routing system if you are using 
perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
server so it routes the query to several places and returns a single result 
set. Ordering could be achieved as well. I'm sure there are commercial 
packages out there as well. I don't see why the individual database servers 
would need to do anything special.

 [snip]
 
 If I'm understanding you correctly, I think you're refering to routing
 based on the first character of the word. That would work for cases
 where the query is searching for a word that begins with a certain
 character.. however fulltext searches also return results with the term
 in the middle.
 
 ie: a search for 'foo' could return:
 foo.txt
 foobar
 
 but also could return:
 thisisfoo
 that_is_foolish
 
 I could be wrong, but it's my understanding that MySQL stores it's
 fulltext index based on all the 'unique words' found. For such a system
 as you mentioned above, you'd probably have to create your own fulltext
 indexing system to determine: a) where to store the data 'segments' and
 b) how to route queries.  It seems like this could probably be done much
 more efficiently inside of the server.
 
  - Brian
 
 
 
 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)
 
 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail [EMAIL PROTECTED]
 Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
 
 



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-07 Thread James Montebello



I did this at a previous job, and we split the data up more or less
this way (we used a pre-existing item number for the split which was
essentially random in relation to the text data), with a aggregator that
did the query X ways, each to a separate box holding 1/X of the data.
The results from each unit were paged and sorted, so all the aggregator
did was do a simple merge sort on a page of the set, which was fast.
On a 6M record dataset, it produced millisecond-range search results.
Not exactly Google-class, but pretty good for 12 Linux boxes, two
programmers, and about six weeks of effort.

james montebello

On Thu, 7 Feb 2002, Brian Bray wrote:

 
 It seems to me like the best solution that could be implemented as-is 
 would be to keep a random int column in your table (with a range of say 
 1-100) and then have fulltext server 1 psudo-replicate records with a 
 the random number in the range of 1-10, server 2 11-20 and server 3 
 21-30 and so on.
 
 Then run your query on all 10 servers and merge the result sets and 
 possibly re-sort them if you use the score column.
 
 The problem with splitting the index up by word is that is messes up all 
 your scoring and ranking.  For example what if you search using 5 
 keywords, all starting with letters from different groups?  Your going 
 to get pretty bad score for each match, and it could totally break 
 boolean searches.
 
 --
 Brian Bray
 
 
 
 
 Brian DeFeyter wrote:
  On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
  [snip]
  
 Wouldn't be too tough to write a little query routing system if you are using 
 perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
 server so it routes the query to several places and returns a single result 
 set. Ordering could be achieved as well. I'm sure there are commercial 
 packages out there as well. I don't see why the individual database servers 
 would need to do anything special.
 
  [snip]
  
  If I'm understanding you correctly, I think you're refering to routing
  based on the first character of the word. That would work for cases
  where the query is searching for a word that begins with a certain
  character.. however fulltext searches also return results with the term
  in the middle.
  
  ie: a search for 'foo' could return:
  foo.txt
  foobar
  
  but also could return:
  thisisfoo
  that_is_foolish
  
  I could be wrong, but it's my understanding that MySQL stores it's
  fulltext index based on all the 'unique words' found. For such a system
  as you mentioned above, you'd probably have to create your own fulltext
  indexing system to determine: a) where to store the data 'segments' and
  b) how to route queries.  It seems like this could probably be done much
  more efficiently inside of the server.
  
   - Brian
  
  
  
  -
  Before posting, please check:
 http://www.mysql.com/manual.php   (the manual)
 http://lists.mysql.com/   (the list archive)
  
  To request this thread, e-mail [EMAIL PROTECTED]
  To unsubscribe, e-mail [EMAIL PROTECTED]
  Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
  
  
 
 
 
 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)
 
 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail [EMAIL PROTECTED]
 Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
 


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-07 Thread Amir Aliabadi


How do you make something like this fault tolerant?
The answer is probably what I suspect, 2 of every thing.
How does the aggregator handle this or are these machines in a cluster?

We are thinking of how to rebuild our fulltext search.  Currently it is
in MS SQL 7.0 -  MySQL 4.0 seems to blow the doors off the cataloging
time as compaired to MS SQL 7.0 Or even 8.0.


On Thu, 2002-02-07 at 15:19, James Montebello wrote:
 
 I did this at a previous job, and we split the data up more or less
 this way (we used a pre-existing item number for the split which was
 essentially random in relation to the text data), with a aggregator that
 did the query X ways, each to a separate box holding 1/X of the data.
 The results from each unit were paged and sorted, so all the aggregator
 did was do a simple merge sort on a page of the set, which was fast.
 On a 6M record dataset, it produced millisecond-range search results.
 Not exactly Google-class, but pretty good for 12 Linux boxes, two
 programmers, and about six weeks of effort.
 
 james montebello
 
 On Thu, 7 Feb 2002, Brian Bray wrote:
 
  
  It seems to me like the best solution that could be implemented as-is 
  would be to keep a random int column in your table (with a range of say 
  1-100) and then have fulltext server 1 psudo-replicate records with a 
  the random number in the range of 1-10, server 2 11-20 and server 3 
  21-30 and so on.
  
  Then run your query on all 10 servers and merge the result sets and 
  possibly re-sort them if you use the score column.
  
  The problem with splitting the index up by word is that is messes up all 
  your scoring and ranking.  For example what if you search using 5 
  keywords, all starting with letters from different groups?  Your going 
  to get pretty bad score for each match, and it could totally break 
  boolean searches.
  
  --
  Brian Bray
  
  
  
  
  Brian DeFeyter wrote:
   On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
   [snip]
   
  Wouldn't be too tough to write a little query routing system if you are using 
  perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
  server so it routes the query to several places and returns a single result 
  set. Ordering could be achieved as well. I'm sure there are commercial 
  packages out there as well. I don't see why the individual database servers 
  would need to do anything special.
  
   [snip]
   
   If I'm understanding you correctly, I think you're refering to routing
   based on the first character of the word. That would work for cases
   where the query is searching for a word that begins with a certain
   character.. however fulltext searches also return results with the term
   in the middle.
   
   ie: a search for 'foo' could return:
   foo.txt
   foobar
   
   but also could return:
   thisisfoo
   that_is_foolish
   
   I could be wrong, but it's my understanding that MySQL stores it's
   fulltext index based on all the 'unique words' found. For such a system
   as you mentioned above, you'd probably have to create your own fulltext
   indexing system to determine: a) where to store the data 'segments' and
   b) how to route queries.  It seems like this could probably be done much
   more efficiently inside of the server.
   
- Brian
   
   
   
   -
   Before posting, please check:
  http://www.mysql.com/manual.php   (the manual)
  http://lists.mysql.com/   (the list archive)
   
   To request this thread, e-mail [EMAIL PROTECTED]
   To unsubscribe, e-mail [EMAIL PROTECTED]
   Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
   
   
  
  
  
  -
  Before posting, please check:
 http://www.mysql.com/manual.php   (the manual)
 http://lists.mysql.com/   (the list archive)
  
  To request this thread, e-mail [EMAIL PROTECTED]
  To unsubscribe, e-mail [EMAIL PROTECTED]
  Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
  
 
 
 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)
 
 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail [EMAIL PROTECTED]
 Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
 



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Distributed Fulltext?

2002-02-07 Thread Steve Rapaport


Also, I have to ask the question:  

Why is it that Altavista can index terabytes overnight and return 
a fulltext boolean for the WHOLE WEB 
within a second, and Mysql takes so long? 

On Friday 08 February 2002 11:50, Steve Rapaport wrote:
 I second the question.  It could also reduce the size of the
 fulltext index and the time taken to update it.

 -steve

  On Thursday 07 February 2002 20:53, Brian  wrote:
   Has anyone made a suggestion or thought about ways to distribute
   databases which focus on fulltext indexes?
  
   fulltext indexes do a good job of indexing a moderate amount of data,
   but when you get a lot of data to be indexed, the queries slow down
   significantly.
  
   I have an example table, with about 90 million rows.. and has a
   fulltext index on a varchar(100) field. A single-word query which would
   return approx 300k results takes an average of 15 seconds. A query with
   smaller results (~ 10k) can be as quick as 1 sec.. which I would
   consider acceptable.
  
   Has any thought about splitting the data into distributed files or even
   machines? ie: something as simple as 'words' starting with 'X' are
   split into a-h, i-p, q-z... or something more advanced? (maybe mysqld
   could automatically split results based on (#results per unique 'word'
   / desired # of 'split files/machines') Would such a system give any
   advantages to searching speed and concurrenct query scalability? I
   haven't looked at the fulltext internals.. so I don't know if such
   query routing could take place or not.
  
   If nothing else, does anyone else have experience with a table of this
   size or even larger? What kind of tuning have you done?

 -
 Before posting, please check:
http://www.mysql.com/manual.php   (the manual)
http://lists.mysql.com/   (the list archive)

 To request this thread, e-mail [EMAIL PROTECTED]
 To unsubscribe, e-mail
 [EMAIL PROTECTED] Trouble
 unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

43 matches

Mail list logo