Re: Distributed Fulltext?

2002-02-16 Thread David Axmark

On Fri, 2002-02-15 at 02:44, Alex Aulbach wrote:
> Wednesday, from David Axmark:
> > > Your other point about exact vs. approximate answers is unclear, I expect
> > > that Google's answers are exact for their currently available indexes at any
> > > given time.  But even if they are approximate, I'd be happy with that too.
> > > The scoring on a FULLTEXT search in Mysql is "exact" but based on a
> > > formula that is approximate anyway.
> >
> > No, MySQL returns all data according to a search. Web engines return
> > what they they find on one search machine. So you can get different
> > results with Google every time you hit refresh if you are routed to
> > different machines. This had happened to me when I was looking for the
> > number of matches and not the result itself.
> >
> > So we should try to make fulltext searches with a limit between 10 and
> > 100 be fast to be closer to google.
> >
> > I have also head about some other things web search engines do since I
> > know some people at FAST but I have forgot that already.
> 
> My opinion is, that mySQL itself never should try to find approximate
> matches. This is against the definition of SQL itself. SQL is a fourth
> generation language. That means, if you say "SELECT", the engine selects.
> And it has to be as exactly that, what I have searched, every time, on
> every machine in any combination with the same data.
>
> So SQL needs a new language construct to make an approximate search. But
> what is an approximate search? How is approximate defined?
> 

I agree, If there is a good technical reason for doing so. I would be
willing to add a keyword APPROXIMATE for such a thing. But it would have
to be a extremely clear even to uninitiated that this is approximate. 

I can se such a feature used by the people who build search engines with
MySQL.

> I don't think it is a good idea to implement it in this way.
> Approximazation must be always done on the application level, cause it is
> highly dependend on application, what an approximate result could be.
> 
> > We will try to make every feature as good as possible. But we do have
> > limited resources.
> 
> Exactly. FTS is not so important as other features and people which want
> you to include a new feature should think about supporting mysql with
> money. :-)
> 
> But (yes, we support mysql! :-) I think the need is growing rapidly, cause
> the amount of data, that has to be indexed is growing over the years. And
> other DB's have much more experices with it. Currently we can live with
> the speed. Those who cannot live with it should buy better machines,
> think about their SE-concept or support mysql.
> 
> Search engines techniques are *not* trivial, so the last way is in my eyes
> one of the cheapest.
> 
> 
> > Well there is always the option of sponsoring further fulltext
> > development. We have a guy who has been working on the GNU fulltext
> > engines who is interesting in working with MySQL fulltext. But for the
> > moment we can not afford it.
> 
> This was my first thought: People write about speed problems and how to
> cluster and so on. Things, that I would calculate with weeks and high TCO.
> 
> But it maybe much cheaper to pay mySQL for this. How much do you estimate
> would it cost to implement inverted files? I think this is difficult,
> cause Sergei told me, that he couldn't use mySQL-index files any more.

If needed we could add a new index file (.MYT ?). But what would require
some internal changes of unknown (to me) size. But if this is needed for
continues development of fulltext we will do it.

> I just ask, nothing special in sight, but many questions from everyone who
> needs it. Cause FTS is a feature which highly improves the value of a web
> site. And coustomers have no problem to pay for things they think they get
> money for. FTS is such a thing.
> 
> But perhaps if we know, under which circumstances FTS is improved, it is
> easier for us to find a possible way to share the costs for it or find a
> compromise. I also understand, if mySQL don't want to speak about it
> here.
> 
> I think it is also important for us, how much it can be theoretically
> improved. My calculations showed me a theoretical speed up of factor 100
> or so. This is ... wow. But in live everything is most times slower...

We want to do things as fast as possible in most cases. We will try to
do this for text-search also but the question is how long time it will
take before we finish.

> > So if some of you are interested in sponsoring this (or know about
> > others who might be) write to [EMAIL PROTECTED]
> 
> Or like this... maybe we find coustomers who needs it. Think it's
> possible.
> 
> My personal feeling is and my stomach says, that fulltext indexing is a
> feature, which needs to be expanded.

We do have Sergei working on it still. And we do plan to expand it
during the coming months. But the way to get our attention in cases like
this is simple. Give us some core or money!

/David


-

Re: Distributed Fulltext?

2002-02-14 Thread Alex Aulbach

Wednesday, from David Axmark:
> > Your other point about exact vs. approximate answers is unclear, I expect
> > that Google's answers are exact for their currently available indexes at any
> > given time.  But even if they are approximate, I'd be happy with that too.
> > The scoring on a FULLTEXT search in Mysql is "exact" but based on a
> > formula that is approximate anyway.
>
> No, MySQL returns all data according to a search. Web engines return
> what they they find on one search machine. So you can get different
> results with Google every time you hit refresh if you are routed to
> different machines. This had happened to me when I was looking for the
> number of matches and not the result itself.
>
> So we should try to make fulltext searches with a limit between 10 and
> 100 be fast to be closer to google.
>
> I have also head about some other things web search engines do since I
> know some people at FAST but I have forgot that already.

My opinion is, that mySQL itself never should try to find approximate
matches. This is against the definition of SQL itself. SQL is a fourth
generation language. That means, if you say "SELECT", the engine selects.
And it has to be as exactly that, what I have searched, every time, on
every machine in any combination with the same data.

So SQL needs a new language construct to make an approximate search. But
what is an approximate search? How is approximate defined?

I don't think it is a good idea to implement it in this way.
Approximazation must be always done on the application level, cause it is
highly dependend on application, what an approximate result could be.

> We will try to make every feature as good as possible. But we do have
> limited resources.

Exactly. FTS is not so important as other features and people which want
you to include a new feature should think about supporting mysql with
money. :-)

But (yes, we support mysql! :-) I think the need is growing rapidly, cause
the amount of data, that has to be indexed is growing over the years. And
other DB's have much more experices with it. Currently we can live with
the speed. Those who cannot live with it should buy better machines,
think about their SE-concept or support mysql.

Search engines techniques are *not* trivial, so the last way is in my eyes
one of the cheapest.


> Well there is always the option of sponsoring further fulltext
> development. We have a guy who has been working on the GNU fulltext
> engines who is interesting in working with MySQL fulltext. But for the
> moment we can not afford it.

This was my first thought: People write about speed problems and how to
cluster and so on. Things, that I would calculate with weeks and high TCO.

But it maybe much cheaper to pay mySQL for this. How much do you estimate
would it cost to implement inverted files? I think this is difficult,
cause Sergei told me, that he couldn't use mySQL-index files any more.

I just ask, nothing special in sight, but many questions from everyone who
needs it. Cause FTS is a feature which highly improves the value of a web
site. And coustomers have no problem to pay for things they think they get
money for. FTS is such a thing.

But perhaps if we know, under which circumstances FTS is improved, it is
easier for us to find a possible way to share the costs for it or find a
compromise. I also understand, if mySQL don't want to speak about it
here.

I think it is also important for us, how much it can be theoretically
improved. My calculations showed me a theoretical speed up of factor 100
or so. This is ... wow. But in live everything is most times slower...


> So if some of you are interested in sponsoring this (or know about
> others who might be) write to [EMAIL PROTECTED]

Or like this... maybe we find coustomers who needs it. Think it's
possible.

My personal feeling is and my stomach says, that fulltext indexing is a
feature, which needs to be expanded.

-- 

SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-14 Thread Alex Aulbach

Wednesday, from Mike Wexler:
> I don't think that would be appropriate. My example, is our site (tias.com) has
> lots of antiques and collectibles. One popular categories is jewelry. If
> somebody does a search for "gold jewelry" and the search engine interprets this
> as anything that mentions gold or jewelry. It is going to match a lot of items.
> It would be nice if we could use explain or something like it to get a rough
> estimate of how many results a query would generate, and if it was really bad,
> we could tell the user to be more specific.

This is not a solution, but we make it by using the sql query

SELECT COUNT(*) FROM table WHERE MATCH(index) AGAINST ('gold')
(results e.g. in 100)

and

SELECT COUNT(*) FROM table WHERE MATCH(index) AGAINST ('jewelry')
(results e.g. in 200)

OR-Search: The result is between 300 and 500 matches.
AND-Search: The result is between 0 and 200 matches.

The problem is: The queries lasts nearly as fast, as the complete search.
:)

-- 

SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-14 Thread Alex Aulbach


Hi, I also will explain how we made FTS "fast". (sorry for my bad english)

First some DATA: The table which has to be indexed has ~60 entries.
There are articles inside it, which are in average 3-4 kb each (which says
nothing!) with about 300 words each (this number is very important!).

This makes theoretically an index of 180 Million. But we have stopwords
and only about 1 Mio different words. So the index for this relation

  words  <---> article

is about 25-30 Mio recors/relations.

> This is the only important thing: How big is the index?
Does it go completly into memory?


Cause of other expierences with FTS in mySQL we decided to put the FTS
machine itself on an extra DB-Server. This machine has 1 GHz, 4 GB Ram,
Raid 5.

The main application has a centralistic write module. Every article has to
go throug these routines. So it was easy to write the article not only
into the article database, but also into the search database. Before we do
this, we strip away html and other things to reduce data.

Some things which makes the engine "slow":

* Words, that match very often. This is, cause mysql has to find all
records and FTS has to calculate how good the match is. That is very time
consuming. So a SELECT * FROM index is very bad. Better is a SELECT index
FROM fts. The selected amount of data is much lower.

* at startup the machine is very slow and if there are much updates. But
it's clear why, I think I don't need to explain it.

* too much requests (>10 per seconds, but this depends very much on what
is searched). Think this is also clear.


BTW: we found out a trick for "AND"-searches (by default FTS
searches with OR!)

1. Count matches for each word (SELECT COUNT(*) FROM fts WHERE
MATCH(ftsindex) AGAINST (word1))
2. Sort by found matches ascending.
3. SELECT * FROM fts WHERE MATCH(ftsindex) AGAINS(word with lowest number
of matches) AND MATCH(ftsindex) AGAINST(word with second lowest number of
matches) AND 

This speeds up the query about 30-50% against the same query, when the
words are sorted descending! It is obvious: If the first set is the
smallest set mysql has to compare much less possibilities.



So let's summarize:

- Split the searching facilities from your application!

  This is also very important not to block your application too much, when
  the ft-index must be generated (e.g. cause machine cpu bites into dust)

- depending on the size of index:
  1. Use an own table
  2. put more RAM into the machine
  3. Use an own database on another machine
  4. Put more RAM into the machine
  5. Split you DB into several DB's on different machines.
  6. Why don't you by a real google SE?

  Positive side effect of using an own DB: You are able to index when
  there is time to, e.g. when the load is low. Asynchronous Searching
  facilities can be a good idea in some reasons.

- write only those data into FTS-Table, which should be found. This
  reduces indexing and search time and amount of data.

- put also other searching criterias into this table, if it is possible.
  Try to avoid joins!

- There exists some optimization tricks to speed up search.

- Caching can also be a very good idea.


The biggest searching times takes about 2 seconds with this method.



-- 

SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread Steven Roussey

Mike Wexler wrote:


That may be what you want. But I want a search engine. For my database.
I want 
to be able to do search of the million records of inventory in our
database and 
find the items with certain keywords, that are currently available for
sale, in 
the venue the user is shopping in and within the price range the user
specifies. 
This is a combination of database and fulltext search that I think is
exactly in 
MySQL's marketplace.


Personally I use the term 'search engine' to mean Google-like text-only
search. I don't have a term for the above, but FTS seems fine. I agree
that MySQL users will tend to want an index over several columns
including one that is text (FTS).

As for practical advice now, I'll tell you what we did:

1. Got better hardware. Athlons, DDR memory, RAID: more is better. :)

2. We split the FTS tables into 10 tables. Since we do not need to
search the text without some other constraint (yet -- I'd like FTS to
work with merge tables in the few cases where we did), we know which
table to use. Your mileage may vary. This helps for several reasons:
less index to pull into memory for the search, and FTS does not seem to
be O(n) but rather O(n^2) though I can't be sure.

3. Replicated the table to do FTS on to another computer. All it does is
FTS. We made our own routine to replicate the table (actually 10 tables)
since our text is compressed on the main server (smaller = less disk
access = faster) and the indexes are different (obviously). In this
case, it can hold a lot of the index in memory -- big difference. Also,
a long search does not lock the tables. This is worth in and of itself.
Even if you have this copy of the table on the same server.
De-normalization can produce huge speed increases depending on what and
how you do things.

4. If you do #2 and #3 you'll notice that you can have x (10 for us)
number of servers partition the FTS. We don't actually do this, but we
could and therefore get 'Distributed Fulltext' -- the title of this
thread!!!

Number 1 and 3 should work for everyone, I think. Only if your app can
partition, can you do #2 and #4. 

Sincerely,
Steven Roussey
http://Network54.com/?pp=e 




-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread hooker

> While any speed up with a full table fulltext search would be helpful
> and useful, there are instances where the search is intersected with
> another column and the problem of search is therefore more complex but
> also leads to potential optimizations.
> 
> In our case we rarely do searches that use just the FTS index. More
> commonly, we do searches based on other columns and a FTS indexed text
> column. The WHERE clause restraints on the non-text column usually
> eliminate 99% of the rows. Fulltext indexes can not be combined with
> numeric columns in a single composite index, so as far as I can tell,
> MySQL performs the FTS then filters the results to meet the WHERE
> conditions. I believe the required amount of the index to be paged into
> main memory is much larger than might otherwise be needed (since the
> indexes are larger than available memory, and there a zillion other
> tables accessed simultaneously, this tends to effect all queries).

There's also a consideration about splitting the workload between the
RDBMS and the application. I've improved the performance of applications
by simplifying the SQL and making the Perl/C do more of the work. When
it is successful (usually), it's because the application can be written
to "understand" the implications behind the query results, and the
language is more generalised than SQL.

Paul Wilson
iiNet Ltd


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread hooker

> Steve Rapaport wrote:
> > Someone correctly pointed out today that it's not Mysql's job
> > to be Google, and I agree.  But it seems to me that it would be
> > fair for mysql to be able to handle searches in under 1 second
> > for databases 1 millionth the size of Google.  All I want here
> > is a decent flexible phone book lookup, not a search engine.
> 
> That may be what you want. But I want a search engine. For my database. I want 
> to be able to do search of the million records of inventory in our database and 
> find the items with certain keywords, that are currently available for sale, in 
> the venue the user is shopping in and within the price range the user specifies. 
> This is a combination of database and fulltext search that I think is exactly in 
> MySQL's marketplace.

I think that there are two different requirements here. A database able to handle
fulltext searches is becoming almost mandatory, and MySQL achieves this, within
limits. They appear to recognise the current limitations of the product, and are
working to make it more capable, and that's exactly what I'd expect from a
company with good market awareness.

However, a full search-engine is a different creature. MySQL is unlikely ever
to equal Google. That requires tailored software and data structures, and more
hardware than most of us are likely to have access to.

MySQL has a stop list, in my view it also needs a "yes list" so that indexing on
a closed vocabulary can be implemented easily. The work that I've done trying
to implement full text searching (before MySQL offered the functionality) shows
what pretty much everyone else seems to have found. Regardless of the capabilities
of the underlying engine, full text retrieval only works well with very large
document sets. With smaller volumes of text, you only get good results from a
closed vocabulary - keywords if you will.

If MySQL could be restricted to a list of words/phrases to be indexed, I think
that a lot of the performance issues would be reduced but it's certainly not 
the answer to every problem.

Paul Wilson
iiNet Ltd


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread Mike Wexler



Brian DeFeyter wrote:
> On Wed, 2002-02-13 at 16:39, Mike Wexler wrote:
> 
>>
>>Brian DeFeyter wrote:
>>
>>>I sorta like that idea. I don't know exactly what you can and can't do
>>>as far as indexing inside of HEAP tables.. but the index size would
>>>likely differ from the written index. Then you can expand the idea and
>>>use the X/(num slices) on (num slices) boxes technique.. sending the
>>>query to each, and compiling all of the results.
>>>
>>>Your comment about only having a few common words that are searched
>>>makes me wonder if a reverse-stopword function would be valuable (ie:
>>>only index words in a list you generate) Probably could be done other
>>>ways though too with what's available. Maybe a psuedo-bitmap index
>>>inside a mysql table?
>>>
>>I don't think that would be appropriate. My example, is our site (tias.com) has 
>>lots of antiques and collectibles. One popular categories is jewelry. If 
>>somebody does a search for "gold jewelry" and the search engine interprets this 
>>as anything that mentions gold or jewelry. It is going to match a lot of items. 
>>It would be nice if we could use explain or something like it to get a rough 
>>estimate of how many results a query would generate, and if it was really bad, 
>>we could tell the user to be more specific.
>>
> 
> The FOUND_ROWS() function in mysql-4.0.0+ might help you out in this
> case.

That only works after you've wait 2 minutes for the query to complete. I want 
something (like EXPLAIN) that is fairly quick and that you can run ahead of time.


> 
> 



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread Brian DeFeyter

On Wed, 2002-02-13 at 16:39, Mike Wexler wrote:
> 
> 
> Brian DeFeyter wrote:
> > I sorta like that idea. I don't know exactly what you can and can't do
> > as far as indexing inside of HEAP tables.. but the index size would
> > likely differ from the written index. Then you can expand the idea and
> > use the X/(num slices) on (num slices) boxes technique.. sending the
> > query to each, and compiling all of the results.
> > 
> > Your comment about only having a few common words that are searched
> > makes me wonder if a reverse-stopword function would be valuable (ie:
> > only index words in a list you generate) Probably could be done other
> > ways though too with what's available. Maybe a psuedo-bitmap index
> > inside a mysql table?
> 
> I don't think that would be appropriate. My example, is our site (tias.com) has 
> lots of antiques and collectibles. One popular categories is jewelry. If 
> somebody does a search for "gold jewelry" and the search engine interprets this 
> as anything that mentions gold or jewelry. It is going to match a lot of items. 
> It would be nice if we could use explain or something like it to get a rough 
> estimate of how many results a query would generate, and if it was really bad, 
> we could tell the user to be more specific.

The FOUND_ROWS() function in mysql-4.0.0+ might help you out in this
case.



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread Mike Wexler



Brian DeFeyter wrote:
> I sorta like that idea. I don't know exactly what you can and can't do
> as far as indexing inside of HEAP tables.. but the index size would
> likely differ from the written index. Then you can expand the idea and
> use the X/(num slices) on (num slices) boxes technique.. sending the
> query to each, and compiling all of the results.
> 
> Your comment about only having a few common words that are searched
> makes me wonder if a reverse-stopword function would be valuable (ie:
> only index words in a list you generate) Probably could be done other
> ways though too with what's available. Maybe a psuedo-bitmap index
> inside a mysql table?

I don't think that would be appropriate. My example, is our site (tias.com) has 
lots of antiques and collectibles. One popular categories is jewelry. If 
somebody does a search for "gold jewelry" and the search engine interprets this 
as anything that mentions gold or jewelry. It is going to match a lot of items. 
It would be nice if we could use explain or something like it to get a rough 
estimate of how many results a query would generate, and if it was really bad, 
we could tell the user to be more specific.


> 
>  - bdf
> 
> On Wed, 2002-02-13 at 12:09, Mike Wexler wrote:
> 
>>My understanding is that part of how google and Altavista get such high speeds 
>>is to keep everything in memory. Is it possible to create a HEAP table with a 
>>full text index? If so, does the full text index take advantage of being in 
>>memory? For example, I would imagine that if you were keeping the whole index in 
>>memory, details like the index page size, and the format of the pointers/record 
>>numbers would be different.
>>
>>Then you could just do something roughly like (i know the syntax is a little off)
>>
>>CREATE HEAP TABLE fooFast SELECT * FROM fooSlow
>>ALTER fooFast ADD fulltext(a, b, c)
>>
>>Or maybe you could just have fooSlow on one server. And then have it replicated 
>>on N other servers. But on the other servers you could alter the table type so 
>>it was a heap table. So you would have one persistent table and a bunch of 
>>replicated heap tables. And all the search go could against the heap tables.
>>
>>
>>Brian Bray wrote:
>>
>>>It seems to me like the best solution that could be implemented as-is 
>>>would be to keep a random int column in your table (with a range of say 
>>>1-100) and then have fulltext server 1 psudo-replicate records with a 
>>>the random number in the range of 1-10, server 2 11-20 and server 3 
>>>21-30 and so on.
>>>
>>>Then run your query on all 10 servers and merge the result sets and 
>>>possibly re-sort them if you use the score column.
>>>
>>>The problem with splitting the index up by word is that is messes up all 
>>>your scoring and ranking.  For example what if you search using 5 
>>>keywords, all starting with letters from different groups?  Your going 
>>>to get pretty bad score for each match, and it could totally break 
>>>boolean searches.
>>>
>>>--
>>>Brian Bray
>>>
>>>
>>>
>>>
>>>Brian DeFeyter wrote:
>>>
>>>
On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
[snip]



>Wouldn't be too tough to write a little query routing system if you are using 
>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
>server so it routes the query to several places and returns a single result 
>set. Ordering could be achieved as well. I'm sure there are commercial 
>packages out there as well. I don't see why the individual database servers 
>would need to do anything special.
>
>
>
[snip]

If I'm understanding you correctly, I think you're refering to routing
based on the first character of the word. That would work for cases
where the query is searching for a word that begins with a certain
character.. however fulltext searches also return results with the term
in the middle.

ie: a search for 'foo' could return:
foo.txt
foobar

but also could return:
thisisfoo
that_is_foolish

I could be wrong, but it's my understanding that MySQL stores it's
fulltext index based on all the 'unique words' found. For such a system
as you mentioned above, you'd probably have to create your own fulltext
indexing system to determine: a) where to store the data 'segments' and
b) how to route queries.  It seems like this could probably be done much
more efficiently inside of the server.

- Brian



-
Before posting, please check:
  http://www.mysql.com/manual.php   (the manual)
  http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




>>>
>>>
>>>---

Re: Distributed Fulltext?

2002-02-13 Thread Brian DeFeyter

I sorta like that idea. I don't know exactly what you can and can't do
as far as indexing inside of HEAP tables.. but the index size would
likely differ from the written index. Then you can expand the idea and
use the X/(num slices) on (num slices) boxes technique.. sending the
query to each, and compiling all of the results.

Your comment about only having a few common words that are searched
makes me wonder if a reverse-stopword function would be valuable (ie:
only index words in a list you generate) Probably could be done other
ways though too with what's available. Maybe a psuedo-bitmap index
inside a mysql table?

 - bdf

On Wed, 2002-02-13 at 12:09, Mike Wexler wrote:
> My understanding is that part of how google and Altavista get such high speeds 
> is to keep everything in memory. Is it possible to create a HEAP table with a 
> full text index? If so, does the full text index take advantage of being in 
> memory? For example, I would imagine that if you were keeping the whole index in 
> memory, details like the index page size, and the format of the pointers/record 
> numbers would be different.
> 
> Then you could just do something roughly like (i know the syntax is a little off)
> 
> CREATE HEAP TABLE fooFast SELECT * FROM fooSlow
> ALTER fooFast ADD fulltext(a, b, c)
> 
> Or maybe you could just have fooSlow on one server. And then have it replicated 
> on N other servers. But on the other servers you could alter the table type so 
> it was a heap table. So you would have one persistent table and a bunch of 
> replicated heap tables. And all the search go could against the heap tables.
> 
> 
> Brian Bray wrote:
> > It seems to me like the best solution that could be implemented as-is 
> > would be to keep a random int column in your table (with a range of say 
> > 1-100) and then have fulltext server 1 psudo-replicate records with a 
> > the random number in the range of 1-10, server 2 11-20 and server 3 
> > 21-30 and so on.
> > 
> > Then run your query on all 10 servers and merge the result sets and 
> > possibly re-sort them if you use the score column.
> > 
> > The problem with splitting the index up by word is that is messes up all 
> > your scoring and ranking.  For example what if you search using 5 
> > keywords, all starting with letters from different groups?  Your going 
> > to get pretty bad score for each match, and it could totally break 
> > boolean searches.
> > 
> > --
> > Brian Bray
> > 
> > 
> > 
> > 
> > Brian DeFeyter wrote:
> > 
> >>On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
> >>[snip]
> >>
> >>
> >>>Wouldn't be too tough to write a little query routing system if you are using 
> >>>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
> >>>server so it routes the query to several places and returns a single result 
> >>>set. Ordering could be achieved as well. I'm sure there are commercial 
> >>>packages out there as well. I don't see why the individual database servers 
> >>>would need to do anything special.
> >>>
> >>>
> >>[snip]
> >>
> >>If I'm understanding you correctly, I think you're refering to routing
> >>based on the first character of the word. That would work for cases
> >>where the query is searching for a word that begins with a certain
> >>character.. however fulltext searches also return results with the term
> >>in the middle.
> >>
> >>ie: a search for 'foo' could return:
> >>foo.txt
> >>foobar
> >>
> >>but also could return:
> >>thisisfoo
> >>that_is_foolish
> >>
> >>I could be wrong, but it's my understanding that MySQL stores it's
> >>fulltext index based on all the 'unique words' found. For such a system
> >>as you mentioned above, you'd probably have to create your own fulltext
> >>indexing system to determine: a) where to store the data 'segments' and
> >>b) how to route queries.  It seems like this could probably be done much
> >>more efficiently inside of the server.
> >>
> >> - Brian
> >>
> >>
> >>
> >>-
> >>Before posting, please check:
> >>   http://www.mysql.com/manual.php   (the manual)
> >>   http://lists.mysql.com/   (the list archive)
> >>
> >>To request this thread, e-mail <[EMAIL PROTECTED]>
> >>To unsubscribe, e-mail <[EMAIL PROTECTED]>
> >>Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> >>
> >>
> >>
> > 
> > 
> > 
> > -
> > Before posting, please check:
> >http://www.mysql.com/manual.php   (the manual)
> >http://lists.mysql.com/   (the list archive)
> > 
> > To request this thread, e-mail <[EMAIL PROTECTED]>
> > To unsubscribe, e-mail <[EMAIL PROTECTED]>
> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> > 
> > 
> > -
> > Before posting, please check:
> >http://www.mysql.com/manual.php   (the manual)
> >http://lists.mysql.com/ 

Re: Distributed Fulltext?

2002-02-13 Thread Mike Wexler



Steve Rapaport wrote:

> 
> Someone correctly pointed out today that it's not Mysql's job
> to be Google, and I agree.  But it seems to me that it would be
> fair for mysql to be able to handle searches in under 1 second
> for databases 1 millionth the size of Google.  All I want here
> is a decent flexible phone book lookup, not a search engine.

That may be what you want. But I want a search engine. For my database. I want 
to be able to do search of the million records of inventory in our database and 
find the items with certain keywords, that are currently available for sale, in 
the venue the user is shopping in and within the price range the user specifies. 
This is a combination of database and fulltext search that I think is exactly in 
MySQL's marketplace.

And the current situation where a user fulltext query can take between 1 ms and 
100 secs with no easy way for the application to tell ahead of time is really 
unacceptable.

So I think several things need to be done:

1) full text search needs to be faster (they are working on it).
2) There needs to be some way (ala explain) of figuring out roughly how long a 
particular full text search will take. If I coudl do

EXPLAIN SELECT * FROM inventory WHERE MATCH ($userQuery) AGAINST (title, 
description)

and get something useful for predicting search time. Then I could at least say 
something to the user like:

That query matches too many records. Please pick a more specific query.

Rather than hosing the whole database for 3 or 4 common words.

3) Find ways to use replication and memory to speed up full-text queries. See my 
previous mail about heap tables for a possible example.




-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread Mike Wexler

My understanding is that part of how google and Altavista get such high speeds 
is to keep everything in memory. Is it possible to create a HEAP table with a 
full text index? If so, does the full text index take advantage of being in 
memory? For example, I would imagine that if you were keeping the whole index in 
memory, details like the index page size, and the format of the pointers/record 
numbers would be different.

Then you could just do something roughly like (i know the syntax is a little off)

CREATE HEAP TABLE fooFast SELECT * FROM fooSlow
ALTER fooFast ADD fulltext(a, b, c)

Or maybe you could just have fooSlow on one server. And then have it replicated 
on N other servers. But on the other servers you could alter the table type so 
it was a heap table. So you would have one persistent table and a bunch of 
replicated heap tables. And all the search go could against the heap tables.


Brian Bray wrote:
> It seems to me like the best solution that could be implemented as-is 
> would be to keep a random int column in your table (with a range of say 
> 1-100) and then have fulltext server 1 psudo-replicate records with a 
> the random number in the range of 1-10, server 2 11-20 and server 3 
> 21-30 and so on.
> 
> Then run your query on all 10 servers and merge the result sets and 
> possibly re-sort them if you use the score column.
> 
> The problem with splitting the index up by word is that is messes up all 
> your scoring and ranking.  For example what if you search using 5 
> keywords, all starting with letters from different groups?  Your going 
> to get pretty bad score for each match, and it could totally break 
> boolean searches.
> 
> --
> Brian Bray
> 
> 
> 
> 
> Brian DeFeyter wrote:
> 
>>On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
>>[snip]
>>
>>
>>>Wouldn't be too tough to write a little query routing system if you are using 
>>>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
>>>server so it routes the query to several places and returns a single result 
>>>set. Ordering could be achieved as well. I'm sure there are commercial 
>>>packages out there as well. I don't see why the individual database servers 
>>>would need to do anything special.
>>>
>>>
>>[snip]
>>
>>If I'm understanding you correctly, I think you're refering to routing
>>based on the first character of the word. That would work for cases
>>where the query is searching for a word that begins with a certain
>>character.. however fulltext searches also return results with the term
>>in the middle.
>>
>>ie: a search for 'foo' could return:
>>foo.txt
>>foobar
>>
>>but also could return:
>>thisisfoo
>>that_is_foolish
>>
>>I could be wrong, but it's my understanding that MySQL stores it's
>>fulltext index based on all the 'unique words' found. For such a system
>>as you mentioned above, you'd probably have to create your own fulltext
>>indexing system to determine: a) where to store the data 'segments' and
>>b) how to route queries.  It seems like this could probably be done much
>>more efficiently inside of the server.
>>
>> - Brian
>>
>>
>>
>>-
>>Before posting, please check:
>>   http://www.mysql.com/manual.php   (the manual)
>>   http://lists.mysql.com/   (the list archive)
>>
>>To request this thread, e-mail <[EMAIL PROTECTED]>
>>To unsubscribe, e-mail <[EMAIL PROTECTED]>
>>Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
>>
>>
>>
> 
> 
> 
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
> 
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail <[EMAIL PROTECTED]>
> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> 
> 
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
> 
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail <[EMAIL PROTECTED]>
> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> 



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread Steven Roussey

> [comparisons to Google...]

While any speed up with a full table fulltext search would be helpful
and useful, there are instances where the search is intersected with
another column and the problem of search is therefore more complex but
also leads to potential optimizations.

In our case we rarely do searches that use just the FTS index. More
commonly, we do searches based on other columns and a FTS indexed text
column. The WHERE clause restraints on the non-text column usually
eliminate 99% of the rows. Fulltext indexes can not be combined with
numeric columns in a single composite index, so as far as I can tell,
MySQL performs the FTS then filters the results to meet the WHERE
conditions. I believe the required amount of the index to be paged into
main memory is much larger than might otherwise be needed (since the
indexes are larger than available memory, and there a zillion other
tables accessed simultaneously, this tends to effect all queries).

So besides the obvious desire to lower the size of indexes, it would be
beneficial if a combined index could lower the number of matching rows
and the amount of the index needed to be in memory.

I actually find it odd for people to not be using secondary columns in a
fulltext search, other than to do something like Google (which does not
lend itself to using SQL in the first place. 

Of course, I have a myopic view stemming from my own requirements and
experience.

All that said, I'm glad to have the FTS we have. 

Thanks Sergei for such a wonderful improvement to LIKE '%word%'!

Sincerely,
Steven Roussey
http://Network54.com/?pp=e 

PS - is FTS search and update O(n^2)?



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread David Axmark

On Tue, 2002-02-12 at 15:38, Steve Rapaport wrote:
> David Axmark writes:
> 
> > So the standard answer with Apples and Oranges certainly apply here!
> 
> More like Äpplen och Apelsiner, that is, different but similar.  You Swedish 
> guys should know.  Thanks for answering, David, I appreciate the attention
> from a founder.
> 
> I also appreciate your point that Google is updating continuously and
> therefore not always caught up to the current state of the web.
> But isn't that a problem with indexing speed, not with search speed?
> Their search speed is still amazing, as is Altavista, and most of the
> other engines.
> 
> Your other point about exact vs. approximate answers is unclear, I expect
> that Google's answers are exact for their currently available indexes at any
> given time.  But even if they are approximate, I'd be happy with that too.  
> The scoring on a FULLTEXT search in Mysql is "exact" but based on a
> formula that is approximate anyway.

No, MySQL returns all data according to a search. Web engines return
what they they find on one search machine. So you can get different
results with Google every time you hit refresh if you are routed to
different machines. This had happened to me when I was looking for the
number of matches and not the result itself. 

So we should try to make fulltext searches with a limit between 10 and
100 be fast to be closer to google.

I have also head about some other things web search engines do since I
know some people at FAST but I have forgot that already.

> I'll summarize this thread best I can.
> 
> >From the math I used, we started with my estimate of 10^9,
> which was mistaken.  The real figure was 10^6, that is, Google
> searches fulltext about a million times faster than Mysql.
> Then we used Google's 1 machines +DRAM indexing to reduce the
> gap to 10^2, or 100 times faster.  

I would say we should reduce it even further but that could be
discussed.

> It turns out that 100 times is about the factor that is causing
> my application problems.  If it just ran 100 times faster it would be
> about as fast as a regular indexed search, and I'd
> be happy.
> 
> A few people suggested that Mysql shouldn't try to be faster,
> I (and some high-support customers like Mike Wexler) disagreed.
> And Alex Aulbach, bless him, actually did his homework and showed that
> things could be much improved with smart index techniques like
> inverted files.

We will try to make every feature as good as possible. But we do have
limited resources.

> Then Sergei Golubchik wrote back to say he had taken some of the good ideas
> and inserted them into the TODO list, although he had higher priorities
> at the time.
> 
> And I was satisfied for now, although my application still isn't working
> satisfactorily due to a really slow and CPU-hungry FULLTEXT search.

Well there is always the option of sponsoring further fulltext
development. We have a guy who has been working on the GNU fulltext
engines who is interesting in working with MySQL fulltext. But for the
moment we can not afford it.

So if some of you are interested in sponsoring this (or know about
others who might be) write to [EMAIL PROTECTED]

> I think that's our story so far.
> 
> Steve Rapaport
> Director, Technical Services
> A-Tono


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread Steve Rapaport

I said:
> > Why is it that Altavista can index terabytes overnight and return
> > a fulltext boolean for the WHOLE WEB
> > within a second, and Mysql takes so long?

On Friday 08 February 2002 08:56, Vincent Stoessel wrote:

> Apples and oranges.

Yeah, I know.  But let's see if we can make some distinctions.
If, say, Google, can search 2 trillion web pages, averaging say 70k bytes 
each, in 1 second, and Mysql can search 22 million records, with an index
on 40 bytes each, in 3 seconds (my experience) on a good day,
what's the order of magnitude difference?  Roughly 10^9.

> Have you seen the /hardware/ run that enterprise with?
Irrelevant, you're unlikely to get 9 orders of magnitude difference with
faster hardware or even with clustering.

> Also, their software is optimized for full text searches and that
> is /all/ they do. Mysql is an SQL database and is optimized as such.

Absolutely granted. You are completely right. 
And I don't expect the data format to change.

 BUT:  thought experiment:

When Mysql decides to generate a FULLTEXT index, it
is using an index file, the .MYI file, which can have ANY FORMAT
it wants to.  If the .MYI format is poorly optimized for fulltext
searches, they can improve the format.  They can even introduce
a new index file type .MYF solely for optimizing fulltext searches.

None of this need have any impact on the data file format or the 
SQL search optimizations, and yet it could still improve the search
speed for fulltext.  It might not help as much for the slow indexing,
but it could certainly improve the performance of the search.

Thinking out loud...
Steve


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread Steve Rapaport

Ooops, factual error:

> > If, say, Google, can search 2 trillion web pages, averaging say 70k
> > bytes each, in 1 second, and Mysql can search 22 million records, with
> > an index on 40 bytes each, in 3 seconds (my experience) on a good day,
> > what's the order of magnitude difference?  Roughly 10^9.


Google has only 2 Billion with a B web pages.

So the order of magnitude difference is only 10^6.  But that still
leaves a mysql Fulltext search a MILLION times slower than Google.

Someone correctly pointed out today that it's not Mysql's job
to be Google, and I agree.  But it seems to me that it would be
fair for mysql to be able to handle searches in under 1 second
for databases 1 millionth the size of Google.  All I want here
is a decent flexible phone book lookup, not a search engine.

-steve

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread alec . cawley



> Why is it that Altavista can index terabytes overnight and return
> a fulltext boolean for the WHOLE WEB
> within a second, and Mysql takes so long?

I don't know about Altavista, but if you read up on Google, they do indeed
do some sort of
spreading of keywords across multiple machines - last I saw they had over
10,000
machines.

But you have to ask the question as to whether this would be an appropriate
development for MySQL. This would seeem to me to be a very complex
development
of interest to a very small subset of MySQL customers. MySQL, it seems to
me,
specialises in doing the mainstream database manipulations used by many
people simply,
quicly, and (my most important criterion) reliably. Such a development
would, IMO,
be away from the whole trend of MySQL, and shouldn't happen before
triggers,
stored procedures and the like - if ever. A multe-machine distributed
database is
a wholly different animal - IMO again. If I was going there I wouldn't
start from here -
it would need a "clean brain" look at the whole problem. If you really need
such searches,
Google survives by licencint that technology - speak to them.

 Alec Cawley




-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread Mike Wexler



Steve Rapaport wrote:
> On Friday 08 February 2002 06:14 pm, James Montebello wrote:
> 
>>Distribution is how Google gets its speed.  You say clustering won't
>>solve the problem, but distributing the indicies across many processors
>>*is* going to gain you a huge speed increase through sheer parallelism.
>>
> 
> True, but not enough.  The BEST parallelism can do  in a compute-bound
> application is divide the time by the number of processors.  That's assuming
> a PERFECT routing system.  (Correct me if I'm wrong here)

There are actually some exceptions. Specifically if you have a very large 
problem set and parallelism allows you to move the problem set into a faster 
storage medium, you can sometimes see greater performance increases. Lets say 
you are doing a full text search. And you have a 20 GB full text index. It may 
not be feasible to build a machine with 20GB of RAM for storing the index in 
RAM. But it might be more feasible to store 1/20th of the index in each of 20 
1GB machines. And with RAM being >1000 times as fast as hard disk. You could get 
a huge win.


> 
> So to make the routing system + parallelism add up to 
> a MILLION times better performance, you would need at
> least a MILLION processors.  I doubt that even Google is
> doing that.  
> 
> 
>>Google uses thousands of processors to handle its index, and any given
>>search is going to be spread over 100s of processors.  
>>
> 
> Right, so we can expect Google to do, say, 10,000 times  (10^4) better
> than Mysql at a Fulltext search.  But in fact we're seeing 
> a million, 10^6, being generous.  It's that extra factor
> of a hundred (more likely a thousand, my estimates
> were very generous) that I'm getting all fussy about.
> 
> 
>>Asking
>>a general purpose RDBMS to be really good a 10 different things is asking
>>a bit much.  Fulltext searches are well down the list.
>>
> 
> Here we get out of the realm of hard numbers and into opinions.   But here's
> mine:  If mysql bothers to support fulltext searches, it's presumably because 
> there's some demand for them in some circumstances.  The level of scaling and
> optimization that could reasonably be expected:  What it takes to make the
> feature useful (i.e. perform similarly) in similar cases to other Mysql 
> features.  With a regular index, I can do a hash lookup on 23 million records
> subsecond.  With a fulltext index (on a small field, only 40 characters) my
> time slips to 3 to 180 seconds.  That extra little factor of 100 is my 
> problem.  Distributing over 4 processors wouldn't really help much.  And 
> because people don't always type a company name the exact same way,
> FULLTEXT really is the best way to do this.
> 
> So Monty et al, my request is this:  please put on the wish list, enhancement 
> of FULLTEXT search  to approximately match the performance of an indexed
> search on 25 million records, on the same hardware and with other things
> held equal.
> 
> Steve Rapaport
> 
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
> 
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail <[EMAIL PROTECTED]>
> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> 
> 
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
> 
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail <[EMAIL PROTECTED]>
> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> 



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread Tod Harter

On Thursday 07 February 2002 14:53, Brian DeFeyter wrote:
> Has anyone made a suggestion or thought about ways to distribute
> databases which focus on fulltext indexes?
>
> fulltext indexes do a good job of indexing a moderate amount of data,
> but when you get a lot of data to be indexed, the queries slow down
> significantly.
>
> I have an example table, with about 90 million rows.. and has a fulltext
> index on a varchar(100) field. A single-word query which would return
> approx 300k results takes an average of 15 seconds. A query with smaller
> results (~ 10k) can be as quick as 1 sec.. which I would consider
> acceptable.
>
> Has any thought about splitting the data into distributed files or even
> machines? ie: something as simple as 'words' starting with 'X' are split
> into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
> automatically split results based on (#results per unique 'word' /
> desired # of 'split files/machines') Would such a system give any
> advantages to searching speed and concurrenct query scalability? I
> haven't looked at the fulltext internals.. so I don't know if such
> "query routing" could take place or not.
>
> If nothing else, does anyone else have experience with a table of this
> size or even larger? What kind of tuning have you done?
>
> Thanks,
>
>  - Brian

Wouldn't be too tough to write a little query routing system if you are using 
perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
server so it routes the query to several places and returns a single result 
set. Ordering could be achieved as well. I'm sure there are commercial 
packages out there as well. I don't see why the individual database servers 
would need to do anything special.
>
>
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
>
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail
> <[EMAIL PROTECTED]> Trouble
> unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread Brian Bray


It seems to me like the best solution that could be implemented as-is 
would be to keep a random int column in your table (with a range of say 
1-100) and then have fulltext server 1 psudo-replicate records with a 
the random number in the range of 1-10, server 2 11-20 and server 3 
21-30 and so on.

Then run your query on all 10 servers and merge the result sets and 
possibly re-sort them if you use the score column.

The problem with splitting the index up by word is that is messes up all 
your scoring and ranking.  For example what if you search using 5 
keywords, all starting with letters from different groups?  Your going 
to get pretty bad score for each match, and it could totally break 
boolean searches.

--
Brian Bray




Brian DeFeyter wrote:
> On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
> [snip]
> 
>>Wouldn't be too tough to write a little query routing system if you are using 
>>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
>>server so it routes the query to several places and returns a single result 
>>set. Ordering could be achieved as well. I'm sure there are commercial 
>>packages out there as well. I don't see why the individual database servers 
>>would need to do anything special.
>>
> [snip]
> 
> If I'm understanding you correctly, I think you're refering to routing
> based on the first character of the word. That would work for cases
> where the query is searching for a word that begins with a certain
> character.. however fulltext searches also return results with the term
> in the middle.
> 
> ie: a search for 'foo' could return:
> foo.txt
> foobar
> 
> but also could return:
> thisisfoo
> that_is_foolish
> 
> I could be wrong, but it's my understanding that MySQL stores it's
> fulltext index based on all the 'unique words' found. For such a system
> as you mentioned above, you'd probably have to create your own fulltext
> indexing system to determine: a) where to store the data 'segments' and
> b) how to route queries.  It seems like this could probably be done much
> more efficiently inside of the server.
> 
>  - Brian
> 
> 
> 
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
> 
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail <[EMAIL PROTECTED]>
> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> 
> 



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-13 Thread James Montebello


I did this at a previous job, and we split the data up more or less
this way (we used a pre-existing item number for the split which was
essentially random in relation to the text data), with a aggregator that
did the query X ways, each to a separate box holding 1/X of the data.
The results from each unit were paged and sorted, so all the aggregator
did was do a simple merge sort on a "page" of the set, which was fast.
On a 6M record dataset, it produced millisecond-range search results.
Not exactly Google-class, but pretty good for 12 Linux boxes, two
programmers, and about six weeks of effort.

james montebello

On Thu, 7 Feb 2002, Brian Bray wrote:

> 
> It seems to me like the best solution that could be implemented as-is 
> would be to keep a random int column in your table (with a range of say 
> 1-100) and then have fulltext server 1 psudo-replicate records with a 
> the random number in the range of 1-10, server 2 11-20 and server 3 
> 21-30 and so on.
> 
> Then run your query on all 10 servers and merge the result sets and 
> possibly re-sort them if you use the score column.
> 
> The problem with splitting the index up by word is that is messes up all 
> your scoring and ranking.  For example what if you search using 5 
> keywords, all starting with letters from different groups?  Your going 
> to get pretty bad score for each match, and it could totally break 
> boolean searches.
> 
> --
> Brian Bray
> 
> 
> 
> 
> Brian DeFeyter wrote:
> > On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
> > [snip]
> > 
> >>Wouldn't be too tough to write a little query routing system if you are using 
> >>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
> >>server so it routes the query to several places and returns a single result 
> >>set. Ordering could be achieved as well. I'm sure there are commercial 
> >>packages out there as well. I don't see why the individual database servers 
> >>would need to do anything special.
> >>
> > [snip]
> > 
> > If I'm understanding you correctly, I think you're refering to routing
> > based on the first character of the word. That would work for cases
> > where the query is searching for a word that begins with a certain
> > character.. however fulltext searches also return results with the term
> > in the middle.
> > 
> > ie: a search for 'foo' could return:
> > foo.txt
> > foobar
> > 
> > but also could return:
> > thisisfoo
> > that_is_foolish
> > 
> > I could be wrong, but it's my understanding that MySQL stores it's
> > fulltext index based on all the 'unique words' found. For such a system
> > as you mentioned above, you'd probably have to create your own fulltext
> > indexing system to determine: a) where to store the data 'segments' and
> > b) how to route queries.  It seems like this could probably be done much
> > more efficiently inside of the server.
> > 
> >  - Brian
> > 
> > 
> > 
> > -
> > Before posting, please check:
> >http://www.mysql.com/manual.php   (the manual)
> >http://lists.mysql.com/   (the list archive)
> > 
> > To request this thread, e-mail <[EMAIL PROTECTED]>
> > To unsubscribe, e-mail <[EMAIL PROTECTED]>
> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> > 
> > 
> 
> 
> 
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
> 
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail <[EMAIL PROTECTED]>
> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> 


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-12 Thread Steve Rapaport

David Axmark writes:

> So the standard answer with Apples and Oranges certainly apply here!

More like Äpplen och Apelsiner, that is, different but similar.  You Swedish 
guys should know.  Thanks for answering, David, I appreciate the attention
from a founder.

I also appreciate your point that Google is updating continuously and
therefore not always caught up to the current state of the web.
But isn't that a problem with indexing speed, not with search speed?
Their search speed is still amazing, as is Altavista, and most of the
other engines.

Your other point about exact vs. approximate answers is unclear, I expect
that Google's answers are exact for their currently available indexes at any
given time.  But even if they are approximate, I'd be happy with that too.  
The scoring on a FULLTEXT search in Mysql is "exact" but based on a
formula that is approximate anyway.

I'll summarize this thread best I can.

>From the math I used, we started with my estimate of 10^9,
which was mistaken.  The real figure was 10^6, that is, Google
searches fulltext about a million times faster than Mysql.
Then we used Google's 1 machines +DRAM indexing to reduce the
gap to 10^2, or 100 times faster.  

It turns out that 100 times is about the factor that is causing
my application problems.  If it just ran 100 times faster it would be
about as fast as a regular indexed search, and I'd
be happy.

A few people suggested that Mysql shouldn't try to be faster,
I (and some high-support customers like Mike Wexler) disagreed.
And Alex Aulbach, bless him, actually did his homework and showed that
things could be much improved with smart index techniques like
inverted files.

Then Sergei Golubchik wrote back to say he had taken some of the good ideas
and inserted them into the TODO list, although he had higher priorities
at the time.

And I was satisfied for now, although my application still isn't working
satisfactorily due to a really slow and CPU-hungry FULLTEXT search.

I think that's our story so far.

Steve Rapaport
Director, Technical Services
A-Tono


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-12 Thread David Axmark

On Fri, 2002-02-08 at 11:11, Steve Rapaport wrote:
> I said:
> > > Why is it that Altavista can index terabytes overnight and return
> > > a fulltext boolean for the WHOLE WEB
> > > within a second, and Mysql takes so long?
> 
> On Friday 08 February 2002 08:56, Vincent Stoessel wrote:
> 
> > Apples and oranges.
> 
> Yeah, I know.  But let's see if we can make some distinctions.
> If, say, Google, can search 2 trillion web pages, averaging say 70k bytes 
> each, in 1 second, and Mysql can search 22 million records, with an index
> on 40 bytes each, in 3 seconds (my experience) on a good day,
> what's the order of magnitude difference?  Roughly 10^9.
> 
> > Have you seen the /hardware/ run that enterprise with?
> Irrelevant, you're unlikely to get 9 orders of magnitude difference with
> faster hardware or even with clustering.

Actually not since Google runs on 1 of machines. And if you search
for a while you will notice that you will get different answers (at
least I have had troubles with this).

Also I have used google to track occurrences of certain words over time.
And according to google some words (with millions of occurrences)
changed with about 40% in a week. So they simple provide approximate
answers while MySQL has to provide exact answers.

So the standard answer with Apples and Oranges certainly apply here!

/David


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-12 Thread Steve Rapaport





On Friday 08 February 2002 06:14 pm, James Montebello wrote:
> Distribution is how Google gets its speed.  You say clustering won't
> solve the problem, but distributing the indicies across many processors
> *is* going to gain you a huge speed increase through sheer parallelism.

True, but not enough.  The BEST parallelism can do  in a compute-bound
application is divide the time by the number of processors.  That's assuming
a PERFECT routing system.  (Correct me if I'm wrong here)

So to make the routing system + parallelism add up to 
a MILLION times better performance, you would need at
least a MILLION processors.  I doubt that even Google is
doing that.  

> Google uses thousands of processors to handle its index, and any given
> search is going to be spread over 100s of processors.  

Right, so we can expect Google to do, say, 10,000 times  (10^4) better
than Mysql at a Fulltext search.  But in fact we're seeing 
a million, 10^6, being generous.  It's that extra factor
of a hundred (more likely a thousand, my estimates
were very generous) that I'm getting all fussy about.

> Asking
> a general purpose RDBMS to be really good a 10 different things is asking
> a bit much.  Fulltext searches are well down the list.

Here we get out of the realm of hard numbers and into opinions.   But here's
mine:  If mysql bothers to support fulltext searches, it's presumably because 
there's some demand for them in some circumstances.  The level of scaling and
optimization that could reasonably be expected:  What it takes to make the
feature useful (i.e. perform similarly) in similar cases to other Mysql 
features.  With a regular index, I can do a hash lookup on 23 million records
subsecond.  With a fulltext index (on a small field, only 40 characters) my
time slips to 3 to 180 seconds.  That extra little factor of 100 is my 
problem.  Distributing over 4 processors wouldn't really help much.  And 
because people don't always type a company name the exact same way,
FULLTEXT really is the best way to do this.

So Monty et al, my request is this:  please put on the wish list, enhancement 
of FULLTEXT search  to approximately match the performance of an indexed
search on 25 million records, on the same hardware and with other things
held equal.

Steve Rapaport

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-12 Thread George M. Ellenburg

> Last week on Slashdot there was an article where the CEO of Google mentioned he
> uses DRAM (solid state disk arrays) rather than hard drives for the indexes and
> arrays because of the magnitude of difference in speed they provide.
>
> There's your 10^6 difference in speed (or part of it).
>
> G.
> Google Articles on Slashdot:
> http://slashdot.org/article.pl?sid=02/02/03/1339216&mode=thread

sql, MySQL, db, rdbms, query


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-12 Thread James Montebello


For the slice servers, you simply assume that if one is lost, you lose X%
of the data until it is revived, which is usually not even noticable by
the end user.  For the aggregators, we had four behind a load-balancer.
In practice, we had nearly zero downtime over a roughly 18 month period.

james montebello

On 7 Feb 2002, Amir Aliabadi wrote:

> How do you make something like this fault tolerant?
> The answer is probably what I suspect, 2 of every thing.
> How does the aggregator handle this or are these machines in a cluster?
> 
> We are thinking of how to rebuild our fulltext search.  Currently it is
> in MS SQL 7.0 -  MySQL 4.0 seems to blow the doors off the cataloging
> time as compaired to MS SQL 7.0 Or even 8.0.
> 
> 
> On Thu, 2002-02-07 at 15:19, James Montebello wrote:
> > 
> > I did this at a previous job, and we split the data up more or less
> > this way (we used a pre-existing item number for the split which was
> > essentially random in relation to the text data), with a aggregator that
> > did the query X ways, each to a separate box holding 1/X of the data.
> > The results from each unit were paged and sorted, so all the aggregator
> > did was do a simple merge sort on a "page" of the set, which was fast.
> > On a 6M record dataset, it produced millisecond-range search results.
> > Not exactly Google-class, but pretty good for 12 Linux boxes, two
> > programmers, and about six weeks of effort.
> > 
> > james montebello
> > 
> > On Thu, 7 Feb 2002, Brian Bray wrote:
> > 
> > > 
> > > It seems to me like the best solution that could be implemented as-is 
> > > would be to keep a random int column in your table (with a range of say 
> > > 1-100) and then have fulltext server 1 psudo-replicate records with a 
> > > the random number in the range of 1-10, server 2 11-20 and server 3 
> > > 21-30 and so on.
> > > 
> > > Then run your query on all 10 servers and merge the result sets and 
> > > possibly re-sort them if you use the score column.
> > > 
> > > The problem with splitting the index up by word is that is messes up all 
> > > your scoring and ranking.  For example what if you search using 5 
> > > keywords, all starting with letters from different groups?  Your going 
> > > to get pretty bad score for each match, and it could totally break 
> > > boolean searches.
> > > 
> > > --
> > > Brian Bray
> > > 
> > > 
> > > 
> > > 
> > > Brian DeFeyter wrote:
> > > > On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
> > > > [snip]
> > > > 
> > > >>Wouldn't be too tough to write a little query routing system if you are using 
> > > >>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
> > > >>server so it routes the query to several places and returns a single result 
> > > >>set. Ordering could be achieved as well. I'm sure there are commercial 
> > > >>packages out there as well. I don't see why the individual database servers 
> > > >>would need to do anything special.
> > > >>
> > > > [snip]
> > > > 
> > > > If I'm understanding you correctly, I think you're refering to routing
> > > > based on the first character of the word. That would work for cases
> > > > where the query is searching for a word that begins with a certain
> > > > character.. however fulltext searches also return results with the term
> > > > in the middle.
> > > > 
> > > > ie: a search for 'foo' could return:
> > > > foo.txt
> > > > foobar
> > > > 
> > > > but also could return:
> > > > thisisfoo
> > > > that_is_foolish
> > > > 
> > > > I could be wrong, but it's my understanding that MySQL stores it's
> > > > fulltext index based on all the 'unique words' found. For such a system
> > > > as you mentioned above, you'd probably have to create your own fulltext
> > > > indexing system to determine: a) where to store the data 'segments' and
> > > > b) how to route queries.  It seems like this could probably be done much
> > > > more efficiently inside of the server.
> > > > 
> > > >  - Brian
> > > > 
> > > > 
> > > > 
> > > > -
> > > > Before posting, please check:
> > > >http://www.mysql.com/manual.php   (the manual)
> > > >http://lists.mysql.com/   (the list archive)
> > > > 
> > > > To request this thread, e-mail <[EMAIL PROTECTED]>
> > > > To unsubscribe, e-mail <[EMAIL PROTECTED]>
> > > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > > -
> > > Before posting, please check:
> > >http://www.mysql.com/manual.php   (the manual)
> > >http://lists.mysql.com/   (the list archive)
> > > 
> > > To request this thread, e-mail <[EMAIL PROTECTED]>
> > > To unsubscribe, e-mail <[EMAIL PROTECTED]>
> > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> > > 
> > 
> > 
> > -
> > Before posti

Re: Distributed Fulltext?

2002-02-12 Thread Alex Aulbach

Yesterday, from Brian DeFeyter:

> Has anyone made a suggestion or thought about ways to distribute
> databases which focus on fulltext indexes?
>
> fulltext indexes do a good job of indexing a moderate amount of data,
> but when you get a lot of data to be indexed, the queries slow down
> significantly.

Yea, this is cause the method to search the index is a geometric
algorithm.

> I have an example table, with about 90 million rows.. and has a fulltext
> index on a varchar(100) field. A single-word query which would return
> approx 300k results takes an average of 15 seconds. A query with smaller
> results (~ 10k) can be as quick as 1 sec.. which I would consider
> acceptable.

The only interesting thing is, how many words have to be indexed (how many
rows is not very important), how big grows your index (does it go in the
memory) and how many rows can be found for one word.

These are the most depending things.

> Has any thought about splitting the data into distributed files or even
> machines? ie: something as simple as 'words' starting with 'X' are split
> into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
> automatically split results based on (#results per unique 'word' /
> desired # of 'split files/machines') Would such a system give any
> advantages to searching speed and concurrenct query scalability? I
> haven't looked at the fulltext internals.. so I don't know if such
> "query routing" could take place or not.

Hum, I think it's *much* cheaper is to come together and pay the mySQL
people to introduce a new feature into mySQL called "inverted files". This
method is in short, that you only store the word and in which records it
can be found. This can redurce the size of indexes and so the speed
dramatically.

> If nothing else, does anyone else have experience with a table of this
> size or even larger? What kind of tuning have you done?

We have made for example an extra table for indexing on an extra server.
This is good, cause

- mySQL can "concentrate" only for this table, so the
machine dosn't need to swap

- big searches or reindex dosn't bother the rest of the system

- indexing can be done via a cron-job

- we optimized the write routines, for example we stripped all tags out
and wrote special stop-lists, before we write it.



BTW: My wishlist for fulltext indexing:
---

- inverted files

- rules to define words

- stop-word-lists and stop-word-regex

- a function which returns a table, which can tell me, what can be
searched.
E.g SHOW MATCH(table.index) AGAINST('hugobla hugo bla');

 word  count rows stopword   autostopword   score wscore
 hugobla  22no no 10  1
 hugo106no no  8  1
 bla  00no yes 0  1

Words can be searched, depending on rules, stop-words and what has been
indexed.

- configurable scoring, e.g. very fast (and simple) scoring for speed

- special scores for special words

- a cache, which stores often used words and the rows


-- 

SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-12 Thread Brian DeFeyter


> On Friday 08 February 2002 08:56, Vincent Stoessel wrote:
>
>> Apples and oranges.
>
> Yeah, I know.  But let's see if we can make some distinctions.
> If, say, Google, can search 2 trillion web pages, averaging say 70k 
> bytes each, in 1 second, and Mysql can search 22 million records, with 
> an index on 40 bytes each, in 3 seconds (my experience) on a good day,
> what's the order of magnitude difference?  Roughly 10^9.
>
>> Have you seen the /hardware/ run that enterprise with?
> Irrelevant, you're unlikely to get 9 orders of magnitude difference with
> faster hardware or even with clustering.

Google runs on cheap pc hardware with local IDE disk... nothing fancy. The 
majority of the speed comes from "sharding" the data to distributed 
indexes which are optimized for what they do, and having a frontend that
knows where to route those requests.

>> Also, their software is optimized for full text searches and that
>> is /all/ they do. Mysql is an SQL database and is optimized as such. 
> Absolutely granted. You are completely right.
> And I don't expect the data format to change.
>
>  BUT:  thought experiment: 
>
> When Mysql decides to generate a FULLTEXT index, it is using an index 
> file, the .MYI file, which can have ANY FORMAT it wants to.  If the .MYI 
> format is poorly optimized for fulltext searches, they can improve the 
> format.  They can even introduce a new index file type .MYF solely for 
> optimizing fulltext searche.

Agreed... however I don't know if that's the case either. Anyone 
(developers?) know enough about the internals to comment on such an 
approach? It seems if you know your goal is fulltext speed.. things could 
probably be done better.

> Thinking out loud...
> Steve



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-11 Thread Steve Rapaport

Also, I have to ask the question:  

Why is it that Altavista can index terabytes overnight and return 
a fulltext boolean for the WHOLE WEB 
within a second, and Mysql takes so long? 

On Friday 08 February 2002 11:50, Steve Rapaport wrote:
> I second the question.  It could also reduce the size of the
> fulltext index and the time taken to update it.
>
> -steve
>
> > On Thursday 07 February 2002 20:53, Brian  wrote:
> > > Has anyone made a suggestion or thought about ways to distribute
> > > databases which focus on fulltext indexes?
> > >
> > > fulltext indexes do a good job of indexing a moderate amount of data,
> > > but when you get a lot of data to be indexed, the queries slow down
> > > significantly.
> > >
> > > I have an example table, with about 90 million rows.. and has a
> > > fulltext index on a varchar(100) field. A single-word query which would
> > > return approx 300k results takes an average of 15 seconds. A query with
> > > smaller results (~ 10k) can be as quick as 1 sec.. which I would
> > > consider acceptable.
> > >
> > > Has any thought about splitting the data into distributed files or even
> > > machines? ie: something as simple as 'words' starting with 'X' are
> > > split into a-h, i-p, q-z... or something more advanced? (maybe mysqld
> > > could automatically split results based on (#results per unique 'word'
> > > / desired # of 'split files/machines') Would such a system give any
> > > advantages to searching speed and concurrenct query scalability? I
> > > haven't looked at the fulltext internals.. so I don't know if such
> > > "query routing" could take place or not.
> > >
> > > If nothing else, does anyone else have experience with a table of this
> > > size or even larger? What kind of tuning have you done?
>
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
>
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail
> <[EMAIL PROTECTED]> Trouble
> unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-11 Thread Brian DeFeyter

On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
[snip]
> Wouldn't be too tough to write a little query routing system if you are using 
> perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
> server so it routes the query to several places and returns a single result 
> set. Ordering could be achieved as well. I'm sure there are commercial 
> packages out there as well. I don't see why the individual database servers 
> would need to do anything special.
[snip]

If I'm understanding you correctly, I think you're refering to routing
based on the first character of the word. That would work for cases
where the query is searching for a word that begins with a certain
character.. however fulltext searches also return results with the term
in the middle.

ie: a search for 'foo' could return:
foo.txt
foobar

but also could return:
thisisfoo
that_is_foolish

I could be wrong, but it's my understanding that MySQL stores it's
fulltext index based on all the 'unique words' found. For such a system
as you mentioned above, you'd probably have to create your own fulltext
indexing system to determine: a) where to store the data 'segments' and
b) how to route queries.  It seems like this could probably be done much
more efficiently inside of the server.

 - Brian



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-11 Thread Steve Rapaport


I second the question.  It could also reduce the size of the
fulltext index and the time taken to update it.

-steve

> On Thursday 07 February 2002 20:53, Brian  wrote:
> > Has anyone made a suggestion or thought about ways to distribute
> > databases which focus on fulltext indexes?
> >
> > fulltext indexes do a good job of indexing a moderate amount of data,
> > but when you get a lot of data to be indexed, the queries slow down
> > significantly.
> >
> > I have an example table, with about 90 million rows.. and has a fulltext
> > index on a varchar(100) field. A single-word query which would return
> > approx 300k results takes an average of 15 seconds. A query with smaller
> > results (~ 10k) can be as quick as 1 sec.. which I would consider
> > acceptable.
> >
> > Has any thought about splitting the data into distributed files or even
> > machines? ie: something as simple as 'words' starting with 'X' are split
> > into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
> > automatically split results based on (#results per unique 'word' /
> > desired # of 'split files/machines') Would such a system give any
> > advantages to searching speed and concurrenct query scalability? I
> > haven't looked at the fulltext internals.. so I don't know if such
> > "query routing" could take place or not.
> >
> > If nothing else, does anyone else have experience with a table of this
> > size or even larger? What kind of tuning have you done?

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-10 Thread George M. Ellenburg

> Last week on Slashdot there was an article where the CEO of Google mentioned he
> uses DRAM (solid state disk arrays) rather than hard drives for the indexes and
> arrays because of the magnitude of difference in speed they provide.
>
> There's your 10^6 difference in speed (or part of it).
>
> G.
> Google Articles on Slashdot:
> http://slashdot.org/article.pl?sid=02/02/03/1339216&mode=thread

sql, MySQL, db, rdbms, query


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-10 Thread Steve Rapaport





On Friday 08 February 2002 06:14 pm, James Montebello wrote:
> Distribution is how Google gets its speed.  You say clustering won't
> solve the problem, but distributing the indicies across many processors
> *is* going to gain you a huge speed increase through sheer parallelism.

True, but not enough.  The BEST parallelism can do  in a compute-bound
application is divide the time by the number of processors.  That's assuming
a PERFECT routing system.  (Correct me if I'm wrong here)

So to make the routing system + parallelism add up to 
a MILLION times better performance, you would need at
least a MILLION processors.  I doubt that even Google is
doing that.  

> Google uses thousands of processors to handle its index, and any given
> search is going to be spread over 100s of processors.  

Right, so we can expect Google to do, say, 10,000 times  (10^4) better
than Mysql at a Fulltext search.  But in fact we're seeing 
a million, 10^6, being generous.  It's that extra factor
of a hundred (more likely a thousand, my estimates
were very generous) that I'm getting all fussy about.

> Asking
> a general purpose RDBMS to be really good a 10 different things is asking
> a bit much.  Fulltext searches are well down the list.

Here we get out of the realm of hard numbers and into opinions.   But here's
mine:  If mysql bothers to support fulltext searches, it's presumably because 
there's some demand for them in some circumstances.  The level of scaling and
optimization that could reasonably be expected:  What it takes to make the
feature useful (i.e. perform similarly) in similar cases to other Mysql 
features.  With a regular index, I can do a hash lookup on 23 million records
subsecond.  With a fulltext index (on a small field, only 40 characters) my
time slips to 3 to 180 seconds.  That extra little factor of 100 is my 
problem.  Distributing over 4 processors wouldn't really help much.  And 
because people don't always type a company name the exact same way,
FULLTEXT really is the best way to do this.

So Monty et al, my request is this:  please put on the wish list, enhancement 
of FULLTEXT search  to approximately match the performance of an indexed
search on 25 million records, on the same hardware and with other things
held equal.

Steve Rapaport

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-08 Thread James Montebello


For the slice servers, you simply assume that if one is lost, you lose X%
of the data until it is revived, which is usually not even noticable by
the end user.  For the aggregators, we had four behind a load-balancer.
In practice, we had nearly zero downtime over a roughly 18 month period.

james montebello

On 7 Feb 2002, Amir Aliabadi wrote:

> How do you make something like this fault tolerant?
> The answer is probably what I suspect, 2 of every thing.
> How does the aggregator handle this or are these machines in a cluster?
> 
> We are thinking of how to rebuild our fulltext search.  Currently it is
> in MS SQL 7.0 -  MySQL 4.0 seems to blow the doors off the cataloging
> time as compaired to MS SQL 7.0 Or even 8.0.
> 
> 
> On Thu, 2002-02-07 at 15:19, James Montebello wrote:
> > 
> > I did this at a previous job, and we split the data up more or less
> > this way (we used a pre-existing item number for the split which was
> > essentially random in relation to the text data), with a aggregator that
> > did the query X ways, each to a separate box holding 1/X of the data.
> > The results from each unit were paged and sorted, so all the aggregator
> > did was do a simple merge sort on a "page" of the set, which was fast.
> > On a 6M record dataset, it produced millisecond-range search results.
> > Not exactly Google-class, but pretty good for 12 Linux boxes, two
> > programmers, and about six weeks of effort.
> > 
> > james montebello
> > 
> > On Thu, 7 Feb 2002, Brian Bray wrote:
> > 
> > > 
> > > It seems to me like the best solution that could be implemented as-is 
> > > would be to keep a random int column in your table (with a range of say 
> > > 1-100) and then have fulltext server 1 psudo-replicate records with a 
> > > the random number in the range of 1-10, server 2 11-20 and server 3 
> > > 21-30 and so on.
> > > 
> > > Then run your query on all 10 servers and merge the result sets and 
> > > possibly re-sort them if you use the score column.
> > > 
> > > The problem with splitting the index up by word is that is messes up all 
> > > your scoring and ranking.  For example what if you search using 5 
> > > keywords, all starting with letters from different groups?  Your going 
> > > to get pretty bad score for each match, and it could totally break 
> > > boolean searches.
> > > 
> > > --
> > > Brian Bray
> > > 
> > > 
> > > 
> > > 
> > > Brian DeFeyter wrote:
> > > > On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
> > > > [snip]
> > > > 
> > > >>Wouldn't be too tough to write a little query routing system if you are using 
> > > >>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
> > > >>server so it routes the query to several places and returns a single result 
> > > >>set. Ordering could be achieved as well. I'm sure there are commercial 
> > > >>packages out there as well. I don't see why the individual database servers 
> > > >>would need to do anything special.
> > > >>
> > > > [snip]
> > > > 
> > > > If I'm understanding you correctly, I think you're refering to routing
> > > > based on the first character of the word. That would work for cases
> > > > where the query is searching for a word that begins with a certain
> > > > character.. however fulltext searches also return results with the term
> > > > in the middle.
> > > > 
> > > > ie: a search for 'foo' could return:
> > > > foo.txt
> > > > foobar
> > > > 
> > > > but also could return:
> > > > thisisfoo
> > > > that_is_foolish
> > > > 
> > > > I could be wrong, but it's my understanding that MySQL stores it's
> > > > fulltext index based on all the 'unique words' found. For such a system
> > > > as you mentioned above, you'd probably have to create your own fulltext
> > > > indexing system to determine: a) where to store the data 'segments' and
> > > > b) how to route queries.  It seems like this could probably be done much
> > > > more efficiently inside of the server.
> > > > 
> > > >  - Brian
> > > > 
> > > > 
> > > > 
> > > > -
> > > > Before posting, please check:
> > > >http://www.mysql.com/manual.php   (the manual)
> > > >http://lists.mysql.com/   (the list archive)
> > > > 
> > > > To request this thread, e-mail <[EMAIL PROTECTED]>
> > > > To unsubscribe, e-mail <[EMAIL PROTECTED]>
> > > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > > -
> > > Before posting, please check:
> > >http://www.mysql.com/manual.php   (the manual)
> > >http://lists.mysql.com/   (the list archive)
> > > 
> > > To request this thread, e-mail <[EMAIL PROTECTED]>
> > > To unsubscribe, e-mail <[EMAIL PROTECTED]>
> > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> > > 
> > 
> > 
> > -
> > Before posti

Re: Distributed Fulltext?

2002-02-08 Thread Steve Rapaport

Ooops, factual error:

> > If, say, Google, can search 2 trillion web pages, averaging say 70k
> > bytes each, in 1 second, and Mysql can search 22 million records, with
> > an index on 40 bytes each, in 3 seconds (my experience) on a good day,
> > what's the order of magnitude difference?  Roughly 10^9.


Google has only 2 Billion with a B web pages.

So the order of magnitude difference is only 10^6.  But that still
leaves a mysql Fulltext search a MILLION times slower than Google.

Someone correctly pointed out today that it's not Mysql's job
to be Google, and I agree.  But it seems to me that it would be
fair for mysql to be able to handle searches in under 1 second
for databases 1 millionth the size of Google.  All I want here
is a decent flexible phone book lookup, not a search engine.

-steve

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-08 Thread Alex Aulbach

Yesterday, from Brian DeFeyter:

> Has anyone made a suggestion or thought about ways to distribute
> databases which focus on fulltext indexes?
>
> fulltext indexes do a good job of indexing a moderate amount of data,
> but when you get a lot of data to be indexed, the queries slow down
> significantly.

Yea, this is cause the method to search the index is a geometric
algorithm.

> I have an example table, with about 90 million rows.. and has a fulltext
> index on a varchar(100) field. A single-word query which would return
> approx 300k results takes an average of 15 seconds. A query with smaller
> results (~ 10k) can be as quick as 1 sec.. which I would consider
> acceptable.

The only interesting thing is, how many words have to be indexed (how many
rows is not very important), how big grows your index (does it go in the
memory) and how many rows can be found for one word.

These are the most depending things.

> Has any thought about splitting the data into distributed files or even
> machines? ie: something as simple as 'words' starting with 'X' are split
> into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
> automatically split results based on (#results per unique 'word' /
> desired # of 'split files/machines') Would such a system give any
> advantages to searching speed and concurrenct query scalability? I
> haven't looked at the fulltext internals.. so I don't know if such
> "query routing" could take place or not.

Hum, I think it's *much* cheaper is to come together and pay the mySQL
people to introduce a new feature into mySQL called "inverted files". This
method is in short, that you only store the word and in which records it
can be found. This can redurce the size of indexes and so the speed
dramatically.

> If nothing else, does anyone else have experience with a table of this
> size or even larger? What kind of tuning have you done?

We have made for example an extra table for indexing on an extra server.
This is good, cause

- mySQL can "concentrate" only for this table, so the
machine dosn't need to swap

- big searches or reindex dosn't bother the rest of the system

- indexing can be done via a cron-job

- we optimized the write routines, for example we stripped all tags out
and wrote special stop-lists, before we write it.



BTW: My wishlist for fulltext indexing:
---

- inverted files

- rules to define words

- stop-word-lists and stop-word-regex

- a function which returns a table, which can tell me, what can be
searched.
E.g SHOW MATCH(table.index) AGAINST('hugobla hugo bla');

 word  count rows stopword   autostopword   score wscore
 hugobla  22no no 10  1
 hugo106no no  8  1
 bla  00no yes 0  1

Words can be searched, depending on rules, stop-words and what has been
indexed.

- configurable scoring, e.g. very fast (and simple) scoring for speed

- special scores for special words

- a cache, which stores often used words and the rows


-- 

SSilk - Alexander Aulbach - Herbipolis/Frankonia Minoris


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-08 Thread Steve Rapaport

I said:
> > Why is it that Altavista can index terabytes overnight and return
> > a fulltext boolean for the WHOLE WEB
> > within a second, and Mysql takes so long?

On Friday 08 February 2002 08:56, Vincent Stoessel wrote:

> Apples and oranges.

Yeah, I know.  But let's see if we can make some distinctions.
If, say, Google, can search 2 trillion web pages, averaging say 70k bytes 
each, in 1 second, and Mysql can search 22 million records, with an index
on 40 bytes each, in 3 seconds (my experience) on a good day,
what's the order of magnitude difference?  Roughly 10^9.

> Have you seen the /hardware/ run that enterprise with?
Irrelevant, you're unlikely to get 9 orders of magnitude difference with
faster hardware or even with clustering.

> Also, their software is optimized for full text searches and that
> is /all/ they do. Mysql is an SQL database and is optimized as such.

Absolutely granted. You are completely right. 
And I don't expect the data format to change.

 BUT:  thought experiment:

When Mysql decides to generate a FULLTEXT index, it
is using an index file, the .MYI file, which can have ANY FORMAT
it wants to.  If the .MYI format is poorly optimized for fulltext
searches, they can improve the format.  They can even introduce
a new index file type .MYF solely for optimizing fulltext searches.

None of this need have any impact on the data file format or the 
SQL search optimizations, and yet it could still improve the search
speed for fulltext.  It might not help as much for the slow indexing,
but it could certainly improve the performance of the search.

Thinking out loud...
Steve


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-08 Thread alec . cawley



> Why is it that Altavista can index terabytes overnight and return
> a fulltext boolean for the WHOLE WEB
> within a second, and Mysql takes so long?

I don't know about Altavista, but if you read up on Google, they do indeed
do some sort of
spreading of keywords across multiple machines - last I saw they had over
10,000
machines.

But you have to ask the question as to whether this would be an appropriate
development for MySQL. This would seeem to me to be a very complex
development
of interest to a very small subset of MySQL customers. MySQL, it seems to
me,
specialises in doing the mainstream database manipulations used by many
people simply,
quicly, and (my most important criterion) reliably. Such a development
would, IMO,
be away from the whole trend of MySQL, and shouldn't happen before
triggers,
stored procedures and the like - if ever. A multe-machine distributed
database is
a wholly different animal - IMO again. If I was going there I wouldn't
start from here -
it would need a "clean brain" look at the whole problem. If you really need
such searches,
Google survives by licencint that technology - speak to them.

 Alec Cawley




-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-07 Thread Steve Rapaport

Also, I have to ask the question:  

Why is it that Altavista can index terabytes overnight and return 
a fulltext boolean for the WHOLE WEB 
within a second, and Mysql takes so long? 

On Friday 08 February 2002 11:50, Steve Rapaport wrote:
> I second the question.  It could also reduce the size of the
> fulltext index and the time taken to update it.
>
> -steve
>
> > On Thursday 07 February 2002 20:53, Brian  wrote:
> > > Has anyone made a suggestion or thought about ways to distribute
> > > databases which focus on fulltext indexes?
> > >
> > > fulltext indexes do a good job of indexing a moderate amount of data,
> > > but when you get a lot of data to be indexed, the queries slow down
> > > significantly.
> > >
> > > I have an example table, with about 90 million rows.. and has a
> > > fulltext index on a varchar(100) field. A single-word query which would
> > > return approx 300k results takes an average of 15 seconds. A query with
> > > smaller results (~ 10k) can be as quick as 1 sec.. which I would
> > > consider acceptable.
> > >
> > > Has any thought about splitting the data into distributed files or even
> > > machines? ie: something as simple as 'words' starting with 'X' are
> > > split into a-h, i-p, q-z... or something more advanced? (maybe mysqld
> > > could automatically split results based on (#results per unique 'word'
> > > / desired # of 'split files/machines') Would such a system give any
> > > advantages to searching speed and concurrenct query scalability? I
> > > haven't looked at the fulltext internals.. so I don't know if such
> > > "query routing" could take place or not.
> > >
> > > If nothing else, does anyone else have experience with a table of this
> > > size or even larger? What kind of tuning have you done?
>
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
>
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail
> <[EMAIL PROTECTED]> Trouble
> unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-07 Thread Amir Aliabadi

How do you make something like this fault tolerant?
The answer is probably what I suspect, 2 of every thing.
How does the aggregator handle this or are these machines in a cluster?

We are thinking of how to rebuild our fulltext search.  Currently it is
in MS SQL 7.0 -  MySQL 4.0 seems to blow the doors off the cataloging
time as compaired to MS SQL 7.0 Or even 8.0.


On Thu, 2002-02-07 at 15:19, James Montebello wrote:
> 
> I did this at a previous job, and we split the data up more or less
> this way (we used a pre-existing item number for the split which was
> essentially random in relation to the text data), with a aggregator that
> did the query X ways, each to a separate box holding 1/X of the data.
> The results from each unit were paged and sorted, so all the aggregator
> did was do a simple merge sort on a "page" of the set, which was fast.
> On a 6M record dataset, it produced millisecond-range search results.
> Not exactly Google-class, but pretty good for 12 Linux boxes, two
> programmers, and about six weeks of effort.
> 
> james montebello
> 
> On Thu, 7 Feb 2002, Brian Bray wrote:
> 
> > 
> > It seems to me like the best solution that could be implemented as-is 
> > would be to keep a random int column in your table (with a range of say 
> > 1-100) and then have fulltext server 1 psudo-replicate records with a 
> > the random number in the range of 1-10, server 2 11-20 and server 3 
> > 21-30 and so on.
> > 
> > Then run your query on all 10 servers and merge the result sets and 
> > possibly re-sort them if you use the score column.
> > 
> > The problem with splitting the index up by word is that is messes up all 
> > your scoring and ranking.  For example what if you search using 5 
> > keywords, all starting with letters from different groups?  Your going 
> > to get pretty bad score for each match, and it could totally break 
> > boolean searches.
> > 
> > --
> > Brian Bray
> > 
> > 
> > 
> > 
> > Brian DeFeyter wrote:
> > > On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
> > > [snip]
> > > 
> > >>Wouldn't be too tough to write a little query routing system if you are using 
> > >>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
> > >>server so it routes the query to several places and returns a single result 
> > >>set. Ordering could be achieved as well. I'm sure there are commercial 
> > >>packages out there as well. I don't see why the individual database servers 
> > >>would need to do anything special.
> > >>
> > > [snip]
> > > 
> > > If I'm understanding you correctly, I think you're refering to routing
> > > based on the first character of the word. That would work for cases
> > > where the query is searching for a word that begins with a certain
> > > character.. however fulltext searches also return results with the term
> > > in the middle.
> > > 
> > > ie: a search for 'foo' could return:
> > > foo.txt
> > > foobar
> > > 
> > > but also could return:
> > > thisisfoo
> > > that_is_foolish
> > > 
> > > I could be wrong, but it's my understanding that MySQL stores it's
> > > fulltext index based on all the 'unique words' found. For such a system
> > > as you mentioned above, you'd probably have to create your own fulltext
> > > indexing system to determine: a) where to store the data 'segments' and
> > > b) how to route queries.  It seems like this could probably be done much
> > > more efficiently inside of the server.
> > > 
> > >  - Brian
> > > 
> > > 
> > > 
> > > -
> > > Before posting, please check:
> > >http://www.mysql.com/manual.php   (the manual)
> > >http://lists.mysql.com/   (the list archive)
> > > 
> > > To request this thread, e-mail <[EMAIL PROTECTED]>
> > > To unsubscribe, e-mail <[EMAIL PROTECTED]>
> > > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> > > 
> > > 
> > 
> > 
> > 
> > -
> > Before posting, please check:
> >http://www.mysql.com/manual.php   (the manual)
> >http://lists.mysql.com/   (the list archive)
> > 
> > To request this thread, e-mail <[EMAIL PROTECTED]>
> > To unsubscribe, e-mail <[EMAIL PROTECTED]>
> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> > 
> 
> 
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
> 
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail <[EMAIL PROTECTED]>
> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> 



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To u

Re: Distributed Fulltext?

2002-02-07 Thread James Montebello


I did this at a previous job, and we split the data up more or less
this way (we used a pre-existing item number for the split which was
essentially random in relation to the text data), with a aggregator that
did the query X ways, each to a separate box holding 1/X of the data.
The results from each unit were paged and sorted, so all the aggregator
did was do a simple merge sort on a "page" of the set, which was fast.
On a 6M record dataset, it produced millisecond-range search results.
Not exactly Google-class, but pretty good for 12 Linux boxes, two
programmers, and about six weeks of effort.

james montebello

On Thu, 7 Feb 2002, Brian Bray wrote:

> 
> It seems to me like the best solution that could be implemented as-is 
> would be to keep a random int column in your table (with a range of say 
> 1-100) and then have fulltext server 1 psudo-replicate records with a 
> the random number in the range of 1-10, server 2 11-20 and server 3 
> 21-30 and so on.
> 
> Then run your query on all 10 servers and merge the result sets and 
> possibly re-sort them if you use the score column.
> 
> The problem with splitting the index up by word is that is messes up all 
> your scoring and ranking.  For example what if you search using 5 
> keywords, all starting with letters from different groups?  Your going 
> to get pretty bad score for each match, and it could totally break 
> boolean searches.
> 
> --
> Brian Bray
> 
> 
> 
> 
> Brian DeFeyter wrote:
> > On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
> > [snip]
> > 
> >>Wouldn't be too tough to write a little query routing system if you are using 
> >>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
> >>server so it routes the query to several places and returns a single result 
> >>set. Ordering could be achieved as well. I'm sure there are commercial 
> >>packages out there as well. I don't see why the individual database servers 
> >>would need to do anything special.
> >>
> > [snip]
> > 
> > If I'm understanding you correctly, I think you're refering to routing
> > based on the first character of the word. That would work for cases
> > where the query is searching for a word that begins with a certain
> > character.. however fulltext searches also return results with the term
> > in the middle.
> > 
> > ie: a search for 'foo' could return:
> > foo.txt
> > foobar
> > 
> > but also could return:
> > thisisfoo
> > that_is_foolish
> > 
> > I could be wrong, but it's my understanding that MySQL stores it's
> > fulltext index based on all the 'unique words' found. For such a system
> > as you mentioned above, you'd probably have to create your own fulltext
> > indexing system to determine: a) where to store the data 'segments' and
> > b) how to route queries.  It seems like this could probably be done much
> > more efficiently inside of the server.
> > 
> >  - Brian
> > 
> > 
> > 
> > -
> > Before posting, please check:
> >http://www.mysql.com/manual.php   (the manual)
> >http://lists.mysql.com/   (the list archive)
> > 
> > To request this thread, e-mail <[EMAIL PROTECTED]>
> > To unsubscribe, e-mail <[EMAIL PROTECTED]>
> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> > 
> > 
> 
> 
> 
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
> 
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail <[EMAIL PROTECTED]>
> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> 


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-07 Thread Brian Bray


It seems to me like the best solution that could be implemented as-is 
would be to keep a random int column in your table (with a range of say 
1-100) and then have fulltext server 1 psudo-replicate records with a 
the random number in the range of 1-10, server 2 11-20 and server 3 
21-30 and so on.

Then run your query on all 10 servers and merge the result sets and 
possibly re-sort them if you use the score column.

The problem with splitting the index up by word is that is messes up all 
your scoring and ranking.  For example what if you search using 5 
keywords, all starting with letters from different groups?  Your going 
to get pretty bad score for each match, and it could totally break 
boolean searches.

--
Brian Bray




Brian DeFeyter wrote:
> On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
> [snip]
> 
>>Wouldn't be too tough to write a little query routing system if you are using 
>>perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
>>server so it routes the query to several places and returns a single result 
>>set. Ordering could be achieved as well. I'm sure there are commercial 
>>packages out there as well. I don't see why the individual database servers 
>>would need to do anything special.
>>
> [snip]
> 
> If I'm understanding you correctly, I think you're refering to routing
> based on the first character of the word. That would work for cases
> where the query is searching for a word that begins with a certain
> character.. however fulltext searches also return results with the term
> in the middle.
> 
> ie: a search for 'foo' could return:
> foo.txt
> foobar
> 
> but also could return:
> thisisfoo
> that_is_foolish
> 
> I could be wrong, but it's my understanding that MySQL stores it's
> fulltext index based on all the 'unique words' found. For such a system
> as you mentioned above, you'd probably have to create your own fulltext
> indexing system to determine: a) where to store the data 'segments' and
> b) how to route queries.  It seems like this could probably be done much
> more efficiently inside of the server.
> 
>  - Brian
> 
> 
> 
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
> 
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail <[EMAIL PROTECTED]>
> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
> 
> 



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-07 Thread Steve Rapaport


I second the question.  It could also reduce the size of the
fulltext index and the time taken to update it.

-steve

> On Thursday 07 February 2002 20:53, Brian  wrote:
> > Has anyone made a suggestion or thought about ways to distribute
> > databases which focus on fulltext indexes?
> >
> > fulltext indexes do a good job of indexing a moderate amount of data,
> > but when you get a lot of data to be indexed, the queries slow down
> > significantly.
> >
> > I have an example table, with about 90 million rows.. and has a fulltext
> > index on a varchar(100) field. A single-word query which would return
> > approx 300k results takes an average of 15 seconds. A query with smaller
> > results (~ 10k) can be as quick as 1 sec.. which I would consider
> > acceptable.
> >
> > Has any thought about splitting the data into distributed files or even
> > machines? ie: something as simple as 'words' starting with 'X' are split
> > into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
> > automatically split results based on (#results per unique 'word' /
> > desired # of 'split files/machines') Would such a system give any
> > advantages to searching speed and concurrenct query scalability? I
> > haven't looked at the fulltext internals.. so I don't know if such
> > "query routing" could take place or not.
> >
> > If nothing else, does anyone else have experience with a table of this
> > size or even larger? What kind of tuning have you done?

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-07 Thread Brian DeFeyter

On Thu, 2002-02-07 at 15:40, Tod Harter wrote:
[snip]
> Wouldn't be too tough to write a little query routing system if you are using 
> perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
> server so it routes the query to several places and returns a single result 
> set. Ordering could be achieved as well. I'm sure there are commercial 
> packages out there as well. I don't see why the individual database servers 
> would need to do anything special.
[snip]

If I'm understanding you correctly, I think you're refering to routing
based on the first character of the word. That would work for cases
where the query is searching for a word that begins with a certain
character.. however fulltext searches also return results with the term
in the middle.

ie: a search for 'foo' could return:
foo.txt
foobar

but also could return:
thisisfoo
that_is_foolish

I could be wrong, but it's my understanding that MySQL stores it's
fulltext index based on all the 'unique words' found. For such a system
as you mentioned above, you'd probably have to create your own fulltext
indexing system to determine: a) where to store the data 'segments' and
b) how to route queries.  It seems like this could probably be done much
more efficiently inside of the server.

 - Brian



-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: Distributed Fulltext?

2002-02-07 Thread Tod Harter

On Thursday 07 February 2002 14:53, Brian DeFeyter wrote:
> Has anyone made a suggestion or thought about ways to distribute
> databases which focus on fulltext indexes?
>
> fulltext indexes do a good job of indexing a moderate amount of data,
> but when you get a lot of data to be indexed, the queries slow down
> significantly.
>
> I have an example table, with about 90 million rows.. and has a fulltext
> index on a varchar(100) field. A single-word query which would return
> approx 300k results takes an average of 15 seconds. A query with smaller
> results (~ 10k) can be as quick as 1 sec.. which I would consider
> acceptable.
>
> Has any thought about splitting the data into distributed files or even
> machines? ie: something as simple as 'words' starting with 'X' are split
> into a-h, i-p, q-z... or something more advanced? (maybe mysqld could
> automatically split results based on (#results per unique 'word' /
> desired # of 'split files/machines') Would such a system give any
> advantages to searching speed and concurrenct query scalability? I
> haven't looked at the fulltext internals.. so I don't know if such
> "query routing" could take place or not.
>
> If nothing else, does anyone else have experience with a table of this
> size or even larger? What kind of tuning have you done?
>
> Thanks,
>
>  - Brian

Wouldn't be too tough to write a little query routing system if you are using 
perl. Use DBD::Proxy on the web server side, and just hack the perl proxy 
server so it routes the query to several places and returns a single result 
set. Ordering could be achieved as well. I'm sure there are commercial 
packages out there as well. I don't see why the individual database servers 
would need to do anything special.
>
>
> -
> Before posting, please check:
>http://www.mysql.com/manual.php   (the manual)
>http://lists.mysql.com/   (the list archive)
>
> To request this thread, e-mail <[EMAIL PROTECTED]>
> To unsubscribe, e-mail
> <[EMAIL PROTECTED]> Trouble
> unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php