Re: [HACKERS] OK, does anyone have any better ideas?
Andrew Snow wrote: > > Could you perhaps post the code you have for splitting a text field up into > keys, then I could work on turning into a new type of index with a new > operator, as Tom suggested? > > (Or is this already what the text search code in contrib already does??) I think that splitting the text is the least of the problems. I've been contemplating integrating my own (currently BSDDB based) full-text-index with postgresql and I've come to the conclusion that the index proper should work on already split list-of-words (so it could be made to work with any array types easily) So my suggestion is to start working on an inverted-index on a field of any array type. Or even better - start digging in the indexing interface of PostgreSQL first documenting it (the current docs just state that it is too complicated for anyone to comprehend ;) and then cleaning it up and making it thus making it easyer for any kind of new index to be written. - Hannu
Re: [HACKERS] OK, does anyone have any better ideas?
Oleg Bartunov wrote: > > On Sat, 9 Dec 2000, Edmar Wiggers wrote: > > > Date: Sat, 9 Dec 2000 14:20:17 -0200 > > From: Edmar Wiggers <[EMAIL PROTECTED]> > > To: mlw <[EMAIL PROTECTED]>, Oleg Bartunov <[EMAIL PROTECTED]> > > Cc: Tom Lane <[EMAIL PROTECTED]>, > > Hackers List <[EMAIL PROTECTED]> > > Subject: RE: [HACKERS] OK, does anyone have any better ideas? > > > > > > One possible idea for SQL integration: can one use index access-method > > functions to query the FTS outside the database? Yes, it would require some > > work, but the results I guess it would be wonderful. Ok, Tom, not so fast as > > an index sitting inside Postgres, but AFAICS that's not going to happen > > anytime soon. > > We did external indexing using suffix arrays ( http://sary.namazu.org ) > for one project because of limiting time :-) But we had to do a lot > of work like ACID (well, we already have some technology) and > everything we get usually from SQL. > Now we're trying to implement fast indexing using GiST. I think I have the answer, or at least as good as I think I can get in the near term. The syntax is as follows: create temp table search_results as select ts_key(10) as "key", ts_rank(10) as "rank" from ts_template where ts_search('bla bla bla', 10)>0; select * from table where search_results.key = table.field; It is not ideal, obviously, but it solves a couple problems, and should not be too big a performance hit. (If ANYONE can come up with a better way, I'd really really love to hear it.) The first call to ts_search(...) does the search and saves the results. Each successive call simply advances the result number. ts_key() and ts_rank() work on the current result number and return the respective value. ts_template is a table of some maximum number of records plus 1, say 1001. When the total number of results have been returned (or maximum has been reached), ts_search frees the search results (because they should be saved in the table) and returns 0, stopping the select call. Anyone see any problems with this approach? It is not ideal, but it is as efficient as I can come up with, without spending a year or two creating a new Postgres index type. -- http://www.mohawksoft.com
Re: [HACKERS] OK, does anyone have any better ideas?
"Edmar Wiggers" <[EMAIL PROTECTED]> writes: > One possible idea for SQL integration: can one use index access-method > functions to query the FTS outside the database? Hm. In principle an index access method can do whatever it darn pleases. In practice, though, I think the main problem with making an index type for FTS is simply learning *how* to make a new index access method. (No one currently associated with the project has ever done it --- the four existing types all date back to Berkeley, I believe.) Seems to me that that learning curve is not going to be made any easier by pushing the guts of the information outside of Postgres ... if anything, it'd probably be harder because the existing examples of index access methods would become less relevant. regards, tom lane PS: by the way, do any of the rest of you find that Mark's email address doesn't work? Everytime I send him something, it sits in my mail queue for five days and then bounces with Name server: mohawksoft.com.: host name lookup failure. The DNS server's syslog entries look like Dec 9 11:34:56 sss2 named[10258]: ns_resp: query(mohawksoft.com) All possible A RR's lame Sorry to use up list bandwidth on this, but I have no other way to reach him --- apparently hub.org's nameserver is less picky than bind 8.2.2.p7?
RE: [HACKERS] OK, does anyone have any better ideas?
On Sat, 9 Dec 2000, Edmar Wiggers wrote: > Date: Sat, 9 Dec 2000 14:20:17 -0200 > From: Edmar Wiggers <[EMAIL PROTECTED]> > To: mlw <[EMAIL PROTECTED]>, Oleg Bartunov <[EMAIL PROTECTED]> > Cc: Tom Lane <[EMAIL PROTECTED]>, > Hackers List <[EMAIL PROTECTED]> > Subject: RE: [HACKERS] OK, does anyone have any better ideas? > > > One possible idea for SQL integration: can one use index access-method > functions to query the FTS outside the database? Yes, it would require some > work, but the results I guess it would be wonderful. Ok, Tom, not so fast as > an index sitting inside Postgres, but AFAICS that's not going to happen > anytime soon. We did external indexing using suffix arrays ( http://sary.namazu.org ) for one project because of limiting time :-) But we had to do a lot of work like ACID (well, we already have some technology) and everything we get usually from SQL. Now we're trying to implement fast indexing using GiST. Regards, Oleg > > Yours sincerely, > > Edmar Wiggers > BRASMAP Information Systems > +55 48 9960 2752 > > _ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
RE: [HACKERS] OK, does anyone have any better ideas?
Personally, I'm not too afraid of having an FTS engine outside the database. Oracle's Intermedia, which has a very powerful/fast FTS engine, uses that approach. I could look into how they do the SQL integration, maybe it would yeld some ideas. Mark, about that row identifier: OID's are no good for fast find. Maybe you could use tuple id (TID). But please note that tuple ID's change after a simple update. It's like the offset of the row inside the table file (and therefore blazing fast to get to that row again). One possible idea for SQL integration: can one use index access-method functions to query the FTS outside the database? Yes, it would require some work, but the results I guess it would be wonderful. Ok, Tom, not so fast as an index sitting inside Postgres, but AFAICS that's not going to happen anytime soon. Yours sincerely, Edmar Wiggers BRASMAP Information Systems +55 48 9960 2752
Re: [HACKERS] OK, does anyone have any better ideas?
Oleg Bartunov wrote: > postgres... It would be great. > > Gotcha. It's impossible to return a set from a function, so the only > way to use perl to parse your bitmap. We did (in one project) external > search using suffix arrays which incredibly fast and use postgres to > return results to perl for processing. Here's a question, and I simply do not know enough about the internals of postgres to know, I had a brainstorm last night and though of a method. Create a table: Is it possible to call "SPI_exec" in a C function which does this: "create temp table fubar as select ts_key(10) as 'key', ts_rank(10) as 'rank' from textsearch_template where ts_placeholder(10) limit ts_count(10)" In the above example, which call would be called first? I assume the count would be called first, but I'm probably wrong. Which ever function would be called first would execute the query. textsearch_template would be a bogus table with 1000 or so zeros. So, in a query one does this: select ts_search('fubar', 'bla bla'); select * from table, fubar where table.field_key = fubar.key; How about this: Is there a construct in Postgres that represents a row ID, so a row can be found quickly without using an index? I tried oid but that didn't seem fast at all. P.S. If you want to see the system working, I have a test fixture running on "http://gateway.mohawksoft.com/music.php3" It calls the text search daemon from PHP and the text search daemon executes a sql query per result (PQExec). Look for a popular song and press "search." A good example is look for "pink floyd pigs," then try "pink floyd pigs -box." (It is running slow because it has debugging code, but it is still pretty fast.) This index has been metaphoned so something like "penk floid" will work too. The "+" operator is "requires" this is the default. The "-" operator is "must not have" and the "?" operator is "may have" (the "?" operator is a big hit because it increases the selection size.) I think if you try it, you'll see why I want to be able to get it deep into postgres, and what the possibilities are. -- http://www.mohawksoft.com
Re: [HACKERS] OK, does anyone have any better ideas?
On Sat, 9 Dec 2000, mlw wrote: > Date: Sat, 09 Dec 2000 08:18:10 -0500 > From: mlw <[EMAIL PROTECTED]> > To: Oleg Bartunov <[EMAIL PROTECTED]> > Cc: Tom Lane <[EMAIL PROTECTED]>, > Hackers List <[EMAIL PROTECTED]> > Subject: Re: [HACKERS] OK, does anyone have any better ideas? > > Oleg Bartunov wrote: > > > > We need multi-key B-tree like index for such problem. > > Our full text search engine is based on arrays and we need to find quickly > > is some number exists in array - some kind of index over int array. > > We're currently testing GiST approach and seems will have some conclusions > > soon. I think multi-key B-tree like index would be better in my > > opinion, but this requires to much work and knowledge of postgres's internals. > > Yesterday I read about UBTree, seems like it's good for index and query > > sets. Currently postgres has no set specific methods. > > The way I do my search indexing is with bitmap objects and a word > dictionary. One creates a searchable dictionary of words by scanning the > selected records. So, in one query that results in 2 million records, > the total aggregate number of words is about 60,000 depending on how you > parse. For each word, you create a "bitmap object" (in one of a few > forms) where bit '0' represents the first record, bit '1' represents the > second, and so on, until you have 2 million bits. > > Set the correct bit in the bitmap for each document that contains that > word. In the end you will have the equivalent 60,000 bitmaps or 2 > million bits. > > During search time, one creates an empty bitmap of 2 million bits as a > work space. One parses the search term, and performs boolean operation > on the workspace from the bitmap retrieved for each word. > > When you are done parsing, you have a bitmap with a bit set for each > document that fits the search criteria. You then enumerate the bits by > bit position, and you now have a list of document numbers. > > If only I could get the list of document numbers back into > postgres... It would be great. Gotcha. It's impossible to return a set from a function, so the only way to use perl to parse your bitmap. We did (in one project) external search using suffix arrays which incredibly fast and use postgres to return results to perl for processing. Regards, Oleg > -- > http://www.mohawksoft.com > _ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Re: [HACKERS] OK, does anyone have any better ideas?
Oleg Bartunov wrote: > > We need multi-key B-tree like index for such problem. > Our full text search engine is based on arrays and we need to find quickly > is some number exists in array - some kind of index over int array. > We're currently testing GiST approach and seems will have some conclusions > soon. I think multi-key B-tree like index would be better in my > opinion, but this requires to much work and knowledge of postgres's internals. > Yesterday I read about UBTree, seems like it's good for index and query > sets. Currently postgres has no set specific methods. The way I do my search indexing is with bitmap objects and a word dictionary. One creates a searchable dictionary of words by scanning the selected records. So, in one query that results in 2 million records, the total aggregate number of words is about 60,000 depending on how you parse. For each word, you create a "bitmap object" (in one of a few forms) where bit '0' represents the first record, bit '1' represents the second, and so on, until you have 2 million bits. Set the correct bit in the bitmap for each document that contains that word. In the end you will have the equivalent 60,000 bitmaps or 2 million bits. During search time, one creates an empty bitmap of 2 million bits as a work space. One parses the search term, and performs boolean operation on the workspace from the bitmap retrieved for each word. When you are done parsing, you have a bitmap with a bit set for each document that fits the search criteria. You then enumerate the bits by bit position, and you now have a list of document numbers. If only I could get the list of document numbers back into postgres... It would be great. -- http://www.mohawksoft.com
Re: [HACKERS] OK, does anyone have any better ideas?
We need multi-key B-tree like index for such problem. Our full text search engine is based on arrays and we need to find quickly is some number exists in array - some kind of index over int array. We're currently testing GiST approach and seems will have some conclusions soon. I think multi-key B-tree like index would be better in my opinion, but this requires to much work and knowledge of postgres's internals. Yesterday I read about UBTree, seems like it's good for index and query sets. Currently postgres has no set specific methods. Regards, Oleg On Fri, 8 Dec 2000, mlw wrote: > Date: Fri, 08 Dec 2000 20:17:34 -0500 > From: mlw <[EMAIL PROTECTED]> > To: Tom Lane <[EMAIL PROTECTED]> > Cc: Hackers List <[EMAIL PROTECTED]> > Subject: Re: [HACKERS] OK, does anyone have any better ideas? > > Tom Lane wrote: > > > > mlw <[EMAIL PROTECTED]> writes: > > > I have a working version of a text search engine. I want to make it work > > > for Postgres (I will be releasing it GPL). It can literally find the > > > occurrence of a string of words within 5 million records in a few > > > milliseconds. > > > > Where are the records coming from? Are they inside the database? > > (If not, why do you care about integrating this with Postgres?) > > > > It seems like the right way to integrate this sort of functionality > > is to turn it into a kind of index, so that you can do > > > > SELECT * FROM mytable WHERE keyfield ~~~ 'search string'; > > > > where ~~~ is the name of some operator that is associated with the > > index. The temporary-table approach you are taking seems inherently > > klugy, and would still be awkward even if we had functions returning > > recordsets... > > Oh! Another method I tried and just could not get working was returning > an array of integers. I as thinking about "select * from table where > key_field in ( textsearch('bla bla') ), but I haven't been able to get > that to work, and as per a previous post and belatedly reading a FAQ, > this would probably still force a full table scan. > > Another method I thought about was having a table with some maximum > number of zero initialized records, and trying something like: > > create table temp_table as select * from ts_template limit > textsearch('bla bla', 10); > > select filltable(temp_table, 10); > > select * from table where key_field = temp_table.key; > > As you can see, all of these ideas are heinous hacks, there has to be a > better way. Surely someone has a better idea. > > -- > http://www.mohawksoft.com > _ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Re: [HACKERS] OK, does anyone have any better ideas?
Andrew Snow wrote: > > Could you perhaps post the code you have for splitting a text field up into > keys, then I could work on turning into a new type of index with a new > operator, as Tom suggested? > > (Or is this already what the text search code in contrib already does??) > > - Andrew OK, I guess I am not getting everything across. Let me give the basics: There are two programs: sqlindex, and sqlfts. sqlindex, is the SQL indexer. sqlfts, is the SQL full text server. They currently take a config file, which will be replaced by columns in the database. (This technology is aimed at multiple databases and non-SQL uses) The config file currently looks like the example at the bottom of this post. The current incarnation of this server sits outside of Postgres and execute joins based the results of the query. The indexing query returns a number of fields, one must be designated as the "key" field. In websearch lingo, think of it as "document name." During index time, I separate the individual fields and create bitmap files which relate word numbers to document bits. Words are parsed and a dictionary is created. Phrase meta-data is also stored along with the document reference (key field) associated with a document number. When a query is executed, each word is picked out of the dictionary. At various points, phrases are evaluated, the bitmap indexes are ANDed, ORed, or NOTed together, rank is applied. The results are then sorted by rank, and the document numbers are merged in with document "references" (key field value) and return with the rank. This technology works quite well as a search engine sort of thing if I store a URL or file name and a teaser as the document reference. I thought it would be cool (and easy) if I just stored a SQL key field as the URL, and connected this stuff to a SQL database. I chose Postgres because I had used it in a number of projects, and thought since it was open source I would have fewer problems. It has not been easy to do what I thought would be a fairly trivial task. I am starting to get Windows programming flashbacks of the "so close, but yet so far" feeling one gets when one tries to do conceptually simple things on Windows. I'm sorry I am getting discouraged and beginning to think that this project is not going to work. >>> configuration file << # The computer host name used for the database sqlindex=localhost sqlfts=localhost # The name of the database sqldb=cdinfo # Base name of the index files. basename=cdinfo/index # The key field used to index and find records. sqlkey=trackid sqlkeyindex=off metaphone=1 # A SQL query that produces a single result, which is the count of # records to be indexed. sqlcount=select count(trackid) from zsong # The SQL query used to produce data to be indexed. sqlretrieve=select * from songview; sqlfields = all,performer2,title,song # A SQL query that will be used to display a list records found sqldisplay=select zsong.muzenbr, performer, performer2, title, song, trackid from ztitles, zsong where zsong.muzenbr = ztitles.muzenbr and zsong.trackid = %s # The tcport is the TCP/IP port for the server tcpport = 8090 ftsproc = 5 ftsqueue = 32 -- http://www.mohawksoft.com
Re: [HACKERS] OK, does anyone have any better ideas?
Andrew Snow wrote: > > Could you perhaps post the code you have for splitting a text field up into > keys, then I could work on turning into a new type of index with a new > operator, as Tom suggested? > > (Or is this already what the text search code in contrib already does??) Go to a search engine like lycos, alltheweb, or altavista. This is the type of search capability I want to use in Postgres. I have it working as a stand-alone daemon, it is fast and produces very relevant results. I just thought that this sort of functionality would be a big plus if I could tie it down deep in Postgres. The big advantage to the code is high relevance, boolean operations, and very very high speed operation. If I could get easy Postgres record selection out of it, it would rock. -- http://www.mohawksoft.com
Re: [HACKERS] OK, does anyone have any better ideas?
Could you perhaps post the code you have for splitting a text field up into keys, then I could work on turning into a new type of index with a new operator, as Tom suggested? (Or is this already what the text search code in contrib already does??) - Andrew
Re: [HACKERS] OK, does anyone have any better ideas?
mlw <[EMAIL PROTECTED]> writes: > Then you call search with a string, such as "the long and winding road" > or "software OR hardware AND engineer NOT sales." A few milliseconds > later, a list of key/rank pairs are produced. This is FAR faster than > the '~~~' operator because it never does a full table scan. An index-associated operator doesn't imply a full table scan either. The whole purpose of an index is to pull out the rows matched by the WHERE expression without doing a full scan. The thing that bothers me about the way you're doing it is that the search result as such doesn't give you access to anything but the keys themselves. Typically what you want to do is get the whole record(s) in which the matching keys are located --- and that's why the notion of SELECT ... WHERE textfield-matches-search-string looks so attractive. You get the records immediately, in one step. Without that, your next step after the search engine call is to do a join of the search result table against your data table, and poof there goes much of your speed gain. (At best, you can make the join reasonably quick by having an index on the unique key field ... but that just means another index to maintain.) Another advantage of handling it as an index is that you don't have to rely on a periodic recomputation of the index; you can do on-the-fly updates each time the table is altered. (Of course, if your indexing technology can't handle incremental updates efficiently, that might not be of any value to you. But there's nothing in the system design that precludes making an index type that's only updated by REINDEX.) I realize this is probably not what you wanted to hear, since building a new index type is a lot more work than I suppose you were looking for. But if you want a full-text index that's integrated naturally into Postgres, that's the path to travel. The way you're doing it is swimming against the tide. Even when the function-returning-recordset limitation is gone (maybe a version or two away), it's still going to be an awkward and inefficient way to work. regards, tom lane
Re: [HACKERS] OK, does anyone have any better ideas?
Tom Lane wrote: > > mlw <[EMAIL PROTECTED]> writes: > > I have a working version of a text search engine. I want to make it work > > for Postgres (I will be releasing it GPL). It can literally find the > > occurrence of a string of words within 5 million records in a few > > milliseconds. > > Where are the records coming from? Are they inside the database? > (If not, why do you care about integrating this with Postgres?) > > It seems like the right way to integrate this sort of functionality > is to turn it into a kind of index, so that you can do > > SELECT * FROM mytable WHERE keyfield ~~~ 'search string'; > > where ~~~ is the name of some operator that is associated with the > index. The temporary-table approach you are taking seems inherently > klugy, and would still be awkward even if we had functions returning > recordsets... Oh! Another method I tried and just could not get working was returning an array of integers. I as thinking about "select * from table where key_field in ( textsearch('bla bla') ), but I haven't been able to get that to work, and as per a previous post and belatedly reading a FAQ, this would probably still force a full table scan. Another method I thought about was having a table with some maximum number of zero initialized records, and trying something like: create table temp_table as select * from ts_template limit textsearch('bla bla', 10); select filltable(temp_table, 10); select * from table where key_field = temp_table.key; As you can see, all of these ideas are heinous hacks, there has to be a better way. Surely someone has a better idea. -- http://www.mohawksoft.com
Re: [HACKERS] OK, does anyone have any better ideas?
Tom Lane wrote: > > mlw <[EMAIL PROTECTED]> writes: > > I have a working version of a text search engine. I want to make it work > > for Postgres (I will be releasing it GPL). It can literally find the > > occurrence of a string of words within 5 million records in a few > > milliseconds. > > Where are the records coming from? Are they inside the database? > (If not, why do you care about integrating this with Postgres?) > > It seems like the right way to integrate this sort of functionality > is to turn it into a kind of index, so that you can do > > SELECT * FROM mytable WHERE keyfield ~~~ 'search string'; > > where ~~~ is the name of some operator that is associated with the > index. The temporary-table approach you are taking seems inherently > klugy, and would still be awkward even if we had functions returning > recordsets... OK, I get the misunderstanding, you are absolutely right it is VERY kludgy. It is sort of like a bitmap index, but it is more of a search engine. I actually have it working on a commercial website. You run a program periodically (cron job?) that executes a query, the query is then parsed and an index of words, keys, ranks and phrase meta-data is created. You also specify which fields in the query should be indexed and which field will be the "key." (It is not ACID if I understand what they term means.) The data for the text search need not even be in the database, as long as the "key" being indexed is. Then you call search with a string, such as "the long and winding road" or "software OR hardware AND engineer NOT sales." A few milliseconds later, a list of key/rank pairs are produced. This is FAR faster than the '~~~' operator because it never does a full table scan. It is assumed that the "key" field specified is properly indexed. If I had a way of getting the key/rank result pair deeper into Postgres, it would be an amazing platform to make some serious high speed search applications. Think about a million resumes' online and searchable with an arbitrary text string to get a list of candidates, powered by Postgres, handling 100 queries a second. Right now, the way I have it working is PHP makes the search call and then executes a query with the first result (highest rank) and returns the data. If I could get the key/rank pair into postgres as a table or multiple searches into postgres as a set of tables, then you could do some amazing queries really really fast. Still, you said that "select foo from bar where key = textsearch('bla bla',..)" could not be done, and my previous example was the only other way I have been able to even prototype my idea. -- http://www.mohawksoft.com
Re: [HACKERS] OK, does anyone have any better ideas?
mlw <[EMAIL PROTECTED]> writes: > I have a working version of a text search engine. I want to make it work > for Postgres (I will be releasing it GPL). It can literally find the > occurrence of a string of words within 5 million records in a few > milliseconds. Where are the records coming from? Are they inside the database? (If not, why do you care about integrating this with Postgres?) It seems like the right way to integrate this sort of functionality is to turn it into a kind of index, so that you can do SELECT * FROM mytable WHERE keyfield ~~~ 'search string'; where ~~~ is the name of some operator that is associated with the index. The temporary-table approach you are taking seems inherently klugy, and would still be awkward even if we had functions returning recordsets... regards, tom lane