Re[2]: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words
Just for information: A full-text indexer based only on SQLite BTree index, not using tables: http://www.codeproject.com/useritems/Text_Indexer.asp - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words
Scott Hess wrote: >>I am optimistic that the proper implementation will use even less than 50%: > >Indeed :-). Glad to read this ;-) >>I found that _not_ adding the original text turned out to be a great time >>saver. This makes sense if we know that the original text is about 4 times >>the size of the index. Storing lots of text by itself is already quite time >>consuming even without creating a FTS index. So I do not expect really >>bad slow downs by adding a docid->term index. > >Are you doing your inserts in the implied transactions sqlite provides >for you if you didn't open an explicit transaction? I'm found that >when doing bulk inserts, the maintenance of the content table is a >pretty small part of the overall time, perhaps 10%. My timings vary: I have just measured the insertion speeds with and without storing the original text and was _very_ surprised by the results: WITHtext storage: 1055 KB / sec WITHOUT text storage: 4948 KB / sec FTS without text storage performed almost 5 (five!) times faster than with text storage (running WinXP on a fairly recent system with a 5200 rotations per sec hard drive). The testing scenario: There were no changes to the code except that I commented out the text bindings as described in my earlier message. The same documents were indexed (10739 files, 239959 KB size in total). Insertion took place in a single transaction, SYNCHRONOUS = OFF was used as the only tweak to the database. I ran all tests multiple times consecutively on an empty database to avoid OS file buffering interferences. >>Snippets are of course nice to have out of the box as it is right now. But >>even without storing the original text, snippets could be created by >> >>1. supplying the text through other means (additional parameter or >>callback function), so that not FTS but the application would read >>it from a disk file or decompress it from a database field. >> >>2. constructing token-only snippets from the document tokens and >>offsets. This would of course exclude all non-word characters, but >>would still return legible information. > >A use-case that was considered was indexing PDF data, in which case >the per-document tokenization cost would probably be a couple seconds. >If you ran a query which matched a couple thousand documents and >proceeded to re-tokenize them for snippet generation, you'd be in deep >trouble. This is somewhat addressable by providing scoring mechanisms >and using subselects (basically, have the subselect order by score, >then cap the number of results, and have the main select ask for >snippets). A variant on that would be an index of a CD. In that case >it's pretty much essential that the index be able to efficiently >answer questions without having to seek all over the disk. Quite true. But is this indeed a realistic scenario? It sounds a bit like the "select * from my-million-row-table" problem. Nothing wrong with this per se, but be aware of the consequences. >Option 2 has some attraction, though, because you have the option of >transparently segmenting the document into blocks and thus not having >to re-tokenize the entire document to generate snippets. Thanks! >>>Being able to have an index without storing the original data was a >>>weak goal when fts1 was being developed, but every time we visitted >>>it, we found that the negatives of that approach were substantial >>>enough to discourage us for a time. [The "we" in that sentence means >>>"me and the various people I run wacky ideas past."] I'm keeping an >>>eye out for interesting implementation strategies and the time to >>>explore them, though. >> >>Maybe my arguments could influence the opinion of "we"? I would love >>to see FTS without text storage, especially since I just lost a project to >>another FTS product because duplicating data was unfortunately "out >>of disk space". > >Feel free to drop me a description of the types of things you're doing >out-of-band, maybe something will gel. No promises! Most of the >current use-cases are pretty clear - since the data is already going >to be in the database, letting fts2 store it is no big deal. I can >imagine pretty broad classes of problems which could come up when >indexing data which is not in the database, so one of the challanges >is to narrow down which problems are real, and which are figments. I conclude from your remarks that the offsets() problem is not predominant and could be solved even without storing full text in the database. If so, snippets could be created as well from those offsets. I realize that this will commplicate the FTS2 implementation, so please excuse if I am arguing from a user's perspective. For users, I can see the following benefits in separating FTS index and original text: * Space savings when indexing external documents not stored in the database. * Possibility to add FTS to text stored in compressed format in the database. * Possibility to mix FTS text rows with numer
Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words
On 3/13/07, Ralf Junker <[EMAIL PROTECTED]> wrote: Scott Hess wrote: >Keeping track of that information would probably double the >size of the index. With your estimate, the SQLite full text index (without document storage) would still take up only 50% of the documents' size. In my opinion, this is still a very good ratio, even if some specialized full text search engines apparently get away with less than 30%. It's a pretty raw estimate, though, and assumes that we only bother to store the set of terms per document, without retaining ordering. Since it takes X space to store the terms with their list of docids, it should take approximately X space to store the docids with their list of terms, suitably encoded. Actually, a bit of a win could be had because the reverse mapping need not retain the positions, and the termids should encode more tightly than docids, since there are fewer of them. I am optimistic that the proper implementation will use even less than 50%: Indeed :-). I've mostly been trying to stay on the positive side of the 80/20 rule, and avoid some of the complications that come along with some of the bigger dedicated systems. My modifications are completely rudimentary and not at all optimized - the column to store the document text still exists. The only difference is that it is not used - it stores a null value which could be saved. In fact, the entire FTS table (the one without the suffixes) would not be needed and cut down storage space. The overhead of the table should be pretty minimal, and is providing the useful function of docid assignment. A complete implementation might be able to fold docid assignment into the internal data structures, but that might be counterproductive (it's surely useful to have docids assigned in the exact same way as sqlite assigns rowids). >A thing I've considered doing is to keep deletions >as a special index to the side, Would this open the door to "insert only, but no-modify and no-delete" indexes? I am sure users would like pay this cost for the benefit of even smaller FTS indexes! You're 90% there for INSERT-only - the above is a notion for how one could handle UPDATE and DELETE without storing the content data. >which would allow older data to be >deleted during segment merges. Unfortunately, I suspect that this >would slow things down by introducing another bit of data which needs >to be considered during merges. I found that _not_ adding the original text turned out to be a great time saver. This makes sense if we know that the original text is about 4 times the size of the index. Storing lots of text by itself is already quite time consuming even without creating a FTS index. So I do not expect really bad slow downs by adding a docid->term index. Are you doing your inserts in the implied transactions sqlite provides for you if you didn't open an explicit transaction? I'm found that when doing bulk inserts, the maintenance of the content table is a pretty small part of the overall time, perhaps 10%. Snippets are of course nice to have out of the box as it is right now. But even without storing the original text, snippets could be created by 1. supplying the text through other means (additional parameter or callback function), so that not FTS but the application would read it from a disk file or decompress it from a database field. 2. constructing token-only snippets from the document tokens and offsets. This would of course exclude all non-word characters, but would still return legible information. A use-case that was considered was indexing PDF data, in which case the per-document tokenization cost would probably be a couple seconds. If you ran a query which matched a couple thousand documents and proceeded to re-tokenize them for snippet generation, you'd be in deep trouble. This is somewhat addressable by providing scoring mechanisms and using subselects (basically, have the subselect order by score, then cap the number of results, and have the main select ask for snippets). A variant on that would be an index of a CD. In that case it's pretty much essential that the index be able to efficiently answer questions without having to seek all over the disk. Option 2 has some attraction, though, because you have the option of transparently segmenting the document into blocks and thus not having to re-tokenize the entire document to generate snippets. This extends the b-tree characteristics a little further into the data. [Of course, you'd really just store things in a form which made snippets easy to generate without re-tokenizing at all, but that's besides the point.] >Being able to have an index without storing the original data was a >weak goal when fts1 was being developed, but every time we visitted >it, we found that the negatives of that approach were substantial >enough to discourage us for a time. [The "we" in that sentence means >"me and the various people I run wacky ideas past."] I'm keeping a
Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words
Hello Scott, I was hoping that you would read my message, many thanks for your reply! >UPDATE and DELETE need to have the previous document text, because the >docids are embedded in the index, and there is no docid->term index >(or, put another way, the previous document text _is_ the docid->term >index). This is very understandable given the present design. >Keeping track of that information would probably double the >size of the index. With your estimate, the SQLite full text index (without document storage) would still take up only 50% of the documents' size. In my opinion, this is still a very good ratio, even if some specialized full text search engines apparently get away with less than 30%. I think you have done an enourmous job on FTS2! I am optimistic that the proper implementation will use even less than 50%: My modifications are completely rudimentary and not at all optimized - the column to store the document text still exists. The only difference is that it is not used - it stores a null value which could be saved. In fact, the entire FTS table (the one without the suffixes) would not be needed and cut down storage space. >A thing I've considered doing is to keep deletions >as a special index to the side, Would this open the door to "insert only, but no-modify and no-delete" indexes? I am sure users would like pay this cost for the benefit of even smaller FTS indexes! >which would allow older data to be >deleted during segment merges. Unfortunately, I suspect that this >would slow things down by introducing another bit of data which needs >to be considered during merges. I found that _not_ adding the original text turned out to be a great time saver. This makes sense if we know that the original text is about 4 times the size of the index. Storing lots of text by itself is already quite time consuming even without creating a FTS index. So I do not expect really bad slow downs by adding a docid->term index. >Of course, there's no way the current system could generate snippets >without the original text, because doclists don't record the set of >adjacent terms. That information could be recorded, but it's doubtful >that doing so would be an improvement on simply storing the original >text in the first place. The current system _does_ have everything >needed to generate the offsets to hits even without the original text, >so the client application could generate snippets, though the code is >not currently in place to expose this information. Snippets are of course nice to have out of the box as it is right now. But even without storing the original text, snippets could be created by 1. supplying the text through other means (additional parameter or callback function), so that not FTS but the application would read it from a disk file or decompress it from a database field. 2. constructing token-only snippets from the document tokens and offsets. This would of course exclude all non-word characters, but would still return legible information. >Being able to have an index without storing the original data was a >weak goal when fts1 was being developed, but every time we visitted >it, we found that the negatives of that approach were substantial >enough to discourage us for a time. [The "we" in that sentence means >"me and the various people I run wacky ideas past."] I'm keeping an >eye out for interesting implementation strategies and the time to >explore them, though. Maybe my arguments could influence the opinion of "we"? I would love to see FTS without text storage, especially since I just lost a project to another FTS product because duplicating data was unfortunately "out of disk space". All the best and keep up your good work, Ralf - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words
Ion Silvestru wrote: >Just a question: did you eliminated stop-words in your tests? No, I did not eliminate any stop-words. The two test runs were equal except for the small changes in FTS 2. My stop words question was not intended for source code but for human language texts. Ralf - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words
On 3/13/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: Ion Silvestru <[EMAIL PROTECTED]> wrote: > To Ralf: > >As a side effect, the offsets() and snippet() functions stopped working, > >as they seem to rely on the presence of the full document text in the > >current implementation. > > Did you tested "phrase" searching on the index-only version, didn't this > kind of search rely on offsets()? Phrase searches do *not* use the full document text. But UPDATE and DELETE do, ironically. Or at least they used to, unless Scott has changed that in FTS2. Indeed, phrase searches should continue to work, because since we have the terms from the query, we can look them up and compare their token positions in the document (offsets being the character positions of the tokens). UPDATE and DELETE need to have the previous document text, because the docids are embedded in the index, and there is no docid->term index (or, put another way, the previous document text _is_ the docid->term index). Keeping track of that information would probably double the size of the index. A thing I've considered doing is to keep deletions as a special index to the side, which would allow older data to be deleted during segment merges. Unfortunately, I suspect that this would slow things down by introducing another bit of data which needs to be considered during merges. Of course, there's no way the current system could generate snippets without the original text, because doclists don't record the set of adjacent terms. That information could be recorded, but it's doubtful that doing so would be an improvement on simply storing the original text in the first place. The current system _does_ have everything needed to generate the offsets to hits even without the original text, so the client application could generate snippets, though the code is not currently in place to expose this information. Being able to have an index without storing the original data was a weak goal when fts1 was being developed, but every time we visitted it, we found that the negatives of that approach were substantial enough to discourage us for a time. [The "we" in that sentence means "me and the various people I run wacky ideas past."] I'm keeping an eye out for interesting implementation strategies and the time to explore them, though. -scott - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words
Ion Silvestru <[EMAIL PROTECTED]> wrote: > To Ralf: > > >As a side effect, the offsets() and snippet() functions stopped working, as > >they seem to rely on the presence of the full document text in the current > >implementation. > > Did you tested "phrase" searching on the index-only version, didn't this > kind of search rely on offsets()? > Phrase searches do *not* use the full document text. But UPDATE and DELETE do, ironically. Or at least they used to, unless Scott has changed that in FTS2. -- D. Richard Hipp <[EMAIL PROTECTED]> - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words
To Ralf: >As a side effect, the offsets() and snippet() functions stopped working, as >they seem to rely on the presence of the full document text in the current >implementation. Did you tested "phrase" searching on the index-only version, didn't this kind of search rely on offsets()? - To unsubscribe, send email to [EMAIL PROTECTED] -
Re[2]: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words
>Just a question: did you eliminated stop-words in your tests? Sorry, you specified that you indexed source code files, so no stop-words are applicable here. - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words
Thank you. Just a question: did you eliminated stop-words in your tests? >Concluding: Given the great database size savings possible by separating full >text index from data storage, I wish that >developers would consider adding such an option to the SQLite FTS interface. If such an option will be added, I see a big future for using SQLite as a simple, but powerful and easily customized (user tokenizers etc) full-text search engine, and not only as a DB engine. Currently we don't have many options for full-text desktop engine, there are some, like DTSearch, Onix, Lucene, but these are over-priced, can't be easily customized or too complex. - To unsubscribe, send email to [EMAIL PROTECTED] -
[sqlite] FTS: index only, no text storage - Was: [sqlite] FTS: Custom Tokenizer / Stop Words
>But what about: > >I am very interested to know if it would be possible to use an FTS indexing >module to store the inverted index only, but >not the document's text. This would safe disk space if the text to index is >stored on disk rather than inside the database. This is possible with just minor modifications to fts2.c (below). I commented out the instructions responsible for inserting and updating the text body into the %_content table. As a side effect, the offsets() and snippet() functions stopped working, as they seem to rely on the presence of the full document text in the current implementation. Neverthelses, I ran FTS2 over a collection of source code files, and the results are astonishing: With the original fts2.c, the database figures are as follows: Number of documents:10739 Files Total size of document text stored: 234 MB Total size of database: ===> 295 MB <=== Size of index within database: 61 MB Index / Text ratio:26 Percent With the modified fts2.c (no text stored), the database size was obviously much smaller: Number of documents:10739 Files Total size of document text stored: 234 MB Total size of database: ===> 61 MB <=== Index / Text ratio:26 Percent I addition to the database size savings, I can think of a number of other benefits in separating text and reverted index storage: 1. Indexing docuements stored in another database would not need to duplicate storage. A small "FTS database" could be attached to the "Data database" if necessary, so the "data" database stays smaller without the index. Deleting the "FTS database" would leave the the data untouched. 2. Point 1 from above would allow to distribute CDs without FTS and let the user create a small FTS index on local storage to speed up searching. This way more data can be shipped on single CD volumes. 3. Indexing compressed text would become possible. The current implementation does not allow text compression because the FTS tables always store uncompressed. 4. Ease maintainance and consistency of data as long as FTS is experimental. If data and FTS are separated, only the FTS index must be rebuild if FTS changes, while the current implementation potentially requires to upgrade entire tables to yet unknown formats. 5. FTS could be removed from a database without touching the data: Only the FTS tables would have to be deleted. Concluding: Given the great database size savings possible by separating full text index from data storage, I wish that developers would consider adding such an option to the SQLite FTS interface. Finally, here are the changes I applied to fts2.c as proof of concept: /* insert into %_content (rowid, ...) values ([rowid], [pValues]) */ static int content_insert(fulltext_vtab *v, sqlite3_value *rowid, sqlite3_value **pValues){ sqlite3_stmt *s; int i; int rc = sql_get_statement(v, CONTENT_INSERT_STMT, &s); if( rc!=SQLITE_OK ) return rc; rc = sqlite3_bind_value(s, 1, rowid); if( rc!=SQLITE_OK ) return rc; /* for(i=0; inColumn; ++i){ rc = sqlite3_bind_value(s, 2+i, pValues[i]); if( rc!=SQLITE_OK ) return rc; } */ return sql_single_step_statement(v, CONTENT_INSERT_STMT, &s); } /* update %_content set col0 = pValues[0], col1 = pValues[1], ... * where rowid = [iRowid] */ static int content_update(fulltext_vtab *v, sqlite3_value **pValues, sqlite_int64 iRowid){ sqlite3_stmt *s; int i; int rc = sql_get_statement(v, CONTENT_UPDATE_STMT, &s); if( rc!=SQLITE_OK ) return rc; /* for(i=0; inColumn; ++i){ rc = sqlite3_bind_value(s, 1+i, pValues[i]); if( rc!=SQLITE_OK ) return rc; } */ rc = sqlite3_bind_int64(s, 1+v->nColumn, iRowid); if( rc!=SQLITE_OK ) return rc; return sql_single_step_statement(v, CONTENT_UPDATE_STMT, &s); } Ralf - To unsubscribe, send email to [EMAIL PROTECTED] -