Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?
Were you able to solve the problem? What solution have to chosen? BR, Fedor I'd appreciate any suggestions on good ways to do this, I'm neither an SQL or sqlite expert, so I might be thinking about it all wrong. I have something like a (read-only) address book/rolodex, with interactive searching. As users type into the search box, I need to first know for each section how many rows match the substring typed so far. I only display the rows that are visible on screen. I have two queries: (A) I count the rows in a letter group. If they typed "e": select substr(name,1,1), count(*) from my_table where name like '%e%' group by substr(name,1,1); A|94 B|118 C|131 ... This is too slow, ~3sec, with 2500 rows, and we want to have 1 rows. Worse, when they type "es", the search is as slow after they type "s" as when they typed "e", even though the "es" rows are a sub-set of the rows that matched "e". FTS3 only searches full terms/words by default, but I think if I built a custom tokenizer that returned all the suffix trees for a name: "fu bar" => [ "r", "ar", "bar", " bar", "u bar", "fu bar"] That I could do rewrite query (A) like this: select substr(name,1,1), count(*) from my_table where name match 'e*' group by substr(name,1,1); Is this a reasonable approach? Is there a better way? Has somebody else done this? (B) I access specific rows within a letter group. For visible rows, I fetch them by offset into a letter group, so row 4 in the "g" section of names containing "e" would be: select * from my_table where name like "g%" and name like "%e%" order by name limit 1 offset 4; The performance for this is OK, right now, I think it's because the first LIKE can use the index, so the linear scan is over only a few hundred rows. Or it could be that the on-screen display of each row is slower than the DB search. I think it might become a problem, though. I'm not sure how I would rewrite this to use FTS3 if it turns out to be to slow for a larger DB, maybe a tokenizer that puts the first letter of the name as the first letter of every suffix? ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?
Sounds to me like Boyer-Moore is needed http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm And...I would probably pre-load the database table into 26 seperate memory tables to avoid any SQL interactivity at all other than the initial loading. Adding the SQL layer slows things down far too much. Michael D. Black Senior Scientist Advanced Analytics Directorate Northrop Grumman Information Systems From: sqlite-users-boun...@sqlite.org on behalf of Tim Romano Sent: Mon 8/9/2010 7:00 AM To: General Discussion of SQLite Database Subject: EXTERNAL:Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way? First, permit me a little rant. As a user, I dislike this kind of incremental search feature if there's no easy way to toggle it or to configure it and the list of items will be large enough to cause a typing lag. The feature can become an intrusive nuisance, the opposite of what is intended. Browsers put this feature on the URL address bar and Google has it on its search-input. Keystrokes entered often get swallowed up. It's worse than typing on a 300 baud dumb terminal, for at least on those ancient machines your characters would eventually be displayed on the green screen, whereas with today's browsers the characters often just get eaten; I find myself having to retype the first few characters of a URL or search term far too often. I agree with Radzi's suggestion. Once you have the initial set of of hits (rowid, name) in an array, do the rest in procedurally rather than going back against the database with a new SQL query and a longer search string. That will be much faster that issuing a new SQL query after every keystroke. I would wait until the user had typed at least two characters before kicking off the initial search because finding every value that contains a common letter is not helpful when the list of matches is a very long one. Regards Tim Romano Swarthmore PA On Fri, Aug 6, 2010 at 9:54 PM, Scott Hess wrote: > On Fri, Aug 6, 2010 at 6:08 PM, Sam Roberts wrote: > > On Fri, Aug 6, 2010 at 11:32 AM, Scott Hess wrote: > >> On Thu, Aug 5, 2010 at 12:42 PM, Sam Roberts > wrote: > >>> FTS3 only searches full terms/words by default, but I think if I built > a custom > >>> tokenizer that returned all the suffix trees for a name: > >> > >> FTS3 can do prefix searches, MATCH 'a*'. Also, it aimed to support > > > > Prefix searches don't allow matching in the middle of words. For > > example, I want "bert" > > to match my name, "roberts". > > Darn. Sorry, was only thinking with half my brain, and that half > connected your problem up with some past idea. You're right, you'd > need the tidbits to get at the interior substrings. > > That said, you should be able to pretty easily copy the current > tokenizer and modify it to return multiple tokens at a single > location. > > -scott > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?
First, permit me a little rant. As a user, I dislike this kind of incremental search feature if there's no easy way to toggle it or to configure it and the list of items will be large enough to cause a typing lag. The feature can become an intrusive nuisance, the opposite of what is intended. Browsers put this feature on the URL address bar and Google has it on its search-input. Keystrokes entered often get swallowed up. It's worse than typing on a 300 baud dumb terminal, for at least on those ancient machines your characters would eventually be displayed on the green screen, whereas with today's browsers the characters often just get eaten; I find myself having to retype the first few characters of a URL or search term far too often. I agree with Radzi's suggestion. Once you have the initial set of of hits (rowid, name) in an array, do the rest in procedurally rather than going back against the database with a new SQL query and a longer search string. That will be much faster that issuing a new SQL query after every keystroke. I would wait until the user had typed at least two characters before kicking off the initial search because finding every value that contains a common letter is not helpful when the list of matches is a very long one. Regards Tim Romano Swarthmore PA On Fri, Aug 6, 2010 at 9:54 PM, Scott Hess wrote: > On Fri, Aug 6, 2010 at 6:08 PM, Sam Roberts wrote: > > On Fri, Aug 6, 2010 at 11:32 AM, Scott Hess wrote: > >> On Thu, Aug 5, 2010 at 12:42 PM, Sam Roberts > wrote: > >>> FTS3 only searches full terms/words by default, but I think if I built > a custom > >>> tokenizer that returned all the suffix trees for a name: > >> > >> FTS3 can do prefix searches, MATCH 'a*'. Also, it aimed to support > > > > Prefix searches don't allow matching in the middle of words. For > > example, I want "bert" > > to match my name, "roberts". > > Darn. Sorry, was only thinking with half my brain, and that half > connected your problem up with some past idea. You're right, you'd > need the tidbits to get at the interior substrings. > > That said, you should be able to pretty easily copy the current > tokenizer and modify it to return multiple tokens at a single > location. > > -scott > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?
On Fri, Aug 6, 2010 at 6:08 PM, Sam Roberts wrote: > On Fri, Aug 6, 2010 at 11:32 AM, Scott Hess wrote: >> On Thu, Aug 5, 2010 at 12:42 PM, Sam Roberts wrote: >>> FTS3 only searches full terms/words by default, but I think if I built a >>> custom >>> tokenizer that returned all the suffix trees for a name: >> >> FTS3 can do prefix searches, MATCH 'a*'. Also, it aimed to support > > Prefix searches don't allow matching in the middle of words. For > example, I want "bert" > to match my name, "roberts". Darn. Sorry, was only thinking with half my brain, and that half connected your problem up with some past idea. You're right, you'd need the tidbits to get at the interior substrings. That said, you should be able to pretty easily copy the current tokenizer and modify it to return multiple tokens at a single location. -scott ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?
Have you not consider loading the whole rows into memory array and use simple string search or regexp? I'm sure 10,000 records could be search a blink. best regards, Radzi. On 6-Aug-2010, at 3:42 AM, Sam Roberts wrote: > I'd appreciate any suggestions on good ways to do this, I'm neither an SQL or > sqlite expert, so I might be thinking about it all wrong. > > I have something like a (read-only) address book/rolodex, with interactive > searching. As users type into the search box, I need to first know for each > section how many rows match the substring typed so far. I only display the > rows that are visible on screen. > > I have two queries: > > (A) I count the rows in a letter group. > > If they typed "e": > > select substr(name,1,1), count(*) from my_table where name like '%e%' > group by substr(name,1,1); > A|94 > B|118 > C|131 > ... > > This is too slow, ~3sec, with 2500 rows, and we want to have 1 rows. > > Worse, when they type "es", the search is as slow after they type "s" as when > they typed "e", even though the "es" rows are a sub-set of the rows that > matched "e". > > FTS3 only searches full terms/words by default, but I think if I built a > custom > tokenizer that returned all the suffix trees for a name: > > "fu bar" => [ "r", "ar", "bar", " bar", "u bar", "fu bar"] > > That I could do rewrite query (A) like this: > > select substr(name,1,1), count(*) from my_table where name match 'e*' > group by substr(name,1,1); > > Is this a reasonable approach? Is there a better way? Has somebody > else done this? > > > > (B) I access specific rows within a letter group. > > For visible rows, I fetch them by offset into a letter group, so row 4 in the > "g" section of names containing "e" would be: > > select * from my_table where name like "g%" and name like "%e%" order > by name limit 1 offset 4; > > The performance for this is OK, right now, I think it's because the first LIKE > can use the index, so the linear scan is over only a few hundred rows. Or it > could be that the on-screen display of each row is slower than the DB search. > I > think it might become a problem, though. > > I'm not sure how I would rewrite this to use FTS3 if it turns out to be to > slow > for a larger DB, maybe a tokenizer that puts the first letter of the name as > the first letter of every suffix? > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?
On Fri, Aug 6, 2010 at 11:32 AM, Scott Hess wrote: > On Thu, Aug 5, 2010 at 12:42 PM, Sam Roberts wrote: >> FTS3 only searches full terms/words by default, but I think if I built a >> custom >> tokenizer that returned all the suffix trees for a name: > > FTS3 can do prefix searches, MATCH 'a*'. Also, it aimed to support Prefix searches don't allow matching in the middle of words. For example, I want "bert" to match my name, "roberts". So, I think I'd need to tokenize roberts as "s", "ts", ..., "berts", "oberts", ... etc. Then do a prefix match for "bert*" in order to see that "roberts" matches. Lucky, I don't need or care about any of the snippeting stuff, because I'm matching short strings (names). Cheers, Sam ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?
On Thu, Aug 5, 2010 at 12:42 PM, Sam Roberts wrote: > FTS3 only searches full terms/words by default, but I think if I built a > custom > tokenizer that returned all the suffix trees for a name: FTS3 can do prefix searches, MATCH 'a*'. Also, it aimed to support multiple hits at the same position, for stemming purposes. So you might be able to get away with making a copy of fts3_tokenizer1.c, and modifying it to keep an additional flag in the cursor to let you return each token twice (once reversed). I can't offhand think of how to distinguish the resulting prefix matches from suffix matches. Maybe you can work that out yourself by using the rows returned to figure it out. Also note that this will possibly interact poorly with the snippeting and offset functions. As a short-term proof-of-concept hack, you could just have two tables. Insert your originals into one table, then take last_insert_rowid() and insert the document reversed into the other table. -scott ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?
On Fri, Aug 6, 2010 at 6:11 AM, Adam DeVita wrote: > A variant on Simon's plan. > Are the 10,000 rows static, slowly changing, or frequently changing? Never change, it's read-only. > Does > it make sense to pre-calculate some counts at the time data is loaded? > Is > this memory constrained so much that you can't afford 1 or 2 MB to let you > look up based on ints? (I'm assuming that one letter is all you are after, > either 'starts with' or 'contains' and not in order combinations.) No, substrings, it's just that I then need a count of matching substrings by first char. Good idea, there are a number of other queries where pre-calculating is linear in the space cost, but here the the usage is interactive search, where as they type more of the name, it narrows down the search results as people type in more. Pre-calculating would be about 40 factorial in space, there are about 64000 3-character strings, and then once they typed the 4th char in it would be slow again. Of course, not all of those exist. Hm. Maybe I'll try to precalculate the suffix tree, and see how many results there really are, I don't need to store zero results. The fastest I've found so far is using FTS3. Its a little slow, but not unusably so. There are only 2500 rows now, I hope that it will scale well as the DB increases in size. I'm still considering other approaches, maybe a custom b-tree. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?
A variant on Simon's plan. Are the 10,000 rows static, slowly changing, or frequently changing? Does it make sense to pre-calculate some counts at the time data is loaded? Is this memory constrained so much that you can't afford 1 or 2 MB to let you look up based on ints? (I'm assuming that one letter is all you are after, either 'starts with' or 'contains' and not in order combinations.) Adam On Thu, Aug 5, 2010 at 5:40 PM, Simon Slavin wrote: > > On 5 Aug 2010, at 10:03pm, Sam Roberts wrote: > > > But do you think the section would make the counting faster? I think > > I'd have to get the row counts like this, which would still do the > > slow full table scan: > > > > select section, count(*) from my_table where name like '%e%' group by > section; > > But 'group by section' can profit from the index on the section column so > it should be faster. > > As with all these things, the suggestion is to try it and see. You should > try six or seven different solutions including shuffling columns and indexes > before you settle on the one that will be in your final code. > > Simon. > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > -- VerifEye Technologies Inc. 905-948-0015x245 7100 Warden Ave, Unit 3 Markham ON, L3R 8B5 Canada ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?
On 5 Aug 2010, at 10:03pm, Sam Roberts wrote: > But do you think the section would make the counting faster? I think > I'd have to get the row counts like this, which would still do the > slow full table scan: > > select section, count(*) from my_table where name like '%e%' group by > section; But 'group by section' can profit from the index on the section column so it should be faster. As with all these things, the suggestion is to try it and see. You should try six or seven different solutions including shuffling columns and indexes before you settle on the one that will be in your final code. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?
On Thu, Aug 5, 2010 at 1:37 PM, Simon Slavin wrote: > > On 5 Aug 2010, at 8:42pm, Sam Roberts wrote: > >> select substr(name,1,1), count(*) from my_table where name like '%e%' >> group by substr(name,1,1); > > If you are constantly going to be using the first character of the name like > that, give it a column of its own with its own index. That's a good idea. I think it would help a lot with row fetching if section was it's own column: select * from my_table where section is "g" and name like "%e%" order by name limit 1 offset 4; But do you think the section would make the counting faster? I think I'd have to get the row counts like this, which would still do the slow full table scan: select section, count(*) from my_table where name like '%e%' group by section; Cheers, Sam ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Substring (LIKE "%key%") searches, would FTS3 with suffix-tree tokenizer be the fast way?
On 5 Aug 2010, at 8:42pm, Sam Roberts wrote: > select substr(name,1,1), count(*) from my_table where name like '%e%' > group by substr(name,1,1); If you are constantly going to be using the first character of the name like that, give it a column of its own with its own index. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users