Awesome! Poor search capability and speed is the primary reason I've been only lurking on this list too for so long instead of contributing. I don't have the C experience to improve that very much easily, and don't have the time to dive in and learn at the moment to invent better searching for sword from scratch. But I have done some work on fast indexing and searching algorithms, maybe we can work together... you can see some of my tests here:
http://beaver.dburry.com/cgi-perl/bible You will see the framework and a lot of implementation of what you're talking about is already there (albeit with different syntax and some added things), but it's not in C it's in Perl, therefore of little use to sword at the moment without porting it. Complicated searches generally take about 200 milliseconds or so (on an old 233 mhz PII). You will notice I did implement full regexp searching on the indexed version, by limiting regexps to only search the list of indexed words not the full text. Yes, it's a bit slow (usually 2 seconds) to search the entire word list with regexps, but hey still beats the pants off a brute force full text regexp! For instance, using your example: http://beaver.dburry.com/cgi-perl/bible?encoding=utf-8&refstr=kjv&srchstr=%2 Fa.*b%2F&maxverses=30 this one is particularly long, and takes about 3.5 seconds to run the regexp... and then 1.2 seconds to find all the bazillion verses that match (see the speed notes at the bottom, this latter part can be optimized since we're only showing the first 30 matches so who cares about locating the rest internally)... don't everyone hit this all at once you could run my poor home linux machine on a DSL out of memory or something, haha, no I don't really think so I have the apache processes limited so you'd have to wait your turns instead ;o) I also have some ideas not yet implemented about creating a "phrase index" that indexes word pairs instead of words... it should be able to be used almost the same as my normal word index for most things, but makes phrase searches faster still and eliminates the loooong failure case for when you search for something like the phrase "the the" (this ties up the machine for like 25 secs and then at last you get a failure notice) The phrase index is no bigger than the normal word index, in fact it can be smaller because the data is more evenly distributed among the hash buckets internally, therefore requiring fewer of them. However, my particular indexing algorithm is probably too large for desktop use, but not bad for a web server because it's so faaaast! It uses a Berkeley DB B+Tree to store the main part of the index. The phrase index will be important to improve the speed for languages like Chinese, where you index each character. I noticed that in such languages my algorithm starts bogging down because everything is found so often, it can't start with an unpopular word to make it quick, and the phrase index can help dramatically there. It would also help in other languages if we ever indexed letter-by-letter instead of word-by-word (which is probably useless in most languages, but... you never know) I've designed but not yet implemented a much more complicated index storage system that should be smaller (about 5 megs per bible version) and still preserve all the extended functionality of my existing one (which is about 25 megs per version) without reducing speed any (in fact it may increase speed, not sure yet). This came out of the idea of trying to design an order preserving minimal perfect hash function for my index, when it dawned on me that my data is so predictable I don't need to use hashes for most things just some complicated tables that reference tables that reference tables that reference tables, etc. i.e. that large hash function by itself would be as good as the index it was trying to describe, yet much smaller because no holes between the hash values. Again, this better index is not yet coded, only on the drawing board. I've also done some work on prediction of how long a search will take beforehand, so as to help eliminate DOS attacks if this were run on a web server (for instance), so that you can refuse to run a search that will display too many results and take too much resources. This is completely disabled in my test script above, because I want to test the real thing... (the non-existant phrase "the the" is currently the longest thing possible in english I think). Dave ----- Original Message ----- From: "Joel Mawhorter" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, September 07, 2002 10:42 PM Subject: [sword-devel] Comming soon: new improved sword searching > With 95% less time and 7 essential nutrients! > > Hi all, > > Most of you don't know me but I've been hanging out in this list for a few > years. I've been working on a Bible search program that I started in my last > year of University as a guided project. My focus with this Bible program was > to implement full featured searching for non-Latin based languages. What I > want to see is people all over the world able to study the Bible in their own > language. Several times in the past I have evaluated Sword and considered > just putting my effort into that but the support for non-Latin languages just > wasn't there. However, it now seems to be getting much closer and I think > Sword will be more useful than what I could produce on my own. Therefore, > I've decided to join the Sword development project. My first priority is to > make a few improvements to the searching mechanism in Sword. I am writing to > the list to get feedback while I am still in the planning and early > implementation stages of my work. > ...
