|
Assuming the behavior of CLucene is the same as Lucene (the Java
version), I think I can partially answer some of these: Sword is using the standard analyzer. The std analyzer ignores a list of common words, such as (a, the, in, an, on, ....). In Lucene speak these are stop words. This list may or may not be appropriate for biblical research. The simple analyzer does not have a stop list. The Standard analyzer also does a lot of other things, which don't have a net effect on Bibles. The difference between the two in terms of size of the index is 2M to 2.6M. Both take about the same length of time to build an index. But the simple analyzer is just a touch faster. Lucene does case insensitive searching. It always does. It does not matter which analyzer is used. Lucene only looks for what you tell it to look for. If you want partial words you have to construct a search with wild cards. Lucene cannot wildcard the beginning of a word. So it cannot find words ending with ration. In addition to the typical wildcards, Lucene also has ~ which when added to the end of the word will find words like the one you entered. This is very useful to find words whose spelling is close but not correct (e.g. abimeleck~). By default, Lucene uses "OR" as a connector between words. To require two words to be in a search result use "AND" or prefix the words with "+" (like google). Lynn Allan wrote: Lucene is case insensitive. It always does exact match unless the request uses wild cards.* Does indexed searching always do a match on the exact word? For example, with "Phrase" or "Multi Word" or "Optimized", there are 2 matches for "regeneration" using the AKJV. Phrase and MultiWord find 275 matches for "ration", including the times it is within "generation" and "regeneration". Optimized search finds 0. Perhaps this is how it is supposed to work, but it seems like an end-user might find it unexpected that Optimized Searching gives results that are very different from "Phrase" and "MultiWord" searching. There aren't "clues" that Optimized Searching has different behavior. Perhaps the "Case Insensitive" checkbox should be unchecked and/or disabled? Lucene when given <<son of god>> will find all verses with <<son>> OR <<god>>.* Perhaps similarily unexpected, MultiWord searching for "son of god" results in 294 case insensitive matches, "Phrase" found 47, and "Optimized" found 5472. After this search, the Optimizing seemed disabled, becausing searching for "son" took about 20 seconds. Then the next search for "of" crashed (floating point division by zero") BibleDesktop/JSword has the same performance problem when searching for "son" when we show 1000 verses at a time. But showing 50 at a time fixes the problem. Looking JSword in the debugger, I find that the answer is returned almost immediately, but the processing of it is what is taking the time. Part of the problem is that half of the time is fetching verses from the module. Since the verses are spread out across the book, getting them requires lot of disk hits. And if the module is compressed, lots of cpu. When we list the hits based on score I find that the module read cache is invalidated very often and we have to re-read from disk. (With one read we cache many adjacent verses and serve them out of there.) I don't know how Sword does it, but since JSword is based upon it, it might not be too far different. "of" is not indexed. It will return zero hits.This was the second time it crashed ... sorry don't have repeatable sequence of actions ... except that each time Searching was effectively disabled. The button that should be "Search" was "Halt" and stayed as "Halt" even when the Search dialog was dismissed and reentered. I had to shut-down BibleCS to get searching to work again. Here's a repeatable sequence to cause a crash: AKJV Optimize search for "son of god", then search for "son", then search for "of" ... crash. Actually, it is simpler ... search for a very common word like "of" or "the" or "a" I found that this did not happen when searching for "buzzard" which is not in the KJV. Hmmm. I would have expected this to also fail if the problem were in Sword's handling of the answer set. In case the index needed rebulding, I deleted the AKJV index and clicked on the "Create Index" button. This caused a "C++ Exception" message to show up??? Odd ... after the crash, the AKJV seemed to have "forgotten" that it had an index file created ... that option wasn't available. I had to switch to another module and back to AKJV for it to realize it had the index file created. Very odd .... while trying out different searches, it has twice happened that the search source switched from AKJV to "Personal Commentary." This was without the "Choose Module" showing, so I don't think it was anything I did. I'll rebuild the indices and see if the behavior is repeatable. HTH _______________________________________________ sword-devel mailing list: [email protected] http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page |
_______________________________________________ sword-devel mailing list: [email protected] http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
