Assuming the behavior of CLucene is the same as Lucene (the Java version), I think I can partially answer some of these:
Sword is using the standard analyzer.

The std analyzer ignores a list of common words, such as (a, the, in, an, on, ....). In Lucene speak these are stop words. This list may or may not be appropriate for biblical research. The simple analyzer does not have a stop list. The Standard analyzer also does a lot of other things, which don't have a net effect on Bibles. The difference between the two in terms of size of the index is 2M to 2.6M. Both take about the same length of time to build an index. But the simple analyzer is just a touch faster.

Lucene does case insensitive searching. It always does. It does not matter which analyzer is used.

Lucene only looks for what you tell it to look for. If you want partial words you have to construct a search with wild cards. Lucene cannot wildcard the beginning of a word. So it cannot find words ending with ration. In addition to the typical wildcards, Lucene also has ~ which when added to the end of the word will find words like the one you entered. This is very useful to find words whose spelling is close but not correct (e.g. abimeleck~).

By default, Lucene uses "OR" as a connector between words. To require two words to be in a search result use "AND" or prefix the words with "+" (like google).

Lynn Allan wrote:
Here is my best attempt at a beta before I leave for the summer.
    
Give
  
it a go and let me know what you think.
    

Have a great summer.

Some questions about indexed/optimized searching:

* Does it always do case-insensitive searching even when the "Case
Sensitive" checkbox is checked? With the AKJV and "Case Sensitive"
checked, it finds 942 matches for "Jesus" and 942 matches for "jesus".
  
Lucene is case insensitive.
* Does indexed searching always do a match on the exact word? For
example, with "Phrase" or "Multi Word" or "Optimized", there are 2
matches for "regeneration" using the AKJV. Phrase and MultiWord find
275 matches for "ration", including the times it is within
"generation" and "regeneration". Optimized search finds 0. Perhaps
this is how it is supposed to work, but it seems like an end-user
might find it unexpected that Optimized Searching gives results that
are very different from "Phrase" and "MultiWord" searching. There
aren't "clues" that Optimized Searching has different behavior.
Perhaps the "Case Insensitive" checkbox should be unchecked and/or
disabled?
  
It always does exact match unless the request uses wild cards.
* Perhaps similarily unexpected, MultiWord searching for "son of god"
results in 294 case insensitive matches, "Phrase" found 47, and
"Optimized" found 5472. After this search, the Optimizing seemed
disabled, becausing searching for "son" took about 20 seconds. Then
the next search for "of" crashed (floating point division by zero")
  
Lucene when given <<son of god>> will find all verses with <<son>> OR <<god>>.
BibleDesktop/JSword has the same performance problem when searching for "son" when we
show 1000 verses at a time. But showing 50 at a time fixes the problem.
Looking JSword in the debugger, I find that the answer is returned almost immediately, but
the processing of it is what is taking the time. Part of the problem is that half of the time is fetching
verses from the module. Since the verses are spread out across the book, getting them requires
lot of disk hits. And if the module is compressed, lots of cpu. When we list the hits based on score
I find that the module read cache is invalidated very often and we have to re-read from disk.
(With one read we cache many adjacent verses and serve them out of there.)
I don't know how Sword does it, but since JSword is based upon it, it might not be too
far different.
This was the second time it crashed ... sorry don't have repeatable
sequence of actions ... except that each time Searching was
effectively disabled. The button that should be "Search" was "Halt"
and stayed as "Halt" even when the Search dialog was dismissed and
reentered. I had to shut-down BibleCS to get searching to work again.

Here's a repeatable sequence to cause a crash: AKJV Optimize search
for "son of god", then search for "son", then search for "of" ...
crash.
Actually, it is simpler ... search for a very common word like "of" or
"the" or "a"
  
"of" is not indexed. It will return zero hits.
I found that this did not happen when searching for "buzzard" which is not in the KJV. Hmmm.
I would have expected this to also fail if the problem were in Sword's handling of the answer set.
In case the index needed rebulding, I deleted the AKJV index and
clicked on the "Create Index" button. This caused a "C++ Exception"
message to show up???

Odd ... after the crash, the AKJV seemed to have "forgotten" that it
had an index file created ... that option wasn't available. I had to
switch to another module and back to AKJV for it to realize it had the
index file created.

Very odd .... while trying out different searches, it has twice
happened that the search source switched from AKJV to "Personal
Commentary." This was without the "Choose Module" showing, so I don't
think it was anything I did.

I'll rebuild the indices and see if the behavior is repeatable.

HTH


_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

  
_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to