DM Smith wrote:
Hi,

I hope I am posting to the right list.

Yes.


We (sword and jsword at crosswire.org) are indexing bibles with each verse becoming a document, with the verse text being indexed and the verse reference being stored. This way we can search the text and get which verses have hits.


The problem is that verse is an artifical document boundary.

You could "smear" the document boundary by adding a number of tokens from adjacent verses, directly preceding or following a given verse. Perhaps even adding a full verse from each side.


If you wish, you could also artificially lower their score by adding gaps (token.setPositionIncrement()), but then exact matches would not work across boundaries, in such case you would have to add a phrase query with a slop to your main query.


Frequently, verses cut a paragraph into parts, a poem into stanzas, ... and the significant parts are across verses. (But we usually don't have these in our markup)


Is there any thought of adding a NEAR operator that will work across documents?
>
> Specifically, find x NEAR y, where the distance given to near is not
> understood as words but documents.
>

I assume that you also add fields for books and chapters. While the chapter boundary is sometimes disputed, the book boundaries are pretty accurate ;-). You could create an equivalent of the "near" operator by limiting your search within a single book (by adding a required clause), and then from the list of hits (which should be pretty small in that case) you could programmatically select verses that match your proximity criteria.

It would also be good to have the ability to have search automatically consider that adjacent documents are flowing unless some token in the doucment interrupts the flow. In this case, search would return a compound document as a hit.

Lucene doesn't have a notion of compound documents, it's up to the application to do that. However, it's easy to retrieve documents that precede or follow a given document. It's also easy to retieve documents that contain a given term (similar to a primary key), let's say "John 1:12". You could also add a field to flag a given document as the "end of chapter", or "end of book".


I would be more than happy to help you find a good solution - I'm a born-again Christian, and I use the Sword application from time to time...

--
Best regards,
Andrzej Bialecki
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to