Hi AJ -

Depending on your need, you could create a lucene document for each sentence (in which case searching and returning sentences is trivial), or create a lucene document for each of your documents, with embedded sentence start/stop markers (as a special symbol). or, instead of a special symbol, you can increase the token count after each end-of-sentence so that there is a large gap inbetween sentences -- this will give higher scores to intra-sentence matches.

if you insert special sentence marker symbols, then you could use a span search to guarantee that a phrase happens inside a sentence. when a match occurs, you can use the document's termpositionvector object to re-create the original sentence, or alternatively, use the embedded sentence number in lucene (perhaps symbols like "__sentence_start" and "__sentence_num_20") to grab the original sentence from a file containing the full text with sentence markers (perhaps xml tags: "<sentence num=20>").

I use the techniques such as the above for a very large lucene index of documents with embedded sentence markers. There are various trade-offs in terms of index size (how much info to keep in index), expected query performance, and so on.

---marc hadfield



AJ Chen wrote:

I'll appreciate any advice on whether Lucene is appropriate for index/search
sentences.  I have millions of documents broken down into millions of
sentences. Each sentence does not exist as a document.  All these sentences
are in a small number of big files. How can I use Lucene to index/search the
sentences? Search will return which sentence matches the query.  If Lucene
does not do it, any better approach besides using mysql database?

Thanks,
AJ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to