Hello,

I'm indexing some content (articles) whose text I cannot store in its original 
form for copyright reason.  So I can index the content, but cannot store it.  
However, I need snippets and search term highlighting.  


Any way to accomplish this elegantly?  Or even not so elegantly?

Here is one idea:

* Create 2 indices: main index for indexing (but not storing) the original 
content, the secondary index for storing individual sentences from the original 
article.

* That is, before indexing an article, split it into sentences.  Then index the 
article in the main index, and index+store each sentence in the secondary 
index.  So for each doc in the main index there will be multiple docs in the 
secondary index with individual sentences.  Each sentence doc includes an ID of 
the "parent" document.

* Then run queries against the main index, and pull individual sentences from 
the secondary index for snippet+highlight purposes.


The problem I see with this approach (and there may be other ones that I am not 
seeing yet) is with queries like foo AND bar.  In this case "foo" may be a 
match 
from sentence #1, and "bar" may be a match from sentence #7.  Or maybe "foo" is 
a match in sentence #1, and "bar" is a match in multiple sentences: #7 and #10 
and #23.

Regardless, when a query is run against the main index, you don't know where 
the 
match was, so you don't know which sentences to go get from the secondary index.

Does anyone have any suggestions for how to handle this?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

Reply via email to