Otis, just interested in .. storing the full text is not allowed, but splitting up in separate sentences is okay?
while you think about using the sentences only as secondary/additional source, maybe it would help to search in the sentences itself, or would that give misleading results in your case? Stefan On Wed, Jan 12, 2011 at 12:02 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Hello, > > I'm indexing some content (articles) whose text I cannot store in its > original > form for copyright reason. So I can index the content, but cannot store > it. > However, I need snippets and search term highlighting. > > > Any way to accomplish this elegantly? Or even not so elegantly? > > Here is one idea: > > * Create 2 indices: main index for indexing (but not storing) the original > content, the secondary index for storing individual sentences from the > original > article. > > * That is, before indexing an article, split it into sentences. Then index > the > article in the main index, and index+store each sentence in the secondary > index. So for each doc in the main index there will be multiple docs in > the > secondary index with individual sentences. Each sentence doc includes an > ID of > the "parent" document. > > * Then run queries against the main index, and pull individual sentences > from > the secondary index for snippet+highlight purposes. > > > The problem I see with this approach (and there may be other ones that I am > not > seeing yet) is with queries like foo AND bar. In this case "foo" may be a > match > from sentence #1, and "bar" may be a match from sentence #7. Or maybe > "foo" is > a match in sentence #1, and "bar" is a match in multiple sentences: #7 and > #10 > and #23. > > Regardless, when a query is run against the main index, you don't know > where the > match was, so you don't know which sentences to go get from the secondary > index. > > Does anyone have any suggestions for how to handle this? > > Thanks, > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > >