Hi Stefan, Yes, splitting in separate sentences (and storing them) is OK because with a bunch of sentences you can't really reconstruct the original article unless you know which order to put them in.
Searching against the sentence won't work for queries like foo AND bar because this should match original articles even if foo and bar are in different sentences. Otis ----- Original Message ---- > From: Stefan Matheis <matheis.ste...@googlemail.com> > To: solr-user@lucene.apache.org > Sent: Wed, January 12, 2011 7:02:46 AM > Subject: Re: Not storing, but highlighting from document sentences > > Otis, > > just interested in .. storing the full text is not allowed, but splitting up > in separate sentences is okay? > > while you think about using the sentences only as secondary/additional > source, maybe it would help to search in the sentences itself, or would that > give misleading results in your case? > > Stefan > > On Wed, Jan 12, 2011 at 12:02 PM, Otis Gospodnetic < > otis_gospodne...@yahoo.com> wrote: > > > Hello, > > > > I'm indexing some content (articles) whose text I cannot store in its > > original > > form for copyright reason. So I can index the content, but cannot store > > it. > > However, I need snippets and search term highlighting. > > > > > > Any way to accomplish this elegantly? Or even not so elegantly? > > > > Here is one idea: > > > > * Create 2 indices: main index for indexing (but not storing) the original > > content, the secondary index for storing individual sentences from the > > original > > article. > > > > * That is, before indexing an article, split it into sentences. Then index > > the > > article in the main index, and index+store each sentence in the secondary > > index. So for each doc in the main index there will be multiple docs in > > the > > secondary index with individual sentences. Each sentence doc includes an > > ID of > > the "parent" document. > > > > * Then run queries against the main index, and pull individual sentences > > from > > the secondary index for snippet+highlight purposes. > > > > > > The problem I see with this approach (and there may be other ones that I am > > not > > seeing yet) is with queries like foo AND bar. In this case "foo" may be a > > match > > from sentence #1, and "bar" may be a match from sentence #7. Or maybe > > "foo" is > > a match in sentence #1, and "bar" is a match in multiple sentences: #7 and > > #10 > > and #23. > > > > Regardless, when a query is run against the main index, you don't know > > where the > > match was, so you don't know which sentences to go get from the secondary > > index. > > > > Does anyone have any suggestions for how to handle this? > > > > Thanks, > > Otis > > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > > >