On 01/12/2011 12:02 PM, Otis Gospodnetic wrote: > Hello, > > I'm indexing some content (articles) whose text I cannot store in its > original > form for copyright reason. So I can index the content, but cannot store it. > However, I need snippets and search term highlighting. > > > Any way to accomplish this elegantly? Or even not so elegantly? > > Here is one idea: > > * Create 2 indices: main index for indexing (but not storing) the original > content, the secondary index for storing individual sentences from the > original > article. How about storing the sentences in the same index in a separate field but with random ordering, would that be ok?
Tarjei > * That is, before indexing an article, split it into sentences. Then index > the > article in the main index, and index+store each sentence in the secondary > index. So for each doc in the main index there will be multiple docs in the > secondary index with individual sentences. Each sentence doc includes an ID > of > the "parent" document. > > * Then run queries against the main index, and pull individual sentences from > the secondary index for snippet+highlight purposes. > > > The problem I see with this approach (and there may be other ones that I am > not > seeing yet) is with queries like foo AND bar. In this case "foo" may be a > match > from sentence #1, and "bar" may be a match from sentence #7. Or maybe "foo" > is > a match in sentence #1, and "bar" is a match in multiple sentences: #7 and > #10 > and #23. > > Regardless, when a query is run against the main index, you don't know where > the > match was, so you don't know which sentences to go get from the secondary > index. > > Does anyone have any suggestions for how to handle this? > > Thanks, > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > -- Regards / Med vennlig hilsen Tarjei Huse Mobil: 920 63 413