Otis,

just interested in .. storing the full text is not allowed, but splitting up
in separate sentences is okay?

while you think about using the sentences only as secondary/additional
source, maybe it would help to search in the sentences itself, or would that
give misleading results in your case?

Stefan

On Wed, Jan 12, 2011 at 12:02 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hello,
>
> I'm indexing some content (articles) whose text I cannot store in its
> original
> form for copyright reason.  So I can index the content, but cannot store
> it.
> However, I need snippets and search term highlighting.
>
>
> Any way to accomplish this elegantly?  Or even not so elegantly?
>
> Here is one idea:
>
> * Create 2 indices: main index for indexing (but not storing) the original
> content, the secondary index for storing individual sentences from the
> original
> article.
>
> * That is, before indexing an article, split it into sentences.  Then index
> the
> article in the main index, and index+store each sentence in the secondary
> index.  So for each doc in the main index there will be multiple docs in
> the
> secondary index with individual sentences.  Each sentence doc includes an
> ID of
> the "parent" document.
>
> * Then run queries against the main index, and pull individual sentences
> from
> the secondary index for snippet+highlight purposes.
>
>
> The problem I see with this approach (and there may be other ones that I am
> not
> seeing yet) is with queries like foo AND bar.  In this case "foo" may be a
> match
> from sentence #1, and "bar" may be a match from sentence #7.  Or maybe
> "foo" is
> a match in sentence #1, and "bar" is a match in multiple sentences: #7 and
> #10
> and #23.
>
> Regardless, when a query is run against the main index, you don't know
> where the
> match was, so you don't know which sentences to go get from the secondary
> index.
>
> Does anyone have any suggestions for how to handle this?
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>

Reply via email to