Re: Not storing, but highlighting from document sentences

2011-01-18 Thread Ahson Iqbal
Hi

A simple solution to this could be, for all such searches (foo and bar), search 
them as it is from 1st(primary index) and while sending these queries to 
secondary index replace and with or. 


But in this particular scenario u could also have problem with proximity and 
phrase queries that is much difficult to tackle.

Regards
Ahsan






From: Otis Gospodnetic otis_gospodne...@yahoo.com
To: solr-user@lucene.apache.org
Sent: Tue, January 18, 2011 12:25:12 PM
Subject: Re: Not storing, but highlighting from document sentences

Hi Tarjei,

:)
Yeah, that is the solution we are going with, actually.


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Tarjei Huse tar...@scanmine.com
 To: solr-user@lucene.apache.org
 Sent: Tue, January 18, 2011 1:33:44 AM
 Subject: Re: Not storing, but highlighting from document sentences
 
 On 01/12/2011 12:02 PM, Otis Gospodnetic wrote:
  Hello,
 
   I'm indexing some content (articles) whose text I cannot store in its 
original 

  form for copyright reason.  So I can index the content, but cannot  store 
it.  

  However, I need snippets and search term  highlighting.  
 
 
  Any way to accomplish this  elegantly?  Or even not so elegantly?
 
  Here is one  idea:
 
  * Create 2 indices: main index for indexing (but not  storing) the original 
  content, the secondary index for storing  individual sentences from the 
original 

  article.
 How about storing  the sentences in the same index in a separate field
 but with random ordering,  would that be ok?
 
 Tarjei
  * That is, before indexing an article,  split it into sentences.  Then 
  index 

the 

  article in the main  index, and index+store each sentence in the secondary 
  index.  So  for each doc in the main index there will be multiple docs in 
  the 


   secondary index with individual sentences.  Each sentence doc includes an  
ID of 

  the parent document.
 
  * Then run queries against  the main index, and pull individual sentences 
from 

  the secondary index  for snippet+highlight purposes.
 
 
  The problem I see with  this approach (and there may be other ones that I 
  am 

not 

  seeing yet) is  with queries like foo AND bar.  In this case foo may be a 
match 

   from sentence #1, and bar may be a match from sentence #7.  Or maybe  
foo is 

  a match in sentence #1, and bar is a match in multiple  sentences: #7 and 
#10 

  and #23.
 
  Regardless, when a query  is run against the main index, you don't know 
  where 

the 

  match was, so  you don't know which sentences to go get from the secondary  
index.
 
  Does anyone have any suggestions for how to handle  this?
 
  Thanks,
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 -- 
 Regards / Med vennlig  hilsen
 Tarjei Huse
 Mobil: 920 63 413
 
 



  

Re: Not storing, but highlighting from document sentences

2011-01-17 Thread Tarjei Huse
On 01/12/2011 12:02 PM, Otis Gospodnetic wrote:
 Hello,

 I'm indexing some content (articles) whose text I cannot store in its 
 original 
 form for copyright reason.  So I can index the content, but cannot store it.  
 However, I need snippets and search term highlighting.  


 Any way to accomplish this elegantly?  Or even not so elegantly?

 Here is one idea:

 * Create 2 indices: main index for indexing (but not storing) the original 
 content, the secondary index for storing individual sentences from the 
 original 
 article.
How about storing the sentences in the same index in a separate field
but with random ordering, would that be ok?

Tarjei
 * That is, before indexing an article, split it into sentences.  Then index 
 the 
 article in the main index, and index+store each sentence in the secondary 
 index.  So for each doc in the main index there will be multiple docs in the 
 secondary index with individual sentences.  Each sentence doc includes an ID 
 of 
 the parent document.

 * Then run queries against the main index, and pull individual sentences from 
 the secondary index for snippet+highlight purposes.


 The problem I see with this approach (and there may be other ones that I am 
 not 
 seeing yet) is with queries like foo AND bar.  In this case foo may be a 
 match 
 from sentence #1, and bar may be a match from sentence #7.  Or maybe foo 
 is 
 a match in sentence #1, and bar is a match in multiple sentences: #7 and 
 #10 
 and #23.

 Regardless, when a query is run against the main index, you don't know where 
 the 
 match was, so you don't know which sentences to go get from the secondary 
 index.

 Does anyone have any suggestions for how to handle this?

 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/



-- 
Regards / Med vennlig hilsen
Tarjei Huse
Mobil: 920 63 413



Re: Not storing, but highlighting from document sentences

2011-01-17 Thread Otis Gospodnetic
Hi Tarjei,

:)
Yeah, that is the solution we are going with, actually.


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Tarjei Huse tar...@scanmine.com
 To: solr-user@lucene.apache.org
 Sent: Tue, January 18, 2011 1:33:44 AM
 Subject: Re: Not storing, but highlighting from document sentences
 
 On 01/12/2011 12:02 PM, Otis Gospodnetic wrote:
  Hello,
 
   I'm indexing some content (articles) whose text I cannot store in its 
original 

  form for copyright reason.  So I can index the content, but cannot  store 
it.  

  However, I need snippets and search term  highlighting.  
 
 
  Any way to accomplish this  elegantly?  Or even not so elegantly?
 
  Here is one  idea:
 
  * Create 2 indices: main index for indexing (but not  storing) the original 
  content, the secondary index for storing  individual sentences from the 
original 

  article.
 How about storing  the sentences in the same index in a separate field
 but with random ordering,  would that be ok?
 
 Tarjei
  * That is, before indexing an article,  split it into sentences.  Then 
  index 
the 

  article in the main  index, and index+store each sentence in the secondary 
  index.  So  for each doc in the main index there will be multiple docs in 
  the 

   secondary index with individual sentences.  Each sentence doc includes an  
ID of 

  the parent document.
 
  * Then run queries against  the main index, and pull individual sentences 
from 

  the secondary index  for snippet+highlight purposes.
 
 
  The problem I see with  this approach (and there may be other ones that I 
  am 
not 

  seeing yet) is  with queries like foo AND bar.  In this case foo may be a 
match 

   from sentence #1, and bar may be a match from sentence #7.  Or maybe  
foo is 

  a match in sentence #1, and bar is a match in multiple  sentences: #7 and 
#10 

  and #23.
 
  Regardless, when a query  is run against the main index, you don't know 
  where 
the 

  match was, so  you don't know which sentences to go get from the secondary  
index.
 
  Does anyone have any suggestions for how to handle  this?
 
  Thanks,
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 -- 
 Regards / Med vennlig  hilsen
 Tarjei Huse
 Mobil: 920 63 413
 
 


Re: Not storing, but highlighting from document sentences

2011-01-12 Thread Stefan Matheis
Otis,

just interested in .. storing the full text is not allowed, but splitting up
in separate sentences is okay?

while you think about using the sentences only as secondary/additional
source, maybe it would help to search in the sentences itself, or would that
give misleading results in your case?

Stefan

On Wed, Jan 12, 2011 at 12:02 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hello,

 I'm indexing some content (articles) whose text I cannot store in its
 original
 form for copyright reason.  So I can index the content, but cannot store
 it.
 However, I need snippets and search term highlighting.


 Any way to accomplish this elegantly?  Or even not so elegantly?

 Here is one idea:

 * Create 2 indices: main index for indexing (but not storing) the original
 content, the secondary index for storing individual sentences from the
 original
 article.

 * That is, before indexing an article, split it into sentences.  Then index
 the
 article in the main index, and index+store each sentence in the secondary
 index.  So for each doc in the main index there will be multiple docs in
 the
 secondary index with individual sentences.  Each sentence doc includes an
 ID of
 the parent document.

 * Then run queries against the main index, and pull individual sentences
 from
 the secondary index for snippet+highlight purposes.


 The problem I see with this approach (and there may be other ones that I am
 not
 seeing yet) is with queries like foo AND bar.  In this case foo may be a
 match
 from sentence #1, and bar may be a match from sentence #7.  Or maybe
 foo is
 a match in sentence #1, and bar is a match in multiple sentences: #7 and
 #10
 and #23.

 Regardless, when a query is run against the main index, you don't know
 where the
 match was, so you don't know which sentences to go get from the secondary
 index.

 Does anyone have any suggestions for how to handle this?

 Thanks,
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/




Re: Not storing, but highlighting from document sentences

2011-01-12 Thread Otis Gospodnetic
Hi Stefan,

Yes, splitting in separate sentences (and storing them) is OK because with a 
bunch of sentences you can't really reconstruct the original article unless you 
know which order to put them in.

Searching against the sentence won't work for queries like foo AND bar because 
this should match original articles even if foo and bar are in different 
sentences.

Otis



- Original Message 
 From: Stefan Matheis matheis.ste...@googlemail.com
 To: solr-user@lucene.apache.org
 Sent: Wed, January 12, 2011 7:02:46 AM
 Subject: Re: Not storing, but highlighting from document sentences
 
 Otis,
 
 just interested in .. storing the full text is not allowed, but  splitting up
 in separate sentences is okay?
 
 while you think about  using the sentences only as secondary/additional
 source, maybe it would help  to search in the sentences itself, or would that
 give misleading results in  your case?
 
 Stefan
 
 On Wed, Jan 12, 2011 at 12:02 PM, Otis  Gospodnetic 
 otis_gospodne...@yahoo.com  wrote:
 
  Hello,
 
  I'm indexing some content (articles)  whose text I cannot store in its
  original
  form for copyright  reason.  So I can index the content, but cannot store
  it.
   However, I need snippets and search term highlighting.
 
 
   Any way to accomplish this elegantly?  Or even not so  elegantly?
 
  Here is one idea:
 
  * Create 2 indices:  main index for indexing (but not storing) the original
  content, the  secondary index for storing individual sentences from the
   original
  article.
 
  * That is, before indexing an article,  split it into sentences.  Then index
  the
  article in the  main index, and index+store each sentence in the secondary
  index.   So for each doc in the main index there will be multiple docs in
   the
  secondary index with individual sentences.  Each sentence doc  includes an
  ID of
  the parent document.
 
  * Then  run queries against the main index, and pull individual sentences
   from
  the secondary index for snippet+highlight  purposes.
 
 
  The problem I see with this approach (and  there may be other ones that I am
  not
  seeing yet) is with  queries like foo AND bar.  In this case foo may be a
   match
  from sentence #1, and bar may be a match from sentence #7.   Or maybe
  foo is
  a match in sentence #1, and bar is a match  in multiple sentences: #7 and
  #10
  and #23.
 
   Regardless, when a query is run against the main index, you don't know
   where the
  match was, so you don't know which sentences to go get from  the secondary
  index.
 
  Does anyone have any suggestions  for how to handle this?
 
  Thanks,
  Otis
   
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene ecosystem  search :: http://search-lucene.com/
 
 
 


RE: Not storing, but highlighting from document sentences

2011-01-12 Thread Steven A Rowe
Hi Otis,

I think you can get what you want by doing the first stage retrieval, and then 
in the second stage, add required constraint(s) to the query for the matching 
docid(s), and change the AND operators in the original query to OR.  
Coordination will cause the best snippet(s) to rise to the top, no?

Hmm, you'll want to run the second stage once for each hit from the first 
stage, though, unless you can afford to collect *all* hits and pull out each 
first stage's hit from the intermixed second stage results...

Steve

 -Original Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Sent: Wednesday, January 12, 2011 7:29 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Not storing, but highlighting from document sentences
 
 Hi Stefan,
 
 Yes, splitting in separate sentences (and storing them) is OK because with
 a
 bunch of sentences you can't really reconstruct the original article
 unless you
 know which order to put them in.
 
 Searching against the sentence won't work for queries like foo AND bar
 because
 this should match original articles even if foo and bar are in different
 sentences.
 
 Otis
 
 
 
 - Original Message 
  From: Stefan Matheis matheis.ste...@googlemail.com
  To: solr-user@lucene.apache.org
  Sent: Wed, January 12, 2011 7:02:46 AM
  Subject: Re: Not storing, but highlighting from document sentences
 
  Otis,
 
  just interested in .. storing the full text is not allowed, but
 splitting up
  in separate sentences is okay?
 
  while you think about  using the sentences only as secondary/additional
  source, maybe it would help  to search in the sentences itself, or would
 that
  give misleading results in  your case?
 
  Stefan
 
  On Wed, Jan 12, 2011 at 12:02 PM, Otis  Gospodnetic 
  otis_gospodne...@yahoo.com  wrote:
 
   Hello,
  
   I'm indexing some content (articles)  whose text I cannot store in its
   original
   form for copyright  reason.  So I can index the content, but cannot
 store
   it.
However, I need snippets and search term highlighting.
  
  
Any way to accomplish this elegantly?  Or even not so  elegantly?
  
   Here is one idea:
  
   * Create 2 indices:  main index for indexing (but not storing) the
 original
   content, the  secondary index for storing individual sentences from
 the
original
   article.
  
   * That is, before indexing an article,  split it into sentences.  Then
 index
   the
   article in the  main index, and index+store each sentence in the
 secondary
   index.   So for each doc in the main index there will be multiple docs
 in
the
   secondary index with individual sentences.  Each sentence doc
 includes an
   ID of
   the parent document.
  
   * Then  run queries against the main index, and pull individual
 sentences
from
   the secondary index for snippet+highlight  purposes.
  
  
   The problem I see with this approach (and  there may be other ones
 that I am
   not
   seeing yet) is with  queries like foo AND bar.  In this case foo may
 be a
match
   from sentence #1, and bar may be a match from sentence #7.   Or
 maybe
   foo is
   a match in sentence #1, and bar is a match  in multiple sentences:
 #7 and
   #10
   and #23.
  
Regardless, when a query is run against the main index, you don't
 know
where the
   match was, so you don't know which sentences to go get from  the
 secondary
   index.
  
   Does anyone have any suggestions  for how to handle this?
  
   Thanks,
   Otis

   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
   Lucene ecosystem  search :: http://search-lucene.com/
  
  
 


Re: Not storing, but highlighting from document sentences

2011-01-12 Thread Otis Gospodnetic
Hi Steve,



- Original Message 
 From: Steven A Rowe sar...@syr.edu
 Subject: RE: Not storing, but highlighting from document sentences
 
 I think you can get what you want by doing the first stage  retrieval, and 
 then 
in the second stage, add required constraint(s) to the query  for the matching 
docid(s), and change the AND operators in the original query to  OR.  
Coordination will cause the best snippet(s) to rise to the top,  no?

Right, right.
So if the original query is: foo AND bar, I'd run it against the main index, 
get 
top N hits, say N=10.
Then I'd create another query: +(foo OR bar) +articleID:(ORed list of top N 
article IDs from main results)
And then I'd use that to get enough sentence docs to have at least 1 of them 
for each hit from the main index.

Hm, I wonder what happens when instead of simple foo AND bar you have a more 
complex query with more elaborate grouping and such...


 Hmm, you'll want to run the second stage once for each hit from the  first 
stage, though, unless you can afford to collect *all* hits and pull out  each 
first stage's hit from the intermixed second stage  results...

Wouldn't the above get me all sentences I need for top N hits from the main 
result in a single shot, assuming I use high enough rows=NNN to minimize the 
possibility of not getting even 1 sentence for any one of those top N hits?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/ 

 Steve
 
  -Original Message-
  From:  Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
   Sent: Wednesday, January 12, 2011 7:29 AM
  To: solr-user@lucene.apache.org
   Subject: Re: Not storing, but highlighting from document sentences
  
  Hi Stefan,
  
  Yes, splitting in separate sentences (and  storing them) is OK because with
  a
  bunch of sentences you can't  really reconstruct the original article
  unless you
  know which  order to put them in.
  
  Searching against the sentence won't work  for queries like foo AND bar
  because
  this should match original  articles even if foo and bar are in different
  sentences.
  
  Otis
  
  
  
  - Original Message  
   From: Stefan Matheis matheis.ste...@googlemail.com
To: solr-user@lucene.apache.org
Sent: Wed, January 12, 2011 7:02:46 AM
   Subject: Re: Not  storing, but highlighting from document sentences
  
Otis,
  
   just interested in .. storing the full text is  not allowed, but
  splitting up
   in separate sentences is  okay?
  
   while you think about  using the sentences  only as secondary/additional
   source, maybe it would help  to  search in the sentences itself, or would
  that
   give  misleading results in  your case?
  
   Stefan
   
   On Wed, Jan 12, 2011 at 12:02 PM, Otis  Gospodnetic  
   otis_gospodne...@yahoo.com   wrote:
  
Hello,
   
 I'm indexing some content (articles)  whose text I cannot store in  its
original
form for copyright   reason.  So I can index the content, but cannot
  store
 it.
 However, I need snippets and search term  highlighting.
   
   
 Any  way to accomplish this elegantly?  Or even not so  elegantly?

Here is one idea:
   
 * Create 2 indices:  main index for indexing (but not storing)  the
  original
content, the  secondary index for  storing individual sentences from
  the
  original
article.
   
* That  is, before indexing an article,  split it into sentences.   Then
  index
the
article in the   main index, and index+store each sentence in the
  secondary
 index.   So for each doc in the main index there will be multiple  docs
  in
 the
secondary index  with individual sentences.  Each sentence doc
  includes an
 ID of
the parent document.

* Then  run queries against the main index, and pull  individual
  sentences
 from
the  secondary index for snippet+highlight  purposes.
   

The problem I see with this approach (and   there may be other ones
  that I am
not
 seeing yet) is with  queries like foo AND bar.  In this case  foo may
  be a
 match
from  sentence #1, and bar may be a match from sentence #7.   Or
   maybe
foo is
a match in sentence #1, and  bar is a match  in multiple sentences:
  #7 and
 #10
and #23.
   
  Regardless, when a query is run against the main index, you don't
   know
 where the
match was, so you don't  know which sentences to go get from  the
  secondary
 index.
   
Does anyone have any  suggestions  for how to handle this?
   
 Thanks,
Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene  ecosystem  search :: http://search-lucene.com/ 
   
   
   
 


Re: Not storing, but highlighting from document sentences

2011-01-12 Thread Tomislav Poljak
Hi Steven,
if I understand correctly, you are suggesting query execution in two
phases: first execute query on whole article index core (where whole
articles are indexed, but not stored) to get article IDs (for articles
which match original query).  Then for each match in article core:
change the AND operators from the original query to OR and add
articleID condition/filter and execute such query on sentence based
index (with assumption each sentence based doc has articleID set).

Is this correct and it this what is you'll want to run the second
stage once for each hit from the first stage, though referring to?

Example for this scenario would be for original query q=apples and
oranges, execute q=apples and orange with fl=articleId on article
core and for each articleIdX result execute q=(apples OR orange) AND
articleId:articleIdX on sentence based core.

Same thing (with the same results) should be doable with only a single
query in second phase, for previous example that single query for
second phase would be for all articleId1,...,articleIdN something
like:

q=((apples OR orange) AND articleId:articleId1) OR ((apples OR orange)
AND articleId:articleId2) OR ... OR  apples OR orange) AND
articleId:articleIdN)

But, here in second case results are ordered by sentence scoring
instead of article and reslts should be re-ordered. Is this what is
unless you can afford to collect *all* hits and pull out  each first
stage's hit from the intermixed second stage  results refering to?

My actual question after this really long intro is: couldn't this be
done with single second level query approach, but on each topN
start/row chunk as user iterates through first level results?

For example, user executes query q=apples and oranges and this
results in 1000 results, but first page display only for example 20
results which means proposed solution would:

1. phase: execute execute q=apples and orange with fl=articleId on
article core, but with start=0rows=20
2. phase: q=((apples OR orange) AND articleId:articleId1) OR ((apples
OR orange) AND articleId:articleId2) OR ... OR  apples OR orange) AND
articleId:articleId20)
3. Reorder sentence results to match order defined by article matching
scores and return to user

Only, the results here would need to be collapsed on unique articleID,
so only 20 results are provided in result set (because multiple
sentence based doc can be returned for a single unique articleID)

Would this work?

Thanks,
Tomislav

2011/1/12 Steven A Rowe sar...@syr.edu:
 Hi Otis,

 I think you can get what you want by doing the first stage retrieval, and 
 then in the second stage, add required constraint(s) to the query for the 
 matching docid(s), and change the AND operators in the original query to OR.  
 Coordination will cause the best snippet(s) to rise to the top, no?

 Hmm, you'll want to run the second stage once for each hit from the first 
 stage, though, unless you can afford to collect *all* hits and pull out each 
 first stage's hit from the intermixed second stage results...

 Steve

 -Original Message-
 From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
 Sent: Wednesday, January 12, 2011 7:29 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Not storing, but highlighting from document sentences

 Hi Stefan,

 Yes, splitting in separate sentences (and storing them) is OK because with
 a
 bunch of sentences you can't really reconstruct the original article
 unless you
 know which order to put them in.

 Searching against the sentence won't work for queries like foo AND bar
 because
 this should match original articles even if foo and bar are in different
 sentences.

 Otis



 - Original Message 
  From: Stefan Matheis matheis.ste...@googlemail.com
  To: solr-user@lucene.apache.org
  Sent: Wed, January 12, 2011 7:02:46 AM
  Subject: Re: Not storing, but highlighting from document sentences
 
  Otis,
 
  just interested in .. storing the full text is not allowed, but
 splitting up
  in separate sentences is okay?
 
  while you think about  using the sentences only as secondary/additional
  source, maybe it would help  to search in the sentences itself, or would
 that
  give misleading results in  your case?
 
  Stefan
 
  On Wed, Jan 12, 2011 at 12:02 PM, Otis  Gospodnetic 
  otis_gospodne...@yahoo.com  wrote:
 
   Hello,
  
   I'm indexing some content (articles)  whose text I cannot store in its
   original
   form for copyright  reason.  So I can index the content, but cannot
 store
   it.
    However, I need snippets and search term highlighting.
  
  
    Any way to accomplish this elegantly?  Or even not so  elegantly?
  
   Here is one idea:
  
   * Create 2 indices:  main index for indexing (but not storing) the
 original
   content, the  secondary index for storing individual sentences from
 the
    original
   article.
  
   * That is, before indexing an article,  split it into sentences.  Then
 index
   the
   article in the  main index, and index+store

RE: Not storing, but highlighting from document sentences

2011-01-12 Thread Steven A Rowe
  I think you can get what you want by doing the first stage  retrieval,
  and then in the second stage, add required constraint(s) to the query
  for the matching docid(s), and change the AND operators in the
  original query to OR.  Coordination will cause the best snippet(s) to
  rise to the top,  no?
 
 Right, right.
 So if the original query is: foo AND bar, I'd run it against the main
 index, get top N hits, say N=10.
 Then I'd create another query: +(foo OR bar) +articleID:(ORed list of top
 N article IDs from main results)
 And then I'd use that to get enough sentence docs to have at least 1 of
 them for each hit from the main index.
 
 Hm, I wonder what happens when instead of simple foo AND bar you have a
 more complex query with more elaborate grouping and such...

:) I was hoping that you could limit the query language to exclude grouping...  
If not, you could walk the boolean query, trim all clauses that are PROHIBITED, 
then flatten all of the remaining terms to a single OR'd query?

  Hmm, you'll want to run the second stage once for each hit from the
  first stage, though, unless you can afford to collect *all* hits and pull
  out each first stage's hit from the intermixed second stage  results...
 
 Wouldn't the above get me all sentences I need for top N hits from the
 main result in a single shot, assuming I use high enough rows=NNN to
 minimize the possibility of not getting even 1 sentence for any one of
 those top N hits?

Yes, but the problem is that the worst case is that you have to retrieve *all* 
second-stage hits to get at least one for each of the first-stage hits.  So if 
you're okay with NNN = numDocs, then no problem.

Steve



RE: Not storing, but highlighting from document sentences

2011-01-12 Thread Steven A Rowe
Hi Tomislav,

 if I understand correctly, you are suggesting query execution in two
 phases: first execute query on whole article index core (where whole
 articles are indexed, but not stored) to get article IDs (for articles
 which match original query).  Then for each match in article core:
 change the AND operators from the original query to OR and add
 articleID condition/filter and execute such query on sentence based
 index (with assumption each sentence based doc has articleID set).

Yes.

 Is this correct and it this what is you'll want to run the second
 stage once for each hit from the first stage, though referring to?
 
 Example for this scenario would be for original query q=apples and
 oranges, execute q=apples and orange with fl=articleId on article
 core and for each articleIdX result execute q=(apples OR orange) AND
 articleId:articleIdX on sentence based core.
 
 Same thing (with the same results) should be doable with only a single
 query in second phase, for previous example that single query for
 second phase would be for all articleId1,...,articleIdN something
 like:
 
 q=((apples OR orange) AND articleId:articleId1) OR ((apples OR orange)
 AND articleId:articleId2) OR ... OR  apples OR orange) AND
 articleId:articleIdN)
 
 But, here in second case results are ordered by sentence scoring
 instead of article and reslts should be re-ordered. Is this what is
 unless you can afford to collect *all* hits and pull out  each first
 stage's hit from the intermixed second stage  results refering to?

Yes.

 My actual question after this really long intro is: couldn't this be
 done with single second level query approach, but on each topN
 start/row chunk as user iterates through first level results?
 
 For example, user executes query q=apples and oranges and this
 results in 1000 results, but first page display only for example 20
 results which means proposed solution would:
 
 1. phase: execute execute q=apples and orange with fl=articleId on
 article core, but with start=0rows=20
 2. phase: q=((apples OR orange) AND articleId:articleId1) OR ((apples
 OR orange) AND articleId:articleId2) OR ... OR  apples OR orange) AND
 articleId:articleId20)
 3. Reorder sentence results to match order defined by article matching
 scores and return to user
 
 Only, the results here would need to be collapsed on unique articleID,
 so only 20 results are provided in result set (because multiple
 sentence based doc can be returned for a single unique articleID)
 
 Would this work?

I think so, but I don't have any experience using collapsing, so I can't say 
for sure.

BTW, Otis' rearrangement of your phase #2 would also work, and would be 
theoretically faster to evaluate: q=+(apples orange) +articleId:(articleId1 ... 
articleId20)

Steve