RE: Post-sort filtering

2013-02-04 Thread Steve Molloy
BTW, I've logged SOLR-4397 for this and submitted a first patch (based on 4.1 
tag which is what we use). Need to at least add logic to respect timeAllowed, 
and would like a better way of handling missing results than going back and 
restarting by asking for more, but works for now so guess it's a start.

Steve Molloy  steve.mol...@opentext.com
Software Architect  |  Information Discovery  Analytics RD   
OpenText  
   
-Original Message-
From: Steve Molloy [mailto:smol...@opentext.com] 
Sent: January-24-13 1:16 PM
To: dev@lucene.apache.org
Subject: RE: Post-sort filtering

I was actually looking for an extension point to plug in, which I wasn't able 
to find looking at the code. And yes, I'm willing to have counts being off, the 
important thing being that results don't contain the wrong document. I'd like 
to avoid oversampling and requesting back because of the bandwidth and overall 
resource usage this implies. I'm currently trying out a PostSortFilter 
approach that I'll share if it seems interesting enough.

Steve Molloy
Software Architect  |  Information Discovery  Analytics RD
Website:
www.opentext.com



This email message is confidential, may be privileged, and is intended for the 
exclusive use of the addressee. Any other person is strictly prohibited from 
disclosing or reproducing it. If the addressee cannot be reached or is unknown 
to you, please inform the sender by return email and delete this email message 
and all copies immediately.


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: January-24-13 1:11 PM
To: dev@lucene.apache.org
Subject: Re: Post-sort filtering

this has some problems. First, your facet, group, num hits, etc.
counts will be off for that user. Second, you can't sort without having all of 
the documents, so unless you're willing to have your counts be off, you really 
have to pay the price of post-filtering everything.

If you can live with the counts being off, consider just having the application 
do a couple of round-trips. Get the docs (oversample, say just get the IDs for 
the top 100 docs) _without_ any kind of ACL filtering. Then send those docs 
back to the server with the ACL filtering. If you don't get enough to fill up a 
response, get the next page of 100, etc.

Finally, the user's list is a better place for this kind of question, this list 
is for discussing developing the code...

Best
Erick

On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy smol...@opentext.com wrote:
 Hi,

 I'm looking for a way to apply filtering that unfortunately 
 implies high cost because it needs to access external resources (for 
 security). I looked at (and tried) the PostFilter technique, which 
 offers some advantages, but still imply a lot of matches in a lot of 
 cases. What I'd like to be able to do is to filter out ids until I 
 have enough to fill the response, then stop filtering (and accept 
 everything). The idea being that total count is not as important, 
 major thing being results should not contain documents requester 
 should not see. So, post filter almost does the trick, except it comes 
 before sorting, so first X documents are not the same that the post filter is 
 getting.

 Is there a way to filter out documents after they have been scored and 
 sorted?

 Thanks,
 Steve



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional 
commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional 
commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Post-sort filtering

2013-02-04 Thread Mikhail Khludnev
Steve,
this question pops up from time to time, but the answer is usually - no.
This approach is inefficient, and usually proposed as hack/or workaround
made in UI (front end app).
Current patch ruins facets, it filter the same top docs again and again
(i.e. you don't exclude document from the step one from the following
steps), every step costs O(n log p), but lucene support deep scrolling
which made it much more efficient.
AFAIK common way is using Manifold CF to index security filter inside of
Solr.


On Mon, Feb 4, 2013 at 6:20 PM, Steve Molloy smol...@opentext.com wrote:

 BTW, I've logged SOLR-4397 for this and submitted a first patch (based on
 4.1 tag which is what we use). Need to at least add logic to respect
 timeAllowed, and would like a better way of handling missing results than
 going back and restarting by asking for more, but works for now so guess
 it's a start.

 Steve Molloy  steve.mol...@opentext.com
 Software Architect  |  Information Discovery  Analytics RD
 OpenText

 -Original Message-
 From: Steve Molloy [mailto:smol...@opentext.com]
 Sent: January-24-13 1:16 PM
 To: dev@lucene.apache.org
 Subject: RE: Post-sort filtering

 I was actually looking for an extension point to plug in, which I wasn't
 able to find looking at the code. And yes, I'm willing to have counts being
 off, the important thing being that results don't contain the wrong
 document. I'd like to avoid oversampling and requesting back because of the
 bandwidth and overall resource usage this implies. I'm currently trying out
 a PostSortFilter approach that I'll share if it seems interesting enough.

 Steve Molloy
 Software Architect  |  Information Discovery  Analytics RD
 Website:
 www.opentext.com



 This email message is confidential, may be privileged, and is intended for
 the exclusive use of the addressee. Any other person is strictly prohibited
 from disclosing or reproducing it. If the addressee cannot be reached or is
 unknown to you, please inform the sender by return email and delete this
 email message and all copies immediately.


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: January-24-13 1:11 PM
 To: dev@lucene.apache.org
 Subject: Re: Post-sort filtering

 this has some problems. First, your facet, group, num hits, etc.
 counts will be off for that user. Second, you can't sort without having
 all of the documents, so unless you're willing to have your counts be off,
 you really have to pay the price of post-filtering everything.

 If you can live with the counts being off, consider just having the
 application do a couple of round-trips. Get the docs (oversample, say just
 get the IDs for the top 100 docs) _without_ any kind of ACL filtering. Then
 send those docs back to the server with the ACL filtering. If you don't get
 enough to fill up a response, get the next page of 100, etc.

 Finally, the user's list is a better place for this kind of question, this
 list is for discussing developing the code...

 Best
 Erick

 On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy smol...@opentext.com
 wrote:
  Hi,
 
  I'm looking for a way to apply filtering that unfortunately
  implies high cost because it needs to access external resources (for
  security). I looked at (and tried) the PostFilter technique, which
  offers some advantages, but still imply a lot of matches in a lot of
  cases. What I'd like to be able to do is to filter out ids until I
  have enough to fill the response, then stop filtering (and accept
  everything). The idea being that total count is not as important,
  major thing being results should not contain documents requester
  should not see. So, post filter almost does the trick, except it comes
  before sorting, so first X documents are not the same that the post
 filter is getting.
 
  Is there a way to filter out documents after they have been scored and
  sorted?
 
  Thanks,
  Steve
 
 

 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
 commands, e-mail: dev-h...@lucene.apache.org


 -
 To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: dev-h...@lucene.apache.org




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


RE: Post-sort filtering

2013-02-04 Thread Steve Molloy
I understand all that (and do want to avoid revisiting same documents, this is 
just a first working version). I also know about Manifold CF, or more generally 
about storing security information in the index. But in some cases, this is not 
enough. When access to restricted content can lead to huge legal issues, 
companies want to make sure that there is 0 latency between a permission change 
and access to information. So we want to have a security net after results are 
gathered.

And we do want to avoid putting that logic in an external component (which 
would definitely not be UI anyhow) so that we can reduce amount of information 
going back and forth on the wire. Anyhow, I guess you won't be putting your 
vote for that one, but still, I'm open to all suggestions for improvement. :)

Steve Molloy
Software Architect  |  Information Discovery  Analytics RD

From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
Sent: February-04-13 1:11 PM
To: dev@lucene.apache.org
Subject: Re: Post-sort filtering

Steve,
this question pops up from time to time, but the answer is usually - no.
This approach is inefficient, and usually proposed as hack/or workaround made 
in UI (front end app).
Current patch ruins facets, it filter the same top docs again and again (i.e. 
you don't exclude document from the step one from the following steps), every 
step costs O(n log p), but lucene support deep scrolling which made it much 
more efficient.
AFAIK common way is using Manifold CF to index security filter inside of Solr.

On Mon, Feb 4, 2013 at 6:20 PM, Steve Molloy 
smol...@opentext.commailto:smol...@opentext.com wrote:
BTW, I've logged SOLR-4397 for this and submitted a first patch (based on 4.1 
tag which is what we use). Need to at least add logic to respect timeAllowed, 
and would like a better way of handling missing results than going back and 
restarting by asking for more, but works for now so guess it's a start.

Steve Molloy  
steve.mol...@opentext.commailto:steve.mol...@opentext.com
Software Architect  |  Information Discovery  Analytics RD
OpenText

-Original Message-
From: Steve Molloy [mailto:smol...@opentext.commailto:smol...@opentext.com]
Sent: January-24-13 1:16 PM
To: dev@lucene.apache.orgmailto:dev@lucene.apache.org
Subject: RE: Post-sort filtering

I was actually looking for an extension point to plug in, which I wasn't able 
to find looking at the code. And yes, I'm willing to have counts being off, the 
important thing being that results don't contain the wrong document. I'd like 
to avoid oversampling and requesting back because of the bandwidth and overall 
resource usage this implies. I'm currently trying out a PostSortFilter 
approach that I'll share if it seems interesting enough.

Steve Molloy
Software Architect  |  Information Discovery  Analytics RD
Website:
www.opentext.comhttp://www.opentext.com



This email message is confidential, may be privileged, and is intended for the 
exclusive use of the addressee. Any other person is strictly prohibited from 
disclosing or reproducing it. If the addressee cannot be reached or is unknown 
to you, please inform the sender by return email and delete this email message 
and all copies immediately.


-Original Message-
From: Erick Erickson 
[mailto:erickerick...@gmail.commailto:erickerick...@gmail.com]
Sent: January-24-13 1:11 PM
To: dev@lucene.apache.orgmailto:dev@lucene.apache.org
Subject: Re: Post-sort filtering

this has some problems. First, your facet, group, num hits, etc.
counts will be off for that user. Second, you can't sort without having all of 
the documents, so unless you're willing to have your counts be off, you really 
have to pay the price of post-filtering everything.

If you can live with the counts being off, consider just having the application 
do a couple of round-trips. Get the docs (oversample, say just get the IDs for 
the top 100 docs) _without_ any kind of ACL filtering. Then send those docs 
back to the server with the ACL filtering. If you don't get enough to fill up a 
response, get the next page of 100, etc.

Finally, the user's list is a better place for this kind of question, this list 
is for discussing developing the code...

Best
Erick

On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy 
smol...@opentext.commailto:smol...@opentext.com wrote:
 Hi,

 I'm looking for a way to apply filtering that unfortunately
 implies high cost because it needs to access external resources (for
 security). I looked at (and tried) the PostFilter technique, which
 offers some advantages, but still imply a lot of matches in a lot of
 cases. What I'd like to be able to do is to filter out ids until I
 have enough to fill the response, then stop filtering (and accept
 everything). The idea being that total count is not as important,
 major thing being results should not contain documents requester
 should not see. So, post filter almost does the trick, except it comes
 before sorting

Re: Post-sort filtering

2013-01-24 Thread Erick Erickson
this has some problems. First, your facet, group, num hits, etc.
counts will be off for that user. Second, you can't sort without
having all of the documents, so unless you're willing to have your
counts be off, you really have to pay the price of post-filtering
everything.

If you can live with the counts being off, consider just having the
application do a couple of round-trips. Get the docs (oversample, say
just get the IDs for the top 100 docs) _without_ any kind of ACL
filtering. Then send those docs back to the server with the ACL
filtering. If you don't get enough to fill up a response, get the next
page of 100, etc.

Finally, the user's list is a better place for this kind of question,
this list is for discussing developing the code...

Best
Erick

On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy smol...@opentext.com wrote:
 Hi,

 I'm looking for a way to apply filtering that unfortunately implies high
 cost because it needs to access external resources (for security). I looked
 at (and tried) the PostFilter technique, which offers some advantages, but
 still imply a lot of matches in a lot of cases. What I'd like to be able to
 do is to filter out ids until I have enough to fill the response, then stop
 filtering (and accept everything). The idea being that total count is not as
 important, major thing being results should not contain documents requester
 should not see. So, post filter almost does the trick, except it comes
 before sorting, so first X documents are not the same that the post filter
 is getting.

 Is there a way to filter out documents after they have been scored and
 sorted?

 Thanks,
 Steve



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Post-sort filtering

2013-01-24 Thread Steve Molloy
I was actually looking for an extension point to plug in, which I wasn't able 
to find looking at the code. And yes, I'm willing to have counts being off, the 
important thing being that results don't contain the wrong document. I'd like 
to avoid oversampling and requesting back because of the bandwidth and overall 
resource usage this implies. I'm currently trying out a PostSortFilter 
approach that I'll share if it seems interesting enough.

Steve Molloy
Software Architect  |  Information Discovery  Analytics RD
Website:
www.opentext.com



This email message is confidential, may be privileged, and is intended for the 
exclusive use of the addressee. Any other person is strictly prohibited from 
disclosing or reproducing it. If the addressee cannot be reached or is unknown 
to you, please inform the sender by return email and delete this email message 
and all copies immediately.


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: January-24-13 1:11 PM
To: dev@lucene.apache.org
Subject: Re: Post-sort filtering

this has some problems. First, your facet, group, num hits, etc.
counts will be off for that user. Second, you can't sort without having all of 
the documents, so unless you're willing to have your counts be off, you really 
have to pay the price of post-filtering everything.

If you can live with the counts being off, consider just having the application 
do a couple of round-trips. Get the docs (oversample, say just get the IDs for 
the top 100 docs) _without_ any kind of ACL filtering. Then send those docs 
back to the server with the ACL filtering. If you don't get enough to fill up a 
response, get the next page of 100, etc.

Finally, the user's list is a better place for this kind of question, this list 
is for discussing developing the code...

Best
Erick

On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy smol...@opentext.com wrote:
 Hi,

 I'm looking for a way to apply filtering that unfortunately 
 implies high cost because it needs to access external resources (for 
 security). I looked at (and tried) the PostFilter technique, which 
 offers some advantages, but still imply a lot of matches in a lot of 
 cases. What I'd like to be able to do is to filter out ids until I 
 have enough to fill the response, then stop filtering (and accept 
 everything). The idea being that total count is not as important, 
 major thing being results should not contain documents requester 
 should not see. So, post filter almost does the trick, except it comes 
 before sorting, so first X documents are not the same that the post filter is 
 getting.

 Is there a way to filter out documents after they have been scored and 
 sorted?

 Thanks,
 Steve



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional 
commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Post-sort filtering

2013-01-23 Thread Steve Molloy
Hi,

I'm looking for a way to apply filtering that unfortunately implies high 
cost because it needs to access external resources (for security). I looked at 
(and tried) the PostFilter technique, which offers some advantages, but still 
imply a lot of matches in a lot of cases. What I'd like to be able to do is to 
filter out ids until I have enough to fill the response, then stop filtering 
(and accept everything). The idea being that total count is not as important, 
major thing being results should not contain documents requester should not 
see. So, post filter almost does the trick, except it comes before sorting, so 
first X documents are not the same that the post filter is getting.

Is there a way to filter out documents after they have been scored and sorted?

Thanks,
Steve