RE: Post-sort filtering

2013-02-04 Thread Steve Molloy
I understand all that (and do want to avoid revisiting same documents, this is 
just a first working version). I also know about Manifold CF, or more generally 
about storing security information in the index. But in some cases, this is not 
enough. When access to restricted content can lead to huge legal issues, 
companies want to make sure that there is 0 latency between a permission change 
and access to information. So we want to have a security net after results are 
gathered.

And we do want to avoid putting that logic in an external component (which 
would definitely not be UI anyhow) so that we can reduce amount of information 
going back and forth on the wire. Anyhow, I guess you won't be putting your 
vote for that one, but still, I'm open to all suggestions for improvement. :)

Steve Molloy
Software Architect  |  Information Discovery & Analytics R&D

From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com]
Sent: February-04-13 1:11 PM
To: dev@lucene.apache.org
Subject: Re: Post-sort filtering

Steve,
this question pops up from time to time, but the answer is usually - no.
This approach is inefficient, and usually proposed as hack/or workaround made 
in UI (front end app).
Current patch ruins facets, it filter the same top docs again and again (i.e. 
you don't exclude document from the step one from the following steps), every 
step costs O(n log p), but lucene support deep scrolling which made it much 
more efficient.
AFAIK common way is using Manifold CF to index security filter inside of Solr.

On Mon, Feb 4, 2013 at 6:20 PM, Steve Molloy 
mailto:smol...@opentext.com>> wrote:
BTW, I've logged SOLR-4397 for this and submitted a first patch (based on 4.1 
tag which is what we use). Need to at least add logic to respect timeAllowed, 
and would like a better way of handling missing results than going back and 
restarting by asking for more, but works for now so guess it's a start.

Steve Molloy  
steve.mol...@opentext.com<mailto:steve.mol...@opentext.com>
Software Architect  |  Information Discovery & Analytics R&D
OpenText

-Original Message-
From: Steve Molloy [mailto:smol...@opentext.com<mailto:smol...@opentext.com>]
Sent: January-24-13 1:16 PM
To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
Subject: RE: Post-sort filtering

I was actually looking for an extension point to plug in, which I wasn't able 
to find looking at the code. And yes, I'm willing to have counts being off, the 
important thing being that results don't contain the wrong document. I'd like 
to avoid oversampling and requesting back because of the bandwidth and overall 
resource usage this implies. I'm currently trying out a "PostSortFilter" 
approach that I'll share if it seems interesting enough.

Steve Molloy
Software Architect  |  Information Discovery & Analytics R&D
Website:
www.opentext.com<http://www.opentext.com>



This email message is confidential, may be privileged, and is intended for the 
exclusive use of the addressee. Any other person is strictly prohibited from 
disclosing or reproducing it. If the addressee cannot be reached or is unknown 
to you, please inform the sender by return email and delete this email message 
and all copies immediately.


-Original Message-
From: Erick Erickson 
[mailto:erickerick...@gmail.com<mailto:erickerick...@gmail.com>]
Sent: January-24-13 1:11 PM
To: dev@lucene.apache.org<mailto:dev@lucene.apache.org>
Subject: Re: Post-sort filtering

this has some problems. First, your facet, group, num hits, etc.
counts will be off for that user. Second, you can't sort without having all of 
the documents, so unless you're willing to have your counts be off, you really 
have to pay the price of post-filtering everything.

If you can live with the counts being off, consider just having the application 
do a couple of round-trips. Get the docs (oversample, say just get the IDs for 
the top 100 docs) _without_ any kind of ACL filtering. Then send those docs 
back to the server with the ACL filtering. If you don't get enough to fill up a 
response, get the next page of 100, etc.

Finally, the user's list is a better place for this kind of question, this list 
is for discussing developing the code...

Best
Erick

On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy 
mailto:smol...@opentext.com>> wrote:
> Hi,
>
> I'm looking for a way to apply filtering that unfortunately
> implies high cost because it needs to access external resources (for
> security). I looked at (and tried) the PostFilter technique, which
> offers some advantages, but still imply a lot of matches in a lot of
> cases. What I'd like to be able to do is to filter out ids until I
> have enough to fill the response, then stop filtering (and accept
> everything). The idea being that

Re: Post-sort filtering

2013-02-04 Thread Mikhail Khludnev
Steve,
this question pops up from time to time, but the answer is usually - no.
This approach is inefficient, and usually proposed as hack/or workaround
made in UI (front end app).
Current patch ruins facets, it filter the same top docs again and again
(i.e. you don't exclude document from the step one from the following
steps), every step costs O(n log p), but lucene support deep scrolling
which made it much more efficient.
AFAIK common way is using Manifold CF to index security filter inside of
Solr.


On Mon, Feb 4, 2013 at 6:20 PM, Steve Molloy  wrote:

> BTW, I've logged SOLR-4397 for this and submitted a first patch (based on
> 4.1 tag which is what we use). Need to at least add logic to respect
> timeAllowed, and would like a better way of handling missing results than
> going back and restarting by asking for more, but works for now so guess
> it's a start.
>
> Steve Molloy  steve.mol...@opentext.com
> Software Architect  |  Information Discovery & Analytics R&D
> OpenText
>
> -Original Message-
> From: Steve Molloy [mailto:smol...@opentext.com]
> Sent: January-24-13 1:16 PM
> To: dev@lucene.apache.org
> Subject: RE: Post-sort filtering
>
> I was actually looking for an extension point to plug in, which I wasn't
> able to find looking at the code. And yes, I'm willing to have counts being
> off, the important thing being that results don't contain the wrong
> document. I'd like to avoid oversampling and requesting back because of the
> bandwidth and overall resource usage this implies. I'm currently trying out
> a "PostSortFilter" approach that I'll share if it seems interesting enough.
>
> Steve Molloy
> Software Architect  |  Information Discovery & Analytics R&D
> Website:
> www.opentext.com
>
>
>
> This email message is confidential, may be privileged, and is intended for
> the exclusive use of the addressee. Any other person is strictly prohibited
> from disclosing or reproducing it. If the addressee cannot be reached or is
> unknown to you, please inform the sender by return email and delete this
> email message and all copies immediately.
>
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: January-24-13 1:11 PM
> To: dev@lucene.apache.org
> Subject: Re: Post-sort filtering
>
> this has some problems. First, your facet, group, num hits, etc.
> counts will be off for that user. Second, you can't sort without having
> all of the documents, so unless you're willing to have your counts be off,
> you really have to pay the price of post-filtering everything.
>
> If you can live with the counts being off, consider just having the
> application do a couple of round-trips. Get the docs (oversample, say just
> get the IDs for the top 100 docs) _without_ any kind of ACL filtering. Then
> send those docs back to the server with the ACL filtering. If you don't get
> enough to fill up a response, get the next page of 100, etc.
>
> Finally, the user's list is a better place for this kind of question, this
> list is for discussing developing the code...
>
> Best
> Erick
>
> On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy 
> wrote:
> > Hi,
> >
> > I'm looking for a way to apply filtering that unfortunately
> > implies high cost because it needs to access external resources (for
> > security). I looked at (and tried) the PostFilter technique, which
> > offers some advantages, but still imply a lot of matches in a lot of
> > cases. What I'd like to be able to do is to filter out ids until I
> > have enough to fill the response, then stop filtering (and accept
> > everything). The idea being that total count is not as important,
> > major thing being results should not contain documents requester
> > should not see. So, post filter almost does the trick, except it comes
> > before sorting, so first X documents are not the same that the post
> filter is getting.
> >
> > Is there a way to filter out documents after they have been scored and
> > sorted?
> >
> > Thanks,
> > Steve
> >
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
> commands, e-mail: dev-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
> commands, e-mail: dev-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

<http://www.griddynamics.com>
 


RE: Post-sort filtering

2013-02-04 Thread Steve Molloy
BTW, I've logged SOLR-4397 for this and submitted a first patch (based on 4.1 
tag which is what we use). Need to at least add logic to respect timeAllowed, 
and would like a better way of handling missing results than going back and 
restarting by asking for more, but works for now so guess it's a start.

Steve Molloy  steve.mol...@opentext.com
Software Architect  |  Information Discovery & Analytics R&D   
OpenText  
   
-Original Message-
From: Steve Molloy [mailto:smol...@opentext.com] 
Sent: January-24-13 1:16 PM
To: dev@lucene.apache.org
Subject: RE: Post-sort filtering

I was actually looking for an extension point to plug in, which I wasn't able 
to find looking at the code. And yes, I'm willing to have counts being off, the 
important thing being that results don't contain the wrong document. I'd like 
to avoid oversampling and requesting back because of the bandwidth and overall 
resource usage this implies. I'm currently trying out a "PostSortFilter" 
approach that I'll share if it seems interesting enough.

Steve Molloy
Software Architect  |  Information Discovery & Analytics R&D
Website:
www.opentext.com



This email message is confidential, may be privileged, and is intended for the 
exclusive use of the addressee. Any other person is strictly prohibited from 
disclosing or reproducing it. If the addressee cannot be reached or is unknown 
to you, please inform the sender by return email and delete this email message 
and all copies immediately.


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: January-24-13 1:11 PM
To: dev@lucene.apache.org
Subject: Re: Post-sort filtering

this has some problems. First, your facet, group, num hits, etc.
counts will be off for that user. Second, you can't sort without having all of 
the documents, so unless you're willing to have your counts be off, you really 
have to pay the price of post-filtering everything.

If you can live with the counts being off, consider just having the application 
do a couple of round-trips. Get the docs (oversample, say just get the IDs for 
the top 100 docs) _without_ any kind of ACL filtering. Then send those docs 
back to the server with the ACL filtering. If you don't get enough to fill up a 
response, get the next page of 100, etc.

Finally, the user's list is a better place for this kind of question, this list 
is for discussing developing the code...

Best
Erick

On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy  wrote:
> Hi,
>
> I'm looking for a way to apply filtering that unfortunately 
> implies high cost because it needs to access external resources (for 
> security). I looked at (and tried) the PostFilter technique, which 
> offers some advantages, but still imply a lot of matches in a lot of 
> cases. What I'd like to be able to do is to filter out ids until I 
> have enough to fill the response, then stop filtering (and accept 
> everything). The idea being that total count is not as important, 
> major thing being results should not contain documents requester 
> should not see. So, post filter almost does the trick, except it comes 
> before sorting, so first X documents are not the same that the post filter is 
> getting.
>
> Is there a way to filter out documents after they have been scored and 
> sorted?
>
> Thanks,
> Steve
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional 
commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional 
commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



RE: Post-sort filtering

2013-01-24 Thread Steve Molloy
I was actually looking for an extension point to plug in, which I wasn't able 
to find looking at the code. And yes, I'm willing to have counts being off, the 
important thing being that results don't contain the wrong document. I'd like 
to avoid oversampling and requesting back because of the bandwidth and overall 
resource usage this implies. I'm currently trying out a "PostSortFilter" 
approach that I'll share if it seems interesting enough.

Steve Molloy
Software Architect  |  Information Discovery & Analytics R&D
Website:
www.opentext.com



This email message is confidential, may be privileged, and is intended for the 
exclusive use of the addressee. Any other person is strictly prohibited from 
disclosing or reproducing it. If the addressee cannot be reached or is unknown 
to you, please inform the sender by return email and delete this email message 
and all copies immediately.


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: January-24-13 1:11 PM
To: dev@lucene.apache.org
Subject: Re: Post-sort filtering

this has some problems. First, your facet, group, num hits, etc.
counts will be off for that user. Second, you can't sort without having all of 
the documents, so unless you're willing to have your counts be off, you really 
have to pay the price of post-filtering everything.

If you can live with the counts being off, consider just having the application 
do a couple of round-trips. Get the docs (oversample, say just get the IDs for 
the top 100 docs) _without_ any kind of ACL filtering. Then send those docs 
back to the server with the ACL filtering. If you don't get enough to fill up a 
response, get the next page of 100, etc.

Finally, the user's list is a better place for this kind of question, this list 
is for discussing developing the code...

Best
Erick

On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy  wrote:
> Hi,
>
> I'm looking for a way to apply filtering that unfortunately 
> implies high cost because it needs to access external resources (for 
> security). I looked at (and tried) the PostFilter technique, which 
> offers some advantages, but still imply a lot of matches in a lot of 
> cases. What I'd like to be able to do is to filter out ids until I 
> have enough to fill the response, then stop filtering (and accept 
> everything). The idea being that total count is not as important, 
> major thing being results should not contain documents requester 
> should not see. So, post filter almost does the trick, except it comes 
> before sorting, so first X documents are not the same that the post filter is 
> getting.
>
> Is there a way to filter out documents after they have been scored and 
> sorted?
>
> Thanks,
> Steve
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional 
commands, e-mail: dev-h...@lucene.apache.org


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Post-sort filtering

2013-01-24 Thread Erick Erickson
this has some problems. First, your facet, group, num hits, etc.
counts will be off for that user. Second, you can't sort without
having all of the documents, so unless you're willing to have your
counts be off, you really have to pay the price of post-filtering
everything.

If you can live with the counts being off, consider just having the
application do a couple of round-trips. Get the docs (oversample, say
just get the IDs for the top 100 docs) _without_ any kind of ACL
filtering. Then send those docs back to the server with the ACL
filtering. If you don't get enough to fill up a response, get the next
page of 100, etc.

Finally, the user's list is a better place for this kind of question,
this list is for discussing developing the code...

Best
Erick

On Wed, Jan 23, 2013 at 9:05 AM, Steve Molloy  wrote:
> Hi,
>
> I'm looking for a way to apply filtering that unfortunately implies high
> cost because it needs to access external resources (for security). I looked
> at (and tried) the PostFilter technique, which offers some advantages, but
> still imply a lot of matches in a lot of cases. What I'd like to be able to
> do is to filter out ids until I have enough to fill the response, then stop
> filtering (and accept everything). The idea being that total count is not as
> important, major thing being results should not contain documents requester
> should not see. So, post filter almost does the trick, except it comes
> before sorting, so first X documents are not the same that the post filter
> is getting.
>
> Is there a way to filter out documents after they have been scored and
> sorted?
>
> Thanks,
> Steve
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Post-sort filtering

2013-01-23 Thread Steve Molloy
Hi,

I'm looking for a way to apply filtering that unfortunately implies high 
cost because it needs to access external resources (for security). I looked at 
(and tried) the PostFilter technique, which offers some advantages, but still 
imply a lot of matches in a lot of cases. What I'd like to be able to do is to 
filter out ids until I have enough to fill the response, then stop filtering 
(and accept everything). The idea being that total count is not as important, 
major thing being results should not contain documents requester should not 
see. So, post filter almost does the trick, except it comes before sorting, so 
first X documents are not the same that the post filter is getting.

Is there a way to filter out documents after they have been scored and sorted?

Thanks,
Steve