RE: disallowing delete through security.json

2020-11-24 Thread Oakley, Craig (NIH/NLM/NCBI) [C]
Thank you for the response

The use case I have in mind is trying to approximate incremental updates (as 
are available in Sybase or MSSQL, to which I am more accustomed).

We are wanting to upgrade a large collection from Solr7.4 to Solr8.5. It turns 
out that Solr8.5 cannot run against the current data, because the collection 
was created under Solr6.6. We want to migrate in such a way that, in a year or 
so, we will be able to migrate to Solr9 without worrying about Solr7.4 let 
alone Solr6.6. We want to create a new collection (of the same name) in a brand 
new Solr8.5 SolrCloud, and then to select everything from the current Solr7.4 
collection in json format and load it into the new Solr8.5 collection. All of 
the fields have stored="true", with the exception of fields populated by 
copyField. The select will be done by ranges of id values, so as to avoid 
OutOfMemory errors. That process will take several days; and in the meanwhile, 
users will be continuing to add data. When all the data will have been copied 
(including that which is described below), we can switch port numbers so that 
the new Solr8.5 SolrCloud takes the place of the old Solr7.4 SolrCloud.

The plan is to find a value of _version_ (call it V1) which was in the Solr7.4 
collection when we started the first select, but which is greater than almost 
all values of _version_ in the collection (we are fine with having an overlap 
of _version_ values, but we want to avoid losing anything by having a gap in 
_version_ values). After the initial selects are complete, we can run other 
selects by ranges of id with the additional criteria that the _version_ will be 
no lower than the V1 value. As we have seen in test runs, this will involve 
less data and will run faster. We will also keep note of a new value of 
_version_ (call it V2) which was in the Solr7.4 collection when we start the V1 
select, but which is greater than almost all values of _version_ in the V1 
select. Following this procedure through various iterations (V3, V4, however 
many it takes), we can load the V1 set of selects when we will have completed 
the loading of the initial set of selects. We can then load the V2 set of 
selects when we will have completed the loading of the V1 set of selects. The 
plan is that the selecting and loading of the last Vn set of selects will 
involve a maintenance window measured in minutes rather than in days.

The users claim that they never do deletes: which is good, because a delete 
would be something which would be missed by this plan. If (as you describe) the 
users were to update a record so that only the id field (and the _version_ 
field) are left, that update would get picked up by one of these incremental 
selects and would be applied to the new collection. A delete, however, would 
not be noticed: and the new Solr8.5 collection would still have the record 
which had been deleted from the old Solr7.4 collection. The users claim that 
they never do deletes: but it would seem safer to actually disallow deletes 
during the maintenance.

Let me know if you have any suggestions.

Thank you again for your reply.


-Original Message-
From: Jason Gerlowski  
Sent: Tuesday, November 24, 2020 12:35 PM
To: solr-user@lucene.apache.org
Subject: Re: disallowing delete through security.json

Hey Craig,

I think this will be tricky to do with the current Rule-Based
Authorization support.  As you pointed out in your initial post -
there are lots of ways to delete documents.  The Rule-Based Auth code
doesn't inspect request bodies (AFAIK), so it's going to have trouble
differentiating between traditional "/update" requests with
method=POST that are request-body driven.

But to zoom out a bit, does it really make sense to lock down deletes,
but not updates more broadly?  After all, "updates" can remove and add
fields.  Users might submit an update that strips everything but "id"
from your documents.  In many/most usecases that'd be equally
concerning.  Just wondering what your usecase is - if it's generally
applicable this is probably worth a JIRA ticket.

Best,

Jason

On Thu, Nov 19, 2020 at 10:34 AM Oakley, Craig (NIH/NLM/NCBI) [C]
 wrote:
>
> Having not heard back, I thought I would ask again whether anyone else has 
> been able to use security.json to disallow deletes, and/or if anyone has 
> examples of using the "method" section in 
> lucene.apache.org/solr/guide/8_4/rule-based-authorization-plugin.html
>
> -Original Message-
> From: Oakley, Craig (NIH/NLM/NCBI) [C] 
> Sent: Monday, October 26, 2020 6:23 PM
> To: solr-user@lucene.apache.org
> Subject: disallowing delete through security.json
>
> I am interested in disallowing delete through security.json
>
> After seeing the "method" section in 
> lucene.apache.org/solr/guide/8_4/rule-based-authorization-plugin.html my 
> first attempt was as follows:
>
> {"set-permission":{
> "name":"NO_delete",
> "path":["/update/*","/update"],
> "collection":col_name,
> "role":"NoSuchR

Re: Query generation is different for search terms with and without "-"

2020-11-24 Thread Samuel Gutierrez
Are there any good workarounds/parameters we can use to fix this so it
doesn't have to be solved client side?

On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder 
wrote:

> Is the normal/standard solution here to regex remove the '-'s and
> combine them into a single token?
>
> On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson 
> wrote:
> >
> > This is a common point of confusion. There are two phases for creating a
> query,
> > query _parsing_ first, then the analysis chain for the parsed result.
> >
> > So what e-dismax sees in the two cases is:
> >
> > Name_enUS:“high tech” -> two tokens, since there are two of them pf2
> comes into play.
> >
> > Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply,
> splitting it on the hyphen comes later.
> >
> > It’s especially confusing since the field analysis then breaks up
> “high-tech” into two tokens that
> > look the same as “high tech” in the debug response, just without the
> phrase query.
> >
> > Name_enUS:high
> > Name_enUS:tech
> >
> > Best,
> > Erick
> >
> > > On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez <
> samuel.gutier...@iherb.com.INVALID> wrote:
> > >
> > > I am troubleshooting an issue with ranking for search terms that
> contain a
> > > "-" vs the same query that does not contain the dash e.g. "high-tech"
> vs
> > > "high tech". The field that I am querying is using the standard
> tokenizer,
> > > so I would expect that the underlying lucene query should be the same
> for
> > > both versions of the query, however when printing the debug, it appears
> > > they are generated differently. I know "-" must be escaped as it has
> > > special meaning in lucene, however escaping does not fix the problem.
> It
> > > appears that with the "-" present, the pf2 edismax parameter is not
> > > respected and omitted from the final query. We use sow=false as we have
> > > multiterm synonyms and need to ensure they are included in the final
> lucene
> > > query. My expectation is that the final underlying lucene query should
> be
> > > based on the output  of the field analyzer, however after briefly
> looking
> > > at the code for ExtendedDismaxQParser, it appears that there is some
> string
> > > processing happening outside of the analysis step which causes the
> > > unexpected lucene query.
> > >
> > >
> > > Solr Debug for "high tech":
> > >
> > > parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
> > > DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
> > > DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
> > > DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
> > > parsedquery_toString: "+(((Name_enUS:high)~0.4
> > > (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
> > > (Name_enUS:"high tech"~4)~0.4",
> > >
> > >
> > > Solr Debug for "high-tech"
> > >
> > > parsedquery: "+DisjunctionMaxQueryName_enUS:high
> > > Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
> > > tech"~5)~0.4)",
> > > parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
> > > (Name_enUS:"high tech"~5)~0.4"
> > >
> > > SolrConfig:
> > >
> > >  
> > >
> > >  true
> > >  true
> > >  json
> > >  3<75%
> > >  Name_enUS
> > >  Name_enUS
> > >  5
> > >  Name_enUS
> > >  4   
> > >  3
> > >  0.4
> > >  explicit
> > >  100
> > >  false
> > >
> > >
> > >  edismax
> > >
> > >  
> > >
> > > Schema:
> > >
> > >   positionIncrementGap="100">
> > >  
> > >
> > >
> > >
> > >
> > >  
> > >  
> > >
> > >
> > > Using Solr 8.6.3
> > >
>

-- 
*The information contained in this message is the sole and exclusive 
property of ***iHerb Inc.*** and may be privileged and confidential. It may 
not be disseminated or distributed to persons or entities other than the 
ones intended without the written authority of ***iHerb Inc.** *If you have 
received this e-mail in error or are not the intended recipient, you may 
not use, copy, disseminate or distribute it. Do not open any attachments. 
Please delete it immediately from your system and notify the sender 
promptly by e-mail that you have done so.*


Re: disallowing delete through security.json

2020-11-24 Thread Jason Gerlowski
Hey Craig,

I think this will be tricky to do with the current Rule-Based
Authorization support.  As you pointed out in your initial post -
there are lots of ways to delete documents.  The Rule-Based Auth code
doesn't inspect request bodies (AFAIK), so it's going to have trouble
differentiating between traditional "/update" requests with
method=POST that are request-body driven.

But to zoom out a bit, does it really make sense to lock down deletes,
but not updates more broadly?  After all, "updates" can remove and add
fields.  Users might submit an update that strips everything but "id"
from your documents.  In many/most usecases that'd be equally
concerning.  Just wondering what your usecase is - if it's generally
applicable this is probably worth a JIRA ticket.

Best,

Jason

On Thu, Nov 19, 2020 at 10:34 AM Oakley, Craig (NIH/NLM/NCBI) [C]
 wrote:
>
> Having not heard back, I thought I would ask again whether anyone else has 
> been able to use security.json to disallow deletes, and/or if anyone has 
> examples of using the "method" section in 
> lucene.apache.org/solr/guide/8_4/rule-based-authorization-plugin.html
>
> -Original Message-
> From: Oakley, Craig (NIH/NLM/NCBI) [C] 
> Sent: Monday, October 26, 2020 6:23 PM
> To: solr-user@lucene.apache.org
> Subject: disallowing delete through security.json
>
> I am interested in disallowing delete through security.json
>
> After seeing the "method" section in 
> lucene.apache.org/solr/guide/8_4/rule-based-authorization-plugin.html my 
> first attempt was as follows:
>
> {"set-permission":{
> "name":"NO_delete",
> "path":["/update/*","/update"],
> "collection":col_name,
> "role":"NoSuchRole",
> "method":"DELETE",
> "before":4}}
>
> I found, however, that this did not disallow deleted: I could still run
> curl -u ... "http://.../solr/col_name/update?commit=true"; --data 
> "id:11"
>
> After further experimentation, I seemed to have success with
> {"set-permission":
> {"name":"NO_delete6",
> "path":"/update/*",
> "collection":"col_name",
> "role":"NoSuchRole",
> "method":["REGEX:(?i)DELETE"],
> "before":4}}
>
> My initial impression was that this did what I wanted; but now I find that 
> this disallows *any* updates to this collection (which had previously been 
> allowed). Other attempts to tweak this strategy, such as granting permissions 
> for "/update/*" for methods other than DELETE to a role which is granted to 
> the desired user, have not yet been successful.
>
> Does anyone have an example of security.json disallowing a delete while still 
> allowing an update?
>
> Thanks


Re: Query generation is different for search terms with and without "-"

2020-11-24 Thread matthew sporleder
Is the normal/standard solution here to regex remove the '-'s and
combine them into a single token?

On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson  wrote:
>
> This is a common point of confusion. There are two phases for creating a 
> query,
> query _parsing_ first, then the analysis chain for the parsed result.
>
> So what e-dismax sees in the two cases is:
>
> Name_enUS:“high tech” -> two tokens, since there are two of them pf2 comes 
> into play.
>
> Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply, 
> splitting it on the hyphen comes later.
>
> It’s especially confusing since the field analysis then breaks up “high-tech” 
> into two tokens that
> look the same as “high tech” in the debug response, just without the phrase 
> query.
>
> Name_enUS:high
> Name_enUS:tech
>
> Best,
> Erick
>
> > On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez 
> >  wrote:
> >
> > I am troubleshooting an issue with ranking for search terms that contain a
> > "-" vs the same query that does not contain the dash e.g. "high-tech" vs
> > "high tech". The field that I am querying is using the standard tokenizer,
> > so I would expect that the underlying lucene query should be the same for
> > both versions of the query, however when printing the debug, it appears
> > they are generated differently. I know "-" must be escaped as it has
> > special meaning in lucene, however escaping does not fix the problem. It
> > appears that with the "-" present, the pf2 edismax parameter is not
> > respected and omitted from the final query. We use sow=false as we have
> > multiterm synonyms and need to ensure they are included in the final lucene
> > query. My expectation is that the final underlying lucene query should be
> > based on the output  of the field analyzer, however after briefly looking
> > at the code for ExtendedDismaxQParser, it appears that there is some string
> > processing happening outside of the analysis step which causes the
> > unexpected lucene query.
> >
> >
> > Solr Debug for "high tech":
> >
> > parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
> > DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
> > DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
> > DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
> > parsedquery_toString: "+(((Name_enUS:high)~0.4
> > (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
> > (Name_enUS:"high tech"~4)~0.4",
> >
> >
> > Solr Debug for "high-tech"
> >
> > parsedquery: "+DisjunctionMaxQueryName_enUS:high
> > Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
> > tech"~5)~0.4)",
> > parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
> > (Name_enUS:"high tech"~5)~0.4"
> >
> > SolrConfig:
> >
> >  
> >
> >  true
> >  true
> >  json
> >  3<75%
> >  Name_enUS
> >  Name_enUS
> >  5
> >  Name_enUS
> >  4   
> >  3
> >  0.4
> >  explicit
> >  100
> >  false
> >
> >
> >  edismax
> >
> >  
> >
> > Schema:
> >
> >   > positionIncrementGap="100">
> >  
> >
> >
> >
> >
> >  
> >  
> >
> >
> > Using Solr 8.6.3
> >


Re: Query generation is different for search terms with and without "-"

2020-11-24 Thread Erick Erickson
This is a common point of confusion. There are two phases for creating a query,
query _parsing_ first, then the analysis chain for the parsed result.

So what e-dismax sees in the two cases is:

Name_enUS:“high tech” -> two tokens, since there are two of them pf2 comes into 
play.

Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply, splitting 
it on the hyphen comes later.

It’s especially confusing since the field analysis then breaks up “high-tech” 
into two tokens that
look the same as “high tech” in the debug response, just without the phrase 
query.

Name_enUS:high
Name_enUS:tech

Best,
Erick

> On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez 
>  wrote:
> 
> I am troubleshooting an issue with ranking for search terms that contain a
> "-" vs the same query that does not contain the dash e.g. "high-tech" vs
> "high tech". The field that I am querying is using the standard tokenizer,
> so I would expect that the underlying lucene query should be the same for
> both versions of the query, however when printing the debug, it appears
> they are generated differently. I know "-" must be escaped as it has
> special meaning in lucene, however escaping does not fix the problem. It
> appears that with the "-" present, the pf2 edismax parameter is not
> respected and omitted from the final query. We use sow=false as we have
> multiterm synonyms and need to ensure they are included in the final lucene
> query. My expectation is that the final underlying lucene query should be
> based on the output  of the field analyzer, however after briefly looking
> at the code for ExtendedDismaxQParser, it appears that there is some string
> processing happening outside of the analysis step which causes the
> unexpected lucene query.
> 
> 
> Solr Debug for "high tech":
> 
> parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
> DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
> DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
> DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
> parsedquery_toString: "+(((Name_enUS:high)~0.4
> (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
> (Name_enUS:"high tech"~4)~0.4",
> 
> 
> Solr Debug for "high-tech"
> 
> parsedquery: "+DisjunctionMaxQueryName_enUS:high
> Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
> tech"~5)~0.4)",
> parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
> (Name_enUS:"high tech"~5)~0.4"
> 
> SolrConfig:
> 
>  
>
>  true
>  true
>  json
>  3<75%
>  Name_enUS
>  Name_enUS
>  5
>  Name_enUS
>  4   
>  3
>  0.4
>  explicit
>  100
>  false
>
>
>  edismax
>
>  
> 
> Schema:
> 
>  
>  
>
>
>
>
>  
>  
> 
> 
> Using Solr 8.6.3
> 
> -- 
> *The information contained in this message is the sole and exclusive 
> property of ***iHerb Inc.*** and may be privileged and confidential. It may 
> not be disseminated or distributed to persons or entities other than the 
> ones intended without the written authority of ***iHerb Inc.** *If you have 
> received this e-mail in error or are not the intended recipient, you may 
> not use, copy, disseminate or distribute it. Do not open any attachments. 
> Please delete it immediately from your system and notify the sender 
> promptly by e-mail that you have done so.*



Re: Atomic update wrongly deletes child documents

2020-11-24 Thread Erick Erickson
Sure, raise a JIRA. Thanks for the update...

> On Nov 24, 2020, at 4:12 AM, Andreas Hubold  
> wrote:
> 
> Hi,
> 
> I was able to work around the issue. I'm now using a custom
> UpdateRequestProcessor that removes undefined fields, so that I was able to
> remove the catch-all dynamic field "ignored" from my schema.. Of course, one
> has to be careful to not remove fields that are used for nested documents in
> the URP.
> 
> I think it would still make sense to fix the original issue, or at least
> document it as caveat. I'm going to create a JIRA ticket for this soon, if
> that's okay.
> 
> Regards,
> Andreas
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html



RE: Use stream result like a query (alternative to innerJoin)

2020-11-24 Thread ufuk yılmaz
Fetch would work for my specific case (since I’m working with id’s there’s no 
one to many), if I was able to restrict fetch’s target domain with a query. I 
would first get all possible deleted ids, then use fetch to the items 
collection. But then the current fetch implementation would find all deleted 
items, not something like “deleted items with these names” or “deleted items 
between this time” etc.

I came upon your video while researching this stuff: 
https://www.youtube.com/watch?v=kTNe3TaqFvo

I’m trying to use the “let” expression to feed one stream’s result to another 
as a query, using string concat function and eval stream. So far I couldn’t 
write a working example, but it’s an idea that I’m playing with.


Sent from Mail for Windows 10

From: Joel Bernstein
Sent: 23 November 2020 23:23
To: solr-user@lucene.apache.org
Subject: Re: Use stream result like a query (alternative to innerJoin)

H



Re: Atomic update wrongly deletes child documents

2020-11-24 Thread Andreas Hubold
Hi,

I was able to work around the issue. I'm now using a custom
UpdateRequestProcessor that removes undefined fields, so that I was able to
remove the catch-all dynamic field "ignored" from my schema.. Of course, one
has to be careful to not remove fields that are used for nested documents in
the URP.

I think it would still make sense to fix the original issue, or at least
document it as caveat. I'm going to create a JIRA ticket for this soon, if
that's okay.

Regards,
Andreas



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html