Re: Query generation is different for search terms with and without "-"

2020-11-25 Thread Walter Underwood
Ages ago at Netflix, I fixed this with a few hundred synonyms. If you are 
working with
a fixed vocabulary (movie titles, product names), that can work just fine.

babysitter, baby-sitter, baby sitter
fullmetal, full-metal, full metal
manhunter, man-hunter, man hunter
spiderman, spider-man, spider man

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 25, 2020, at 9:26 AM, Erick Erickson  wrote:
> 
> Parameters, no. You could use a PatternReplaceCharFilterFactory. NOTE:
> 
> *FilterFactory are _not_ what you want in this case, they are applied to 
> individual tokens after parsing
> 
> *CharFiterFactory are invoked on the entire input to the field, although I 
> can’t say for certain that even that’s early enough.
> 
> There are two other options to consider:
> StatelessScriptUpdateProcessor
> FieldMutatingUpdateProcessor
> 
> Stateless... is probably easiest…
> 
> Best,
> ERick
> 
>> On Nov 24, 2020, at 1:44 PM, Samuel Gutierrez 
>>  wrote:
>> 
>> Are there any good workarounds/parameters we can use to fix this so it
>> doesn't have to be solved client side?
>> 
>> On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder 
>> wrote:
>> 
>>> Is the normal/standard solution here to regex remove the '-'s and
>>> combine them into a single token?
>>> 
>>> On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson 
>>> wrote:
 
 This is a common point of confusion. There are two phases for creating a
>>> query,
 query _parsing_ first, then the analysis chain for the parsed result.
 
 So what e-dismax sees in the two cases is:
 
 Name_enUS:“high tech” -> two tokens, since there are two of them pf2
>>> comes into play.
 
 Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply,
>>> splitting it on the hyphen comes later.
 
 It’s especially confusing since the field analysis then breaks up
>>> “high-tech” into two tokens that
 look the same as “high tech” in the debug response, just without the
>>> phrase query.
 
 Name_enUS:high
 Name_enUS:tech
 
 Best,
 Erick
 
> On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez <
>>> samuel.gutier...@iherb.com.INVALID> wrote:
> 
> I am troubleshooting an issue with ranking for search terms that
>>> contain a
> "-" vs the same query that does not contain the dash e.g. "high-tech"
>>> vs
> "high tech". The field that I am querying is using the standard
>>> tokenizer,
> so I would expect that the underlying lucene query should be the same
>>> for
> both versions of the query, however when printing the debug, it appears
> they are generated differently. I know "-" must be escaped as it has
> special meaning in lucene, however escaping does not fix the problem.
>>> It
> appears that with the "-" present, the pf2 edismax parameter is not
> respected and omitted from the final query. We use sow=false as we have
> multiterm synonyms and need to ensure they are included in the final
>>> lucene
> query. My expectation is that the final underlying lucene query should
>>> be
> based on the output  of the field analyzer, however after briefly
>>> looking
> at the code for ExtendedDismaxQParser, it appears that there is some
>>> string
> processing happening outside of the analysis step which causes the
> unexpected lucene query.
> 
> 
> Solr Debug for "high tech":
> 
> parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
> DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
> DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
> DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
> parsedquery_toString: "+(((Name_enUS:high)~0.4
> (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
> (Name_enUS:"high tech"~4)~0.4",
> 
> 
> Solr Debug for "high-tech"
> 
> parsedquery: "+DisjunctionMaxQueryName_enUS:high
> Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
> tech"~5)~0.4)",
> parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
> (Name_enUS:"high tech"~5)~0.4"
> 
> SolrConfig:
> 
> 
>  
>true
>true
>json
>375%
>Name_enUS
>Name_enUS
>5
>Name_enUS
>4   
>3
>0.4
>explicit
>100
>false
>  
>  
>edismax
>  
> 
> 
> Schema:
> 
> >> positionIncrementGap="100">
>
>  
>  
>  
>  
>
> 
> 
> 
> Using Solr 8.6.3
> 
>>> 
>> 
>> -- 
>> *The information contained in this message is the sole and exclusive 
>> property of ***iHerb Inc.*** and may be privileged and confidential. It may 
>> not be disseminated or distributed to persons or entities other than the 
>> ones intended without the written authority of ***iHerb Inc.** *If you have 
>> received this e-mail in error or are 

Solr 8.4.1, NOT NULL query not working on plong & pint type fields (fieldname:* )

2020-11-25 Thread Deepu
Dear Team,

We are in the process of migrating from Solr 5 to Solr 8, during testing
identified that "Not null" queries on plong & pint field types are not
giving any results, it is working fine with solr 5.4 version.

could you please let me know if you have suggestions on this issue?

Thanks
Deepu


Re: Query generation is different for search terms with and without "-"

2020-11-25 Thread Erick Erickson
Parameters, no. You could use a PatternReplaceCharFilterFactory. NOTE:

*FilterFactory are _not_ what you want in this case, they are applied to 
individual tokens after parsing

*CharFiterFactory are invoked on the entire input to the field, although I 
can’t say for certain that even that’s early enough.

There are two other options to consider:
StatelessScriptUpdateProcessor
FieldMutatingUpdateProcessor

Stateless... is probably easiest…

Best,
ERick

> On Nov 24, 2020, at 1:44 PM, Samuel Gutierrez 
>  wrote:
> 
> Are there any good workarounds/parameters we can use to fix this so it
> doesn't have to be solved client side?
> 
> On Tue, Nov 24, 2020 at 7:50 AM matthew sporleder 
> wrote:
> 
>> Is the normal/standard solution here to regex remove the '-'s and
>> combine them into a single token?
>> 
>> On Tue, Nov 24, 2020 at 8:00 AM Erick Erickson 
>> wrote:
>>> 
>>> This is a common point of confusion. There are two phases for creating a
>> query,
>>> query _parsing_ first, then the analysis chain for the parsed result.
>>> 
>>> So what e-dismax sees in the two cases is:
>>> 
>>> Name_enUS:“high tech” -> two tokens, since there are two of them pf2
>> comes into play.
>>> 
>>> Name_enUS:“high-tech” -> there’s only one token so pf2 doesn’t apply,
>> splitting it on the hyphen comes later.
>>> 
>>> It’s especially confusing since the field analysis then breaks up
>> “high-tech” into two tokens that
>>> look the same as “high tech” in the debug response, just without the
>> phrase query.
>>> 
>>> Name_enUS:high
>>> Name_enUS:tech
>>> 
>>> Best,
>>> Erick
>>> 
 On Nov 23, 2020, at 8:32 PM, Samuel Gutierrez <
>> samuel.gutier...@iherb.com.INVALID> wrote:
 
 I am troubleshooting an issue with ranking for search terms that
>> contain a
 "-" vs the same query that does not contain the dash e.g. "high-tech"
>> vs
 "high tech". The field that I am querying is using the standard
>> tokenizer,
 so I would expect that the underlying lucene query should be the same
>> for
 both versions of the query, however when printing the debug, it appears
 they are generated differently. I know "-" must be escaped as it has
 special meaning in lucene, however escaping does not fix the problem.
>> It
 appears that with the "-" present, the pf2 edismax parameter is not
 respected and omitted from the final query. We use sow=false as we have
 multiterm synonyms and need to ensure they are included in the final
>> lucene
 query. My expectation is that the final underlying lucene query should
>> be
 based on the output  of the field analyzer, however after briefly
>> looking
 at the code for ExtendedDismaxQParser, it appears that there is some
>> string
 processing happening outside of the analysis step which causes the
 unexpected lucene query.
 
 
 Solr Debug for "high tech":
 
 parsedquery: "+(DisjunctionMaxQuery((Name_enUS:high)~0.4)
 DisjunctionMaxQuery((Name_enUS:tech)~0.4))~2
 DisjunctionMaxQuery((Name_enUS:"high tech"~5)~0.4)
 DisjunctionMaxQuery((Name_enUS:"high tech"~4)~0.4)",
 parsedquery_toString: "+(((Name_enUS:high)~0.4
 (Name_enUS:tech)~0.4)~2) (Name_enUS:"high tech"~5)~0.4
 (Name_enUS:"high tech"~4)~0.4",
 
 
 Solr Debug for "high-tech"
 
 parsedquery: "+DisjunctionMaxQueryName_enUS:high
 Name_enUS:tech)~2))~0.4) DisjunctionMaxQuery((Name_enUS:"high
 tech"~5)~0.4)",
 parsedquery_toString: "+(((Name_enUS:high Name_enUS:tech)~2))~0.4
 (Name_enUS:"high tech"~5)~0.4"
 
 SolrConfig:
 
 
   
 true
 true
 json
 375%
 Name_enUS
 Name_enUS
 5
 Name_enUS
 4   
 3
 0.4
 explicit
 100
 false
   
   
 edismax
   
 
 
 Schema:
 
 > positionIncrementGap="100">
 
   
   
   
   
 
 
 
 
 Using Solr 8.6.3
 
>> 
> 
> -- 
> *The information contained in this message is the sole and exclusive 
> property of ***iHerb Inc.*** and may be privileged and confidential. It may 
> not be disseminated or distributed to persons or entities other than the 
> ones intended without the written authority of ***iHerb Inc.** *If you have 
> received this e-mail in error or are not the intended recipient, you may 
> not use, copy, disseminate or distribute it. Do not open any attachments. 
> Please delete it immediately from your system and notify the sender 
> promptly by e-mail that you have done so.*



Re: Atomic update wrongly deletes child documents

2020-11-25 Thread Andreas Hubold
Thank you, I've created https://issues.apache.org/jira/browse/SOLR-15018 
now.


Regards,
Andreas

Erick Erickson wrote on 24.11.20 13:29:

Sure, raise a JIRA. Thanks for the update...


On Nov 24, 2020, at 4:12 AM, Andreas Hubold  
wrote:

Hi,

I was able to work around the issue. I'm now using a custom
UpdateRequestProcessor that removes undefined fields, so that I was able to
remove the catch-all dynamic field "ignored" from my schema.. Of course, one
has to be careful to not remove fields that are used for nested documents in
the URP.

I think it would still make sense to fix the original issue, or at least
document it as caveat. I'm going to create a JIRA ticket for this soon, if
that's okay.

Regards,
Andreas



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

.





Re: security.json help

2020-11-25 Thread Jason Gerlowski
Hi Mark,

It looks like you're using the "path" wildcard as it's intended, but
some bug is causing the behavior you're seeing.  It should be working
as you expected, but evidently it's not.

One potential workaround might be to leave out the "path" property
entirely in your "custom-example" permission.  When I do that (on Solr
8.6.2), I get the following behavior in the following pastebin link,
which looks close to what you're after: https://paste.apache.org/ygndt

Hope that helps!

Jason

On Mon, Oct 19, 2020 at 3:49 PM Mark Dadisman
 wrote:
>
> Hey, I'm new to configuring Solr. I'm trying to configure Solr with Rule 
> Based Authorization. 
> https://lucene.apache.org/solr/guide/8_6/rule-based-authorization-plugin.html
>
> I have permissions working if I allow everything with "all", but I want to 
> limit access so that a site can only access its own collection, in addition 
> to a server ping path, so I'm trying to add the collection-specific 
> permission at the top:
>
> "permissions": [
>   {
> "name": "custom-example",
> "collection": "example",
> "path": "*",
> "role": [
>   "admin",
>   "example"
> ]
>   },
>   {
> "name": "custom-collection",
> "collection": "*",
> "path": [
>   "/admin/luke",
>   "/admin/mbeans",
>   "/admin/system"
> ],
> "role": "*"
>   },
>   {
> "name": "custom-ping",
> "collection": null,
> "path": [
>   "/admin/info/system"
> ],
> "role": "*"
>   },
>   {
> "name": "all",
> "role": "admin"
>   }
> ]
>
> The rule "custom-ping" works, and "all" works. But when the above permissions 
> are used, access is denied to the "example" user-role for collection 
> "example" at the path "/solr/example/select". If I specify paths explicitly, 
> the permissions work, but I can't get permissions to work with path wildcards 
> for a specific collection.
>
> I also had to declare "custom-collection" with the specific paths needed to 
> get collection info in order for those paths to work. I would've expected 
> that these paths would be included in the collection-specific paths and be 
> covered by the first rule, but they aren't. For example, the call to 
> "/solr/example/admin/luke" will fail if the path is removed from this rule.
>
> I don't really want to specify every single path I might need to use. Am I 
> using the path wildcard wrong somehow? Is there a better way to do 
> collection-specific authorizations for a collection "example"?
>
> Thanks.
> - M
>


CDCR

2020-11-25 Thread Gell-Holleron, Daniel
Hello,

Does anybody have advice on why CDCR would say its Forwarding updates (with no 
errors) even though the solr servers its replicating to aren't updating?

We have just under 50 million documents, that are spread across 4 servers. Each 
server has a node each.

One side is updating happily so would think that sharding wouldn't be needed at 
this point?

We are using Solr version 7.7.1.

Thanks,

Daniel