[jira] [Commented] (SOLR-16810) Under certain situations Solr produces managed schema XML that cannot be loaded

Shawn Heisey (Jira) Mon, 22 May 2023 14:07:26 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725123#comment-17725123
 ]


Shawn Heisey commented on SOLR-16810:
-------------------------------------

The info I pasted doesn't explicitly state it, but what it boils down to is 
ASCII letters, numbers, and the underscore, which are all characters with an 
ascii code below 127 (0x7F).  So it would exclude characters like accented 
letters, ideographs, etc well as other punctuation, spaces, control characters, 
etc.  I am aware that this is very much a USA-centric restriction ... no 
offense to other languages is intended, but supporting every character outside 
that specific range tends to be challenging from a programming perspective.

In truth, many printable characters outside that range do actually work, what 
that warning says is that we won't treat problems with such characters as bugs.

Control characters definitely fall into the "not supported at all" category.  I 
don't think it matters whether the problem is in Solr or XML ... those 
characters are not supported, and our stance is that they never have been 
supported.

That doesn't mean we're going to turn away a fix ... as long as it doesn't 
break anything else and cleanly fixes the problem, it's worth including.

> Under certain situations Solr produces managed schema XML that cannot be 
> loaded
> -------------------------------------------------------------------------------
>
>                 Key: SOLR-16810
>                 URL: https://issues.apache.org/jira/browse/SOLR-16810
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Schema and Analysis
>    Affects Versions: 9.2.1
>            Reporter: Thiruvalluvan M. G.
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> While persisting the {{ManagedIndexSchema}} as XML, non-printable characters 
> in field names get escaped as {{{}#nn;{}}}, where {{nn}} is the decimal 
> representation of the non-printable character. For example, if the field name 
> has the byte {{{}0x14{}}}, it gets escaped as {{{}#20;{}}}. This in 
> indistinguishable from the literal {{#20;}} in the field name. If we have two 
> fields - one with the non-printable character and the other with the literal 
> string, two fields get generated with the same name. Loading the resulting 
> XML, naturally, causes an exception. To fix this, any occurrence of literal 
> {{#}} in the field name should be escaped, with say {{{}##{}}}.
> A second problem is that while escaping happens when generating XML, the 
> corresponding unescaping does not happen on loading it. This asymmetry should 
> be fixed as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-16810) Under certain situations Solr produces managed schema XML that cannot be loaded

Reply via email to