[ 
https://issues.apache.org/jira/browse/SOLR-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725145#comment-17725145
 ] 

Gus Heck commented on SOLR-16810:
---------------------------------

Yeah I looked and the code that does this encoding is literally some of the 
oldest code in solr (circa 2006 and attribute escaping in 2008). I think that 
xml creation code exist because at the time existing xml libraries where heavy 
and slow. Hard to say if there is a better alternative in modern times, 
certainly would need some performance testing and a much more comprehensive 
ticket for that. Frustratingly there's no easy answer, but it's yet another "we 
didn't follow the spec for a data format and now we're stuck with it" problem.

One thing about this fix is it means that code reading back the escaped control 
characters will now actually recreate those control characters. Hard to say 
what affect that will have on folks unprepared for it. Could be fun to read 
logs trying to distinguish the difference between a field named 
'fool<backspace>' and one name 'foo' and when logs try to print the name of 
fields with return characters that might cause some fun with parsing logs. The 
history seems to be that the escaping routines for the main body were co-opted 
to handle attributes. I don't know if there was intention behind the decision 
to not decode since it does have the effect of purging characters that can 
cause problems and shouldn't be there in the first place.

> Under certain situations Solr produces managed schema XML that cannot be 
> loaded
> -------------------------------------------------------------------------------
>
>                 Key: SOLR-16810
>                 URL: https://issues.apache.org/jira/browse/SOLR-16810
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: Schema and Analysis
>    Affects Versions: 9.2.1
>            Reporter: Thiruvalluvan M. G.
>            Assignee: Ishan Chattopadhyaya
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> While persisting the {{ManagedIndexSchema}} as XML, non-printable characters 
> in field names get escaped as {{{}#nn;{}}}, where {{nn}} is the decimal 
> representation of the non-printable character. For example, if the field name 
> has the byte {{{}0x14{}}}, it gets escaped as {{{}#20;{}}}. This in 
> indistinguishable from the literal {{#20;}} in the field name. If we have two 
> fields - one with the non-printable character and the other with the literal 
> string, two fields get generated with the same name. Loading the resulting 
> XML, naturally, causes an exception. To fix this, any occurrence of literal 
> {{#}} in the field name should be escaped, with say {{{}##{}}}.
> A second problem is that while escaping happens when generating XML, the 
> corresponding unescaping does not happen on loading it. This asymmetry should 
> be fixed as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to