[ https://issues.apache.org/jira/browse/SOLR-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725145#comment-17725145 ]
Gus Heck commented on SOLR-16810: --------------------------------- Yeah I looked and the code that does this encoding is literally some of the oldest code in solr (circa 2006 and attribute escaping in 2008). I think that xml creation code exist because at the time existing xml libraries where heavy and slow. Hard to say if there is a better alternative in modern times, certainly would need some performance testing and a much more comprehensive ticket for that. Frustratingly there's no easy answer, but it's yet another "we didn't follow the spec for a data format and now we're stuck with it" problem. One thing about this fix is it means that code reading back the escaped control characters will now actually recreate those control characters. Hard to say what affect that will have on folks unprepared for it. Could be fun to read logs trying to distinguish the difference between a field named 'fool<backspace>' and one name 'foo' and when logs try to print the name of fields with return characters that might cause some fun with parsing logs. The history seems to be that the escaping routines for the main body were co-opted to handle attributes. I don't know if there was intention behind the decision to not decode since it does have the effect of purging characters that can cause problems and shouldn't be there in the first place. > Under certain situations Solr produces managed schema XML that cannot be > loaded > ------------------------------------------------------------------------------- > > Key: SOLR-16810 > URL: https://issues.apache.org/jira/browse/SOLR-16810 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Schema and Analysis > Affects Versions: 9.2.1 > Reporter: Thiruvalluvan M. G. > Assignee: Ishan Chattopadhyaya > Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > While persisting the {{ManagedIndexSchema}} as XML, non-printable characters > in field names get escaped as {{{}#nn;{}}}, where {{nn}} is the decimal > representation of the non-printable character. For example, if the field name > has the byte {{{}0x14{}}}, it gets escaped as {{{}#20;{}}}. This in > indistinguishable from the literal {{#20;}} in the field name. If we have two > fields - one with the non-printable character and the other with the literal > string, two fields get generated with the same name. Loading the resulting > XML, naturally, causes an exception. To fix this, any occurrence of literal > {{#}} in the field name should be escaped, with say {{{}##{}}}. > A second problem is that while escaping happens when generating XML, the > corresponding unescaping does not happen on loading it. This asymmetry should > be fixed as well. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org