[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799143#comment-17799143 ] Julian Reschke commented on JCR-4935: - The JCR spec explicitly references XML 1.0. And, FWIW, XML 1.1 is not used in practice, so this wouldn't help in any case. > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.18 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799136#comment-17799136 ] Konrad Windszus commented on JCR-4935: -- With XML 1.1 most limitations to invalid characters were lifted: https://www.w3.org/TR/xml11/#sec-xml11. Not sure if the JCR spec explicitly specifies an XML spec version... > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.18 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727945#comment-17727945 ] Julian Reschke commented on JCR-4935: - a) First we need to decide what we *want* it to do. b) Whatever the fix is, it needs to happen higher in the stack; this is a low-level XML writing method; it's not supposed to modify the contents of the strings it writes (except for {*}XML{*}-escaping). > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.17 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727941#comment-17727941 ] Yegor Kozlov commented on JCR-4935: --- I changed the fix to escape illegal characters to unicode code points, similar to how File Valult does it. The XML would like something like {code:xml} {code} > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.17 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727557#comment-17727557 ] Julian Reschke commented on JCR-4935: - OK, so that is a somewhat different format anyway (and I see it takes care of non-XML chars). Thanks. > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.17 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727554#comment-17727554 ] Konrad Windszus commented on JCR-4935: -- This is how FileVault Docview does it: https://jackrabbit.apache.org/filevault/docview.html#escaping > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.17 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727475#comment-17727475 ] Julian Reschke commented on JCR-4935: - Document view is by definition a "best-effort" mapping to XML, so it's known to be lossy. The question here is whether it's better to fail early, or export incorrect data. There is no simple answer here. > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.17 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727472#comment-17727472 ] Yegor Kozlov commented on JCR-4935: --- {quote} * fail on export (I believe this would be conformant with the JCR spec){quote} does that mean that if a user wants to export to XML they need to use only the good characters in the data ? As an API consumer I'd expect a friendlier approach. > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.17 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727453#comment-17727453 ] Julian Reschke commented on JCR-4935: - So basically the choices are: * fail on export (I believe this would be conformant with the JCR spec) * produce broken XML that can not be imported * produce valid XML after stripping some content > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.17 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727431#comment-17727431 ] Julian Reschke commented on JCR-4935: - FWIW, [ToXmlContentHandler.java|https://github.com/apache/jackrabbit/pull/132/commits/772347431022120704153606883b9b1abcf489f1#diff-c815600021691abe44140c80f533e6dda87aa0a90c9147bf0346fdf8a6e0be38] works as defined. We *could* change it to check for invalid characters and throw an exception. Whatever the fix is, it need to happen higher in the stack. > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.17 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727424#comment-17727424 ] Julian Reschke commented on JCR-4935: - Good catch; I somehow assumed that both use the same escaping, but that's of course not true. The problem with the definition for the document view export however is that it does not work. If a char is disallowed in XML, it can't be represented with an entits reference, either. (so "" is a parse error). > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.17 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727421#comment-17727421 ] Marcel Reutegger commented on JCR-4935: --- [~reschke], what you quoted applies to system view export. IIUC, this report is about document view export. I think '7.3 Document View' list item 10 applies: bq. If P is a non-BINARY property its value is converted to string form according to the standard conversion (see §3.6.4 Property Type Conversion). Entity references are used to escape characters which cannot be included as literals within attribute values (see §7.5 Escaping of Values). > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.17 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727406#comment-17727406 ] Julian Reschke commented on JCR-4935: - [~kwin] - FYI as this might impact Filevault as well. > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.17 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character
[ https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727405#comment-17727405 ] Julian Reschke commented on JCR-4935: - I agree that there is a problem here, but the JCR spec in fact defines the handling here (fortunately). See <[https://developer.adobe.com/experience-manager/reference-materials/spec/jcr/2.0/7_Export.html]>: {quote} * If, after conversion to string and entity escaping is performed, the string form of a value still contains characters which cannot appear in an XML document (neither as literals nor as character references{^}[13|https://developer.adobe.com/experience-manager/reference-materials/spec/jcr/2.0/7_Export.html#sdfootnote13sym]{^}) then: ## The string form is further encoded using Base64 encoding. ## The attribute xsi:type=“xsd:base64Binary” is added to the element. ## The namespace mappings for xsi and xsd are added to the exported XML document so that the xsi:type attribute is within their scope. The namespace declarations required are xmlns:xsd=“http://www.w3.org/2001/XMLSchema” and xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance”. Note that the prefixes representing these two namespaces need not be _literally_ “xsd” and “xsi”. Any two prefixes are permitted as long as the corresponding namespace declarations are changed accordingly. {quote} > session.exportDocumentView() generates unparsable XML if a JCR Property > contains invalid XML character > -- > > Key: JCR-4935 > URL: https://issues.apache.org/jira/browse/JCR-4935 > Project: Jackrabbit Content Repository > Issue Type: Bug > Components: jackrabbit-jcr-commons >Affects Versions: 2.21.17 >Reporter: Yegor Kozlov >Assignee: Julian Reschke >Priority: Major > Attachments: image-2023-05-29-14-58-05-591.png > > > I came across this issue in AEM, where user content can contain all kinds of > special characters. In my case it was a 0x3 character (^C) in a node property > which was written in the JCR XML as-is, and it resulted in a unparsable > output. > !image-2023-05-29-14-58-05-591.png|width=968,height=305! > IMO control characters, non-characters and out-of-unicode-range characters > should be skipped when writing XML. These can come from user data and can act > as a "poison pill" breaking the export/import functionality. > > The PR is coming. -- This message was sent by Atlassian Jira (v8.20.10#820010)