[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-12-20 Thread Julian Reschke (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799143#comment-17799143
 ] 

Julian Reschke commented on JCR-4935:
-

The JCR spec explicitly references XML 1.0.

And, FWIW, XML 1.1 is not used in practice, so this wouldn't help in any case.

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.18
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-12-20 Thread Konrad Windszus (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799136#comment-17799136
 ] 

Konrad Windszus commented on JCR-4935:
--

With XML 1.1 most limitations to invalid characters were lifted: 
https://www.w3.org/TR/xml11/#sec-xml11. Not sure if the JCR spec explicitly 
specifies an XML spec version...

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.18
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-05-31 Thread Julian Reschke (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727945#comment-17727945
 ] 

Julian Reschke commented on JCR-4935:
-

a) First we need to decide what we *want* it to do.

b) Whatever the fix is, it needs to happen higher in the stack; this is a 
low-level XML writing method; it's not supposed to modify the contents of the 
strings it writes (except for {*}XML{*}-escaping).

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.17
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-05-31 Thread Yegor Kozlov (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727941#comment-17727941
 ] 

Yegor Kozlov commented on JCR-4935:
---

I changed the fix to escape illegal characters to unicode code points, similar 
to how File Valult does it. The XML would like something like
{code:xml}

 {code}

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.17
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-05-30 Thread Julian Reschke (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727557#comment-17727557
 ] 

Julian Reschke commented on JCR-4935:
-

OK, so that is a somewhat different format anyway (and I see it takes care of 
non-XML chars). Thanks.

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.17
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-05-30 Thread Konrad Windszus (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727554#comment-17727554
 ] 

Konrad Windszus commented on JCR-4935:
--

This is how FileVault Docview does it: 
https://jackrabbit.apache.org/filevault/docview.html#escaping

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.17
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-05-30 Thread Julian Reschke (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727475#comment-17727475
 ] 

Julian Reschke commented on JCR-4935:
-

Document view is by definition a "best-effort" mapping to XML, so it's known to 
be lossy. The question here is whether it's better to fail early, or export 
incorrect data. There is no simple answer here.

 

 

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.17
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-05-30 Thread Yegor Kozlov (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727472#comment-17727472
 ] 

Yegor Kozlov commented on JCR-4935:
---

 
{quote} 
 * fail on export (I believe this would be conformant with the JCR spec){quote}
 

does that mean that if a user wants to export to XML they need to use only the 
good characters in the data ? As an API consumer I'd expect a friendlier 
approach.


 

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.17
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-05-30 Thread Julian Reschke (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727453#comment-17727453
 ] 

Julian Reschke commented on JCR-4935:
-

So basically the choices are:
 * fail on export (I believe this would be conformant with the JCR spec)
 * produce broken XML that can not be imported
 * produce valid XML after stripping some content

 

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.17
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-05-30 Thread Julian Reschke (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727431#comment-17727431
 ] 

Julian Reschke commented on JCR-4935:
-

FWIW, 
[ToXmlContentHandler.java|https://github.com/apache/jackrabbit/pull/132/commits/772347431022120704153606883b9b1abcf489f1#diff-c815600021691abe44140c80f533e6dda87aa0a90c9147bf0346fdf8a6e0be38]
 works as defined. We *could* change it to check for invalid characters and 
throw an exception.

Whatever the fix is, it need to happen higher in the stack.

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.17
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-05-30 Thread Julian Reschke (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727424#comment-17727424
 ] 

Julian Reschke commented on JCR-4935:
-

Good catch; I somehow assumed that both use the same escaping, but that's of 
course not true.

The problem with the definition for the document view export however is that it 
does not work. If a char is disallowed in XML, it can't be represented with an 
entits reference, either. (so "" is a parse error).

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.17
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-05-30 Thread Marcel Reutegger (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727421#comment-17727421
 ] 

Marcel Reutegger commented on JCR-4935:
---

[~reschke], what you quoted applies to system view export. IIUC, this report is 
about document view export. I think '7.3 Document View' list item 10 applies:
bq. If P is a non-BINARY property its value is converted to string form 
according to the standard conversion (see §3.6.4 Property Type Conversion). 
Entity references are used to escape characters which cannot be included as 
literals within attribute values (see §7.5 Escaping of Values).

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.17
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-05-30 Thread Julian Reschke (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727406#comment-17727406
 ] 

Julian Reschke commented on JCR-4935:
-

[~kwin]  - FYI as this might impact Filevault as well.

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.17
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (JCR-4935) session.exportDocumentView() generates unparsable XML if a JCR Property contains invalid XML character

2023-05-30 Thread Julian Reschke (Jira)


[ 
https://issues.apache.org/jira/browse/JCR-4935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17727405#comment-17727405
 ] 

Julian Reschke commented on JCR-4935:
-

I agree that there is a problem here, but the JCR spec in fact defines the 
handling here (fortunately). See 
<[https://developer.adobe.com/experience-manager/reference-materials/spec/jcr/2.0/7_Export.html]>:

 
{quote} * If, after conversion to string and entity escaping is performed, the 
string form of a value still contains characters which cannot appear in an XML 
document (neither as literals nor as character 
references{^}[13|https://developer.adobe.com/experience-manager/reference-materials/spec/jcr/2.0/7_Export.html#sdfootnote13sym]{^})
 then:

 ## The string form is further encoded using Base64 encoding.

 ## The attribute xsi:type=“xsd:base64Binary” is added to the  
element.

 ## The namespace mappings for xsi and xsd are added to the exported XML 
document so that the xsi:type attribute is within their scope. The namespace 
declarations required are xmlns:xsd=“http://www.w3.org/2001/XMLSchema” and 
xmlns:xsi=“http://www.w3.org/2001/XMLSchema-instance”. Note that the prefixes 
representing these two namespaces need not be _literally_ “xsd” and “xsi”. Any 
two prefixes are permitted as long as the corresponding namespace declarations 
are changed accordingly.
{quote}

> session.exportDocumentView() generates unparsable XML if a JCR Property 
> contains invalid XML character
> --
>
> Key: JCR-4935
> URL: https://issues.apache.org/jira/browse/JCR-4935
> Project: Jackrabbit Content Repository
>  Issue Type: Bug
>  Components: jackrabbit-jcr-commons
>Affects Versions: 2.21.17
>Reporter: Yegor Kozlov
>Assignee: Julian Reschke
>Priority: Major
> Attachments: image-2023-05-29-14-58-05-591.png
>
>
> I came across this issue in AEM, where user content can contain all kinds of 
> special characters. In my case it was a 0x3 character (^C) in a node property 
> which was written in the JCR XML as-is, and it resulted in a unparsable 
> output. 
> !image-2023-05-29-14-58-05-591.png|width=968,height=305!
> IMO control characters, non-characters and out-of-unicode-range characters 
> should be skipped when writing XML. These can come from user data and can act 
> as a "poison pill" breaking the export/import functionality. 
>  
> The PR is coming.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)