[
https://issues.apache.org/jira/browse/STANBOL-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021552#comment-13021552
]
Olivier Grisel commented on STANBOL-176:
----------------------------------------
Just a note: "\x13" writes "\u0013" in java. Such characters can be extracted
by libraries such as apache POI to extract the text content of a word document
for instance.
> NER engine should not put control chars in text literals of the annotation
> graph
> --------------------------------------------------------------------------------
>
> Key: STANBOL-176
> URL: https://issues.apache.org/jira/browse/STANBOL-176
> Project: Stanbol
> Issue Type: Bug
> Reporter: Olivier Grisel
> Assignee: Olivier Grisel
>
> Some text to analyse might contain control chars such as "\x13", "\x14",
> "\x15"... Such characters cannothe be serialized as XML and are generally
> worthless in the labels and context properties of enhancements.
> The NER engine should filter them out before writing its annotations to the
> content item graph.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira