[ 
https://issues.apache.org/jira/browse/STANBOL-176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021552#comment-13021552
 ] 

Olivier Grisel commented on STANBOL-176:
----------------------------------------

Just a note: "\x13" writes "\u0013" in java. Such characters can be extracted 
by libraries such as apache POI to extract the text content of a word document 
for instance.

> NER engine should not put control chars in text literals of the annotation 
> graph
> --------------------------------------------------------------------------------
>
>                 Key: STANBOL-176
>                 URL: https://issues.apache.org/jira/browse/STANBOL-176
>             Project: Stanbol
>          Issue Type: Bug
>            Reporter: Olivier Grisel
>            Assignee: Olivier Grisel
>
> Some text to analyse might contain control chars such as "\x13", "\x14", 
> "\x15"... Such characters cannothe be serialized as XML and are generally 
> worthless in the labels and context properties of enhancements.
> The NER engine should filter them out before writing its annotations to the 
> content item graph.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to