Vadym Oliinyk created UIMA-4115:
-----------------------------------
Summary: TikaAnnotator: incorrect order of tags processing
Key: UIMA-4115
URL: https://issues.apache.org/jira/browse/UIMA-4115
Project: UIMA
Issue Type: Bug
Components: addons
Affects Versions: 2.3.1Addons
Reporter: Vadym Oliinyk
org.apache.uima.tika.MarkupAnnotator outputs incorrect content due to bug in
org.apache.uima.tika.MarkupHandler. The problem located in the end element
event handler: MarkupHandler#endElement method should close opened tags by
removing them from the stack (last added tag should be removed first if
corresponding end tag found). But in current implementation it removes start
elements beginning from the first open element which results in incorrect text
spans annotated by the processor.
The fix is trivial:
in MarkupHandler#endElement replace startedAnnotations.iterator() with
startedAnnotations.descendingIterator().
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)