Beletsky Andrey created TIKA-2508:
-------------------------------------

             Summary: ParsingReader uses hardcoded content handler
                 Key: TIKA-2508
                 URL: https://issues.apache.org/jira/browse/TIKA-2508
             Project: Tika
          Issue Type: Improvement
    Affects Versions: 1.16
            Reporter: Beletsky Andrey


ParsingReader uses hardcoded content handler what makes it not useful in the 
following case:
I want to parse image using TesseractParser using HOCR output format. but I 
can't read it using this reader because its content handler is hardcoded to 
BodyContentHandler which uses WriteOutContentHandler which uses 
ToTextContentHandler by default. This sequence of content handlers gets rid of 
all HOCR result format tags and their attributes.

*Expected Result:*
I would refactor this reader to make it more useful in cases like this. I 
suppose making content handler configurable will solve the issue like this, but 
you know better... probably there are some bottlenecks I don't know about.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to