Beletsky Andrey created TIKA-2508: ------------------------------------- Summary: ParsingReader uses hardcoded content handler Key: TIKA-2508 URL: https://issues.apache.org/jira/browse/TIKA-2508 Project: Tika Issue Type: Improvement Affects Versions: 1.16 Reporter: Beletsky Andrey
ParsingReader uses hardcoded content handler what makes it not useful in the following case: I want to parse image using TesseractParser using HOCR output format. but I can't read it using this reader because its content handler is hardcoded to BodyContentHandler which uses WriteOutContentHandler which uses ToTextContentHandler by default. This sequence of content handlers gets rid of all HOCR result format tags and their attributes. *Expected Result:* I would refactor this reader to make it more useful in cases like this. I suppose making content handler configurable will solve the issue like this, but you know better... probably there are some bottlenecks I don't know about. -- This message was sent by Atlassian JIRA (v6.4.14#64029)