[
https://issues.apache.org/jira/browse/TIKA-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2787.
-------------------------------
Fix Version/s: 2.0.0
Resolution: Fixed
> Make WriteLimitReachedException public and not subclass of SAXException
> -----------------------------------------------------------------------
>
> Key: TIKA-2787
> URL: https://issues.apache.org/jira/browse/TIKA-2787
> Project: Tika
> Issue Type: Bug
> Components: core
> Affects Versions: 1.19.1
> Reporter: Dmitry Goldenberg
> Priority: Major
> Fix For: 2.0.0
>
>
> The idea behind being able to set a limit on text extraction is to be able to
> get up to N characters extracted back. We just got tripped up by the fact
> that Tika throws an exception once the limit has been reached.
> This, in and of itself, is not a major hindrance especially since the error
> message itself clearly states that the extracted text is, "however,
> available".
> OK, but why is WriteLimitReachedException private? why not public so it can
> be explicitly caught when the parse() method is called? and why not add it to
> the signature of the parse method? I don't think it should extend
> SAXException, either; just cleanly throw it as is.
> Right now, our code makes this cumbersome adjustment around the condition:
> {code:java}
> ContentHandler handler = new BodyContentHandler(limit); // <-- e.g. set to
> 1000000
> try {
> parser.parse(dataStream, handler, metadata, parseCtx);
> } catch (IOException | TikaException ex) {
> throw ex;
> } catch (SAXException ex) {
> String message = (ex.getMessage() == null) ? "" : ex.getMessage();
> if (!message.contains("Your document contained more than")) {
> throw new TikaException("Tika error has occurred.", ex);
> } else {
> log.warn("TE limit reached on file {}.", filePath);
> }
> }
> // Keep the extracted text regardless of WriteLimitReachedException
> String text = handler.toString();
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)