[ https://issues.apache.org/jira/browse/TIKA-1436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15214082#comment-15214082 ]
Stefano Fornari commented on TIKA-1436: --------------------------------------- sorry, it took much more than I expected... however, here is a new patch. I realized I did not create the previous one correctly, and I have added some more testing code. I have done it on the master HEAD from git. Regarding your concern here: "I'm looking at the raw patch now (not applied), and I'm a bit concerned that there is special handling for catching and swallowing a WriteLimitReached within the PDFParser. I may be misunderstanding your proposal, but the nice thing about the exception was that it put the burden/opportunity on the client to handle it, and we didn't have to add catch blocks to every parser (this point was already made by Jukka)." There are two main reasons: 1. the limit is in the ContentHandler and the Parser is the client of such functionality, which therefore should handle the condition 2. the condition is handled because expected: we want the parsing to be successful in the case the limit is reached so that the so far read content can be handled; but I am open to explore a different approach if anyone thinks a better way. > improvement to PDFParser > ------------------------ > > Key: TIKA-1436 > URL: https://issues.apache.org/jira/browse/TIKA-1436 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.6 > Reporter: Stefano Fornari > Labels: parser, pdf > Fix For: 1.13 > > Attachments: ste-20140927.patch > > > with regards to the thread "[PDFParser] - read limited number of characters" > on Mar 29, I would like to propose the attached patch. I noticed that in Tika > 1.6 there have been some work around a better handling of the > WriteLimitReachedException condition, but I believe it could be even > improved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)