One work around to ignore parse exceptions (at least in the Tika parser): https://github.com/tballison/nutch/tree/ignore-parse-exception
Proposed fix for truncation checking: https://github.com/tballison/nutch/tree/okhttp-truncated On 2023/10/18 14:28:45 Tim Allison wrote: > I'm trying to configure Nutch to index pages/files that are truncated (in > addition to the successful non-truncated files). > > I'm using the okhttp protocol, because I don't think the http protocol > stores truncation information. > > I'm using parse-tika, and the "parser.skip.truncated" is set to > default=true. > > The particular PDF that I'm experimenting with is returned chunked with gz > compression. There is no length header in the response. > > For this PDF, okhttp correctly marks it as truncated, but then the file is > sent to parsetika, which throws a parse exception. The file is then not > sent to the index. > > If I understand correctly, ParseSegment is checking for truncation, but it > requires a Content-Length header to work. In my case, there is no > Content-Length header, so it assumes the file is not truncated. > > Should I open a ticket to have ParseSegment also check for okhttp's header ( > http.content.truncated=true)? > > Is there a way to index files even if they are truncated or if there is a > parse exception? > > If indexing is a bridge too far, what's the most efficient way to dump a > list of urls that are truncated and/or had a parse exception? > > Thank you! > > Best, > > Tim >