Re: truncation, parsing and indexing?

Sebastian Nagel Mon, 23 Oct 2023 11:25:13 -0700

Hi Tim,

>> I'm using the okhttp protocol, because I don't think the http protocol
>> stores truncation information.

protocol-http could mark truncations as well, however. Please, also open anissue for this and other protocol plugins.



>> Should I open a ticket to have ParseSegment also check for okhttp's header (
>> http.content.truncated=true)?

Yes, please.

> One work around to ignore parse exceptions (at least in the Tika parser):
> https://github.com/tballison/nutch/tree/ignore-parse-exception

One potential improvement: could still parse MIME types which are parseable
when truncated, most important HTML pages.


>> If I understand correctly, ParseSegment is checking for truncation, but it
>> requires a Content-Length header to work. In my case, there is no
>> Content-Length header, so it assumes the file is not truncated.

With chunked Content-Encoding, there is usually no Content-Length header.

Even if there is a Content-Length header: it indicates the compressed lengthwith HTTP Content-Encoding "gzip", "deflate" or "brotli".



>> Is there a way to index files even if they are truncated or if there is a
>> parse exception?
>>
>> If indexing is a bridge too far, what's the most efficient way to dump a
>> list of urls that are truncated and/or had a parse exception?

Let me think about it...


Best,
Sebastian


On 10/18/23 18:16, Tim Allison wrote:

One work around to ignore parse exceptions (at least in the Tika parser): 
https://github.com/tballison/nutch/tree/ignore-parse-exception

Proposed fix for truncation checking:
https://github.com/tballison/nutch/tree/okhttp-truncated

On 2023/10/18 14:28:45 Tim Allison wrote:

I'm trying to configure Nutch to index pages/files that are truncated (in
addition to the successful non-truncated files).

I'm using the okhttp protocol, because I don't think the http protocol
stores truncation information.

I'm using parse-tika, and the "parser.skip.truncated" is set to
default=true.

The particular PDF that I'm experimenting with is returned chunked with gz
compression.  There is no length header in the response.

For this PDF, okhttp correctly marks it as truncated, but then the file is
sent to parsetika, which throws a parse exception. The file is then not
sent to the index.

If I understand correctly, ParseSegment is checking for truncation, but it
requires a Content-Length header to work. In my case, there is no
Content-Length header, so it assumes the file is not truncated.

Should I open a ticket to have ParseSegment also check for okhttp's header (
http.content.truncated=true)?

Is there a way to index files even if they are truncated or if there is a
parse exception?

If indexing is a bridge too far, what's the most efficient way to dump a
list of urls that are truncated and/or had a parse exception?

Thank you!

Best,

       Tim

Re: truncation, parsing and indexing?

Reply via email to