Re: Protocol-http not storing response headers

Sebastian Nagel Wed, 31 Jul 2024 05:21:09 -0700

Hi Markus,

>> And i do not agree with it. Almost all content is compressed now, so this
>> will never work. We need the headers and response code stored for WARC
>> export and do not care about an incorrect length header.


No, don't do this. You need to rewrite the header. There are many WARC readers
which just fail in the following situations:
- HTTP header Content-Length does not match the length of the WARC payload
- there's a Content-Encoding or Transfer-Encoding header but the payload is
  not stored using these encodings
- a fully functional WARC parser needs to understand chunked transfer
  encoding, and also gzip, deflate, brotli content encodings
- however, this is not the case for many WARC parsers: they just fail
  or pass the chunked or encoded content forward to the user

That's why at Common Crawl we store the payload with all HTTP-level
encodings removed. Using Nutch it would be also difficult to implement
that the HTTP stream is stored unmodified.

The encoding headers are rewritten to "X-Crawler-Content-Length","X-Crawler-Content-Encoding" and "X-Crawler-Transfer-Encoding". The"Content-Length" header

shows the length of the decoded (and eventually truncated) payload.

If you need more information and pointers about that, please ping me.
I've discussed this with other web archiving peoples.

Also WARC validators ([1,2,3] include one) will complain about invalid
HTTP header values if the payload is decoded but headers are not rewritten.


>> I don't see okhttp having the same condition.

We use protocol-okhttp: it has meanwhile (Nutch master) much better support
for WARC writing and - more important - supports HTTP/2.


Common Crawl also uses a custom WARC writer [4,5].

Unfortunately, both WarcExporter and CommonCrawlDataDumper cannot be used 
because
- cannot write gzip-compressed WARC files
  Note: writing WARCs with
   -Dmapreduce.output.fileoutputformat.compress=true
   -Dmapreduce.output.fileoutputformat.compress.codec=gzip
  results in invalid WARC files, not compressed per-record.
- issues with the headers, see above
- no WARC request records (NUTCH-2255)
- no WARC digests

Since long I hope to find the time to rewrite a clean and lean WARC writer basedon on jwarc [2] which writes perfect WARC files combined with the flexibility of

WarcExporter.


Best,
Sebastian


[1] https://pypi.org/project/warcio/
[2] https://github.com/iipc/jwarc
[3] https://resiliparse.chatnoir.eu/en/latest/api/fastwarc.html
[4] https://github.com/commoncrawl/nutch/
[5] https://github.com/commoncrawl/nutch/tree/cc/src/java/org/commoncrawl/util



On 7/31/24 10:15, Markus Jelsma wrote:

Aah thanks Lewis. We're still on 1.15, glad to see this was fixed already,
and that i would have patched it in exactly the same way.

Thanks!

Op di 30 jul 2024 om 18:42 schreef lewis john mcgibbney <[email protected]

Hi Markus,

Which version of Nutch are you referring to? I'm not seeing this exact
code in master branch.
Is this roughly the code you are referencing?

https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L304-L318

Thanks
lewismc

On Tue, Jul 30, 2024 at 8:14 AM <[email protected]> wrote:

---------- Forwarded message ----------
From: Markus Jelsma <[email protected]>
To: user <[email protected]>
Cc:
Bcc:
Date: Tue, 30 Jul 2024 17:13:01 +0200
Subject: Protocol-http not storing response headers
Hi,

Protocol-http does this (not storing HTTP response heades if response is
compressed):

           // store the headers verbatim only if the response was not
compressed
           // as the content length reported does not match otherwise
           if (httpHeaders != null) {
             headers.add(Response.RESPONSE_HEADERS,

httpHeaders.toString());

           }
           if (Http.LOG.isTraceEnabled()) {
             Http.LOG.trace("fetched " + content.length + " bytes from " +
url);
           }

And i do not agree with it. Almost all content is compressed now, so this
will never work. We need the headers and response code stored for WARC
export and do not care about an incorrect length header.

Before patching this up and breaking that code out of the compression
condition, i do ask myself, is that a good idea? I don't see okhttp

having

the same condition.

Markus


--
http://people.apache.org/keys/committer/lewismc

Re: Protocol-http not storing response headers

Reply via email to