Hi Markus,
>> And i do not agree with it. Almost all content is compressed now, so this
>> will never work. We need the headers and response code stored for WARC
>> export and do not care about an incorrect length header.
No, don't do this. You need to rewrite the header. There are many WARC readers
which just fail in the following situations:
- HTTP header Content-Length does not match the length of the WARC payload
- there's a Content-Encoding or Transfer-Encoding header but the payload is
not stored using these encodings
- a fully functional WARC parser needs to understand chunked transfer
encoding, and also gzip, deflate, brotli content encodings
- however, this is not the case for many WARC parsers: they just fail
or pass the chunked or encoded content forward to the user
That's why at Common Crawl we store the payload with all HTTP-level
encodings removed. Using Nutch it would be also difficult to implement
that the HTTP stream is stored unmodified.
The encoding headers are rewritten to "X-Crawler-Content-Length",
"X-Crawler-Content-Encoding" and "X-Crawler-Transfer-Encoding". The
"Content-Length" header
shows the length of the decoded (and eventually truncated) payload.
If you need more information and pointers about that, please ping me.
I've discussed this with other web archiving peoples.
Also WARC validators ([1,2,3] include one) will complain about invalid
HTTP header values if the payload is decoded but headers are not rewritten.
>> I don't see okhttp having the same condition.
We use protocol-okhttp: it has meanwhile (Nutch master) much better support
for WARC writing and - more important - supports HTTP/2.
Common Crawl also uses a custom WARC writer [4,5].
Unfortunately, both WarcExporter and CommonCrawlDataDumper cannot be used
because
- cannot write gzip-compressed WARC files
Note: writing WARCs with
-Dmapreduce.output.fileoutputformat.compress=true
-Dmapreduce.output.fileoutputformat.compress.codec=gzip
results in invalid WARC files, not compressed per-record.
- issues with the headers, see above
- no WARC request records (NUTCH-2255)
- no WARC digests
Since long I hope to find the time to rewrite a clean and lean WARC writer based
on on jwarc [2] which writes perfect WARC files combined with the flexibility of
WarcExporter.
Best,
Sebastian
[1] https://pypi.org/project/warcio/
[2] https://github.com/iipc/jwarc
[3] https://resiliparse.chatnoir.eu/en/latest/api/fastwarc.html
[4] https://github.com/commoncrawl/nutch/
[5] https://github.com/commoncrawl/nutch/tree/cc/src/java/org/commoncrawl/util
On 7/31/24 10:15, Markus Jelsma wrote:
Aah thanks Lewis. We're still on 1.15, glad to see this was fixed already,
and that i would have patched it in exactly the same way.
Thanks!
Op di 30 jul 2024 om 18:42 schreef lewis john mcgibbney <[email protected]
:
Hi Markus,
Which version of Nutch are you referring to? I'm not seeing this exact
code in master branch.
Is this roughly the code you are referencing?
https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L304-L318
Thanks
lewismc
On Tue, Jul 30, 2024 at 8:14 AM <[email protected]> wrote:
---------- Forwarded message ----------
From: Markus Jelsma <[email protected]>
To: user <[email protected]>
Cc:
Bcc:
Date: Tue, 30 Jul 2024 17:13:01 +0200
Subject: Protocol-http not storing response headers
Hi,
Protocol-http does this (not storing HTTP response heades if response is
compressed):
// store the headers verbatim only if the response was not
compressed
// as the content length reported does not match otherwise
if (httpHeaders != null) {
headers.add(Response.RESPONSE_HEADERS,
httpHeaders.toString());
}
if (Http.LOG.isTraceEnabled()) {
Http.LOG.trace("fetched " + content.length + " bytes from " +
url);
}
And i do not agree with it. Almost all content is compressed now, so this
will never work. We need the headers and response code stored for WARC
export and do not care about an incorrect length header.
Before patching this up and breaking that code out of the compression
condition, i do ask myself, is that a good idea? I don't see okhttp
having
the same condition.
Markus
--
http://people.apache.org/keys/committer/lewismc