Hi Sawood, hi Andy, > Should you use X-Crawler-Orig-*
Unfortunately, X-Crawler-* is already used and documented as prefix in our latest monthly crawl (August). I do not want to change it again, unless there is a common standard I could refer to. Thanks for the pointer regarding HTTP/2. Our crawler now supports it (Nutch, using okhttp). It's not used in production, but I already plan to benchmark it and will try to implement the proposal. Best, Sebastian On Wednesday, August 29, 2018 at 11:09:47 AM UTC+2, andrew.jackson wrote: > > Hi Sebastian, > > There's been some discussion of issues related to this recently, > particularly in the context of supporting HTTP/2. See > https://github.com/iipc/warc-specifications/issues/42 for details. > > That said, I'm pretty sure that that various tools have been flattening > HTTP/1.1-style chunked encodings to unchunked and/or removing compression > for some time. It's permitted by the WARC spec. to do so without noting it > was done, as these encodings are semantically equivalent. The use of an > `X-Crawler-[Orig-]` seems perfectly reasonable (indeed an improvement on > current practice), at least until we come up with an official standardised > way of noting these kinds of manipulations. > > Best, > Andy Jackson > UK Web Archive > > On Wednesday, 22 August 2018 10:22:05 UTC+1, Sebastian Nagel wrote: >> >> Common Crawl stores the payload uncompressed and unchunked to leverage >> processing of WARC files by users on various platforms on programming >> languages. To avoid potential errors by the WARC processors this requires >> to remove resp. change the HTTP headers "Content-Length", >> "Content-Encoding" and "Transfer-Encoding". >> >> Now I want to preserve the original headers in a safe but transparent >> way. Is the preservation of original and rewritten HTTP headers in WARC >> files a valid use case for the X-Archive-Orig- header prefix, or is it >> thought only for wayback machines when serving captures over HTTP? >> >> Thanks, >> Sebastian >> >> -- You received this message because you are subscribed to the Google Groups "openwayback-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. Visit this group at https://groups.google.com/group/openwayback-dev. To view this discussion on the web visit https://groups.google.com/d/msgid/openwayback-dev/c97ebbfe-370f-4579-9241-9a931e0db5ca%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
