Hi Sawood, hi Andy,

> Should you use X-Crawler-Orig-*

Unfortunately, X-Crawler-* is already used and documented as prefix in our 
latest monthly crawl (August). I do not want to change it again, unless 
there is a common standard I could refer to.

Thanks for the pointer regarding HTTP/2. Our crawler now supports it 
(Nutch, using okhttp). It's not used in production, but I already plan to 
benchmark it and will try to implement the proposal.

Best,
Sebastian

On Wednesday, August 29, 2018 at 11:09:47 AM UTC+2, andrew.jackson wrote:
>
> Hi Sebastian,
>
> There's been some discussion of issues related to this recently, 
> particularly in the context of supporting HTTP/2. See 
> https://github.com/iipc/warc-specifications/issues/42 for details.
>
> That said, I'm pretty sure that that various tools have been flattening 
> HTTP/1.1-style chunked encodings to unchunked and/or removing compression 
> for some time. It's permitted by the WARC spec. to do so without noting it 
> was done, as these encodings are semantically equivalent. The use of an 
> `X-Crawler-[Orig-]` seems perfectly reasonable (indeed an improvement on 
> current practice), at least until we come up with an official standardised 
> way of noting these kinds of manipulations.
>
> Best,
> Andy Jackson
> UK Web Archive
>
> On Wednesday, 22 August 2018 10:22:05 UTC+1, Sebastian Nagel wrote:
>>
>> Common Crawl stores the payload uncompressed and unchunked to leverage 
>> processing of WARC files by users on various platforms on programming 
>> languages. To avoid potential errors by the WARC processors this requires 
>> to remove resp. change the HTTP headers "Content-Length", 
>> "Content-Encoding" and "Transfer-Encoding".
>>
>> Now I want to preserve the original headers in a safe but transparent 
>> way. Is the preservation of original and rewritten HTTP headers in WARC 
>> files a valid use case for the X-Archive-Orig- header prefix, or is it 
>> thought only for wayback machines when serving captures over HTTP?
>>
>> Thanks,
>> Sebastian
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
Visit this group at https://groups.google.com/group/openwayback-dev.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/openwayback-dev/c97ebbfe-370f-4579-9241-9a931e0db5ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to