Sebastian, > I've chosen X-Crawler- for now but are happy to change it to any commonly adapted prefix.
Should you use X-Crawler-Orig-* instead until the WARC community agrees on something? Those who might not know about it, X-Crawler-* might not be as obvious to them. Best, -- Sawood Alam Department of Computer Science Old Dominion University Norfolk VA 23529 On Fri, Aug 24, 2018 at 8:04 AM Sebastian Nagel <[email protected]> wrote: > Hi Sawood, > > > X-Archive-Orig-X-Archive-Orig-* > > Thanks, that would be the telling argument why not to use > "X-Archive-Orig-"! > > > > We can perhaps think of a different and more descriptive naming for > situations like this (for > > example, X-Capture-Orig-* or X-WARC-Transform-Orig-*). > > I've chosen X-Crawler- for now but are happy to change it to any commonly > adapted prefix. > > > > WARC file allow a means to store variants of a record and associate > them, so in theory it is > > possible to have one record in the original form and another one that is > canonicalized from it. > > However, this will lose the deduplication benefit. > > It would be also quite costly to store the content twice. > > > > IIPC's slack has a #warc channel where we discuss on the WARC > specifications. We had some > > conversation around HTTP2 recently, > > Thanks for the pointer, I'll join it soon (but I'll be off the next two > weeks). > > Best and thanks, > Sebastian > > > On 08/23/2018 01:08 AM, Sawood Alam wrote: > > Hi Sebastian, > > > > In IPWB (https://github.com/oduwsdl/ipwb) we perform dechinking of the > chunked responses before > > pushing them to the IPFS (a content-addressable file system) to leverage > deduplication benefits. > > When we do that, we make necessary modifications in corresponding > headers to make sure the archival > > replay behaves properly. However, in pour case we are throwing the > original information away as we > > are not using WARCs for replay, but ingesting WARC records to populate > IPFS for replay. If we can > > agree on a way to preserve this information, we will be happy to adopt > to it in IPWB. > > > > As far as I know X-Archive-Orig-* headers are replay specific. They do > not exist in WARC records, > > but added later at the replay time. While there is nothing stopping us > from utilizing that style of > > headers, I wonder current archival replay systems (such as Open Wayback > and PyWB) will relay them as > > X-Archive-Orig-X-Archive-Orig-* headers instead (I might be wrong here). > We can perhaps think of a > > different and more descriptive naming for situations like this (for > example, X-Capture-Orig-* or > > X-WARC-Transform-Orig-*). > > > > WARC file allow a means to store variants of a record and associate > them, so in theory it is > > possible to have one record in the original form and another one that is > canonicalized from it. > > However, this will lose the deduplication benefit. > > > > IIPC's slack has a #warc channel where we discuss on the WARC > specifications. We had some > > conversation around HTTP2 recently, should we preserve the original > binary bytes as they arrive or > > the post-processed data. I think you might want to discuss this > transformation matter there and see > > what others have to say about it. > > > > Best, > > > > -- > > Sawood Alam > > Department of Computer Science > > Old Dominion University > > Norfolk VA 23529 > > > > > > > > On Wed, Aug 22, 2018 at 5:22 AM Sebastian Nagel < > [email protected] > > <mailto:[email protected]>> wrote: > > > > Common Crawl stores the payload uncompressed and unchunked to > leverage processing of WARC files > > by users on various platforms on programming languages. To avoid > potential errors by the WARC > > processors this requires to remove resp. change the HTTP headers > "Content-Length", > > "Content-Encoding" and "Transfer-Encoding". > > > > Now I want to preserve the original headers in a safe but > transparent way. Is the preservation > > of original and rewritten HTTP headers in WARC files a valid use > case for the X-Archive-Orig- > > header prefix, or is it thought only for wayback machines when > serving captures over HTTP? > > > > Thanks, > > Sebastian > > > > -- > > You received this message because you are subscribed to the Google > Groups "openwayback-dev" group. > > To unsubscribe from this group and stop receiving emails from it, > send an email to > > [email protected] <mailto: > [email protected]>. > > Visit this group at https://groups.google.com/group/openwayback-dev. > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/openwayback-dev/371f84db-0486-4f7d-860a-a37d7d3c06df%40googlegroups.com > > < > https://groups.google.com/d/msgid/openwayback-dev/371f84db-0486-4f7d-860a-a37d7d3c06df%40googlegroups.com?utm_medium=email&utm_source=footer > >. > > For more options, visit https://groups.google.com/d/optout. > > > > -- > > You received this message because you are subscribed to the Google > Groups "openwayback-dev" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to > > [email protected] <mailto: > [email protected]>. > > Visit this group at https://groups.google.com/group/openwayback-dev. > > To view this discussion on the web visit > > > https://groups.google.com/d/msgid/openwayback-dev/CALOnmf-AiZrHyTHWKQvFTBsOPcHoO4y4QXzPrMWzKNPr7f8jug%40mail.gmail.com > > < > https://groups.google.com/d/msgid/openwayback-dev/CALOnmf-AiZrHyTHWKQvFTBsOPcHoO4y4QXzPrMWzKNPr7f8jug%40mail.gmail.com?utm_medium=email&utm_source=footer > >. > > For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "openwayback-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > Visit this group at https://groups.google.com/group/openwayback-dev. > To view this discussion on the web visit > https://groups.google.com/d/msgid/openwayback-dev/4415d8b4-a8b7-22b1-d4b8-a1fb7e7af089%40commoncrawl.org > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "openwayback-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. Visit this group at https://groups.google.com/group/openwayback-dev. To view this discussion on the web visit https://groups.google.com/d/msgid/openwayback-dev/CALOnmf8YnJFR_TFAztMTwV2RGDD6i42NNQi%3Dr%2B8ir5A_jKGACw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
