Hi Sawood, > X-Archive-Orig-X-Archive-Orig-*
Thanks, that would be the telling argument why not to use "X-Archive-Orig-"! > We can perhaps think of a different and more descriptive naming for > situations like this (for > example, X-Capture-Orig-* or X-WARC-Transform-Orig-*). I've chosen X-Crawler- for now but are happy to change it to any commonly adapted prefix. > WARC file allow a means to store variants of a record and associate them, so > in theory it is > possible to have one record in the original form and another one that is > canonicalized from it. > However, this will lose the deduplication benefit. It would be also quite costly to store the content twice. > IIPC's slack has a #warc channel where we discuss on the WARC specifications. > We had some > conversation around HTTP2 recently, Thanks for the pointer, I'll join it soon (but I'll be off the next two weeks). Best and thanks, Sebastian On 08/23/2018 01:08 AM, Sawood Alam wrote: > Hi Sebastian, > > In IPWB (https://github.com/oduwsdl/ipwb) we perform dechinking of the > chunked responses before > pushing them to the IPFS (a content-addressable file system) to leverage > deduplication benefits. > When we do that, we make necessary modifications in corresponding headers to > make sure the archival > replay behaves properly. However, in pour case we are throwing the original > information away as we > are not using WARCs for replay, but ingesting WARC records to populate IPFS > for replay. If we can > agree on a way to preserve this information, we will be happy to adopt to it > in IPWB. > > As far as I know X-Archive-Orig-* headers are replay specific. They do not > exist in WARC records, > but added later at the replay time. While there is nothing stopping us from > utilizing that style of > headers, I wonder current archival replay systems (such as Open Wayback and > PyWB) will relay them as > X-Archive-Orig-X-Archive-Orig-* headers instead (I might be wrong here). We > can perhaps think of a > different and more descriptive naming for situations like this (for example, > X-Capture-Orig-* or > X-WARC-Transform-Orig-*). > > WARC file allow a means to store variants of a record and associate them, so > in theory it is > possible to have one record in the original form and another one that is > canonicalized from it. > However, this will lose the deduplication benefit. > > IIPC's slack has a #warc channel where we discuss on the WARC specifications. > We had some > conversation around HTTP2 recently, should we preserve the original binary > bytes as they arrive or > the post-processed data. I think you might want to discuss this > transformation matter there and see > what others have to say about it. > > Best, > > -- > Sawood Alam > Department of Computer Science > Old Dominion University > Norfolk VA 23529 > > > > On Wed, Aug 22, 2018 at 5:22 AM Sebastian Nagel <[email protected] > <mailto:[email protected]>> wrote: > > Common Crawl stores the payload uncompressed and unchunked to leverage > processing of WARC files > by users on various platforms on programming languages. To avoid > potential errors by the WARC > processors this requires to remove resp. change the HTTP headers > "Content-Length", > "Content-Encoding" and "Transfer-Encoding". > > Now I want to preserve the original headers in a safe but transparent > way. Is the preservation > of original and rewritten HTTP headers in WARC files a valid use case for > the X-Archive-Orig- > header prefix, or is it thought only for wayback machines when serving > captures over HTTP? > > Thanks, > Sebastian > > -- > You received this message because you are subscribed to the Google Groups > "openwayback-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to > [email protected] > <mailto:[email protected]>. > Visit this group at https://groups.google.com/group/openwayback-dev. > To view this discussion on the web visit > > https://groups.google.com/d/msgid/openwayback-dev/371f84db-0486-4f7d-860a-a37d7d3c06df%40googlegroups.com > > <https://groups.google.com/d/msgid/openwayback-dev/371f84db-0486-4f7d-860a-a37d7d3c06df%40googlegroups.com?utm_medium=email&utm_source=footer>. > For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "openwayback-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to > [email protected] > <mailto:[email protected]>. > Visit this group at https://groups.google.com/group/openwayback-dev. > To view this discussion on the web visit > https://groups.google.com/d/msgid/openwayback-dev/CALOnmf-AiZrHyTHWKQvFTBsOPcHoO4y4QXzPrMWzKNPr7f8jug%40mail.gmail.com > <https://groups.google.com/d/msgid/openwayback-dev/CALOnmf-AiZrHyTHWKQvFTBsOPcHoO4y4QXzPrMWzKNPr7f8jug%40mail.gmail.com?utm_medium=email&utm_source=footer>. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "openwayback-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. Visit this group at https://groups.google.com/group/openwayback-dev. To view this discussion on the web visit https://groups.google.com/d/msgid/openwayback-dev/4415d8b4-a8b7-22b1-d4b8-a1fb7e7af089%40commoncrawl.org. For more options, visit https://groups.google.com/d/optout.
