Hi Sawood,

> X-Archive-Orig-X-Archive-Orig-*

Thanks, that would be the telling argument why not to use "X-Archive-Orig-"!


> We can perhaps think of a different and more descriptive naming for 
> situations like this (for
> example, X-Capture-Orig-* or X-WARC-Transform-Orig-*).

I've chosen X-Crawler- for now but are happy to change it to any commonly 
adapted prefix.


> WARC file allow a means to store variants of a record and associate them, so 
> in theory it is
> possible to have one record in the original form and another one that is 
> canonicalized from it.
> However, this will lose the deduplication benefit.

It would be also quite costly to store the content twice.


> IIPC's slack has a #warc channel where we discuss on the WARC specifications. 
> We had some
> conversation around HTTP2 recently,

Thanks for the pointer, I'll join it soon (but I'll be off the next two weeks).

Best and thanks,
Sebastian


On 08/23/2018 01:08 AM, Sawood Alam wrote:
> Hi Sebastian,
> 
> In IPWB (https://github.com/oduwsdl/ipwb) we perform dechinking of the 
> chunked responses before
> pushing them to the IPFS (a content-addressable file system) to leverage 
> deduplication benefits.
> When we do that, we make necessary modifications in corresponding headers to 
> make sure the archival
> replay behaves properly. However, in pour case we are throwing the original 
> information away as we
> are not using WARCs for replay, but ingesting WARC records to populate IPFS 
> for replay. If we can
> agree on a way to preserve this information, we will be happy to adopt to it 
> in IPWB.
> 
> As far as I know X-Archive-Orig-* headers are replay specific. They do not 
> exist in WARC records,
> but added later at the replay time. While there is nothing stopping us from 
> utilizing that style of
> headers, I wonder current archival replay systems (such as Open Wayback and 
> PyWB) will relay them as
> X-Archive-Orig-X-Archive-Orig-* headers instead (I might be wrong here). We 
> can perhaps think of a
> different and more descriptive naming for situations like this (for example, 
> X-Capture-Orig-* or
> X-WARC-Transform-Orig-*).
> 
> WARC file allow a means to store variants of a record and associate them, so 
> in theory it is
> possible to have one record in the original form and another one that is 
> canonicalized from it.
> However, this will lose the deduplication benefit.
> 
> IIPC's slack has a #warc channel where we discuss on the WARC specifications. 
> We had some
> conversation around HTTP2 recently, should we preserve the original binary 
> bytes as they arrive or
> the post-processed data. I think you might want to discuss this 
> transformation matter there and see
> what others have to say about it.
> 
> Best,
> 
> --
> Sawood Alam
> Department of Computer Science
> Old Dominion University
> Norfolk VA 23529
> 
> 
> 
> On Wed, Aug 22, 2018 at 5:22 AM Sebastian Nagel <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Common Crawl stores the payload uncompressed and unchunked to leverage 
> processing of WARC files
>     by users on various platforms on programming languages. To avoid 
> potential errors by the WARC
>     processors this requires to remove resp. change the HTTP headers 
> "Content-Length",
>     "Content-Encoding" and "Transfer-Encoding".
> 
>     Now I want to preserve the original headers in a safe but transparent 
> way. Is the preservation
>     of original and rewritten HTTP headers in WARC files a valid use case for 
> the X-Archive-Orig-
>     header prefix, or is it thought only for wayback machines when serving 
> captures over HTTP?
> 
>     Thanks,
>     Sebastian
> 
>     -- 
>     You received this message because you are subscribed to the Google Groups 
> "openwayback-dev" group.
>     To unsubscribe from this group and stop receiving emails from it, send an 
> email to
>     [email protected] 
> <mailto:[email protected]>.
>     Visit this group at https://groups.google.com/group/openwayback-dev.
>     To view this discussion on the web visit
>     
> https://groups.google.com/d/msgid/openwayback-dev/371f84db-0486-4f7d-860a-a37d7d3c06df%40googlegroups.com
>     
> <https://groups.google.com/d/msgid/openwayback-dev/371f84db-0486-4f7d-860a-a37d7d3c06df%40googlegroups.com?utm_medium=email&utm_source=footer>.
>     For more options, visit https://groups.google.com/d/optout.
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "openwayback-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to
> [email protected] 
> <mailto:[email protected]>.
> Visit this group at https://groups.google.com/group/openwayback-dev.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/openwayback-dev/CALOnmf-AiZrHyTHWKQvFTBsOPcHoO4y4QXzPrMWzKNPr7f8jug%40mail.gmail.com
> <https://groups.google.com/d/msgid/openwayback-dev/CALOnmf-AiZrHyTHWKQvFTBsOPcHoO4y4QXzPrMWzKNPr7f8jug%40mail.gmail.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
Visit this group at https://groups.google.com/group/openwayback-dev.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/openwayback-dev/4415d8b4-a8b7-22b1-d4b8-a1fb7e7af089%40commoncrawl.org.
For more options, visit https://groups.google.com/d/optout.

Reply via email to