Sebastian,

> I've chosen X-Crawler- for now but are happy to change it to any commonly
adapted prefix.

Should you use X-Crawler-Orig-* instead until the WARC community agrees on
something? Those who might not know about it, X-Crawler-* might not be as
obvious to them.

Best,

--
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk VA 23529



On Fri, Aug 24, 2018 at 8:04 AM Sebastian Nagel <[email protected]>
wrote:

> Hi Sawood,
>
> > X-Archive-Orig-X-Archive-Orig-*
>
> Thanks, that would be the telling argument why not to use
> "X-Archive-Orig-"!
>
>
> > We can perhaps think of a different and more descriptive naming for
> situations like this (for
> > example, X-Capture-Orig-* or X-WARC-Transform-Orig-*).
>
> I've chosen X-Crawler- for now but are happy to change it to any commonly
> adapted prefix.
>
>
> > WARC file allow a means to store variants of a record and associate
> them, so in theory it is
> > possible to have one record in the original form and another one that is
> canonicalized from it.
> > However, this will lose the deduplication benefit.
>
> It would be also quite costly to store the content twice.
>
>
> > IIPC's slack has a #warc channel where we discuss on the WARC
> specifications. We had some
> > conversation around HTTP2 recently,
>
> Thanks for the pointer, I'll join it soon (but I'll be off the next two
> weeks).
>
> Best and thanks,
> Sebastian
>
>
> On 08/23/2018 01:08 AM, Sawood Alam wrote:
> > Hi Sebastian,
> >
> > In IPWB (https://github.com/oduwsdl/ipwb) we perform dechinking of the
> chunked responses before
> > pushing them to the IPFS (a content-addressable file system) to leverage
> deduplication benefits.
> > When we do that, we make necessary modifications in corresponding
> headers to make sure the archival
> > replay behaves properly. However, in pour case we are throwing the
> original information away as we
> > are not using WARCs for replay, but ingesting WARC records to populate
> IPFS for replay. If we can
> > agree on a way to preserve this information, we will be happy to adopt
> to it in IPWB.
> >
> > As far as I know X-Archive-Orig-* headers are replay specific. They do
> not exist in WARC records,
> > but added later at the replay time. While there is nothing stopping us
> from utilizing that style of
> > headers, I wonder current archival replay systems (such as Open Wayback
> and PyWB) will relay them as
> > X-Archive-Orig-X-Archive-Orig-* headers instead (I might be wrong here).
> We can perhaps think of a
> > different and more descriptive naming for situations like this (for
> example, X-Capture-Orig-* or
> > X-WARC-Transform-Orig-*).
> >
> > WARC file allow a means to store variants of a record and associate
> them, so in theory it is
> > possible to have one record in the original form and another one that is
> canonicalized from it.
> > However, this will lose the deduplication benefit.
> >
> > IIPC's slack has a #warc channel where we discuss on the WARC
> specifications. We had some
> > conversation around HTTP2 recently, should we preserve the original
> binary bytes as they arrive or
> > the post-processed data. I think you might want to discuss this
> transformation matter there and see
> > what others have to say about it.
> >
> > Best,
> >
> > --
> > Sawood Alam
> > Department of Computer Science
> > Old Dominion University
> > Norfolk VA 23529
> >
> >
> >
> > On Wed, Aug 22, 2018 at 5:22 AM Sebastian Nagel <
> [email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Common Crawl stores the payload uncompressed and unchunked to
> leverage processing of WARC files
> >     by users on various platforms on programming languages. To avoid
> potential errors by the WARC
> >     processors this requires to remove resp. change the HTTP headers
> "Content-Length",
> >     "Content-Encoding" and "Transfer-Encoding".
> >
> >     Now I want to preserve the original headers in a safe but
> transparent way. Is the preservation
> >     of original and rewritten HTTP headers in WARC files a valid use
> case for the X-Archive-Orig-
> >     header prefix, or is it thought only for wayback machines when
> serving captures over HTTP?
> >
> >     Thanks,
> >     Sebastian
> >
> >     --
> >     You received this message because you are subscribed to the Google
> Groups "openwayback-dev" group.
> >     To unsubscribe from this group and stop receiving emails from it,
> send an email to
> >     [email protected] <mailto:
> [email protected]>.
> >     Visit this group at https://groups.google.com/group/openwayback-dev.
> >     To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/openwayback-dev/371f84db-0486-4f7d-860a-a37d7d3c06df%40googlegroups.com
> >     <
> https://groups.google.com/d/msgid/openwayback-dev/371f84db-0486-4f7d-860a-a37d7d3c06df%40googlegroups.com?utm_medium=email&utm_source=footer
> >.
> >     For more options, visit https://groups.google.com/d/optout.
> >
> > --
> > You received this message because you are subscribed to the Google
> Groups "openwayback-dev" group.
> > To unsubscribe from this group and stop receiving emails from it, send
> an email to
> > [email protected] <mailto:
> [email protected]>.
> > Visit this group at https://groups.google.com/group/openwayback-dev.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/openwayback-dev/CALOnmf-AiZrHyTHWKQvFTBsOPcHoO4y4QXzPrMWzKNPr7f8jug%40mail.gmail.com
> > <
> https://groups.google.com/d/msgid/openwayback-dev/CALOnmf-AiZrHyTHWKQvFTBsOPcHoO4y4QXzPrMWzKNPr7f8jug%40mail.gmail.com?utm_medium=email&utm_source=footer
> >.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "openwayback-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> Visit this group at https://groups.google.com/group/openwayback-dev.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/openwayback-dev/4415d8b4-a8b7-22b1-d4b8-a1fb7e7af089%40commoncrawl.org
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
Visit this group at https://groups.google.com/group/openwayback-dev.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/openwayback-dev/CALOnmf8YnJFR_TFAztMTwV2RGDD6i42NNQi%3Dr%2B8ir5A_jKGACw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to