Hi Conor,

Just a quick note that path-index.txt files are also sorted the same way
CDX file are, i.e., with LC_ALL=C env variable set.

Best,

On Tue, Sep 19, 2017 at 4:04 AM <[email protected]> wrote:

> Hi Sawood,
>
> Thanks for your reply. The screenshot which shows a working site is using
> BDB. I can't get any site to work when using CDX. The screenshots ending in
> _bdb use bdb and _cdx use cdx. I'm sure the warc files aren't corrupt as
> they work with BDB. I'd like to use CDX instead of BDB because it scales
> far better and I would like to host an entire domain crawl.
>
> The idea about path-index.txt seems promising. I'll have a look at making
> a fresh path-index file and possibly changing permissions on the warc files.
> Also going to look into PyWB.
>
> Thanks again,
> Conor
>
>
> On Monday, September 18, 2017 at 3:08:46 PM UTC+1, Sawood Alam wrote:
>
>> Hi Conor,
>>
>> You said, all of them return Resource Not Available. However, in your
>> screenshots you have demonstrated an example which illustrates otherwise.
>> That said, the kind of issue you are describing, it seems like CDX files
>> are in place and sorted as required or else you would not be able to see
>> the listing. However, either there is some issue in your path-index.txt
>> file (or its configuration), the WARC files are not located where
>> path-index is suggesting they are, file permissions or path-index as well
>> as WARC files should be revisited, or WARC files are corrupted in some way
>> (in the past I have seen WARC files who's first block was uncompressed
>> while rest of the WARC was gzipped).
>>
>> I would perhaps chase a failing request till the end by finding that URL
>> and timestamp in the CDX file manually, read the filename and offsets from
>> the CDX file, find that file map in the path-index file, seek the offset
>> and read bits from the WARC files based on the CDX entry, then decompress
>> it (if gzipped) to see the content. I would choose a text file
>> (HTML/CSS/JS) for this exercise.
>>
>> Alternatively, I would grab a few WARC files, copy them elsewhere, then
>> run them through a different replay system such as PyWB to locate potential
>> issue.
>>
>> Best,
>>
>
>> --
>> Sawood Alam
>> Department of Computer Science
>> Old Dominion University
>> Norfolk VA 23529
>>
>>
>> On Mon, Sep 18, 2017 at 8:21 AM, <[email protected]> wrote:
>>
> Hi all,
>>>
>>>
>>>
>>> *The problem:*
>>>
>>> I have a domain crawl of the .ie domain from 2007 which I'm trying to
>>> access. The overall size of the crawl is 3.7 TB
>>>
>>> I followed the setup instructions here:
>>>  https://github.com/iipc/openwayback/wiki/How-to-configure
>>> <https://github.com/iipc/openwayback/wiki/How-to-configure>
>>>
>>> If I use the BDB option I can view the sites, but the index grows to
>>> about 1TB quite quickly and I run out of space before everything is indexed.
>>>
>>> I already have a CDX file that was generated at the time of the crawl
>>> (2007).
>>>
>>> If I use the CDX option I can see the links to view sites, but all of
>>> them return Resource Not Available.
>>>
>>>
>>>
>>> *Further details:*
>>>
>>> I only have read and execution rights on the folder containing the
>>> webarchive.
>>>
>>> I tried generating a new cdx file and path-index for a single file but
>>> ran into the same issue.
>>>
>>> I had a look at this
>>> <https://groups.google.com/forum/#!searchin/openwayback-dev/cdx%7Csort:relevance/openwayback-dev/Fz4fdRrg9GQ/33I3bOp71acJ>
>>>  answer,
>>> but I don't think that's my issue since the CDX file was working in the
>>> past, although it was using wayback and nutch instead of openwayback.
>>>
>>>
>>>
>>> *Recreating the problem:*
>>>
>>> Unfortunately I can't attach any of the warc files I'm working with.
>>> I've also left out my cdx file since it's 68 GB in size.
>>>
>>> I've attached my configuration files and some screen shots of what
>>> happens using BDB vs CDX.
>>>
>>> To switch between configurations I copy wayback.xml.bdb or
>>> wayback.xml.cdx to /usr/share/tomcat/openwayback/WEB_INF/wayback.xml.
>>>
>>>
>>>
>>> If anyone can see what I'm doing wrong or point me in the direction of
>>> further documentation I'd really appreciate it.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Conor
>>>
>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "openwayback-dev" group.
>>>
>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to [email protected].
>>
>>
>>> Visit this group at https://groups.google.com/group/openwayback-dev.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/openwayback-dev/a0543942-8511-4782-b815-8d04cbdb9195%40googlegroups.com
>>> <https://groups.google.com/d/msgid/openwayback-dev/a0543942-8511-4782-b815-8d04cbdb9195%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "openwayback-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> Visit this group at https://groups.google.com/group/openwayback-dev.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/openwayback-dev/2e2222de-e7f9-431a-956d-b8273fd03556%40googlegroups.com
> <https://groups.google.com/d/msgid/openwayback-dev/2e2222de-e7f9-431a-956d-b8273fd03556%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
-- 

--
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk VA 23529

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
Visit this group at https://groups.google.com/group/openwayback-dev.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/openwayback-dev/CALOnmf8f4hoy6exT3XpJM7AxQVVpH8trcDPA_m7Q8o2hyJNoUQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to