Hi Sawood,

Thanks for your reply. The screenshot which shows a working site is using 
BDB. I can't get any site to work when using CDX. The screenshots ending in 
_bdb use bdb and _cdx use cdx. I'm sure the warc files aren't corrupt as 
they work with BDB. I'd like to use CDX instead of BDB because it scales 
far better and I would like to host an entire domain crawl. 

The idea about path-index.txt seems promising. I'll have a look at making a 
fresh path-index file and possibly changing permissions on the warc files.
Also going to look into PyWB.

Thanks again,
Conor

On Monday, September 18, 2017 at 3:08:46 PM UTC+1, Sawood Alam wrote:
>
> Hi Conor,
>
> You said, all of them return Resource Not Available. However, in your 
> screenshots you have demonstrated an example which illustrates otherwise. 
> That said, the kind of issue you are describing, it seems like CDX files 
> are in place and sorted as required or else you would not be able to see 
> the listing. However, either there is some issue in your path-index.txt 
> file (or its configuration), the WARC files are not located where 
> path-index is suggesting they are, file permissions or path-index as well 
> as WARC files should be revisited, or WARC files are corrupted in some way 
> (in the past I have seen WARC files who's first block was uncompressed 
> while rest of the WARC was gzipped).
>
> I would perhaps chase a failing request till the end by finding that URL 
> and timestamp in the CDX file manually, read the filename and offsets from 
> the CDX file, find that file map in the path-index file, seek the offset 
> and read bits from the WARC files based on the CDX entry, then decompress 
> it (if gzipped) to see the content. I would choose a text file 
> (HTML/CSS/JS) for this exercise.
>
> Alternatively, I would grab a few WARC files, copy them elsewhere, then 
> run them through a different replay system such as PyWB to locate potential 
> issue.
>
> Best,
>
> --
> Sawood Alam
> Department of Computer Science
> Old Dominion University
> Norfolk VA 23529
>
>
> On Mon, Sep 18, 2017 at 8:21 AM, <[email protected] <javascript:>> 
> wrote:
>
>> Hi all,
>>
>>  
>>
>> *The problem:*
>>
>> I have a domain crawl of the .ie domain from 2007 which I'm trying to 
>> access. The overall size of the crawl is 3.7 TB
>>
>> I followed the setup instructions here: 
>>  https://github.com/iipc/openwayback/wiki/How-to-configure 
>> <https://github.com/iipc/openwayback/wiki/How-to-configure>
>>
>> If I use the BDB option I can view the sites, but the index grows to 
>> about 1TB quite quickly and I run out of space before everything is indexed.
>>
>> I already have a CDX file that was generated at the time of the crawl 
>> (2007). 
>>
>> If I use the CDX option I can see the links to view sites, but all of 
>> them return Resource Not Available.
>>
>>  
>>
>> *Further details:*
>>
>> I only have read and execution rights on the folder containing the 
>> webarchive. 
>>
>> I tried generating a new cdx file and path-index for a single file but 
>> ran into the same issue.
>>
>> I had a look at this 
>> <https://groups.google.com/forum/#!searchin/openwayback-dev/cdx%7Csort:relevance/openwayback-dev/Fz4fdRrg9GQ/33I3bOp71acJ>
>>  answer, 
>> but I don't think that's my issue since the CDX file was working in the 
>> past, although it was using wayback and nutch instead of openwayback.
>>
>>  
>>
>> *Recreating the problem:*
>>
>> Unfortunately I can't attach any of the warc files I'm working with. I've 
>> also left out my cdx file since it's 68 GB in size.
>>
>> I've attached my configuration files and some screen shots of what 
>> happens using BDB vs CDX.
>>
>> To switch between configurations I copy wayback.xml.bdb or 
>> wayback.xml.cdx to /usr/share/tomcat/openwayback/WEB_INF/wayback.xml.
>>
>>  
>>
>> If anyone can see what I'm doing wrong or point me in the direction of 
>> further documentation I'd really appreciate it.
>>
>>  
>>
>> Thanks,
>>
>> Conor
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "openwayback-dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> Visit this group at https://groups.google.com/group/openwayback-dev.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/openwayback-dev/a0543942-8511-4782-b815-8d04cbdb9195%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/openwayback-dev/a0543942-8511-4782-b815-8d04cbdb9195%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
Visit this group at https://groups.google.com/group/openwayback-dev.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/openwayback-dev/2e2222de-e7f9-431a-956d-b8273fd03556%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to