Hi Sawood, Thanks for your reply. The screenshot which shows a working site is using BDB. I can't get any site to work when using CDX. The screenshots ending in _bdb use bdb and _cdx use cdx. I'm sure the warc files aren't corrupt as they work with BDB. I'd like to use CDX instead of BDB because it scales far better and I would like to host an entire domain crawl.
The idea about path-index.txt seems promising. I'll have a look at making a fresh path-index file and possibly changing permissions on the warc files. Also going to look into PyWB. Thanks again, Conor On Monday, September 18, 2017 at 3:08:46 PM UTC+1, Sawood Alam wrote: > > Hi Conor, > > You said, all of them return Resource Not Available. However, in your > screenshots you have demonstrated an example which illustrates otherwise. > That said, the kind of issue you are describing, it seems like CDX files > are in place and sorted as required or else you would not be able to see > the listing. However, either there is some issue in your path-index.txt > file (or its configuration), the WARC files are not located where > path-index is suggesting they are, file permissions or path-index as well > as WARC files should be revisited, or WARC files are corrupted in some way > (in the past I have seen WARC files who's first block was uncompressed > while rest of the WARC was gzipped). > > I would perhaps chase a failing request till the end by finding that URL > and timestamp in the CDX file manually, read the filename and offsets from > the CDX file, find that file map in the path-index file, seek the offset > and read bits from the WARC files based on the CDX entry, then decompress > it (if gzipped) to see the content. I would choose a text file > (HTML/CSS/JS) for this exercise. > > Alternatively, I would grab a few WARC files, copy them elsewhere, then > run them through a different replay system such as PyWB to locate potential > issue. > > Best, > > -- > Sawood Alam > Department of Computer Science > Old Dominion University > Norfolk VA 23529 > > > On Mon, Sep 18, 2017 at 8:21 AM, <[email protected] <javascript:>> > wrote: > >> Hi all, >> >> >> >> *The problem:* >> >> I have a domain crawl of the .ie domain from 2007 which I'm trying to >> access. The overall size of the crawl is 3.7 TB >> >> I followed the setup instructions here: >> https://github.com/iipc/openwayback/wiki/How-to-configure >> <https://github.com/iipc/openwayback/wiki/How-to-configure> >> >> If I use the BDB option I can view the sites, but the index grows to >> about 1TB quite quickly and I run out of space before everything is indexed. >> >> I already have a CDX file that was generated at the time of the crawl >> (2007). >> >> If I use the CDX option I can see the links to view sites, but all of >> them return Resource Not Available. >> >> >> >> *Further details:* >> >> I only have read and execution rights on the folder containing the >> webarchive. >> >> I tried generating a new cdx file and path-index for a single file but >> ran into the same issue. >> >> I had a look at this >> <https://groups.google.com/forum/#!searchin/openwayback-dev/cdx%7Csort:relevance/openwayback-dev/Fz4fdRrg9GQ/33I3bOp71acJ> >> answer, >> but I don't think that's my issue since the CDX file was working in the >> past, although it was using wayback and nutch instead of openwayback. >> >> >> >> *Recreating the problem:* >> >> Unfortunately I can't attach any of the warc files I'm working with. I've >> also left out my cdx file since it's 68 GB in size. >> >> I've attached my configuration files and some screen shots of what >> happens using BDB vs CDX. >> >> To switch between configurations I copy wayback.xml.bdb or >> wayback.xml.cdx to /usr/share/tomcat/openwayback/WEB_INF/wayback.xml. >> >> >> >> If anyone can see what I'm doing wrong or point me in the direction of >> further documentation I'd really appreciate it. >> >> >> >> Thanks, >> >> Conor >> >> -- >> You received this message because you are subscribed to the Google Groups >> "openwayback-dev" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> Visit this group at https://groups.google.com/group/openwayback-dev. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/openwayback-dev/a0543942-8511-4782-b815-8d04cbdb9195%40googlegroups.com >> >> <https://groups.google.com/d/msgid/openwayback-dev/a0543942-8511-4782-b815-8d04cbdb9195%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "openwayback-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. Visit this group at https://groups.google.com/group/openwayback-dev. To view this discussion on the web visit https://groups.google.com/d/msgid/openwayback-dev/2e2222de-e7f9-431a-956d-b8273fd03556%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
