Hi Conor, You said, all of them return Resource Not Available. However, in your screenshots you have demonstrated an example which illustrates otherwise. That said, the kind of issue you are describing, it seems like CDX files are in place and sorted as required or else you would not be able to see the listing. However, either there is some issue in your path-index.txt file (or its configuration), the WARC files are not located where path-index is suggesting they are, file permissions or path-index as well as WARC files should be revisited, or WARC files are corrupted in some way (in the past I have seen WARC files who's first block was uncompressed while rest of the WARC was gzipped).
I would perhaps chase a failing request till the end by finding that URL and timestamp in the CDX file manually, read the filename and offsets from the CDX file, find that file map in the path-index file, seek the offset and read bits from the WARC files based on the CDX entry, then decompress it (if gzipped) to see the content. I would choose a text file (HTML/CSS/JS) for this exercise. Alternatively, I would grab a few WARC files, copy them elsewhere, then run them through a different replay system such as PyWB to locate potential issue. Best, -- Sawood Alam Department of Computer Science Old Dominion University Norfolk VA 23529 On Mon, Sep 18, 2017 at 8:21 AM, <[email protected]> wrote: > Hi all, > > > > *The problem:* > > I have a domain crawl of the .ie domain from 2007 which I'm trying to > access. The overall size of the crawl is 3.7 TB > > I followed the setup instructions here: https://github.com/ > iipc/openwayback/wiki/How-to-configure > <https://github.com/iipc/openwayback/wiki/How-to-configure> > > If I use the BDB option I can view the sites, but the index grows to about > 1TB quite quickly and I run out of space before everything is indexed. > > I already have a CDX file that was generated at the time of the crawl > (2007). > > If I use the CDX option I can see the links to view sites, but all of them > return Resource Not Available. > > > > *Further details:* > > I only have read and execution rights on the folder containing the > webarchive. > > I tried generating a new cdx file and path-index for a single file but ran > into the same issue. > > I had a look at this > <https://groups.google.com/forum/#!searchin/openwayback-dev/cdx%7Csort:relevance/openwayback-dev/Fz4fdRrg9GQ/33I3bOp71acJ> > answer, > but I don't think that's my issue since the CDX file was working in the > past, although it was using wayback and nutch instead of openwayback. > > > > *Recreating the problem:* > > Unfortunately I can't attach any of the warc files I'm working with. I've > also left out my cdx file since it's 68 GB in size. > > I've attached my configuration files and some screen shots of what happens > using BDB vs CDX. > > To switch between configurations I copy wayback.xml.bdb or wayback.xml.cdx > to /usr/share/tomcat/openwayback/WEB_INF/wayback.xml. > > > > If anyone can see what I'm doing wrong or point me in the direction of > further documentation I'd really appreciate it. > > > > Thanks, > > Conor > > -- > You received this message because you are subscribed to the Google Groups > "openwayback-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > Visit this group at https://groups.google.com/group/openwayback-dev. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/openwayback-dev/a0543942-8511-4782-b815-8d04cbdb9195%40googlegroups. > com > <https://groups.google.com/d/msgid/openwayback-dev/a0543942-8511-4782-b815-8d04cbdb9195%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "openwayback-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. Visit this group at https://groups.google.com/group/openwayback-dev. To view this discussion on the web visit https://groups.google.com/d/msgid/openwayback-dev/CALOnmf-O8v5t-05gv1zJykFB%3DGAxrWXojJ%3DzoTq-%3DuY2EtgZdg%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
