Hi Conor, Just a quick note that path-index.txt files are also sorted the same way CDX file are, i.e., with LC_ALL=C env variable set.
Best, On Tue, Sep 19, 2017 at 4:04 AM <[email protected]> wrote: > Hi Sawood, > > Thanks for your reply. The screenshot which shows a working site is using > BDB. I can't get any site to work when using CDX. The screenshots ending in > _bdb use bdb and _cdx use cdx. I'm sure the warc files aren't corrupt as > they work with BDB. I'd like to use CDX instead of BDB because it scales > far better and I would like to host an entire domain crawl. > > The idea about path-index.txt seems promising. I'll have a look at making > a fresh path-index file and possibly changing permissions on the warc files. > Also going to look into PyWB. > > Thanks again, > Conor > > > On Monday, September 18, 2017 at 3:08:46 PM UTC+1, Sawood Alam wrote: > >> Hi Conor, >> >> You said, all of them return Resource Not Available. However, in your >> screenshots you have demonstrated an example which illustrates otherwise. >> That said, the kind of issue you are describing, it seems like CDX files >> are in place and sorted as required or else you would not be able to see >> the listing. However, either there is some issue in your path-index.txt >> file (or its configuration), the WARC files are not located where >> path-index is suggesting they are, file permissions or path-index as well >> as WARC files should be revisited, or WARC files are corrupted in some way >> (in the past I have seen WARC files who's first block was uncompressed >> while rest of the WARC was gzipped). >> >> I would perhaps chase a failing request till the end by finding that URL >> and timestamp in the CDX file manually, read the filename and offsets from >> the CDX file, find that file map in the path-index file, seek the offset >> and read bits from the WARC files based on the CDX entry, then decompress >> it (if gzipped) to see the content. I would choose a text file >> (HTML/CSS/JS) for this exercise. >> >> Alternatively, I would grab a few WARC files, copy them elsewhere, then >> run them through a different replay system such as PyWB to locate potential >> issue. >> >> Best, >> > >> -- >> Sawood Alam >> Department of Computer Science >> Old Dominion University >> Norfolk VA 23529 >> >> >> On Mon, Sep 18, 2017 at 8:21 AM, <[email protected]> wrote: >> > Hi all, >>> >>> >>> >>> *The problem:* >>> >>> I have a domain crawl of the .ie domain from 2007 which I'm trying to >>> access. The overall size of the crawl is 3.7 TB >>> >>> I followed the setup instructions here: >>> https://github.com/iipc/openwayback/wiki/How-to-configure >>> <https://github.com/iipc/openwayback/wiki/How-to-configure> >>> >>> If I use the BDB option I can view the sites, but the index grows to >>> about 1TB quite quickly and I run out of space before everything is indexed. >>> >>> I already have a CDX file that was generated at the time of the crawl >>> (2007). >>> >>> If I use the CDX option I can see the links to view sites, but all of >>> them return Resource Not Available. >>> >>> >>> >>> *Further details:* >>> >>> I only have read and execution rights on the folder containing the >>> webarchive. >>> >>> I tried generating a new cdx file and path-index for a single file but >>> ran into the same issue. >>> >>> I had a look at this >>> <https://groups.google.com/forum/#!searchin/openwayback-dev/cdx%7Csort:relevance/openwayback-dev/Fz4fdRrg9GQ/33I3bOp71acJ> >>> answer, >>> but I don't think that's my issue since the CDX file was working in the >>> past, although it was using wayback and nutch instead of openwayback. >>> >>> >>> >>> *Recreating the problem:* >>> >>> Unfortunately I can't attach any of the warc files I'm working with. >>> I've also left out my cdx file since it's 68 GB in size. >>> >>> I've attached my configuration files and some screen shots of what >>> happens using BDB vs CDX. >>> >>> To switch between configurations I copy wayback.xml.bdb or >>> wayback.xml.cdx to /usr/share/tomcat/openwayback/WEB_INF/wayback.xml. >>> >>> >>> >>> If anyone can see what I'm doing wrong or point me in the direction of >>> further documentation I'd really appreciate it. >>> >>> >>> >>> Thanks, >>> >>> Conor >>> >> -- >>> You received this message because you are subscribed to the Google >>> Groups "openwayback-dev" group. >>> >> To unsubscribe from this group and stop receiving emails from it, send an >>> email to [email protected]. >> >> >>> Visit this group at https://groups.google.com/group/openwayback-dev. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/openwayback-dev/a0543942-8511-4782-b815-8d04cbdb9195%40googlegroups.com >>> <https://groups.google.com/d/msgid/openwayback-dev/a0543942-8511-4782-b815-8d04cbdb9195%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "openwayback-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > Visit this group at https://groups.google.com/group/openwayback-dev. > To view this discussion on the web visit > https://groups.google.com/d/msgid/openwayback-dev/2e2222de-e7f9-431a-956d-b8273fd03556%40googlegroups.com > <https://groups.google.com/d/msgid/openwayback-dev/2e2222de-e7f9-431a-956d-b8273fd03556%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- -- Sawood Alam Department of Computer Science Old Dominion University Norfolk VA 23529 -- You received this message because you are subscribed to the Google Groups "openwayback-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. Visit this group at https://groups.google.com/group/openwayback-dev. To view this discussion on the web visit https://groups.google.com/d/msgid/openwayback-dev/CALOnmf8f4hoy6exT3XpJM7AxQVVpH8trcDPA_m7Q8o2hyJNoUQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
