Another quick update: I ran Luke on the index, and part-00000 works fine, whereas part-00001 comes up as corrupt or missing. Now seeing from the list of files in both these directories, we know that there is nothing in part-00001 - so why does it get generated? And if it does, why does dedup not handle it gracefully?
I also ran a merge on the two indexes, and it worked fine. So that rests the case that both the indexes are corrupted. This brings me to understand that since I only had two pages indexed and the index was small, part-00001 came up with nothing, and dedup does not handle it???? Any thoughts? -- Hetal Shah wrote: -- That's what I had read on another post as well, but somehow, I can't understand how it can be corrupted! It's not even a massive index. Just a couple of urls. Every step that I followed was per the tutorials on the wiki page. Here's the list under /indexes: drwxr-xr-x 2 root root 4096 Jan 31 16:21 part-00000 drwxr-xr-x 2 root root 4096 Jan 31 16:21 part-00001 This is what's under part-00000 -rw-r--r-- 1 root root 2 Jan 31 16:21 _2.f0 -rw-r--r-- 1 root root 2 Jan 31 16:21 _2.f1 -rw-r--r-- 1 root root 2 Jan 31 16:21 _2.f2 -rw-r--r-- 1 root root 2 Jan 31 16:21 _2.f3 -rw-r--r-- 1 root root 2 Jan 31 16:21 _2.f4 -rw-r--r-- 1 root root 2 Jan 31 16:21 _2.f5 -rw-r--r-- 1 root root 399 Jan 31 16:21 _2.fdt -rw-r--r-- 1 root root 16 Jan 31 16:21 _2.fdx -rw-r--r-- 1 root root 74 Jan 31 16:21 _2.fnm -rw-r--r-- 1 root root 945 Jan 31 16:21 _2.frq -rw-r--r-- 1 root root 1790 Jan 31 16:21 _2.prx -rw-r--r-- 1 root root 105 Jan 31 16:21 _2.tii -rw-r--r-- 1 root root 6850 Jan 31 16:21 _2.tis -rw-r--r-- 1 root root 4 Jan 31 16:21 deletable -rw-r--r-- 1 root root 0 Jan 31 16:21 index.done -rw-r--r-- 1 root root 27 Jan 31 16:21 segments This is what's under part-00001 -rw-r--r-- 1 root root 0 Jan 31 16:21 index.done -rw-r--r-- 1 root root 20 Jan 31 16:21 segments By the way, also to mention here that I am running dedup on the DFS system. I haven't tried running it on the local system yet, but does that matter? Thanks for your help. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier. Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
