Hey guys, definitely no duplicate urls because the filenames are in xxxx-timestamp.txt format.
I just tried again what yanky suggested about setting -1 in db.max.outlinks.per.page, after the crawl - in the index there are 589 records out of 594. i did see a couple 404 errors which might account for the 5 missing records though. not sure why it didnt work properly before - this time around i deleted the last crawl indexes etc and did a clean build and got the result above which is quite acceptable. So thanks for the heads up. Now whats the impact of this when i start crawling over http protocol (ie db.max.outlinks.per.page = -1)? because it seems i might end up with a very long crawl and large index?! Any suggestions? On Sun, 2009-04-12 at 22:44 -0500, Dennis Kubes wrote: > Could it be you have duplicate urls when they are normalized? > > Dennis > > Fadzi Ushewokunze wrote: > > yanky, > > > > thanks for that; > > > > i am running on linux so i definitely dont have spaces in my files > > names. > > > > having said that i changed the db.max.outlinks.per.page to 1000 from 100 > > and i started getting exactly 500 documents instead of the 600 or so. So > > i changed it to -1 and i am still getting 500 docs! Not sure whats going > > on here. > > > > > > > > On Mon, 2009-04-13 at 11:17 +0800, yanky young wrote: > >> Hi: > >> > >> I have encountered a similar problem with local windows file system search > >> with nutch 0.9. You can see my post here. > >> http://www.nabble.com/nutch-0.9-protocol-file-plugin-break-with-windows-file-name-that--contains-space-td22903785.html. > >> Hope it helps. > >> > >> good luck > >> > >> yanky > >> > >> > >> 2009/4/13 Fadzi Ushewokunze <[email protected]> > >> > >>> Hi, > >>> > >>> I am having a problem with a file system crawl where i have about 600 > >>> urls text files in a folder and only 100 of them are getting fetched and > >>> indexed. > >>> > >>> i have +^file://* in my regex.urlfilter.txt and crawl.urlfilter.txt so > >>> every file should be picked up. i have created my own text parser and > >>> Indexingfilter plugins as well. Not sure if this could have something to > >>> do with this problem or not. I dont think it does. > >>> > >>> I can see that the QueueFeeder only contains 100 records only but doesnt > >>> replenish; > >>> > >>> Any leads? > >>> > >>> Thanks, > >>> > >>> Fadzi > >>> > >>> > >>> > >
