Hi: I wonder you are following my suggestion and then it works:-). I touched about file content size limit, but you are talking about setting db.max.outlinks.per.page. I am not sure which option works for you:-)
As for http crawling, sure, db.max.outlinks.per.page and file.content.limit all will increase your crawl db size, but in different ways. db.max.outlinks.per.page will be more effctive in creasing your crawl db size than file.content.limit. because most html will be less than the default size limit 64K. But I would say there are many factors on the size of crawldb, such as file types url filters, and TopN parameters and so on. yanky 2009/4/13 Fadzi Ushewokunze <[email protected]> > Hey guys, > > definitely no duplicate urls because the filenames are in > xxxx-timestamp.txt format. > > I just tried again what yanky suggested about setting -1 in > db.max.outlinks.per.page, after the crawl - in the index there are 589 > records out of 594. i did see a couple 404 errors which might account > for the 5 missing records though. > > not sure why it didnt work properly before - this time around i deleted > the last crawl indexes etc and did a clean build and got the result > above which is quite acceptable. So thanks for the heads up. > > Now whats the impact of this when i start crawling over http protocol > (ie db.max.outlinks.per.page = -1)? because it seems i might end up with > a very long crawl and large index?! > > Any suggestions? > > > On Sun, 2009-04-12 at 22:44 -0500, Dennis Kubes wrote: > > Could it be you have duplicate urls when they are normalized? > > > > Dennis > > > > Fadzi Ushewokunze wrote: > > > yanky, > > > > > > thanks for that; > > > > > > i am running on linux so i definitely dont have spaces in my files > > > names. > > > > > > having said that i changed the db.max.outlinks.per.page to 1000 from > 100 > > > and i started getting exactly 500 documents instead of the 600 or so. > So > > > i changed it to -1 and i am still getting 500 docs! Not sure whats > going > > > on here. > > > > > > > > > > > > On Mon, 2009-04-13 at 11:17 +0800, yanky young wrote: > > >> Hi: > > >> > > >> I have encountered a similar problem with local windows file system > search > > >> with nutch 0.9. You can see my post here. > > >> > http://www.nabble.com/nutch-0.9-protocol-file-plugin-break-with-windows-file-name-that--contains-space-td22903785.html > . > > >> Hope it helps. > > >> > > >> good luck > > >> > > >> yanky > > >> > > >> > > >> 2009/4/13 Fadzi Ushewokunze <[email protected]> > > >> > > >>> Hi, > > >>> > > >>> I am having a problem with a file system crawl where i have about 600 > > >>> urls text files in a folder and only 100 of them are getting fetched > and > > >>> indexed. > > >>> > > >>> i have +^file://* in my regex.urlfilter.txt and crawl.urlfilter.txt > so > > >>> every file should be picked up. i have created my own text parser and > > >>> Indexingfilter plugins as well. Not sure if this could have something > to > > >>> do with this problem or not. I dont think it does. > > >>> > > >>> I can see that the QueueFeeder only contains 100 records only but > doesnt > > >>> replenish; > > >>> > > >>> Any leads? > > >>> > > >>> Thanks, > > >>> > > >>> Fadzi > > >>> > > >>> > > >>> > > > > >
