Hey guys,

definitely no duplicate urls because the filenames are in
xxxx-timestamp.txt format.

I just tried again what yanky suggested about setting -1 in
db.max.outlinks.per.page, after the crawl - in the index there are 589
records out of 594. i did see a couple 404 errors which might account
for the 5 missing records though.

not sure why it didnt work properly before - this time around i deleted
the last crawl indexes etc and did a clean build and got the result
above which is quite acceptable. So thanks for the heads up.

Now whats the impact of this when i start crawling over http protocol
(ie db.max.outlinks.per.page = -1)? because it seems i might end up with
a very long crawl and large index?!

Any suggestions?


On Sun, 2009-04-12 at 22:44 -0500, Dennis Kubes wrote:
> Could it be you have duplicate urls when they are normalized?
> 
> Dennis
> 
> Fadzi Ushewokunze wrote:
> > yanky,
> > 
> > thanks for that; 
> > 
> > i am running on linux so i definitely dont have spaces in my files
> > names. 
> > 
> > having said that i changed the db.max.outlinks.per.page to 1000 from 100
> > and i started getting exactly 500 documents instead of the 600 or so. So
> > i changed it to -1 and i am still getting 500 docs! Not sure whats going
> > on here.
> > 
> > 
> > 
> > On Mon, 2009-04-13 at 11:17 +0800, yanky young wrote:
> >> Hi:
> >>
> >> I have encountered a similar problem with local windows file system search
> >> with nutch 0.9. You can see my post here.
> >> http://www.nabble.com/nutch-0.9-protocol-file-plugin-break-with-windows-file-name-that--contains-space-td22903785.html.
> >> Hope it helps.
> >>
> >> good luck
> >>
> >> yanky
> >>
> >>
> >> 2009/4/13 Fadzi Ushewokunze <[email protected]>
> >>
> >>> Hi,
> >>>
> >>> I am having a problem with a file system crawl where i have about 600
> >>> urls text files in a folder and only 100 of them are getting fetched and
> >>> indexed.
> >>>
> >>> i have +^file://* in my regex.urlfilter.txt and crawl.urlfilter.txt so
> >>> every file should be picked up. i have created my own text parser and
> >>> Indexingfilter plugins as well. Not sure if this could have something to
> >>> do with this problem or not. I dont think it does.
> >>>
> >>> I can see that the QueueFeeder only contains 100 records only but doesnt
> >>> replenish;
> >>>
> >>> Any leads?
> >>>
> >>> Thanks,
> >>>
> >>> Fadzi
> >>>
> >>>
> >>>
> > 

Reply via email to