Re: Nutch can't find all files

yanky young Wed, 08 Apr 2009 21:59:38 -0700

Hi:

Of course u can look into code and add some debug lines in ur case. Just
look at protocol-file plugin, which is supposed to process file:// scheme.
You can find this plugin code in ${nutch_home}/src/plugin/protocol-file


and as of nutch fetching list, you can dump crawldb by nutch readdb command.

good luck


2009/4/9 Hannu Väisänen <[email protected]>

> On Wed, Apr 08, 2009 at 08:54:37AM +0200, Andrzej Bialecki wrote:
> > Most likely this is related to the setting db.max.outlinks.per.page. The
> > default is 1000. In case of file:// URLs this means that directory
> > listings with more than 1000 entries will be truncated. Solution: simply
> > increase the limit.
>
> That helped a little. Now Nutch is fetching more files but it is still
> skipping files.
>
> I have more questions.
>
> How does Nutch select the files it fetches?
>
> Is it reading every file name in a directory and then selecting what it
> fetches?
>
> Is it possible to output the file names Nutch consideres for fetching?
>
> Where do I look in the code? (-:
>

Re: Nutch can't find all files

Reply via email to