Just to let you know that I fix my problem. I found the following in the
parameters nutch-default.xml

 <property> 
  <name>db.max.outlinks.per.page</name> 
  <value>100</value> 
  <description>The maximum number of outlinks that we'll process for a
page. 
  </description> 
</property> 

Since I have many links on each page then I need to increase the value
of this parameter.

Francois

-----Original Message-----
From: Lacoursiere, Francois [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 19, 2005 9:50 AM
To: [email protected]
Subject: Missing files in fetchlist

Hello,
 
I have a small problem. I'm indexing the files of a web server on my
intranet (apache). In one directory of the intranet there is 50 files. I
run the generate,fetch commands and I see that the last 3 files are
never fetched.
 
The following 2 workarounds work:
-If I create an index.html file that refers all the 50 files.  Then all
the 50 files are in the fetch list and they are indexed.
-If I do a subdirectory. 47 files in parent dir and I move 3 files in
the subdirectory. Then all the 50 files are in the fetchlist and they
are indexed.
 
Do you have an idea what's going wrong?
 
thanks
Francois.
 
Here is the script I use to build the fetch list and index:
:
echo "** Nutch Index 1 iteration"
bin/nutch generate db segments
s1=`ls -d segments/2* | tail -1`
echo $s1
echo "** fetch list"
bin/nutch fetchlist db -local -dumpurls $s1 bin/nutch fetch -local
-threads 1 $s1 bin/nutch updatedb db $s1
 
echo "** Nutch Index 2 iteration"
bin/nutch generate db segments
s2=`ls -d segments/2* | tail -1`
echo "** fetch list"
bin/nutch fetchlist db -local -dumpurls $s2 echo $s2 bin/nutch fetch
-local -threads 1 $s2 bin/nutch updatedb db $s2
 
echo "** Nutch Index 3 iteration"
bin/nutch generate db segments
s3=`ls -d segments/2* | tail -1`
echo "** fetch list"
bin/nutch fetchlist db -local -dumpurls $s3 echo $s3 bin/nutch fetch
-local -threads 1 $s3 bin/nutch updatedb db $s3
 
bin/nutch index $s1
bin/nutch index $s2
bin/nutch index $s3

 


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to