I'm experiencing a problem whereby it appears that Nutch is only indexing the
first 200 files in a given directory (from seeds.txt).  The evidence I have
for this is:

1.) I am not getting the search hits I'd expect.
2.) In the _0.fdt file, only the first 200 html files are referenced, in
alphabetical order.

By "the first 200 files" I mean the first 200 files in one of the
directories in seeds.txt.  The actual number of files referenced in _0.fdt
happens to be 202.  seeds.txt contains 2 directories.  One of these
directories contains two files that I'd expect to be indexed.  The paths to
these 2 files are listed in _0.fdt.  The second directory in seeds.txt
contains 2900 files that I'd expect to be indexed.  Of these 2900 files,
only the first 200 (alphabetically ordered) files are in _0.fdt.

These files are indexed directly from the filesystem.  In general, our
methods of indexing and searching function properly.  We only have a problem
when there are more than 200 files in a directory to be indexed.

I do not see any Nutch configuration that would impose such a limit.  Does
anyone know why this may be happening?  Thanks in advance!

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexed-Files-Limited-to-200-tp2825662p2825662.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to