Hi:

Another possibility is that there are some big files larger than default
file content size limit.
It seems the default file content size limit is 65536, aka 64k. If that's
the case, you can set the limit to like this:

<property>
  <name>file.content.limit</name>
  <value>-1</value>
</property>

good luck

yanky

2009/4/13 Dennis Kubes <[email protected]>

> Could it be you have duplicate urls when they are normalized?
>
> Dennis
>
>
> Fadzi Ushewokunze wrote:
>
>> yanky,
>>
>> thanks for that;
>> i am running on linux so i definitely dont have spaces in my files
>> names.
>> having said that i changed the db.max.outlinks.per.page to 1000 from 100
>> and i started getting exactly 500 documents instead of the 600 or so. So
>> i changed it to -1 and i am still getting 500 docs! Not sure whats going
>> on here.
>>
>>
>>
>> On Mon, 2009-04-13 at 11:17 +0800, yanky young wrote:
>>
>>> Hi:
>>>
>>> I have encountered a similar problem with local windows file system
>>> search
>>> with nutch 0.9. You can see my post here.
>>>
>>> http://www.nabble.com/nutch-0.9-protocol-file-plugin-break-with-windows-file-name-that--contains-space-td22903785.html
>>> .
>>> Hope it helps.
>>>
>>> good luck
>>>
>>> yanky
>>>
>>>
>>> 2009/4/13 Fadzi Ushewokunze <[email protected]>
>>>
>>>  Hi,
>>>>
>>>> I am having a problem with a file system crawl where i have about 600
>>>> urls text files in a folder and only 100 of them are getting fetched and
>>>> indexed.
>>>>
>>>> i have +^file://* in my regex.urlfilter.txt and crawl.urlfilter.txt so
>>>> every file should be picked up. i have created my own text parser and
>>>> Indexingfilter plugins as well. Not sure if this could have something to
>>>> do with this problem or not. I dont think it does.
>>>>
>>>> I can see that the QueueFeeder only contains 100 records only but doesnt
>>>> replenish;
>>>>
>>>> Any leads?
>>>>
>>>> Thanks,
>>>>
>>>> Fadzi
>>>>
>>>>
>>>>
>>>>
>>

Reply via email to