This is what we have, hope this clears up some confusion.  It will show 
up in log files of the sites that you crawl like this.  I don't know if 
the configuration is what is causing your problem but I have talked to 
other people on the list with similar problems where their configuration 
was incorrect.  I think the only thing that is "required" is for the 
http.agent.name not to be blank but I would set all of the other options 
as well, just for politeness.

Dennis

Log file will record a crawler similar to this:
NameOfAgent/1.0_(Yourwebsite.com;_http://www.yoururl.com/bot.html;[EMAIL 
PROTECTED])

<!-- HTTP properties -->
<property>
  <name>http.agent.name</name>
  <value>NameOfAgent</value>
  <description>Our HTTP 'User-Agent' request header.</description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>NutchCVS,Nutch,NameOfAgent,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence.</description>
</property>

<property>
  <name>http.robots.403.allow</name>
  <value>true</value>
  <description>Some servers return HTTP status 403 (Forbidden) if
  /robots.txt doesn't exist. This should probably mean that we are
  allowed to crawl the site nonetheless. If this is set to false,
  then such sites will be treated as forbidden.</description>
</property>

<property>
  <name>http.agent.description</name>
  <value>Yourwebsite.com</value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://yoururl.com</value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value>[EMAIL PROTECTED]</value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header.</description>
</property>

<property>
  <name>http.agent.version</name>
  <value>1.0</value>
  <description>A version string to advertise in the User-Agent
   header.</description>
</property>

carmmello wrote:
> Tanks for your answer Dennis, but, yes, I did.  The only thing I did 
> not (and I have some doubt about it) is that in the http.agent.version 
> I only used Nutch-0.8.1 name, but not the the name I used in 
> http.robots.agent, although in this configuration I have kept the *.   
> Also,  in the log file, I can not find any error regarding this
>
> ----- Original Message ----- From: "Dennis Kubes" 
> <[EMAIL PROTECTED]>
> To: <[email protected]>
> Sent: Wednesday, September 27, 2006 7:59 PM
> Subject: Re: no results in nutch 0.8.1
>
>
>> Did you setup the user agent name in the nutch-site.xml file or 
>> nutch-default.xml file?
>>
>> Dennis
>>
>> carmmello wrote:
>>> I have followed the steps in the  0.8.1 tutorial and, also, I have 
>>> been using Nutch for some time now, without seeing the kind of  
>>> problem I am encountering now.
>>> After I have finished the crawl process (intranet crawling), I go to 
>>> localhost:8080, try a search and get, no matter what, 0 results.
>>> Looking at the logs, everything seems ok.  Also, if I use the 
>>> command bin/nutch readdb "crawl/crawldb"  I found more than 6000 urls.
>>> So, why can`t I get any results?
>>> Tanks
>>>
>>
>>
>> -- 
>> No virus found in this incoming message.
>> Checked by AVG Free Edition.
>> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 
>> 27/9/2006
>>
>>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to