Re: [Nutch-general] no results in nutch 0.8.1

carmmello Thu, 28 Sep 2006 15:31:42 -0700

Hello, Dennis,

Tanks again, for your response.  I am really amazed that the things can`t go 
right.  I have verified my configuration, in nutch-site.xml  and  I have 
already filled all the fields we mentioned in your e-mail.  I have even 
copied the file nutch-site.xml to a sub-folder under the folder ROOT in 
TomCat.  Still no results, although the log does not show any problems. 
Just for your information I will reproduce two section of the log:


The first one, just when starting the crawl:

006-09-28 17:15:43,930 INFO  http.Http - http.agent = 
qualidade/0.8.1(qualidade e meio ambiente; http://www.qualidade.eng.br; 
[EMAIL PROTECTED])

and, the final section, after all the indexing and optimization:

2006-09-28 17:25:58,551 INFO  indexer.Indexer - Indexer: done
2006-09-28 17:25:58,556 INFO  indexer.DeleteDuplicates - Dedup: starting
2006-09-28 17:25:58,593 INFO  indexer.DeleteDuplicates - Dedup: adding 
indexes in: teste/indexes
2006-09-28 17:26:01,356 INFO  indexer.DeleteDuplicates - Dedup: done
2006-09-28 17:26:01,358 INFO  indexer.IndexMerger - Adding 
teste/indexes/part-00000
2006-09-28 17:26:02,377 INFO  crawl.Crawl - crawl finished: teste

Then I go to the "teste" folder and start TomCat from there, like in Nutch 
0.7.2, get that nice search page, try something and ..........zero results!

Any new ideas?

Tanks,
W. Melo



----- Original Message ----- 
From: "Dennis Kubes" <[EMAIL PROTECTED]>
To: <[email protected]>
Sent: Thursday, September 28, 2006 6:19 PM
Subject: Re: no results in nutch 0.8.1


> This is what we have, hope this clears up some confusion.  It will show up 
> in log files of the sites that you crawl like this.  I don't know if the 
> configuration is what is causing your problem but I have talked to other 
> people on the list with similar problems where their configuration was 
> incorrect.  I think the only thing that is "required" is for the 
> http.agent.name not to be blank but I would set all of the other options 
> as well, just for politeness.
>
> Dennis
>
> Log file will record a crawler similar to this:
> NameOfAgent/1.0_(Yourwebsite.com;_http://www.yoururl.com/bot.html;[EMAIL 
> PROTECTED])
>
> <!-- HTTP properties -->
> <property>
>  <name>http.agent.name</name>
>  <value>NameOfAgent</value>
>  <description>Our HTTP 'User-Agent' request header.</description>
> </property>
>
> <property>
>  <name>http.robots.agents</name>
>  <value>NutchCVS,Nutch,NameOfAgent,*</value>
>  <description>The agent strings we'll look for in robots.txt files,
>  comma-separated, in decreasing order of precedence.</description>
> </property>
>
> <property>
>  <name>http.robots.403.allow</name>
>  <value>true</value>
>  <description>Some servers return HTTP status 403 (Forbidden) if
>  /robots.txt doesn't exist. This should probably mean that we are
>  allowed to crawl the site nonetheless. If this is set to false,
>  then such sites will be treated as forbidden.</description>
> </property>
>
> <property>
>  <name>http.agent.description</name>
>  <value>Yourwebsite.com</value>
>  <description>Further description of our bot- this text is used in
>  the User-Agent header.  It appears in parenthesis after the agent name.
>  </description>
> </property>
>
> <property>
>  <name>http.agent.url</name>
>  <value>http://yoururl.com</value>
>  <description>A URL to advertise in the User-Agent header.  This will
>   appear in parenthesis after the agent name.
>  </description>
> </property>
>
> <property>
>  <name>http.agent.email</name>
>  <value>[EMAIL PROTECTED]</value>
>  <description>An email address to advertise in the HTTP 'From' request
>   header and User-Agent header.</description>
> </property>
>
> <property>
>  <name>http.agent.version</name>
>  <value>1.0</value>
>  <description>A version string to advertise in the User-Agent
>   header.</description>
> </property>
>
> carmmello wrote:
>> Tanks for your answer Dennis, but, yes, I did.  The only thing I did not 
>> (and I have some doubt about it) is that in the http.agent.version I only 
>> used Nutch-0.8.1 name, but not the the name I used in http.robots.agent, 
>> although in this configuration I have kept the *.   Also,  in the log 
>> file, I can not find any error regarding this
>>
>> ----- Original Message ----- From: "Dennis Kubes" 
>> <[EMAIL PROTECTED]>
>> To: <[email protected]>
>> Sent: Wednesday, September 27, 2006 7:59 PM
>> Subject: Re: no results in nutch 0.8.1
>>
>>
>>> Did you setup the user agent name in the nutch-site.xml file or 
>>> nutch-default.xml file?
>>>
>>> Dennis
>>>
>>> carmmello wrote:
>>>> I have followed the steps in the  0.8.1 tutorial and, also, I have been 
>>>> using Nutch for some time now, without seeing the kind of  problem I am 
>>>> encountering now.
>>>> After I have finished the crawl process (intranet crawling), I go to 
>>>> localhost:8080, try a search and get, no matter what, 0 results.
>>>> Looking at the logs, everything seems ok.  Also, if I use the command 
>>>> bin/nutch readdb "crawl/crawldb"  I found more than 6000 urls.
>>>> So, why can`t I get any results?
>>>> Tanks
>>>>
>>>
>>>
>>> -- 
>>> No virus found in this incoming message.
>>> Checked by AVG Free Edition.
>>> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 
>>> 27/9/2006
>>>
>>>
>>
>
>
> -- 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.405 / Virus Database: 268.12.9/458 - Release Date: 27/9/2006
>
> 


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] no results in nutch 0.8.1

Reply via email to