Hi,

I solved this problem.

For some urls, the owner set the robots protect.

The robots.txt is on server side to protect the web site not to be searched
by search engine.

Because of this,the crawler can not  fetch the content . So it will throw
exception when indexing.

Thank you for your help.

regards,

Gong Zhao


2008/7/28 wuqi <[EMAIL PROTECTED]>

>  Try to set log for Dedup program to "DEBUG" in your log4j.properties
> file and you may find the cause..
>
> ----- Original Message -----
> *From:* 宫照 <[EMAIL PROTECTED]>
> *To:* nutch-user@lucene.apache.org ; [EMAIL PROTECTED]
> *Sent:* Monday, July 28, 2008 2:43 PM
> *Subject:* Re: nutch fetched but no indexed
>
> Hi,
>
> Thank you for wuqi's help.
>
> I check it under luke and can not find it.
>
> Now I import the source into eclipse and debug it . I found there is an
> exception at here:
> org.apache.nutch.indexer.DeleteDuplicates.java   line 439
> *JobClient.runJob(job);*
>
> The exception is
> Exception in thread "main" java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>     at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:439)
>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
>
> It is an exception from hadoop.
> *
> if i use other urls, it will be ok. The exception only occured for some
> special urls.*
>
> anybody know the reasons?
>
> regrads,
>
> Gong zhao
>
>
>
> 2008/7/25 wuqi <[EMAIL PROTECTED]>
>
>> This problem can't be figured out just with a simple command.Just a few
>> points hope helpfull for you.
>>
>> 1. Why you think the page is not indexed ? just can't be searched ? You
>> can use Lucene index tool Luke to find whether the page is in index.
>> 2.If this page is not in the index,try to check the status of this page in
>> crawldb,if it is db_fetched, then try to check wheter it exist in the
>> segement file..
>>
>>
>>
>> ----- Original Message -----
>> From: "宫照" <[EMAIL PROTECTED]>
>> To: <nutch-user@lucene.apache.org>; <[EMAIL PROTECTED]>
>> Sent: Friday, July 25, 2008 9:53 AM
>> Subject: Re: nutch fetched but no indexed
>>
>>
>> > Hi Patrick,
>> >
>> > Thank you for your advice.
>> >
>> > my nutch-site.xml file is already set as you said  and I can search pdf
>> file
>> > under other urls.
>> >
>> > Just the file under the url I said before can not be indexed .
>> >
>> > I guess maybe It is about the type of urls. Because from log we can see
>> it
>> > was fetched but not indexed.
>> >
>> > anybody can help me?
>> >
>> > regards,
>> >
>> > Gong Zhao
>> >
>> >
>> >
>> > 2008/7/24 Patrick Markiewicz <[EMAIL PROTECTED]>:
>> >
>> >> Hi Gong Zhao,
>> >>        Make sure you have the parse-pdf plugin enabled in your
>> >> nutch-site.xml file.
>> >> I.e.
>> >> <property>
>> >>  <name>plugin.includes</name>
>> >>  <value>...|parse-(xml|text|html|js|pdf)|...</value>
>> >>  <description>
>> >>  </description>
>> >> </property>
>> >>
>> >> That's the only thing I can think of at first glance.
>> >>
>> >> Patrick
>> >> -----Original Message-----
>> >> From: 宫照 [mailto:[EMAIL PROTECTED]
>> >> Sent: Wednesday, July 23, 2008 11:27 PM
>> >> To: nutch-user@lucene.apache.org
>> >> Subject: nutch fetched but no indexed
>> >>
>> >> Hi everybody,
>> >>
>> >> I face a problem when using nutch. I use nuth to crawl in intranet. It
>> >> works
>> >> well before. But recently, I add some urls to crawl. These urls ara
>> >> different with normal .The new urls like this:
>> >> http://compass.mydomain.com/go/247460034
>> >>
>> >> there are many folders or documents under this url, such as folder:
>> >> http://compass.mot.com/go/247460034/2354342276
>> >> documents:
>> >> http://compass.mot.com/go/247460034/mydoc.pdf
>> >>
>> >> After crawl, the docs under this kind of urls can not be searched,
>> >> I check the log, I find when crawling  this kind of urls can be fetched
>> >> ,but
>> >> they were not indexed.
>> >>
>> >> I don't know why. Can you tell how to do?
>> >>
>> >> regards,
>> >>
>> >> Gong Zhao
>> >>
>> >
>>
>
>

Reply via email to