Hi,

I found the reason of that exception!
If you look into my crawl.log carefully then you notice these lines:

060104 213608 Parsing
[http://220.000.000.001/otd_04_Detailed_Design_Document.doc] with
[EMAIL PROTECTED]
060104 213609 Unable to successfully parse content
http://220.000.000.001/otd_04_Detailed_Design_Document.doc of type
application/msword
060104 213609 Error parsing:
http://220.000.000.001/otd_04_Detailed_Design_Document.doc:
notparsed(0,0)
060104 213609 Using Signature impl: org.apache.nutch.crawl.MD5Signature

I have one word document which can not be parsed properly. This causes
the issue. If I remove this document then nutch finishes correctly.
But if the list of files contains this word document (or even if this
is the only document to be crawled) then I always receive that
exception.

Can anybody look at this issue?
If anybody is interested in that word document I can send it (but I
really wouldn't like to see it becomig regular part of nutch test
package, you know it is some kind of internal document [though it does
not contain any useful infomration] :-)

But in general I think there should be some test in nutch to assure
that it can stand such documents.

Regards,
Lukas

On 1/5/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> Hi Andrzej,
>
> This is what sets Fetcher to parse to true or false, right?
>
> <property>
>   <name>fetcher.parse</name>
>   <value>true</value>
>   <description>If true, fetcher will parse content.</description>
> </property>
>
> I don't have my nutch-default and nutch-site files with me right now
> but I would say that for 95% I didn't change this value in my
> nutch-site (and I didn't change nutch-default at all).
>
> So the answer is YES, Fetcher is in parsing mode (with ~ 95% confience).
>
> I am running nutch against my local apache (not visible for you). But
> you may noticed that I used depth=2 so only a few pages (16 to be
> exact) are crawled. If you are interested I can send you them all so
> that you can upload this content on any server you need for your
> tests.
>
> Look into crawl.log file (attached to previous email sent at 8:21am
> today) for deatils.
>
> I will try to simulate this issue with one or two arbitraty html
> pages. If that will produce the issue then I can send you them.
>
> Lukas
>
> On 1/5/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> > Lukas Vlcek wrote:
> >
> > >How can I learn that?
> > >What I do is running regular one-step command [/bin/nutch crawl]
> > >
> > >
> >
> > In that case your nutch-default.xml / nutch-site.xml decides, there is a
> > boolean option there. If you didn't change this, then it defaults to
> > true (i.e. your fetcher is parsing the content).
> >
> > Is it easy to reproduce this if I knew the seed urls? If that's the
> > case, please send me the seed urls (contact me off the list, if it's
> > sensitive).
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
>


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to