Is there a plugin of some sort that I need in order to take a web site (which serves up a collection of xml documents) and crawl it's non html files?

I have tried to crawl an apache server of mine that has a directory listing of several hundred xml files but it failed with:

051108 113634 fetching http://www.example.com/example.xml.43779
051108 113640 fetch okay, but can't parse http://www.example.com/example.xml.43792, reason: failed(2,203): Content-Type not text/html: application/xml
051108 113640 fetching http://buildhost.kozoru.com/example.xml.43812

Now when I stripped out all the xml and left just raw text, I recieved the following error:

051108 113634 fetching http://www.example.com/example.xml.43779-txt
051108 113640 fetch okay, but can't parse http://www.example.com/example.xml.43792-txt, reason: failed(2,203): Content-Type not text/html: application/xml
051108 113640 fetching http://www.example.com/example.xml.43812-txt

So you can see that niether are parsing correctly, and I'm not entirely sure why? Is there any way I can parse a collection of non-html files and be able to search it?

I guess I'm confused as to the fundamentals of Nutch. If someone could please point me in the right direction, that'd be greatly appreciated. Thanks.

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/



-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to