[Nutch-general] Parsing XML files

Mike Reynols Tue, 08 Nov 2005 10:00:13 -0800

Is there a plugin of some sort that I need in order to take a web site(which serves up a collection of xml documents) and crawl it's non htmlfiles?

I have tried to crawl an apache server of mine that has a directory listingof several hundred xml files but it failed with:


051108 113634 fetching http://www.example.com/example.xml.43779

051108 113640 fetch okay, but can't parsehttp://www.example.com/example.xml.43792, reason: failed(2,203):Content-Type not text/html: application/xml

051108 113640 fetching http://buildhost.kozoru.com/example.xml.43812

Now when I stripped out all the xml and left just raw text, I recieved thefollowing error:


051108 113634 fetching http://www.example.com/example.xml.43779-txt

051108 113640 fetch okay, but can't parsehttp://www.example.com/example.xml.43792-txt, reason: failed(2,203):Content-Type not text/html: application/xml

051108 113640 fetching http://www.example.com/example.xml.43812-txt

So you can see that niether are parsing correctly, and I'm not entirely surewhy? Is there any way I can parse a collection of non-html files and be ableto search it?

I guess I'm confused as to the fundamentals of Nutch. If someone couldplease point me in the right direction, that'd be greatly appreciated.Thanks.


_________________________________________________________________

Express yourself instantly with MSN Messenger! Download today - it's FREE!http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/




-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Parsing XML files

Reply via email to