[Nutch-dev] RE: parse-rss fetch problems

Chris Mattmann Wed, 20 Apr 2005 21:05:38 -0700

Hi Marco,

  The issue that you are having is that the parse-html plugin is getting
called by default on the content that you are trying to parse. This may have
to do with the MIME type mappings, and the new improved way (that J. Charron
worked on) that Nutch is currently using. So, basically there needs to be an
entry in the mime types content file to detect that the file type is RSS,
and set the content type to "application/rss+xml", which will cause the
parse-rss content parser to be invoked. The problem right now for you is
that it is now being invoked.


  The bigger issue, however, is how you deal with causing the byte sequence
(or so called "magic characters") in the mime types configuration file to
recognize that a file is in fact an RSS file. With so many different types
of valid feeds (RSS 2.0, 0.9, 1.0, ATOM, and its many versions), how do you
reliably and accurately detect by magic character matchers that a file is
RSS? The first bytes of the file may be * completely * different in all
these valid feed types. The only thing you could probably detect is the fact
that the file is of type text/xml. Then, you would need a way to then
understand that it's an XML file, but it's also RSS.

  So, the long story short is, let me look into how this could be done with
J. Charron's new MIME type system. I'll try and think about how this could
be done. In the meanwhile, try and see if you can get the MIME type system
to recognize that the file is in fact XML. Because, if you do that, then a
quick and dirty solution for your problem would be to just edit the
parse-rss plugin.xml file, and change it to handle content type "text/xml"
instead of "application/rss+xml", which is what's currently in there. Then,
when the code gets called, I've code the RSSParser to accept both
"application/rss+xml", * and * "text/xml". So, it would work fine from
there.


Does that make sense? If not, just let me know. I got your prior email with
the info about checking out your system. I have some free time tonight, so
I'll give it a look see and let you know if I can set that up for you.

Thanks,
  Chris Mattmann


-----Original Message-----
From: Marco PV [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 20, 2005 7:24 PM
To: [email protected]
Subject: parse-rss fetch problems

Hi,

I'm using /nutch-nightly from  April 18th.
I've downloaded and uploaded the last src/plugin/parse-rss (src) and 
/plugin/parse-rss  (bin).
I've also compiled it with "ant", with no erros.
I've edited nutch-default.xml and modified the "parse-(rss|text|html)"
Should I edit the new mime.type files?

But when trying to fetch it can't parse either .xml or .rss files.
I get the error "indexed, but can't parse : content type not text/html; 
content type is "text/xml".
  Should I edit the new mime.type files?
  Whatever should I do?

Please, help.

Thanks,
Marco

_________________________________________________________________
MSN Messenger: instale gr�tis e converse com seus amigos. 
http://messenger.msn.com.br



-------------------------------------------------------
This SF.Net email is sponsored by: New Crystal Reports XI.
Version 11 adds new functionality designed to reduce time involved in
creating, integrating, and deploying reporting solutions. Free runtime info,
new features, or free trial, at: http://www.businessobjects.com/devxi/728
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] RE: parse-rss fetch problems

Reply via email to