[Nutch-dev] Re: parse-rss fetch problems

Jérôme Charron Thu, 21 Apr 2005 01:03:03 -0700

> 
> The bigger issue, however, is how you deal with causing the byte sequence
> (or so called "magic characters") in the mime types configuration file to
> recognize that a file is in fact an RSS file. With so many different types
> of valid feeds (RSS 2.0, 0.9, 1.0, ATOM, and its many versions), how do 
> you
> reliably and accurately detect by magic character matchers that a file is
> RSS? The first bytes of the file may be * completely * different in all
> these valid feed types. The only thing you could probably detect is the 
> fact
> that the file is of type text/xml. Then, you would need a way to then
> understand that it's an XML file, but it's also RSS.



That's exact. I take a look on Freedesktop mime-type database, and it 
doesn't have any magic detection for RSS.
In fact, there's no easy way to detect rss content.
But the actual mime-types definitin in Nutch can detect xml content using 
the magic sequence &lt;?xml at the begining of the file.
Then, the Rss parser module need to check if this xml file is an rss content 
or not.
For now, that's the only solution.

parse-rss plugin.xml file, and change it to handle content type "text/xml"
> instead of "application/rss+xml", which is what's currently in there. 
> Then,
> when the code gets called, I've code the RSSParser to accept both
> "application/rss+xml", * and * "text/xml". So, it would work fine from
> there.
> Does that make sense? 

Yes

Jerome


-- 
http://motrech.free.fr/
http://frutch.free.fr/

[Nutch-dev] Re: parse-rss fetch problems

Reply via email to