[Nutch-general] Re: Crawling blogs and RSS

Miguel A Paraz Sun, 30 Oct 2005 11:50:58 -0800

On 10/19/05, Chris Mattmann <[EMAIL PROTECTED]> wrote:
>  Actually it's not out of priority, unfortunately because of the generic
> nature of the mime type "text/xml". Turns out that a lot of RSS comes back
> as configured by the web server with the content type "text/xml", even
> though it's recommended that "application/rss+xml" be used as the mime type
> for RSS. Most web server admins don't really spend the time configuring this
> mime type correctly in their web server. Further, if you go look at the IANA
> list of mime types, there really isn't a mime type specified for RSS
> (although RDF has applicaction/rdf+xml, which is sometimes used to identify
> RSS as well).


Hi,
I just realized: we don't have to look inside the XML file. We can
pick it up from context.

1. We could look inside the <head/> for links like:

<link rel="alternate" type="application/rss+xml" title="RSS 2.0"
href="http://migs.paraz.com/w/feed/"; />
<link rel="alternate" type="application/atom+xml" title="Atom 0.3"
href="http://migs.paraz.com/w/feed/atom/"; />

Is it practical to add a parser type to the Outlink type, so that the
HTML parser could set it from context?

2. We could add a new inject type: inject a list of feed URLs as the
starting point for the crawl. Technically, this isn't necessary since
an external program that parse the feeds then generate the URLs.


-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Crawling blogs and RSS

Reply via email to