Hi - Nutch has a parser for RSS and ATOM on-board: https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/feed/FeedParser.html
You must configure it in your plugin.includes to use it. Regards, Markus -----Original message----- > From:Ankit Goel <ankitgoel2...@gmail.com> > Sent: Thursday 2nd November 2017 10:11 > To: user@nutch.apache.org > Subject: Re: sitemap and xml crawl > > Hi Yossi, > I have 2 kinds of rss links which are domain.com/rss/feed.xml > <http://domain.com/rss/feed.xml> links. One is the standard rss feed that we > see, which becomes the starting point for crawling further as we can pull > links from it. > > > <item> > <title> > <![CDATA[ > Article headline > ]]> > </title> > <link> > article url > </link> > <pubDate> date </pubDate> > <dc:creator> > <![CDATA[ author ]]> > </dc:creator> > <description> > <![CDATA[ > One line descriptor tag line > ]]> > </description> > </item> > <item> > … > </item> > > The other one also includes the content within the xml itself, so it doesn’t > need further crawling. > I have standalone xml parsers in java that I can use directly, but obviously, > crawling is an important part, because it documents all the links traversed > so far. > > What would you advice? > > Regards, > Ankit Goel > > > On 02-Nov-2017, at 2:04 PM, Yossi Tamari <yossi.tam...@pipl.com> wrote: > > > > Hi Ankit, > > > > If you are looking for a Sitemap parser, I would suggest moving to 1.14 > > (trunk). I've been using it, and it is probably in better shape than 1.13. > > If you need to parse your own format, the answer depends on the details. Do > > you need to crawl pages in this format where each page contains links in XML > > that you need to crawl? Or is this more like Sitemap where the XML is just > > the initial starting point? > > In the second case, maybe just write something outside of Nutch that will > > parse the XML and produce a seed file? > > In the first case, the link you sent is not relevant. You need to implement > > a > > http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/Parser.h > > tml. I haven't done that myself. My suggestion is that you take a look at > > the built-in parser at > > https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/o > > rg/apache/nutch/parse/html/HtmlParser.java. Google found this article on > > developing a custom parser, which might be a good starting point: > > http://www.treselle.com/blog/apache-nutch-with-custom-parser/. > > > > Yossi. > > > > > >> -----Original Message----- > >> From: Ankit Goel [mailto:ankitgoel2...@gmail.com] > >> Sent: 02 November 2017 10:24 > >> To: user@nutch.apache.org > >> Subject: Re: sitemap and xml crawl > >> > >> Hi Yossi, > >> So I need to make a custom parser. Where do I start? I found this link > >> https://wiki.apache.org/nutch/HowToMakeCustomSearch > >> <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the right > >> place, or should I be looking at creating a plugin page. Any advice would > > be > >> helpful. > >> > >> Thank you, > >> Ankit Goel > >> > >>> On 02-Nov-2017, at 1:14 PM, Yossi Tamari <yossi.tam...@pipl.com> wrote: > >>> > >>> Hi Ankit, > >>> > >>> According to this: https://issues.apache.org/jira/browse/NUTCH-1465, > >>> sitemap is a 1.14 feature. > >>> I just checked, and the command indeed exists in 1.14. I did not test > >>> that it works. > >>> > >>> In general, Nutch supports crawling anything, but you might need to > >>> write your own parser for custom protocols. > >>> > >>> Yossi. > >>> > >>>> -----Original Message----- > >>>> From: Ankit Goel [mailto:ankitgoel2...@gmail.com] > >>>> Sent: 01 November 2017 18:55 > >>>> To: user@nutch.apache.org > >>>> Subject: sitemap and xml crawl > >>>> > >>>> Hi, > >>>> I need to crawl a xml feed, which includes url, title and content of > >>>> the > >>> articles on > >>>> site. > >>>> > >>>> The documentation on the site says that bin/nutch sitemap exists, but > >>>> on > >>> my > >>>> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch > >>>> support crawling sitemaps? Or xml links. > >>>> > >>>> Regards, > >>>> Ankit Goel > >>> > >>> > > > > > >