RE: sitemap and xml crawl

Markus Jelsma Thu, 02 Nov 2017 02:29:18 -0700

Hi - Nutch has a parser for RSS and ATOM on-board:
https://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/feed/FeedParser.html


You must configure it in your plugin.includes to use it.

Regards,
Markus

 
 
-----Original message-----
> From:Ankit Goel <ankitgoel2...@gmail.com>
> Sent: Thursday 2nd November 2017 10:11
> To: user@nutch.apache.org
> Subject: Re: sitemap and xml crawl
> 
> Hi Yossi,
> I have 2 kinds of rss links which are domain.com/rss/feed.xml 
> <http://domain.com/rss/feed.xml> links. One is the standard rss feed that we 
> see, which becomes the starting point for crawling further as we can pull 
> links from it.
> 
> 
> <item>
> <title>
> <![CDATA[
> Article headline
> ]]>
> </title>
> <link>
> article url
> </link>
> <pubDate> date </pubDate>
> <dc:creator>
> <![CDATA[ author ]]>
> </dc:creator>
> <description>
> <![CDATA[
> One line descriptor tag line
> ]]>
> </description>
> </item>
> <item>
> …
> </item>
> 
> The other one also includes the content within the xml itself, so it doesn’t 
> need further crawling.
> I have standalone xml parsers in java that I can use directly, but obviously, 
> crawling is an important part, because it documents all the links traversed 
> so far.
> 
> What would you advice?
> 
> Regards,
> Ankit Goel
> 
> > On 02-Nov-2017, at 2:04 PM, Yossi Tamari <yossi.tam...@pipl.com> wrote:
> > 
> > Hi Ankit,
> > 
> > If you are looking for a Sitemap parser, I would suggest moving to 1.14
> > (trunk). I've been using it, and it is probably in better shape than 1.13.
> > If you need to parse your own format, the answer depends on the details. Do
> > you need to crawl pages in this format where each page contains links in XML
> > that you need to crawl? Or is this more like Sitemap where the XML is just
> > the  initial starting point? 
> > In the second case, maybe just write something outside of Nutch that will
> > parse the XML and produce a seed file?
> > In the first case, the link you sent is not relevant. You need to implement
> > a
> > http://nutch.apache.org/apidocs/apidocs-1.13/org/apache/nutch/parse/Parser.h
> > tml. I haven't done that myself. My suggestion is that you take a look at
> > the built-in parser at
> > https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/o
> > rg/apache/nutch/parse/html/HtmlParser.java. Google found this article on
> > developing a custom parser, which might be a good starting point:
> > http://www.treselle.com/blog/apache-nutch-with-custom-parser/.
> > 
> >     Yossi.
> > 
> > 
> >> -----Original Message-----
> >> From: Ankit Goel [mailto:ankitgoel2...@gmail.com]
> >> Sent: 02 November 2017 10:24
> >> To: user@nutch.apache.org
> >> Subject: Re: sitemap and xml crawl
> >> 
> >> Hi Yossi,
> >> So I need to make a custom parser. Where do I start? I found this link
> >> https://wiki.apache.org/nutch/HowToMakeCustomSearch
> >> <https://wiki.apache.org/nutch/HowToMakeCustomSearch>. Is this the right
> >> place, or should I be looking at creating a plugin page. Any advice would
> > be
> >> helpful.
> >> 
> >> Thank you,
> >> Ankit Goel
> >> 
> >>> On 02-Nov-2017, at 1:14 PM, Yossi Tamari <yossi.tam...@pipl.com> wrote:
> >>> 
> >>> Hi Ankit,
> >>> 
> >>> According to this: https://issues.apache.org/jira/browse/NUTCH-1465,
> >>> sitemap is a 1.14 feature.
> >>> I just checked, and the command indeed exists in 1.14. I did not test
> >>> that it works.
> >>> 
> >>> In general, Nutch supports crawling anything, but you might need to
> >>> write your own parser for custom protocols.
> >>> 
> >>>   Yossi.
> >>> 
> >>>> -----Original Message-----
> >>>> From: Ankit Goel [mailto:ankitgoel2...@gmail.com]
> >>>> Sent: 01 November 2017 18:55
> >>>> To: user@nutch.apache.org
> >>>> Subject: sitemap and xml crawl
> >>>> 
> >>>> Hi,
> >>>> I need to crawl a xml feed, which includes url, title and content of
> >>>> the
> >>> articles on
> >>>> site.
> >>>> 
> >>>> The documentation on the site says that bin/nutch sitemap exists, but
> >>>> on
> >>> my
> >>>> nutch 1.13 sitemap is not a command in bin/nutch. So does nutch
> >>>> support crawling sitemaps? Or xml links.
> >>>> 
> >>>> Regards,
> >>>> Ankit Goel
> >>> 
> >>> 
> > 
> > 
> 
>

RE: sitemap and xml crawl

Reply via email to