[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1465:
-----------------------------------

    Attachment: NUTCH-1465-sitemapinjector-trunk-v1.patch

Hi Tejas,
attached you'll find a patch for a sitemap injector. Originally written by 
Hannes Schwarz, it's used by use for a couple of time. The patch contains a 
revised and improved version which, however, needs some more work (see TODOs in 
code).
The use case is somewhat different from way B: The sitemap injector takes URLs 
of sitemaps (not via robots.txt) and injects them directly to CrawlDb (no extra 
sitemapDB - do we really need an extra DB?). Robots.txt is not used as an 
intermediate step/hop because experience has shown that often customers prepare 
a special sitemap for the site search crawler which differs from the sitemap 
propagated in robots.txt.
Btw., NUTCH-1622 would enable solution A: outlinks now can hold extra info. 

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.9
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to