[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1465:
-------------------------------

    Attachment: NUTCH-1465-trunk.v2.patch

Attaching NUTCH-1465-trunk.v2.patch which has implementation of *option (B)* 
_Have separate job for the sitemap stuff and merge its output into the crawldb_

+I have tied both the cases in this patch:+
1. users with targeted crawl who want to get sitemaps injected from a list of 
sitemap urls - the use case which [~wastl-nagel] had pointed out.
2. large open web crawls where users cannot afford to generate sitemap seeds 
for all the hosts and want nutch to inject sitemaps automatically. 

+To try out this patch:+
1. Apply the patch for HostDb feature 
(https://issues.apache.org/jira/secure/attachment/12624178/NUTCH-1325-trunk-v4.patch)
2. Apply this patch (NUTCH-1465-trunk.v2.patch)
3. (optional) Add this to conf/log4j.properties at line 11:
{noformat}
log4j.logger.org.apache.nutch.util.SitemapProcessor=INFO,cmdstdout
{noformat}
3. Run using 
{noformat}
bin/nutch org.apache.nutch.util.SitemapProcessor
{noformat}

I have started working on a *wiki page* describing this feature: 
https://wiki.apache.org/nutch/SitemapFeature 

Any suggestion and comments are welcome.

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.9
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to