[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072959#comment-16072959
 ] 

Lewis John McGibbney commented on NUTCH-1465:
---------------------------------------------

[~markus17] when attempting to process the following sitemap - 
http://www.autotrader.com/sitemap.xml, it appears the new processor is not able 
to process anything... although the crawldb data structures are produced, no 
entries are added... can you please rescope the patch and ensure it is the most 
up-to-date one you are working with? Thanks

{code}
2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total 
records rejected by filters: 0
2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total 
sitemaps from HostDb: 0
2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total 
sitemaps from seed urls: 1
2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total 
failed sitemap fetches: 0
2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: Total 
new sitemap entries added: 0
2017-07-03 15:32:09,213 INFO  util.SitemapProcessor - SitemapProcessor: 
Finished at 2017-07-03 15:32:09, elapsed: 00:00:19
{code}

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.14
>
>         Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch, 
> NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, 
> NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to