[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16209547#comment-16209547
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---------------------------------------

marconett commented on issue #189: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189#issuecomment-337633586
 
 
   I'm running into the same problem and am unable to inject sitemap content 
into the db. here's the commands i used (not including output, it's the same as 
above):
   
   ```
   bin/nutch inject crawl/crawldb urls/
   bin/nutch sitemap crawl/crawldb -sitemapUrls sitemaps/ -noStrict -noFilter 
-noNormalize
   bin/nutch readdb crawl/crawldb -stats
   ```
   
   where `urls/seed.txt` contains "https://www.linux.com/"; and 
`sitemaps/seed.txt` contains "https://www.linux.com/sitemap.xml";.
   
   I see (tcpdump) that there are https connections being established to 
linux.com while `bin/nutch sitemap` is running. But nothing gets injected into 
the crawldb.
   
   Is there any info on this? Should this be fixed?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Markus Jelsma
>             Fix For: 1.14
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to