[jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch

Tejas Patil (JIRA) Sun, 15 Dec 2013 16:11:14 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848723#comment-13848723
 ]


Tejas Patil edited comment on NUTCH-1465 at 12/16/13 12:09 AM:
---------------------------------------------------------------

Hi [~wastl-nagel],

Nice share. The only grudge I have with that approach is that users will have 
to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. 
It would fit well where users are performing targeted crawling.
For a large scale, open web crawl use case:
i) the number of initial hosts can be large : one time burden for users
ii) crawler discovers new hosts with time : constant pain for users to look out 
for the new hosts discovered and then get sitemaps from robots.txt manually. 
With HostDB from NUTCH-1325 and B, users won't suffer here.

> do we really need an extra DB?
I should have been clear with the explanation. "sitemapDB" is some temporary 
location where all crawl datums of sitemap entries would be written. This can 
be deleted after merge with the main crawlDB. Quite analogous to what inject 
operation does.

> NUTCH-1622 would enable solution A: outlinks now can hold extra info.
I didn't knew that. Still I would go in favor of B as it is clean and A would 
involve messing around with existing codebase at several places.


was (Author: tejasp):
Hi [~wastl-nagel],

Nice share. The only grudge I have with that approach is that users will have 
to pick up sitemap urls for hosts *manually* and feed to the sitemap injector. 
It would fit well where users are performing targeted crawling.
For a large scale, open web crawl use case:
(i) the number of initial hosts can be large : one time burden for users
(ii) crawler discovers new hosts with time : constant pain for users to look 
out for the new hosts discovered and then get sitemaps from robots.txt 
manually. With HostDB from NUTCH-1325 and B, users won't suffer here.

> do we really need an extra DB?
I should have been clear with the explanation. "sitemapDB" is some temporary 
location where all crawl datums of sitemap entries would be written. This can 
be deleted after merge with the main crawlDB. Quite analogous to what inject 
operation does.

> NUTCH-1622 would enable solution A: outlinks now can hold extra info.
I didn't knew that. Still I would go in favor of B as it is clean and A would 
involve messing around with existing codebase at several places.

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.9
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Comment Edited] (NUTCH-1465) Support sitemaps in Nutch

Reply via email to