[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Tejas Patil (JIRA) Sun, 15 Dec 2013 00:18:46 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13848561#comment-13848561
 ]


Tejas Patil commented on NUTCH-1465:
------------------------------------

Revisited this Jira after a long time and gave a thought how this can be done 
cleanly. Two ways for implementing this:

*(A) Do the sitemap stuff in the fetch phase of nutch cycle.*
This was my original approach which the (in-progress) patch addresses. This 
would involve tweaking core nutch classes at several locations.

Pros:
- Sitemaps are nothing but normal pages with several outlinks. Fits well in the 
'fetch' cycle.

Cons:
- Sitemaps can be very huge in size. Fetching them need large size and time 
limits. Fetch code must have a special case to take into account that the url 
is a sitemap url and use custom limits => leads to hacky coding style.
- Outlink class cannot hold extra information contained in sitemaps (like 
lastmod, changefreq). Modify it to hold this information too. This would be 
specific for sitemaps only yet we end up making all outlinks to hold this info. 
We could create a special type of outlink and take care of this.

*(B) Have separate job for the sitemap stuff and merge its output into the 
crawldb.*
i. User populates a list of hosts (or uses HostDB from NUTCH-1325). Now we got 
all the hosts to be processed.
ii. Run a map-reduce job: for each host, 
          - get the robots page, extract sitemap urls, 
          - get xml content of these sitemap pages
          - create crawl datums with the requried info and write this to a 
sitemapDB

iii. Use CrawlDbMerger utility to merge the sitemapDB and crawldb

Pros:
- Cleaner code. 
- Users have control when to perform sitemap extraction. This is better than 
(A) wherein sitemap urls are sitting in the crawldb and get fetched along with 
normal pages (thus, eating up fetch time of every fetch phase). We can have a 
sitemap_fequency used insdie the crawl script so that users say that after 'x' 
nutch cycles, run sitemap processing.

Cons:
- Additional map-reduce jobs are needed. I think that this must be reasonable. 
Running sitemap job 1-5 times in a month on a production level crawl would work 
out well.

I am inclined towards implementing (B)

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.9
>
>         Attachments: NUTCH-1465-trunk.v1.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Reply via email to