[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Sebastian Nagel (JIRA) Fri, 31 Jan 2014 14:02:55 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13888220#comment-13888220
 ]


Sebastian Nagel commented on NUTCH-1465:
----------------------------------------

??(1) fetch interval: ...??
+1, sounds plausible.

??(2) score: Never use value from sitemap. For new ones, use scoring filters. 
Keep the value of old entries as it is.??
That means use {{ScoringFilter.initialScore(...)}} for new ones?
Why not use the priority for newly found URLs? If the site owner takes it 
seriously the score can be useful. We could make it configurable, eg. by a 
factor {{sitemap.priority.factor}}. If it's 0.0 priority is not used. Usually, 
the factor should be low to avoid that the total score in the web graph (cf. 
[FixingOpicScoring|http://wiki.apache.org/nutch/FixingOpicScoring]) get's too 
high when "injecting" 50.000 URLs from sitemaps each with 1.0 priority. 
Alternatively, we could just put values from sitemap in CrawlDatum's meta data 
and "delegate" any actions to set the score to scoring filters or FetchSchedule 
implementations. Users then can more easily adapt any sitemap logic to their 
needs (cf. below).

??(3) modified time: Always use the value from sitemap provided its not a date 
in future.??
Um, seems that this way is conceptually wrong (and was also in SitemapInjector).
The modified time in CrawlDb must indicate the time of the last fetch or the 
modified time sent by the server when a page was fetched. If we overwrite the 
modified time, the server may just answer not-modified on a if-modified-since 
request and we'll never get the current version of a page. So we must not touch 
modified time, even for newly discovered pages, where it must be 0. If it's not 
zero, if-not-modified-since header field is sent although the page never has 
been fetched, cf. HttpResponse.java. 
If we can trust the sitemap the desired behaviour would be to set fetch time 
(in CrawlDb = time when next fetch should happen) to now (or sitemap modified 
time) if (and only if) sitemap.modif > crawldb.modif. This would make sure that 
changed pages are fetched asap. If the sitemap is not 100% trustworthy we 
should be more careful. 
Could we again delegate this decision (trustworthy or not) to scoring filter or 
FetchSchedule implementations? Whether we can trust a sitemap may depend on 
concrete crawler config/project and should be configurable. Would this require 
a new method in scoring/schedule interfaces?

More open questions since before!? Comments are welcome!

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

Reply via email to