[ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13888220#comment-13888220 ]
Sebastian Nagel commented on NUTCH-1465: ---------------------------------------- ??(1) fetch interval: ...?? +1, sounds plausible. ??(2) score: Never use value from sitemap. For new ones, use scoring filters. Keep the value of old entries as it is.?? That means use {{ScoringFilter.initialScore(...)}} for new ones? Why not use the priority for newly found URLs? If the site owner takes it seriously the score can be useful. We could make it configurable, eg. by a factor {{sitemap.priority.factor}}. If it's 0.0 priority is not used. Usually, the factor should be low to avoid that the total score in the web graph (cf. [FixingOpicScoring|http://wiki.apache.org/nutch/FixingOpicScoring]) get's too high when "injecting" 50.000 URLs from sitemaps each with 1.0 priority. Alternatively, we could just put values from sitemap in CrawlDatum's meta data and "delegate" any actions to set the score to scoring filters or FetchSchedule implementations. Users then can more easily adapt any sitemap logic to their needs (cf. below). ??(3) modified time: Always use the value from sitemap provided its not a date in future.?? Um, seems that this way is conceptually wrong (and was also in SitemapInjector). The modified time in CrawlDb must indicate the time of the last fetch or the modified time sent by the server when a page was fetched. If we overwrite the modified time, the server may just answer not-modified on a if-modified-since request and we'll never get the current version of a page. So we must not touch modified time, even for newly discovered pages, where it must be 0. If it's not zero, if-not-modified-since header field is sent although the page never has been fetched, cf. HttpResponse.java. If we can trust the sitemap the desired behaviour would be to set fetch time (in CrawlDb = time when next fetch should happen) to now (or sitemap modified time) if (and only if) sitemap.modif > crawldb.modif. This would make sure that changed pages are fetched asap. If the sitemap is not 100% trustworthy we should be more careful. Could we again delegate this decision (trustworthy or not) to scoring filter or FetchSchedule implementations? Whether we can trust a sitemap may depend on concrete crawler config/project and should be configurable. Would this require a new method in scoring/schedule interfaces? More open questions since before!? Comments are welcome! > Support sitemaps in Nutch > ------------------------- > > Key: NUTCH-1465 > URL: https://issues.apache.org/jira/browse/NUTCH-1465 > Project: Nutch > Issue Type: New Feature > Components: parser > Reporter: Lewis John McGibbney > Assignee: Tejas Patil > Fix For: 1.8 > > Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, > NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, > NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, > NUTCH-1465-trunk.v5.patch > > > I recently came across this rather stagnant codebase[0] which is ASL v2.0 > licensed and appears to have been used successfully to parse sitemaps as per > the discussion here[1]. > [0] http://sourceforge.net/projects/sitemap-parser/ > [1] > http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html -- This message was sent by Atlassian JIRA (v6.1.5#6160)