[ https://issues.apache.org/jira/browse/NUTCH-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892562#comment-15892562 ]
Sebastian Nagel commented on NUTCH-2363: ---------------------------------------- Hi Markus, I'm a little bit concerned (and agree with Julien) about the complexity introduced to handle cookies. But I know that it may be the only way to crawl some sites. Nutch does not provide a way to persist per-domain information such as cookies and on the protocol level cookies are persisted only for a single fetch task. Some remarks about the patch: * are you sure that copying arbitrary meta data from a link or redirect source to the target CrawlDatum may never harm? Wouldn't it be better to use a restricted and configurable set, or do this only for the "Cookie" meta data key? * handleCookies(datum, content, queueId) is only called from 2 places: success and redirect (no need for 404s, exceptions, etc.): Maybe call it directly there: it's line of code more, but one indirection less and also a shorter argument list of output(...), easier to understand what's happening. * CookieScoringFilter could be simplified by inheriting from AbstractScoringFilter > Fetcher support for reading and setting cookies > ----------------------------------------------- > > Key: NUTCH-2363 > URL: https://issues.apache.org/jira/browse/NUTCH-2363 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.13 > > Attachments: NUTCH-2363.patch > > > Patch adds basic support for cookies in the fetcher, and a scoring plugin > that passes cookies to its outlinks, within the domain. Sub-domain or path > based is not supported. > This is useful if you want to maintain sessions or need to get around a > cookie wall. -- This message was sent by Atlassian JIRA (v6.3.15#6346)