[ https://issues.apache.org/jira/browse/NUTCH-706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473599#comment-13473599 ]
Sebastian Nagel commented on NUTCH-706: --------------------------------------- First commit erroneously with wrong patch. Correct patch (NUTCH-706-2.patch) now committed to trunk (revision 1396817) and 2.x (revision 1396822). > Url regex normalizer: default pattern for session id removal not to match > "newsId" > ---------------------------------------------------------------------------------- > > Key: NUTCH-706 > URL: https://issues.apache.org/jira/browse/NUTCH-706 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.0.0 > Reporter: Meghna Kukreja > Priority: Minor > Fix For: 1.6, 2.2 > > Attachments: NUTCH-706-2.patch, NUTCH-706.patch > > > Hey, > I encountered the following problem while trying to crawl a site using > nutch-trunk. In the file regex-normalize.xml, the following regex is > used to remove session ids: > <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern>. > This pattern also transforms a url, such as, > "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it > matches 'sId' in the 'newsId'), which is incorrect and hence does not > get fetched. This expression needs to be changed to prevent this. > Thanks, > Meghna -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira