Url regex normalizer -------------------- Key: NUTCH-706 URL: https://issues.apache.org/jira/browse/NUTCH-706 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter: Meghna Kukreja Priority: Minor Fix For: 1.0.0
Hey, I encountered the following problem while trying to crawl a site using nutch-trunk. In the file regex-normalize.xml, the following regex is used to remove session ids: <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern>. This pattern also transforms a url, such as, "&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it matches 'sId' in the 'newsId'), which is incorrect and hence does not get fetched. This expression needs to be changed to prevent this. Thanks, Meghna -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.