[ https://issues.apache.org/jira/browse/NUTCH-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-1328. -------------------------------- Resolution: Duplicate Closed in favor of NUTCH-706 > a problem with regex-normalize.xml > ---------------------------------- > > Key: NUTCH-1328 > URL: https://issues.apache.org/jira/browse/NUTCH-1328 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.4 > Reporter: behnam nikbakht > Labels: parse > > there is a regex-pattern in regex-normalize.xml: > <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern> > that remove session ids from urls, but there is some sites, like: > http://www.mehrnews.com/fa > that have urls, like: > http://www.mehrnews.com/fa/newsdetail.aspx?NewsID=1567539 > and with this pattern, this url converted to an invalid url: > http://www.mehrnews.com/fa/newsdetail.aspx?New -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira