Url regex normalizer
--------------------

                 Key: NUTCH-706
                 URL: https://issues.apache.org/jira/browse/NUTCH-706
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 1.0.0
            Reporter: Meghna Kukreja
            Priority: Minor
             Fix For: 1.0.0


Hey,

I encountered the following problem while trying to crawl a site using
nutch-trunk. In the file regex-normalize.xml, the following regex is
used to remove session ids:

<pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>.

This pattern also transforms a url, such as,
"&newsId=2000484784794&newsLang=en" into "&new&newsLang=en" (since it
matches 'sId' in the 'newsId'), which is incorrect and hence does not
get fetched. This expression needs to be changed to prevent this.

Thanks,
Meghna

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to