Did you also modify the regex-urlfilter.txt to not skip URLS containing certain characters as probable queries? So put a # in the below part of regex-urlfilter.txt
# skip URLs containing certain characters as probable queries, etc. #-[?*!@=] On Fri, Sep 12, 2014 at 4:03 AM, Krishnanand, Kartik < [email protected]> wrote: > Hi, Nutch Gurus, > > I need to crawl two dynamically pages > > > 1. http://example.com and > > 2. http://example.com?request_locale=es_US > > The difference is that when the query parameter "request_locale" equals > "es_US", Spanish content is loaded. We would like to be able to crawl both > the URLs if possible. I have passed these urls in my seed.txt but have the > logs show that only the first URL is being crawled, but not the second. > > I modified the regex-normalize.xml to not strip out query parameters and > is given below. How do I configure Nutch to crawl both URLs? > > Kartik > > <regex-normalize> > > <!-- removes session ids from urls (such as jsessionid and PHPSESSID) --> > <regex> > > <pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&|#|$)</pattern> > <substitution>$4</substitution> > </regex> > > <!-- changes default pages into standard for /index.html, etc. into / > <regex> > > <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&|#|$)</pattern> > <substitution>/$3</substitution> > </regex> --> > > <!-- removes interpage href anchors such as site.com#location --> > <regex> > <pattern>#.*?(\?|&|$)</pattern> > <substitution>$1</substitution> > </regex> > > <!-- cleans ?&var=value into ?var=value --> > <regex> > <pattern>\?&</pattern> > <substitution>\?</substitution> > </regex> > > <!-- cleans multiple sequential ampersands into a single ampersand --> > <regex> > <pattern>&{2,}</pattern> > <substitution>&</substitution> > </regex> > > <!-- removes trailing ? --> > <regex> > <pattern>[\?&\.]$</pattern> > <substitution></substitution> > </regex> > > <!-- removes duplicate slashes --> > <regex> > <pattern>(?<!:)/{2,}</pattern> > <substitution>/</substitution> > </regex> > > </regex-normalize> > > ---------------------------------------------------------------------- > This message, and any attachments, is for the intended recipient(s) only, > may contain information that is privileged, confidential and/or proprietary > and subject to important terms and conditions available at > http://www.bankofamerica.com/emaildisclaimer. If you are not the > intended recipient, please delete this message. > -- Nima Falaki Software Engineer [email protected]

