Re: Crawl URL with varying query parameters values

Nima Falaki Mon, 15 Sep 2014 15:59:31 -0700

Did you also modify the regex-urlfilter.txt to not skip URLS containing
certain characters as probable queries? So put a # in the below part of
regex-urlfilter.txt


# skip URLs containing certain characters as probable queries, etc.

#-[?*!@=]

On Fri, Sep 12, 2014 at 4:03 AM, Krishnanand, Kartik <
[email protected]> wrote:

> Hi, Nutch Gurus,
>
> I need to crawl two dynamically pages
>
>
> 1.       http://example.com and
>
> 2.       http://example.com?request_locale=es_US
>
> The difference is that when the query parameter "request_locale" equals
> "es_US", Spanish content is loaded. We would like to be able to crawl both
> the URLs if possible. I have passed these urls in my seed.txt but have the
> logs show that only the first URL is being crawled, but not the second.
>
> I modified the regex-normalize.xml to not strip out query parameters and
> is given below. How do I configure Nutch to crawl both URLs?
>
> Kartik
>
> <regex-normalize>
>
> <!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
> <regex>
>
> <pattern>(?i)(;?\b_?(l|j|bv_)?(sid|phpsessid|sessionid)=.*?)(\?|&amp;|#|$)</pattern>
>   <substitution>$4</substitution>
> </regex>
>
> <!-- changes default pages into standard for /index.html, etc. into /
> <regex>
>
> <pattern>/((?i)index|default)\.((?i)js[pf]{1}?[afx]?|cgi|cfm|asp[x]?|[psx]?htm[l]?|php[3456]?)(\?|&amp;|#|$)</pattern>
>   <substitution>/$3</substitution>
> </regex> -->
>
> <!-- removes interpage href anchors such as site.com#location -->
> <regex>
>   <pattern>#.*?(\?|&amp;|$)</pattern>
>   <substitution>$1</substitution>
> </regex>
>
> <!-- cleans ?&amp;var=value into ?var=value -->
> <regex>
>   <pattern>\?&amp;</pattern>
>   <substitution>\?</substitution>
> </regex>
>
> <!-- cleans multiple sequential ampersands into a single ampersand -->
> <regex>
>   <pattern>&amp;{2,}</pattern>
>   <substitution>&amp;</substitution>
> </regex>
>
> <!-- removes trailing ? -->
> <regex>
>   <pattern>[\?&amp;\.]$</pattern>
>   <substitution></substitution>
> </regex>
>
> <!-- removes duplicate slashes -->
> <regex>
>   <pattern>(?&lt;!:)/{2,}</pattern>
>   <substitution>/</substitution>
> </regex>
>
> </regex-normalize>
>
> ----------------------------------------------------------------------
> This message, and any attachments, is for the intended recipient(s) only,
> may contain information that is privileged, confidential and/or proprietary
> and subject to important terms and conditions available at
> http://www.bankofamerica.com/emaildisclaimer.   If you are not the
> intended recipient, please delete this message.
>



-- 



Nima Falaki
Software Engineer
[email protected]

Re: Crawl URL with varying query parameters values

Reply via email to