I think that what you have done till now is logical. Typically in nutch
crawls people dont want urls with query string but nowadays things have
changed. For instance, category #2 you pointed out may capture some vital
pages. I once ran into the similar issue. Crawler cant be made intelligent
beyond a certain point and I had to go through crawl logs to check what all
urls are being fetched and later redefine by regex rules.

Some things that I had considered doing:
1. Start off with rules which are less restrictive and observe the logs for
which urls are visited. This will give you an idea about the bad urls and
the good ones. As you already have crawled for 10 days, you are (just !!)
left with studying the logs.
2. After #1 is done, launch crawls with accept rules for the good urls and
put a "-." in the end to avoid the bad urls.
3. Having a huge list of regexes is bad thing because its comparing urls
against regexes is a costly operation and done for every url. A url getting
a match early saves this time. So have patterns which capture a huge set of
urls at the top for the regex urlfilter file.
4. Sometimes you dont want the parser to extract urls from certain areas of
the page as you know that its not going to yield anything good to you. Lets
say that the "print" or "zoom" urls are coming from some specific tags of
the html source. Its better not to parse those things and thus not have
those urls itself in the first place. The profit here is that now the regex
rules to be defined are reduced.
5. An improvement over #5 is that if you know the nature of pages that are
being crawled, you can tweak parsers to extract urls from specific tags
only. This reduces noise and much cleaner fetch list.

As far as I feel, this problem wont have an automated solution like
modifying some config/setting. There is a decent amount of human
intervention required to get things right. Knowing the nature of pages you
plan to crawl is vital in making smart decisions.

Thanks,
Tejas Patil


On Fri, Feb 22, 2013 at 5:52 PM, ytthet <[email protected]> wrote:

> Hi Folks,
>
> I have a question on crawling URLs with query string. I am crawling about
> 10,000 sites. Some of the site uses query string to serve the content while
> some uses simple URLs. Example I have following cases
>
> Case 1:
>
> site1.com/article1
> site1.com/article2
>
> Case 2:
> site2.com/?pid=123
> site2.com/?pid=124
>
> The only way to crawl and fetch webpages/articles in case 2 is to fetch
> URLs
> with query string "?" . While for the case 1 I can set NOT to fetch "?" in
> URL. Thus currently in my regex-urlfilter.txt , I commented the following
> lines for my crawler to fetch URL with query string.
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
>
> The above setting cause the crawler to fetch all URLs including URLs with
> query string thus pages such as download, login, comments, search query,
> printer friendly pages, zoom in view and other not valuable pages are being
> fetch. Practically, the crawler is going deep web. The undesirable cause of
> this is as following:
>
> 1. Duplicate pages are being fetch, effecting the crawl DB to be bloated
> - Printer friendly view, zoom in view
> e.g. site1.com/article1
> e.g. site1.com/article1/?view=printerfriendly
> e.g. site1.com/article1/?zoom=large
> e.g. site1.com/article1/?zoom=extralarge
>
> 2. Download pages are being fetch, effecting the segment to be too large
> e.g. site1/com/getcontentID?id=1&format=pdf
> e.g. site1/com/getcontentID?id=1&format=doc
>
> 3. Crawling take very long time (10 days for depth 5) since is it going
> deep
> web.
>
> My current solution to the problem is to add additional regex in the
> regex-urlfilter.txt to prevent the crawler from fetching undesired pages.
> Now I have another problems.
> 1. regex to exclude undesired URLs patter is not exhausted for there are
> many site and many pattern. Thus crawler is still going deep web.
> 2. regex filters to exclude is getting too long so far 50 regex to exclude
> the URLs pattern.
>
> I hope I am not the only one with the similar problem and someone knows
> smarter way to solve the problem. Has anybody have a solution or suggestion
> on how to solve the problem? Some tips or direction would be very much
> appreciated.
>
> Btw, I am using nutch 1.2 but I believe the crawler principle is pretty
> much
> the same.
>
> Warm Regards,
>
> Ye
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Reply via email to