Re: Crawling URLs with query string while limiting only web pages

Tejas Patil Fri, 22 Feb 2013 20:23:07 -0800

one correction in red below.

On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil <[email protected]>wrote:


> I think that what you have done till now is logical. Typically in nutch
> crawls people dont want urls with query string but nowadays things have
> changed. For instance, category #2 you pointed out may capture some vital
> pages. I once ran into the similar issue. Crawler cant be made intelligent
> beyond a certain point and I had to go through crawl logs to check what all
> urls are being fetched and later redefine by regex rules.
>
> Some things that I had considered doing:
> 1. Start off with rules which are less restrictive and observe the logs
> for which urls are visited. This will give you an idea about the bad urls
> and the good ones. As you already have crawled for 10 days, you are (just
> !!) left with studying the logs.
> 2. After #1 is done, launch crawls with accept rules for the good urls and
> put a "-." in the end to avoid the bad urls.
> 3. Having a huge list of regexes is bad thing because its comparing urls
> against regexes is a costly operation and done for every url. A url getting
> a match early saves this time. So have patterns which capture a huge set of
> urls at the top for the regex urlfilter file.
> 4. Sometimes you dont want the parser to extract urls from certain areas
> of the page as you know that its not going to yield anything good to you.
> Lets say that the "print" or "zoom" urls are coming from some specific tags
> of the html source. Its better not to parse those things and thus not have
> those urls itself in the first place. The profit here is that now the regex
> rules to be defined are reduced.
> 5. An improvement over *#4* is that if you know the nature of pages that
> are being crawled, you can tweak parsers to extract urls from specific tags
> only. This reduces noise and much cleaner fetch list.
>
> As far as I feel, this problem wont have an automated solution like
> modifying some config/setting. There is a decent amount of human
> intervention required to get things right. Knowing the nature of pages you
> plan to crawl is vital in making smart decisions.
>
> Thanks,
> Tejas Patil
>
>
> On Fri, Feb 22, 2013 at 5:52 PM, ytthet <[email protected]> wrote:
>
>> Hi Folks,
>>
>> I have a question on crawling URLs with query string. I am crawling about
>> 10,000 sites. Some of the site uses query string to serve the content
>> while
>> some uses simple URLs. Example I have following cases
>>
>> Case 1:
>>
>> site1.com/article1
>> site1.com/article2
>>
>> Case 2:
>> site2.com/?pid=123
>> site2.com/?pid=124
>>
>> The only way to crawl and fetch webpages/articles in case 2 is to fetch
>> URLs
>> with query string "?" . While for the case 1 I can set NOT to fetch "?" in
>> URL. Thus currently in my regex-urlfilter.txt , I commented the following
>> lines for my crawler to fetch URL with query string.
>>
>> # skip URLs containing certain characters as probable queries, etc.
>> #-[?*!@=]
>>
>> The above setting cause the crawler to fetch all URLs including URLs with
>> query string thus pages such as download, login, comments, search query,
>> printer friendly pages, zoom in view and other not valuable pages are
>> being
>> fetch. Practically, the crawler is going deep web. The undesirable cause
>> of
>> this is as following:
>>
>> 1. Duplicate pages are being fetch, effecting the crawl DB to be bloated
>> - Printer friendly view, zoom in view
>> e.g. site1.com/article1
>> e.g. site1.com/article1/?view=printerfriendly
>> e.g. site1.com/article1/?zoom=large
>> e.g. site1.com/article1/?zoom=extralarge
>>
>> 2. Download pages are being fetch, effecting the segment to be too large
>> e.g. site1/com/getcontentID?id=1&format=pdf
>> e.g. site1/com/getcontentID?id=1&format=doc
>>
>> 3. Crawling take very long time (10 days for depth 5) since is it going
>> deep
>> web.
>>
>> My current solution to the problem is to add additional regex in the
>> regex-urlfilter.txt to prevent the crawler from fetching undesired pages.
>> Now I have another problems.
>> 1. regex to exclude undesired URLs patter is not exhausted for there are
>> many site and many pattern. Thus crawler is still going deep web.
>> 2. regex filters to exclude is getting too long so far 50 regex to exclude
>> the URLs pattern.
>>
>> I hope I am not the only one with the similar problem and someone knows
>> smarter way to solve the problem. Has anybody have a solution or
>> suggestion
>> on how to solve the problem? Some tips or direction would be very much
>> appreciated.
>>
>> Btw, I am using nutch 1.2 but I believe the crawler principle is pretty
>> much
>> the same.
>>
>> Warm Regards,
>>
>> Ye
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>
>

Re: Crawling URLs with query string while limiting only web pages

Reply via email to