Tejas,

Thanks for your pointers. They are really helpful. As of now my approach is
according to your direction 1, 2 and 3. Since my sites are around 10k in
number, I hope it would be manageable for near future.

I might need to apply as per your direction 4 and 5 in the future as well.
But I believe it might be out of my league to get it right though.

Some extra information my approach, most of my target sites are using CMS
and quite a number of them DOES NOT use pretty URL. I have been greping the
log and identify the pattern of redundant or non-important URL and adding
regex rules to regex-urlfilter. 2 millions URL is quite hard to process for
one man though. Phew!

I would share if I could fine an approach that could benefit us all.

Regards,

Ye

On Sat, Feb 23, 2013 at 12:22 PM, Tejas Patil <[email protected]>wrote:

> one correction in red below.
>
> On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil <[email protected]
> >wrote:
>
> > I think that what you have done till now is logical. Typically in nutch
> > crawls people dont want urls with query string but nowadays things have
> > changed. For instance, category #2 you pointed out may capture some vital
> > pages. I once ran into the similar issue. Crawler cant be made
> intelligent
> > beyond a certain point and I had to go through crawl logs to check what
> all
> > urls are being fetched and later redefine by regex rules.
> >
> > Some things that I had considered doing:
> > 1. Start off with rules which are less restrictive and observe the logs
> > for which urls are visited. This will give you an idea about the bad urls
> > and the good ones. As you already have crawled for 10 days, you are (just
> > !!) left with studying the logs.
> > 2. After #1 is done, launch crawls with accept rules for the good urls
> and
> > put a "-." in the end to avoid the bad urls.
> > 3. Having a huge list of regexes is bad thing because its comparing urls
> > against regexes is a costly operation and done for every url. A url
> getting
> > a match early saves this time. So have patterns which capture a huge set
> of
> > urls at the top for the regex urlfilter file.
> > 4. Sometimes you dont want the parser to extract urls from certain areas
> > of the page as you know that its not going to yield anything good to you.
> > Lets say that the "print" or "zoom" urls are coming from some specific
> tags
> > of the html source. Its better not to parse those things and thus not
> have
> > those urls itself in the first place. The profit here is that now the
> regex
> > rules to be defined are reduced.
> > 5. An improvement over *#4* is that if you know the nature of pages that
> > are being crawled, you can tweak parsers to extract urls from specific
> tags
> > only. This reduces noise and much cleaner fetch list.
> >
> > As far as I feel, this problem wont have an automated solution like
> > modifying some config/setting. There is a decent amount of human
> > intervention required to get things right. Knowing the nature of pages
> you
> > plan to crawl is vital in making smart decisions.
> >
> > Thanks,
> > Tejas Patil
> >
> >
> > On Fri, Feb 22, 2013 at 5:52 PM, ytthet <[email protected]> wrote:
> >
> >> Hi Folks,
> >>
> >> I have a question on crawling URLs with query string. I am crawling
> about
> >> 10,000 sites. Some of the site uses query string to serve the content
> >> while
> >> some uses simple URLs. Example I have following cases
> >>
> >> Case 1:
> >>
> >> site1.com/article1
> >> site1.com/article2
> >>
> >> Case 2:
> >> site2.com/?pid=123
> >> site2.com/?pid=124
> >>
> >> The only way to crawl and fetch webpages/articles in case 2 is to fetch
> >> URLs
> >> with query string "?" . While for the case 1 I can set NOT to fetch "?"
> in
> >> URL. Thus currently in my regex-urlfilter.txt , I commented the
> following
> >> lines for my crawler to fetch URL with query string.
> >>
> >> # skip URLs containing certain characters as probable queries, etc.
> >> #-[?*!@=]
> >>
> >> The above setting cause the crawler to fetch all URLs including URLs
> with
> >> query string thus pages such as download, login, comments, search query,
> >> printer friendly pages, zoom in view and other not valuable pages are
> >> being
> >> fetch. Practically, the crawler is going deep web. The undesirable cause
> >> of
> >> this is as following:
> >>
> >> 1. Duplicate pages are being fetch, effecting the crawl DB to be bloated
> >> - Printer friendly view, zoom in view
> >> e.g. site1.com/article1
> >> e.g. site1.com/article1/?view=printerfriendly
> >> e.g. site1.com/article1/?zoom=large
> >> e.g. site1.com/article1/?zoom=extralarge
> >>
> >> 2. Download pages are being fetch, effecting the segment to be too large
> >> e.g. site1/com/getcontentID?id=1&format=pdf
> >> e.g. site1/com/getcontentID?id=1&format=doc
> >>
> >> 3. Crawling take very long time (10 days for depth 5) since is it going
> >> deep
> >> web.
> >>
> >> My current solution to the problem is to add additional regex in the
> >> regex-urlfilter.txt to prevent the crawler from fetching undesired
> pages.
> >> Now I have another problems.
> >> 1. regex to exclude undesired URLs patter is not exhausted for there are
> >> many site and many pattern. Thus crawler is still going deep web.
> >> 2. regex filters to exclude is getting too long so far 50 regex to
> exclude
> >> the URLs pattern.
> >>
> >> I hope I am not the only one with the similar problem and someone knows
> >> smarter way to solve the problem. Has anybody have a solution or
> >> suggestion
> >> on how to solve the problem? Some tips or direction would be very much
> >> appreciated.
> >>
> >> Btw, I am using nutch 1.2 but I believe the crawler principle is pretty
> >> much
> >> the same.
> >>
> >> Warm Regards,
> >>
> >> Ye
> >>
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >
> >
>

Reply via email to