Re: Crawling URLs with query string while limiting only web pages

Tejas Patil Sun, 24 Feb 2013 11:32:20 -0800

@Ye, You need not look at each url. Random sampling will be better. It wont
be accurate but practical thing to do. Even while going through logs,
extract the urls, sort them so that all of those belonging to the same host
lie in the same group.


@feng lu: +1. Good trick to remove the bad urls using normalization. The
main problem in front of OP would be still to come up with such rules by
manually observing the logs.

Thanks,
Tejas Patil


On Sun, Feb 24, 2013 at 7:16 AM, feng lu <[email protected]> wrote:

> Hi Ye
>
> Can you add this pattern to regex-normalize.xml configuration file for the
> RegexUrlNormalize class.
>
> <!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
> <regex>
>
>
> <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid|view|zoom)=.*?)(\?|&amp;|#|$)</pattern>
>   <substitution>$4</substitution>
> </regex>
>
> it will removes session ids from urls such as view and zoom.
>
> e.g. site1.com/article1/?view=printerfriendly
> e.g. site1.com/article1/?zoom=large
> e.g. site1.com/article1/?zoom=extralarge
>
> to
>
> e.g. site1.com/article1
>
>
>
>
>
> On Sun, Feb 24, 2013 at 9:48 PM, Ye T Thet <[email protected]> wrote:
>
> > Tejas,
> >
> > Thanks for your pointers. They are really helpful. As of now my approach
> is
> > according to your direction 1, 2 and 3. Since my sites are around 10k in
> > number, I hope it would be manageable for near future.
> >
> > I might need to apply as per your direction 4 and 5 in the future as
> well.
> > But I believe it might be out of my league to get it right though.
> >
> > Some extra information my approach, most of my target sites are using CMS
> > and quite a number of them DOES NOT use pretty URL. I have been greping
> the
> > log and identify the pattern of redundant or non-important URL and adding
> > regex rules to regex-urlfilter. 2 millions URL is quite hard to process
> for
> > one man though. Phew!
> >
> > I would share if I could fine an approach that could benefit us all.
> >
> > Regards,
> >
> > Ye
> >
> > On Sat, Feb 23, 2013 at 12:22 PM, Tejas Patil <[email protected]
> > >wrote:
> >
> > > one correction in red below.
> > >
> > > On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil <[email protected]
> > > >wrote:
> > >
> > > > I think that what you have done till now is logical. Typically in
> nutch
> > > > crawls people dont want urls with query string but nowadays things
> have
> > > > changed. For instance, category #2 you pointed out may capture some
> > vital
> > > > pages. I once ran into the similar issue. Crawler cant be made
> > > intelligent
> > > > beyond a certain point and I had to go through crawl logs to check
> what
> > > all
> > > > urls are being fetched and later redefine by regex rules.
> > > >
> > > > Some things that I had considered doing:
> > > > 1. Start off with rules which are less restrictive and observe the
> logs
> > > > for which urls are visited. This will give you an idea about the bad
> > urls
> > > > and the good ones. As you already have crawled for 10 days, you are
> > (just
> > > > !!) left with studying the logs.
> > > > 2. After #1 is done, launch crawls with accept rules for the good
> urls
> > > and
> > > > put a "-." in the end to avoid the bad urls.
> > > > 3. Having a huge list of regexes is bad thing because its comparing
> > urls
> > > > against regexes is a costly operation and done for every url. A url
> > > getting
> > > > a match early saves this time. So have patterns which capture a huge
> > set
> > > of
> > > > urls at the top for the regex urlfilter file.
> > > > 4. Sometimes you dont want the parser to extract urls from certain
> > areas
> > > > of the page as you know that its not going to yield anything good to
> > you.
> > > > Lets say that the "print" or "zoom" urls are coming from some
> specific
> > > tags
> > > > of the html source. Its better not to parse those things and thus not
> > > have
> > > > those urls itself in the first place. The profit here is that now the
> > > regex
> > > > rules to be defined are reduced.
> > > > 5. An improvement over *#4* is that if you know the nature of pages
> > that
> > > > are being crawled, you can tweak parsers to extract urls from
> specific
> > > tags
> > > > only. This reduces noise and much cleaner fetch list.
> > > >
> > > > As far as I feel, this problem wont have an automated solution like
> > > > modifying some config/setting. There is a decent amount of human
> > > > intervention required to get things right. Knowing the nature of
> pages
> > > you
> > > > plan to crawl is vital in making smart decisions.
> > > >
> > > > Thanks,
> > > > Tejas Patil
> > > >
> > > >
> > > > On Fri, Feb 22, 2013 at 5:52 PM, ytthet <[email protected]>
> > wrote:
> > > >
> > > >> Hi Folks,
> > > >>
> > > >> I have a question on crawling URLs with query string. I am crawling
> > > about
> > > >> 10,000 sites. Some of the site uses query string to serve the
> content
> > > >> while
> > > >> some uses simple URLs. Example I have following cases
> > > >>
> > > >> Case 1:
> > > >>
> > > >> site1.com/article1
> > > >> site1.com/article2
> > > >>
> > > >> Case 2:
> > > >> site2.com/?pid=123
> > > >> site2.com/?pid=124
> > > >>
> > > >> The only way to crawl and fetch webpages/articles in case 2 is to
> > fetch
> > > >> URLs
> > > >> with query string "?" . While for the case 1 I can set NOT to fetch
> > "?"
> > > in
> > > >> URL. Thus currently in my regex-urlfilter.txt , I commented the
> > > following
> > > >> lines for my crawler to fetch URL with query string.
> > > >>
> > > >> # skip URLs containing certain characters as probable queries, etc.
> > > >> #-[?*!@=]
> > > >>
> > > >> The above setting cause the crawler to fetch all URLs including URLs
> > > with
> > > >> query string thus pages such as download, login, comments, search
> > query,
> > > >> printer friendly pages, zoom in view and other not valuable pages
> are
> > > >> being
> > > >> fetch. Practically, the crawler is going deep web. The undesirable
> > cause
> > > >> of
> > > >> this is as following:
> > > >>
> > > >> 1. Duplicate pages are being fetch, effecting the crawl DB to be
> > bloated
> > > >> - Printer friendly view, zoom in view
> > > >> e.g. site1.com/article1
> > > >> e.g. site1.com/article1/?view=printerfriendly
> > > >> e.g. site1.com/article1/?zoom=large
> > > >> e.g. site1.com/article1/?zoom=extralarge
> > > >>
> > > >> 2. Download pages are being fetch, effecting the segment to be too
> > large
> > > >> e.g. site1/com/getcontentID?id=1&format=pdf
> > > >> e.g. site1/com/getcontentID?id=1&format=doc
> > > >>
> > > >> 3. Crawling take very long time (10 days for depth 5) since is it
> > going
> > > >> deep
> > > >> web.
> > > >>
> > > >> My current solution to the problem is to add additional regex in the
> > > >> regex-urlfilter.txt to prevent the crawler from fetching undesired
> > > pages.
> > > >> Now I have another problems.
> > > >> 1. regex to exclude undesired URLs patter is not exhausted for there
> > are
> > > >> many site and many pattern. Thus crawler is still going deep web.
> > > >> 2. regex filters to exclude is getting too long so far 50 regex to
> > > exclude
> > > >> the URLs pattern.
> > > >>
> > > >> I hope I am not the only one with the similar problem and someone
> > knows
> > > >> smarter way to solve the problem. Has anybody have a solution or
> > > >> suggestion
> > > >> on how to solve the problem? Some tips or direction would be very
> much
> > > >> appreciated.
> > > >>
> > > >> Btw, I am using nutch 1.2 but I believe the crawler principle is
> > pretty
> > > >> much
> > > >> the same.
> > > >>
> > > >> Warm Regards,
> > > >>
> > > >> Ye
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> View this message in context:
> > > >>
> > >
> >
> http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
> > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > >>
> > > >
> > > >
> > >
> >
>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

Re: Crawling URLs with query string while limiting only web pages

Reply via email to