Re: Crawling URLs with query string while limiting only web pages

Ye T Thet Tue, 26 Feb 2013 04:44:57 -0800

Feng Lu: Thanks for the tip. I will definitely try the approach. Appreciate
your help.


Tejas: I am using the groping approach, filtering out some keywords from
the fetched log. Good so far I observed 20% of the fetched list is filled
up with the not-so important URL. I hope optimized filter can do some good
for my crawler performance.

Thanks for your directions.

Cheers,

Ye



On Mon, Feb 25, 2013 at 3:31 AM, Tejas Patil <[email protected]>wrote:

> @Ye, You need not look at each url. Random sampling will be better. It wont
> be accurate but practical thing to do. Even while going through logs,
> extract the urls, sort them so that all of those belonging to the same host
> lie in the same group.
>
> @feng lu: +1. Good trick to remove the bad urls using normalization. The
> main problem in front of OP would be still to come up with such rules by
> manually observing the logs.
>
> Thanks,
> Tejas Patil
>
>
> On Sun, Feb 24, 2013 at 7:16 AM, feng lu <[email protected]> wrote:
>
> > Hi Ye
> >
> > Can you add this pattern to regex-normalize.xml configuration file for
> the
> > RegexUrlNormalize class.
> >
> > <!-- removes session ids from urls (such as jsessionid and PHPSESSID) -->
> > <regex>
> >
> >
> >
> <pattern>([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid|view|zoom)=.*?)(\?|&amp;|#|$)</pattern>
> >   <substitution>$4</substitution>
> > </regex>
> >
> > it will removes session ids from urls such as view and zoom.
> >
> > e.g. site1.com/article1/?view=printerfriendly
> > e.g. site1.com/article1/?zoom=large
> > e.g. site1.com/article1/?zoom=extralarge
> >
> > to
> >
> > e.g. site1.com/article1
> >
> >
> >
> >
> >
> > On Sun, Feb 24, 2013 at 9:48 PM, Ye T Thet <[email protected]>
> wrote:
> >
> > > Tejas,
> > >
> > > Thanks for your pointers. They are really helpful. As of now my
> approach
> > is
> > > according to your direction 1, 2 and 3. Since my sites are around 10k
> in
> > > number, I hope it would be manageable for near future.
> > >
> > > I might need to apply as per your direction 4 and 5 in the future as
> > well.
> > > But I believe it might be out of my league to get it right though.
> > >
> > > Some extra information my approach, most of my target sites are using
> CMS
> > > and quite a number of them DOES NOT use pretty URL. I have been greping
> > the
> > > log and identify the pattern of redundant or non-important URL and
> adding
> > > regex rules to regex-urlfilter. 2 millions URL is quite hard to process
> > for
> > > one man though. Phew!
> > >
> > > I would share if I could fine an approach that could benefit us all.
> > >
> > > Regards,
> > >
> > > Ye
> > >
> > > On Sat, Feb 23, 2013 at 12:22 PM, Tejas Patil <
> [email protected]
> > > >wrote:
> > >
> > > > one correction in red below.
> > > >
> > > > On Fri, Feb 22, 2013 at 8:20 PM, Tejas Patil <
> [email protected]
> > > > >wrote:
> > > >
> > > > > I think that what you have done till now is logical. Typically in
> > nutch
> > > > > crawls people dont want urls with query string but nowadays things
> > have
> > > > > changed. For instance, category #2 you pointed out may capture some
> > > vital
> > > > > pages. I once ran into the similar issue. Crawler cant be made
> > > > intelligent
> > > > > beyond a certain point and I had to go through crawl logs to check
> > what
> > > > all
> > > > > urls are being fetched and later redefine by regex rules.
> > > > >
> > > > > Some things that I had considered doing:
> > > > > 1. Start off with rules which are less restrictive and observe the
> > logs
> > > > > for which urls are visited. This will give you an idea about the
> bad
> > > urls
> > > > > and the good ones. As you already have crawled for 10 days, you are
> > > (just
> > > > > !!) left with studying the logs.
> > > > > 2. After #1 is done, launch crawls with accept rules for the good
> > urls
> > > > and
> > > > > put a "-." in the end to avoid the bad urls.
> > > > > 3. Having a huge list of regexes is bad thing because its comparing
> > > urls
> > > > > against regexes is a costly operation and done for every url. A url
> > > > getting
> > > > > a match early saves this time. So have patterns which capture a
> huge
> > > set
> > > > of
> > > > > urls at the top for the regex urlfilter file.
> > > > > 4. Sometimes you dont want the parser to extract urls from certain
> > > areas
> > > > > of the page as you know that its not going to yield anything good
> to
> > > you.
> > > > > Lets say that the "print" or "zoom" urls are coming from some
> > specific
> > > > tags
> > > > > of the html source. Its better not to parse those things and thus
> not
> > > > have
> > > > > those urls itself in the first place. The profit here is that now
> the
> > > > regex
> > > > > rules to be defined are reduced.
> > > > > 5. An improvement over *#4* is that if you know the nature of pages
> > > that
> > > > > are being crawled, you can tweak parsers to extract urls from
> > specific
> > > > tags
> > > > > only. This reduces noise and much cleaner fetch list.
> > > > >
> > > > > As far as I feel, this problem wont have an automated solution like
> > > > > modifying some config/setting. There is a decent amount of human
> > > > > intervention required to get things right. Knowing the nature of
> > pages
> > > > you
> > > > > plan to crawl is vital in making smart decisions.
> > > > >
> > > > > Thanks,
> > > > > Tejas Patil
> > > > >
> > > > >
> > > > > On Fri, Feb 22, 2013 at 5:52 PM, ytthet <[email protected]>
> > > wrote:
> > > > >
> > > > >> Hi Folks,
> > > > >>
> > > > >> I have a question on crawling URLs with query string. I am
> crawling
> > > > about
> > > > >> 10,000 sites. Some of the site uses query string to serve the
> > content
> > > > >> while
> > > > >> some uses simple URLs. Example I have following cases
> > > > >>
> > > > >> Case 1:
> > > > >>
> > > > >> site1.com/article1
> > > > >> site1.com/article2
> > > > >>
> > > > >> Case 2:
> > > > >> site2.com/?pid=123
> > > > >> site2.com/?pid=124
> > > > >>
> > > > >> The only way to crawl and fetch webpages/articles in case 2 is to
> > > fetch
> > > > >> URLs
> > > > >> with query string "?" . While for the case 1 I can set NOT to
> fetch
> > > "?"
> > > > in
> > > > >> URL. Thus currently in my regex-urlfilter.txt , I commented the
> > > > following
> > > > >> lines for my crawler to fetch URL with query string.
> > > > >>
> > > > >> # skip URLs containing certain characters as probable queries,
> etc.
> > > > >> #-[?*!@=]
> > > > >>
> > > > >> The above setting cause the crawler to fetch all URLs including
> URLs
> > > > with
> > > > >> query string thus pages such as download, login, comments, search
> > > query,
> > > > >> printer friendly pages, zoom in view and other not valuable pages
> > are
> > > > >> being
> > > > >> fetch. Practically, the crawler is going deep web. The undesirable
> > > cause
> > > > >> of
> > > > >> this is as following:
> > > > >>
> > > > >> 1. Duplicate pages are being fetch, effecting the crawl DB to be
> > > bloated
> > > > >> - Printer friendly view, zoom in view
> > > > >> e.g. site1.com/article1
> > > > >> e.g. site1.com/article1/?view=printerfriendly
> > > > >> e.g. site1.com/article1/?zoom=large
> > > > >> e.g. site1.com/article1/?zoom=extralarge
> > > > >>
> > > > >> 2. Download pages are being fetch, effecting the segment to be too
> > > large
> > > > >> e.g. site1/com/getcontentID?id=1&format=pdf
> > > > >> e.g. site1/com/getcontentID?id=1&format=doc
> > > > >>
> > > > >> 3. Crawling take very long time (10 days for depth 5) since is it
> > > going
> > > > >> deep
> > > > >> web.
> > > > >>
> > > > >> My current solution to the problem is to add additional regex in
> the
> > > > >> regex-urlfilter.txt to prevent the crawler from fetching undesired
> > > > pages.
> > > > >> Now I have another problems.
> > > > >> 1. regex to exclude undesired URLs patter is not exhausted for
> there
> > > are
> > > > >> many site and many pattern. Thus crawler is still going deep web.
> > > > >> 2. regex filters to exclude is getting too long so far 50 regex to
> > > > exclude
> > > > >> the URLs pattern.
> > > > >>
> > > > >> I hope I am not the only one with the similar problem and someone
> > > knows
> > > > >> smarter way to solve the problem. Has anybody have a solution or
> > > > >> suggestion
> > > > >> on how to solve the problem? Some tips or direction would be very
> > much
> > > > >> appreciated.
> > > > >>
> > > > >> Btw, I am using nutch 1.2 but I believe the crawler principle is
> > > pretty
> > > > >> much
> > > > >> the same.
> > > > >>
> > > > >> Warm Regards,
> > > > >>
> > > > >> Ye
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> View this message in context:
> > > > >>
> > > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Crawling-URLs-with-query-string-while-limiting-only-web-pages-tp4042381.html
> > > > >> Sent from the Nutch - User mailing list archive at Nabble.com.
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Don't Grow Old, Grow Up... :-)
> >
>

Re: Crawling URLs with query string while limiting only web pages

Reply via email to