László Török:

> I was wondering though how do you make sure two
> crawlers do not crawl the same URL twice if there is no global state? :)

By adding sharing state, for a single app instance, typically an atom. As for 
separating different instances,
it is not uncommon to hash seed URLs (or domains) in such a way that two 
instances simply won't
crawl the same site in parallel.


> You may also consider using the sitemap as a source of urls per domain,
> although this depends on the crawling policy.

That does not work in practice. One reason is, sitemaps are often incomplete, 
out of date or missing
completely. Another one, for most news websites and blogs, you will discover 
site structure a lot
faster by frequently (within reason, of course) recrawling either first level 
pages or a seed of known
"section" pages.

There is a really good workshop on Web mining video from Strata Santa Clara 
2012, it highlights two dozens
more common problems you face when designing Web crawlers:

http://my.safaribooksonline.com/video/-/9781449336172

Highly recommended for people who are interested or work in this area (I think 
it can be purchased separately, O'Reilly Safari subscribers have access to the 
entire video set)

I am by no means an expert (or even very experienced) in this area but Itsy has 
features that solve several very common
problems out of the box in 0.1.0. Good job.

MK

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to