Matthias Samwald wrote:
> Dear SIOC community,
>
> At the moment, I am thinking about possible ways of turning existing
> bulletin boards (often based on the popular vBulletin software) into
> SIOC, by crawling them and extracting the content.
>
> Does any of you have experience with crawling bulletin boards? Is
> there any existing software that could be built upon?
>
> Cheers,
> Matthias
>
Back in '99 I wrote a webcrawler in Java that I called 'Blackbird.'
It was in a lot of ways, like the airplane with the same name. (Yes,
web crawling is a bit of a 'black art')
Although it wasn't distributed, it had fancy concurrency control
and queuing policies; it could get a lot of the performance that would
be possible with reasonable Unix box and internet connection.
Then I went through a phase of creating simpler and simpler web
crawlers. I kind of thought I was devolving until I saw the crawling
strategy Nutch uses and realized it was pretty much the same.
These days I'm a big believer in breadth-first crawling. The web
crawler runs in stages: stage N outputs a list of urls to stage N+1.
The crawler itself is pretty dumb: it grabs the URLs, writes the
contents into files or stuffs them into DB blobs. Concurrency control
can be ~simple~, for instance, just divide the list of tasks to do
into M sublists, fork into M children, and let each child do 1/M of
the work. (That's not the best strategy, but you can even do it in
Perl or PHP.)
Once a stage of the crawl is done, I run some scripts that extract
whatever data comes out of the stage. The nice thing about having this
decoupled from the crawler is that you can fix bugs in your extractor
without having to re-run the crawl. The extractor sends URLs on the
stage N+1, you can even move URLs that were temporary fails in stage N
to stage N+1.
You'll usually see a rapid increase in the size of the stages, then
a gentle plateau, then it falls off and you're left with some
stragglers, which are all web traps. Terminate the crawl then... The
real advantage of breadth-first is that it easily shakes off common web
traps.
In early development or for small jobs you can do it manually and
have a lot of control over what's happening. In a more mature system
you can have higher-level optimization start and stop the stages, run
the extractor scripts, decide when to terminate a crawl, etc.
My current web crawler has a centralized work queue: other scripts
submit jobs to the crawler, which works through them, and runs
callback scripts when jobs are completed. It works pretty nice.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"SIOC-Dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/sioc-dev?hl=en
-~----------~----~----~----~------~----~------~--~---