Re: Crawling existing bulletin boards, representing contents as SIOC?

Paul A Houle Thu, 01 Oct 2009 14:30:01 -0700

Matthias Samwald wrote:
> Dear SIOC community,
>
> At the moment, I am thinking about possible ways of turning existing
> bulletin boards (often based on the popular vBulletin software) into
> SIOC, by crawling them and extracting the content.
>
> Does any of you have experience with crawling bulletin boards? Is
> there any existing software that could be built upon?
>
> Cheers,
> Matthias
>   
    Back in '99 I wrote a webcrawler in Java that I called 'Blackbird.' 
It was  in a lot of ways,  like the airplane with the same name.  (Yes,  
web crawling is a bit of a 'black art')


    Although it wasn't distributed,  it had fancy concurrency control 
and queuing policies;  it could get a lot of the performance that would 
be possible with reasonable Unix box and internet connection.

    Then I went through a phase of creating simpler and simpler web 
crawlers.  I kind of thought I was devolving until I saw the crawling 
strategy Nutch uses and realized it was pretty much the same.

    These days I'm a big believer in breadth-first crawling.  The web 
crawler runs in stages:  stage N outputs a list of urls to stage N+1.  
The crawler itself is pretty dumb:  it grabs the URLs,  writes the 
contents into files or stuffs them into DB blobs.   Concurrency control 
can be ~simple~,  for instance,  just divide the list of tasks to do 
into M sublists,  fork into M children,  and let each child do 1/M of 
the work.  (That's not the best strategy,  but you can even do it in 
Perl or PHP.)

    Once a stage of the crawl is done,  I run some scripts that extract 
whatever data comes out of the stage.  The nice thing about having this 
decoupled from the crawler is that you can fix bugs in your extractor 
without having to re-run the crawl.  The extractor sends URLs on the 
stage N+1,  you can even move URLs that were temporary fails in stage N 
to stage N+1.

    You'll usually see a rapid increase in the size of the stages,  then 
a gentle plateau,  then it falls off and you're left with some 
stragglers,  which are all web traps.  Terminate the crawl then...  The 
real advantage of breadth-first is that it easily shakes off common web 
traps.

    In early development or for small jobs you can do it manually and 
have a lot of control over what's happening.  In a more mature system 
you can have higher-level optimization start and stop the stages,  run 
the extractor scripts,  decide when to terminate a crawl,  etc.

    My current web crawler has a centralized work queue:  other scripts 
submit jobs to the crawler,  which works through them,  and runs 
callback scripts when jobs are completed.  It works pretty nice.



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"SIOC-Dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/sioc-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Crawling existing bulletin boards, representing contents as SIOC?

Reply via email to