My experience with writing crawl and processing bots like this [1] is  
that you want to have an architecture that has each step (e.g. get  
list of pages, get each page, parse the page, generate) as a separate  
job. That makes it much easier to recover from the inevitable  
unpredictable error conditions.  Generate a set of jobs, throw them in  
a queue, have forked agents pull them out and record errors separately  
for review later.  You need a singleton grabbing pages, so that you  
can control how often you hit the server.

Good luck!
James

[1]: Howison, J. and Crowston, K. (2004). The perils and pitfalls of  
mining Sourceforge. In Proc. of Workshop on Mining Software  
Repositories at the International Conference on Software Engineering  
ICSE.
http://citeseer.ist.psu.edu/howison04perils.html

On Oct 8, 2009, at 09:18, Alexandre Passant wrote:

>
> HI,
>
> On 8 Oct 2009, at 14:04, Matthias Samwald wrote:
>
>>
>> Hi Seb,
>>
>> Basically help with doing the actual programming by some experience
>> programmer. For example, I could write down how the vBulletin archive
>> HTML should be mapped to SIOC, and someone else could help with
>> writing the code. However, it might still be more efficient if I just
>> start hacking right away...
>>
>> Another possible help could be writing some of the necessary code
>> snippets for extracting the various attributes from the archive HTML
>> pages (e.g., as PHP code that uses SimpleXML and regular expressions
>> to extract each post, each author, each content, thread title, date,
>> links to external resources et cetera). Yes, I guess that would be
>> more efficient.
>
> I guess the current SIOC PHP API may help you to write such wrapping
> service, available at [1].
> If you have any question wrt this API, please ask us on the ML.
>
> Best,
>
> Alex.
>
> [1] http://wiki.sioc-project.org/index.php/PHPExportAPI
>
>>
>> -- Matthias
>>
>> On 8 Okt., 13:56, seb <[email protected]> wrote:
>>> Hi Matthias,
>>>
>>> What kind of help/contribution would you need more specifically?
>>>
>>> /seb
>>>
>>> On 8 oct, 11:06, Matthias Samwald <[email protected]> wrote:
>>>
>>>> Oh, and if someone feels motivated to help with this project,  
>>>> please
>>>> say so! :)
>>>
>
> --
> Dr. Alexandre Passant
> Digital Enterprise Research Institute
> National University of Ireland, Galway
> :me owl:sameAs <http://apassant.net/alex> .
>
>
>
>
>
>
>
> >


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"SIOC-Dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/sioc-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to