My experience with writing crawl and processing bots like this [1] is that you want to have an architecture that has each step (e.g. get list of pages, get each page, parse the page, generate) as a separate job. That makes it much easier to recover from the inevitable unpredictable error conditions. Generate a set of jobs, throw them in a queue, have forked agents pull them out and record errors separately for review later. You need a singleton grabbing pages, so that you can control how often you hit the server.
Good luck! James [1]: Howison, J. and Crowston, K. (2004). The perils and pitfalls of mining Sourceforge. In Proc. of Workshop on Mining Software Repositories at the International Conference on Software Engineering ICSE. http://citeseer.ist.psu.edu/howison04perils.html On Oct 8, 2009, at 09:18, Alexandre Passant wrote: > > HI, > > On 8 Oct 2009, at 14:04, Matthias Samwald wrote: > >> >> Hi Seb, >> >> Basically help with doing the actual programming by some experience >> programmer. For example, I could write down how the vBulletin archive >> HTML should be mapped to SIOC, and someone else could help with >> writing the code. However, it might still be more efficient if I just >> start hacking right away... >> >> Another possible help could be writing some of the necessary code >> snippets for extracting the various attributes from the archive HTML >> pages (e.g., as PHP code that uses SimpleXML and regular expressions >> to extract each post, each author, each content, thread title, date, >> links to external resources et cetera). Yes, I guess that would be >> more efficient. > > I guess the current SIOC PHP API may help you to write such wrapping > service, available at [1]. > If you have any question wrt this API, please ask us on the ML. > > Best, > > Alex. > > [1] http://wiki.sioc-project.org/index.php/PHPExportAPI > >> >> -- Matthias >> >> On 8 Okt., 13:56, seb <[email protected]> wrote: >>> Hi Matthias, >>> >>> What kind of help/contribution would you need more specifically? >>> >>> /seb >>> >>> On 8 oct, 11:06, Matthias Samwald <[email protected]> wrote: >>> >>>> Oh, and if someone feels motivated to help with this project, >>>> please >>>> say so! :) >>> > > -- > Dr. Alexandre Passant > Digital Enterprise Research Institute > National University of Ireland, Galway > :me owl:sameAs <http://apassant.net/alex> . > > > > > > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "SIOC-Dev" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/sioc-dev?hl=en -~----------~----~----~----~------~----~------~--~---
