php-general Digest 3 Oct 2010 10:00:36 -0000 Issue 6971

php-general-digest-help Sun, 03 Oct 2010 03:02:24 -0700

php-general Digest 3 Oct 2010 10:00:36 -0000 Issue 6971

Topics (messages 308405 through 308407):


Scraping Multiple sites
        308405 by: Russell Dias
        308406 by: chris h

Vermis - new issue tracker in PHP
        308407 by: Lukasz Cepowski

Administrivia:

To subscribe to the digest, e-mail:
        [email protected]

To unsubscribe from the digest, e-mail:
        [email protected]

To post to the list, e-mail:
        [email protected]


----------------------------------------------------------------------

--- Begin Message ---

I'm currently stuck on a little problem. I'm using cURL in conjunction
with DOMDocument and Xpath to scrape data from a couple of websites.
Please note that is only for personal and educational purposes.

Right now I have 5 independent scripts (that traverse through 5
websites) that run via a cron tab every 12 hours. However, as you may
have guessed this is a scalability nightmare. If my list of websites
to scrape grows I have to create another independent script and run it
via cron.

My knowledge of OOP is fairly basic as I have just gotten started with
it. However, could anyone perhaps suggest a design pattern that would
suit my needs? My solution would be to create an abstract class for
the web crawler and then simply extend it per website I add on.
However, as I said my experience with OOP is almost non-existant
therefore I have no idea how this would scale. I want this 'crawler'
to be one application which can run via one cron rather than having n
amount of scripts for each websites and having to manually create a
cron each time.

Or does anyone have any experience with this sort thing and could
maybe offer some advice?

I'm not limited to using PHP either, however due to hosting
constraints Python would most likely be my only other alternative.

Any help would be appreciated.

Cheers,
Russell

--- End Message ---

--- Begin Message ---

On Sat, Oct 2, 2010 at 9:03 PM, Russell Dias <[email protected]> wrote:

> I'm currently stuck on a little problem. I'm using cURL in conjunction
> with DOMDocument and Xpath to scrape data from a couple of websites.
> Please note that is only for personal and educational purposes.
>
> Right now I have 5 independent scripts (that traverse through 5
> websites) that run via a cron tab every 12 hours. However, as you may
> have guessed this is a scalability nightmare. If my list of websites
> to scrape grows I have to create another independent script and run it
> via cron.
>
> My knowledge of OOP is fairly basic as I have just gotten started with
> it. However, could anyone perhaps suggest a design pattern that would
> suit my needs? My solution would be to create an abstract class for
> the web crawler and then simply extend it per website I add on.
> However, as I said my experience with OOP is almost non-existant
> therefore I have no idea how this would scale. I want this 'crawler'
> to be one application which can run via one cron rather than having n
> amount of scripts for each websites and having to manually create a
> cron each time.
>
> Or does anyone have any experience with this sort thing and could
> maybe offer some advice?
>
> I'm not limited to using PHP either, however due to hosting
> constraints Python would most likely be my only other alternative.
>
> Any help would be appreciated.
>
> Cheers,
> Russell
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

Are the sites that you are crawling so different as to justify
maintaining separate chunks of code for each one?  I would try to avoid
having any code specific to a site, otherwise scaling your application to
support even a hundred sites would involve overlapping hundreds of points of
functionality and be a logistical nightmare.  Unless you're simply wanting
to do this for educational reasons...

My suggestion would be to attempt to create an application that can craw all
the sites, without specifics for each one.  You could fire it with a single
cron job, and give it a list of the urls you want it to hit.  It can crawl
one url, record the findings, move to the next, repeat.

Chris.

--- End Message ---

--- Begin Message ---
Hello,
I would like to introduce the project that I'm working on for a fewmonths :)Project is called Vermis (lat. bug, worm). It is an Open Source issuetracker and project management tool for software developers and projectmanagers that has been created for improving quality of code, efficiencyand speed of development. Designed as a standard web application writtenin PHP, it can be used on almost any platform and hosting service,including Windows, Linux and more.
Project is available here http://vermis.diabloware.com
The online demo is here http://vermis.diabloware.com/demo
The long term goal is to compete with commercial products like Jira andother open source software like Trac, Redmine, Mantis, Bugzilla etc.Vermis is being distributed under terms of GNU General Public License,so you can use it both in open and closed source projects.
Why Vermis exists?
- Jira has a lot of features but it is hard to use, and first of all itis a commercial software
- Redmine needs RoR which is resource consuming
- Trac needs Python
- Bugzilla needs Perl
- Mantis, hmm i just didn't like it ;)

Why Vermis is better than the other products?
- Vermis is written in PHP and uses MySQL, which is probably the most
widespread and the cheapest web platform nowadays
- It doesn't require any additional software on a hosting server (exceptmod_rewrite which is also very popular)
- Currently it has similar functionalities lika Jira
- It growns very fast :)

What Vermis already has?
- Multiple projects in one place
- Web access from any place on Earth
- Public and private projects
- Many types of issues
- Components
- Milestones
- Versioning and the history of changes
- Dynamic grids (issue navigator)
- Many user accounts
- Online registration
- Notes
- File upload
- Comments
- Progress bars
- Email notifications

What Vermis will have?
- API via SOAP or REST
- Graphical reporting
- Burndown charts
- Agile support (Scrum)
- Custom issue types, priorities, statuses, etc
- Dynamic access control list
- Automatic collecting reports from the external applications
- Wrappers for PHP, Java, C#
- many more ;)

I'm inviting to watch, test and use Vermis.
Since version 1.0 RC3 Vermis is its own bugtracker, which is availableat http://bugs.diabloware.comThe latest source code you can download fromhttp://vermis.diabloware.com/downloadAny questions you can post at the official project's forum which is athttp://forum.diabloware.com
I'm looking forward for any feedback, comments and critique :)

Thanks,
Lukasz (cepa) Cepowski
DiabloWare :: Software from Hell!
www.diabloware.com | www.cepowski.pl
skype: lukasz.cepowski
cell:  +48 502 670 711
--- End Message ---

php-general Digest 3 Oct 2010 10:00:36 -0000 Issue 6971

Reply via email to