Re: [symfony-users] Re: CLI sf Task, allowed memory issue

2010-09-01 Thread Jochen Daum
Hi,

We've implemented a number of crawlers ourselves and the benefit of
using PHP is that they are easier to maintain if they are built in a
language a larger range of people cah use.

To solve your specific problems build a queue system.

Create a list table, which lists urls that you want to scrape. This
may include details on how to log in to these and what method (POST,
GET) to use.

Then create a daemon by doing the following:

- create a cron job that runs every minute:
 - set-time_limit(58+$time_of_curl_request+$a_bit);
 - get next url from list
 - get contents
 - scrape out data you need - this may include generating new list urls
 - remove from list table
 - measure time, if 58 seconds have passed terminate.


If you have to manage load on the target site and have certain
turnaround times for scrape data, you may also need a scheduler, which
decides what url gets scheduled for scrape when.On Thu, Sep 2, 2010 at
2:20 PM, Dennis  wrote:
> Also, I listened to a conversation with Justin Wage and some one from
> Digg. They use daemons for various things (kind of what you need to
> do). They kept a counter for each daemon and when the counter was at
> some magic number of times the daemon had been assigned and completed
> a task, it was killed, and a new one started to replace it. This was
> more of a 'top level' garbage collection scheme IN PRODUCTION right
> now.
>
> On Sep 1, 11:17 am, pghoratiu  wrote:
>> Hi!
>>
>> My suggestion is to use PHP 5.3.X, it has improved garbage collection
>> and it should help with reclaiming unused memory. Also you should
>> group the code that is leaking inside a separate function(s), this way
>> the PHP runtime knows that it can release the memory for variables
>> within the scope.
>>
>>     gabriel
>>
>> On Sep 1, 12:11 pm, "PieR."  wrote:
>>
>> > Hi,
>>
>> > I have a sfTask in CLI wich use lot of foreach and preg_matches, and
>> > unfortunatly PHP return an error "Allowed memory size" in few
>> > minutes.
>>
>> > I read that PHP clear the memory when a script ends, so I tried to run
>> > tasks inside the main task, but the problem still remains.
>>
>> > How to manage this memory issue ? clear memory or launch tasks in
>> > separate processes ?
>>
>> > The final aim is to build a web crawler, wich runs many hours per
>> > days.
>>
>> > Thanks in advance for help,
>>
>> > Regards,
>>
>> > Pierre
>>
>>
>
> --
> If you want to report a vulnerability issue on symfony, please send it to 
> security at symfony-project.com
>
> You received this message because you are subscribed to the Google
> Groups "symfony users" group.
> To post to this group, send email to symfony-users@googlegroups.com
> To unsubscribe from this group, send email to
> symfony-users+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/symfony-users?hl=en
>

-- 
If you want to report a vulnerability issue on symfony, please send it to 
security at symfony-project.com

You received this message because you are subscribed to the Google
Groups "symfony users" group.
To post to this group, send email to symfony-users@googlegroups.com
To unsubscribe from this group, send email to
symfony-users+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/symfony-users?hl=en


[symfony-users] Re: CLI sf Task, allowed memory issue

2010-09-01 Thread Dennis
Also, I listened to a conversation with Justin Wage and some one from
Digg. They use daemons for various things (kind of what you need to
do). They kept a counter for each daemon and when the counter was at
some magic number of times the daemon had been assigned and completed
a task, it was killed, and a new one started to replace it. This was
more of a 'top level' garbage collection scheme IN PRODUCTION right
now.

On Sep 1, 11:17 am, pghoratiu  wrote:
> Hi!
>
> My suggestion is to use PHP 5.3.X, it has improved garbage collection
> and it should help with reclaiming unused memory. Also you should
> group the code that is leaking inside a separate function(s), this way
> the PHP runtime knows that it can release the memory for variables
> within the scope.
>
>     gabriel
>
> On Sep 1, 12:11 pm, "PieR."  wrote:
>
> > Hi,
>
> > I have a sfTask in CLI wich use lot of foreach and preg_matches, and
> > unfortunatly PHP return an error "Allowed memory size" in few
> > minutes.
>
> > I read that PHP clear the memory when a script ends, so I tried to run
> > tasks inside the main task, but the problem still remains.
>
> > How to manage this memory issue ? clear memory or launch tasks in
> > separate processes ?
>
> > The final aim is to build a web crawler, wich runs many hours per
> > days.
>
> > Thanks in advance for help,
>
> > Regards,
>
> > Pierre
>
>

-- 
If you want to report a vulnerability issue on symfony, please send it to 
security at symfony-project.com

You received this message because you are subscribed to the Google
Groups "symfony users" group.
To post to this group, send email to symfony-users@googlegroups.com
To unsubscribe from this group, send email to
symfony-users+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/symfony-users?hl=en


[symfony-users] Re: CLI sf Task, allowed memory issue

2010-09-01 Thread Dennis
PHP is the easiest, but not the best for that purpose. If you're
crawling web sites, why not use web crawling software? Take a look @
apache web crawler. The Apache foundation has some amazing software in
it's 'forge'.

If you're doing web scraping, try dapper.net. Follow a link on the
main page to get to the 'old site'.

On Sep 1, 11:17 am, pghoratiu  wrote:
> Hi!
>
> My suggestion is to use PHP 5.3.X, it has improved garbage collection
> and it should help with reclaiming unused memory. Also you should
> group the code that is leaking inside a separate function(s), this way
> the PHP runtime knows that it can release the memory for variables
> within the scope.
>
>     gabriel
>
> On Sep 1, 12:11 pm, "PieR."  wrote:
>
> > Hi,
>
> > I have a sfTask in CLI wich use lot of foreach and preg_matches, and
> > unfortunatly PHP return an error "Allowed memory size" in few
> > minutes.
>
> > I read that PHP clear the memory when a script ends, so I tried to run
> > tasks inside the main task, but the problem still remains.
>
> > How to manage this memory issue ? clear memory or launch tasks in
> > separate processes ?
>
> > The final aim is to build a web crawler, wich runs many hours per
> > days.
>
> > Thanks in advance for help,
>
> > Regards,
>
> > Pierre
>
>

-- 
If you want to report a vulnerability issue on symfony, please send it to 
security at symfony-project.com

You received this message because you are subscribed to the Google
Groups "symfony users" group.
To post to this group, send email to symfony-users@googlegroups.com
To unsubscribe from this group, send email to
symfony-users+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/symfony-users?hl=en


[symfony-users] Re: CLI sf Task, allowed memory issue

2010-09-01 Thread pghoratiu
Hi!

My suggestion is to use PHP 5.3.X, it has improved garbage collection
and it should help with reclaiming unused memory. Also you should
group the code that is leaking inside a separate function(s), this way
the PHP runtime knows that it can release the memory for variables
within the scope.

gabriel

On Sep 1, 12:11 pm, "PieR."  wrote:
> Hi,
>
> I have a sfTask in CLI wich use lot of foreach and preg_matches, and
> unfortunatly PHP return an error "Allowed memory size" in few
> minutes.
>
> I read that PHP clear the memory when a script ends, so I tried to run
> tasks inside the main task, but the problem still remains.
>
> How to manage this memory issue ? clear memory or launch tasks in
> separate processes ?
>
> The final aim is to build a web crawler, wich runs many hours per
> days.
>
> Thanks in advance for help,
>
> Regards,
>
> Pierre

-- 
If you want to report a vulnerability issue on symfony, please send it to 
security at symfony-project.com

You received this message because you are subscribed to the Google
Groups "symfony users" group.
To post to this group, send email to symfony-users@googlegroups.com
To unsubscribe from this group, send email to
symfony-users+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/symfony-users?hl=en