Hey all,

When I had run nutch-selenium, it was in a config such that zombies were 
created from closing Firefox windows and they couldn't be reaped (again, due to 
the docker configuration I had).

In a normal setup, it should not be an issue - if you're running 20 threads in 
nutch that's potentially 20 open FF windows which isn't good for 512mb.

Selenium grid is much more efficient, in that browsers are opened, but tabs are 
used to fetch sites - and only those are closed.

Additionally, ensure you're using Nutch 2.2.1.

Feel free to fork patch and tinker and PR as needed.

Chris, if you want to be added to contribs on the GitHub project, that's cool 
with me! Wish I could dedicate more time to this, but I don't foresee using 
Nutch again in the near future, and am now working on projects that require 
lots of reading and possibly patches to Caffe and opencl r-CNN projects.

Tl;dr: 
- no, this shouldn't be typical unless you're creating zombies like crazy and 
they're not being reaped (too many open file descriptors), running out of 
memory, or similar resource constraint.
- selenium grid is TONs more efficient, but a bit more difficult to set up. I 
used it to crawl 100ks of sites.
- unfortunately I can't commit more time to this, but if I can assist in any 
admin way, let me know.

Thank you,

Mo

This message was drafted on a tiny touch screen; please forgive brevity & tpyos

> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)" 
> <chris.a.mattm...@jpl.nasa.gov> wrote:
> 
> Oh yes, please up your memory to like at least 2Gb..
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Shuo Li <sli...@usc.edu>
> Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org>
> Date: Friday, February 13, 2015 at 10:38 AM
> To: "dev@nutch.apache.org" <dev@nutch.apache.org>
> Cc: Mo Omer <beancinemat...@gmail.com>
> Subject: Re: Vagrant Crushed When using Nutch-Selenium
> 
>> Hey Mo and Prof Mattmann,
>> 
>> 
>> I will try to crawl the 3 websites in the homework tonight (NASA AMD, NSF
>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's going
>> on. 
>> 
>> 
>> Is memory an issue? My vagrant only has 512MB of memory.
>> 
>> 
>> Regards,
>> Shuo Li
>> 
>> 
>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>> <chris.a.mattm...@jpl.nasa.gov> wrote:
>> 
>> Hi Shuo,
>> 
>> Thanks for your email. I wonder if using selenium grid would
>> help?
>> 
>> Please see this plugin:
>> 
>> https://github.com/momer/nutch-selenium-grid-plugin
>> 
>> 
>> I’m CC’ing Mo the author of the plugin to see if he experienced
>> this while running the original selenium plugin - Mo did using
>> selenium grid help the issue that Shuo is experiencing below?
>> 
>> Mo: are you cool with portion the grid plugin, or if Lewis or
>> I do it to trunk (with full credit to you of course?)
>> 
>> Cheers,
>> Chris
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Shuo Li <sli...@usc.edu>
>> Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org>
>> Date: Friday, February 13, 2015 at 10:12 AM
>> To: "dev@nutch.apache.org" <dev@nutch.apache.org>
>> Subject: Vagrant Crushed When using Nutch-Selenium
>> 
>>> Hey guys,
>>> 
>>> 
>>> I'm trying to use Nutch-Selenium to crawl
>>> nutch.apache.org <http://nutch.apache.org> <http://nutch.apache.org>.
>>> However, my vagrant seems
>>> crushed after a few minutes. I forced it to shut down and it turns out it
>>> only crawled 59 websites. My nutch version is 1.10 and my OS is Ubuntu
>>> Trusty, 14.04.
>>> 
>>> 
>>> Is there anything I can provide to you guys? Or is there anybody have the
>>> same issue? Or 59 websites is the complete crawling?
>>> 
>>> 
>>> Any suggestion would be appreciated.
>>> 
>>> 
>>> Regards,
>>> Shuo Li
> 

Reply via email to