Thank you so much!! I am going to try it out tonight. On Tuesday, February 17, 2015, Mohammed Omer <beancinemat...@gmail.com> wrote:
> Jiaxin, > > Each page takes about 3 seconds to crawl due to this piece of code - we > allow selenium 3 seconds to grab the page [0]. Due to what I was crawling, > I didn't want to wait for a specific element/class/id to show up. However, > you can change it up if you want. Selenium documentation [1] has more info > on Ex/Implicit waiting. > > Again, it's not the most efficient way to crawl; but, if you need JS to > render, it's a backwards way that ensures it happens. Selenium Grid has the > benefit of being able to handle more throughput, but at the end of the day > we're waiting for a browser to go out and fetch the url. > > I've suggested that most items be configurable when merged into trunk [2], > but I'll make a specific call-out to the wait time. > > Due to the way Selenium standalone works, it's wayyyyyy less efficient > than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that > set-up. > > Wish I could help out more, but 30 threads might be too much. 5 threads, > at a total fetch/parse time of 4 seconds per url, would still theoretically > churn out > 100k urls per day. There are multiple tweaks that could be made > to optimize for your system, I'd start with reducing thread count, as you > might be saturating your system [4]. > > Sorry I can't be of more help! > > Thank you, > > Mo > > [0]: > https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49 > [1]: http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp > [2]: https://issues.apache.org/jira/browse/NUTCH-1933 > [3]: https://code.google.com/p/selenium/wiki/Grid2 > [4]: http://stackoverflow.com/a/4895271 > > On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye <jiaxi...@usc.edu > <javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu');>> wrote: > >> I am using fetcher.threads.per.queue = 30 by the way. >> >> On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye <jiaxi...@usc.edu >> <javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu');>> wrote: >> >>> Hi Mo, >>> >>> I have a problem about the selenium plugin on mac. I think I >>> successfully set it up on mac but I have a question about the performance. >>> I am using a Mac with Intel Core i5 processor and 8GB ram, but I found >>> that each url fetched takes about 1 seconds to open and close >>> the firefox window. Is it a normal speed? or anything is wrong? And is >>> it possible to install selenium grid plugin on Mac? I will cry if you >>> ask me to change machine now...... >>> >>> Best, >>> Jiaxin >>> >>> On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer <beancinemat...@gmail.com >>> <javascript:_e(%7B%7D,'cvml','beancinemat...@gmail.com');>> wrote: >>> >>>> No worries man, glad everything works! Glad, since I was having >>>> hostname issues with nutch/hbase just now as I quickly tried to get it >>>> working/fixed for ya, ha. >>>> >>>> Mo >>>> >>>> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li <sli...@usc.edu >>>> <javascript:_e(%7B%7D,'cvml','sli...@usc.edu');>> wrote: >>>> >>>>> Hey guys, >>>>> >>>>> After change my RAM to 2GB, everything works fine. My bad. Thanks for >>>>> your help. >>>>> >>>>> Regards, >>>>> Shuo Li >>>>> >>>>> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) < >>>>> chris.a.mattm...@jpl.nasa.gov >>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov');>> >>>>> wrote: >>>>> >>>>>> Thank you Mo. I sincerely appreciate your guidance and contribution. >>>>>> >>>>>> I will work to get your nutch selenium grid plugin contributed >>>>>> to work with Nutch 1.x. >>>>>> >>>>>> Cheers, >>>>>> Chris >>>>>> >>>>>> >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Chris Mattmann, Ph.D. >>>>>> Chief Architect >>>>>> Instrument Software and Science Data Systems Section (398) >>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> Office: 168-519, Mailstop: 168-527 >>>>>> Email: chris.a.mattm...@nasa.gov >>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@nasa.gov');> >>>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> Adjunct Associate Professor, Computer Science Department >>>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -----Original Message----- >>>>>> From: Mo Omer <beancinemat...@gmail.com >>>>>> <javascript:_e(%7B%7D,'cvml','beancinemat...@gmail.com');>> >>>>>> Date: Friday, February 13, 2015 at 11:10 AM >>>>>> To: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov >>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov');>> >>>>>> Cc: "dev@nutch.apache.org >>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" < >>>>>> dev@nutch.apache.org >>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>> >>>>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium >>>>>> >>>>>> >Hey all, >>>>>> > >>>>>> >When I had run nutch-selenium, it was in a config such that zombies >>>>>> were >>>>>> >created from closing Firefox windows and they couldn't be reaped >>>>>> (again, >>>>>> >due to the docker configuration I had). >>>>>> > >>>>>> >In a normal setup, it should not be an issue - if you're running 20 >>>>>> >threads in nutch that's potentially 20 open FF windows which isn't >>>>>> good >>>>>> >for 512mb. >>>>>> > >>>>>> >Selenium grid is much more efficient, in that browsers are opened, >>>>>> but >>>>>> >tabs are used to fetch sites - and only those are closed. >>>>>> > >>>>>> >Additionally, ensure you're using Nutch 2.2.1. >>>>>> > >>>>>> >Feel free to fork patch and tinker and PR as needed. >>>>>> > >>>>>> >Chris, if you want to be added to contribs on the GitHub project, >>>>>> that's >>>>>> >cool with me! Wish I could dedicate more time to this, but I don't >>>>>> >foresee using Nutch again in the near future, and am now working on >>>>>> >projects that require lots of reading and possibly patches to Caffe >>>>>> and >>>>>> >opencl r-CNN projects. >>>>>> > >>>>>> >Tl;dr: >>>>>> >- no, this shouldn't be typical unless you're creating zombies like >>>>>> crazy >>>>>> >and they're not being reaped (too many open file descriptors), >>>>>> running >>>>>> >out of memory, or similar resource constraint. >>>>>> >- selenium grid is TONs more efficient, but a bit more difficult to >>>>>> set >>>>>> >up. I used it to crawl 100ks of sites. >>>>>> >- unfortunately I can't commit more time to this, but if I can >>>>>> assist in >>>>>> >any admin way, let me know. >>>>>> > >>>>>> >Thank you, >>>>>> > >>>>>> >Mo >>>>>> > >>>>>> >This message was drafted on a tiny touch screen; please forgive >>>>>> brevity & >>>>>> >tpyos >>>>>> > >>>>>> >> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)" >>>>>> >><chris.a.mattm...@jpl.nasa.gov >>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov');>> >>>>>> wrote: >>>>>> >> >>>>>> >> Oh yes, please up your memory to like at least 2Gb.. >>>>>> >> >>>>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >> Chris Mattmann, Ph.D. >>>>>> >> Chief Architect >>>>>> >> Instrument Software and Science Data Systems Section (398) >>>>>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> >> Office: 168-519, Mailstop: 168-527 >>>>>> >> Email: chris.a.mattm...@nasa.gov >>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@nasa.gov');> >>>>>> >> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >> Adjunct Associate Professor, Computer Science Department >>>>>> >> University of Southern California, Los Angeles, CA 90089 USA >>>>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> -----Original Message----- >>>>>> >> From: Shuo Li <sli...@usc.edu >>>>>> <javascript:_e(%7B%7D,'cvml','sli...@usc.edu');>> >>>>>> >> Reply-To: "dev@nutch.apache.org >>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" < >>>>>> dev@nutch.apache.org >>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>> >>>>>> >> Date: Friday, February 13, 2015 at 10:38 AM >>>>>> >> To: "dev@nutch.apache.org >>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" < >>>>>> dev@nutch.apache.org >>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>> >>>>>> >> Cc: Mo Omer <beancinemat...@gmail.com >>>>>> <javascript:_e(%7B%7D,'cvml','beancinemat...@gmail.com');>> >>>>>> >> Subject: Re: Vagrant Crushed When using Nutch-Selenium >>>>>> >> >>>>>> >>> Hey Mo and Prof Mattmann, >>>>>> >>> >>>>>> >>> >>>>>> >>> I will try to crawl the 3 websites in the homework tonight (NASA >>>>>> AMD, >>>>>> >>>NSF >>>>>> >>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's >>>>>> >>>going >>>>>> >>> on. >>>>>> >>> >>>>>> >>> >>>>>> >>> Is memory an issue? My vagrant only has 512MB of memory. >>>>>> >>> >>>>>> >>> >>>>>> >>> Regards, >>>>>> >>> Shuo Li >>>>>> >>> >>>>>> >>> >>>>>> >>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980) >>>>>> >>> <chris.a.mattm...@jpl.nasa.gov >>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov');>> >>>>>> wrote: >>>>>> >>> >>>>>> >>> Hi Shuo, >>>>>> >>> >>>>>> >>> Thanks for your email. I wonder if using selenium grid would >>>>>> >>> help? >>>>>> >>> >>>>>> >>> Please see this plugin: >>>>>> >>> >>>>>> >>> https://github.com/momer/nutch-selenium-grid-plugin >>>>>> >>> >>>>>> >>> >>>>>> >>> I’m CC’ing Mo the author of the plugin to see if he experienced >>>>>> >>> this while running the original selenium plugin - Mo did using >>>>>> >>> selenium grid help the issue that Shuo is experiencing below? >>>>>> >>> >>>>>> >>> Mo: are you cool with portion the grid plugin, or if Lewis or >>>>>> >>> I do it to trunk (with full credit to you of course?) >>>>>> >>> >>>>>> >>> Cheers, >>>>>> >>> Chris >>>>>> >>> >>>>>> >>> >>>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>> Chris Mattmann, Ph.D. >>>>>> >>> Chief Architect >>>>>> >>> Instrument Software and Science Data Systems Section (398) >>>>>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>>> >>> Office: 168-519, Mailstop: 168-527 >>>>>> >>> Email: chris.a.mattm...@nasa.gov >>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@nasa.gov');> >>>>>> >>> WWW: http://sunset.usc.edu/~mattmann/ >>>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>> Adjunct Associate Professor, Computer Science Department >>>>>> >>> University of Southern California, Los Angeles, CA 90089 USA >>>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> -----Original Message----- >>>>>> >>> From: Shuo Li <sli...@usc.edu >>>>>> <javascript:_e(%7B%7D,'cvml','sli...@usc.edu');>> >>>>>> >>> Reply-To: "dev@nutch.apache.org >>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" < >>>>>> dev@nutch.apache.org >>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>> >>>>>> >>> Date: Friday, February 13, 2015 at 10:12 AM >>>>>> >>> To: "dev@nutch.apache.org >>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" < >>>>>> dev@nutch.apache.org >>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>> >>>>>> >>> Subject: Vagrant Crushed When using Nutch-Selenium >>>>>> >>> >>>>>> >>>> Hey guys, >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> I'm trying to use Nutch-Selenium to crawl >>>>>> >>>> nutch.apache.org <http://nutch.apache.org> < >>>>>> http://nutch.apache.org>. >>>>>> >>>> However, my vagrant seems >>>>>> >>>> crushed after a few minutes. I forced it to shut down and it >>>>>> turns >>>>>> >>>>out it >>>>>> >>>> only crawled 59 websites. My nutch version is 1.10 and my OS is >>>>>> Ubuntu >>>>>> >>>> Trusty, 14.04. >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> Is there anything I can provide to you guys? Or is there anybody >>>>>> have >>>>>> >>>>the >>>>>> >>>> same issue? Or 59 websites is the complete crawling? >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> Any suggestion would be appreciated. >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> Regards, >>>>>> >>>> Shuo Li >>>>>> >> >>>>>> >>>>>> >>>>> >>>> >>> >> >