Thank you so much!! I am going to try it out tonight.

On Tuesday, February 17, 2015, Mohammed Omer <beancinemat...@gmail.com>
wrote:

> Jiaxin,
>
> Each page takes about 3 seconds to crawl due to this piece of code - we
> allow selenium 3 seconds to grab the page [0]. Due to what I was crawling,
> I didn't want to wait for a specific element/class/id to show up. However,
> you can change it up if you want. Selenium documentation [1] has more info
> on Ex/Implicit waiting.
>
> Again, it's not the most efficient way to crawl; but, if you need JS to
> render, it's a backwards way that ensures it happens. Selenium Grid has the
> benefit of being able to handle more throughput, but at the end of the day
> we're waiting for a browser to go out and fetch the url.
>
> I've suggested that most items be configurable when merged into trunk [2],
> but I'll make a specific call-out to the wait time.
>
> Due to the way Selenium standalone works, it's wayyyyyy less efficient
> than a 'Grid' set-up (hub + nodes) [3], which is why I switched to that
> set-up.
>
> Wish I could help out more, but 30 threads might be too much. 5 threads,
> at a total fetch/parse time of 4 seconds per url, would still theoretically
> churn out > 100k urls per day. There are multiple tweaks that could be made
> to optimize for your system, I'd start with reducing thread count, as you
> might be saturating your system [4].
>
> Sorry I can't be of more help!
>
> Thank you,
>
> Mo
>
> [0]:
> https://github.com/momer/nutch-selenium/blob/master/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java#L48-L49
> [1]: http://docs.seleniumhq.org/docs/04_webdriver_advanced.jsp
> [2]: https://issues.apache.org/jira/browse/NUTCH-1933
> [3]: https://code.google.com/p/selenium/wiki/Grid2
> [4]: http://stackoverflow.com/a/4895271
>
> On Mon, Feb 16, 2015 at 2:13 AM, Jiaxin Ye <jiaxi...@usc.edu
> <javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu');>> wrote:
>
>> I am using fetcher.threads.per.queue = 30 by the way.
>>
>> On Mon, Feb 16, 2015 at 12:08 AM, Jiaxin Ye <jiaxi...@usc.edu
>> <javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu');>> wrote:
>>
>>> Hi Mo,
>>>
>>> I have a problem about the selenium plugin on mac. I think I
>>> successfully set it up on mac but I have a question about the performance.
>>> I am using a Mac with Intel Core i5 processor and 8GB ram, but I found
>>> that each url fetched takes about 1 seconds to open and close
>>> the firefox window. Is it a normal speed? or anything is wrong? And is
>>> it possible to install selenium grid plugin on Mac? I will cry if you
>>> ask me to change machine now......
>>>
>>> Best,
>>> Jiaxin
>>>
>>> On Fri, Feb 13, 2015 at 2:09 PM, Mohammed Omer <beancinemat...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','beancinemat...@gmail.com');>> wrote:
>>>
>>>> No worries man, glad everything works! Glad, since I was having
>>>> hostname issues with nutch/hbase just now as I quickly tried to get it
>>>> working/fixed for ya, ha.
>>>>
>>>> Mo
>>>>
>>>> On Fri, Feb 13, 2015 at 2:57 PM, Shuo Li <sli...@usc.edu
>>>> <javascript:_e(%7B%7D,'cvml','sli...@usc.edu');>> wrote:
>>>>
>>>>> Hey guys,
>>>>>
>>>>> After change my RAM to 2GB, everything works fine. My bad. Thanks for
>>>>> your help.
>>>>>
>>>>> Regards,
>>>>> Shuo Li
>>>>>
>>>>> On Fri, Feb 13, 2015 at 11:34 AM, Mattmann, Chris A (3980) <
>>>>> chris.a.mattm...@jpl.nasa.gov
>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov');>>
>>>>> wrote:
>>>>>
>>>>>> Thank you Mo. I sincerely appreciate your guidance and contribution.
>>>>>>
>>>>>> I will work to get your nutch selenium grid plugin contributed
>>>>>> to work with Nutch 1.x.
>>>>>>
>>>>>> Cheers,
>>>>>> Chris
>>>>>>
>>>>>>
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Chris Mattmann, Ph.D.
>>>>>> Chief Architect
>>>>>> Instrument Software and Science Data Systems Section (398)
>>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> Office: 168-519, Mailstop: 168-527
>>>>>> Email: chris.a.mattm...@nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@nasa.gov');>
>>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> Adjunct Associate Professor, Computer Science Department
>>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Mo Omer <beancinemat...@gmail.com
>>>>>> <javascript:_e(%7B%7D,'cvml','beancinemat...@gmail.com');>>
>>>>>> Date: Friday, February 13, 2015 at 11:10 AM
>>>>>> To: Chris Mattmann <chris.a.mattm...@jpl.nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov');>>
>>>>>> Cc: "dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" <
>>>>>> dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>>> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>>>>
>>>>>> >Hey all,
>>>>>> >
>>>>>> >When I had run nutch-selenium, it was in a config such that zombies
>>>>>> were
>>>>>> >created from closing Firefox windows and they couldn't be reaped
>>>>>> (again,
>>>>>> >due to the docker configuration I had).
>>>>>> >
>>>>>> >In a normal setup, it should not be an issue - if you're running 20
>>>>>> >threads in nutch that's potentially 20 open FF windows which isn't
>>>>>> good
>>>>>> >for 512mb.
>>>>>> >
>>>>>> >Selenium grid is much more efficient, in that browsers are opened,
>>>>>> but
>>>>>> >tabs are used to fetch sites - and only those are closed.
>>>>>> >
>>>>>> >Additionally, ensure you're using Nutch 2.2.1.
>>>>>> >
>>>>>> >Feel free to fork patch and tinker and PR as needed.
>>>>>> >
>>>>>> >Chris, if you want to be added to contribs on the GitHub project,
>>>>>> that's
>>>>>> >cool with me! Wish I could dedicate more time to this, but I don't
>>>>>> >foresee using Nutch again in the near future, and am now working on
>>>>>> >projects that require lots of reading and possibly patches to Caffe
>>>>>> and
>>>>>> >opencl r-CNN projects.
>>>>>> >
>>>>>> >Tl;dr:
>>>>>> >- no, this shouldn't be typical unless you're creating zombies like
>>>>>> crazy
>>>>>> >and they're not being reaped (too many open file descriptors),
>>>>>> running
>>>>>> >out of memory, or similar resource constraint.
>>>>>> >- selenium grid is TONs more efficient, but a bit more difficult to
>>>>>> set
>>>>>> >up. I used it to crawl 100ks of sites.
>>>>>> >- unfortunately I can't commit more time to this, but if I can
>>>>>> assist in
>>>>>> >any admin way, let me know.
>>>>>> >
>>>>>> >Thank you,
>>>>>> >
>>>>>> >Mo
>>>>>> >
>>>>>> >This message was drafted on a tiny touch screen; please forgive
>>>>>> brevity &
>>>>>> >tpyos
>>>>>> >
>>>>>> >> On Feb 13, 2015, at 12:41 PM, "Mattmann, Chris A (3980)"
>>>>>> >><chris.a.mattm...@jpl.nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov');>>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Oh yes, please up your memory to like at least 2Gb..
>>>>>> >>
>>>>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> >> Chris Mattmann, Ph.D.
>>>>>> >> Chief Architect
>>>>>> >> Instrument Software and Science Data Systems Section (398)
>>>>>> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> >> Office: 168-519, Mailstop: 168-527
>>>>>> >> Email: chris.a.mattm...@nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@nasa.gov');>
>>>>>> >> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> >> Adjunct Associate Professor, Computer Science Department
>>>>>> >> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >> -----Original Message-----
>>>>>> >> From: Shuo Li <sli...@usc.edu
>>>>>> <javascript:_e(%7B%7D,'cvml','sli...@usc.edu');>>
>>>>>> >> Reply-To: "dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" <
>>>>>> dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>>> >> Date: Friday, February 13, 2015 at 10:38 AM
>>>>>> >> To: "dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" <
>>>>>> dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>>> >> Cc: Mo Omer <beancinemat...@gmail.com
>>>>>> <javascript:_e(%7B%7D,'cvml','beancinemat...@gmail.com');>>
>>>>>> >> Subject: Re: Vagrant Crushed When using Nutch-Selenium
>>>>>> >>
>>>>>> >>> Hey Mo and Prof Mattmann,
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> I will try to crawl the 3 websites in the homework tonight (NASA
>>>>>> AMD,
>>>>>> >>>NSF
>>>>>> >>> ACADIS and NSIDC Arctic Data Explorer). I will let you know what's
>>>>>> >>>going
>>>>>> >>> on.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Is memory an issue? My vagrant only has 512MB of memory.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> Regards,
>>>>>> >>> Shuo Li
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Fri, Feb 13, 2015 at 10:25 AM, Mattmann, Chris A (3980)
>>>>>> >>> <chris.a.mattm...@jpl.nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov');>>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> Hi Shuo,
>>>>>> >>>
>>>>>> >>> Thanks for your email. I wonder if using selenium grid would
>>>>>> >>> help?
>>>>>> >>>
>>>>>> >>> Please see this plugin:
>>>>>> >>>
>>>>>> >>> https://github.com/momer/nutch-selenium-grid-plugin
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> I’m CC’ing Mo the author of the plugin to see if he experienced
>>>>>> >>> this while running the original selenium plugin - Mo did using
>>>>>> >>> selenium grid help the issue that Shuo is experiencing below?
>>>>>> >>>
>>>>>> >>> Mo: are you cool with portion the grid plugin, or if Lewis or
>>>>>> >>> I do it to trunk (with full credit to you of course?)
>>>>>> >>>
>>>>>> >>> Cheers,
>>>>>> >>> Chris
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> >>> Chris Mattmann, Ph.D.
>>>>>> >>> Chief Architect
>>>>>> >>> Instrument Software and Science Data Systems Section (398)
>>>>>> >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>>> >>> Office: 168-519, Mailstop: 168-527
>>>>>> >>> Email: chris.a.mattm...@nasa.gov
>>>>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@nasa.gov');>
>>>>>> >>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> >>> Adjunct Associate Professor, Computer Science Department
>>>>>> >>> University of Southern California, Los Angeles, CA 90089 USA
>>>>>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> -----Original Message-----
>>>>>> >>> From: Shuo Li <sli...@usc.edu
>>>>>> <javascript:_e(%7B%7D,'cvml','sli...@usc.edu');>>
>>>>>> >>> Reply-To: "dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" <
>>>>>> dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>>> >>> Date: Friday, February 13, 2015 at 10:12 AM
>>>>>> >>> To: "dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>" <
>>>>>> dev@nutch.apache.org
>>>>>> <javascript:_e(%7B%7D,'cvml','dev@nutch.apache.org');>>
>>>>>> >>> Subject: Vagrant Crushed When using Nutch-Selenium
>>>>>> >>>
>>>>>> >>>> Hey guys,
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> I'm trying to use Nutch-Selenium to crawl
>>>>>> >>>> nutch.apache.org <http://nutch.apache.org> <
>>>>>> http://nutch.apache.org>.
>>>>>> >>>> However, my vagrant seems
>>>>>> >>>> crushed after a few minutes. I forced it to shut down and it
>>>>>> turns
>>>>>> >>>>out it
>>>>>> >>>> only crawled 59 websites. My nutch version is 1.10 and my OS is
>>>>>> Ubuntu
>>>>>> >>>> Trusty, 14.04.
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> Is there anything I can provide to you guys? Or is there anybody
>>>>>> have
>>>>>> >>>>the
>>>>>> >>>> same issue? Or 59 websites is the complete crawling?
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> Any suggestion would be appreciated.
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> Regards,
>>>>>> >>>> Shuo Li
>>>>>> >>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to