Update:
selenium latest version 2.44.0 doesn’t seem to work with firefox latest
version(35),so I installed firefox version 29 and it’s crawling properly now.
> On Feb 18, 2015, at 2:56 PM, Jaydeep Bagrecha <bagre...@usc.edu> wrote:
>
> thanks Jiaxin!
>
> I again repeated the entire installation procedure and I think i have
> installed it correctly.(it said BUILD SUCCESSFUL after ant runtime command
> and has selenium jar files in runtime/local/lib folder)
>
> When i started crawling the mozilla browser popped 2 times,but when i saw
> crawl statistics,it had fetched no urls(Did anyone have this problem?)
>
> I had following error while crawling:-
>
> org.openqa.selenium.firefox.NotConnectedException: Unable to connect to host
> 127.0.0.1 on port 7055 after 45000 ms. Firefox console output:
> h changes to installed add-ons
> 1424295898279 addons.xpi-utils DEBUG Updating add-on states
> 1424295898281 addons.xpi-utils DEBUG Writing add-ons list
> 1424295898291 addons.manager DEBUG Registering shutdown blocker for
> XPIProvider
> 1424295898292 addons.manager DEBUG Registering shutdown blocker for
> LightweightThemeManager
> 1424295898295 addons.manager DEBUG Registering shutdown blocker for
> OpenH264Provider
> 1424295898296 addons.manager DEBUG Registering shutdown blocker for
> PluginProvider
> 1424295898775 DeferredSave.extensions.json DEBUG Starting timer
> 1424295898800 DeferredSave.extensions.json DEBUG Starting write
> 1424295898858 addons.manager DEBUG shutdown
> 1424295898859 addons.manager DEBUG Calling shutdown blocker for XPIProvider
> 1424295898859 addons.xpi DEBUG shutdown
> 1424295898860 addons.xpi-utils DEBUG shutdown
> 1424295898861 addons.manager DEBUG Calling shutdown blocker for
> LightweightThemeManager
> 1424295898862 addons.manager DEBUG Calling shutdown blocker for
> OpenH264Provider
> 1424295898864 addons.manager DEBUG Calling shutdown blocker for
> PluginProvider
> 1424295899016 DeferredSave.extensions.json DEBUG Write succeeded
> 1424295899016 addons.xpi-utils DEBUG XPI Database saved, setting
> schema version preference to 16
> 1424295899017 addons.xpi DEBUG Notifying XPI shutdown observers
> 1424295899025 addons.manager DEBUG Async provider shutdown done
> 1424295900455 addons.manager DEBUG Loaded provider scope for
> resource://gre/modules/addons/XPIProvider.jsm:
> <resource://gre/modules/addons/XPIProvider.jsm:> ["XPIProvider"]
> 1424295900459 addons.manager DEBUG Loaded provider scope for
> resource://gre/modules/LightweightThemeManager.jsm:
> <resource://gre/modules/LightweightThemeManager.jsm:>
> ["LightweightThemeManager"]
> 1424295900468 addons.xpi DEBUG startup
> 1424295900470 addons.xpi INFO Mapping fxdri...@googlecode.com
> <mailto:fxdri...@googlecode.com> to
> /var/folders/np/stzpy0s56v719zgrt_gsgzf40000gn/T/anonymous3766188187771514178webdriver-profile/extensions/fxdri...@googlecode.com
>
> <mailto:var/folders/np/stzpy0s56v719zgrt_gsgzf40000gn/T/anonymous3766188187771514178webdriver-profile/extensions/fxdri...@googlecode.com>
> 1424295900471 addons.xpi DEBUG Ignoring file entry whose name is not a
> valid add-on ID:
> /var/folders/np/stzpy0s56v719zgrt_gsgzf40000gn/T/anonymous3766188187771514178webdriver-profile/extensions/webdriver-staging
> 1424295900472 addons.xpi INFO Mapping
> {972ce4c6-7e08-4474-a285-3208198ce6fd} to
> /Applications/Firefox.app/Contents/Resources/browser/extensions/{972ce4c6-7e08-4474-a285-3208198ce6fd}
> 1424295900473 addons.xpi DEBUG Skipping unavailable install location
> app-system-share
> 1424295900475 addons.xpi DEBUG checkForChanges
> 1424295900476 addons.xpi DEBUG Loaded add-on state from prefs:
> {"app-profile":{"fxdri...@googlecode.com
> <mailto:fxdri...@googlecode.com>":{"d":"/var/folders/np/stzpy0s56v719zgrt_gsgzf40000gn/T/anonymous3766188187771514178webdriver-profile/extensions/fxdri...@googlecode.com
>
> <mailto:var/folders/np/stzpy0s56v719zgrt_gsgzf40000gn/T/anonymous3766188187771514178webdriver-profile/extensions/fxdri...@googlecode.com>","e":false,"v":"2.42.2","st":1424295897000,"mt":1424295897000}},"app-global":{"{972ce4c6-7e08-4474-a285-3208198ce6fd}":{"d":"/Applications/Firefox.app/Contents/Resources/browser/extensions/{972ce4c6-7e08-4474-a285-3208198ce6fd}","e":true,"v":"35.0.1","st":1423704245000,"mt":1423704244000}}}
> 1424295900480 addons.xpi DEBUG getModTime: Recursive scan of
> {972ce4c6-7e08-4474-a285-3208198ce6fd}
> 1424295900483 addons.xpi DEBUG getInstallState changed: false, state:
> {"app-profile":{"fxdri...@googlecode.com
> <mailto:fxdri...@googlecode.com>":{"d":"/var/folders/np/stzpy0s56v719zgrt_gsgzf40000gn/T/anonymous3766188187771514178webdriver-profile/extensions/fxdri...@googlecode.com
>
> <mailto:var/folders/np/stzpy0s56v719zgrt_gsgzf40000gn/T/anonymous3766188187771514178webdriver-profile/extensions/fxdri...@googlecode.com>","e":false,"v":"2.42.2","st":1424295897000,"mt":1424295897000}},"app-global":{"{972ce4c6-7e08-4474-a285-3208198ce6fd}":{"d":"/Applications/Firefox.app/Contents/Resources/browser/extensions/{972ce4c6-7e08-4474-a285-3208198ce6fd}","e":true,"v":"35.0.1","st":1423704245000,"mt":1423704244000}}}
> 1424295900488 addons.xpi DEBUG No changes found
> 1424295900502 addons.manager DEBUG Registering shutdown blocker for
> XPIProvider
> 1424295900504 addons.manager DEBUG Registering shutdown blocker for
> LightweightThemeManager
> 1424295900507 addons.manager DEBUG Registering shutdown blocker for
> OpenH264Provider
> 1424295900508 addons.manager DEBUG Registering shutdown blocker for
> PluginProvider
> *** Blocklist::_preloadBlocklistFile: blocklist is disabled
> 1424295903113 addons.manager DEBUG Registering shutdown blocker for
> <unnamed-provider>
>
> at
> org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:118)
> at
> org.openqa.selenium.firefox.FirefoxDriver.startClient(FirefoxDriver.java:246)
> at
> org.openqa.selenium.remote.RemoteWebDriver.<init>(RemoteWebDriver.java:114)
> at
> org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:191)
> at
> org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:186)
> at
> org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:182)
> at
> org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:95)
> at
> org.apache.nutch.protocol.selenium.HttpWebClient.getHtmlPage(HttpWebClient.java:53)
> at
> org.apache.nutch.protocol.selenium.HttpResponse.readPlainContent(HttpResponse.java:199)
> at
> org.apache.nutch.protocol.selenium.HttpResponse.<init>(HttpResponse.java:161)
> at org.apache.nutch.protocol.selenium.Http.getResponse(Http.java:56)
> at
> org.apache.nutch.protocol.http.api.HttpRobotRulesParser.getRobotRulesSet(HttpRobotRulesParser.java:101)
> at
> org.apache.nutch.protocol.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:151)
> at
> org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:492)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:722)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
> fetchQueues.getQueueCount=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
> fetchQueues.getQueueCount=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
> fetchQueues.getQueueCount=1
>
>> On Feb 17, 2015, at 11:21 PM, Jiaxin Ye <jiaxi...@usc.edu
>> <mailto:jiaxi...@usc.edu>> wrote:
>>
>> Hi,
>>
>> When you install the patch, did you see any fails? No fail is tolerated. I
>> am guessing there is something wrong with ivy.xml. I am suggesting that
>> checkout ALL files in Nutch and then try it again.
>>
>> Best,
>> Jiaxin
>>
>> On Tuesday, February 17, 2015, Jaydeep Bagrecha <bagre...@usc.edu
>> <mailto:bagre...@usc.edu>> wrote:
>> Hi all,
>> I am trying to install and build selenium with nutch1.10 on Mac
>> Yosemite.
>>
>> having following error after downloading selenium
>> patch(https://issues.apache.org/jira/browse/NUTCH-1933
>> <https://issues.apache.org/jira/browse/NUTCH-1933>) and while using “ant
>> runtime” command (as mentioned by Jiaxin below).Any suggestions to avoid it?
>>
>> error: package org.openqa.selenium does not exist
>> [javac] import org.openqa.selenium.By <http://org.openqa.selenium.by/>;
>> [javac] ^
>> error: package org.openqa.selenium does not exist
>> [javac] import org.openqa.selenium.WebDriver;
>> [javac] ^
>> error: package org.openqa.selenium.firefox does not exist
>> [javac] import org.openqa.selenium.firefox.FirefoxDriver;
>> [javac] ^
>> error: package org.openqa.selenium.firefox does not exist
>> [javac] import org.openqa.selenium.firefox.FirefoxProfile;
>> error: cannot find symbol
>> [javac] public static ThreadLocal<WebDriver> threadWebDriver = new
>> ThreadLocal<WebDriver>() {
>> [javac] ^
>> [javac] symbol: class WebDriver
>> [javac] location: class HttpWebClient
>> error: cannot find symbol
>> [javac] protected WebDriver initialValue()
>> [javac] ^
>> [javac] symbol: class WebDriver
>> error: cannot find symbol
>> [javac] FirefoxProfile profile = new FirefoxProfile();
>> [javac] ^
>> [javac] symbol: class FirefoxProfile
>> error: cannot find symbol
>> [javac] WebDriver driver = new FirefoxDriver(profile);
>> [javac] ^
>> [javac] symbol: class FirefoxDriver
>> error: cannot find symbol
>> [javac] driver = new FirefoxDriver();
>> [javac] ^
>> [javac] symbol: class FirefoxDriver
>> [javac] location: class HttpWebClient
>>
>> error: cannot find symbol
>> [javac] new WebDriverWait(driver, 3);
>> [javac] ^
>> [javac] symbol: class WebDriverWait
>> [javac] location: class HttpWebClient
>>
>> error: cannot find symbol
>> [javac] String innerHtml =
>> driver.findElement(By.tagName("body")).getAttribute("innerHTML");
>> [javac] ^
>> [javac] symbol: variable By
>> [javac] location: class HttpWebClient
>>
>> Thanks,
>> Jaydeep
>>
>>> On Feb 12, 2015, at 11:37 PM, Jiaxin Ye <jiaxi...@usc.edu
>>> <javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu');>> wrote:
>>>
>>> Sure. I will do it once I confirm it works...
>>>
>>> On Thursday, February 12, 2015, Mattmann, Chris A (3980)
>>> <chris.a.mattm...@jpl.nasa.gov
>>> <javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov');>> wrote:
>>> This is great, Jiaxin, can you please make a wiki page on the Nutch
>>> wiki that has this information?
>>>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattm...@nasa.gov <>
>>> WWW: http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Jiaxin Ye <jiaxi...@usc.edu <>>
>>> Reply-To: "dev@nutch.apache.org <>" <dev@nutch.apache.org <>>
>>> Date: Thursday, February 12, 2015 at 9:39 PM
>>> To: "dev@nutch.apache.org <>" <dev@nutch.apache.org <>>
>>> Subject: Nutch-Selenium in Nutch 1.10
>>>
>>> >Hi Li, Shuo. You are so right. I finished installing and successfully run
>>> >the butch with selenium and Firefox. I have a question though, does your
>>> >Firefox plug out for always all the urls we crawled?
>>> >
>>> >
>>> >Hi Prof Mattmann. I think here is the way we install selenium on MAC with
>>> >OS higher than 10.6 I think...
>>> >
>>> >
>>> >1. Download XQuatz, it's a dmp file, install it directly
>>> >2. Download Nutch 1.10
>>> >3. Download the patch and put it on the Nutch project directory
>>> >4. patch -p0 < THE PATCH NAME
>>> >5. DO NOT update the build.xml and the ivy.xml as the selenium tutorial
>>> >in the github told you. The patch basically updated those .xml file for
>>> >us. And the patch also installs lib-selenium and protocol selenium for us
>>> >(Correct me if
>>> > I am wrong)
>>> >6. Update tika dependency if needed
>>> >7. Go to the Nutch project directory and run ant runtime
>>> >8. Download Firefox
>>> >9. Open a new terminal and type
>>> > xvfb -screen scrn 1024x758x34 (I think you can set it smaller if you
>>> >want...)
>>> > There should be some errors after entering the command (for me at
>>> >least). Manually sudo create a /tmp/.X11-unix folder, and then set the
>>> >mode to 1777. Rerun the command. xvfb should be working.
>>> >10. Go to nutch > runtime > local and run the crawling command
>>> >
>>> >
>>> >Hope it helps. :)
>>> >
>>> >
>>> >Best,
>>> >Jiaxin
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >On Thu, Feb 12, 2015 at 1:08 PM, Shuo Li
>>> ><sli...@usc.edu <> <javascript:_e(%7B%7D,'cvml','sli...@usc.edu <>');>>
>>> >wrote:
>>> >
>>> >I think I have possibly finished installing.
>>> >
>>> >
>>> >What you need to do:
>>> >0. git status and checkout what you have modified.
>>> >1. patch -p0 < YOUR_PATCH_FILE
>>> >2. ant clean jar
>>> >3. ant runtime
>>> >
>>> >
>>> >Will try crawling using selenium later on. Hope this helped. >_<
>>> >
>>> >
>>> >On Thu, Feb 12, 2015 at 9:20 AM, Mattmann, Chris A (3980)
>>> ><chris.a.mattm...@jpl.nasa.gov <>
>>> ><javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov <>');>> wrote:
>>> >
>>> >Yes I believe you need to install X11 - why don't you try and report back
>>> >what you find thanks.
>>> >
>>> >Sent from my iPhone
>>> >
>>> >On Feb 12, 2015, at 8:28 AM, Jiaxin Ye <jiaxi...@usc.edu <>
>>> ><javascript:_e(%7B%7D,'cvml','jiaxi...@usc.edu <>');>> wrote:
>>> >
>>> >
>>> >
>>> >Hi professor, but can we use Selenium on Mac?
>>> >
>>> >On Thursday, February 12, 2015, Mattmann, Chris A (3980)
>>> ><chris.a.mattm...@jpl.nasa.gov <>
>>> ><javascript:_e(%7B%7D,'cvml','chris.a.mattm...@jpl.nasa.gov <>');>> wrote:
>>> >
>>> >You need Selenium Jiaxin, in order to crawl dynamic pages in the
>>> >polar dataset you have been assigned in my CSCI 572 search engines class.
>>> >
>>> >The instructions for integrating Selenium with Nutch 1.10-trunk
>>> >are here:
>>> >
>>> >https://issues.apache.org/jira/browse/NUTCH-1933
>>> ><https://issues.apache.org/jira/browse/NUTCH-1933>
>>> >
>>> >
>>> >Cheers,
>>> >Chris
>>> >
>>> >
>>> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >Chris Mattmann, Ph.D.
>>> >Chief Architect
>>> >Instrument Software and Science Data Systems Section (398)
>>> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> >Office: 168-519, Mailstop: 168-527
>>> >Email: chris.a.mattm...@nasa.gov <>
>>> >WWW: http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>>> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >Adjunct Associate Professor, Computer Science Department
>>> >University of Southern California, Los Angeles, CA 90089 USA
>>> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >-----Original Message-----
>>> >From: Jiaxin Ye <jiaxi...@usc.edu <>>
>>> >Reply-To: "dev@nutch.apache.org <>" <dev@nutch.apache.org <>>
>>> >Date: Thursday, February 12, 2015 at 12:46 AM
>>> >To: "dev@nutch.apache.org <>" <dev@nutch.apache.org <>>
>>> >Subject: Re: Nutch-Selenium in Nutch 1.10
>>> >
>>> >>Well, good choice. I am thinking changing to ubuntu now. The thing is why
>>> >>do we need Selenium anyway? Just easier to perform crawling?
>>> >>
>>> >>On Thu, Feb 12, 2015 at 12:25 AM, Shuo Li
>>> >><sli...@usc.edu <>> wrote:
>>> >>
>>> >>Interestingly, I'm a mac user but I don't want to screw my laptop so I'm
>>> >>using vagrant with Ubuntu Trusty. It doesn't have GUI but Xvfb can still
>>> >>be installed properly. The issue would be I don't know how to integrate
>>> >>Selenium with Nutch 1.10.
>>> >>
>>> >>On Thu, Feb 12, 2015 at 12:04 AM, Jiaxin Ye
>>> >><jiaxi...@usc.edu <>> wrote:
>>> >>
>>> >>Hi all,
>>> >>
>>> >>
>>> >>Anyone here knows where to find the setup tutorial for Selenium on Mac ??
>>> >>I find it difficult to install Xvfb on mac.
>>> >>
>>> >>
>>> >>Best,
>>> >>Jiaxin
>>> >>
>>> >>
>>> >>On Tue, Feb 10, 2015 at 9:42 PM, Sapnashri Suresh
>>> >><sapna...@usc.edu <>> wrote:
>>> >>
>>> >>Hi Shuo Li,
>>> >>
>>> >>
>>> >>We were facing a similar issue. Prof. Mattman suggested we look into this
>>> >>patch for Selenium on Nutch 1.10 :
>>> >>https://issues.apache.org/jira/browse/NUTCH-1933
>>> >><https://issues.apache.org/jira/browse/NUTCH-1933>.
>>> >>
>>> >>
>>> >>Hope this helps!
>>> >>
>>> >>
>>> >>Thanks,
>>> >>Sapna
>>> >>
>>> >>On Tue, Feb 10, 2015 at 9:36 PM, Shuo Li
>>> >><sli...@usc.edu <>> wrote:
>>> >>
>>> >>Yop,
>>> >>
>>> >>
>>> >>I'm trying to install selenium in Nutch 1.10. However, this error pops
>>> >>out:
>>> >>
>>> >>
>>> >>error: package org.apache.nutch.storage does not exist
>>> >>
>>> >>
>>> >>
>>> >>I can only find this package in Nutch 2.x. Is there a way to use Selenium
>>> >>in 1.10?
>>> >>
>>> >>
>>> >>Any advice would be appreciated.
>>> >>
>>> >>
>>> >>Regards,
>>> >>Shuo Li
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>--
>>> >>Graduate Student
>>> >>MS in CS (Data Science)
>>> >>Viterbi School of Engineering
>>> >>University of Southern California
>>> >>
>>> >>
>>> >>Phone:
>>> >>+1 650-307-9848 <tel:%2B1%20650-307-9848> <tel:%2B1%20650-307-9848>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>>
>>
>