Author: lewismc
Date: Wed Apr 22 16:35:25 2015
New Revision: 1675408

URL: http://svn.apache.org/r1675408
Log:
NUTCH-1996 Make protocol-selenium README part of plugin

Added:
    nutch/trunk/src/plugin/protocol-selenium/README.md
Modified:
    nutch/trunk/CHANGES.txt

Modified: nutch/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?rev=1675408&r1=1675407&r2=1675408&view=diff
==============================================================================
--- nutch/trunk/CHANGES.txt (original)
+++ nutch/trunk/CHANGES.txt Wed Apr 22 16:35:25 2015
@@ -2,6 +2,8 @@ Nutch Change Log
  
 Nutch Current Development 1.10-SNAPSHOT
 
+* NUTCH-1996 Make protocol-selenium README part of plugin (lewismc)
+
 * NUTCH-1990 Use URI.normalise() in BasicURLNormalizer (snagel, jnioche)
 
 * NUTCH-1973 Job Administration end point for the REST service (Sujen Shah via 
mattmann)

Added: nutch/trunk/src/plugin/protocol-selenium/README.md
URL: 
http://svn.apache.org/viewvc/nutch/trunk/src/plugin/protocol-selenium/README.md?rev=1675408&view=auto
==============================================================================
--- nutch/trunk/src/plugin/protocol-selenium/README.md (added)
+++ nutch/trunk/src/plugin/protocol-selenium/README.md Wed Apr 22 16:35:25 2015
@@ -0,0 +1,64 @@
+Nutch Selenium
+==============
+
+This plugin allows you to fetch Javascript pages using 
[Selenium](http://www.seleniumhq.org/), while relying on the rest of the 
awesome Nutch stack!
+
+The underlying code is based on the nutch-htmlunit plugin, which was in turn 
based on nutch-httpclient.
+
+# IMPORTANT NOTES:
+
+ * A version of this plugin which relies on the Selenium Hub/Node system can 
be found here: 
[nutch-selenium-grid-plugin](https://github.com/momer/nutch-selenium-grid-plugin)
+
+# Installation (tested on Ubuntu 14.0x)
+
+## Part 1: Setting up Selenium
+
+ * Ensure that you have Firefox installed
+```
+# More info about the package @ 
[launchpad](https://launchpad.net/ubuntu/trusty/+source/firefox)
+
+sudo apt-get install firefox
+```
+ * Install Xvfb and its associates
+```
+sudo apt-get install xorg synaptic xvfb gtk2-engines-pixbuf xfonts-cyrillic 
xfonts-100dpi \
+    xfonts-75dpi xfonts-base xfonts-scalable freeglut3-dev dbus-x11 openbox 
x11-xserver-utils \
+    libxrender1 cabextract
+```
+ * Set a display for Xvfb, so that firefox believes a display is connected
+```
+sudo /usr/bin/Xvfb :11 -screen 0 1024x768x24 &
+sudo export DISPLAY=:11
+```
+## Part 2: Installing plugin for Nutch (where NUTCH_HOME is the root of your 
nutch install)
+
+ * Ensure that the plugin will be used as the protocol parser in your config
+```
+<!-- NUTCH_HOME/conf/nutch-site.xml -->
+
+<configuration>
+  ...
+  <property>
+    <name>plugin.includes</name>
+    
<value>protocol-selenium|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic</value>
+    <description>Regular expression naming plugin directory names to
+    include.  Any plugin not matching this expression is excluded.
+    In any case you need at least include the nutch-extensionpoints plugin. By
+    default Nutch includes crawling just HTML and plain text via HTTP,
+    and basic indexing and search plugins. In order to use HTTPS please enable 
+    protocol-httpclient, but be aware of possible intermittent problems with 
the 
+    underlying commons-httpclient library.
+    </description>
+  </property>
+```
+ * Compile nutch
+```
+ant runtime
+```
+
+ * Start your web crawl (Ensure that you followed the above steps and have 
started your xvfb display as shown above)
+```
+NUTCH_HOME/runtime/local/bin/crawl [-i|--index] [-D \"key=value\"] <Seed Dir> 
<Crawl Dir> <Num Rounds>
+```
+
+


Reply via email to