Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "QuickStartparseChecker" page has been changed by MaziyarBoustani:
https://wiki.apache.org/nutch/QuickStartparseChecker

Comment:
Quick Start Guide to scrape website with Nutch

New page:
== Requirement ==

 *install Java
 *set JAVA_HOME
 *install Apache Ant (brew install ant) if on Mac OSX, apt-get install ant if 
on Ubuntu/Linux

== Steps ==

 * create a new directory
 * cd to directory
 * {{{svn co https://svn.apache.org/repos/asf/nutch/trunk/}}}
 * cd to trunk folder
 * run {{{ $ ant runtime }}}
 * cd runtime/local/
 * edit conf/nutch-site.xml
 * add below code between <configuration> section and replace "Value_name" with 
the desire name
{{{
<property>
  <name>http.agent.name</name>
  <value>Value_name</value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

  and set their values appropriately.

  </description>
</property>
}}}

 * run parsecheker for NASA JPL website for example by 
{{{
./bin/nutch parsechecker -dumpText http://www.jpl.nasa.gov > jpl_out.txt
}}}

Reply via email to