en press.xml,NONE,1.1 help.xml,NONE,1.1 about.xml,NONE,1.1 tutorial.xml,NONE,1.1 donate.xml,NONE,1.1 org.xml,NONE,1.1 faq.xml,NONE,1.1 credits.xml,NONE,1.1 i18n.xml,NONE,1.1 bot.xml,NONE,1.1 policies.xml,NONE,1.1 developers.xml,NONE,1.1 status.xml,NONE,1.1 search.xml,NONE,1.1

joa23 Thu, 29 Jan 2004 12:28:45 -0800

Update of /cvsroot/nutch/playground/src/web/pages/en
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv10313/src/web/pages/en


Added Files:
        press.xml help.xml about.xml tutorial.xml donate.xml org.xml 
        faq.xml credits.xml i18n.xml bot.xml policies.xml 
        developers.xml status.xml search.xml 
Log Message:
intial commit

--- NEW FILE: press.xml ---
<page>

<title>press</title>

<body>

<h3>Articles about Nutch</h3>

<h4>08/08/03, Business 2.0 - <a
href="http://www.business2.com/articles/mag/0,1640,51462,00.html?cnn=yes";>
Watch Out, Google</a></h4>

<h4>08/13/03, DaveNet - <a
href="http://davenet.userland.com/2003/08/13/nutchAnOpenSourceSearchEngine";>
Nutch, an open source search engine</a></h4>

<h4>08/13/03, Slashdot - <a
href="http://slashdot.org/articles/03/08/13/191225.shtml";> Nutch: An
Open Source Search Engine</a></h4>

<h4>08/14/03, CIOL, India - <a
href="http://www.ciol.com/content/news/2003/103081408.asp";> All search
engines are biased</a></h4>

<h4>08/14/03, The Inquirer, UK - <a href="http://www.theinquirer.net/?article=11034";>
Developers work on open source web search engine</a></h4>

<h4>08/14/03, 01net., France - <a href="http://www.01net.com/article/215086.html";>
Nutch : un moteur de recherche open source sponsorisé par... Overture</a></h4>

<h4>08/18/03, C|Net News.Com - <a
href="http://news.com.com/2100-1032_3-5064913.html";> Project searches
for open-source niche</a></h4>

<h4>09/11/03, SearchEngineWatch - <a
href="http://searchenginewatch.com/searchday/print.php/34711_3071971";>An
Open Source Search Engine</a></h4>

<h4>09/24/03, TechNewsWorld - <a
href="http://www.technewsworld.com/perl/story/31653.html";>
An Open-Source Search Engine Takes Shape</a></h4>

</body>
</page>

--- NEW FILE: help.xml ---
<page>

<title>search help</title>

<body>

<h3>Queries</h3>
To search with Nutch, just type in a few words.
<ul>
  <li>Results only include pages that contain <span
 style="font-style: italic;">all</span> of the query words.</li>
  <li>Use quotes around words that must occur adjacently, as a phrase,
e.g., <span style="font-weight: bold;">"New Zealand"</span>.</li>
  <li>Punctuation between words also triggers phrase matching.  So
searching for <span style="font-weight: bold;">http://www.nutch.org/</span>
is the same as searching for <span style="font-weight: bold;">"http www
nutch org"</span>.</li>
  <li>Searches are not case-sensitive, so searching for <span
 style="font-weight: bold;">NuTcH</span> is the same as searching for <span
 style="font-weight: bold;">nUtCh</span>.</li>
  <li>You can prohibit a term from resulting pages by putting a minus
before it, e.g., searching for <span style="font-weight: bold;">football
-nfl</span> will find pages that discuss football, but don't use the
word "nfl".</li>
  <li>That's it!</li>
</ul>
<h3>Results</h3>
Each matching page in the results has the following links:
<ul>
  <li>(<span style="color: rgb(51, 51, 255);">cached</span>) displays
the version of the page that Nutch downloaded.</li>
  <li>(<span style="color: rgb(51, 51, 255);">explain</span>) displays
an explanation of how this page scored.</li>
  <li>(<span style="color: rgb(51, 51, 255);">anchors</span>) shows the
list of incoming anchors indexed for this page.</li>
</ul>

</body>
</page>

--- NEW FILE: about.xml ---
<page>

<title>about</title>

<menu>
 <item><a href="org.html">Organization</a></item>
 <item><a href="credits.html">Credits</a></item>
 <item><a href="press.html">Press</a></item>
 <item><a href="status.html">Status</a></item>
</menu>

<body>

<p>Nutch is a nascent effort to implement an open-source web search
engine.</p>

<p>Web search is a basic requirement for internet navigation, yet the
number of web search engines is decreasing. Today's oligopoly could
soon be a monopoly, with a single company controlling nearly all web
search for its commercial gain.  That would not be good for users of
the internet.</p>

<p>Nutch provides a transparent alternative to commercial web search
engines.  Only open source search results can be fully trusted to be
without bias.  (Or at least their bias is public.)  All existing major
search engines have proprietary ranking formulas, and will not explain
why a given page ranks as it does.  Additionally, some search engines
determine which sites to index based on payments, rather than on the
merits of the sites themselves.  Nutch, on the other hand, has nothing
to hide and no motive to bias its results or its crawler in any way
other than to try to give each user the best results possible.</p>

<p>Nutch aims to enable anyone to easily and cost-effectively deploy a
world-class web search engine.  This is a substantial challenge.  To
succeed, Nutch software must be able to:</p>
<ul>
  <li>fetch several billion pages per month</li>
  <li>maintain an index of these pages</li>
  <li>search that index up to 1000 times per second</li>
  <li>provide very high quality search results</li>
  <li>operate at minimal cost</li>
</ul>

<p>This is a challenging proposition.  If you believe in the merits of
this project, please help out, either as a <a
href="developers.html">developer</a> or with a <a
href="donate.html">donation</a>
</p>

</body>
</page>

--- NEW FILE: tutorial.xml ---
<page>

<title>tutorial</title>

<body>

<h3>Requirements</h3>
<ol>
  <li>Java 1.4.x, either from <a
 href="http://java.sun.com/j2se/downloads.html";>Sun</a> or <a
 href="http://www-106.ibm.com/developerworks/java/jdk/";>IBM</a> on
 Linux is preferred.  Set <code>NUTCH_JAVA_HOME</code> to the root
 of your JVM installation.
  </li>
  <li>Apache's <a href="http://jakarta.apache.org/tomcat/";>Tomcat</a>
4.x.</li>
  <li>On Win32, <a href="http://www.cygwin.com/";>cygwin</a>, for
shell support.  (If you plan to use CVS on Win32, be sure to select the cvs and
openssh packages when you install, in the "Devel" and "Net"
categories, respectively.)</li>
  <li>Up to a gigabyte of free disk space, a high-speed connection, and
an hour or so.
  </li>
</ol>
<h3>Getting Started</h3> <p>First, you need to get a copy of the Nutch
code.  You can download a release from <a
href="http://www.nutch.org/release/";>http://www.nutch.org/release/</a>.
Unpack the release and connect to its top-level directory.  Or, check
out the latest source code from <a
href="http://sourceforge.net/cvs/?group_id=59548";>CVS</a> and build it
with <a href="http://ant.apache.org/";>Ant</a>.</p>

<p>Try the following command:</p>
<pre style="margin-left: 40px;">bin/nutch</pre>
This will display the documentation for the Nutch command script.

<h3>Concepts</h3>
Nutch data is of two types:
<ol>
  <li>The web database.  This contains information about every
page known to Nutch, and about links between those pages.</li>
  <li>A set of segments.  Each segment is a set of pages that are
fetched and indexed as a unit.  Segment data consists of the
following types:</li>
  <ul>
    <li>a <i>fetchlist</i> is a file
that names a set of pages to be fetched</li>
    <li>the<i> fetcher output</i> is a
set of files containing the fetched pages</li>
    <li>the <i>index </i>is a
Lucene-format index of the fetcher output.</li>
  </ul>
</ol>
In the following examples we will keep our web database in a directory
named <tt>db</tt> and our segments
in a directory named <tt>segments</tt>:
<pre style="margin-left: 40px;">mkdir db
mkdir segments</pre>

<h3>Boostrapping the Web Database</h3>
The admin tool is used to create a new, empty database:
<pre style="margin-left: 40px;">bin/nutch admin db -create</pre>
The <i>injector</i> adds urls into
the database.  Let's inject URLs from the <a
 href="http://dmoz.org/";>DMOZ</a> Open Directory. First we must download
and uncompress the file listing all of the DMOZ pages.  (This is a
200+Mb file, so this will take a few minutes.)
<pre style="margin-left: 40px;">wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz</pre>
Next we inject a random subset of these pages into the web database.
 (We use a random subset so that everyone who runs this tutorial
doesn't hammer the same sites.)  DMOZ contains around three million
URLs.  We inject one out of every 3000, so that we end up with
around 1000 URLs:
<pre style="margin-left: 40px;">bin/nutch inject db -dmozfile content.rdf.u8 -subset 
3000</pre>
This also takes a few minutes, as it must parse the full file.

<p>Now we have a web database with around 1000 as-yet unfetched URLs in it.</p>

<h3>Fetching</h3>
To fetch, we first generate a fetchlist from the database:
<pre style="margin-left: 40px;">bin/nutch generate db segments
</pre>
This generates a fetchlist for all of the pages due to be fetched.
 The fetchlist is placed in a newly created segment directory.
 The segment directory is named by the time it's created.  We
save the name of this segment in the shell variable <tt>>s1</tt>:
<pre style="margin-left: 40px;">s1=`ls -d segments/2* | tail -1`
echo $s1
</pre>
Now we run the fetcher on this segment with:
<pre style="margin-left: 40px;">bin/nutch fetch $s1</pre>
When this is complete, we update the database with the results of the
fetch:
<pre style="margin-left: 40px;">bin/nutch updatedb db $s1</pre>
Now the database has entries for all of the pages referenced by the
initial set.

<p>Next we run five iterations of link analysis on the database in order
to prioritize which pages to next fetch:</p>
<pre style="margin-left: 40px;">bin/nutch analyze db 5
</pre>
Now we fetch a new segment with the top-scoring 1000 pages:
<pre style="margin-left: 40px;">bin/nutch generate db segments -topN 1000
s2=`ls -d segments/2* | tail -1`
echo $s2

bin/nutch fetch $s2
bin/nutch updatedb db $s2
bin/nutch analyze db 2
</pre>
Let's fetch one more round:
<pre style="margin-left: 40px;">
bin/nutch generate db segments -topN 1000
s3=`ls -d segments/2* | tail -1`
echo $s3

bin/nutch fetch $s3
bin/nutch updatedb db $s3
bin/nutch analyze db 2</pre>
By this point we've fetched a few thousand pages.  Let's index
them!

<h3>Indexing</h3>
To index each segment we use the <tt>index</tt>
command, as follows:
<pre style="margin-left: 40px;">bin/nutch index $s1
bin/nutch index $s2
bin/nutch index $s3</pre>
Then, before we can search a set of segments, we need to delete
duplicate pages.  This is done with:
<pre style="margin-left: 40px;">bin/nutch dedup segments dedup.tmp</pre>
Now we're ready to search!

<h3>Searching</h3>

<p>To search you need to put the nutch war file into your servlet
container.  (If instead of downloading a Nutch release you checked the
sources out of CVS, then you'll first need to build the war file, with
the command <tt>ant war</tt>.)</p>

Assuming you've unpacked Tomcat as ~/local/tomcat, then the Nutch war
file may be installed with the commands:
<pre style="margin-left: 40px;">rm -rf ~/local/tomcat/webapps/ROOT*
cp nutch*.war ~/local/tomcat/webapps/ROOT.war
</pre>
The webapp finds its indexes in <tt>./segments</tt>,
relative to where you start Tomcat, so, don't change
directories, and give the command:
<pre style="margin-left: 40px;">~/local/tomcat/bin/catalina.sh start
</pre>
Then visit <a href="http://localhost:8080/";>http://localhost:8080/</a>
and have fun!

</body>
</page>

--- NEW FILE: donate.xml ---
<page>

<title>donate</title>

<body>

<p>Monetary donations to Nutch gratefully accepted.</p>

<p>Our current goal is to create a good-sized public demo that can
handle moderate traffic.  Even this takes a fair amount of hardware
and bandwidth.  Fortunately, the Internet Archive has donated
bandwidth, so all that we need now is hardware.  We estimate that a
two-hundred-million page demo system that can handle moderate traffic
will require less than $200,000 in hardware.</p>

<p>If you believe in the cause and think a more compelling Nutch demo
would help it, please donate.</p>

<h3>U.S. Tax Exempt Status</h3>

<p>The Nutch <a href="org.html">Organization</a> is applying for
federal tax-exempt status as a 501(c)3 corporation.  Once that
application is accepted, donations by United States residents will be
tax deductable.</p>

<h3>By Check</h3>

<p>Donations by check may be sent to:</p>

<p style="margin-left: 40px;">
The Nutch Organization<br/>
PO Box 5633<br/>
Petaluma, CA 94955-5633<br/>
USA
</p>

<h3>By PayPal</h3>

<p>If you would like to make an easy cash donation you can send it to
our PayPal account:</p>

<center>
<form action="https://www.paypal.com/cgi-bin/webscr"; method="post">
<input type="hidden" name="cmd" value="_xclick"/>
<input type="hidden" name="business"
       value="[EMAIL PROTECTED]"/>
<input type="hidden" name="item_name" value="donation"/>
<input type="hidden" name="no_note" value="1"/>
<input type="hidden" name="currency_code" value="USD"/>
<input type="hidden" name="tax" value="0"/>
<input type="submit" value="Donate via PayPal" name="submit"/>
</form>
</center>

<h3>Equipment and Services</h3>

<p>Donations of equipment and services are also very much appreciated.</p>

<p>In particular, we need lots of rack-mountable CPUs with 4GB (or
more) of RAM each.  We also need lots of 200GB (or larger) hard
drives.  Note that, in order to minimize our operational costs, we
need to keep our hardware base as uniform as possible, so we may be
picky about what hardware donations we'll accept.  Unless the donation
is a large number of identical devices, we'd rather have cash.</p>

<p>Currently our hosting is provided by the Internet Archive, so we do
not at present need donations of bandwidth or machine room space.</p>

<p>Please send inquiries to <a href="mailto:[EMAIL PROTECTED]">[EMAIL 
PROTECTED]</a>.</p>

</body>
</page>

--- NEW FILE: org.xml ---
<page>

<title>organization</title>

<body>

<p>The Nutch Organization is a California-based public-benefit
non-profit corporation.</p>

<h3>Purpose</h3>

<p>The specific purposes for which this corporation is organized are
scientific and educational in nature: namely, to promote public access
to search technology without commercial bias by:

<ul>

<li>Providing free high-quality search software and its source code to
the public; and</li>

<li>Facilitating ongoing research and development of search technology
in a public forum.</li> </ul>
</p>

<h3>Bylaws</h3>

TBD

<h3>Board of Directors</h3>
<ul>
  <li>Mitch Kapor</li>
  <li>Tim O'Reilly</li>
  <li>Peter Savich (<a href="http://research.overture.com/";>Overture Research</a>)</li>
  <li>Raymie Stata (UCSC)</li>
  <li>Graham Spencer (<a href="http://www.digitalconsumer.org/";>Digital 
Consumer</a>)</li>
  <li>Doug Cutting</li>
</ul>

<h3>Officers</h3>
<ul>
  <li>Doug Cutting (President)</li>
  <li>Anne Cottrell (Secretary/Treasurer)</li>
</ul>


</body>
</page>

--- NEW FILE: faq.xml ---
<page>

<title>faq</title>

<body>

<h3>Why does the world need Nutch, when search engines are free?</h3>

<p>Search engines are free to use like television is free to watch,
but, like television programming, search results are subject to
manipulation by the interests that control them.  The only way one can
be certain that search results are unbiased is if the technology which
computes them is public.  Nutch seeks to make high-quality search
technology freely available.</p>
    
<h3>How can I help?</h3>

<p>If you're interested in donating funds, please visit our <a
href="donate.html">donations</a> page.</p>

<p>If you're a developer, please visit our <a
href="developers.html">developer</a> page.</p>

<p>If you have other suggestions, questions or comments, please send a
message to <a
href="mailto:[EMAIL PROTECTED]">[EMAIL PROTECTED]</a>.</p>

<h3>How can a non-profit afford to run a search engine?</h3>

<p>Nutch is primarily a software project, not a service.  Large scale
deployments of Nutch will probably be run by commerical interests
separate from Nutch, funded by advertising or somesuch.  If the Nutch
software is good enough, perhaps existing major search engines will
use it in place of their current closed source code.</p>

<p>The Nutch project itself may choose to host small-scale demo
system, so that folks can see that it really works.  This will require
only moderate funding, perhaps a few hundred thousand dollars.  The
Nutch project may never host a full-scale deployment for folks to use
as their everyday search engine.  We'll leave that to commercial
ventures who can afford it.</p>

<h3>Will Nutch ever be as good as other search engines?</h3>

<p>We hope it will be better.  With developers and researchers from
around the world helping out, we hope to be able to surpass the
quality of what any single company can do.</p>

<h3>How can I stop Nutch from crawling my site?</h3>

<p>Please visit our <a href="bot.html">webmaster info page</a>.</p>

<h3>How can I make sure that Nutch crawls my site?</h3>

<p>Nutch uses the <a href="http://www.dmoz.org/";>DMOZ Open
Directory</a> to bootstrap its crawling.  So the best way to get your
site crawled by Nutch is to make sure that it is listed in the Open
Directory.</p>

<h3>Will Nutch be a distributed, P2P-based search engine?</h3>

<p>We don't think it is presently possible to build a peer-to-peer
search engine that is competitive with existing search engines.  It
would just be too slow.  Returning results in less than a second is
important: it lets people rapidly reformulate their queries so that
they can more often find what they're looking for.  In short, a fast
search engine is a better search engine.  I don't think many people
would want to use a search engine that takes ten or more seconds to
return results.</p>

<p>That said, if someone wishes to start a sub-project of Nutch
exploring distributed searching, we'd love to host it.  We don't think
these techniques are likely to solve the hard problems Nutch needs to
solve, but we'd be happy to be proven wrong.</p>

<h3>Will Nutch use a distributed crawler, like <a
href="http://www.grub.org/";>Grub</a>?</h3>

<p>Distributed crawling can save download bandwidth, but, in the long
run, the savings is not significant.  A successful search engine
requires more bandwidth to upload query result pages than its crawler
needs to download pages, so making the crawler use less bandwidth does
not reduce overall bandwidth requirements.  The dominant expense of
operating a search engine is not crawling, but searching.</p>

<h3>Won't open source just make it easier for sites to manipulate
rankings?</h3>

<p>Search engines work hard to construct ranking algorithms that are
immune to manipulation.  Search engine optimizers still manage to
reverse-engineer the ranking algorithms used by search engines, and
improve the ranking of their pages.  For example, many sites use link
farms to manipulate search engines' link-based ranking algorithms, and
search engines retaliate by improving their link-based algorithms to
neutralize the effect of link farms.</p>

<p>With an open-source search engine, this will still happen, just out
in the open.  This is analagous to encryption and virus protection
software.  In the long term, making such algorithms open source makes
them stronger, as more people can examine the source code to find
flaws and suggest improvements.  Thus we believe that an open source
search engine has the potential to better resist manipulation of its
rankings.</p>

<h3>When will Nutch search images, pdf files, etc.?</h3>

<p>Soon, we hope.</p>

</body>
</page>

--- NEW FILE: credits.xml ---
<page>

<title>credits</title>

<body>

<h3>Developers</h3>
<ul>
  <li>Mike Cafarella (Search Quality)</li>
  <li>Doug Cutting</li>
  <li>Ben Lutch (Operations)</li>
  <li>Tom Pierce</li>
</ul>

<h3>Friends</h3>
<ul>
  <li>Dan Fain (<a href="http://research.overture.com/";>Overture Research</a>)</li>
  <li>Brewster Kahle (<a href="http://www.archive.org/";>Internet Archive</a>)</li>
  <li>Michele Kimpton (<a href="http://www.archive.org/";>Internet Archive</a>)</li>
  <li>Joe Kraus (<a href="http://www.digitalconsumer.org/";>Digital Consumer</a>)</li>
  <li>Brett Bullington</li>
  <li>Neil &amp; Danny Rimer (<a href="http://www.indexventures.com/";>Index 
Ventures</a>)</li>
  <li>R.J. Pittman (<a href="http://www.groxis.com/";>Groxis</a>)</li>
</ul>

<h3>Sponsors</h3>
<ul>

  <li><a href="http://research.overture.com/";>Overture Research</a> has
  donated hardware and helped to fund development.</li>

  <li><a href="http://www.archive.org/";>The Internet Archive</a>
  hosts Nutch.</li>
</ul>


</body>
</page>

--- NEW FILE: i18n.xml ---
<page>

<title>i18n</title>

<body>

<p>The Nutch website, including the search pages, is easy to
internationalize.</p>

<p>For each language, there are three kinds things which must be
translated:</p>

<ol>

<li><b>page header</b>: This is a list of anchors included at the top of
every page.</li>

<li><b>static page content</b>: This forms the bulk of the Nutch
website, and also serves as downloadable documentation, like this page.</li>

<li><b>dynamic page text</b>: This is used when constructing search
result pages.</li>

</ol>

<p>Each of the above is described in more detail below.</p>

<h3>Getting Started</h3>

<p>In general, all of the Nutch documentation does not need to be
translated.  The most important things to translate are:</p>

<ol>
<li>the page header</li>
<li>the "about" page (<tt>src/web/pages/<i>lang</i>/about.xml</tt>)</li>
<li>the "search" page (<tt>src/web/pages/<i>lang</i>/search.xml</tt>)</li>
<li>the "help" page (<tt>src/web/pages/<i>lang</i>/help.xml</tt>)</li>
<li>text for search results 
(<tt>src/web/locale/org/nutch/jsp/search_<i>lang</i>.properties</tt>)</li>
</ol>

<p>If you'd like to provide a translation, simply post translations of
these five files to <a
href="mailto:[EMAIL PROTECTED]">[EMAIL PROTECTED]</a>
as an attachment.  If you're able to translate more, we'd love that
too!  For pages that you don't translate, provide links to the English
version.  Look at the other translations for examples of this.</p>

<h3>Page Header</h3>

<p>The Nutch page header is included at the top of every page.</p>

<p>The header is filed as
<tt>src/web/include/<i>language</i>/header.xml</tt> where
<i>language</i> is the <a
href="http://ftp.ics.uci.edu/pub/ietf/http/related/iso639.txt";>IS0639</a>
language code.</p>

<p>The format of the header file is:</p>

<pre>
  &lt;header-menu&gt;
    &lt;item&gt; ... &lt;/item&gt;
    &lt;item&gt; ... &lt;/item&gt;
  &lt;/header-menu&gt;
</pre>

<p>Each item typically includes an HTML anchor, one for each of the
top-level pages in the translation.</p>

<p>For example, the header file for an English translation is filed
as <a
href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/nutch/nutch/src/web/include/en/header.xml?rev=HEAD";><tt>src/web/include/en/header.xml</tt></a>.</p>


<h3>Static Page Content</h3>

<p>Static pages compose most of the Nutch website, and are also used
for project documentation.  These are HTML generated from XML files by
XSLT.  This process is used to include a standard header and footer,
and optionally a menu of sub-pages.</p>

<p>Static page content is filed as
<tt>src/web/pages/<i>language</i>/<i>page</i>.xml</tt> where
<i>language</i> is the IS0639 language code, as above, and <i>page</i>
determines the name of the page generated:
<tt>docs/<i>page</i>.html</tt>.</p>

<p>The format of a static page xml file is:</p>

<pre>
  &lt;page&gt;
    &lt;title&gt; ... &lt;/title&gt;
    &lt;menu&gt;
      &lt;item&gt; ... &lt;/item&gt;
      &lt;item&gt; ... &lt;/item&gt;
    &lt;/menu&gt;
    &lt;body&gt; ... &lt;/body&gt;
  &lt;/page&gt;
</pre>

The <tt>&lt;menu&gt;</tt> item is optional.

<p>Note that if you use an encoding other than UTF-8 (the default for
XML data) then you need to declare that.  Also, if you use HTML
entities in your data, you'll need to declare these too.  Look at
existing translations for examples of this.</p>

<p>For example, the English language "about" page is filed
as <a
href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/nutch/nutch/src/web/pages/en/about.xml?rev=HEAD";><tt>src/web/pages/en/about.xml</tt></a>.</p>

<h3>Dynamic Page Content</h3>

<p>Java Server Pages (JSP) is used to generate Nutch search results, and
a few other dynamic pages (cached content, score explanations, etc.).</p>

<p>These use Java's <a
href="http://java.sun.com/j2se/1.4.2/docs/api/java/util/Locale.html";>Locale</a>
mechanism for internationalization.  For each page/language pair,
there is a Java property file containing the translated text of that
page.</p>

<p>These property files are filed as
<tt>src/web/locale/org/nutch/jsp/<i>page</i>_<i>language</i>.xml</tt>
where <i>page</i> is the name of the JSP page in <a
href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/nutch/nutch/src/web/jsp/";><tt>src/web/jsp/</tt></a>
and <i>language</i> is the IS0639 language code, as above.</p>

<p>For example, text for the English language search results page is filed
as <a
href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/nutch/nutch/src/web/locale/org/nutch/jsp/search_en.properties?rev=HEAD";><tt>src/web/locale/org/nutch/jsp/search_en.properties</tt></a>.
 This contains something like:</p>

<pre>
  title = search results
  search = Search
  hits = Hits &lt;b&gt;{0}-{1}&lt;/b&gt; (out of {2} total matching documents):
  cached = cached
  explain = explain
  anchors = anchors
  next = Next
</pre>

<p>Each entry corresponds to a text fragment on the search results
page. The "hits" entry uses Java's <a
href="http://java.sun.com/j2se/1.4.2/docs/api/java/text/MessageFormat.html";>MessageFormat</a>.</p>

<p>Note that property files must use the ISO 8859-1 encoding with
unicode escapes.  If you author them in a different encoding, please
use Java's <tt>native2ascii</tt> tool to convert them to this
encoding.</p>

<h3>Generating Static Pages</h3>

<p>To generate the static pages you must have <a
 href="http://java.sun.com/j2se/downloads.html";>Java</a>, <a
 href="http://ant.apache.org/";>Ant</a> and Nutch installed.  To
 install Nutch, either download and unpack the latest <a
 href="http://www.nutch.org/release/nightly/";>release</a>, or check it
 out from <a
 href="http://sourceforge.net/cvs/?group_id=59548";>CVS</a>.</p>

<p>Then give the command:</p>

<pre>
  ant generate-docs
</pre>

<i>This documentation needs more detail.  Could someone
please submit a list of the actual steps required here?</i>


<p>Once this is working, try adding directories and files to make your
own translation of the header and a few of the static pages.</p>

<h3>Testing Dynamic Pages</h3>

<p>To test the dynamic pages you must also have <a
href="http://jakarta.apache.org/tomcat/";>Tomcat</a> installed.</p>

<p>An index is also required.  You can either download a <a
href="http://sourceforge.net/project/showfiles.php?group_id=59548";>sample
index</a>, or collect your own by working through the <a
href="tutorial.html">tutorial</a>.  Once you have an index, follow the
steps outlined at the end of the tutorial for searching.</p>

<i>This documentation needs more detail.  Could someone
please submit a list of the actual steps required here?</i>

</body>
</page>

--- NEW FILE: bot.xml ---
<page>

<title>robot</title>

<menu>
<item><a href="#sysadmin">Sysadmins</a></item>
<item><a href="#webmaster">Webmasters</a></item>
<item><a href="mailto:[EMAIL PROTECTED]">Contact us</a></item>
</menu>

<body>

<p> If you're reading this, chances are you've seen our robot visiting 
your site while looking through your server logs.  When we crawl to 
populate our index, we advertise the "User-agent" string "NutchOrg". 
If you see the agent "Nutch" or "NutchCVS", that's probably a 
developer testing a new version of our robot, or someone running  their
own instance. </p>
<p> We are open-source developers, trying to build something useful for 
the world to use.  It comes naturally to us to want to be good 
netizens.  If you notice our bot misbehaving, please drop us a line  at<a
 href="mailto:[EMAIL PROTECTED]">
[EMAIL PROTECTED]</a> and we will investigate the
problem. </p>
<p> Our bot does retrieve and parse robots.txt files, and it looks for 
robots META tags in HTML.  These are the standard mechanisms for 
webmasters to tell web robots which portions of a site a robot is 
welcome to access. </p>

<h3><a name="sysadmin">Sysadmins/robots.txt</a></h3>
<p>We're an open source project, so please
understand that a misbehaving  bot appearing with our Agent string may
not have been run by us.  Our  code is out there for anyone to tinker
with.  However, whether or not  we ran the bot, we'd appreciate hearing
about any bad behavior-  please let us know about it!  If possible,
please include the name of  the domain and some representative log
entries.  We can be reached at <a
 href="mailto:[EMAIL PROTECTED]">
[EMAIL PROTECTED]</a> </p>
<p> Our bot follows the robots.txt exclusion standard, which is
described at <a
 href="http://www.robotstxt.org/wc/exclusion.html#robotstxt";>
http://www.robotstxt.org/wc/exclusion.html#robotstxt</a>.  Depending on
the configuration, our robot may obey different rules.  To make it
simple to send our bot away, we'll always obey rules for  "Nutch". 
Here are the different cases. </p>
<ul>
  <li> When we're running to populate our index, we'll advertise the
agent   "NutchOrg", and obey rules for "NutchOrg" if they exist, or
"Nutch",   or "*". </li>
  <li> When anyone is running an unmodified CVS version of our bot  
(including when we're running our bot to test it) it will advertise  
"NutchCVS", and obey rules for "NutchCVS" if they exist, or "Nutch",  
or "*". </li>
  <li> Release versions of our bot will advertise "Nutch", and obey
rules   for "Nutch" or "*". </li>
</ul>
<p> To ban all bots from your site, place the following in your 
robots.txt file: </p>
<blockquote>
  <pre>User-agent: *<br/>Disallow: /<br/> </pre>
</blockquote>
<p> To ban Nutch bots from your site <b>unless</b> they're building the 
Nutch.Org demo index, place the following in your robots.txt  file: </p>
<blockquote>
  <pre>User-agent: Nutch<br/>Disallow: /<br/><br/>User-agent: NutchOrg<br/>Disallow: 
<br/> </pre>
</blockquote>
<p> To ban all Nutch bots from your site: </p>
<blockquote>
  <pre>User-agent: Nutch<br/>Disallow: /<br/> </pre>
</blockquote>

<h3><a name="webmaster">Webmasters/Robots META</a></h3>
<p>If you do not have permission to edit the
/robots.txt file on your  server, you can still tell robots not to
index your pages or follow  your links.  The standard mechanism for
this is the robots META tag,  as described at<a
 href="http://www.robotstxt.org/wc/meta-user.html";>
http://www.robotstxt.org/wc/meta-user.html</a>. </p>
<p> To tell Nutch, and other robots, not to index your page or follow
your  links, insert this META tag into the HEAD section of your HTML
document: </p>
<blockquote>
  <pre>&lt;meta name="robots" content="noindex,nofollow"&gt;<br/> </pre>
</blockquote>
<p> Of course, you can control the "index" and "follow" directives 
independantly.  The keywords "all" or "none" are also allowed, 
meaning "index,follow" or "noindex,nofollow", respectively.  Some 
examples are: </p>
<blockquote>
  <pre>&lt;meta name="robots" content="all"&gt;<br/>&lt;meta
  name="robots" content="index,follow"&gt;<br/>&lt;meta name="robots"
  content="index,nofollow"&gt;<br/>&lt;meta name="robots"
  content="noindex,follow"&gt;<br/>&lt;meta name="robots"
  content="none"&gt;<br/>  </pre>
</blockquote>
<p> If there are no robots META tags, or if an action is not
specifically  prohibited (ie. neither "nofollow" or "none" appears),
Nutch will  assume it is allowed to index or follow links. </p>

</body>
</page>

--- NEW FILE: policies.xml ---
<page>

<title>developer policies</title>

<body>

<h3>Definitions</h3>
<ul>
  <li>code - all Nutch software and documentation.<br/>
  </li>
  <li>developer - a member of the small group who may directly make
changes to the code.</li>
  <li>contributor - someone who contributes code to Nutch indirectly
via a developer.</li>
  <li>license - the license in <a href="../LICENSE.txt">LICENSE.txt</a>.</li>
  <li>organization - the copyright owner of Nutch code.<br/>
  </li>
</ul>
<h3>Decision Process<br/>
</h3>
All Nutch decisions are made by a simple majority vote of all
developers.  Votes are made by email on the <a
 href="mailto:[EMAIL PROTECTED]">[EMAIL PROTECTED]</a>
mailing list.<br/>
<br/>
In general, Nutch operates by having just a handful few developers who
trust one another.  Thus most changes by developers may be made
unilaterally without explicit authorization.  <br/>
<br/>
In particular:<br/>
<ul>
  <li>The decision process itself may be changed by a simple majority
vote.</li>
  <li>Developers are added and removed by a simple majority vote of
existing developers.</li>
  <li>Disputes about code changes are resolved by a simple majority of
developers.</li>
</ul>
<h3>Change Process</h3>
Developers should always perform a clean recompilation against the
latest version of the sources.  Compilation should succedd without
warnings.  Javadoc should always build without warnings.   All
unit tests must complete successfuly before code is committed.<br/>
<br/>
In other words, the following steps must be performed prior to each cvs
commit:<br/>
<pre style="margin-left: 40px;">cvs update -d<br/>ant clean test javadoc<br/></pre>
Eventually we should have an automated nightly build process which
sends email to developers if any of these fail.<br/>
<h3>Contributions</h3>
Nutch welcomes contributions from non-developers.<br/>
<br/>
Contributions must:<br/>
<ul>
  <li>be submitted by email to the mailing list <a
 href="mailto:[EMAIL PROTECTED]">[EMAIL PROTECTED]</a></li>
  <li>be in patch-file format.</li>
  <li>conform to the coding conventions of the project.<br/>
  </li>
  <li>use the license.</li>
  <li>assign copyright to the organization.<br/>
  </li>
</ul>
<h3>Coding Conventions</h3>
Java code should conform to the contentions described in:<br/>
   <a
 
href="http://java.sun.com/docs/codeconv/html/CodeConvTOC.doc.html";>http://java.sun.com/docs/codeconv/html/CodeConvTOC.doc.html<br/>
</a><br/>
All Java code modules should be accompanied by <a
 href="http://www.junit.org/index.htm";>JUnit</a> tests.<br/>
<br/>
Every public or protected class, method and field must have an
informative Javadoc comment.<br/>
<br/>
</body>
</page>

--- NEW FILE: developers.xml ---
<page>

<title>developer information</title>

<menu>
 <item><a href="http://sourceforge.net/mail/?group_id=59548";>Mailing Lists</a></item> 
 <item><a 
href="http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/nutch/nutch/";>CVS</a></item>
 <item><a href="tutorial.html">Tutorial</a></item>
 <item><a href="i18n.html">i18n</a></item>
 <item><a href="http://www.nutch.org/release/nightly/";>Download</a></item>
 <item><a href="../api/index.html">Javadoc</a></item>
 <item><a 
href="http://sourceforge.net/tracker/?atid=491356&amp;group_id=59548&amp;func=browse";>Bugs</a></item>
 <item><a href="policies.html">Policies</a></item>
 <item><a href="bot.html">Webmasters</a></item>
</menu>

<body>

<h3>How to Contribute</h3>

<p>Contributions are merit-based.  Other developers must see
contributions in order to evaluate them, suggest improvements, and
integrate them into the sourcebase.</p>

<p>Contributors should follow these steps:</p>

<ol>
  <li>Check the nutch-developers mailing list to see if anyone is
already working on what you are interested in working on.  If so, you
might want to contact that person to see how far along the work has
come.</li>

  <li> If it looks like you are not duplicating effort, send
a small piece of mail saying you are about to do the work.  When
future people want to contribute, they will see your letter during
step (1) above.</li>

  <li> Once you've done some, submit the diffs to the nutch-developers
mailing list.  We can all then examine the work for quality,
relevance, etc.  Details like formatting, documentation, and coding
conventions are important.</li>

  <li> We hope everyone will try to provide good feedback on your work, 
but honestly everyone's time is very limited.  Make it easy for
people to examine your work by making it:
   <ul>
    <li>high-quality;</li>
    <li>easy-to-read; and</li>
    <li>easy-to-integrate; and</li>
    <li>relevant to Nutch's stated goals.</li>
   </ul>
  </li>

  <li> If everything seems right, we'll accept it into the source 
base and it will become part of Nutch.</li>

  <li> Collect glory and good karma.  Goto step 1.</li>
</ol>

<p>Please also read the developer <a href="policies.html">policies</a>
page.</p>

<h3>Needed contributions</h3>

Nutch needs contributions in the following areas (among others).  If
you think you can help with these, or with something else, please send
a message to <a
href="mailto:[EMAIL PROTECTED]">[EMAIL PROTECTED]</a>.

<h4>Internationalization</h4>

<p>Nutch intends to be international.  At present, we believe that
our indexing works well for western languages.  But we need:
<ul>
<li>Translations of the basic Nutch pages (at least <tt>search.xml</tt>,
<tt>help.xml</tt> and <tt>about.xml</tt>) into other languages.</li>
<li>Testing and development work on improved Asian language support</li>
</ul>
More information about how to internationalize can be found on the <a
href="i18n.html">i18n</a> page.
</p>


<h4>Search Parameter Tuning</h4>

<p>Nutch has not yet been tuned for quality.  There are ten or twenty
knobs that we can twiddle to adjust the ranking formula.  We have
started developing software to do this tuning automatically, but the
current code just contains guesses.  With a little tuning we should be
able to get results that are competitive with those of major search
engines.</p>

<h4>Alternate Content Types</h4>

<p>Nutch currently only supports HTML content accessed by HTTP.  It
would be great to add support for PDF files, image search, etc.</p>

</body>
</page>

--- NEW FILE: status.xml ---
<page>

<title>status</title>

<body>

<p>Currently we're just a handful of developers working part-time.  At
this point Nutch is coded entirely in Java, however persistent data is
written in language-independent formats so that, if needed, modules
may be re-written in other languages (e.g., C++) as the project
progresses.</p>

<p>Nutch has not yet been tuned for quality.  There are ten or twenty
knobs that we can twiddle to adjust the ranking formula.  We are
developing software to do this tuning automatically, but the current
code just contains guesses.  With a little tuning we should be able to
get results that are competitive with those of major search
engines.</p>

<p>As of June, 2003, we have successfully built a 100 million page
demo system.  Unfortunately, we do not yet have enough hardware to
support a public demo.  Hopefully we will be able to add that in the
next few months.  Stay tuned.</p>

</body>
</page>

--- NEW FILE: search.xml ---
<page>
<body>
<center>
<form name="search" action="/search.jsp" method="get"> <input
 name="query" size="44"/>&#160;<input type="submit" value="Search"/>
<a href="help.html">help</a>
</form>
</center>
</body>
</page>



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nutch-cvs mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to