[Nutch Wiki] Update of "FAQ" by Ankit Dangi

Apache Wiki Mon, 22 Mar 2010 02:42:24 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "FAQ" page has been changed by Ankit Dangi.
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=115&rev2=116

--------------------------------------------------

  <<TableOfContents>>
  
  == Nutch FAQ ==
- 
  === General ===
- 
  ==== Are there any mailing lists available? ====
- 
  There's a user, developer, commits and agents lists, all available at 
http://lucene.apache.org/nutch/mailing_lists.html.
  
  ==== How can I stop Nutch from crawling my site? ====
- 
  Please visit our [[http://lucene.apache.org/nutch/bot.html|"webmaster info 
page"]]
  
  ==== Will Nutch be a distributed, P2P-based search engine? ====
- 
  We don't think it is presently possible to build a peer-to-peer search engine 
that is competitive with existing search engines. It would just be too slow. 
Returning results in less than a second is important: it lets people rapidly 
reformulate their queries so that they can more often find what they're looking 
for. In short, a fast search engine is a better search engine. I don't think 
many people would want to use a search engine that takes ten or more seconds to 
return results.
  
  That said, if someone wishes to start a sub-project of Nutch exploring 
distributed searching, we'd love to host it. We don't think these techniques 
are likely to solve the hard problems Nutch needs to solve, but we'd be happy 
to be proven wrong.
  
- 
  ==== Will Nutch use a distributed crawler, like Grub? ====
- 
  Distributed crawling can save download bandwidth, but, in the long run, the 
savings is not significant. A successful search engine requires more bandwidth 
to upload query result pages than its crawler needs to download pages, so 
making the crawler use less bandwidth does not reduce overall bandwidth 
requirements. The dominant expense of operating a large search engine is not 
crawling, but searching.
  
  ==== Won't open source just make it easier for sites to manipulate rankings? 
====
- 
  Search engines work hard to construct ranking algorithms that are immune to 
manipulation. Search engine optimizers still manage to reverse-engineer the 
ranking algorithms used by search engines, and improve the ranking of their 
pages. For example, many sites use link farms to manipulate search engines' 
link-based ranking algorithms, and search engines retaliate by improving their 
link-based algorithms to neutralize the effect of link farms.
  
  With an open-source search engine, this will still happen, just out in the 
open. This is analagous to encryption and virus protection software. In the 
long term, making such algorithms open source makes them stronger, as more 
people can examine the source code to find flaws and suggest improvements. Thus 
we believe that an open source search engine has the potential to better resist 
manipulation of its rankings.
  
  ==== What Java version is required to run Nutch? ====
- 
  Nutch 0.7 will run with Java 1.4 and up.
  
  ==== Exception: java.net.SocketException: Invalid argument or cannot assign 
requested address on Fedora Core 3 or 4 ====
- 
  It seems you have installed IPV6 on your machine.
  
  To solve this problem, add the following java param to the java instantiation 
in bin/nutch:
  
  JAVA_IPV4=-Djava.net.preferIPv4Stack=true
  
- # run it
- exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS $JAVA_IPV4 -classpath "$CLASSPATH" 
$CLASS "$@"
+ # run it exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS $JAVA_IPV4 -classpath 
"$CLASSPATH" $CLASS "$@"
  
  ==== I have two XML files, nutch-default.xml and nutch-site.xml, why? ====
+ nutch-default.xml is the out of the box configuration for nutch. Most 
configuration can (and should unless you know what your doing) stay as it is. 
nutch-site.xml is where you make the changes that override the default 
settings. The same goes to the servlet container application.
- 
- nutch-default.xml is the out of the box configuration for nutch. Most 
configuration can (and should unless you know what your doing) stay as it is.
- nutch-site.xml is where you make the changes that override the default 
settings.
- The same goes to the servlet container application.
  
  ==== My system does not find the segments folder. Why? Or: How do I tell the 
''Nutch Servlet'' where the index file are located? ====
- 
  There are at least two choices to do that:
  
-   First you need to copy the .WAR file to the servlet container webapps 
folder.
+  . First you need to copy the .WAR file to the servlet container webapps 
folder.
+ 
  {{{
     % cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war
  }}}
- 
-   1) After building your first index, start Tomcat from the index folder.
+  . 1) After building your first index, start Tomcat from the index folder.
-     Assuming your index is located at /index :
+   . Assuming your index is located at /index :
+ 
  {{{
  % cd /index/
  % $CATATALINA_HOME/bin/startup.sh
  }}}
-     '''Now you can search.'''
+  . '''Now you can search.'''
  
-   2) After building your first index, start and stop Tomcat which will make 
Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and 
put in it the location of the index folder.
+  . 2) After building your first index, start and stop Tomcat which will make 
Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and 
put in it the location of the index folder.
+ 
  {{{
  % $CATATALINA_HOME/bin/startup.sh
  % $CATATALINA_HOME/bin/shutdown.sh
  }}}
- 
  {{{
  % vi $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/nutch-site.xml
  <?xml version="1.0"?>
@@ -93, +79 @@

  
  % $CATATALINA_HOME/bin/startup.sh
  }}}
- 
  === Injecting ===
- 
  ==== What happens if I inject urls several times? ====
- 
  Urls which are already in the database, won't be injected.
  
  === Fetching ===
- 
  ==== Is it possible to fetch only pages from some specific domains? ====
- 
- Please have a look on PrefixURLFilter.
- Adding some regular expressions to the urlfilter.regex.file might work, but 
adding a list with thousands of regular expressions would slow down your system 
excessively.
+ Please have a look on PrefixURLFilter. Adding some regular expressions to the 
urlfilter.regex.file might work, but adding a list with thousands of regular 
expressions would slow down your system excessively.
  
  Alternatively, you can set db.ignore.external.links to "true", and inject 
seeds from the domains you wish to crawl (these seeds must link to all pages 
you wish to crawl, directly or indirectly).  Doing this will let the crawl go 
through only these domains without leaving to start crawling external links.  
Unfortunately there is no way to record external links encountered for future 
processing, although a very small patch to the generator code can allow you to 
log these links to hadoop.log.
  
  ==== How can I recover an aborted fetch process? ====
- 
  Well, you can not. However, you have two choices to proceed:
  
-   1) Recover the pages already fetched and than restart the fetcher.
+  . 1) Recover the pages already fetched and than restart the fetcher.
- 
-       You'll need to create a file fetcher.done in the segment directory an 
than: [[http://wiki.apache.org/nutch/bin/nutch_updatedb|updatedb]], 
[[http://wiki.apache.org/nutch/bin/nutch_generate|generate]] and 
[[http://wiki.apache.org/nutch/bin/nutch_fetch|fetch]] .
+   . You'll need to create a file fetcher.done in the segment directory an 
than: [[http://wiki.apache.org/nutch/bin/nutch_updatedb|updatedb]], 
[[http://wiki.apache.org/nutch/bin/nutch_generate|generate]] and 
[[http://wiki.apache.org/nutch/bin/nutch_fetch|fetch]] . Assuming your index is 
at /index
-       Assuming your index is at /index
+ 
- {{{ 
+ {{{
  % touch /index/segments/2005somesegment/fetcher.done
  
  % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
@@ -126, +104 @@

  
  % bin/nutch fetch /index/segments/2005somesegment
  }}}
- 
-       All the pages that were not crawled will be re-generated for fetch. If 
you fetched lots of pages, and don't want to have to re-fetch them again, this 
is the best way.
+  . All the pages that were not crawled will be re-generated for fetch. If you 
fetched lots of pages, and don't want to have to re-fetch them again, this is 
the best way.
  
-   2) Discard the aborted output.
+  . 2) Discard the aborted output.
- 
-       Delete all folders from the segment folder except the fetchlist folder 
and restart the fetcher.
+   . Delete all folders from the segment folder except the fetchlist folder 
and restart the fetcher.
  
  ==== Who changes the next fetch date? ====
- 
-   * After injecting a new url the next fetch date is set to the current time.
+  * After injecting a new url the next fetch date is set to the current time.
-   * Generating a fetchlist enhances the date by 7 days.
+  * Generating a fetchlist enhances the date by 7 days.
-   * Updating the db sets the date to the current time + 
db.default.fetch.interval - 7 days.
+  * Updating the db sets the date to the current time + 
db.default.fetch.interval - 7 days.
  
  ==== I have a big fetchlist in my segments folder. How can I fetch only some 
sites at a time? ====
- 
-   * You have to decide how many pages you want to crawl before generating 
segments and use the options of bin/nutch generate.
+  * You have to decide how many pages you want to crawl before generating 
segments and use the options of bin/nutch generate.
-   * Use -topN to limit the amount of pages all together.
+  * Use -topN to limit the amount of pages all together.
-   * Use -numFetchers to generate multiple small segments.
+  * Use -numFetchers to generate multiple small segments.
-   * Now you could either generate new segments. Maybe you whould use -adddays 
to allow bin/nutch generate to put all the urls in the new fetchlist again. Add 
more then 7 days if you did not make a updatedb.
+  * Now you could either generate new segments. Maybe you whould use -adddays 
to allow bin/nutch generate to put all the urls in the new fetchlist again. Add 
more then 7 days if you did not make a updatedb.
-   * Or send the process a unix STOP signal. You should be able to index the 
part of the segment for crawling which is allready fetched. Then later send a 
CONT signal to the process. Do not turn off your computer between! :)
+  * Or send the process a unix STOP signal. You should be able to index the 
part of the segment for crawling which is allready fetched. Then later send a 
CONT signal to the process. Do not turn off your computer between! :)
  
  ==== How many concurrent threads should I use? ====
- 
  This is dependent on your particular setup, but the following works for me:
  
  If you are using a slow internet connection (ie- DSL), you might be limited 
to 40 or fewer concurrent fetches.
  
- If you have a fast internet connection (> 10Mb/sec) your bottleneck will 
definitely be in the machine itself (in fact you will need multiple machines to 
saturate the data pipe).  Empirically I have found that the machine works well 
up to about 1000-1500 threads.  
+ If you have a fast internet connection (> 10Mb/sec) your bottleneck will 
definitely be in the machine itself (in fact you will need multiple machines to 
saturate the data pipe).  Empirically I have found that the machine works well 
up to about 1000-1500 threads.
  
  To get this to work on my Linux box I needed to set the ulimit to 65535 
(ulimit -n 65535), and I had to make sure that the DNS server could handle the 
load (we had to speak with our colo to get them to shut off an artificial cap 
on the DNS servers).  Also, in order to get the speed up to a reasonable value, 
we needed to set the maximum fetches per host to 100 (otherwise we get a quick 
start followed by a very long slow tail of fetching).
  
  To other users: please add to this with your own experiences, my own 
experience may be atypical.
  
- 
- 
  ==== How can I force fetcher to use custom nutch-config? ====
- 
-   * Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig
+  * Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig
-   * Copy these files from $NUTCH_HOME/conf to the new directory: 
common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, 
regex-normalize.xml, regex-urlfilter.txt
+  * Copy these files from $NUTCH_HOME/conf to the new directory: 
common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, 
regex-normalize.xml, regex-urlfilter.txt
-   * Modify the nutch-default.xml to suite your needs
+  * Modify the nutch-default.xml to suite your needs
-   * Set NUTCH_CONF_DIR environment variable to point into the directory you 
created
+  * Set NUTCH_CONF_DIR environment variable to point into the directory you 
created
-   * run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR environment 
variable. You should check the command outputs for lines where the configs are 
loaded, that they are really loaded from your custom dir.
+  * run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR environment 
variable. You should check the command outputs for lines where the configs are 
loaded, that they are really loaded from your custom dir.
-   * Happy using.
+  * Happy using.
  
  ==== bin/nutch generate generates empty fetchlist, what can I do? ====
- 
- The reason for that is that when a page is fetched, it is timestamped in the 
webdb. So basiclly if its time is not up it will not be included in a 
fetchlist. So for example if you generated a fetchlist and you deleted the 
segment dir created. calling generate again will generate an empty fetchlist.
+ The reason for that is that when a page is fetched, it is timestamped in the 
webdb. So basiclly if its time is not up it will not be included in a 
fetchlist. So for example if you generated a fetchlist and you deleted the 
segment dir created. calling generate again will generate an empty fetchlist. 
So, two choices:
- So, two choices:
-   1) Change your system date to be 30 days from today (if you haven't changed 
the default settings) and re-run bin/nutch generate...
  
+  . 1) Change your system date to be 30 days from today (if you haven't 
changed the default settings) and re-run bin/nutch generate... 2) Call 
bin/nutch generate with the -adddays 30 (if you haven't changed the default 
settings) to make generate think the time has come... After generate you can 
call bin/nutch fetch.
-   2) Call bin/nutch generate with the -adddays 30 (if you haven't changed the 
default settings) to make generate think the time has come...
- 
-   After generate you can call bin/nutch fetch.
  
  ==== While fetching I get UnknownHostException for known hosts ====
- 
  Make sure your DNS server is working and/or it can handle the load of 
requests.
  
  ==== How can I fetch pages that require Authentication? ====
- 
  See HttpAuthenticationSchemes.
  
  === Updating ===
- 
- 
  === Indexing ===
- 
  ==== Is it possible to change the list of common words without crawling 
everything again? ====
- 
  Yes. The list of common words is used only when indexing and searching, and 
not during other steps. So, if you change the list of common words, there is no 
need to re-fetch the content, you just need to re-create segment indexes to 
reflect the changes.
  
  ==== How do I index my local file system? ====
- 
  The tricky thing about Nutch is that out of the box it has most plugins 
disabled and is tuned for a crawl of a "remote" web server - you '''have''' to 
change config files to get it to crawl your local disk.
  
-   1) crawl-urlfilter.txt needs a change to allow file: URLs while not 
following http: ones, otherwise it either won't index anything, or it'll jump 
off your disk onto web sites.
+  . 1) crawl-urlfilter.txt needs a change to allow file: URLs while not 
following http: ones, otherwise it either won't index anything, or it'll jump 
off your disk onto web sites.
+   . Change this line: -^(file|ftp|mailto|https): to this: 
-^(http|ftp|mailto|https):
- 
-     Change this line:
- 
-     -^(file|ftp|mailto|https):
- 
-     to this:
- 
-     -^(http|ftp|mailto|https):
- 
-   2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If 
it has this fragment it's probably ok:
+  2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If 
it has this fragment it's probably ok:
- 
-     # accept anything else
+   . # accept anything else +.*
-     +.*
- 
-   3) By default the 
[[http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html|"file
 plugin"]] is disabled. nutch-site.xml needs to be modified to allow this 
plugin. Add an entry like this:
+  3) By default the 
[[http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html|"file
 plugin"]] is disabled. nutch-site.xml needs to be modified to allow this 
plugin. Add an entry like this:
+ 
  {{{
      <property>
        <name>plugin.includes</name>
        <value>protocol-file|...copy original values from nutch-default 
here...</value>
      </property>
  }}}
- 
  where you should copy and paste all values from nutch-default.xml in the 
plugin.includes setting provided there. This will ensure that all plug-ins 
normally enabled will be enabled, plus the protocol-file plugin. Make sure to 
include parse-pdf if you want to parse PDF files. Make sure that 
urlfilter-regexp is included, or else '''the *urlfilter files will be 
ignored''', leading nutch to accept all URLs. You need to enable crawl URL 
filters to prevent nutch from crawling up the parent directory, see below.
  
  Now you can invoke the crawler and index all or part of your disk. The only 
remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs 
from a web paged fetched with http, so if you test with the Nutch web container 
running in Tomcat, annoyingly, as you click on results nothing will happen as 
Mozilla by default does not load file URLs. This is mentioned 
[[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and 
this behavior may be disabled by a 
[[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] 
(see security.checkloaduri). IE5 does not have this problem.
  
  ==== Nutch crawling parent directories for file protocol ====
- 
  If you find nutch crawling parent directories when using the file protocol, 
the following kludge may help:
  
- [[http://issues.apache.org/jira/browse/NUTCH-407]] E.g. for urlfilter-regex 
you could put the following in regex-urlfilter.txt :
+ http://issues.apache.org/jira/browse/NUTCH-407 E.g. for urlfilter-regex you 
could put the following in regex-urlfilter.txt :
+ 
  {{{
  +^file:///c:/top/directory/
  -.
  }}}
- 
  Alternatively, you could apply the patch described 
[[http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch|on 
this page]], which would avoid the hardwiring of the site-specific 
/top/directory in your configuration file.
  
  ==== How do I index remote file shares? ====
- 
  At the current time, Nutch does not have built in support for accessing files 
over SMB (Windows) shares.  This means the only available method is to mount 
the shares yourself, then index the contents as though they were local 
directories (see above).
  
  Note that the share mounting method suffers from the following drawbacks:
  
+  . 1) The links generated by Nutch will not work except for queries from 
localhost (end users typically won't have the exact same shares mounted in the 
exact same way). 2) You are limited to the number of mounted shares your 
operating system supports.  In *nix environments, this is effectively 
unlimited, but in Windows you may mount 26 (one share or drive per letter in 
the English alphabet) 3) Documents with links to shares are unlikely to work 
since they won't link to the share on your machine, but rather to the SMB 
version.
-   1) The links generated by Nutch will not work except for queries from 
localhost (end users typically won't have the exact same shares mounted in the 
exact same way).
-   
-   2) You are limited to the number of mounted shares your operating system 
supports.  In *nix environments, this is effectively unlimited, but in Windows 
you may mount 26 (one share or drive per letter in the English alphabet)
-   
-   3) Documents with links to shares are unlikely to work since they won't 
link to the share on your machine, but rather to the SMB version.
  
  ==== While indexing documents, I get the following error: ====
- 
  ''050529 011245 fetch okay, but can't parse myfile, reason: Content truncated 
at 65536 bytes. Parser can't handle incomplete msword file.''
  
  '''What is happening?'''
  
-   By default, the size of the documents downloaded by Nutch is limited (to 
65536 bytes). To allow Nutch to download larger files (via HTTP), modify 
nutch-site.xml and add an entry like this:
+  . By default, the size of the documents downloaded by Nutch is limited (to 
65536 bytes). To allow Nutch to download larger files (via HTTP), modify 
nutch-site.xml and add an entry like this:
+ 
  {{{
      <property>
        <name>http.content.limit</name>
        <value>'''150000'''</value>
      </property>
  }}}
-   If you do not want to limit the size of downloaded documents, set 
http.content.limit to a negative value:
+  . If you do not want to limit the size of downloaded documents, set 
http.content.limit to a negative value:
+ 
  {{{
      <property>
        <name>http.content.limit</name>
        <value>'''-1'''</value>
      </property>
  }}}
- 
  === Segment Handling ===
- 
  ==== Do I have to delete old segments after some time? ====
- 
  If you're fetching regularly, segments older than the 
db.default.fetch.interval can be deleted, as their pages should have been 
refetched. This is 30 days by default.
  
  === MapReduce ===
- 
  ==== What is MapReduce? ====
- 
  MapReduce
  
  ==== How to start working with MapReduce? ====
- 
-   edit conf/nutch-site.xml
+  . edit conf/nutch-site.xml
- 
-   <property>
+  <property>
-     <name>fs.default.name</name>
-     <value>localhost:9000</value>
-     <description>The name of the default file system. Either the literal 
string "local" or a host:port for NDFS.</description>
+   . <name>fs.default.name</name> <value>localhost:9000</value> 
<description>The name of the default file system. Either the literal string 
"local" or a host:port for NDFS.</description>
-   </property>
+  </property>
- 
-   <property>
+  <property>
-     <name>mapred.job.tracker</name>
-     <value>localhost:9001</value>
-     <description>The host and port that the MapReduce job tracker runs at. If 
"local", then jobs are run in-process as a single map and reduce 
task.</description>
+   . <name>mapred.job.tracker</name> <value>localhost:9001</value> 
<description>The host and port that the MapReduce job tracker runs at. If 
"local", then jobs are run in-process as a single map and reduce 
task.</description>
+  </property> edit conf/mapred-default.xml
+  <property>
+   . <name>mapred.map.tasks</name> <value>4</value> <description>define 
mapred.map.tasks to be multiple of number of slave hosts </description>
-   </property>
+  </property>
- 
-   edit conf/mapred-default.xml
-   <property>
+  <property>
-     <name>mapred.map.tasks</name>
-     <value>4</value>
-     <description>define mapred.map.tasks to be multiple of number of slave 
hosts
-     </description>
-   </property>
- 
-   <property>
-     <name>mapred.reduce.tasks</name>
-     <value>2</value>
-     <description>define mapred.reduce tasks to be number of slave 
hosts</description>
+   . <name>mapred.reduce.tasks</name> <value>2</value> <description>define 
mapred.reduce tasks to be number of slave hosts</description>
-   </property>
- 
-   create a file with slave host names
+  </property> create a file with slave host names
+ 
  {{{
    % echo localhost >> ~/.slaves
-   % echo somemachin >> ~/.slaves}}}
+   % echo somemachin >> ~/.slaves
- 
+ }}}
-   start all ndfs & mapred daemons
+  . start all ndfs & mapred daemons
+ 
  {{{
    % bin/start-all.sh
-   }}}
+ }}}
- 
-   create a directory with seed list file
+  . create a directory with seed list file
+ 
  {{{
    % mkdir seeds
    % echo http://www.cnn.com/ > seeds/urls
-   }}}
+ }}}
- 
-   copy the seed directory to ndfs
+  . copy the seed directory to ndfs
+ 
  {{{
    % bin/nutch ndfs -put seeds seeds
-   }}}
+ }}}
- 
-   crawl a bit
+  . crawl a bit
+ 
  {{{
    % bin/nutch crawl seeds -depth 3
-   }}}
+ }}}
+  . monitor things from adminstrative interface open browser and enter your 
masterHost : 7845
- 
-   monitor things from adminstrative interface
-   open browser and enter your masterHost : 7845
  
  === NDFS ===
- 
  ==== What is it? ====
- 
  NutchDistributedFileSystem
  
  ==== How to send commands to NDFS? ====
- 
-   list files in the root of NDFS
+  . list files in the root of NDFS
+ 
  {{{
    [r...@xxxxxx mapred]# bin/nutch ndfs -ls /
    050927 160948 parsing file:/mapred/conf/nutch-default.xml
@@ -363, +281 @@

    /user/root/crawl-20050927142856 <dir>
    /user/root/crawl-20050927144626 <dir>
    /user/root/seeds        <dir>
-   }}}
+ }}}
- 
-   remove a directory from NDFS
+  . remove a directory from NDFS
-   {{{
+  {{{
    [r...@xxxxxx mapred]# bin/nutch ndfs -rm /user/root/crawl-20050927144626
    050927 161025 parsing file:/mapred/conf/nutch-default.xml
    050927 161025 parsing file:/mapred/conf/nutch-site.xml
    050927 161025 No FS indicated, using default:localhost:8009
    050927 161025 Client connection to 127.0.0.1:8009: starting
    Deleted /user/root/crawl-20050927144626
-   }}}
+ }}}
  
  === Searching ===
- 
  ==== Common words are saturating my search results. ====
- 
  You can tweak your conf/common-terms.utf8 file after creating an index 
through the following command:
+ 
-   bin/nutch org.apache.nutch.indexer.HighFreqTerms -count 10 -nofreqs index
+  . bin/nutch org.apache.nutch.indexer.HighFreqTerms -count 10 -nofreqs index
  
  ==== How is scoring done in Nutch? (Or, explain the "explain" page?) ====
- 
- Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does 
it. The formula Lucene uses scoring can be found at the head of the Lucene 
Similarity class in the 
[[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html|Lucene
 Similarity Javadoc]]. Roughly, the score for a particular document in a set of 
query results, "score(q,d)", is the sum of the score for each term of a query 
("t in q"). A terms score in a document is itself the sum of the term run 
against each field that comprises a document ("title" is one field, "url" 
another. A "document" is a set of "fields"). Per field, the score is the 
product of the following factors: Its "tf" (term freqency in the document), a 
score factor "idf" (usually a factor made up of frequency of term relative to 
amount of docs in index), an index-time boost, a normalization of count of 
terms found relative to size of document ("lengthNorm"), a similar 
normalization is done for the term in the query itself ("queryNorm"), and 
finally, a factor with a weight for how many instances of the total amount of 
terms a particular document contains. Study the lucene javadoc to get more 
detail on each of the equation components and how they effect overall score.
+ Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does 
it. The formula Lucene uses scoring can be found at the head of the Lucene 
Similarity class in the 
[[http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html|Lucene
 Similarity Javadoc]]. Roughly, the score for a particular document in a set of 
query results, "score(q,d)", is the sum of the score for each term of a query 
("t in q"). A terms score in a document is itself the sum of the term run 
against each field that comprises a document ("title" is one field, "url" 
another. A "document" is a set of "fields"). Per field, the score is the 
product of the following factors: Its "tf" (term freqency in the document), a 
score factor "idf" (usually a factor made up of frequency of term relative to 
amount of docs in index), an index-time boost, a normalization of count of 
terms found relative to size of document ("lengthNorm"), a similar 
normalization is done for the term in the query itself ("queryNorm"), and 
finally, a factor with a weight for how many instances of the total amount of 
terms a particular document contains. Study the lucene javadoc to get more 
detail on each of the equation components and how they effect overall score.
  
  Interpreting the Nutch "explain.jsp", you need to have the above cited Lucene 
scoring equation in mind. First, notice how we move right as we move from 
"score total", to "score per query term", to "score per query document field" 
(A document field is not shown if a term was not found in a particular field). 
Next, studying a particular field scoring, it comprises a query component and 
then a field component. The query component includes query time -- as opposed 
to index time -- boost, an "idf" that is same for the query and field 
components, and then a "queryNorm". Similar for the field component 
("fieldNorm" is an aggregation of certain of the Lucene equation components).
  
  ==== How can I influence Nutch scoring? ====
- 
  Scoring is implemented as a filter plugin, i.e. an implementation of the 
!ScoringFilter class. By default, 
[[http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/scoring/opic/OPICScoringFilter.html|OPICScoringFilter]]
 is used.
  
- However, the easiest way to influence scoring is to change query time boosts 
(Will require edit of nutch-site.xml and redeploy of the WAR file). Query-time 
boost by default looks like this:{{{
+ However, the easiest way to influence scoring is to change query time boosts 
(Will require edit of nutch-site.xml and redeploy of the WAR file). Query-time 
boost by default looks like this:
+ 
+ {{{
    query.url.boost, 4.0f
    query.anchor.boost, 2.0f
    query.title.boost, 1.5f
    query.host.boost, 2.0f
-   query.phrase.boost, 1.0f}}}
+   query.phrase.boost, 1.0f
- 
+ }}}
  From the list above, you can see that terms found in a document URL get the 
highest boost with anchor text next, etc.
  
  Anchor text makes a large contribution to document score (You can see the 
anchor text for a page by browsing to "explain" then editing the URL to put in 
place "anchors.jsp" in place of "explain.jsp").
  
  ==== What is the RSS symbol in search results all about? ====
- Clicking on the RSS symbol sends the current query back to Nutch to a servlet 
named 
[[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html|OpenSearchServlet]].
  
[[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html|OpenSearchServlet]]
 reruns the query and returns the results formatted instead as RSS (XML).  The 
RSS format is based on [[http://a9.com/-/spec/opensearchrss/1.0/|OpenSearch RSS 
1.0]] from [[http://www.a9.com|a9.com]]: 
"[[http://a9.com/-/spec/opensearchrss/1.0/|OpenSearch]] RSS 1.0 is an extension 
to the RSS 2.0 standard, conforming to the guidelines for RSS extensibility as 
outlined by the RSS 2.0 specification" (See also 
[[http://opensearch.a9.com/|opensearch]]). Nutch in turn  makes extension to 
[[http://a9.com/-/spec/opensearchrss/1.0/|OpenSearch]].  The Nutch extensions 
are identified by the 'nutch' namespace prefix and add to 
[[http://a9.com/-/spec/opensearchrss/1.0/|OpenSearch]] navigation information, 
the original query, and all fields that are available at search result time 
including the Nutch page boost, the name of the segment the page resides in, 
etc. 
+ Clicking on the RSS symbol sends the current query back to Nutch to a servlet 
named 
[[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html|OpenSearchServlet]].
  
[[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html|OpenSearchServlet]]
 reruns the query and returns the results formatted instead as RSS (XML).  The 
RSS format is based on [[http://a9.com/-/spec/opensearchrss/1.0/|OpenSearch RSS 
1.0]] from [[http://www.a9.com|a9.com]]: 
"[[http://a9.com/-/spec/opensearchrss/1.0/|OpenSearch]] RSS 1.0 is an extension 
to the RSS 2.0 standard, conforming to the guidelines for RSS extensibility as 
outlined by the RSS 2.0 specification" (See also 
[[http://opensearch.a9.com/|opensearch]]). Nutch in turn  makes extension to 
[[http://a9.com/-/spec/opensearchrss/1.0/|OpenSearch]].  The Nutch extensions 
are identified by the 'nutch' namespace prefix and add to 
[[http://a9.com/-/spec/opensearchrss/1.0/|OpenSearch]] navigation information, 
the original query, and all fields that are available at search result time 
including the Nutch page boost, the name of the segment the page resides in, 
etc.
  
  Results as RSS (XML) rather than HTML are easier for programmatic clients to 
parse: such clients will query against 
[[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html|OpenSearchServlet]]
 rather than search.jsp.  Results as XML can also be transformed using XSL 
stylesheets, the likely direction of UI development going forward.
  
  ==== How can I find out/display the size and mime type of the hits that a 
search returns? ====
  In order to be able to find this information you have to modify the standard 
{{{plugin.includes}}} property of the nutch configuration file and add the 
{{{index-more}}} filter.
+ 
  {{{
  <property>
    <name>plugin.includes</name>
@@ -418, +335 @@

  </property>
  }}}
  After that, __don't forget to crawl again__ and you should be able to 
retrieve the mime-type and content-length through the class HitDetails (via the 
fields "primaryType", "subType" and "contentLength") as you normally do for the 
title and URL of the hits.
+ 
-       (Note by DanielLopez) Thanks to Dogacan Güney for the tip.
+  . (Note by DanielLopez) Thanks to Dogacan Güney for the tip.
  
  === Crawling ===
- 
  ==== Java.io.IOException: No input directories specified in: NutchConf: 
nutch-default.xml , mapred-default.xml ====
- 
  The crawl tool expects as its first parameter the folder name where the 
seeding urls file is located so for example if your urls.txt is located in 
/nutch/seeds the crawl command would look like: crawl seed -dir 
/user/nutchuser...
  
  ==== Nutch doesn't crawl relative URLs? Some pages are not indexed but my 
regex file and everything else is okay - what is going on? ====
+ The crawl tool has a default limitation of 100 outlinks of one page that are 
being fetched. To overcome this limitation change the 
'''db.max.outlinks.per.page''' property to a higher value or simply -1 
(unlimited).
- The crawl tool has a default limitation of 100 outlinks of one page that are 
being fetched.
- To overcome this limitation change the '''db.max.outlinks.per.page''' 
property to a higher value or simply -1 (unlimited).
  
  file: conf/nutch-default.xml
  
@@ -440, +355 @@

     If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
outlinks
     will be processed for a page; otherwise, all outlinks will be processed.
     </description>
-  </property> 
+  </property>
  }}}
- see also: 
http://www.mail-archive.com/[email protected]/msg08665.html
+ see also: 
http://www.mail-archive.com/[email protected]/msg08665.html (tested 
under nutch 0.9)
- (tested under nutch 0.9)
- 
- 
- 
  
  === Discussion ===
- 
  [[http://grub.org/|Grub]] has some interesting ideas about building a search 
engine using distributed computing. ''And how is that relevant to nutch?''
+ 
  ----
  CategoryHomepage

[Nutch Wiki] Update of "FAQ" by Ankit Dangi

Reply via email to