Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NutchTutorial" page has been changed by WayneBurke:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=72&rev2=73

Comment:
Updated 6. Integrate Solr with Nutch to reflect changes in the expected 
schema.xml and its new location in the Solr example directory.

  == 6. Integrate Solr with Nutch ==
  We have both Nutch and Solr installed and setup correctly. And Nutch already 
created crawl data from the seed URL(s). Below are the steps to delegate 
searching to Solr for links to be searchable:
  
+  * Backup the original Solr example schema.xml:<<BR>>
+  {{{
-  * mv ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml 
${APACHE_SOLR_HOME}/example/solr/conf/schema.xml.org
+ mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml 
${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org
+ }}}
+ 
+  * Copy the Nutch specific schema.xml to replace it:
+  {{{
-  * `cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml 
${APACHE_SOLR_HOME}/example/solr/conf/`
+ cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml 
${APACHE_SOLR_HOME}/example/solr/collection1/conf/
+ }}}
+ 
+  * Open the Nutch schema.xml file for editing:<<BR>>
+  {{{
-  * vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
+ vi ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml
+ }}}
-  * Copy exactly in 351 line: <field name="_version_" type="long" 
indexed="true" stored="true"/>
-  * restart Solr with the command “`java -jar start.jar`” under 
`${APACHE_SOLR_HOME}/example`
-  * run the Solr Index command:
  
+  * Comment out the following lines (53-54) in the file by changing this:
- {{{
+  {{{
+ <!--   <filter class="solr.
+ EnglishPorterFilterFactory" protected="protwords.txt"/> -->
+ }}}
+  to this
+  {{{
+ <!--   <filter class="solr.
+ EnglishPorterFilterFactory" protected="protwords.txt"/> -->
+ }}}
+ 
+  * Add the following line right after the line <field name="id" ... /> 
(probably at line 69-70)
+  {{{
+ <field name="_version_" type="long" indexed="true" stored="true"/>
+ }}}
+ 
+  * If you want to see the raw HTML indexed by Solr, change the content field 
definition (line 80) to:
+  {{{
+ <field name="content" type="text" stored="true" indexed="true"/>
+ }}}
+  * Save the file and restart Solr under `${APACHE_SOLR_HOME}/example`:
+  {{{
+ java -jar start.jar
+ }}}
+ 
+  * run the Solr Index command from ${NUTCH_RUNTIME_HOME}:<<BR>>
+  {{{
- bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/*
+ bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb 
crawl/linkdb crawl/segments/
  }}}
+ 
- The call signature for running the solrindex has changed. The linkdb is now 
optional, so you need to denote it with a "-linkdb" flag on the command line.
+ * '' Note: If you are familiar with past version of the solrindex, the call 
signature for running it has changed. The linkdb is now optional, so you need 
to denote it with a "-linkdb" flag on the command line. ''
  
  This will send all crawl data to Solr for indexing. For more information 
please see [[bin/nutch solrindex]]
  
- If all has gone to plan, we are now ready to search with 
http://localhost:8983/solr/admin/.  If you want to see the raw HTML indexed by 
Solr, change the content field definition in `schema.xml` to:
+ If all has gone to plan, you are now ready to search with 
http://localhost:8983/solr/admin/.
  
- {{{
- <field name="content" type="text" stored="true" indexed="true"/>
- }}}
- 

Reply via email to