Three new plugins that parse, index and query meta tags defined in the 
configuration
------------------------------------------------------------------------------------

         Key: NUTCH-260
         URL: http://issues.apache.org/jira/browse/NUTCH-260
     Project: Nutch
        Type: New Feature

  Components: indexer, searcher  
    Versions: 0.7.2    
 Environment: Built and tested on Linux so far.
    Reporter: Jake Vanderdray
    Priority: Minor


These plugins allow you to define meta tags in you're nutch-site file that you 
want to include in parseing, indexing and searching.  The query plugin must 
replace query-basic.  The format for adding query terms to nutch-site.xml is:

<property>
  <name>meta.names</name>
  <value>keywords,recommended</value>
  <description>This is a comma seperated list of meta tag names that will
  be parsed, indexed and searched against when parse-meta, index-meta and
  query-meta are used.</description>
</property>

<property>
  <name>meta.boosts</name>
  <value>1.0,5.0</value>
  <description>Comma seperated list of boost values when searching using
  query-meta.  The order of the values should match the order of meta.names.
  </description>
</property>

Meta tags found are assumed to have either a single value or be a comma 
seperated list of values.  The values found are added to the index as lucene 
keywords (i.e. meta name=keywords values="First Thing, Second Thing" would 
result in two keyword fields named "keywords".  The first would countain "First 
Thing" and the second would contain "Second Thing").

I had to replace the query-basic plugin in order to allow matches in the meta 
fields to return hits even if there were no matches in any of the default 
fields.  The query-basic field only returns hits when every search term is 
found in at least one default field.  I needed hits returned if matches were 
found in at least one field for every term, and/or the entire search phrase 
appeared in a meta index field.

One known bug is that common terms are not getting stripped out of the fields' 
values before they get indexed, so "The Next Big Thing" could not be matched 
because the query engine will strip out "the" from all queries.  I intend to 
fix this by stipping out common terms from meta fields before indexing them.

Another issue is that searching for "Next Big Thing" would not match meta index 
values for "Next", "Big" or "Thing".  You can consider that a bug or a feature 
depending on how you look at it.

These plugins were written for and only work on the 0.7.2 branch.

I'm going to attache a tarball of the source of these three plugins after I 
create the issue.  To use the plugins, you'll need to untar them in your 
src/plugins directory and add them to the ant build.xml directive (and of 
course add them in your nutch-site.xml file).  If these end up getting added to 
the project, I'll write up documentation on the wiki.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to