[ https://issues.apache.org/jira/browse/NUTCH-260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma closed NUTCH-260. ------------------------------- Resolution: Won't Fix > Three new plugins that parse, index and query meta tags defined in the > configuration > ------------------------------------------------------------------------------------ > > Key: NUTCH-260 > URL: https://issues.apache.org/jira/browse/NUTCH-260 > Project: Nutch > Issue Type: New Feature > Components: indexer, searcher > Affects Versions: 0.7.2 > Environment: Built and tested on Linux so far. > Reporter: Jake Vanderdray > Priority: Minor > Attachments: nutch_customizations.tar > > > These plugins allow you to define meta tags in you're nutch-site file that > you want to include in parseing, indexing and searching. The query plugin > must replace query-basic. The format for adding query terms to > nutch-site.xml is: > <property> > <name>meta.names</name> > <value>keywords,recommended</value> > <description>This is a comma seperated list of meta tag names that will > be parsed, indexed and searched against when parse-meta, index-meta and > query-meta are used.</description> > </property> > <property> > <name>meta.boosts</name> > <value>1.0,5.0</value> > <description>Comma seperated list of boost values when searching using > query-meta. The order of the values should match the order of meta.names. > </description> > </property> > Meta tags found are assumed to have either a single value or be a comma > seperated list of values. The values found are added to the index as lucene > keywords (i.e. meta name=keywords values="First Thing, Second Thing" would > result in two keyword fields named "keywords". The first would countain > "First Thing" and the second would contain "Second Thing"). > I had to replace the query-basic plugin in order to allow matches in the meta > fields to return hits even if there were no matches in any of the default > fields. The query-basic field only returns hits when every search term is > found in at least one default field. I needed hits returned if matches were > found in at least one field for every term, and/or the entire search phrase > appeared in a meta index field. > One known bug is that common terms are not getting stripped out of the > fields' values before they get indexed, so "The Next Big Thing" could not be > matched because the query engine will strip out "the" from all queries. I > intend to fix this by stipping out common terms from meta fields before > indexing them. > Another issue is that searching for "Next Big Thing" would not match meta > index values for "Next", "Big" or "Thing". You can consider that a bug or a > feature depending on how you look at it. > These plugins were written for and only work on the 0.7.2 branch. > I'm going to attache a tarball of the source of these three plugins after I > create the issue. To use the plugins, you'll need to untar them in your > src/plugins directory and add them to the ant build.xml directive (and of > course add them in your nutch-site.xml file). If these end up getting added > to the project, I'll write up documentation on the wiki. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira