Hi Markus, Sorry for the delayed response. Thank you very much for your information and pointers. I was able to implement what I wanted using the Nutch tutorial for writing plugins and by looking at the source code of the plugins you mentioned.
I just have one question - I couldn't find an explanation for one thing - what is the benefit of doing parsing+indexing (content -> parsing -> metadata -> indexing -> lucene fields), rather than just indexing ( content -> indexing -> lucene fields)? For instance, if I wanted to add document size as a new token - I could calculate the size in a parser, or in the indexer. Is this separation connected to optimization for Hadoop? Thanks! Jakub On Mon, Jan 21, 2013 at 4:23 AM, Markus Jelsma <markus.jel...@openindex.io>wrote: > Hi, > > In Nutch a `synthetic token` maps to a field/value pair. You need an > indexing filter to read the key/value pair from the parsed metadata and add > it as a field/value pair to the NutchDocument. You may also need a custom > parser filter to extract the data from somewhere and store it to the parsed > metadata as key/value, which you then further process in your indexing > filter. > > Check out the index-basic and index-more plugins for examples. > > Cheers, > > -----Original message----- > > From:Jakub Moskal <jakub.mos...@gmail.com> > > Sent: Mon 21-Jan-2013 04:58 > > To: user@nutch.apache.org > > Subject: Synthetic Tokens > > > > Hi, > > > > I would like to develop a plugin that creates synthetic tokens for > > some documents that are crawled by Nutch (as described here: > > http://www.ideaeng.com/synthetic-tokens-need-p2-0604). How can this be > > done in Nutch? Should I create a new field for every new synthetic > > token, or should I add them to metadata? I'm not quite sure how > > fields/metadata relate to the tokens described in the article. > > > > Thanks! > > Jakub > > >