Hi Markus,

Sorry for the delayed response. Thank you very much for your information
and pointers. I was able to implement what I wanted using the Nutch
tutorial for writing plugins and by looking at the source code of the
plugins you mentioned.

I just have one question - I couldn't find an explanation for one thing -
what is the benefit of doing parsing+indexing (content -> parsing ->
metadata -> indexing -> lucene fields), rather than just indexing ( content
-> indexing -> lucene fields)? For instance, if I wanted to add document
size as a new token - I could calculate the size in a parser, or in the
indexer. Is this separation connected to optimization for Hadoop?

Thanks!
Jakub


On Mon, Jan 21, 2013 at 4:23 AM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> Hi,
>
> In Nutch a `synthetic token` maps to a field/value pair.  You need an
> indexing filter to read the key/value pair from the parsed metadata and add
> it as a field/value pair to the NutchDocument. You may also need a custom
> parser filter to extract the data from somewhere and store it to the parsed
> metadata as key/value, which you then further process in your indexing
> filter.
>
> Check out the index-basic and index-more plugins for examples.
>
> Cheers,
>
> -----Original message-----
> > From:Jakub Moskal <jakub.mos...@gmail.com>
> > Sent: Mon 21-Jan-2013 04:58
> > To: user@nutch.apache.org
> > Subject: Synthetic Tokens
> >
> > Hi,
> >
> > I would like to develop a plugin that creates synthetic tokens for
> > some documents that are crawled by Nutch (as described here:
> > http://www.ideaeng.com/synthetic-tokens-need-p2-0604). How can this be
> > done in Nutch? Should I create a new field for every new synthetic
> > token, or should I add them to metadata? I'm not quite sure how
> > fields/metadata relate to the tokens described in the article.
> >
> > Thanks!
> > Jakub
> >
>

Reply via email to