Re: Top 5 high freq words - UpdateProcessorChain or DIH Script?

Erick Erickson Mon, 09 Jul 2012 06:49:44 -0700

I think the second way is probably the most robust, and it's surprisingly
un-complicated. You wouldn't really be using copyField in that case,
just adding them to the proper field in the document.


Anything you do outside of the update chain would suffer from having to
be kept in synch with the stopwords & etc. Which would be a pain to
maintain whereas putting in your own element in the chain would let Solr/Lucene
do a lot of that work for you...

Best
Erick

On Sun, Jul 8, 2012 at 4:01 PM, Pranav Prakash <pra...@gmail.com> wrote:
> Hi,
>
> I want to store top 5 high frequency non-stopwords words. I use DIH to
> import data. Now I have two approaches -
>
>    1. Use DIH JavaScript to find top 5 frequency words and put them in a
>    copy field. The copy field will then stem it and remove stop words based on
>    appropriate tokenizers.
>    2. Write a custom function for the same and add it to
>    UpdateRequestProcessor Chain.
>
> Which of the two would be better suited? I find the first approach rather
> simple, but the issue is that I won't be having access to stop
> words/synonyms etc at the DIH time.
>
> In the second approach, if I add it to UpdateRequestProcessor Chain and
> insert the function after StopWordsFilterFactory and
> DuplicateRemoveFilterFactory, should be rather good way of doing this?
>
> --
> *Pranav Prakash*
>
> "temet nosce"

Re: Top 5 high freq words - UpdateProcessorChain or DIH Script?

Reply via email to