Re: Help: Problem in customized token filter

Aman Tandon Thu, 18 Jun 2015 23:23:55 -0700

Steve,

Thank you thank you so much. You guys are awesome.


Steve how can i learn more about the lucene indexing process in more
detail. e.g. after we send documents for indexing which function calls till
the doc actually store in index files.

I will be thankful to you. If you guide me here.

With Regards
Aman Tandon

On Fri, Jun 19, 2015 at 10:48 AM, Steve Rowe <sar...@gmail.com> wrote:

> Aman,
>
> Solr uses the same Token filter instances over and over, calling reset()
> before sending each document through.  Your code sets “exhausted" to true
> and then never sets it back to false, so the next time the token filter
> instance is used, its “exhausted" value is still true, so no input stream
> tokens are concatenated ever again.
>
> Does that make sense?
>
> Steve
> www.lucidworks.com
>
> > On Jun 19, 2015, at 1:10 AM, Aman Tandon <amantandon...@gmail.com>
> wrote:
> >
> > Hi Steve,
> >
> >
> >> you never set exhausted to false, and when the filter got reused, *it
> >> incorrectly carried state from the previous document.*
> >
> >
> > Thanks for replying, but I am not able to understand this.
> >
> > With Regards
> > Aman Tandon
> >
> > On Fri, Jun 19, 2015 at 10:25 AM, Steve Rowe <sar...@gmail.com> wrote:
> >
> >> Hi Aman,
> >>
> >> The admin UI screenshot you linked to is from an older version of Solr -
> >> what version are you using?
> >>
> >> Lots of extraneous angle brackets and asterisks got into your email and
> >> made for a bunch of cleanup work before I could read or edit it.  In the
> >> future, please put your code somewhere people can easily read it and
> >> copy/paste it into an editor: into a github gist or on a paste service,
> etc.
> >>
> >> Looks to me like your use of “exhausted” is unnecessary, and is likely
> the
> >> cause of the problem you saw (only one document getting processed): you
> >> never set exhausted to false, and when the filter got reused, it
> >> incorrectly carried state from the previous document.
> >>
> >> Here’s a simpler version that’s hopefully more correct and more
> efficient
> >> (2 fewer copies from the StringBuilder to the final token).  Note: I
> didn’t
> >> test it:
> >>
> >>    https://gist.github.com/sarowe/9b9a52b683869ced3a17
> >>
> >> Steve
> >> www.lucidworks.com
> >>
> >>> On Jun 18, 2015, at 11:33 AM, Aman Tandon <amantandon...@gmail.com>
> >> wrote:
> >>>
> >>> Please help, what wrong I am doing here. please guide me.
> >>>
> >>> With Regards
> >>> Aman Tandon
> >>>
> >>> On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon <amantandon...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I created a *token concat filter* to concat all the tokens from token
> >>>> stream. It creates the concatenated token as expected.
> >>>>
> >>>> But when I am posting the xml containing more than 30,000 documents,
> >> then
> >>>> only first document is having the data of that field.
> >>>>
> >>>> *Schema:*
> >>>>
> >>>> *<field name="titlex" type="text" indexed="true" stored="false"
> >>>>> required="false" omitNorms="false" multiValued="false" />*
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> *<fieldType name="text" class="solr.TextField"
> >>>>> positionIncrementGap="100">*
> >>>>> *      <analyzer type="index">*
> >>>>> *        <charFilter class="solr.HTMLStripCharFilterFactory"/>*
> >>>>> *        <tokenizer class="solr.StandardTokenizerFactory"/>*
> >>>>> *        <filter class="solr.WordDelimiterFilterFactory"
> >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>*
> >>>>> *        <filter class="solr.LowerCaseFilterFactory"/>*
> >>>>> *        <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> >>>>> outputUnigrams="true" tokenSeparator=""/>*
> >>>>> *        <filter class="solr.SnowballPorterFilterFactory"
> >>>>> language="English" protected="protwords.txt"/>*
> >>>>> *        <filter
> >>>>> class="com.xyz.analysis.concat.ConcatenateWordsFilterFactory"/>*
> >>>>> *        <filter class="solr.SynonymFilterFactory"
> >>>>> synonyms="stemmed_synonyms_text_prime_ex_index.txt" ignoreCase="true"
> >>>>> expand="true"/>*
> >>>>> *      </analyzer>*
> >>>>> *      <analyzer type="query">*
> >>>>> *        <tokenizer class="solr.StandardTokenizerFactory"/>*
> >>>>> *        <filter class="solr.SynonymFilterFactory"
> >>>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>*
> >>>>> *        <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>>>> words="stopwords_text_prime_search.txt"
> >> enablePositionIncrements="true" />*
> >>>>> *        <filter class="solr.WordDelimiterFilterFactory"
> >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>*
> >>>>> *        <filter class="solr.LowerCaseFilterFactory"/>*
> >>>>> *        <filter class="solr.SnowballPorterFilterFactory"
> >>>>> language="English" protected="protwords.txt"/>*
> >>>>> *        <filter
> >>>>> class="com.xyz.analysis.concat.ConcatenateWordsFilterFactory"/>*
> >>>>> *      </analyzer>**    </fieldType>*
> >>>>
> >>>>
> >>>> Please help me, The code for the filter is as follows, please take a
> >> look.
> >>>>
> >>>> Here is the picture of what filter is doing
> >>>> <http://i.imgur.com/THCsYtG.png?1>
> >>>>
> >>>> The code of concat filter is :
> >>>>
> >>>> *package com.xyz.analysis.concat;*
> >>>>>
> >>>>> *import java.io.IOException;*
> >>>>>
> >>>>>
> >>>>>> *import org.apache.lucene.analysis.TokenFilter;*
> >>>>>
> >>>>> *import org.apache.lucene.analysis.TokenStream;*
> >>>>>
> >>>>> *import
> org.apache.lucene.analysis.tokenattributes.CharTermAttribute;*
> >>>>>
> >>>>> *import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;*
> >>>>>
> >>>>> *import
> >>>>>>
> >> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;*
> >>>>>
> >>>>> *import org.apache.lucene.analysis.tokenattributes.TypeAttribute;*
> >>>>>
> >>>>>
> >>>>>> *public class ConcatenateWordsFilter extends TokenFilter {*
> >>>>>
> >>>>>
> >>>>>> *  private CharTermAttribute charTermAttribute =
> >>>>>> addAttribute(CharTermAttribute.class);*
> >>>>>
> >>>>> *  private OffsetAttribute offsetAttribute =
> >>>>>> addAttribute(OffsetAttribute.class);*
> >>>>>
> >>>>> *  PositionIncrementAttribute posIncr =
> >>>>>> addAttribute(PositionIncrementAttribute.class);*
> >>>>>
> >>>>> *  TypeAttribute typeAtrr = addAttribute(TypeAttribute.class);*
> >>>>>
> >>>>>
> >>>>>> *  private StringBuilder stringBuilder = new StringBuilder();*
> >>>>>
> >>>>> *  private boolean exhausted = false;*
> >>>>>
> >>>>>
> >>>>>> *  /***
> >>>>>
> >>>>> *   * Creates a new ConcatenateWordsFilter*
> >>>>>
> >>>>> *   * @param input TokenStream that will be filtered*
> >>>>>
> >>>>> *   */*
> >>>>>
> >>>>> *  public ConcatenateWordsFilter(TokenStream input) {*
> >>>>>
> >>>>> *    super(input);*
> >>>>>
> >>>>> *  }*
> >>>>>
> >>>>>
> >>>>>> *  /***
> >>>>>
> >>>>> *   * {@inheritDoc}*
> >>>>>
> >>>>> *   */*
> >>>>>
> >>>>> *  @Override*
> >>>>>
> >>>>> *  public final boolean incrementToken() throws IOException {*
> >>>>>
> >>>>> *    while (!exhausted && input.incrementToken()) {*
> >>>>>
> >>>>> *      char terms[] = charTermAttribute.buffer();*
> >>>>>
> >>>>> *      int termLength = charTermAttribute.length();*
> >>>>>
> >>>>> *      if(typeAtrr.type().equals("<ALPHANUM>")){*
> >>>>>
> >>>>> *     stringBuilder.append(terms, 0, termLength);*
> >>>>>
> >>>>> *      }*
> >>>>>
> >>>>> *      charTermAttribute.copyBuffer(terms, 0, termLength);*
> >>>>>
> >>>>> *      return true;*
> >>>>>
> >>>>> *    }*
> >>>>>
> >>>>>
> >>>>>> *    if (!exhausted) {*
> >>>>>
> >>>>> *      exhausted = true;*
> >>>>>
> >>>>> *      String sb = stringBuilder.toString();*
> >>>>>
> >>>>> *      System.err.println("The Data got is "+sb);*
> >>>>>
> >>>>> *      int sbLength = sb.length();*
> >>>>>
> >>>>> *      //posIncr.setPositionIncrement(0);*
> >>>>>
> >>>>> *      charTermAttribute.copyBuffer(sb.toCharArray(), 0, sbLength);*
> >>>>>
> >>>>> *      offsetAttribute.setOffset(offsetAttribute.startOffset(),
> >>>>>> offsetAttribute.startOffset()+sbLength);*
> >>>>>
> >>>>> *      stringBuilder.setLength(0);*
> >>>>>
> >>>>> *      //typeAtrr.setType("CONCATENATED");*
> >>>>>
> >>>>> *      return true;*
> >>>>>
> >>>>> *    }*
> >>>>>
> >>>>> *    return false;*
> >>>>>
> >>>>> *  }*
> >>>>>
> >>>>> *}*
> >>>>>
> >>>>>
> >>>>
> >>>> With Regards
> >>>> Aman Tandon
> >>>>
> >>
> >>
>
>

Re: Help: Problem in customized token filter

Reply via email to