Re: Help: Problem in customized token filter

Aman Tandon Thu, 18 Jun 2015 22:12:01 -0700

Hi Steve,


>  you never set exhausted to false, and when the filter got reused, *it
> incorrectly carried state from the previous document.*


Thanks for replying, but I am not able to understand this.

With Regards
Aman Tandon

On Fri, Jun 19, 2015 at 10:25 AM, Steve Rowe <sar...@gmail.com> wrote:

> Hi Aman,
>
> The admin UI screenshot you linked to is from an older version of Solr -
> what version are you using?
>
> Lots of extraneous angle brackets and asterisks got into your email and
> made for a bunch of cleanup work before I could read or edit it.  In the
> future, please put your code somewhere people can easily read it and
> copy/paste it into an editor: into a github gist or on a paste service, etc.
>
> Looks to me like your use of “exhausted” is unnecessary, and is likely the
> cause of the problem you saw (only one document getting processed): you
> never set exhausted to false, and when the filter got reused, it
> incorrectly carried state from the previous document.
>
> Here’s a simpler version that’s hopefully more correct and more efficient
> (2 fewer copies from the StringBuilder to the final token).  Note: I didn’t
> test it:
>
>     https://gist.github.com/sarowe/9b9a52b683869ced3a17
>
> Steve
> www.lucidworks.com
>
> > On Jun 18, 2015, at 11:33 AM, Aman Tandon <amantandon...@gmail.com>
> wrote:
> >
> > Please help, what wrong I am doing here. please guide me.
> >
> > With Regards
> > Aman Tandon
> >
> > On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon <amantandon...@gmail.com>
> > wrote:
> >
> >> Hi,
> >>
> >> I created a *token concat filter* to concat all the tokens from token
> >> stream. It creates the concatenated token as expected.
> >>
> >> But when I am posting the xml containing more than 30,000 documents,
> then
> >> only first document is having the data of that field.
> >>
> >> *Schema:*
> >>
> >> *<field name="titlex" type="text" indexed="true" stored="false"
> >>> required="false" omitNorms="false" multiValued="false" />*
> >>
> >>
> >>
> >>
> >>
> >>
> >>> *<fieldType name="text" class="solr.TextField"
> >>> positionIncrementGap="100">*
> >>> *      <analyzer type="index">*
> >>> *        <charFilter class="solr.HTMLStripCharFilterFactory"/>*
> >>> *        <tokenizer class="solr.StandardTokenizerFactory"/>*
> >>> *        <filter class="solr.WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>*
> >>> *        <filter class="solr.LowerCaseFilterFactory"/>*
> >>> *        <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
> >>> outputUnigrams="true" tokenSeparator=""/>*
> >>> *        <filter class="solr.SnowballPorterFilterFactory"
> >>> language="English" protected="protwords.txt"/>*
> >>> *        <filter
> >>> class="com.xyz.analysis.concat.ConcatenateWordsFilterFactory"/>*
> >>> *        <filter class="solr.SynonymFilterFactory"
> >>> synonyms="stemmed_synonyms_text_prime_ex_index.txt" ignoreCase="true"
> >>> expand="true"/>*
> >>> *      </analyzer>*
> >>> *      <analyzer type="query">*
> >>> *        <tokenizer class="solr.StandardTokenizerFactory"/>*
> >>> *        <filter class="solr.SynonymFilterFactory"
> >>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>*
> >>> *        <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>> words="stopwords_text_prime_search.txt"
> enablePositionIncrements="true" />*
> >>> *        <filter class="solr.WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>*
> >>> *        <filter class="solr.LowerCaseFilterFactory"/>*
> >>> *        <filter class="solr.SnowballPorterFilterFactory"
> >>> language="English" protected="protwords.txt"/>*
> >>> *        <filter
> >>> class="com.xyz.analysis.concat.ConcatenateWordsFilterFactory"/>*
> >>> *      </analyzer>**    </fieldType>*
> >>
> >>
> >> Please help me, The code for the filter is as follows, please take a
> look.
> >>
> >> Here is the picture of what filter is doing
> >> <http://i.imgur.com/THCsYtG.png?1>
> >>
> >> The code of concat filter is :
> >>
> >> *package com.xyz.analysis.concat;*
> >>>
> >>> *import java.io.IOException;*
> >>>
> >>>
> >>>> *import org.apache.lucene.analysis.TokenFilter;*
> >>>
> >>> *import org.apache.lucene.analysis.TokenStream;*
> >>>
> >>> *import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;*
> >>>
> >>> *import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;*
> >>>
> >>> *import
> >>>>
> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;*
> >>>
> >>> *import org.apache.lucene.analysis.tokenattributes.TypeAttribute;*
> >>>
> >>>
> >>>> *public class ConcatenateWordsFilter extends TokenFilter {*
> >>>
> >>>
> >>>> *  private CharTermAttribute charTermAttribute =
> >>>> addAttribute(CharTermAttribute.class);*
> >>>
> >>> *  private OffsetAttribute offsetAttribute =
> >>>> addAttribute(OffsetAttribute.class);*
> >>>
> >>> *  PositionIncrementAttribute posIncr =
> >>>> addAttribute(PositionIncrementAttribute.class);*
> >>>
> >>> *  TypeAttribute typeAtrr = addAttribute(TypeAttribute.class);*
> >>>
> >>>
> >>>> *  private StringBuilder stringBuilder = new StringBuilder();*
> >>>
> >>> *  private boolean exhausted = false;*
> >>>
> >>>
> >>>> *  /***
> >>>
> >>> *   * Creates a new ConcatenateWordsFilter*
> >>>
> >>> *   * @param input TokenStream that will be filtered*
> >>>
> >>> *   */*
> >>>
> >>> *  public ConcatenateWordsFilter(TokenStream input) {*
> >>>
> >>> *    super(input);*
> >>>
> >>> *  }*
> >>>
> >>>
> >>>> *  /***
> >>>
> >>> *   * {@inheritDoc}*
> >>>
> >>> *   */*
> >>>
> >>> *  @Override*
> >>>
> >>> *  public final boolean incrementToken() throws IOException {*
> >>>
> >>> *    while (!exhausted && input.incrementToken()) {*
> >>>
> >>> *      char terms[] = charTermAttribute.buffer();*
> >>>
> >>> *      int termLength = charTermAttribute.length();*
> >>>
> >>> *      if(typeAtrr.type().equals("<ALPHANUM>")){*
> >>>
> >>> *     stringBuilder.append(terms, 0, termLength);*
> >>>
> >>> *      }*
> >>>
> >>> *      charTermAttribute.copyBuffer(terms, 0, termLength);*
> >>>
> >>> *      return true;*
> >>>
> >>> *    }*
> >>>
> >>>
> >>>> *    if (!exhausted) {*
> >>>
> >>> *      exhausted = true;*
> >>>
> >>> *      String sb = stringBuilder.toString();*
> >>>
> >>> *      System.err.println("The Data got is "+sb);*
> >>>
> >>> *      int sbLength = sb.length();*
> >>>
> >>> *      //posIncr.setPositionIncrement(0);*
> >>>
> >>> *      charTermAttribute.copyBuffer(sb.toCharArray(), 0, sbLength);*
> >>>
> >>> *      offsetAttribute.setOffset(offsetAttribute.startOffset(),
> >>>> offsetAttribute.startOffset()+sbLength);*
> >>>
> >>> *      stringBuilder.setLength(0);*
> >>>
> >>> *      //typeAtrr.setType("CONCATENATED");*
> >>>
> >>> *      return true;*
> >>>
> >>> *    }*
> >>>
> >>> *    return false;*
> >>>
> >>> *  }*
> >>>
> >>> *}*
> >>>
> >>>
> >>
> >> With Regards
> >> Aman Tandon
> >>
>
>

Re: Help: Problem in customized token filter

Reply via email to