Re: Help: Problem in customized token filter

Steve Rowe Thu, 18 Jun 2015 22:11:01 -0700

Aman,

My version won’t produce anything at all, since incrementToken() always returns 
false…


I updated the gist (at the same URL) to fix the problem by returning true from 
incrementToken() once and then false until reset() is called.  It also handles 
the case when the concatenated token is zero length by not emitting a token.

Steve
www.lucidworks.com

> On Jun 19, 2015, at 12:55 AM, Steve Rowe <sar...@gmail.com> wrote:
> 
> Hi Aman,
> 
> The admin UI screenshot you linked to is from an older version of Solr - what 
> version are you using?
> 
> Lots of extraneous angle brackets and asterisks got into your email and made 
> for a bunch of cleanup work before I could read or edit it.  In the future, 
> please put your code somewhere people can easily read it and copy/paste it 
> into an editor: into a github gist or on a paste service, etc.
> 
> Looks to me like your use of “exhausted” is unnecessary, and is likely the 
> cause of the problem you saw (only one document getting processed): you never 
> set exhausted to false, and when the filter got reused, it incorrectly 
> carried state from the previous document.
> 
> Here’s a simpler version that’s hopefully more correct and more efficient (2 
> fewer copies from the StringBuilder to the final token).  Note: I didn’t test 
> it:
> 
>    https://gist.github.com/sarowe/9b9a52b683869ced3a17
> 
> Steve
> www.lucidworks.com
> 
>> On Jun 18, 2015, at 11:33 AM, Aman Tandon <amantandon...@gmail.com> wrote:
>> 
>> Please help, what wrong I am doing here. please guide me.
>> 
>> With Regards
>> Aman Tandon
>> 
>> On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon <amantandon...@gmail.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> I created a *token concat filter* to concat all the tokens from token
>>> stream. It creates the concatenated token as expected.
>>> 
>>> But when I am posting the xml containing more than 30,000 documents, then
>>> only first document is having the data of that field.
>>> 
>>> *Schema:*
>>> 
>>> *<field name="titlex" type="text" indexed="true" stored="false"
>>>> required="false" omitNorms="false" multiValued="false" />*
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> *<fieldType name="text" class="solr.TextField"
>>>> positionIncrementGap="100">*
>>>> *      <analyzer type="index">*
>>>> *        <charFilter class="solr.HTMLStripCharFilterFactory"/>*
>>>> *        <tokenizer class="solr.StandardTokenizerFactory"/>*
>>>> *        <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>*
>>>> *        <filter class="solr.LowerCaseFilterFactory"/>*
>>>> *        <filter class="solr.ShingleFilterFactory" maxShingleSize="3"
>>>> outputUnigrams="true" tokenSeparator=""/>*
>>>> *        <filter class="solr.SnowballPorterFilterFactory"
>>>> language="English" protected="protwords.txt"/>*
>>>> *        <filter
>>>> class="com.xyz.analysis.concat.ConcatenateWordsFilterFactory"/>*
>>>> *        <filter class="solr.SynonymFilterFactory"
>>>> synonyms="stemmed_synonyms_text_prime_ex_index.txt" ignoreCase="true"
>>>> expand="true"/>*
>>>> *      </analyzer>*
>>>> *      <analyzer type="query">*
>>>> *        <tokenizer class="solr.StandardTokenizerFactory"/>*
>>>> *        <filter class="solr.SynonymFilterFactory"
>>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>*
>>>> *        <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>> words="stopwords_text_prime_search.txt" enablePositionIncrements="true" />*
>>>> *        <filter class="solr.WordDelimiterFilterFactory"
>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>*
>>>> *        <filter class="solr.LowerCaseFilterFactory"/>*
>>>> *        <filter class="solr.SnowballPorterFilterFactory"
>>>> language="English" protected="protwords.txt"/>*
>>>> *        <filter
>>>> class="com.xyz.analysis.concat.ConcatenateWordsFilterFactory"/>*
>>>> *      </analyzer>**    </fieldType>*
>>> 
>>> 
>>> Please help me, The code for the filter is as follows, please take a look.
>>> 
>>> Here is the picture of what filter is doing
>>> <http://i.imgur.com/THCsYtG.png?1>
>>> 
>>> The code of concat filter is :
>>> 
>>> *package com.xyz.analysis.concat;*
>>>> 
>>>> *import java.io.IOException;*
>>>> 
>>>> 
>>>>> *import org.apache.lucene.analysis.TokenFilter;*
>>>> 
>>>> *import org.apache.lucene.analysis.TokenStream;*
>>>> 
>>>> *import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;*
>>>> 
>>>> *import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;*
>>>> 
>>>> *import
>>>>> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;*
>>>> 
>>>> *import org.apache.lucene.analysis.tokenattributes.TypeAttribute;*
>>>> 
>>>> 
>>>>> *public class ConcatenateWordsFilter extends TokenFilter {*
>>>> 
>>>> 
>>>>> *  private CharTermAttribute charTermAttribute =
>>>>> addAttribute(CharTermAttribute.class);*
>>>> 
>>>> *  private OffsetAttribute offsetAttribute =
>>>>> addAttribute(OffsetAttribute.class);*
>>>> 
>>>> *  PositionIncrementAttribute posIncr =
>>>>> addAttribute(PositionIncrementAttribute.class);*
>>>> 
>>>> *  TypeAttribute typeAtrr = addAttribute(TypeAttribute.class);*
>>>> 
>>>> 
>>>>> *  private StringBuilder stringBuilder = new StringBuilder();*
>>>> 
>>>> *  private boolean exhausted = false;*
>>>> 
>>>> 
>>>>> *  /***
>>>> 
>>>> *   * Creates a new ConcatenateWordsFilter*
>>>> 
>>>> *   * @param input TokenStream that will be filtered*
>>>> 
>>>> *   */*
>>>> 
>>>> *  public ConcatenateWordsFilter(TokenStream input) {*
>>>> 
>>>> *    super(input);*
>>>> 
>>>> *  }*
>>>> 
>>>> 
>>>>> *  /***
>>>> 
>>>> *   * {@inheritDoc}*
>>>> 
>>>> *   */*
>>>> 
>>>> *  @Override*
>>>> 
>>>> *  public final boolean incrementToken() throws IOException {*
>>>> 
>>>> *    while (!exhausted && input.incrementToken()) {*
>>>> 
>>>> *      char terms[] = charTermAttribute.buffer();*
>>>> 
>>>> *      int termLength = charTermAttribute.length();*
>>>> 
>>>> *      if(typeAtrr.type().equals("<ALPHANUM>")){*
>>>> 
>>>> *     stringBuilder.append(terms, 0, termLength);*
>>>> 
>>>> *      }*
>>>> 
>>>> *      charTermAttribute.copyBuffer(terms, 0, termLength);*
>>>> 
>>>> *      return true;*
>>>> 
>>>> *    }*
>>>> 
>>>> 
>>>>> *    if (!exhausted) {*
>>>> 
>>>> *      exhausted = true;*
>>>> 
>>>> *      String sb = stringBuilder.toString();*
>>>> 
>>>> *      System.err.println("The Data got is "+sb);*
>>>> 
>>>> *      int sbLength = sb.length();*
>>>> 
>>>> *      //posIncr.setPositionIncrement(0);*
>>>> 
>>>> *      charTermAttribute.copyBuffer(sb.toCharArray(), 0, sbLength);*
>>>> 
>>>> *      offsetAttribute.setOffset(offsetAttribute.startOffset(),
>>>>> offsetAttribute.startOffset()+sbLength);*
>>>> 
>>>> *      stringBuilder.setLength(0);*
>>>> 
>>>> *      //typeAtrr.setType("CONCATENATED");*
>>>> 
>>>> *      return true;*
>>>> 
>>>> *    }*
>>>> 
>>>> *    return false;*
>>>> 
>>>> *  }*
>>>> 
>>>> *}*
>>>> 
>>>> 
>>> 
>>> With Regards
>>> Aman Tandon
>>> 
>

Re: Help: Problem in customized token filter

Reply via email to