Steve, Thank you thank you so much. You guys are awesome.
Steve how can i learn more about the lucene indexing process in more detail. e.g. after we send documents for indexing which function calls till the doc actually store in index files. I will be thankful to you. If you guide me here. With Regards Aman Tandon On Fri, Jun 19, 2015 at 10:48 AM, Steve Rowe <sar...@gmail.com> wrote: > Aman, > > Solr uses the same Token filter instances over and over, calling reset() > before sending each document through. Your code sets “exhausted" to true > and then never sets it back to false, so the next time the token filter > instance is used, its “exhausted" value is still true, so no input stream > tokens are concatenated ever again. > > Does that make sense? > > Steve > www.lucidworks.com > > > On Jun 19, 2015, at 1:10 AM, Aman Tandon <amantandon...@gmail.com> > wrote: > > > > Hi Steve, > > > > > >> you never set exhausted to false, and when the filter got reused, *it > >> incorrectly carried state from the previous document.* > > > > > > Thanks for replying, but I am not able to understand this. > > > > With Regards > > Aman Tandon > > > > On Fri, Jun 19, 2015 at 10:25 AM, Steve Rowe <sar...@gmail.com> wrote: > > > >> Hi Aman, > >> > >> The admin UI screenshot you linked to is from an older version of Solr - > >> what version are you using? > >> > >> Lots of extraneous angle brackets and asterisks got into your email and > >> made for a bunch of cleanup work before I could read or edit it. In the > >> future, please put your code somewhere people can easily read it and > >> copy/paste it into an editor: into a github gist or on a paste service, > etc. > >> > >> Looks to me like your use of “exhausted” is unnecessary, and is likely > the > >> cause of the problem you saw (only one document getting processed): you > >> never set exhausted to false, and when the filter got reused, it > >> incorrectly carried state from the previous document. > >> > >> Here’s a simpler version that’s hopefully more correct and more > efficient > >> (2 fewer copies from the StringBuilder to the final token). Note: I > didn’t > >> test it: > >> > >> https://gist.github.com/sarowe/9b9a52b683869ced3a17 > >> > >> Steve > >> www.lucidworks.com > >> > >>> On Jun 18, 2015, at 11:33 AM, Aman Tandon <amantandon...@gmail.com> > >> wrote: > >>> > >>> Please help, what wrong I am doing here. please guide me. > >>> > >>> With Regards > >>> Aman Tandon > >>> > >>> On Thu, Jun 18, 2015 at 4:51 PM, Aman Tandon <amantandon...@gmail.com> > >>> wrote: > >>> > >>>> Hi, > >>>> > >>>> I created a *token concat filter* to concat all the tokens from token > >>>> stream. It creates the concatenated token as expected. > >>>> > >>>> But when I am posting the xml containing more than 30,000 documents, > >> then > >>>> only first document is having the data of that field. > >>>> > >>>> *Schema:* > >>>> > >>>> *<field name="titlex" type="text" indexed="true" stored="false" > >>>>> required="false" omitNorms="false" multiValued="false" />* > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>> *<fieldType name="text" class="solr.TextField" > >>>>> positionIncrementGap="100">* > >>>>> * <analyzer type="index">* > >>>>> * <charFilter class="solr.HTMLStripCharFilterFactory"/>* > >>>>> * <tokenizer class="solr.StandardTokenizerFactory"/>* > >>>>> * <filter class="solr.WordDelimiterFilterFactory" > >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0" > >>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>* > >>>>> * <filter class="solr.LowerCaseFilterFactory"/>* > >>>>> * <filter class="solr.ShingleFilterFactory" maxShingleSize="3" > >>>>> outputUnigrams="true" tokenSeparator=""/>* > >>>>> * <filter class="solr.SnowballPorterFilterFactory" > >>>>> language="English" protected="protwords.txt"/>* > >>>>> * <filter > >>>>> class="com.xyz.analysis.concat.ConcatenateWordsFilterFactory"/>* > >>>>> * <filter class="solr.SynonymFilterFactory" > >>>>> synonyms="stemmed_synonyms_text_prime_ex_index.txt" ignoreCase="true" > >>>>> expand="true"/>* > >>>>> * </analyzer>* > >>>>> * <analyzer type="query">* > >>>>> * <tokenizer class="solr.StandardTokenizerFactory"/>* > >>>>> * <filter class="solr.SynonymFilterFactory" > >>>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>* > >>>>> * <filter class="solr.StopFilterFactory" ignoreCase="true" > >>>>> words="stopwords_text_prime_search.txt" > >> enablePositionIncrements="true" />* > >>>>> * <filter class="solr.WordDelimiterFilterFactory" > >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0" > >>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>* > >>>>> * <filter class="solr.LowerCaseFilterFactory"/>* > >>>>> * <filter class="solr.SnowballPorterFilterFactory" > >>>>> language="English" protected="protwords.txt"/>* > >>>>> * <filter > >>>>> class="com.xyz.analysis.concat.ConcatenateWordsFilterFactory"/>* > >>>>> * </analyzer>** </fieldType>* > >>>> > >>>> > >>>> Please help me, The code for the filter is as follows, please take a > >> look. > >>>> > >>>> Here is the picture of what filter is doing > >>>> <http://i.imgur.com/THCsYtG.png?1> > >>>> > >>>> The code of concat filter is : > >>>> > >>>> *package com.xyz.analysis.concat;* > >>>>> > >>>>> *import java.io.IOException;* > >>>>> > >>>>> > >>>>>> *import org.apache.lucene.analysis.TokenFilter;* > >>>>> > >>>>> *import org.apache.lucene.analysis.TokenStream;* > >>>>> > >>>>> *import > org.apache.lucene.analysis.tokenattributes.CharTermAttribute;* > >>>>> > >>>>> *import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;* > >>>>> > >>>>> *import > >>>>>> > >> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;* > >>>>> > >>>>> *import org.apache.lucene.analysis.tokenattributes.TypeAttribute;* > >>>>> > >>>>> > >>>>>> *public class ConcatenateWordsFilter extends TokenFilter {* > >>>>> > >>>>> > >>>>>> * private CharTermAttribute charTermAttribute = > >>>>>> addAttribute(CharTermAttribute.class);* > >>>>> > >>>>> * private OffsetAttribute offsetAttribute = > >>>>>> addAttribute(OffsetAttribute.class);* > >>>>> > >>>>> * PositionIncrementAttribute posIncr = > >>>>>> addAttribute(PositionIncrementAttribute.class);* > >>>>> > >>>>> * TypeAttribute typeAtrr = addAttribute(TypeAttribute.class);* > >>>>> > >>>>> > >>>>>> * private StringBuilder stringBuilder = new StringBuilder();* > >>>>> > >>>>> * private boolean exhausted = false;* > >>>>> > >>>>> > >>>>>> * /*** > >>>>> > >>>>> * * Creates a new ConcatenateWordsFilter* > >>>>> > >>>>> * * @param input TokenStream that will be filtered* > >>>>> > >>>>> * */* > >>>>> > >>>>> * public ConcatenateWordsFilter(TokenStream input) {* > >>>>> > >>>>> * super(input);* > >>>>> > >>>>> * }* > >>>>> > >>>>> > >>>>>> * /*** > >>>>> > >>>>> * * {@inheritDoc}* > >>>>> > >>>>> * */* > >>>>> > >>>>> * @Override* > >>>>> > >>>>> * public final boolean incrementToken() throws IOException {* > >>>>> > >>>>> * while (!exhausted && input.incrementToken()) {* > >>>>> > >>>>> * char terms[] = charTermAttribute.buffer();* > >>>>> > >>>>> * int termLength = charTermAttribute.length();* > >>>>> > >>>>> * if(typeAtrr.type().equals("<ALPHANUM>")){* > >>>>> > >>>>> * stringBuilder.append(terms, 0, termLength);* > >>>>> > >>>>> * }* > >>>>> > >>>>> * charTermAttribute.copyBuffer(terms, 0, termLength);* > >>>>> > >>>>> * return true;* > >>>>> > >>>>> * }* > >>>>> > >>>>> > >>>>>> * if (!exhausted) {* > >>>>> > >>>>> * exhausted = true;* > >>>>> > >>>>> * String sb = stringBuilder.toString();* > >>>>> > >>>>> * System.err.println("The Data got is "+sb);* > >>>>> > >>>>> * int sbLength = sb.length();* > >>>>> > >>>>> * //posIncr.setPositionIncrement(0);* > >>>>> > >>>>> * charTermAttribute.copyBuffer(sb.toCharArray(), 0, sbLength);* > >>>>> > >>>>> * offsetAttribute.setOffset(offsetAttribute.startOffset(), > >>>>>> offsetAttribute.startOffset()+sbLength);* > >>>>> > >>>>> * stringBuilder.setLength(0);* > >>>>> > >>>>> * //typeAtrr.setType("CONCATENATED");* > >>>>> > >>>>> * return true;* > >>>>> > >>>>> * }* > >>>>> > >>>>> * return false;* > >>>>> > >>>>> * }* > >>>>> > >>>>> *}* > >>>>> > >>>>> > >>>> > >>>> With Regards > >>>> Aman Tandon > >>>> > >> > >> > >