Hi, Understood. Write a custom FileReader that filters out the text you do not want. This will do it streaming.
Glen On Mon, Feb 27, 2012 at 12:46 PM, Prakash Reddy Bande <praka...@altair.com> wrote: > Hi, > > Description is multiline, in addition there is other text also. So, > essentially what I need id to jump the DATA_END as soon as I hit DATA_BEGIN. > > I am creating the field using the constructor Field(String name, Reader > reader) and using StandardAnalyser. Right now I am using FileReader which is > causing all the text to be indexed/tokenized. > > Amount of text I am interested in is also pretty large, description is just > one such example. So, I really want some stream based implementation to avoid > keeping large amount of text in memory. May be a custom TokenStream, but I > don't know what to implement in tokenstream. The only abstract method is > incrementToken, I have no idea what to do in it. > > Regards, > > Prakash Bande > Director - Hyperworks Enterprise Software > Altair Eng. Inc. > Troy MI > Ph: 248-614-2400 ext 489 > Cell: 248-404-0292 > > -----Original Message----- > From: Glen Newton [mailto:glen.new...@gmail.com] > Sent: Monday, February 27, 2012 12:05 PM > To: java-user@lucene.apache.org > Subject: Re: Customizing indexing of large files > > I'd suggest writing a perl script or > insert-favourite-scripting-language-here script to pre-filter this > content out of the files before it gets to Lucene/Solr > Or you could just grep for "Data' and"Description" (or is > 'Description' multi-line)? > > -Glen Newton > > On Mon, Feb 27, 2012 at 11:55 AM, Prakash Reddy Bande > <praka...@altair.com> wrote: >> Hi, >> >> I want to customize the indexing of some specific kind of files I have. I am >> using 2.9.3 but upgrading is possible. >> This is how my file's data looks >> >> ***************************** >> Data for 2010 >> Description: This section has a general description of the data. >> DATA_BEGIN >> Month P1 P2 P3 >> 01 3243.433 43534.324 45345.2443 >> 02 3242.324 234234.24 323.2343 >> ... >> ... >> ... >> ... >> DATA_END >> Data for 2011 >> Description: This section has a general description of the data. >> DATA_BEGIN >> Month P1 P2 P3 >> 01 3243.433 43534.324 45345.2443 >> 02 3242.324 234234.24 323.2343 >> ... >> ... >> ... >> ... >> DATA_END >> ***************************** >> >> I would like to use a StandardAnalyser, but do not want to index the data of >> the columns, i.e. skip all those numbers. Basically, as soon as I hit the >> keyword DATA_BEGIN, I want to jump to DATA_END. >> So, what is the best approach? Using a custom Reader, custom tokenizer or >> some other mechanism. >> Regards, >> >> Prakash Bande >> Altair Eng. Inc. >> Troy MI >> Ph: 248-614-2400 ext 489 >> Cell: 248-404-0292 >> > > > > -- > - > http://zzzoot.blogspot.com/ > - > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > -- - http://zzzoot.blogspot.com/ - --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org