RE: Customizing indexing of large files

Prakash Reddy Bande Mon, 27 Feb 2012 09:46:41 -0800

Hi,

Description is multiline, in addition there is other text also. So, essentially 
what I need id to jump the DATA_END as soon as I hit DATA_BEGIN.

I am creating the field using the constructor Field(String name, Reader reader) 
and using StandardAnalyser. Right now I am using FileReader which is causing 
all the text to be indexed/tokenized.

Amount of text I am interested in is also pretty large, description is just one 
such example. So, I really want some stream based implementation to avoid 
keeping large amount of text in memory. May be a custom TokenStream, but I 
don't know what to implement in tokenstream. The only abstract method is 
incrementToken, I have no idea what to do in it.

Regards,

Prakash Bande
Director - Hyperworks Enterprise Software 
Altair Eng. Inc. 
Troy MI
Ph: 248-614-2400 ext 489
Cell: 248-404-0292

-----Original Message-----
From: Glen Newton [mailto:glen.new...@gmail.com] 
Sent: Monday, February 27, 2012 12:05 PM
To: java-user@lucene.apache.org
Subject: Re: Customizing indexing of large files

I'd suggest writing a perl script or
insert-favourite-scripting-language-here script to pre-filter this
content out of the files before it gets to Lucene/Solr
Or you could just grep for "Data' and"Description" (or is
'Description' multi-line)?

-Glen Newton

On Mon, Feb 27, 2012 at 11:55 AM, Prakash Reddy Bande
<praka...@altair.com> wrote:
> Hi,
>
> I want to customize the indexing of some specific kind of files I have. I am 
> using 2.9.3 but upgrading is possible.
> This is how my file's data looks
>
> *****************************
> Data for 2010
> Description: This section has a general description of the data.
> DATA_BEGIN
> Month       P1          P2          P3
> 01          3243.433    43534.324   45345.2443
> 02          3242.324    234234.24   323.2343
> ...
> ...
> ...
> ...
> DATA_END
> Data for 2011
> Description: This section has a general description of the data.
> DATA_BEGIN
> Month       P1          P2          P3
> 01          3243.433    43534.324   45345.2443
> 02          3242.324    234234.24   323.2343
> ...
> ...
> ...
> ...
> DATA_END
> *****************************
>
> I would like to use a StandardAnalyser, but do not want to index the data of 
> the columns, i.e. skip all those numbers. Basically, as soon as I hit the 
> keyword DATA_BEGIN, I want to jump to DATA_END.
> So, what is the best approach? Using a custom Reader, custom tokenizer or 
> some other mechanism.
> Regards,
>
> Prakash Bande
> Altair Eng. Inc.
> Troy MI
> Ph: 248-614-2400 ext 489
> Cell: 248-404-0292
>

-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Customizing indexing of large files

Reply via email to