Re: Customizing indexing of large files

Glen Newton Mon, 27 Feb 2012 09:58:23 -0800

Hi,

Understood.
Write a custom FileReader that filters out the text you do not want.
This will do it streaming.


Glen

On Mon, Feb 27, 2012 at 12:46 PM, Prakash Reddy Bande
<[email protected]> wrote:
> Hi,
>
> Description is multiline, in addition there is other text also. So, 
> essentially what I need id to jump the DATA_END as soon as I hit DATA_BEGIN.
>
> I am creating the field using the constructor Field(String name, Reader 
> reader) and using StandardAnalyser. Right now I am using FileReader which is 
> causing all the text to be indexed/tokenized.
>
> Amount of text I am interested in is also pretty large, description is just 
> one such example. So, I really want some stream based implementation to avoid 
> keeping large amount of text in memory. May be a custom TokenStream, but I 
> don't know what to implement in tokenstream. The only abstract method is 
> incrementToken, I have no idea what to do in it.
>
> Regards,
>
> Prakash Bande
> Director - Hyperworks Enterprise Software
> Altair Eng. Inc.
> Troy MI
> Ph: 248-614-2400 ext 489
> Cell: 248-404-0292
>
> -----Original Message-----
> From: Glen Newton [mailto:[email protected]]
> Sent: Monday, February 27, 2012 12:05 PM
> To: [email protected]
> Subject: Re: Customizing indexing of large files
>
> I'd suggest writing a perl script or
> insert-favourite-scripting-language-here script to pre-filter this
> content out of the files before it gets to Lucene/Solr
> Or you could just grep for "Data' and"Description" (or is
> 'Description' multi-line)?
>
> -Glen Newton
>
> On Mon, Feb 27, 2012 at 11:55 AM, Prakash Reddy Bande
> <[email protected]> wrote:
>> Hi,
>>
>> I want to customize the indexing of some specific kind of files I have. I am 
>> using 2.9.3 but upgrading is possible.
>> This is how my file's data looks
>>
>> *****************************
>> Data for 2010
>> Description: This section has a general description of the data.
>> DATA_BEGIN
>> Month       P1          P2          P3
>> 01          3243.433    43534.324   45345.2443
>> 02          3242.324    234234.24   323.2343
>> ...
>> ...
>> ...
>> ...
>> DATA_END
>> Data for 2011
>> Description: This section has a general description of the data.
>> DATA_BEGIN
>> Month       P1          P2          P3
>> 01          3243.433    43534.324   45345.2443
>> 02          3242.324    234234.24   323.2343
>> ...
>> ...
>> ...
>> ...
>> DATA_END
>> *****************************
>>
>> I would like to use a StandardAnalyser, but do not want to index the data of 
>> the columns, i.e. skip all those numbers. Basically, as soon as I hit the 
>> keyword DATA_BEGIN, I want to jump to DATA_END.
>> So, what is the best approach? Using a custom Reader, custom tokenizer or 
>> some other mechanism.
>> Regards,
>>
>> Prakash Bande
>> Altair Eng. Inc.
>> Troy MI
>> Ph: 248-614-2400 ext 489
>> Cell: 248-404-0292
>>
>
>
>
> --
> -
> http://zzzoot.blogspot.com/
> -
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>



-- 
-
http://zzzoot.blogspot.com/
-

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Customizing indexing of large files

Reply via email to