I was thinking if I make a wrapper class that aggregates another IndexReader 
and filter out terms I don't want anymore it might work.   And then pass that 
wrapper into SegmentMerger.  I think if I filter out terms on 
GetFieldNames(...) and Terms(...) it might work.

Something like:

HashSet<string> ignoredTerms=...;

FilteringIndexReader wrapper=new FilterIndexReader(reader);

SegmentMerger merger=new SegmentMerger(writer);

merger.add(wrapper);

merger.Merge();





On Feb 14, 2012, at 1:49 AM, Li Li wrote:

> for method 2, delete is wrong. we can't delete terms.
>   you also should hack with the tii and tis file.
> 
> On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fancye...@gmail.com> wrote:
> 
>> method1, dumping data
>> for stored fields, you can traverse the whole index and save it to
>> somewhere else.
>> for indexed but not stored fields, it may be more difficult.
>>    if the indexed and not stored field is not analyzed(fields such as
>> id), it's easy to get from FieldCache.StringIndex.
>>    But for analyzed fields, though theoretically it can be restored from
>> term vector and term position, it's hard to recover from index.
>> 
>> method 2, hack with metadata
>> 1. indexed fields
>>      delete by query, e.g. field:*
>> 2. stored fields
>>       because all fields are stored sequentially. it's not easy to delete
>> some fields. this will not affect search speed. but if you want to get
>> stored fields,  and the useless fields are very long, then it will slow
>> down.
>>       also it's possible to hack with it. but need more effort to
>> understand the index file format  and traverse the fdt/fdx file.
>> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
>> 
>> this will give you some insight.
>> 
>> 
>> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <bstewart...@gmail.com>wrote:
>> 
>>> Lets say I have a large index (100M docs, 1TB, split up between 10
>>> indexes).  And a bunch of the "stored" and "indexed" fields are not used in
>>> search at all.  In order to save memory and disk, I'd like to rebuild that
>>> index *without* those fields, but I don't have original documents to
>>> rebuild entire index with (don't have the full-text anymore, etc.).  Is
>>> there some way to rebuild or optimize an existing index with only a sub-set
>>> of the existing indexed fields?  Or alternatively is there a way to avoid
>>> loading some indexed fields at all ( to avoid loading term infos and terms
>>> index ) ?
>>> 
>>> Thanks
>>> Bob
>> 
>> 
>> 

Reply via email to