I was thinking if I make a wrapper class that aggregates another IndexReader and filter out terms I don't want anymore it might work. And then pass that wrapper into SegmentMerger. I think if I filter out terms on GetFieldNames(...) and Terms(...) it might work.
Something like: HashSet<string> ignoredTerms=...; FilteringIndexReader wrapper=new FilterIndexReader(reader); SegmentMerger merger=new SegmentMerger(writer); merger.add(wrapper); merger.Merge(); On Feb 14, 2012, at 1:49 AM, Li Li wrote: > for method 2, delete is wrong. we can't delete terms. > you also should hack with the tii and tis file. > > On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fancye...@gmail.com> wrote: > >> method1, dumping data >> for stored fields, you can traverse the whole index and save it to >> somewhere else. >> for indexed but not stored fields, it may be more difficult. >> if the indexed and not stored field is not analyzed(fields such as >> id), it's easy to get from FieldCache.StringIndex. >> But for analyzed fields, though theoretically it can be restored from >> term vector and term position, it's hard to recover from index. >> >> method 2, hack with metadata >> 1. indexed fields >> delete by query, e.g. field:* >> 2. stored fields >> because all fields are stored sequentially. it's not easy to delete >> some fields. this will not affect search speed. but if you want to get >> stored fields, and the useless fields are very long, then it will slow >> down. >> also it's possible to hack with it. but need more effort to >> understand the index file format and traverse the fdt/fdx file. >> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html >> >> this will give you some insight. >> >> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <bstewart...@gmail.com>wrote: >> >>> Lets say I have a large index (100M docs, 1TB, split up between 10 >>> indexes). And a bunch of the "stored" and "indexed" fields are not used in >>> search at all. In order to save memory and disk, I'd like to rebuild that >>> index *without* those fields, but I don't have original documents to >>> rebuild entire index with (don't have the full-text anymore, etc.). Is >>> there some way to rebuild or optimize an existing index with only a sub-set >>> of the existing indexed fields? Or alternatively is there a way to avoid >>> loading some indexed fields at all ( to avoid loading term infos and terms >>> index ) ? >>> >>> Thanks >>> Bob >> >> >>