great. I think you could make it a public tool. maybe others also need such functionality.
On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart <bstewart...@gmail.com>wrote: > I implemented an index shrinker and it works. I reduced my test index > from 6.6 GB to 3.6 GB by removing a single shingled field I did not > need anymore. I'm actually using Lucene.Net for this project so code > is C# using Lucene.Net 2.9.2 API. But basic idea is: > > Create an IndexReader wrapper that only enumerates the terms you want > to keep, and that removes terms from documents when returning > documents. > > Use the SegmentMerger to re-write each segment (where each segment is > wrapped by the wrapper class), writing new segment to a new directory. > Collect the SegmentInfos and do a commit in order to create a new > segments file in new index directory > > Done - you now have a shrunk index with specified terms removed. > > Implementation uses separate thread for each segment, so it re-writes > them in parallel. Took about 15 minutes to do 770,000 doc index on my > macbook. > > > On Tue, Feb 14, 2012 at 10:12 PM, Li Li <fancye...@gmail.com> wrote: > > I have roughly read the codes of 4.0 trunk. maybe it's feasible. > > SegmentMerger.add(IndexReader) will add to be merged Readers > > merge() will call > > mergeTerms(segmentWriteState); > > mergePerDoc(segmentWriteState); > > > > mergeTerms() will construct fields from IndexReaders > > for(int > > readerIndex=0;readerIndex<mergeState.readers.size();readerIndex++) { > > final MergeState.IndexReaderAndLiveDocs r = > > mergeState.readers.get(readerIndex); > > final Fields f = r.reader.fields(); > > final int maxDoc = r.reader.maxDoc(); > > if (f != null) { > > slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex)); > > fields.add(f); > > } > > docBase += maxDoc; > > } > > So If you wrapper your IndexReader and override its fields() method, > > maybe it will work for merge terms. > > > > for DocValues, it can also override AtomicReader.docValues(). just > > return null for fields you want to remove. maybe it should > > traverse CompositeReader's getSequentialSubReaders() and wrapper each > > AtomicReader > > > > other things like term vectors norms are similar. > > On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart <bstewart...@gmail.com > >wrote: > > > >> I was thinking if I make a wrapper class that aggregates another > >> IndexReader and filter out terms I don't want anymore it might work. > And > >> then pass that wrapper into SegmentMerger. I think if I filter out > terms > >> on GetFieldNames(...) and Terms(...) it might work. > >> > >> Something like: > >> > >> HashSet<string> ignoredTerms=...; > >> > >> FilteringIndexReader wrapper=new FilterIndexReader(reader); > >> > >> SegmentMerger merger=new SegmentMerger(writer); > >> > >> merger.add(wrapper); > >> > >> merger.Merge(); > >> > >> > >> > >> > >> > >> On Feb 14, 2012, at 1:49 AM, Li Li wrote: > >> > >> > for method 2, delete is wrong. we can't delete terms. > >> > you also should hack with the tii and tis file. > >> > > >> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fancye...@gmail.com> wrote: > >> > > >> >> method1, dumping data > >> >> for stored fields, you can traverse the whole index and save it to > >> >> somewhere else. > >> >> for indexed but not stored fields, it may be more difficult. > >> >> if the indexed and not stored field is not analyzed(fields such as > >> >> id), it's easy to get from FieldCache.StringIndex. > >> >> But for analyzed fields, though theoretically it can be restored > from > >> >> term vector and term position, it's hard to recover from index. > >> >> > >> >> method 2, hack with metadata > >> >> 1. indexed fields > >> >> delete by query, e.g. field:* > >> >> 2. stored fields > >> >> because all fields are stored sequentially. it's not easy to > >> delete > >> >> some fields. this will not affect search speed. but if you want to > get > >> >> stored fields, and the useless fields are very long, then it will > slow > >> >> down. > >> >> also it's possible to hack with it. but need more effort to > >> >> understand the index file format and traverse the fdt/fdx file. > >> >> > >> > http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html > >> >> > >> >> this will give you some insight. > >> >> > >> >> > >> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart < > bstewart...@gmail.com > >> >wrote: > >> >> > >> >>> Lets say I have a large index (100M docs, 1TB, split up between 10 > >> >>> indexes). And a bunch of the "stored" and "indexed" fields are not > >> used in > >> >>> search at all. In order to save memory and disk, I'd like to > rebuild > >> that > >> >>> index *without* those fields, but I don't have original documents to > >> >>> rebuild entire index with (don't have the full-text anymore, etc.). > Is > >> >>> there some way to rebuild or optimize an existing index with only a > >> sub-set > >> >>> of the existing indexed fields? Or alternatively is there a way to > >> avoid > >> >>> loading some indexed fields at all ( to avoid loading term infos and > >> terms > >> >>> index ) ? > >> >>> > >> >>> Thanks > >> >>> Bob > >> >> > >> >> > >> >> > >> > >> >