I will test it with my big production indexes first, if it works I will port to Java and add to contrib I think.
On Wed, Feb 15, 2012 at 10:03 PM, Li Li <fancye...@gmail.com> wrote: > great. I think you could make it a public tool. maybe others also need such > functionality. > > On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart <bstewart...@gmail.com>wrote: > >> I implemented an index shrinker and it works. I reduced my test index >> from 6.6 GB to 3.6 GB by removing a single shingled field I did not >> need anymore. I'm actually using Lucene.Net for this project so code >> is C# using Lucene.Net 2.9.2 API. But basic idea is: >> >> Create an IndexReader wrapper that only enumerates the terms you want >> to keep, and that removes terms from documents when returning >> documents. >> >> Use the SegmentMerger to re-write each segment (where each segment is >> wrapped by the wrapper class), writing new segment to a new directory. >> Collect the SegmentInfos and do a commit in order to create a new >> segments file in new index directory >> >> Done - you now have a shrunk index with specified terms removed. >> >> Implementation uses separate thread for each segment, so it re-writes >> them in parallel. Took about 15 minutes to do 770,000 doc index on my >> macbook. >> >> >> On Tue, Feb 14, 2012 at 10:12 PM, Li Li <fancye...@gmail.com> wrote: >> > I have roughly read the codes of 4.0 trunk. maybe it's feasible. >> > SegmentMerger.add(IndexReader) will add to be merged Readers >> > merge() will call >> > mergeTerms(segmentWriteState); >> > mergePerDoc(segmentWriteState); >> > >> > mergeTerms() will construct fields from IndexReaders >> > for(int >> > readerIndex=0;readerIndex<mergeState.readers.size();readerIndex++) { >> > final MergeState.IndexReaderAndLiveDocs r = >> > mergeState.readers.get(readerIndex); >> > final Fields f = r.reader.fields(); >> > final int maxDoc = r.reader.maxDoc(); >> > if (f != null) { >> > slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex)); >> > fields.add(f); >> > } >> > docBase += maxDoc; >> > } >> > So If you wrapper your IndexReader and override its fields() method, >> > maybe it will work for merge terms. >> > >> > for DocValues, it can also override AtomicReader.docValues(). just >> > return null for fields you want to remove. maybe it should >> > traverse CompositeReader's getSequentialSubReaders() and wrapper each >> > AtomicReader >> > >> > other things like term vectors norms are similar. >> > On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart <bstewart...@gmail.com >> >wrote: >> > >> >> I was thinking if I make a wrapper class that aggregates another >> >> IndexReader and filter out terms I don't want anymore it might work. >> And >> >> then pass that wrapper into SegmentMerger. I think if I filter out >> terms >> >> on GetFieldNames(...) and Terms(...) it might work. >> >> >> >> Something like: >> >> >> >> HashSet<string> ignoredTerms=...; >> >> >> >> FilteringIndexReader wrapper=new FilterIndexReader(reader); >> >> >> >> SegmentMerger merger=new SegmentMerger(writer); >> >> >> >> merger.add(wrapper); >> >> >> >> merger.Merge(); >> >> >> >> >> >> >> >> >> >> >> >> On Feb 14, 2012, at 1:49 AM, Li Li wrote: >> >> >> >> > for method 2, delete is wrong. we can't delete terms. >> >> > you also should hack with the tii and tis file. >> >> > >> >> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fancye...@gmail.com> wrote: >> >> > >> >> >> method1, dumping data >> >> >> for stored fields, you can traverse the whole index and save it to >> >> >> somewhere else. >> >> >> for indexed but not stored fields, it may be more difficult. >> >> >> if the indexed and not stored field is not analyzed(fields such as >> >> >> id), it's easy to get from FieldCache.StringIndex. >> >> >> But for analyzed fields, though theoretically it can be restored >> from >> >> >> term vector and term position, it's hard to recover from index. >> >> >> >> >> >> method 2, hack with metadata >> >> >> 1. indexed fields >> >> >> delete by query, e.g. field:* >> >> >> 2. stored fields >> >> >> because all fields are stored sequentially. it's not easy to >> >> delete >> >> >> some fields. this will not affect search speed. but if you want to >> get >> >> >> stored fields, and the useless fields are very long, then it will >> slow >> >> >> down. >> >> >> also it's possible to hack with it. but need more effort to >> >> >> understand the index file format and traverse the fdt/fdx file. >> >> >> >> >> >> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html >> >> >> >> >> >> this will give you some insight. >> >> >> >> >> >> >> >> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart < >> bstewart...@gmail.com >> >> >wrote: >> >> >> >> >> >>> Lets say I have a large index (100M docs, 1TB, split up between 10 >> >> >>> indexes). And a bunch of the "stored" and "indexed" fields are not >> >> used in >> >> >>> search at all. In order to save memory and disk, I'd like to >> rebuild >> >> that >> >> >>> index *without* those fields, but I don't have original documents to >> >> >>> rebuild entire index with (don't have the full-text anymore, etc.). >> Is >> >> >>> there some way to rebuild or optimize an existing index with only a >> >> sub-set >> >> >>> of the existing indexed fields? Or alternatively is there a way to >> >> avoid >> >> >>> loading some indexed fields at all ( to avoid loading term infos and >> >> terms >> >> >>> index ) ? >> >> >>> >> >> >>> Thanks >> >> >>> Bob >> >> >> >> >> >> >> >> >> >> >> >> >> >>