I will test it with my big production indexes first, if it works I
will port to Java and add to contrib I think.

On Wed, Feb 15, 2012 at 10:03 PM, Li Li <fancye...@gmail.com> wrote:
> great. I think you could make it a public tool. maybe others also need such
> functionality.
>
> On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart <bstewart...@gmail.com>wrote:
>
>> I implemented an index shrinker and it works.  I reduced my test index
>> from 6.6 GB to 3.6 GB by removing a single shingled field I did not
>> need anymore.  I'm actually using Lucene.Net for this project so code
>> is C# using Lucene.Net 2.9.2 API.  But basic idea is:
>>
>> Create an IndexReader wrapper that only enumerates the terms you want
>> to keep, and that removes terms from documents when returning
>> documents.
>>
>> Use the SegmentMerger to re-write each segment (where each segment is
>> wrapped by the wrapper class), writing new segment to a new directory.
>> Collect the SegmentInfos and do a commit in order to create a new
>> segments file in new index directory
>>
>> Done - you now have a shrunk index with specified terms removed.
>>
>> Implementation uses separate thread for each segment, so it re-writes
>> them in parallel.  Took about 15 minutes to do 770,000 doc index on my
>> macbook.
>>
>>
>> On Tue, Feb 14, 2012 at 10:12 PM, Li Li <fancye...@gmail.com> wrote:
>> > I have roughly read the codes of 4.0 trunk. maybe it's feasible.
>> >    SegmentMerger.add(IndexReader) will add to be merged Readers
>> >    merge() will call
>> >      mergeTerms(segmentWriteState);
>> >      mergePerDoc(segmentWriteState);
>> >
>> >   mergeTerms() will construct fields from IndexReaders
>> >    for(int
>> > readerIndex=0;readerIndex<mergeState.readers.size();readerIndex++) {
>> >      final MergeState.IndexReaderAndLiveDocs r =
>> > mergeState.readers.get(readerIndex);
>> >      final Fields f = r.reader.fields();
>> >      final int maxDoc = r.reader.maxDoc();
>> >      if (f != null) {
>> >        slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
>> >        fields.add(f);
>> >      }
>> >      docBase += maxDoc;
>> >    }
>> >    So If you wrapper your IndexReader and override its fields() method,
>> > maybe it will work for merge terms.
>> >
>> >    for DocValues, it can also override AtomicReader.docValues(). just
>> > return null for fields you want to remove. maybe it should
>> > traverse CompositeReader's getSequentialSubReaders() and wrapper each
>> > AtomicReader
>> >
>> >    other things like term vectors norms are similar.
>> > On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart <bstewart...@gmail.com
>> >wrote:
>> >
>> >> I was thinking if I make a wrapper class that aggregates another
>> >> IndexReader and filter out terms I don't want anymore it might work.
>> And
>> >> then pass that wrapper into SegmentMerger.  I think if I filter out
>> terms
>> >> on GetFieldNames(...) and Terms(...) it might work.
>> >>
>> >> Something like:
>> >>
>> >> HashSet<string> ignoredTerms=...;
>> >>
>> >> FilteringIndexReader wrapper=new FilterIndexReader(reader);
>> >>
>> >> SegmentMerger merger=new SegmentMerger(writer);
>> >>
>> >> merger.add(wrapper);
>> >>
>> >> merger.Merge();
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
>> >>
>> >> > for method 2, delete is wrong. we can't delete terms.
>> >> >   you also should hack with the tii and tis file.
>> >> >
>> >> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fancye...@gmail.com> wrote:
>> >> >
>> >> >> method1, dumping data
>> >> >> for stored fields, you can traverse the whole index and save it to
>> >> >> somewhere else.
>> >> >> for indexed but not stored fields, it may be more difficult.
>> >> >>    if the indexed and not stored field is not analyzed(fields such as
>> >> >> id), it's easy to get from FieldCache.StringIndex.
>> >> >>    But for analyzed fields, though theoretically it can be restored
>> from
>> >> >> term vector and term position, it's hard to recover from index.
>> >> >>
>> >> >> method 2, hack with metadata
>> >> >> 1. indexed fields
>> >> >>      delete by query, e.g. field:*
>> >> >> 2. stored fields
>> >> >>       because all fields are stored sequentially. it's not easy to
>> >> delete
>> >> >> some fields. this will not affect search speed. but if you want to
>> get
>> >> >> stored fields,  and the useless fields are very long, then it will
>> slow
>> >> >> down.
>> >> >>       also it's possible to hack with it. but need more effort to
>> >> >> understand the index file format  and traverse the fdt/fdx file.
>> >> >>
>> >>
>> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
>> >> >>
>> >> >> this will give you some insight.
>> >> >>
>> >> >>
>> >> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <
>> bstewart...@gmail.com
>> >> >wrote:
>> >> >>
>> >> >>> Lets say I have a large index (100M docs, 1TB, split up between 10
>> >> >>> indexes).  And a bunch of the "stored" and "indexed" fields are not
>> >> used in
>> >> >>> search at all.  In order to save memory and disk, I'd like to
>> rebuild
>> >> that
>> >> >>> index *without* those fields, but I don't have original documents to
>> >> >>> rebuild entire index with (don't have the full-text anymore, etc.).
>>  Is
>> >> >>> there some way to rebuild or optimize an existing index with only a
>> >> sub-set
>> >> >>> of the existing indexed fields?  Or alternatively is there a way to
>> >> avoid
>> >> >>> loading some indexed fields at all ( to avoid loading term infos and
>> >> terms
>> >> >>> index ) ?
>> >> >>>
>> >> >>> Thanks
>> >> >>> Bob
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>>

Reply via email to