Re: Can I rebuild an index and remove some fields?

Li Li Tue, 14 Feb 2012 19:13:30 -0800

I have roughly read the codes of 4.0 trunk. maybe it's feasible.
    SegmentMerger.add(IndexReader) will add to be merged Readers
    merge() will call
      mergeTerms(segmentWriteState);
      mergePerDoc(segmentWriteState);


   mergeTerms() will construct fields from IndexReaders
    for(int
readerIndex=0;readerIndex<mergeState.readers.size();readerIndex++) {
      final MergeState.IndexReaderAndLiveDocs r =
mergeState.readers.get(readerIndex);
      final Fields f = r.reader.fields();
      final int maxDoc = r.reader.maxDoc();
      if (f != null) {
        slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
        fields.add(f);
      }
      docBase += maxDoc;
    }
    So If you wrapper your IndexReader and override its fields() method,
maybe it will work for merge terms.

    for DocValues, it can also override AtomicReader.docValues(). just
return null for fields you want to remove. maybe it should
traverse CompositeReader's getSequentialSubReaders() and wrapper each
AtomicReader

    other things like term vectors norms are similar.
On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart <bstewart...@gmail.com>wrote:

> I was thinking if I make a wrapper class that aggregates another
> IndexReader and filter out terms I don't want anymore it might work.   And
> then pass that wrapper into SegmentMerger.  I think if I filter out terms
> on GetFieldNames(...) and Terms(...) it might work.
>
> Something like:
>
> HashSet<string> ignoredTerms=...;
>
> FilteringIndexReader wrapper=new FilterIndexReader(reader);
>
> SegmentMerger merger=new SegmentMerger(writer);
>
> merger.add(wrapper);
>
> merger.Merge();
>
>
>
>
>
> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
>
> > for method 2, delete is wrong. we can't delete terms.
> >   you also should hack with the tii and tis file.
> >
> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li <fancye...@gmail.com> wrote:
> >
> >> method1, dumping data
> >> for stored fields, you can traverse the whole index and save it to
> >> somewhere else.
> >> for indexed but not stored fields, it may be more difficult.
> >>    if the indexed and not stored field is not analyzed(fields such as
> >> id), it's easy to get from FieldCache.StringIndex.
> >>    But for analyzed fields, though theoretically it can be restored from
> >> term vector and term position, it's hard to recover from index.
> >>
> >> method 2, hack with metadata
> >> 1. indexed fields
> >>      delete by query, e.g. field:*
> >> 2. stored fields
> >>       because all fields are stored sequentially. it's not easy to
> delete
> >> some fields. this will not affect search speed. but if you want to get
> >> stored fields,  and the useless fields are very long, then it will slow
> >> down.
> >>       also it's possible to hack with it. but need more effort to
> >> understand the index file format  and traverse the fdt/fdx file.
> >>
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
> >>
> >> this will give you some insight.
> >>
> >>
> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <bstewart...@gmail.com
> >wrote:
> >>
> >>> Lets say I have a large index (100M docs, 1TB, split up between 10
> >>> indexes).  And a bunch of the "stored" and "indexed" fields are not
> used in
> >>> search at all.  In order to save memory and disk, I'd like to rebuild
> that
> >>> index *without* those fields, but I don't have original documents to
> >>> rebuild entire index with (don't have the full-text anymore, etc.).  Is
> >>> there some way to rebuild or optimize an existing index with only a
> sub-set
> >>> of the existing indexed fields?  Or alternatively is there a way to
> avoid
> >>> loading some indexed fields at all ( to avoid loading term infos and
> terms
> >>> index ) ?
> >>>
> >>> Thanks
> >>> Bob
> >>
> >>
> >>
>
>

Re: Can I rebuild an index and remove some fields?

Reply via email to