Re: Can I rebuild an index and remove some fields?

2012-02-16 Thread Robert Stewart
I will test it with my big production indexes first, if it works I
will port to Java and add to contrib I think.

On Wed, Feb 15, 2012 at 10:03 PM, Li Li fancye...@gmail.com wrote:
 great. I think you could make it a public tool. maybe others also need such
 functionality.

 On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart bstewart...@gmail.comwrote:

 I implemented an index shrinker and it works.  I reduced my test index
 from 6.6 GB to 3.6 GB by removing a single shingled field I did not
 need anymore.  I'm actually using Lucene.Net for this project so code
 is C# using Lucene.Net 2.9.2 API.  But basic idea is:

 Create an IndexReader wrapper that only enumerates the terms you want
 to keep, and that removes terms from documents when returning
 documents.

 Use the SegmentMerger to re-write each segment (where each segment is
 wrapped by the wrapper class), writing new segment to a new directory.
 Collect the SegmentInfos and do a commit in order to create a new
 segments file in new index directory

 Done - you now have a shrunk index with specified terms removed.

 Implementation uses separate thread for each segment, so it re-writes
 them in parallel.  Took about 15 minutes to do 770,000 doc index on my
 macbook.


 On Tue, Feb 14, 2012 at 10:12 PM, Li Li fancye...@gmail.com wrote:
  I have roughly read the codes of 4.0 trunk. maybe it's feasible.
     SegmentMerger.add(IndexReader) will add to be merged Readers
     merge() will call
       mergeTerms(segmentWriteState);
       mergePerDoc(segmentWriteState);
 
    mergeTerms() will construct fields from IndexReaders
     for(int
  readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) {
       final MergeState.IndexReaderAndLiveDocs r =
  mergeState.readers.get(readerIndex);
       final Fields f = r.reader.fields();
       final int maxDoc = r.reader.maxDoc();
       if (f != null) {
         slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
         fields.add(f);
       }
       docBase += maxDoc;
     }
     So If you wrapper your IndexReader and override its fields() method,
  maybe it will work for merge terms.
 
     for DocValues, it can also override AtomicReader.docValues(). just
  return null for fields you want to remove. maybe it should
  traverse CompositeReader's getSequentialSubReaders() and wrapper each
  AtomicReader
 
     other things like term vectors norms are similar.
  On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.com
 wrote:
 
  I was thinking if I make a wrapper class that aggregates another
  IndexReader and filter out terms I don't want anymore it might work.
 And
  then pass that wrapper into SegmentMerger.  I think if I filter out
 terms
  on GetFieldNames(...) and Terms(...) it might work.
 
  Something like:
 
  HashSetstring ignoredTerms=...;
 
  FilteringIndexReader wrapper=new FilterIndexReader(reader);
 
  SegmentMerger merger=new SegmentMerger(writer);
 
  merger.add(wrapper);
 
  merger.Merge();
 
 
 
 
 
  On Feb 14, 2012, at 1:49 AM, Li Li wrote:
 
   for method 2, delete is wrong. we can't delete terms.
     you also should hack with the tii and tis file.
  
   On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote:
  
   method1, dumping data
   for stored fields, you can traverse the whole index and save it to
   somewhere else.
   for indexed but not stored fields, it may be more difficult.
      if the indexed and not stored field is not analyzed(fields such as
   id), it's easy to get from FieldCache.StringIndex.
      But for analyzed fields, though theoretically it can be restored
 from
   term vector and term position, it's hard to recover from index.
  
   method 2, hack with metadata
   1. indexed fields
        delete by query, e.g. field:*
   2. stored fields
         because all fields are stored sequentially. it's not easy to
  delete
   some fields. this will not affect search speed. but if you want to
 get
   stored fields,  and the useless fields are very long, then it will
 slow
   down.
         also it's possible to hack with it. but need more effort to
   understand the index file format  and traverse the fdt/fdx file.
  
 
 http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
  
   this will give you some insight.
  
  
   On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart 
 bstewart...@gmail.com
  wrote:
  
   Lets say I have a large index (100M docs, 1TB, split up between 10
   indexes).  And a bunch of the stored and indexed fields are not
  used in
   search at all.  In order to save memory and disk, I'd like to
 rebuild
  that
   index *without* those fields, but I don't have original documents to
   rebuild entire index with (don't have the full-text anymore, etc.).
  Is
   there some way to rebuild or optimize an existing index with only a
  sub-set
   of the existing indexed fields?  Or alternatively is there a way to
  avoid
   loading some indexed fields at all ( to avoid loading term infos and
  

Re: Can I rebuild an index and remove some fields?

2012-02-15 Thread Robert Stewart
I implemented an index shrinker and it works.  I reduced my test index
from 6.6 GB to 3.6 GB by removing a single shingled field I did not
need anymore.  I'm actually using Lucene.Net for this project so code
is C# using Lucene.Net 2.9.2 API.  But basic idea is:

Create an IndexReader wrapper that only enumerates the terms you want
to keep, and that removes terms from documents when returning
documents.

Use the SegmentMerger to re-write each segment (where each segment is
wrapped by the wrapper class), writing new segment to a new directory.
Collect the SegmentInfos and do a commit in order to create a new
segments file in new index directory

Done - you now have a shrunk index with specified terms removed.

Implementation uses separate thread for each segment, so it re-writes
them in parallel.  Took about 15 minutes to do 770,000 doc index on my
macbook.


On Tue, Feb 14, 2012 at 10:12 PM, Li Li fancye...@gmail.com wrote:
 I have roughly read the codes of 4.0 trunk. maybe it's feasible.
    SegmentMerger.add(IndexReader) will add to be merged Readers
    merge() will call
      mergeTerms(segmentWriteState);
      mergePerDoc(segmentWriteState);

   mergeTerms() will construct fields from IndexReaders
    for(int
 readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) {
      final MergeState.IndexReaderAndLiveDocs r =
 mergeState.readers.get(readerIndex);
      final Fields f = r.reader.fields();
      final int maxDoc = r.reader.maxDoc();
      if (f != null) {
        slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
        fields.add(f);
      }
      docBase += maxDoc;
    }
    So If you wrapper your IndexReader and override its fields() method,
 maybe it will work for merge terms.

    for DocValues, it can also override AtomicReader.docValues(). just
 return null for fields you want to remove. maybe it should
 traverse CompositeReader's getSequentialSubReaders() and wrapper each
 AtomicReader

    other things like term vectors norms are similar.
 On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.comwrote:

 I was thinking if I make a wrapper class that aggregates another
 IndexReader and filter out terms I don't want anymore it might work.   And
 then pass that wrapper into SegmentMerger.  I think if I filter out terms
 on GetFieldNames(...) and Terms(...) it might work.

 Something like:

 HashSetstring ignoredTerms=...;

 FilteringIndexReader wrapper=new FilterIndexReader(reader);

 SegmentMerger merger=new SegmentMerger(writer);

 merger.add(wrapper);

 merger.Merge();





 On Feb 14, 2012, at 1:49 AM, Li Li wrote:

  for method 2, delete is wrong. we can't delete terms.
    you also should hack with the tii and tis file.
 
  On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote:
 
  method1, dumping data
  for stored fields, you can traverse the whole index and save it to
  somewhere else.
  for indexed but not stored fields, it may be more difficult.
     if the indexed and not stored field is not analyzed(fields such as
  id), it's easy to get from FieldCache.StringIndex.
     But for analyzed fields, though theoretically it can be restored from
  term vector and term position, it's hard to recover from index.
 
  method 2, hack with metadata
  1. indexed fields
       delete by query, e.g. field:*
  2. stored fields
        because all fields are stored sequentially. it's not easy to
 delete
  some fields. this will not affect search speed. but if you want to get
  stored fields,  and the useless fields are very long, then it will slow
  down.
        also it's possible to hack with it. but need more effort to
  understand the index file format  and traverse the fdt/fdx file.
 
 http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
 
  this will give you some insight.
 
 
  On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.com
 wrote:
 
  Lets say I have a large index (100M docs, 1TB, split up between 10
  indexes).  And a bunch of the stored and indexed fields are not
 used in
  search at all.  In order to save memory and disk, I'd like to rebuild
 that
  index *without* those fields, but I don't have original documents to
  rebuild entire index with (don't have the full-text anymore, etc.).  Is
  there some way to rebuild or optimize an existing index with only a
 sub-set
  of the existing indexed fields?  Or alternatively is there a way to
 avoid
  loading some indexed fields at all ( to avoid loading term infos and
 terms
  index ) ?
 
  Thanks
  Bob
 
 
 




Re: Can I rebuild an index and remove some fields?

2012-02-15 Thread Li Li
great. I think you could make it a public tool. maybe others also need such
functionality.

On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart bstewart...@gmail.comwrote:

 I implemented an index shrinker and it works.  I reduced my test index
 from 6.6 GB to 3.6 GB by removing a single shingled field I did not
 need anymore.  I'm actually using Lucene.Net for this project so code
 is C# using Lucene.Net 2.9.2 API.  But basic idea is:

 Create an IndexReader wrapper that only enumerates the terms you want
 to keep, and that removes terms from documents when returning
 documents.

 Use the SegmentMerger to re-write each segment (where each segment is
 wrapped by the wrapper class), writing new segment to a new directory.
 Collect the SegmentInfos and do a commit in order to create a new
 segments file in new index directory

 Done - you now have a shrunk index with specified terms removed.

 Implementation uses separate thread for each segment, so it re-writes
 them in parallel.  Took about 15 minutes to do 770,000 doc index on my
 macbook.


 On Tue, Feb 14, 2012 at 10:12 PM, Li Li fancye...@gmail.com wrote:
  I have roughly read the codes of 4.0 trunk. maybe it's feasible.
 SegmentMerger.add(IndexReader) will add to be merged Readers
 merge() will call
   mergeTerms(segmentWriteState);
   mergePerDoc(segmentWriteState);
 
mergeTerms() will construct fields from IndexReaders
 for(int
  readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) {
   final MergeState.IndexReaderAndLiveDocs r =
  mergeState.readers.get(readerIndex);
   final Fields f = r.reader.fields();
   final int maxDoc = r.reader.maxDoc();
   if (f != null) {
 slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
 fields.add(f);
   }
   docBase += maxDoc;
 }
 So If you wrapper your IndexReader and override its fields() method,
  maybe it will work for merge terms.
 
 for DocValues, it can also override AtomicReader.docValues(). just
  return null for fields you want to remove. maybe it should
  traverse CompositeReader's getSequentialSubReaders() and wrapper each
  AtomicReader
 
 other things like term vectors norms are similar.
  On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.com
 wrote:
 
  I was thinking if I make a wrapper class that aggregates another
  IndexReader and filter out terms I don't want anymore it might work.
 And
  then pass that wrapper into SegmentMerger.  I think if I filter out
 terms
  on GetFieldNames(...) and Terms(...) it might work.
 
  Something like:
 
  HashSetstring ignoredTerms=...;
 
  FilteringIndexReader wrapper=new FilterIndexReader(reader);
 
  SegmentMerger merger=new SegmentMerger(writer);
 
  merger.add(wrapper);
 
  merger.Merge();
 
 
 
 
 
  On Feb 14, 2012, at 1:49 AM, Li Li wrote:
 
   for method 2, delete is wrong. we can't delete terms.
 you also should hack with the tii and tis file.
  
   On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote:
  
   method1, dumping data
   for stored fields, you can traverse the whole index and save it to
   somewhere else.
   for indexed but not stored fields, it may be more difficult.
  if the indexed and not stored field is not analyzed(fields such as
   id), it's easy to get from FieldCache.StringIndex.
  But for analyzed fields, though theoretically it can be restored
 from
   term vector and term position, it's hard to recover from index.
  
   method 2, hack with metadata
   1. indexed fields
delete by query, e.g. field:*
   2. stored fields
 because all fields are stored sequentially. it's not easy to
  delete
   some fields. this will not affect search speed. but if you want to
 get
   stored fields,  and the useless fields are very long, then it will
 slow
   down.
 also it's possible to hack with it. but need more effort to
   understand the index file format  and traverse the fdt/fdx file.
  
 
 http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
  
   this will give you some insight.
  
  
   On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart 
 bstewart...@gmail.com
  wrote:
  
   Lets say I have a large index (100M docs, 1TB, split up between 10
   indexes).  And a bunch of the stored and indexed fields are not
  used in
   search at all.  In order to save memory and disk, I'd like to
 rebuild
  that
   index *without* those fields, but I don't have original documents to
   rebuild entire index with (don't have the full-text anymore, etc.).
  Is
   there some way to rebuild or optimize an existing index with only a
  sub-set
   of the existing indexed fields?  Or alternatively is there a way to
  avoid
   loading some indexed fields at all ( to avoid loading term infos and
  terms
   index ) ?
  
   Thanks
   Bob
  
  
  
 
 



Re: Can I rebuild an index and remove some fields?

2012-02-14 Thread Robert Stewart
I was thinking if I make a wrapper class that aggregates another IndexReader 
and filter out terms I don't want anymore it might work.   And then pass that 
wrapper into SegmentMerger.  I think if I filter out terms on 
GetFieldNames(...) and Terms(...) it might work.

Something like:

HashSetstring ignoredTerms=...;

FilteringIndexReader wrapper=new FilterIndexReader(reader);

SegmentMerger merger=new SegmentMerger(writer);

merger.add(wrapper);

merger.Merge();





On Feb 14, 2012, at 1:49 AM, Li Li wrote:

 for method 2, delete is wrong. we can't delete terms.
   you also should hack with the tii and tis file.
 
 On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote:
 
 method1, dumping data
 for stored fields, you can traverse the whole index and save it to
 somewhere else.
 for indexed but not stored fields, it may be more difficult.
if the indexed and not stored field is not analyzed(fields such as
 id), it's easy to get from FieldCache.StringIndex.
But for analyzed fields, though theoretically it can be restored from
 term vector and term position, it's hard to recover from index.
 
 method 2, hack with metadata
 1. indexed fields
  delete by query, e.g. field:*
 2. stored fields
   because all fields are stored sequentially. it's not easy to delete
 some fields. this will not affect search speed. but if you want to get
 stored fields,  and the useless fields are very long, then it will slow
 down.
   also it's possible to hack with it. but need more effort to
 understand the index file format  and traverse the fdt/fdx file.
 http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
 
 this will give you some insight.
 
 
 On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.comwrote:
 
 Lets say I have a large index (100M docs, 1TB, split up between 10
 indexes).  And a bunch of the stored and indexed fields are not used in
 search at all.  In order to save memory and disk, I'd like to rebuild that
 index *without* those fields, but I don't have original documents to
 rebuild entire index with (don't have the full-text anymore, etc.).  Is
 there some way to rebuild or optimize an existing index with only a sub-set
 of the existing indexed fields?  Or alternatively is there a way to avoid
 loading some indexed fields at all ( to avoid loading term infos and terms
 index ) ?
 
 Thanks
 Bob
 
 
 



Re: Can I rebuild an index and remove some fields?

2012-02-14 Thread Li Li
I have roughly read the codes of 4.0 trunk. maybe it's feasible.
SegmentMerger.add(IndexReader) will add to be merged Readers
merge() will call
  mergeTerms(segmentWriteState);
  mergePerDoc(segmentWriteState);

   mergeTerms() will construct fields from IndexReaders
for(int
readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) {
  final MergeState.IndexReaderAndLiveDocs r =
mergeState.readers.get(readerIndex);
  final Fields f = r.reader.fields();
  final int maxDoc = r.reader.maxDoc();
  if (f != null) {
slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
fields.add(f);
  }
  docBase += maxDoc;
}
So If you wrapper your IndexReader and override its fields() method,
maybe it will work for merge terms.

for DocValues, it can also override AtomicReader.docValues(). just
return null for fields you want to remove. maybe it should
traverse CompositeReader's getSequentialSubReaders() and wrapper each
AtomicReader

other things like term vectors norms are similar.
On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.comwrote:

 I was thinking if I make a wrapper class that aggregates another
 IndexReader and filter out terms I don't want anymore it might work.   And
 then pass that wrapper into SegmentMerger.  I think if I filter out terms
 on GetFieldNames(...) and Terms(...) it might work.

 Something like:

 HashSetstring ignoredTerms=...;

 FilteringIndexReader wrapper=new FilterIndexReader(reader);

 SegmentMerger merger=new SegmentMerger(writer);

 merger.add(wrapper);

 merger.Merge();





 On Feb 14, 2012, at 1:49 AM, Li Li wrote:

  for method 2, delete is wrong. we can't delete terms.
you also should hack with the tii and tis file.
 
  On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote:
 
  method1, dumping data
  for stored fields, you can traverse the whole index and save it to
  somewhere else.
  for indexed but not stored fields, it may be more difficult.
 if the indexed and not stored field is not analyzed(fields such as
  id), it's easy to get from FieldCache.StringIndex.
 But for analyzed fields, though theoretically it can be restored from
  term vector and term position, it's hard to recover from index.
 
  method 2, hack with metadata
  1. indexed fields
   delete by query, e.g. field:*
  2. stored fields
because all fields are stored sequentially. it's not easy to
 delete
  some fields. this will not affect search speed. but if you want to get
  stored fields,  and the useless fields are very long, then it will slow
  down.
also it's possible to hack with it. but need more effort to
  understand the index file format  and traverse the fdt/fdx file.
 
 http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
 
  this will give you some insight.
 
 
  On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.com
 wrote:
 
  Lets say I have a large index (100M docs, 1TB, split up between 10
  indexes).  And a bunch of the stored and indexed fields are not
 used in
  search at all.  In order to save memory and disk, I'd like to rebuild
 that
  index *without* those fields, but I don't have original documents to
  rebuild entire index with (don't have the full-text anymore, etc.).  Is
  there some way to rebuild or optimize an existing index with only a
 sub-set
  of the existing indexed fields?  Or alternatively is there a way to
 avoid
  loading some indexed fields at all ( to avoid loading term infos and
 terms
  index ) ?
 
  Thanks
  Bob
 
 
 




Can I rebuild an index and remove some fields?

2012-02-13 Thread Robert Stewart
Lets say I have a large index (100M docs, 1TB, split up between 10 indexes).  
And a bunch of the stored and indexed fields are not used in search at all. 
 In order to save memory and disk, I'd like to rebuild that index *without* 
those fields, but I don't have original documents to rebuild entire index with 
(don't have the full-text anymore, etc.).  Is there some way to rebuild or 
optimize an existing index with only a sub-set of the existing indexed fields?  
Or alternatively is there a way to avoid loading some indexed fields at all ( 
to avoid loading term infos and terms index ) ?

Thanks
Bob

Re: Can I rebuild an index and remove some fields?

2012-02-13 Thread Li Li
method1, dumping data
for stored fields, you can traverse the whole index and save it to
somewhere else.
for indexed but not stored fields, it may be more difficult.
if the indexed and not stored field is not analyzed(fields such as id),
it's easy to get from FieldCache.StringIndex.
But for analyzed fields, though theoretically it can be restored from
term vector and term position, it's hard to recover from index.

method 2, hack with metadata
1. indexed fields
  delete by query, e.g. field:*
2. stored fields
   because all fields are stored sequentially. it's not easy to delete
some fields. this will not affect search speed. but if you want to get
stored fields,  and the useless fields are very long, then it will slow
down.
   also it's possible to hack with it. but need more effort to
understand the index file format  and traverse the fdt/fdx file.
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html

this will give you some insight.

On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.comwrote:

 Lets say I have a large index (100M docs, 1TB, split up between 10
 indexes).  And a bunch of the stored and indexed fields are not used in
 search at all.  In order to save memory and disk, I'd like to rebuild that
 index *without* those fields, but I don't have original documents to
 rebuild entire index with (don't have the full-text anymore, etc.).  Is
 there some way to rebuild or optimize an existing index with only a sub-set
 of the existing indexed fields?  Or alternatively is there a way to avoid
 loading some indexed fields at all ( to avoid loading term infos and terms
 index ) ?

 Thanks
 Bob


Re: Can I rebuild an index and remove some fields?

2012-02-13 Thread Li Li
for method 2, delete is wrong. we can't delete terms.
   you also should hack with the tii and tis file.

On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote:

 method1, dumping data
 for stored fields, you can traverse the whole index and save it to
 somewhere else.
 for indexed but not stored fields, it may be more difficult.
 if the indexed and not stored field is not analyzed(fields such as
 id), it's easy to get from FieldCache.StringIndex.
 But for analyzed fields, though theoretically it can be restored from
 term vector and term position, it's hard to recover from index.

 method 2, hack with metadata
 1. indexed fields
   delete by query, e.g. field:*
 2. stored fields
because all fields are stored sequentially. it's not easy to delete
 some fields. this will not affect search speed. but if you want to get
 stored fields,  and the useless fields are very long, then it will slow
 down.
also it's possible to hack with it. but need more effort to
 understand the index file format  and traverse the fdt/fdx file.
 http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html

 this will give you some insight.


 On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.comwrote:

 Lets say I have a large index (100M docs, 1TB, split up between 10
 indexes).  And a bunch of the stored and indexed fields are not used in
 search at all.  In order to save memory and disk, I'd like to rebuild that
 index *without* those fields, but I don't have original documents to
 rebuild entire index with (don't have the full-text anymore, etc.).  Is
 there some way to rebuild or optimize an existing index with only a sub-set
 of the existing indexed fields?  Or alternatively is there a way to avoid
 loading some indexed fields at all ( to avoid loading term infos and terms
 index ) ?

 Thanks
 Bob