Re: Can I rebuild an index and remove some fields?
I will test it with my big production indexes first, if it works I will port to Java and add to contrib I think. On Wed, Feb 15, 2012 at 10:03 PM, Li Li fancye...@gmail.com wrote: great. I think you could make it a public tool. maybe others also need such functionality. On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart bstewart...@gmail.comwrote: I implemented an index shrinker and it works. I reduced my test index from 6.6 GB to 3.6 GB by removing a single shingled field I did not need anymore. I'm actually using Lucene.Net for this project so code is C# using Lucene.Net 2.9.2 API. But basic idea is: Create an IndexReader wrapper that only enumerates the terms you want to keep, and that removes terms from documents when returning documents. Use the SegmentMerger to re-write each segment (where each segment is wrapped by the wrapper class), writing new segment to a new directory. Collect the SegmentInfos and do a commit in order to create a new segments file in new index directory Done - you now have a shrunk index with specified terms removed. Implementation uses separate thread for each segment, so it re-writes them in parallel. Took about 15 minutes to do 770,000 doc index on my macbook. On Tue, Feb 14, 2012 at 10:12 PM, Li Li fancye...@gmail.com wrote: I have roughly read the codes of 4.0 trunk. maybe it's feasible. SegmentMerger.add(IndexReader) will add to be merged Readers merge() will call mergeTerms(segmentWriteState); mergePerDoc(segmentWriteState); mergeTerms() will construct fields from IndexReaders for(int readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) { final MergeState.IndexReaderAndLiveDocs r = mergeState.readers.get(readerIndex); final Fields f = r.reader.fields(); final int maxDoc = r.reader.maxDoc(); if (f != null) { slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex)); fields.add(f); } docBase += maxDoc; } So If you wrapper your IndexReader and override its fields() method, maybe it will work for merge terms. for DocValues, it can also override AtomicReader.docValues(). just return null for fields you want to remove. maybe it should traverse CompositeReader's getSequentialSubReaders() and wrapper each AtomicReader other things like term vectors norms are similar. On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.com wrote: I was thinking if I make a wrapper class that aggregates another IndexReader and filter out terms I don't want anymore it might work. And then pass that wrapper into SegmentMerger. I think if I filter out terms on GetFieldNames(...) and Terms(...) it might work. Something like: HashSetstring ignoredTerms=...; FilteringIndexReader wrapper=new FilterIndexReader(reader); SegmentMerger merger=new SegmentMerger(writer); merger.add(wrapper); merger.Merge(); On Feb 14, 2012, at 1:49 AM, Li Li wrote: for method 2, delete is wrong. we can't delete terms. you also should hack with the tii and tis file. On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote: method1, dumping data for stored fields, you can traverse the whole index and save it to somewhere else. for indexed but not stored fields, it may be more difficult. if the indexed and not stored field is not analyzed(fields such as id), it's easy to get from FieldCache.StringIndex. But for analyzed fields, though theoretically it can be restored from term vector and term position, it's hard to recover from index. method 2, hack with metadata 1. indexed fields delete by query, e.g. field:* 2. stored fields because all fields are stored sequentially. it's not easy to delete some fields. this will not affect search speed. but if you want to get stored fields, and the useless fields are very long, then it will slow down. also it's possible to hack with it. but need more effort to understand the index file format and traverse the fdt/fdx file. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html this will give you some insight. On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.com wrote: Lets say I have a large index (100M docs, 1TB, split up between 10 indexes). And a bunch of the stored and indexed fields are not used in search at all. In order to save memory and disk, I'd like to rebuild that index *without* those fields, but I don't have original documents to rebuild entire index with (don't have the full-text anymore, etc.). Is there some way to rebuild or optimize an existing index with only a sub-set of the existing indexed fields? Or alternatively is there a way to avoid loading some indexed fields at all ( to avoid loading term infos and
Re: Can I rebuild an index and remove some fields?
I implemented an index shrinker and it works. I reduced my test index from 6.6 GB to 3.6 GB by removing a single shingled field I did not need anymore. I'm actually using Lucene.Net for this project so code is C# using Lucene.Net 2.9.2 API. But basic idea is: Create an IndexReader wrapper that only enumerates the terms you want to keep, and that removes terms from documents when returning documents. Use the SegmentMerger to re-write each segment (where each segment is wrapped by the wrapper class), writing new segment to a new directory. Collect the SegmentInfos and do a commit in order to create a new segments file in new index directory Done - you now have a shrunk index with specified terms removed. Implementation uses separate thread for each segment, so it re-writes them in parallel. Took about 15 minutes to do 770,000 doc index on my macbook. On Tue, Feb 14, 2012 at 10:12 PM, Li Li fancye...@gmail.com wrote: I have roughly read the codes of 4.0 trunk. maybe it's feasible. SegmentMerger.add(IndexReader) will add to be merged Readers merge() will call mergeTerms(segmentWriteState); mergePerDoc(segmentWriteState); mergeTerms() will construct fields from IndexReaders for(int readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) { final MergeState.IndexReaderAndLiveDocs r = mergeState.readers.get(readerIndex); final Fields f = r.reader.fields(); final int maxDoc = r.reader.maxDoc(); if (f != null) { slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex)); fields.add(f); } docBase += maxDoc; } So If you wrapper your IndexReader and override its fields() method, maybe it will work for merge terms. for DocValues, it can also override AtomicReader.docValues(). just return null for fields you want to remove. maybe it should traverse CompositeReader's getSequentialSubReaders() and wrapper each AtomicReader other things like term vectors norms are similar. On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.comwrote: I was thinking if I make a wrapper class that aggregates another IndexReader and filter out terms I don't want anymore it might work. And then pass that wrapper into SegmentMerger. I think if I filter out terms on GetFieldNames(...) and Terms(...) it might work. Something like: HashSetstring ignoredTerms=...; FilteringIndexReader wrapper=new FilterIndexReader(reader); SegmentMerger merger=new SegmentMerger(writer); merger.add(wrapper); merger.Merge(); On Feb 14, 2012, at 1:49 AM, Li Li wrote: for method 2, delete is wrong. we can't delete terms. you also should hack with the tii and tis file. On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote: method1, dumping data for stored fields, you can traverse the whole index and save it to somewhere else. for indexed but not stored fields, it may be more difficult. if the indexed and not stored field is not analyzed(fields such as id), it's easy to get from FieldCache.StringIndex. But for analyzed fields, though theoretically it can be restored from term vector and term position, it's hard to recover from index. method 2, hack with metadata 1. indexed fields delete by query, e.g. field:* 2. stored fields because all fields are stored sequentially. it's not easy to delete some fields. this will not affect search speed. but if you want to get stored fields, and the useless fields are very long, then it will slow down. also it's possible to hack with it. but need more effort to understand the index file format and traverse the fdt/fdx file. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html this will give you some insight. On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.com wrote: Lets say I have a large index (100M docs, 1TB, split up between 10 indexes). And a bunch of the stored and indexed fields are not used in search at all. In order to save memory and disk, I'd like to rebuild that index *without* those fields, but I don't have original documents to rebuild entire index with (don't have the full-text anymore, etc.). Is there some way to rebuild or optimize an existing index with only a sub-set of the existing indexed fields? Or alternatively is there a way to avoid loading some indexed fields at all ( to avoid loading term infos and terms index ) ? Thanks Bob
Re: Can I rebuild an index and remove some fields?
great. I think you could make it a public tool. maybe others also need such functionality. On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart bstewart...@gmail.comwrote: I implemented an index shrinker and it works. I reduced my test index from 6.6 GB to 3.6 GB by removing a single shingled field I did not need anymore. I'm actually using Lucene.Net for this project so code is C# using Lucene.Net 2.9.2 API. But basic idea is: Create an IndexReader wrapper that only enumerates the terms you want to keep, and that removes terms from documents when returning documents. Use the SegmentMerger to re-write each segment (where each segment is wrapped by the wrapper class), writing new segment to a new directory. Collect the SegmentInfos and do a commit in order to create a new segments file in new index directory Done - you now have a shrunk index with specified terms removed. Implementation uses separate thread for each segment, so it re-writes them in parallel. Took about 15 minutes to do 770,000 doc index on my macbook. On Tue, Feb 14, 2012 at 10:12 PM, Li Li fancye...@gmail.com wrote: I have roughly read the codes of 4.0 trunk. maybe it's feasible. SegmentMerger.add(IndexReader) will add to be merged Readers merge() will call mergeTerms(segmentWriteState); mergePerDoc(segmentWriteState); mergeTerms() will construct fields from IndexReaders for(int readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) { final MergeState.IndexReaderAndLiveDocs r = mergeState.readers.get(readerIndex); final Fields f = r.reader.fields(); final int maxDoc = r.reader.maxDoc(); if (f != null) { slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex)); fields.add(f); } docBase += maxDoc; } So If you wrapper your IndexReader and override its fields() method, maybe it will work for merge terms. for DocValues, it can also override AtomicReader.docValues(). just return null for fields you want to remove. maybe it should traverse CompositeReader's getSequentialSubReaders() and wrapper each AtomicReader other things like term vectors norms are similar. On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.com wrote: I was thinking if I make a wrapper class that aggregates another IndexReader and filter out terms I don't want anymore it might work. And then pass that wrapper into SegmentMerger. I think if I filter out terms on GetFieldNames(...) and Terms(...) it might work. Something like: HashSetstring ignoredTerms=...; FilteringIndexReader wrapper=new FilterIndexReader(reader); SegmentMerger merger=new SegmentMerger(writer); merger.add(wrapper); merger.Merge(); On Feb 14, 2012, at 1:49 AM, Li Li wrote: for method 2, delete is wrong. we can't delete terms. you also should hack with the tii and tis file. On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote: method1, dumping data for stored fields, you can traverse the whole index and save it to somewhere else. for indexed but not stored fields, it may be more difficult. if the indexed and not stored field is not analyzed(fields such as id), it's easy to get from FieldCache.StringIndex. But for analyzed fields, though theoretically it can be restored from term vector and term position, it's hard to recover from index. method 2, hack with metadata 1. indexed fields delete by query, e.g. field:* 2. stored fields because all fields are stored sequentially. it's not easy to delete some fields. this will not affect search speed. but if you want to get stored fields, and the useless fields are very long, then it will slow down. also it's possible to hack with it. but need more effort to understand the index file format and traverse the fdt/fdx file. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html this will give you some insight. On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.com wrote: Lets say I have a large index (100M docs, 1TB, split up between 10 indexes). And a bunch of the stored and indexed fields are not used in search at all. In order to save memory and disk, I'd like to rebuild that index *without* those fields, but I don't have original documents to rebuild entire index with (don't have the full-text anymore, etc.). Is there some way to rebuild or optimize an existing index with only a sub-set of the existing indexed fields? Or alternatively is there a way to avoid loading some indexed fields at all ( to avoid loading term infos and terms index ) ? Thanks Bob
Re: Can I rebuild an index and remove some fields?
I was thinking if I make a wrapper class that aggregates another IndexReader and filter out terms I don't want anymore it might work. And then pass that wrapper into SegmentMerger. I think if I filter out terms on GetFieldNames(...) and Terms(...) it might work. Something like: HashSetstring ignoredTerms=...; FilteringIndexReader wrapper=new FilterIndexReader(reader); SegmentMerger merger=new SegmentMerger(writer); merger.add(wrapper); merger.Merge(); On Feb 14, 2012, at 1:49 AM, Li Li wrote: for method 2, delete is wrong. we can't delete terms. you also should hack with the tii and tis file. On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote: method1, dumping data for stored fields, you can traverse the whole index and save it to somewhere else. for indexed but not stored fields, it may be more difficult. if the indexed and not stored field is not analyzed(fields such as id), it's easy to get from FieldCache.StringIndex. But for analyzed fields, though theoretically it can be restored from term vector and term position, it's hard to recover from index. method 2, hack with metadata 1. indexed fields delete by query, e.g. field:* 2. stored fields because all fields are stored sequentially. it's not easy to delete some fields. this will not affect search speed. but if you want to get stored fields, and the useless fields are very long, then it will slow down. also it's possible to hack with it. but need more effort to understand the index file format and traverse the fdt/fdx file. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html this will give you some insight. On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.comwrote: Lets say I have a large index (100M docs, 1TB, split up between 10 indexes). And a bunch of the stored and indexed fields are not used in search at all. In order to save memory and disk, I'd like to rebuild that index *without* those fields, but I don't have original documents to rebuild entire index with (don't have the full-text anymore, etc.). Is there some way to rebuild or optimize an existing index with only a sub-set of the existing indexed fields? Or alternatively is there a way to avoid loading some indexed fields at all ( to avoid loading term infos and terms index ) ? Thanks Bob
Re: Can I rebuild an index and remove some fields?
I have roughly read the codes of 4.0 trunk. maybe it's feasible. SegmentMerger.add(IndexReader) will add to be merged Readers merge() will call mergeTerms(segmentWriteState); mergePerDoc(segmentWriteState); mergeTerms() will construct fields from IndexReaders for(int readerIndex=0;readerIndexmergeState.readers.size();readerIndex++) { final MergeState.IndexReaderAndLiveDocs r = mergeState.readers.get(readerIndex); final Fields f = r.reader.fields(); final int maxDoc = r.reader.maxDoc(); if (f != null) { slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex)); fields.add(f); } docBase += maxDoc; } So If you wrapper your IndexReader and override its fields() method, maybe it will work for merge terms. for DocValues, it can also override AtomicReader.docValues(). just return null for fields you want to remove. maybe it should traverse CompositeReader's getSequentialSubReaders() and wrapper each AtomicReader other things like term vectors norms are similar. On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart bstewart...@gmail.comwrote: I was thinking if I make a wrapper class that aggregates another IndexReader and filter out terms I don't want anymore it might work. And then pass that wrapper into SegmentMerger. I think if I filter out terms on GetFieldNames(...) and Terms(...) it might work. Something like: HashSetstring ignoredTerms=...; FilteringIndexReader wrapper=new FilterIndexReader(reader); SegmentMerger merger=new SegmentMerger(writer); merger.add(wrapper); merger.Merge(); On Feb 14, 2012, at 1:49 AM, Li Li wrote: for method 2, delete is wrong. we can't delete terms. you also should hack with the tii and tis file. On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote: method1, dumping data for stored fields, you can traverse the whole index and save it to somewhere else. for indexed but not stored fields, it may be more difficult. if the indexed and not stored field is not analyzed(fields such as id), it's easy to get from FieldCache.StringIndex. But for analyzed fields, though theoretically it can be restored from term vector and term position, it's hard to recover from index. method 2, hack with metadata 1. indexed fields delete by query, e.g. field:* 2. stored fields because all fields are stored sequentially. it's not easy to delete some fields. this will not affect search speed. but if you want to get stored fields, and the useless fields are very long, then it will slow down. also it's possible to hack with it. but need more effort to understand the index file format and traverse the fdt/fdx file. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html this will give you some insight. On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.com wrote: Lets say I have a large index (100M docs, 1TB, split up between 10 indexes). And a bunch of the stored and indexed fields are not used in search at all. In order to save memory and disk, I'd like to rebuild that index *without* those fields, but I don't have original documents to rebuild entire index with (don't have the full-text anymore, etc.). Is there some way to rebuild or optimize an existing index with only a sub-set of the existing indexed fields? Or alternatively is there a way to avoid loading some indexed fields at all ( to avoid loading term infos and terms index ) ? Thanks Bob
Can I rebuild an index and remove some fields?
Lets say I have a large index (100M docs, 1TB, split up between 10 indexes). And a bunch of the stored and indexed fields are not used in search at all. In order to save memory and disk, I'd like to rebuild that index *without* those fields, but I don't have original documents to rebuild entire index with (don't have the full-text anymore, etc.). Is there some way to rebuild or optimize an existing index with only a sub-set of the existing indexed fields? Or alternatively is there a way to avoid loading some indexed fields at all ( to avoid loading term infos and terms index ) ? Thanks Bob
Re: Can I rebuild an index and remove some fields?
method1, dumping data for stored fields, you can traverse the whole index and save it to somewhere else. for indexed but not stored fields, it may be more difficult. if the indexed and not stored field is not analyzed(fields such as id), it's easy to get from FieldCache.StringIndex. But for analyzed fields, though theoretically it can be restored from term vector and term position, it's hard to recover from index. method 2, hack with metadata 1. indexed fields delete by query, e.g. field:* 2. stored fields because all fields are stored sequentially. it's not easy to delete some fields. this will not affect search speed. but if you want to get stored fields, and the useless fields are very long, then it will slow down. also it's possible to hack with it. but need more effort to understand the index file format and traverse the fdt/fdx file. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html this will give you some insight. On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.comwrote: Lets say I have a large index (100M docs, 1TB, split up between 10 indexes). And a bunch of the stored and indexed fields are not used in search at all. In order to save memory and disk, I'd like to rebuild that index *without* those fields, but I don't have original documents to rebuild entire index with (don't have the full-text anymore, etc.). Is there some way to rebuild or optimize an existing index with only a sub-set of the existing indexed fields? Or alternatively is there a way to avoid loading some indexed fields at all ( to avoid loading term infos and terms index ) ? Thanks Bob
Re: Can I rebuild an index and remove some fields?
for method 2, delete is wrong. we can't delete terms. you also should hack with the tii and tis file. On Tue, Feb 14, 2012 at 2:46 PM, Li Li fancye...@gmail.com wrote: method1, dumping data for stored fields, you can traverse the whole index and save it to somewhere else. for indexed but not stored fields, it may be more difficult. if the indexed and not stored field is not analyzed(fields such as id), it's easy to get from FieldCache.StringIndex. But for analyzed fields, though theoretically it can be restored from term vector and term position, it's hard to recover from index. method 2, hack with metadata 1. indexed fields delete by query, e.g. field:* 2. stored fields because all fields are stored sequentially. it's not easy to delete some fields. this will not affect search speed. but if you want to get stored fields, and the useless fields are very long, then it will slow down. also it's possible to hack with it. but need more effort to understand the index file format and traverse the fdt/fdx file. http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html this will give you some insight. On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart bstewart...@gmail.comwrote: Lets say I have a large index (100M docs, 1TB, split up between 10 indexes). And a bunch of the stored and indexed fields are not used in search at all. In order to save memory and disk, I'd like to rebuild that index *without* those fields, but I don't have original documents to rebuild entire index with (don't have the full-text anymore, etc.). Is there some way to rebuild or optimize an existing index with only a sub-set of the existing indexed fields? Or alternatively is there a way to avoid loading some indexed fields at all ( to avoid loading term infos and terms index ) ? Thanks Bob