[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615611#action_12615611 ] Jason Rutherglen commented on LUCENE-1278: -- In order for the proposal I mentioned to work, DocumentsWriter.appendPostings needs to be changed to store the docs in an IntArrayList or something or the sort, then decide where to store the postings. I started working on LUCENE-1292 to address this problem outside of reworking core Lucene. LUCENE-1278 only addresses half of my problem. I also want realtime updates to an in memory term index. The most efficient way to achieve this is what is outlined in LUCENE-1292. Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
This also reminds me of the pulsing technique described in: http://citeseer.ist.psu.edu/cutting90optimizations.html Doug eks dev wrote: It seams someone else had the same idea to inline very short postings into term dictionary (even for in-memory index) ans save one pointer (and seek, in disk setup)... nice reading http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf - Original Message From: Eks Dev (JIRA) [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Sunday, 20 July, 2008 1:02:31 PM Subject: [jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615077#action_12615077 ] Eks Dev commented on LUCENE-1278: - in light of Mike's comments hier (Michael McCandless - 05/May/08 05:33 AM), I think it is worth mentioning that I am working on LUCENE-1340, that is storing postings without additional frq info. correct me if I am wrong, the only difference is that this approach with *.frq needs one seek more... at the same time, this could potentially increase term dict size, so we loose some locality. Your your last proposal sounds interesting, inline short postings into term dict , so for short postings (about the size of offset pointer into *.frq) with tf==1 (that is the always the case if you use omitTf(true) from LUCENE-1340) we spare one seek()... this could be a lot. Also, there is no need to store postings into *frq (this complicates maintenance I guess) Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Not happy with your email address?. Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615077#action_12615077 ] Eks Dev commented on LUCENE-1278: - in light of Mike's comments hier (Michael McCandless - 05/May/08 05:33 AM), I think it is worth mentioning that I am working on LUCENE-1340, that is storing postings without additional frq info. correct me if I am wrong, the only difference is that this approach with *.frq needs one seek more... at the same time, this could potentially increase term dict size, so we loose some locality. Your your last proposal sounds interesting, inline short postings into term dict , so for short postings (about the size of offset pointer into *.frq) with tf==1 (that is the always the case if you use omitTf(true) from LUCENE-1340) we spare one seek()... this could be a lot. Also, there is no need to store postings into *frq (this complicates maintenance I guess) Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: [jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
It seams someone else had the same idea to inline very short postings into term dictionary (even for in-memory index) ans save one pointer (and seek, in disk setup)... nice reading http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf - Original Message From: Eks Dev (JIRA) [EMAIL PROTECTED] To: java-dev@lucene.apache.org Sent: Sunday, 20 July, 2008 1:02:31 PM Subject: [jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary [ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615077#action_12615077 ] Eks Dev commented on LUCENE-1278: - in light of Mike's comments hier (Michael McCandless - 05/May/08 05:33 AM), I think it is worth mentioning that I am working on LUCENE-1340, that is storing postings without additional frq info. correct me if I am wrong, the only difference is that this approach with *.frq needs one seek more... at the same time, this could potentially increase term dict size, so we loose some locality. Your your last proposal sounds interesting, inline short postings into term dict , so for short postings (about the size of offset pointer into *.frq) with tf==1 (that is the always the case if you use omitTf(true) from LUCENE-1340) we spare one seek()... this could be a lot. Also, there is no need to store postings into *frq (this complicates maintenance I guess) Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Not happy with your email address?. Get the one you really want - millions of new email addresses available now at Yahoo! http://uk.docs.yahoo.com/ymail/new.html - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12598770#action_12598770 ] Jason Rutherglen commented on LUCENE-1278: -- Have a new patch that handles deleted docs but realized that returning DocIdSetIterator is not needed. This implementation can integrate with TermDocs transparently. The issue is then whether to keep the Fieldable.isStoreTermDocs or make the implementation a default for untokenized fields. For untokenized fields, this would mean not having to store the docs in the segment.frq file. Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12598793#action_12598793 ] Jason Rutherglen commented on LUCENE-1278: -- Thought of some simple logic for this that will make it work automatically with no user interaction and no API additions. If the term is located in less than or equal to the skipinterval of termdocs docs, and the term frequency for each doc is 1, then the docs should be stored in segment.tis. Otherwise they should be stored as usual in segment.frq. The problem is knowing whether the logic is true in the DocumentsWriter.appendPostings method. Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12596762#action_12596762 ] Paul Elschot commented on LUCENE-1278: -- Some comments on the 5.7.2008 patch: The test with 7.6 times speedup for very few docs per term makes me wonder why this never showed up as a performance problem before. It certainly shows an advantage of flexible indexing for the case in which the within document term frequencies are not needed (for example primary/foreign keys, which normally end up in a keyword field.) In the patch, DocIdSetIterator is used in the org.apache.lucene.index package, so it would be a good idea to move it from o.a.l.search to o.a.l.index or to o.a.l.util to avoid a circular dependency involving the index and search packages. As DocIdSetIterator is not yet released, this move should be no problem. The DocIdSetReader class in the patch has so much code in common with SortedVIntList that it might be better to merge the two into a single one, and try and refactor common code into new methods there. That would also be an easy way to get rid of the unsupported skipTo() operation. Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594744#action_12594744 ] Jason Rutherglen commented on LUCENE-1278: -- Implemented returning DocIdSetIterator however when running org.apache.lucene.search.TestSort remote search fails. Reading the docs from a DocIdSetIterator directly from the file is troublesome due to the way termenum is designed with the other parts of Lucene. My own basic unit test works, however TestSort does not and it is probably due to the file pointer not being on the correct position during enumeration. Perhaps there is a way for the int array work? Or is it best to create a separate file that is very similar to the term dictionary file but only stores terms and docs? Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, TestTermEnumDocs.java Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594761#action_12594761 ] Jason Rutherglen commented on LUCENE-1278: -- What if the int array is saved in TermInfo only if the docfreq was below a certain threshold? Otherwise on int[] docs = TermEnum.docs() the docs are loaded from the file. This solves the main issue with the int array, the potential for high numbers of docs being stored in ram. Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, TestTermEnumDocs.java Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594206#action_12594206 ] Paul Elschot commented on LUCENE-1278: -- Would there be any performance measurements for this? It might be quite good for terms that occur in very many documents, an area in which some improvement is possible I think. Btw, for this case it might also be good to use a SortedVIntList instead of an IntArrayList. I had a look at today's patch, but I stopped at DocumentsWriter because it contains a lot of layout changes, so it's hard to focus on the functional differences. Are there any index format changes involved in this? Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594219#action_12594219 ] Michael McCandless commented on LUCENE-1278: {quote} Is there a way to know the number of documents for a term in DocumentsWriter.appendPostings before running through all of them? {quote} I don't think so. You have to run through the list. Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594220#action_12594220 ] Michael McCandless commented on LUCENE-1278: {quote} I had a look at today's patch, but I stopped at DocumentsWriter because it contains a lot of layout changes, so it's hard to focus on the functional differences. {quote} I also stopped at DocumentsWriter: it seems like nearly all the changes are cosmetic. SegmentTermEnum is also hard to read. In general it's best to not make cosmetic changes (moving around import lines, changing whitespace, re-justifying whole paragraphs in javadocs, etc.) at the same time as a real change, when possible. I do admit there is a strong temptation ;) Also, indentation should be two spaces, not tab. A number of sources were changed to tab in the patch. Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594225#action_12594225 ] Michael McCandless commented on LUCENE-1278: It looks like the .tii file is also storing the int[] docIDs (as inlined byte blob)? I think that shouldn't be necessary? This change adds a posting list like the frq file, except that it stores only docIDs (no freq information), is stored inline in the term dict, and includes a reader that materializes the full doc list as an int[] instead of offering an iterator like (nextDoc()) interface alone. I think these changes would fit cleanly into what's been proposed for flexible indexing. EG, case 1a talks about storing only docID in a posting list, here: http://wiki.apache.org/jakarta-lucene/FlexibleIndexing And recent discussions on the dev list around how to be flexible as to which index file(s) (one or many) things are stored in, eg: http://www.mail-archive.com/java-dev@lucene.apache.org/msg15681.html should allow you to store this data inlined into the terms dict, or as a separate file. Some other initial comments/questions: * I think this would bloat the index because the docIDs are being double stored (in the terms dict, and, in the frq file). Would you propose changing the frq file to not store the docID when the term dict is doing so? * Why store the byte blob in the term dict, and not a separate (new) index file? We lose locality for cases where one wants to iterate through terms and not loads these docs (eg RangeQuery). * Could you, instead, make a reader that reads in the full byte blob from the frq file for a term, and then processes that into the int[]? This would require no change to indexing the index format, and wouldn't waste space double-storing the docIDs. * I'm worried how well this scales up. For very common terms allocating then decoding holding entirely in RAM the full list of docIDs can become extremely costly. Also, I don't have a clear sense of how apps would use the returned int[]. For example, would the int[] for many terms need to remain resident at the same time? (Eg when running a RangeQuery). If so, that compounds the scale challenge. Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594231#action_12594231 ] Jason Rutherglen commented on LUCENE-1278: -- Storing the docs is off by default and will add index size only if the user wishes. The byte blob allows not reading the docs when loaddocs is false. Field cache and range query loading is very slow because of the dual seeks per term (for termenum then termdocs). If in a separate file the terms are redundant. An field cache example: protected Object createValue(IndexReader reader, Object entryKey) throws IOException { Entry entry = (Entry) entryKey; String field = entry.field; IntParser parser = (IntParser) entry.custom; final int[] retArray = new int[reader.maxDoc()]; // TermDocs termDocs = reader.termDocs(); //TermEnum termEnum = reader.terms (new Term (field, )); TermEnum termEnum = reader.terms (new Term (field, ), true); try { do { Term term = termEnum.term(); if (term==null || term.field() != field) break; int termval = parser.parseInt(term.text()); int[] docs = termEnum.docs(); for (int x=0; x docs.length; x++) { retArray[docs[x]] = termval; } //termDocs.seek (termEnum); //while (termDocs.next()) { // retArray[termDocs.doc()] = termval; //} } while (termEnum.next()); } finally { //termDocs.close(); termEnum.close(); } return retArray; } Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594317#action_12594317 ] Jason Rutherglen commented on LUCENE-1278: -- Returning DocIdSetIterator from TermEnum is good, will implement decoding bytes directly from file. Flexible indexing is good, will implement when it's completed. Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, TestTermEnumDocs.java Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594127#action_12594127 ] Jason Rutherglen commented on LUCENE-1278: -- Is there a way to know the number of documents for a term in DocumentsWriter.appendPostings before running through all of them? Currently a non-optimal linkedlist is used. Otherwise will implement a growable int array. Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary
[ https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594128#action_12594128 ] Jason Rutherglen commented on LUCENE-1278: -- Test cases being worked on Add optional storing of document numbers in term dictionary --- Key: LUCENE-1278 URL: https://issues.apache.org/jira/browse/LUCENE-1278 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.3.1 Reporter: Jason Rutherglen Priority: Minor Attachments: lucene.1278.5.4.2008.patch Add optional storing of document numbers in term dictionary. String index field cache and range filter creation will be faster. Example read code: {noformat} TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS); do { Term term = termEnum.term(); if (term == null || term.field() != field) break; int[] docs = termEnum.docs(); } while (termEnum.next()); {noformat} Example write code: {noformat} Document document = new Document(); document.add(new Field(tag, dog, Field.Store.YES, Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS)); indexWriter.addDocument(document); {noformat} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]