[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-07-22 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615611#action_12615611
 ] 

Jason Rutherglen commented on LUCENE-1278:
--

In order for the proposal I mentioned to work, DocumentsWriter.appendPostings 
needs to be changed to store the docs in an IntArrayList or something or the 
sort, then decide where to store the postings.  

I started working on LUCENE-1292 to address this problem outside of reworking 
core Lucene.  LUCENE-1278 only addresses half of my problem.  I also want 
realtime updates to an in memory term index.  The most efficient way to achieve 
this is what is outlined in LUCENE-1292.

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, 
 lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, 
 lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, 
 TestTermEnumDocs.java


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-07-21 Thread Doug Cutting

This also reminds me of the pulsing technique described in:

http://citeseer.ist.psu.edu/cutting90optimizations.html

Doug

eks dev wrote:

It seams someone else had the same idea to inline very short postings into 
term dictionary (even for in-memory index) ans save one pointer (and seek, in disk 
setup)... nice reading

http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf




- Original Message 

From: Eks Dev (JIRA) [EMAIL PROTECTED]
To: java-dev@lucene.apache.org
Sent: Sunday, 20 July, 2008 1:02:31 PM
Subject: [jira] Commented: (LUCENE-1278) Add optional storing of document 
numbers in term dictionary


[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615077#action_12615077 
] 


Eks Dev commented on LUCENE-1278:
-

in light of Mike's comments hier (Michael McCandless - 05/May/08 05:33 AM), I 
think it is worth mentioning that I am working on LUCENE-1340, that is storing 
postings without additional frq info. 

correct me if I am wrong, the only difference is that this approach with *.frq 
needs one seek more... at the same time, this could potentially increase term 
dict size, so we loose some locality.


Your your last proposal sounds interesting,  inline short postings into term 
dict , so for short postings (about the size of offset pointer into *.frq) with 
tf==1 (that is the always the case if you use omitTf(true) from LUCENE-1340)  we 
spare one seek()... this could be a lot. Also, there is no need to store 
postings into *frq  (this complicates maintenance I guess)  


Add optional storing of document numbers in term dictionary
---

Key: LUCENE-1278
URL: https://issues.apache.org/jira/browse/LUCENE-1278
Project: Lucene - Java
 Issue Type: New Feature
 Components: Index
   Affects Versions: 2.3.1
   Reporter: Jason Rutherglen
   Priority: Minor
Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.2.patch, 
lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, 
lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java


Add optional storing of document numbers in term dictionary.  String index 
field cache and range filter creation will be faster.  

Example read code:
{noformat}
TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
do {
  Term term = termEnum.term();
  if (term == null || term.field() != field) break;
  int[] docs = termEnum.docs();
} while (termEnum.next());
{noformat}
Example write code:
{noformat}
Document document = new Document();
document.add(new Field(tag, dog, Field.Store.YES, 

Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));

indexWriter.addDocument(document);
{noformat}

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




  __
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at 
Yahoo! http://uk.docs.yahoo.com/ymail/new.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-07-20 Thread Eks Dev (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615077#action_12615077
 ] 

Eks Dev commented on LUCENE-1278:
-

in light of Mike's comments hier (Michael McCandless - 05/May/08 05:33 AM), I 
think it is worth mentioning that I am working on LUCENE-1340, that is storing 
postings without additional frq info. 

correct me if I am wrong, the only difference is that this approach with *.frq 
needs one seek more... at the same time, this could potentially increase term 
dict size, so we loose some locality.

Your your last proposal sounds interesting,  inline short postings into term 
dict , so for short postings (about the size of offset pointer into *.frq) with 
tf==1 (that is the always the case if you use omitTf(true) from LUCENE-1340)  
we spare one seek()... this could be a lot. Also, there is no need to store 
postings into *frq  (this complicates maintenance I guess)  

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, 
 lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, 
 lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, 
 TestTermEnumDocs.java


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-07-20 Thread eks dev
It seams someone else had the same idea to inline very short postings into 
term dictionary (even for in-memory index) ans save one pointer (and seek, in 
disk setup)... nice reading

http://www.siam.org/proceedings/alenex/2008/alx08_01transierf.pdf




- Original Message 
 From: Eks Dev (JIRA) [EMAIL PROTECTED]
 To: java-dev@lucene.apache.org
 Sent: Sunday, 20 July, 2008 1:02:31 PM
 Subject: [jira] Commented: (LUCENE-1278) Add optional storing of document 
 numbers in term dictionary
 
 
 [ 
 https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12615077#action_12615077
  
 ] 
 
 Eks Dev commented on LUCENE-1278:
 -
 
 in light of Mike's comments hier (Michael McCandless - 05/May/08 05:33 AM), I 
 think it is worth mentioning that I am working on LUCENE-1340, that is 
 storing 
 postings without additional frq info. 
 
 correct me if I am wrong, the only difference is that this approach with 
 *.frq 
 needs one seek more... at the same time, this could potentially increase term 
 dict size, so we loose some locality.
 
 Your your last proposal sounds interesting,  inline short postings into 
 term 
 dict , so for short postings (about the size of offset pointer into *.frq) 
 with 
 tf==1 (that is the always the case if you use omitTf(true) from LUCENE-1340)  
 we 
 spare one seek()... this could be a lot. Also, there is no need to store 
 postings into *frq  (this complicates maintenance I guess)  
 
  Add optional storing of document numbers in term dictionary
  ---
 
  Key: LUCENE-1278
  URL: https://issues.apache.org/jira/browse/LUCENE-1278
  Project: Lucene - Java
   Issue Type: New Feature
   Components: Index
 Affects Versions: 2.3.1
 Reporter: Jason Rutherglen
 Priority: Minor
  Attachments: lucene.1278.5.4.2008.patch, 
  lucene.1278.5.5.2008.2.patch, 
 lucene.1278.5.5.2008.patch, lucene.1278.5.7.2008.patch, 
 lucene.1278.5.7.2008.test.patch, TestTermEnumDocs.java
 
 
  Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
  Example read code:
  {noformat}
  TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
  do {
Term term = termEnum.term();
if (term == null || term.field() != field) break;
int[] docs = termEnum.docs();
  } while (termEnum.next());
  {noformat}
  Example write code:
  {noformat}
  Document document = new Document();
  document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
  indexWriter.addDocument(document);
  {noformat}
 
 -- 
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



  __
Not happy with your email address?.
Get the one you really want - millions of new email addresses available now at 
Yahoo! http://uk.docs.yahoo.com/ymail/new.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12598770#action_12598770
 ] 

Jason Rutherglen commented on LUCENE-1278:
--

Have a new patch that handles deleted docs but realized that returning 
DocIdSetIterator is not needed.  This implementation can integrate with 
TermDocs transparently.  The issue is then whether to keep the 
Fieldable.isStoreTermDocs or make the implementation a default for untokenized 
fields.  For untokenized fields, this would mean not having to store the docs 
in the segment.frq file.  

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, 
 lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, 
 lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, 
 TestTermEnumDocs.java


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-21 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12598793#action_12598793
 ] 

Jason Rutherglen commented on LUCENE-1278:
--

Thought of some simple logic for this that will make it work automatically with 
no user interaction and no API additions.

If the term is located in less than or equal to the skipinterval of termdocs 
docs, and the term frequency for each doc is 1, then the docs should be stored 
in segment.tis.  Otherwise they should be stored as usual in segment.frq.  

The problem is knowing whether the logic is true in the 
DocumentsWriter.appendPostings method.  

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, 
 lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, 
 lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, 
 TestTermEnumDocs.java


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-14 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12596762#action_12596762
 ] 

Paul Elschot commented on LUCENE-1278:
--

Some comments on the 5.7.2008 patch:

The test with 7.6 times speedup for very few docs per term makes me wonder why 
this never showed up as a performance problem before. It certainly shows an 
advantage of flexible indexing for the case in which the within document term 
frequencies are not needed (for example primary/foreign keys, which normally 
end up in a keyword field.)

In the patch, DocIdSetIterator is used in the org.apache.lucene.index package, 
so it would be a good idea to move it from o.a.l.search to o.a.l.index or to 
o.a.l.util to avoid a circular dependency involving the index and search 
packages. As DocIdSetIterator is not yet released, this move should be no 
problem.

The DocIdSetReader class in the patch has so much code in common with 
SortedVIntList that it might be better to merge the two into a single one, and 
try and refactor common code into new methods there.
That would also be an easy way to get rid of the unsupported skipTo() operation.



 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, 
 lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, 
 lucene.1278.5.7.2008.patch, lucene.1278.5.7.2008.test.patch, 
 TestTermEnumDocs.java


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-06 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594744#action_12594744
 ] 

Jason Rutherglen commented on LUCENE-1278:
--

Implemented returning DocIdSetIterator however when running 
org.apache.lucene.search.TestSort remote search fails.  Reading the docs from a 
DocIdSetIterator directly from the file is troublesome due to the way termenum 
is designed with the other parts of Lucene.  My own basic unit test works, 
however TestSort does not and it is probably due to the file pointer not being 
on the correct position during enumeration.  

Perhaps there is a way for the int array work?  

Or is it best to create a separate file that is very similar to the term 
dictionary file but only stores terms and docs?

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, 
 lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, 
 TestTermEnumDocs.java


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-06 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594761#action_12594761
 ] 

Jason Rutherglen commented on LUCENE-1278:
--

What if the int array is saved in TermInfo only if the docfreq was below a 
certain threshold?  Otherwise on int[] docs = TermEnum.docs() the docs are 
loaded from the file.  This solves the main issue with the int array, the 
potential for high numbers of docs being stored in ram.

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, 
 lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, 
 TestTermEnumDocs.java


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-05 Thread Paul Elschot (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594206#action_12594206
 ] 

Paul Elschot commented on LUCENE-1278:
--

Would there be any performance measurements for this? It might be quite good 
for terms that occur in very many documents, an area in which some improvement 
is possible I think.
Btw, for this case it might also be good to use a SortedVIntList instead of an 
IntArrayList.

I had a look at today's patch, but I stopped at DocumentsWriter because it 
contains a lot of layout changes, so it's hard to focus on the functional 
differences.

Are there any index format changes involved in this?

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-05 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594219#action_12594219
 ] 

Michael McCandless commented on LUCENE-1278:


{quote}
Is there a way to know the number of documents for a term in 
DocumentsWriter.appendPostings before running through all of them?
{quote}

I don't think so.  You have to run through the list.



 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-05 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594220#action_12594220
 ] 

Michael McCandless commented on LUCENE-1278:


{quote}
I had a look at today's patch, but I stopped at DocumentsWriter because it 
contains a lot of layout changes, so it's hard to focus on the functional 
differences.
{quote}

I also stopped at DocumentsWriter: it seems like nearly all the
changes are cosmetic.  SegmentTermEnum is also hard to read.

In general it's best to not make cosmetic changes (moving around
import lines, changing whitespace, re-justifying whole paragraphs in
javadocs, etc.) at the same time as a real change, when possible.  I
do admit there is a strong temptation ;)

Also, indentation should be two spaces, not tab.  A number of sources
were changed to tab in the patch.



 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-05 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594225#action_12594225
 ] 

Michael McCandless commented on LUCENE-1278:



It looks like the .tii file is also storing the int[] docIDs (as
inlined byte blob)?  I think that shouldn't be necessary?

This change adds a posting list like the frq file, except that it
stores only docIDs (no freq information), is stored inline in the term
dict, and includes a reader that materializes the full doc list as an
int[] instead of offering an iterator like (nextDoc()) interface
alone.

I think these changes would fit cleanly into what's been proposed for
flexible indexing.  EG, case 1a talks about storing only docID in a
posting list, here:

http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

And recent discussions on the dev list around how to be flexible as to
which index file(s) (one or many) things are stored in, eg:

   http://www.mail-archive.com/java-dev@lucene.apache.org/msg15681.html

should allow you to store this data inlined into the terms dict, or as
a separate file.

Some other initial comments/questions:

  * I think this would bloat the index because the docIDs are being
double stored (in the terms dict, and, in the frq file).  Would
you propose changing the frq file to not store the docID when the
term dict is doing so?

  * Why store the byte blob in the term dict, and not a separate (new)
index file?  We lose locality for cases where one wants to iterate
through terms and not loads these docs (eg RangeQuery).

  * Could you, instead, make a reader that reads in the full byte blob
from the frq file for a term, and then processes that into the
int[]?  This would require no change to indexing  the index
format, and wouldn't waste space double-storing the docIDs.

  * I'm worried how well this scales up.  For very common terms
allocating then decoding  holding entirely in RAM the full list
of docIDs can become extremely costly.  Also, I don't have a clear
sense of how apps would use the returned int[].  For example,
would the int[] for many terms need to remain resident at the same
time?  (Eg when running a RangeQuery).  If so, that compounds the
scale challenge.



 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, lucene.1278.5.5.2008.patch


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-05 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594231#action_12594231
 ] 

Jason Rutherglen commented on LUCENE-1278:
--

Storing the docs is off by default and will add index size only if the user 
wishes.  The byte blob allows not reading the docs when loaddocs is false.  
Field cache and range query loading is very slow because of the dual seeks per 
term (for termenum then termdocs).  If in a separate file the terms are 
redundant.  

An field cache example:

protected Object createValue(IndexReader reader, Object entryKey)
throws IOException {
  Entry entry = (Entry) entryKey;
  String field = entry.field;
  IntParser parser = (IntParser) entry.custom;
  final int[] retArray = new int[reader.maxDoc()];
  // TermDocs termDocs = reader.termDocs();  
  //TermEnum termEnum = reader.terms (new Term (field, ));
  TermEnum termEnum = reader.terms (new Term (field, ), true);
  try {
do {
  Term term = termEnum.term();
  if (term==null || term.field() != field) break;
  int termval = parser.parseInt(term.text());
  int[] docs = termEnum.docs();
  for (int x=0; x  docs.length; x++) {
retArray[docs[x]] = termval;
  }
  //termDocs.seek (termEnum);
  //while (termDocs.next()) {
  //  retArray[termDocs.doc()] = termval;
  //}
} while (termEnum.next());
  } finally {
//termDocs.close();
termEnum.close();
  }
  return retArray;
}

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, 
 lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-05 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594317#action_12594317
 ] 

Jason Rutherglen commented on LUCENE-1278:
--

Returning DocIdSetIterator from TermEnum is good, will implement decoding bytes 
directly from file.

Flexible indexing is good, will implement when it's completed.

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch, 
 lucene.1278.5.5.2008.2.patch, lucene.1278.5.5.2008.patch, 
 TestTermEnumDocs.java


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-04 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594127#action_12594127
 ] 

Jason Rutherglen commented on LUCENE-1278:
--

Is there a way to know the number of documents for a term in 
DocumentsWriter.appendPostings before running through all of them?  Currently a 
non-optimal linkedlist is used.  Otherwise will implement a growable int array.

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1278) Add optional storing of document numbers in term dictionary

2008-05-04 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12594128#action_12594128
 ] 

Jason Rutherglen commented on LUCENE-1278:
--

Test cases being worked on

 Add optional storing of document numbers in term dictionary
 ---

 Key: LUCENE-1278
 URL: https://issues.apache.org/jira/browse/LUCENE-1278
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Affects Versions: 2.3.1
Reporter: Jason Rutherglen
Priority: Minor
 Attachments: lucene.1278.5.4.2008.patch


 Add optional storing of document numbers in term dictionary.  String index 
 field cache and range filter creation will be faster.  
 Example read code:
 {noformat}
 TermEnum termEnum = indexReader.terms(TermEnum.LOAD_DOCS);
 do {
   Term term = termEnum.term();
   if (term == null || term.field() != field) break;
   int[] docs = termEnum.docs();
 } while (termEnum.next());
 {noformat}
 Example write code:
 {noformat}
 Document document = new Document();
 document.add(new Field(tag, dog, Field.Store.YES, 
 Field.Index.UN_TOKENIZED, Field.Term.STORE_DOCS));
 indexWriter.addDocument(document);
 {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]