[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2012-06-05 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289401#comment-13289401
 ] 

Simon Willnauer commented on LUCENE-3108:
-

{quote}Hi, Simon. Can doc values be optional? I am looking into 
org.apache.lucene.codecs.DocValuesConsumer#merge and see that the logic assumes 
that for every docId we have a existing value. Or we use the default value 
instead?
{quote}

hey, DocValues are dense and assume a value for each document. Yet, if you 
don't enable DocValues on a fields its not stored so you only store it for 
certain fields. If you have just a small set of repeated values DocValues can 
store them efficiently and dedupliate if you are concerned about that.

in general you should rather ask these kind of questions on the main dev 
mailing list.

simon

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3108.patch, LUCENE-3108.patch, LUCENE-3108.patch, 
 LUCENE-3108.patch, LUCENE-3108.patch, LUCENE-3108_CHANGES.patch


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2012-06-05 Thread Aliaksandr Zhuhrou (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13289667#comment-13289667
 ] 

Aliaksandr Zhuhrou commented on LUCENE-3108:


Thank you. The thing that make me ask this question is that in the 
org.apache.lucene.codecs.lucene40.values.FixedStraightBytesImpl.FixedBytesWriterBase#add
 we have logic that handles cases when = (lastDocID+1  docID) so I assumed 
that docId may have gap greater than 1.

in general you should rather ask these kind of questions on the main dev 
mailing list.
Sure. 

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3108.patch, LUCENE-3108.patch, LUCENE-3108.patch, 
 LUCENE-3108.patch, LUCENE-3108.patch, LUCENE-3108_CHANGES.patch


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-06-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046451#comment-13046451
 ] 

Michael McCandless commented on LUCENE-3108:


I did another review here -- I think it's ready to land on trunk!  Nice work 
Simon!

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3108.patch, LUCENE-3108.patch, LUCENE-3108.patch


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-06-09 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046473#comment-13046473
 ] 

Uwe Schindler commented on LUCENE-3108:
---

One small issue:

There seems to be a merge missing in file TestIndexSplitter, the changes in 
there are unrelated, so this reverts a commit on trunk for improving tests.

The problem with the README.txt is already fixed.

...still digging

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3108.patch, LUCENE-3108.patch, LUCENE-3108.patch, 
 LUCENE-3108_CHANGES.patch


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-06-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046474#comment-13046474
 ] 

Simon Willnauer commented on LUCENE-3108:
-

bq. There seems to be a merge missing in file TestIndexSplitter, the changes in 
there are unrelated, so this reverts a commit on trunk for improving tests.
fixed revision 1133794

thanks uwe!

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3108.patch, LUCENE-3108.patch, LUCENE-3108.patch, 
 LUCENE-3108_CHANGES.patch


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-06-09 Thread Ryan McKinley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13046537#comment-13046537
 ] 

Ryan McKinley commented on LUCENE-3108:
---

+1   This looks great.  

To avoid more svn work, I think committing soon is better then later.

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3108.patch, LUCENE-3108.patch, LUCENE-3108.patch, 
 LUCENE-3108.patch, LUCENE-3108.patch, LUCENE-3108_CHANGES.patch


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-06-03 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13043446#comment-13043446
 ] 

Simon Willnauer commented on LUCENE-3108:
-

Hey folks,
we are ready for the final review rounds here, I resolved the naming conflict 
by renaming DocValues to IndexDocValues, fixed all the outstanding 
documentation issues and added a fixed ints impl that automatically switches 
over to fixed int/long if packed ints can not handle the range of the values in 
a field.

I am preparing a review patch now.

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3108.patch


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-05-20 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036798#comment-13036798
 ] 

Simon Willnauer commented on LUCENE-3108:
-

bq. It is tricky... but, eg, when someone does SortField(title, 
SortField.STRING), which cache (DV or FC) should we populate?

I think we should have a specialized sort field eventually. FCSortField / 
DVSortField?

bq. Both ValueSource and DocValues have long been used by function queries.

Suggestions welcome - nothing is fixed yet so we should find non-conflicting 
names. Maybe we can call it o.a.l.index.columns.Columns and 
o.a.l.index.columns.ColumnsEnum / ColumnsArray (instead of source) 


bq. OK, but I think if we make a straight longs impl (ie no packed ints at 
all) then we can handle all long values? But in that case we'd require the app 
to pick a sentinel to mean unset?

yes, I will open an issue.

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3108.patch


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-05-19 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036284#comment-13036284
 ] 

Michael McCandless commented on LUCENE-3108:


{quote}
bq. How come codecID changed from String to int on the branch?

due to DocValues I need to compare the ID to certain fields to see for
what field I stored and need to open docValues. I always had to parse
the given string which is kind of odd. I think its more natural to
have the same datatype on FieldInfo, SegmentCodecs and eventually in
the Codec#files() method. Making a string out of it is way simpler /
less risky than parsing IMO.
{quote}

OK that sounds great.

{quote}
bq. Can SortField somehow detect whether the needed field was stored in FC vs DV

This is tricky though. You can have a DV field that is indexed too so its hard 
to tell if we can reliably do it. If we can't make it reliable I think we 
should not do it at all.
{quote}

It is tricky... but, eg, when someone does SortField(title,
SortField.STRING), which cache (DV or FC) should we populate?

{quote}
bq. Should we rename oal.index.values.Type - .ValueType?

agreed. I also think we should rename Source but I don't have a good name yet. 
Any idea?
{quote}

ValueSource?  (conflicts w/ FQs though) Though, maybe we can just
refer to it as DocValues.Source, then it's clear?

{quote}
bq. Since we dynamically reserve a value to mean unset, does that mean there 
are some datasets we cannot index?

Again, tricky! The quick answer is yes, but we can't do that anyway since I 
have not normalize the range to be 0 based since PackedInts doesn't allow 
negative values. so the range we can store is (2^63) -1. So essentially with 
the current impl we can store (2^63)-2 and the max value is Long#MAX_VALUE-1. 
Currently there is no assert for this which is needed I think but to get around 
this we need to have a different impl I think or do I miss something?
{quote}

OK, but I think if we make a straight longs impl (ie no packed ints at all) 
then we can handle all long values?  But in that case we'd require the app to 
pick a sentinel to mean unset?


 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3108.patch


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-05-19 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13036290#comment-13036290
 ] 

Yonik Seeley commented on LUCENE-3108:
--

bq. ValueSource? (conflicts w/ FQs though) Though, maybe we can just refer to 
it as DocValues.Source, then it's clear?

Both ValueSource and DocValues have long been used by function queries.

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-3108.patch


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-05-18 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035234#comment-13035234
 ] 

Simon Willnauer commented on LUCENE-3108:
-

Mike thanks for the review!

bq. Phew been a long time since I looked at this branch!

its been changing :) 

bq. We have some stale jdocs that reference .setIntValue methods (they
are now .setInt)
True - thanks I will fix.

bq. Hmm do we have byte ordering problems? Ie, if I write index on
machine with little-endian but then try to load values on
big-endian...? I think we're OK (we seem to always use
IndexOutput.writeInt, and we convert float-to-raw-int-bits using
java's APIs)?

We are ok here since we write big-endian (enforced by DataOutput) and read it 
back in as plain bytes. The created ByteBuffer will always use BIG_ENDIAN as 
the default order. I added a comment for this.

bq. How come codecID changed from String to int on the branch?
due to DocValues I need to compare the ID to certain fields to see for what 
field I stored and need to open docValues. I always had to parse the given 
string which is kind of odd. I think its more natural to have the same datatype 
on FieldInfo, SegmentCodecs and eventually in the Codec#files() method. Making 
a string out of it is way simpler / less risky than parsing IMO.

bq. What are oal.util.Pair and ParallelArray for?
legacy I will remove

bq. FloatsRef should state in the jdocs that it's really slicing a
double[]?

yep done!

bq. Can SortField somehow detect whether the needed field was stored
in FC vs DV and pick the right comparator accordingly...? Kind of
like how NumericField can detect whether the ints are encoded as
plain text or as NF? We can open a new issue for this,
post-landing...

This is tricky though. You can have a DV field that is indexed too so its hard 
to tell if we can reliably do it. If we can't make it reliable I think we 
should not do it at all.


bq. It looks like we can sort by int/long/float/double pulled from DV,
but not by terms? This is fine for landing... but I think we
should open a post-landing issue to also make FieldComparators for
the Terms cases?

Yeah true. I didn't add a FieldComparator for bytes yet. I think this is post 
landing!

bq. Should we rename oal.index.values.Type - .ValueType? Just
because... it looks so generic when its imported  used as Type
somewhere?

agreed. I also think we should rename Source but I don't have a good name yet. 
Any idea?

bq. Since we dynamically reserve a value to mean unset, does that
mean there are some datasets we cannot index? Or... do we tap
into the unused bit of a long, ie the sentinel value can be
negative? But if the data set spans Long.MIN_VALUE to
Long.MAX_VALUE, what do we do...?

This is tricky though. The quick answer is yes, but we can't do that anyway 
since I have not normalize the range to be 0 based since PackedInts doesn't 
allow negative values. so the range we can store is (2^63) -1. So essentially 
with the current impl we can store (2^63)-2 and the max value is 
Long#MAX_VALUE-1. Currently there is no assert for this which is needed I think 
but to get around this we need to have a different impl I think or do I miss 
something? 

I will make the changes once SVN is writeable again.



 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which 

[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-05-18 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035386#comment-13035386
 ] 

Simon Willnauer commented on LUCENE-3108:
-

FYI. I ran indexing benchmarks trunk vs. branch and they are super close 
together. its like 3 sec difference while branch was faster so its in the 
noise. I also indexed one docvalues field (floats) which was also about the 
same 2 sec. slower including merges etc. So we are on the save side that this 
feature does not influence indexing performance. I didn't expect anything else 
really since the only difference is a single condition in DocFieldProcessor.

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13034645#comment-13034645
 ] 

Michael McCandless commented on LUCENE-3108:


+1, excellent!

 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-3108) Land DocValues on trunk

2011-05-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-3108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13035017#comment-13035017
 ] 

Michael McCandless commented on LUCENE-3108:


This is an awesome change!

Phew been a long time since I looked at this branch!

Some questions on a quick pass -- still need to iterate/dig deeper:

  * We have some stale jdocs that reference .setIntValue methods (they
are now .setInt)

  * Hmm do we have byte ordering problems?  Ie, if I write index on
machine with little-endian but then try to load values on
big-endian...?  I think we're OK (we seem to always use
IndexOutput.writeInt, and we convert float-to-raw-int-bits using
java's APIs)?

  * Since we dynamically reserve a value to mean unset, does that
mean there are some datasets we cannot index?  Or... do we tap
into the unused bit of a long, ie the sentinel value can be
negative?  But if the data set spans Long.MIN_VALUE to
Long.MAX_VALUE, what do we do...?

  * How come codecID changed from String to int on the branch?

  * What are oal.util.Pair and ParallelArray for?

  * FloatsRef should state in the jdocs that it's really slicing a
double[]?

  * Can SortField somehow detect whether the needed field was stored
in FC vs DV and pick the right comparator accordingly...?  Kind of
like how NumericField can detect whether the ints are encoded as
plain text or as NF?  We can open a new issue for this,
post-landing...

  * It looks like we can sort by int/long/float/double pulled from DV,
but not by terms?  This is fine for landing... but I think we
should open a post-landing issue to also make FieldComparators for
the Terms cases?

  * Should we rename oal.index.values.Type - .ValueType?  Just
because... it looks so generic when its imported  used as Type
somewhere?


 Land DocValues on trunk
 ---

 Key: LUCENE-3108
 URL: https://issues.apache.org/jira/browse/LUCENE-3108
 Project: Lucene - Java
  Issue Type: Task
  Components: core/index, core/search, core/store
Affects Versions: CSF branch, 4.0
Reporter: Simon Willnauer
Assignee: Simon Willnauer
 Fix For: 4.0


 Its time to move another feature from branch to trunk. I want to start this 
 process now while still a couple of issues remain on the branch. Currently I 
 am down to a single nocommit (javadocs on DocValues.java) and a couple of 
 testing TODOs (explicit multithreaded tests and unoptimized with deletions) 
 but I think those are not worth separate issues so we can resolve them as we 
 go. 
 The already created issues (LUCENE-3075 and LUCENE-3074) should not block 
 this process here IMO, we can fix them once we are on trunk. 
 Here is a quick feature overview of what has been implemented:
  * DocValues implementations for Ints (based on PackedInts), Float 32 / 64, 
 Bytes (fixed / variable size each in sorted, straight and deref variations)
  * Integration into Flex-API, Codec provides a 
 PerDocConsumer-DocValuesConsumer (write) / PerDocValues-DocValues (read) 
  * By-Default enabled in all codecs except of PreFlex
  * Follows other flex-API patterns like non-segment reader throw UOE forcing 
 MultiPerDocValues if on DirReader etc.
  * Integration into IndexWriter, FieldInfos etc.
  * Random-testing enabled via RandomIW - injecting random DocValues into 
 documents
  * Basic checks in CheckIndex (which runs after each test)
  * FieldComparator for int and float variants (Sorting, currently directly 
 integrated into SortField, this might go into a separate DocValuesSortField 
 eventually)
  * Extended TestSort for DocValues
  * RAM-Resident random access API plus on-disk DocValuesEnum (currently only 
 sequential access) - Source.java / DocValuesEnum.java
  * Extensible Cache implementation for RAM-Resident DocValues (by-default 
 loaded into RAM only once and freed once IR is closed) - SourceCache.java
  
 PS: Currently the RAM resident API is named Source (Source.java) which seems 
 too generic. I think we should rename it into RamDocValues or something like 
 that, suggestion welcome!   
 Any comments, questions (rants :)) are very much appreciated.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org