[jira] [Commented] (LUCENE-2186) First cut at column-stride fields (index values storage)

2011-04-25 Thread Lance Norskog (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025033#comment-13025033
 ] 

Lance Norskog commented on LUCENE-2186:
---

What's the current status on this? 

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2186) First cut at column-stride fields (index values storage)

2011-04-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017822#comment-13017822
 ] 

Simon Willnauer commented on LUCENE-2186:
-

bq. I'm wondering if there is a limitation on whether or not we can randomly 
access the doc values from the underlying Directory implementation, rather than 
need to load all the values directly into the main heap space. This seems 
doable, and if so let me know if I can provide a patch.

the current implementation to access docValues not loaded into memory uses 
DocIdSetIterator as its parent interface so it works only in one direction 
currently. changing this to a random access seekable API should be not too 
hard. 
Look at 
http://svn.apache.org/repos/asf/lucene/dev/branches/docvalues/lucene/src/java/org/apache/lucene/index/values/DocValuesEnum.java

simon



 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2186) First cut at column-stride fields (index values storage)

2011-04-09 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017886#comment-13017886
 ] 

Jason Rutherglen commented on LUCENE-2186:
--

bq. changing this to a random access seekable API should be not too hard

I think we can offer the option of MMap'ing the field caches, which I think 
will help alleviate OOMs?

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2186) First cut at column-stride fields (index values storage)

2011-04-08 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017679#comment-13017679
 ] 

Jason Rutherglen commented on LUCENE-2186:
--

I'm wondering if there is a limitation on whether or not we can randomly access 
the doc values from the underlying Directory implementation, rather than need 
to load all the values directly into the main heap space.  This seems doable, 
and if so let me know if I can provide a patch.

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2011-01-09 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979395#action_12979395
 ] 

Jason Rutherglen commented on LUCENE-2186:
--

Out of curiosity, re: LUCENE-2312, are we planning on putting CSF into Lucene 
4.x?  What's left to be done?

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2011-01-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979404#action_12979404
 ] 

Simon Willnauer commented on LUCENE-2186:
-

bq. Out of curiosity, re: LUCENE-2312, are we planning on putting CSF into 
Lucene 4.x? What's left to be done?
we are very close - to land on trunk there is about an evening of work left. 
JDoc is missing here and there plus some tests for FieldComparators - thats it!

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2011-01-09 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979407#action_12979407
 ] 

Jason Rutherglen commented on LUCENE-2186:
--

bq. we are very close - to land on trunk there is about an evening of work 
left. JDoc is missing here and there plus some tests for FieldComparators - 
thats it!

Nice!  Once it's in I'll try to get started on the RT field cache/doc values, 
which can likely be implemented and tested somewhat independent of the RT 
inverted index.

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-12-06 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968416#action_12968416
 ] 

Simon Willnauer commented on LUCENE-2186:
-

bq. Whew... this interface is more expansive than I thought it would be (but I 
guess it's really many issues rolled into one... like sorting, caching, etc).
sorry about that :)

bq. So it seems like DocValuesEnum is the traditional lowest level read the 
index, and Source is a cached version of that?
Not quiet DocValuesEnum is an iterator based access to the DocValues which does 
not load everything to memory while Source is a entirely Ram-Resident offering 
random access to values similar to field cache. Yet, you can also obtain a 
DocValuesEnum from a Source since its already in memory. 

bq. A higher level question I have is why we're not reusing the FieldCache for 
caching/sorting?
You mean as a replacement for Source? - For caching what we did in here is to 
leave it to the user to do the caching or cache based on Source instance how 
would that relate to FieldCache in your opinion?


 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-11-30 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12965242#action_12965242
 ] 

Simon Willnauer commented on LUCENE-2186:
-

bq. Is there any test cases that cover the new FieldComparators that use the 
doc values?
not yet, I added it to my internal roadmap to land on trunk. I just committed 
my latest changes including a simple testcase to show how to use the API and 
used bytes tracking. 

here is a list of what is missing:

{code}
  /*
   * TODO:
   * Roadmap to land on trunk
   *   - Cut over to a direct API on ValuesEnum vs. ValuesAttribute 
   *   - Add documentation for:
   *  - Source and ValuesEnum
   *  - DocValues
   *  - ValuesField
   *  - ValuesAttribute
   *  - Values
   *   - Add @lucene.experimental to all necessary classes
   *   - Try to make ValuesField more lightweight - AttributeSource
   *   - add test for unoptimized case with deletes
   *   - add a test for addIndexes
   *   - split up existing testcases and give them meaningfull names
   *   - use consistent naming throughout DocValues
   * - Values - DocValueType
   * - PackedIntsImpl - Ints
   *   - run RAT
   *   - add tests for FieldComparator FloatIndexValuesComparator vs. 
FloatValuesComparator etc.
   */
{code}

once I am through with it I will create a new issue and create the final patch 
so we can iterate over it if needed.

simon

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-11-26 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935977#action_12935977
 ] 

Michael McCandless commented on LUCENE-2186:


bq. BTW. it is ok to have the same name as a existing field.

It is, usually... but we should add a test to assert this is still the
case for other field + ValuesField?

{quote}
bq. I'm thinking it's really important now to carry over the same FieldInfos 
from the last segment when opening the writer (LUCENE-1737)... because hitting 
that IllegalStateExc during merge is a trap.

I think that should not block us from moving forward and landing on trunk ey?
{quote}

It makes me mighty nervous though... I'll try to get that issue done
soon.

{quote}
Well it is a nice way of extending field but I am not sure if we
should keep it since it is heavy weight. 
{quote}

The ValuesAttr for ValuesField is actually really heavyweight.  Not
only must it fire up an AttrSource, but then ValuesAttrImpl itself has
a field for each type.  Worse, for the type you do actually use, it's
then another object eg FloatsRef, which in turn holds
array/offset/len, a new length 1 array, etc.

Maybe we shouldn't use attrs here?  And instead somehow let
ValuesField store a single value as it's own private member?

FloatsRef, LongsRef are missing the ASL header.  Maybe it's time to
run RAT :)


 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-11-26 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935994#action_12935994
 ] 

Simon Willnauer commented on LUCENE-2186:
-

{quote}
It is, usually... but we should add a test to assert this is still the
case for other field + ValuesField?
{quote}

I already implemented a simple testcase that shows that this works as an 
example (I will commit that soon though) but I think we need to add another 
test that ensures that this works with all types of DocValues though. Yet, I 
work on making the test more atomic and test only a single thing anyway so i 
will add that too.

bq. It makes me mighty nervous though... I'll try to get that issue done soon.
Well until then I just go on and get the remaining stuff done here.


bq. Maybe we shouldn't use attrs here? And instead somehow let ValuesField 
store a single value as it's own private member?
I more and more think we can nuke ValuesAttribute completely since its other 
purpose on ValuesEnum is somewhat obsolete too. It is actually a leftover from 
earlier days where I was experimenting with using DocEnum to serve CSF too. 
There it would have made sense though but now we can provide a dedicated API. 
It still bugs me that Field is so hard to extend. We really need to fix that 
soon!

I think what we should do is extend AbstractField and simply use a 
long/double/BytesRef and force folks to add another field instance if they want 
to have it indexed and stored.
bq. Maybe it's time to run RAT
+1 :)



 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-11-25 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935713#action_12935713
 ] 

Simon Willnauer commented on LUCENE-2186:
-

bq. I think this is very close!!
Heh, I strongly agree!

{quote}
Using attr source as the way to specify the docValue is nice in
that we get full extensibility, but, it's also heavyweight
compared to a dedicated API (ie, .setIntValue, etc.).  So I think
this means apps that use doc values really must re-use their Field
instances (if they are using doc values) else indexing performance
will likely take a good hit.
{quote}
Well it is a nice way of extending field but I am not sure if we
should keep it since it is heavy weight. We could get rid of
ValuesAttribute for landing on trunk and work on making field
extendible - which is desperately needed anyway. I was also thinking
that the ValuesEnum doesn't need the ValuesAttribute per se. it would
be more intuitive to have getter on ValuesEnum too. I just really hate
those instanceof checks on fields.

{quote}
   ValuesField is nice sugar on top (of the attr) :) Can you add some
jdocs to ValuesField? EG it's not stored/indexed.  It's OK to have
same field name as existing field (hmm... is it)?  Etc.
{quote}
Yeah - until here I haven't done much javadoc but that is on top of
the list. I will start adding JavaDoc to main classes of the API and
ValuesField is 100% a main class of it.
BTW. it is ok to have the same name as a existing field.

bq. Did you want to make FieldsConsumer.addValuesField abstract?
That is a leftover - I will remove it.

bq. The javadoc above DocValues.Source is wrong -- Source is not just for ints.
True - see above that class had a different purpose back in the days
where it was a patch :)

{quote}
You can change jdocs like This feature is experimental and the
API is free to change in non-backwards-compatible ways. to
 @lucene.experimental :)  (eg in Values.java)
{quote}

yeah - its good to have stuff like that left! :) yay!
{quote}
 So, you're not allowed to change the DocValues type for a field
 once you've set it the first time... and, also, segments cannot be
merged if the same field has different value types.  I'm thinking
it's really important now to carry over the same FieldInfos from
the last segment when opening the writer (LUCENE-1737)... because
hitting that IllegalStateExc during merge is a trap.  This would
let us change that IllegalStateExc into an assert (in
SegmentMerger) and also turn the assert back on in FieldsConsumer.
{quote}

I think that should not block us from moving forward and landing on trunk ey?

{quote}
Should we rename MissingValues to MissingValue? Ie it holds the single
value for your type that represents missing?
{quote}

True, I was also thinking to rename some of the classes like
Values - DocValueType
PackedIntsImpl - Ints


bq. We need better names than PagedBytes.fillUsingLengthPrefix,2,3,4

hehe yeah - lemme change the one I added and lets fix the rest on
trunk. I will open an issue once I have a reliable inet connection
again.

{quote}
 It'd be nice to have a more approachable test case that shows the
simple way to index doc values, ie using ValuesField instead of
getting the attr, getting the intsRef, setting it, etc.  I think
such an example should be very compact right?
{quote}

done on my checkout!

so on my list there are the following topics until landing:

 * missing testcase for addIndexes and a simple one to show how to use the api
 * split up exiting tests in smaller tests - they test too much and
they are hard to understand
 * Add JavaDoc to main classes like DocValues, Source, ValuesEnum, ValuesField
 * Document the different types
 * Consistent class naming - see above
 * enable ram usage tracking for all DocValuesProducer to support
flush by RAM usage

That seems very very close to me. Lets see how much I get done on my
flight to boston :)

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can 

[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-11-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935428#action_12935428
 ] 

Michael McCandless commented on LUCENE-2186:


I think this is very close!!

  * Using attr source as the way to specify the docValue is nice in
that we get full extensibility, but, it's also heavyweight
compared to a dedicated API (ie, .setIntValue, etc.).  So I think
this means apps that use doc values really must re-use their Field
instances (if they are using doc values) else indexing performance
will likely take a good hit.

  * ValuesField is nice sugar on top (of the attr) :) Can you add some
jdocs to ValuesField? EG it's not stored/indexed.  It's OK to have
same field name as existing field (hmm... is it)?  Etc.

  * Did you want to make FieldsConsumer.addValuesField abstract?

  * The javadoc above DocValues.Source is wrong -- Source is not just
for ints.

  * You can change jdocs like This feature is experimental and the
API is free to change in non-backwards-compatible ways. to
@lucene.experimental :)  (eg in Values.java)

  * So, you're not allowed to change the DocValues type for a field
once you've set it the first time... and, also, segments cannot be
merged if the same field has different value types.  I'm thinking
it's really important now to carry over the same FieldInfos from
the last segment when opening the writer (LUCENE-1737)... because
hitting that IllegalStateExc during merge is a trap.  This would
let us change that IllegalStateExc into an assert (in
SegmentMerger) and also turn the assert back on in FieldsConsumer.

  * Should we rename MissingValues to MissingValue? Ie it holds the
single value for your type that represents missing?

  * We need better names than PagedBytes.fillUsingLengthPrefix,2,3,4
heh.

  * It'd be nice to have a more approachable test case that shows the
simple way to index doc values, ie using ValuesField instead of
getting the attr, getting the intsRef, setting it, etc.  I think
such an example should be very compact right?


 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: CSF branch, 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), 

[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-10-12 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920110#action_12920110
 ] 

Simon Willnauer commented on LUCENE-2186:
-

created branch at 
[http://http://svn.apache.org/repos/asf/lucene/dev/branches/docvalues/|docvalues]
 and committed the last patch at r1021636. I think the next steps are adding a 
fix version docvalues to JIRA and create new issues according to the 
roadmap above. Once we are through with the mandatory stuff and documentation 
we can land this on trunk. Thoughts?

I'm not sure if we should continue on this issue or close it and create a new 
top level one and spawn issues from there.

simon

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-10-12 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920151#action_12920151
 ] 

Michael McCandless commented on LUCENE-2186:


bq. Implement bulk copies for merging where possible. 

I don't think this should block landing on trunk?  (Even in the non-deletes 
case).

But, yes, searching for next del doc is a linear op, but a very small constant 
in front (at least OpenBitSet.nextSetBit, though del docs are currently a 
BitVector), yet is very much worth it once we get the bulk copying in since 
presumably big chunks of docs can be bulk copied.

bq. Exposing the API via Fields / IndexReader. I think we should expose the 
Iterator API via Fields just like Terms is today. Currently it doesn't feel 
very natural to get the ValuesEnum via IR.

Ahh that does sound like the right place.

bq. Maybe if we wanna populate FieldsCache from it. 

We should be careful here -- it's best if things consume the docvalues instead 
of double-copying into the FC.

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-10-11 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12919864#action_12919864
 ] 

Simon Willnauer commented on LUCENE-2186:
-

{quote}
There are still many nocommits but most look like they could become TODOs?
Do you have a high level sense of what's missing before we can commit to trunk?
{quote}

Yes and No :), here is my roadmap for this issue. We have 47 nocommit pending 
where about the half of it can be TODOs while the other half of it are rather 
easy task and should be fixed before we go to trunk.
These are the major steps I would like to finish until we land this on trunk

* Implement bulk copies for merging where possible. Currently there are still 
some value types not bulk copied and the ones which are only do if there are no 
deletes. Yet, the deletes thing I would make a TODO for now - we can still make 
that more efficient once we are on trunk. If I recall correctly figuring out 
the next deleted document is still a linear problem (I need to iterate through 
deletes), right? I guess that would be easier if I could figure out the next 
one so see if bulks are reasonable - maybe an invalid concern though.

* Exposing the API via Fields / IndexReader. I think we should expose the 
Iterator API via Fields just like Terms is today. Currently it doesn't feel 
very natural to get the ValuesEnum via IR.
* Rethink the Source API - I get the feeling that we don't really need the 
Source class but could rather use a Random Access Enum like Terms where we can 
see back and forth depending on how we loaded the fields values. We could 
actually unify the iterator API and random access which would catch two birds 
with one stone. internally we simply use the *Refs to set the actual values, 
default values would no be needed anymore (would save some code / branches 
internally) and the user would not have to access two different APIs. 
Additionally we could expose bulk reads just like BulkReadResult in DocsEnum to 
obtain all values in an array. Maybe if we wanna populate FieldsCache from it. 
I think we won't have perf. losts due to that since there is not really an 
overhead compared to the get() call on Source. - Reminds me I need to think 
about how we use that with sorted values If we keep Source we should at 
least make it implement ValuesEnum so we can use it as enumeration if they are 
in mem already.

* To do merging for byte values correctly we need to figure out how to specify 
the comparator for each field. I don't have a concrete idea for this but I 
think this should somehow go into IndexWriterConfig in a per field map. 
Thougths?


Remaining nocommits could be converted into TODOs - I think we can do so with 
the following

* Evaluate if we can decide if a Bytes Payload should be stored as straight or 
as fixed which would make it easier for the user to use the byte variants.
* Evaluate if we need String variants or if they can simple be solved with the 
byte ones
* We should have some king of compatibility notion so that slightly different 
segments can be merged like fixed vs. var bytes float32 vs. float64.
* For a cleaner transition we should create a sep. SortField that always uses 
index values.
* explore a better way to obtain all dat / idx fiels in SegmentInfo to do 
segment merges for index values.
*BytesValueProcessor should be thread private but I will leave that as a todo 
since this code might change anyway once realtime lands on trunk though. Not 
super urgent for now.
* Fix some exception handling issues especially in MultiSource  MultiValuesEnum
* Fix the singed / unsigned limitations in Ints implementation
* Explore ways to preven Ints impl do two method calls maybe we can expose 
PackedInts directly somehow

bq. How about docvalues? You don't need the _branch part since it'll be at 
http://svn.apache.org.../branches/docvalues.
OK

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index 

[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-09-24 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914490#action_12914490
 ] 

Simon Willnauer commented on LUCENE-2186:
-

I would want to move this to a branch for further development. If nobody 
objects I'm gonna move forward within the next days.

simon

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914494#action_12914494
 ] 

Robert Muir commented on LUCENE-2186:
-

bq. I would want to move this to a branch for further development. If nobody 
objects I'm gonna move forward within the next days.

+1

In my opinion, if its helpful to use a branch for a feature like this, we 
should not hesitate!
With a lot of development on trunk, big patches make it difficult for anyone to 
get involved.

Additionally its extremely difficult to iterate, because its hard to see the 
differences between iterations.
But say, with the flexible indexing branch, this history is preserved since it 
was done in a branch.
So I am able to just click 'view merged revisions' in my IDE and see all that 
history.


 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-08-09 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896614#action_12896614
 ] 

Simon Willnauer commented on LUCENE-2186:
-

{quote}
It should be more extensible, ie, you can make your own attrs to
store whatever you want. EG we should be able to use this to
store the flex scoring stats (LUCENE-2392).
{quote}

This is actually the first real use-case together with the norms which is kind 
of part of LUCENE-2392 anyway

{quote}
The end-user API is rather cumbersome now (ie, that the user must
interact directly w/ attrs). It seems like we should have a sugar
layer on top, eg an IntField(Type) and I can do IntField.set/get.
{quote}
Yeah I guess lots of users would have a rather hard time with that. I remember 
Grant saying that he tries to explain Document and Fields since every in his 
trainings and with users in mind this should be done with least amount of 
changes. Nevertheless this is something which should be fixed outside of this 
particular issue, LUCENE-2310 would be one I could think of. Guess I need to 
talk to chrismale on Friday about that.


{quote}

Also... maybe we should use Attrs the way NumericField does. Ie, for
CSF we'd have a TokenStream (single valued, for now anyway), and then
attrs could be added to it. If we can get attr serialization
(LUCENE-2125) online, then we can refactor all the read/write code in
this issue as the default attr serializers? And, then, indexer would
have no special code for CSF in particular. It just asks attrs to
serialize themselves...
{quote}
LUCENE-2125 is something which would be nice to have together with CSF. Yet I 
don't think it depends on each other but it should use the same or very closely 
related APIs eventually. LUCENE-2125 has different problems to tackle first I 
guess - but I am closely following that! I will update that patch to make use 
of the {NumericField} - lets call it - work-around to make this patch less 
hairy. Still hairy but I like the idea of using TokenStream to attach the 
ValuesAttribute.

{quote}
Shouldn't FloatsRef be FloatRef (same for IntsRef)? It's ref'ing a
single value right?
{quote}

Yes and no. I was too lazy to add all the capabilities {BytesRef} has but I 
could imagine that this can benefit from being able to hold more values - maybe 
a entire page when paging is used.  If it only holds a single value we don't 
need offset and length too. I will leaf it like that for now, can still change 
it later if it turns out that we don't need this flexibility.

I guess I will move the ValuesEnum down to Fields and FieldsEnum soon. I don't 
think we should confuse this with an DocsEnum since DocsEnum is so closely 
related to Terms and has explicit getters for freq() though. DocIdSetIterator 
seems to be fine for that purpose - while the AttributeSource could be pulled 
up.


 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 

[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-08-06 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896112#action_12896112
 ] 

Yonik Seeley commented on LUCENE-2186:
--

bq. I'd really appreciate any comments especially on the API as this most 
important to me right now.

Could you show some examples of the most efficient way to use this API?
i.e. an example that shows both how to index a document with a CSF, and then 
how to iterate over all values of a CSF (or get the value for a specific set of 
documents).

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-08-06 Thread Simon Willnauer (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896122#action_12896122
 ] 

Simon Willnauer commented on LUCENE-2186:
-

Hey Yonik,
bq. Could you show some examples of the most efficient way to use this API?

Sure! While it's already late over here I am happy to provide you those two 
examples. This is how you can index CSF with this Attribute approach:

{code}
RAMDirectory dir = new RAMDirectory();
IndexWriter writer = new IndexWriter(dir, new 
IndexWriterConfig(Version.LUCENE_40, new SimpleAnalyzer(Version.LUCENE_40)));
Document doc = new Document();
Fieldable fieldable = new AttributeField(myIntField);
ValuesAttribute valuesAttribute = 
fieldable.attributes().addAttribute(ValuesAttribute.class);
valuesAttribute.setType(Values.PACKED_INTS);
valuesAttribute.ints().set(100);
doc.add(fieldable);
writer.addDocument(doc);
writer.close();
{code}
  
This is how get the values back via source or the ValuesEnum:

{code}
IndexReader reader = IndexReader.open(dir);
// this might be integrated into Fields eventually
Reader indexValues = reader.getIndexValues(myIntField); // can get cached 
version too via reader.getIndexValuesCache();
Source load = indexValues.load();
long value = load.ints(0);
System.out.println(value);

// or get it from the enum
ValuesEnum intEnum = indexValues.getEnum();
ValuesAttribute attr = intEnum.getAttribute(ValuesAttribute.class);
while(intEnum.nextDoc() != ValuesEnum.NO_MORE_DOCS) {
  System.out.println(attr.ints().get());
}
{code}

I guess this should make it at least easier to get started.

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Simon Willnauer
 Fix For: 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 LUCENE-2186.patch, mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, 

[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-06-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883872#action_12883872
 ] 

Michael McCandless commented on LUCENE-2186:


Great -- thanks for pushing this forward Simon!

bq. Mike do you mind if I take this?

Please do!

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 4.0

 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, 
 mem.py


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-01-04 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796200#action_12796200
 ] 

Michael McCandless commented on LUCENE-2186:


bq. Is this patch for flex, as it contains CodecUtils and so on?

Actually it's intended for trunk; I was thinking this should land
before flex (it's a much smaller change, and it's isolated from
flex), and so I wrote the CodecUtil/BytesRef basic infrastructure,
thinking flex would then cutover to them.

{quote}
Hmm, so random-access would obviously be the preferred approach for SSDs, but
with conventional disks I think the performance would be poor? In 1231
I implemented the var-sized CSF with a skip list, similar to a posting
list. I think we should add that here too and we can still keep the
additional index that stores the pointers? We could have two readers:
one that allows random-access and loads the pointers into RAM (or uses
MMAP as you mentioned), and a second one that doesn't load anything
into RAM, uses the skip lists and only allows iterator-based access?
{quote}

The intention here is for this (index values) to replace field
cache, but not aim (initially at least) to do much more.  Ie, it's
meant to be a RAM resident (either via explicit slurping-into-RAM or
via MMAP).  So the SSD or spinning magnets should not be hit on
retrieval.

If we add an iterator API, I think it should be simpler than the
postings API (ie, no seeking, dense (every doc is visited,
sequentially) iteration).

{quote}
It looks like ByteRef is very similar to Payload? Could you use that instead 
and extend it with the new String constructor and compare methods?
{quote}

Good point!  I agree.  Also, we should use BytesRef when reading the
payload from TermsEnum.  Actually I think Payload, BytesRef, TermRef
(in flex) should all eventually be merged; of the three names, I think
I like BytesRef the best.  With *Enum in flex we can switch to
BytesRef.  For analysis we should switch PayloadAttribute to BytesRef,
and deprecate the methods using Payload?  Hmmm... but PayloadAttribute
is an interface.

{quote}
So it looks like with your approach you want to support certain
primitive types out of the box, such as byte[], float, int, String?
{quote}

Actually, all primitive types (ie, byte/short/int/long are
included under int, as well as arbitrary bit precision between
those primitive types).  Because the API uses a method invocation (eg
IntSource.get) instead of direct array access, we can hide how many
bits are actually used, under the impl.  Same is true for float/double
(except we can't [easily] do arbitrary bit precision here... just 4 or
8 bytes).

{quote}
If someone has custom data types, then they have, similar as with
payloads today, the byte[] indirection?
{quote}

Right, byte[] is for String, but also for arbitrary (opaque to Lucene)
extensibility.  The six anonymous (separate package private classes)
concrete impls should give good efficiency to fit the different use
cases.

{quote}
The code I initially wrote for 1231 exposed IndexOutput, so that one
can call write*() directly, without having to convert to byte[]
first. I think we will also want to do that for 2125 (store attributes
in the index). So I'm wondering if this and 2125 should work
similarly?
{quote}

This is compelling (letting Attrs read/write directly), but, I have
some questions:

  * How would the random-access API work?  (Attrs are designed for
iteration).  Eg, just providing IndexInput/Output to the Attr
isn't quite enough -- the encoding is sometimes context dependent
(like frq writes the delta between docIDs, the symbol table needed
when reading/writing deref/sorted).  How would I build a random
access API on top of that?  captureState-per-doc is too costly.
What API would be used to write the shared state, ie, to tell the
Attr we now are writing the segment, so you need to dump the
symbol table.

  * How would the packed ints work?  EG say my ints only need 5 bits.
(Attrs are sort of designed for one-value-at-once).

  * How would the symbol table based encodings (deref, sorted) work?
I guess the attr would need to have some state associated with
it, and when I first create the attr I need to pass it segment
name, Directory, etc, so it opens the right files?

  * I'm thinking we should still directly support native types, ie,
Attrs are there for extensibility beyond native types?

  * Exposing single attr across a multi reader sounds tricky --
LUCENE-2154 (and, we need this for flex, which is worrying me!).
But it sounds like you and Uwe are making some progress on that
(using some under-the-hood Java reflection magic)... and this
doesn't directly affect this issue, assuming we don't expose this
API at the MultiReader level.

{quote}
Thinking out loud: Could we have then attributes with
serialize/deserialize methods for 

[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-01-03 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795963#action_12795963
 ] 

Michael Busch commented on LUCENE-2186:
---

Great to see progress here, Mike!

{quote}
String fields are stored as the UTF8 byte[]. This patch adds a
BytesRef, which does the same thing as flex's TermRef (we should merge
them).
{quote}

It looks like ByteRef is very similar to Payload? Could you use that instead 
and extend it with the new String constructor and compare methods? 

{quote}
It handles 3 types of values:
{quote}

So it looks like with your approach you want to support certain
primitive types out of the box, such as byte[], float, int, String?
If someone has custom data types, then they have, similar as with
payloads today, the byte[] indirection? 

The code I initially wrote for 1231 exposed IndexOutput, so that one
can call write*() directly, without having to convert to byte[]
first. I think we will also want to do that for 2125 (store attributes
in the index). So I'm wondering if this and 2125 should work
similarly? 
Thinking out loud: Could we have then attributes with
serialize/deserialize methods for primitive types, such as float?
Could we efficiently use such an approach all the way up to
FieldCache? It would be compelling if you could store an attribute as
CSF, or in the postinglist, retrieve it from the flex APIs, and also
from the FieldCache. All would be the same API and there would only be
one place that needs to know about the encoding (the attribute).

{quote}
Next step is to do basic integration with Lucene, and then compare
sort performance of this vs field cache.
{quote}

Yeah, that's where I got kind of stuck with 1231: We need to figure
out how the public API should look like, with which a user can add CSF
values to the index and retrieve them. The easiest and fastest way
would be to add a dedicated new API. The cleaner one would be to make the whole
Document/Field/FieldInfos API more flexible. LUCENE-1597 was a first attempt.

{quote}
There are abstract Writer/Reader classes. The current reader impls
are entirely RAM resident (like field cache), but the API is (I think)
agnostic, ie, one could make an MMAP impl instead.

I think this is the first baby step towards LUCENE-1231. Ie, it
cannot yet update values, and the reading API is fully random-access
by docID (like field cache), not like a posting list, though I
do think we should add an iterator() api (to return flex's DocsEnum)
{quote}

Hmm, so random-access would obviously be the preferred approach for SSDs, but
with conventional disks I think the performance would be poor? In 1231
I implemented the var-sized CSF with a skip list, similar to a posting
list. I think we should add that here too and we can still keep the
additional index that stores the pointers? We could have two readers:
one that allows random-access and loads the pointers into RAM (or uses
MMAP as you mentioned), and a second one that doesn't load anything
into RAM, uses the skip lists and only allows iterator-based access?

About updating CSF: I hope we can use parallel indexing for that. In
other words: It should be possible for users to use parallel indexes
to update certain fields, and Lucene should use the same approach
internally to store different generations of things like norms and CSFs.

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2186.patch


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef 

[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)

2010-01-02 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795897#action_12795897
 ] 

Uwe Schindler commented on LUCENE-2186:
---

Is this patch for flex, as it contains CodecUtils and so on?

If it is so we should use affects version: flex.

 First cut at column-stride fields (index values storage)
 

 Key: LUCENE-2186
 URL: https://issues.apache.org/jira/browse/LUCENE-2186
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Index
Reporter: Michael McCandless
Assignee: Michael McCandless
 Fix For: 3.1

 Attachments: LUCENE-2186.patch


 I created an initial basic impl for storing index values (ie
 column-stride value storage).  This is still a work in progress... but
 the approach looks compelling.  I'm posting my current status/patch
 here to get feedback/iterate, etc.
 The code is standalone now, and lives under new package
 oal.index.values (plus some util changes, refactorings) -- I have yet
 to integrate into Lucene so eg you can mark that a given Field's value
 should be stored into the index values, sorting will use these values
 instead of field cache, etc.
 It handles 3 types of values:
   * Six variants of byte[] per doc, all combinations of fixed vs
 variable length, and stored either straight (good for eg a
 title field), deref (good when many docs share the same value,
 but you won't do any sorting) or sorted.
   * Integers (variable bit precision used as necessary, ie this can
 store byte/short/int/long, and all precisions in between)
   * Floats (4 or 8 byte precision)
 String fields are stored as the UTF8 byte[].  This patch adds a
 BytesRef, which does the same thing as flex's TermRef (we should merge
 them).
 This patch also adds basic initial impl of PackedInts (LUCENE-1990);
 we can swap that out if/when we get a better impl.
 This storage is dense (like field cache), so it's appropriate when the
 field occurs in all/most docs.  It's just like field cache, except the
 reading API is a get() method invocation, per document.
 Next step is to do basic integration with Lucene, and then compare
 sort performance of this vs field cache.
 For the sort by String value case, I think RAM usage  GC load of
 this index values API should be much better than field caache, since
 it does not create object per document (instead shares big long[] and
 byte[] across all docs), and because the values are stored in RAM as
 their UTF8 bytes.
 There are abstract Writer/Reader classes.  The current reader impls
 are entirely RAM resident (like field cache), but the API is (I think)
 agnostic, ie, one could make an MMAP impl instead.
 I think this is the first baby step towards LUCENE-1231.  Ie, it
 cannot yet update values, and the reading API is fully random-access
 by docID (like field cache), not like a posting list, though I
 do think we should add an iterator() api (to return flex's DocsEnum)
 -- eg I think this would be a good way to track avg doc/field length
 for BM25/lnu.ltc scoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org