[jira] [Commented] (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13025033#comment-13025033 ] Lance Norskog commented on LUCENE-2186: --- What's the current status on this? First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017822#comment-13017822 ] Simon Willnauer commented on LUCENE-2186: - bq. I'm wondering if there is a limitation on whether or not we can randomly access the doc values from the underlying Directory implementation, rather than need to load all the values directly into the main heap space. This seems doable, and if so let me know if I can provide a patch. the current implementation to access docValues not loaded into memory uses DocIdSetIterator as its parent interface so it works only in one direction currently. changing this to a random access seekable API should be not too hard. Look at http://svn.apache.org/repos/asf/lucene/dev/branches/docvalues/lucene/src/java/org/apache/lucene/index/values/DocValuesEnum.java simon First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017886#comment-13017886 ] Jason Rutherglen commented on LUCENE-2186: -- bq. changing this to a random access seekable API should be not too hard I think we can offer the option of MMap'ing the field caches, which I think will help alleviate OOMs? First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13017679#comment-13017679 ] Jason Rutherglen commented on LUCENE-2186: -- I'm wondering if there is a limitation on whether or not we can randomly access the doc values from the underlying Directory implementation, rather than need to load all the values directly into the main heap space. This seems doable, and if so let me know if I can provide a patch. First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979395#action_12979395 ] Jason Rutherglen commented on LUCENE-2186: -- Out of curiosity, re: LUCENE-2312, are we planning on putting CSF into Lucene 4.x? What's left to be done? First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979404#action_12979404 ] Simon Willnauer commented on LUCENE-2186: - bq. Out of curiosity, re: LUCENE-2312, are we planning on putting CSF into Lucene 4.x? What's left to be done? we are very close - to land on trunk there is about an evening of work left. JDoc is missing here and there plus some tests for FieldComparators - thats it! First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12979407#action_12979407 ] Jason Rutherglen commented on LUCENE-2186: -- bq. we are very close - to land on trunk there is about an evening of work left. JDoc is missing here and there plus some tests for FieldComparators - thats it! Nice! Once it's in I'll try to get started on the RT field cache/doc values, which can likely be implemented and tested somewhat independent of the RT inverted index. First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12968416#action_12968416 ] Simon Willnauer commented on LUCENE-2186: - bq. Whew... this interface is more expansive than I thought it would be (but I guess it's really many issues rolled into one... like sorting, caching, etc). sorry about that :) bq. So it seems like DocValuesEnum is the traditional lowest level read the index, and Source is a cached version of that? Not quiet DocValuesEnum is an iterator based access to the DocValues which does not load everything to memory while Source is a entirely Ram-Resident offering random access to values similar to field cache. Yet, you can also obtain a DocValuesEnum from a Source since its already in memory. bq. A higher level question I have is why we're not reusing the FieldCache for caching/sorting? You mean as a replacement for Source? - For caching what we did in here is to leave it to the user to do the caching or cache based on Source instance how would that relate to FieldCache in your opinion? First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12965242#action_12965242 ] Simon Willnauer commented on LUCENE-2186: - bq. Is there any test cases that cover the new FieldComparators that use the doc values? not yet, I added it to my internal roadmap to land on trunk. I just committed my latest changes including a simple testcase to show how to use the API and used bytes tracking. here is a list of what is missing: {code} /* * TODO: * Roadmap to land on trunk * - Cut over to a direct API on ValuesEnum vs. ValuesAttribute * - Add documentation for: * - Source and ValuesEnum * - DocValues * - ValuesField * - ValuesAttribute * - Values * - Add @lucene.experimental to all necessary classes * - Try to make ValuesField more lightweight - AttributeSource * - add test for unoptimized case with deletes * - add a test for addIndexes * - split up existing testcases and give them meaningfull names * - use consistent naming throughout DocValues * - Values - DocValueType * - PackedIntsImpl - Ints * - run RAT * - add tests for FieldComparator FloatIndexValuesComparator vs. FloatValuesComparator etc. */ {code} once I am through with it I will create a new issue and create the final patch so we can iterate over it if needed. simon First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935977#action_12935977 ] Michael McCandless commented on LUCENE-2186: bq. BTW. it is ok to have the same name as a existing field. It is, usually... but we should add a test to assert this is still the case for other field + ValuesField? {quote} bq. I'm thinking it's really important now to carry over the same FieldInfos from the last segment when opening the writer (LUCENE-1737)... because hitting that IllegalStateExc during merge is a trap. I think that should not block us from moving forward and landing on trunk ey? {quote} It makes me mighty nervous though... I'll try to get that issue done soon. {quote} Well it is a nice way of extending field but I am not sure if we should keep it since it is heavy weight. {quote} The ValuesAttr for ValuesField is actually really heavyweight. Not only must it fire up an AttrSource, but then ValuesAttrImpl itself has a field for each type. Worse, for the type you do actually use, it's then another object eg FloatsRef, which in turn holds array/offset/len, a new length 1 array, etc. Maybe we shouldn't use attrs here? And instead somehow let ValuesField store a single value as it's own private member? FloatsRef, LongsRef are missing the ASL header. Maybe it's time to run RAT :) First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935994#action_12935994 ] Simon Willnauer commented on LUCENE-2186: - {quote} It is, usually... but we should add a test to assert this is still the case for other field + ValuesField? {quote} I already implemented a simple testcase that shows that this works as an example (I will commit that soon though) but I think we need to add another test that ensures that this works with all types of DocValues though. Yet, I work on making the test more atomic and test only a single thing anyway so i will add that too. bq. It makes me mighty nervous though... I'll try to get that issue done soon. Well until then I just go on and get the remaining stuff done here. bq. Maybe we shouldn't use attrs here? And instead somehow let ValuesField store a single value as it's own private member? I more and more think we can nuke ValuesAttribute completely since its other purpose on ValuesEnum is somewhat obsolete too. It is actually a leftover from earlier days where I was experimenting with using DocEnum to serve CSF too. There it would have made sense though but now we can provide a dedicated API. It still bugs me that Field is so hard to extend. We really need to fix that soon! I think what we should do is extend AbstractField and simply use a long/double/BytesRef and force folks to add another field instance if they want to have it indexed and stored. bq. Maybe it's time to run RAT +1 :) First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935713#action_12935713 ] Simon Willnauer commented on LUCENE-2186: - bq. I think this is very close!! Heh, I strongly agree! {quote} Using attr source as the way to specify the docValue is nice in that we get full extensibility, but, it's also heavyweight compared to a dedicated API (ie, .setIntValue, etc.). So I think this means apps that use doc values really must re-use their Field instances (if they are using doc values) else indexing performance will likely take a good hit. {quote} Well it is a nice way of extending field but I am not sure if we should keep it since it is heavy weight. We could get rid of ValuesAttribute for landing on trunk and work on making field extendible - which is desperately needed anyway. I was also thinking that the ValuesEnum doesn't need the ValuesAttribute per se. it would be more intuitive to have getter on ValuesEnum too. I just really hate those instanceof checks on fields. {quote} ValuesField is nice sugar on top (of the attr) :) Can you add some jdocs to ValuesField? EG it's not stored/indexed. It's OK to have same field name as existing field (hmm... is it)? Etc. {quote} Yeah - until here I haven't done much javadoc but that is on top of the list. I will start adding JavaDoc to main classes of the API and ValuesField is 100% a main class of it. BTW. it is ok to have the same name as a existing field. bq. Did you want to make FieldsConsumer.addValuesField abstract? That is a leftover - I will remove it. bq. The javadoc above DocValues.Source is wrong -- Source is not just for ints. True - see above that class had a different purpose back in the days where it was a patch :) {quote} You can change jdocs like This feature is experimental and the API is free to change in non-backwards-compatible ways. to @lucene.experimental :) (eg in Values.java) {quote} yeah - its good to have stuff like that left! :) yay! {quote} So, you're not allowed to change the DocValues type for a field once you've set it the first time... and, also, segments cannot be merged if the same field has different value types. I'm thinking it's really important now to carry over the same FieldInfos from the last segment when opening the writer (LUCENE-1737)... because hitting that IllegalStateExc during merge is a trap. This would let us change that IllegalStateExc into an assert (in SegmentMerger) and also turn the assert back on in FieldsConsumer. {quote} I think that should not block us from moving forward and landing on trunk ey? {quote} Should we rename MissingValues to MissingValue? Ie it holds the single value for your type that represents missing? {quote} True, I was also thinking to rename some of the classes like Values - DocValueType PackedIntsImpl - Ints bq. We need better names than PagedBytes.fillUsingLengthPrefix,2,3,4 hehe yeah - lemme change the one I added and lets fix the rest on trunk. I will open an issue once I have a reliable inet connection again. {quote} It'd be nice to have a more approachable test case that shows the simple way to index doc values, ie using ValuesField instead of getting the attr, getting the intsRef, setting it, etc. I think such an example should be very compact right? {quote} done on my checkout! so on my list there are the following topics until landing: * missing testcase for addIndexes and a simple one to show how to use the api * split up exiting tests in smaller tests - they test too much and they are hard to understand * Add JavaDoc to main classes like DocValues, Source, ValuesEnum, ValuesField * Document the different types * Consistent class naming - see above * enable ram usage tracking for all DocValuesProducer to support flush by RAM usage That seems very very close to me. Lets see how much I get done on my flight to boston :) First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12935428#action_12935428 ] Michael McCandless commented on LUCENE-2186: I think this is very close!! * Using attr source as the way to specify the docValue is nice in that we get full extensibility, but, it's also heavyweight compared to a dedicated API (ie, .setIntValue, etc.). So I think this means apps that use doc values really must re-use their Field instances (if they are using doc values) else indexing performance will likely take a good hit. * ValuesField is nice sugar on top (of the attr) :) Can you add some jdocs to ValuesField? EG it's not stored/indexed. It's OK to have same field name as existing field (hmm... is it)? Etc. * Did you want to make FieldsConsumer.addValuesField abstract? * The javadoc above DocValues.Source is wrong -- Source is not just for ints. * You can change jdocs like This feature is experimental and the API is free to change in non-backwards-compatible ways. to @lucene.experimental :) (eg in Values.java) * So, you're not allowed to change the DocValues type for a field once you've set it the first time... and, also, segments cannot be merged if the same field has different value types. I'm thinking it's really important now to carry over the same FieldInfos from the last segment when opening the writer (LUCENE-1737)... because hitting that IllegalStateExc during merge is a trap. This would let us change that IllegalStateExc into an assert (in SegmentMerger) and also turn the assert back on in FieldsConsumer. * Should we rename MissingValues to MissingValue? Ie it holds the single value for your type that represents missing? * We need better names than PagedBytes.fillUsingLengthPrefix,2,3,4 heh. * It'd be nice to have a more approachable test case that shows the simple way to index doc values, ie using ValuesField instead of getting the attr, getting the intsRef, setting it, etc. I think such an example should be very compact right? First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: CSF branch, 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache),
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920110#action_12920110 ] Simon Willnauer commented on LUCENE-2186: - created branch at [http://http://svn.apache.org/repos/asf/lucene/dev/branches/docvalues/|docvalues] and committed the last patch at r1021636. I think the next steps are adding a fix version docvalues to JIRA and create new issues according to the roadmap above. Once we are through with the mandatory stuff and documentation we can land this on trunk. Thoughts? I'm not sure if we should continue on this issue or close it and create a new top level one and spawn issues from there. simon First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12920151#action_12920151 ] Michael McCandless commented on LUCENE-2186: bq. Implement bulk copies for merging where possible. I don't think this should block landing on trunk? (Even in the non-deletes case). But, yes, searching for next del doc is a linear op, but a very small constant in front (at least OpenBitSet.nextSetBit, though del docs are currently a BitVector), yet is very much worth it once we get the bulk copying in since presumably big chunks of docs can be bulk copied. bq. Exposing the API via Fields / IndexReader. I think we should expose the Iterator API via Fields just like Terms is today. Currently it doesn't feel very natural to get the ValuesEnum via IR. Ahh that does sound like the right place. bq. Maybe if we wanna populate FieldsCache from it. We should be careful here -- it's best if things consume the docvalues instead of double-copying into the FC. First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12919864#action_12919864 ] Simon Willnauer commented on LUCENE-2186: - {quote} There are still many nocommits but most look like they could become TODOs? Do you have a high level sense of what's missing before we can commit to trunk? {quote} Yes and No :), here is my roadmap for this issue. We have 47 nocommit pending where about the half of it can be TODOs while the other half of it are rather easy task and should be fixed before we go to trunk. These are the major steps I would like to finish until we land this on trunk * Implement bulk copies for merging where possible. Currently there are still some value types not bulk copied and the ones which are only do if there are no deletes. Yet, the deletes thing I would make a TODO for now - we can still make that more efficient once we are on trunk. If I recall correctly figuring out the next deleted document is still a linear problem (I need to iterate through deletes), right? I guess that would be easier if I could figure out the next one so see if bulks are reasonable - maybe an invalid concern though. * Exposing the API via Fields / IndexReader. I think we should expose the Iterator API via Fields just like Terms is today. Currently it doesn't feel very natural to get the ValuesEnum via IR. * Rethink the Source API - I get the feeling that we don't really need the Source class but could rather use a Random Access Enum like Terms where we can see back and forth depending on how we loaded the fields values. We could actually unify the iterator API and random access which would catch two birds with one stone. internally we simply use the *Refs to set the actual values, default values would no be needed anymore (would save some code / branches internally) and the user would not have to access two different APIs. Additionally we could expose bulk reads just like BulkReadResult in DocsEnum to obtain all values in an array. Maybe if we wanna populate FieldsCache from it. I think we won't have perf. losts due to that since there is not really an overhead compared to the get() call on Source. - Reminds me I need to think about how we use that with sorted values If we keep Source we should at least make it implement ValuesEnum so we can use it as enumeration if they are in mem already. * To do merging for byte values correctly we need to figure out how to specify the comparator for each field. I don't have a concrete idea for this but I think this should somehow go into IndexWriterConfig in a per field map. Thougths? Remaining nocommits could be converted into TODOs - I think we can do so with the following * Evaluate if we can decide if a Bytes Payload should be stored as straight or as fixed which would make it easier for the user to use the byte variants. * Evaluate if we need String variants or if they can simple be solved with the byte ones * We should have some king of compatibility notion so that slightly different segments can be merged like fixed vs. var bytes float32 vs. float64. * For a cleaner transition we should create a sep. SortField that always uses index values. * explore a better way to obtain all dat / idx fiels in SegmentInfo to do segment merges for index values. *BytesValueProcessor should be thread private but I will leave that as a todo since this code might change anyway once realtime lands on trunk though. Not super urgent for now. * Fix some exception handling issues especially in MultiSource MultiValuesEnum * Fix the singed / unsigned limitations in Ints implementation * Explore ways to preven Ints impl do two method calls maybe we can expose PackedInts directly somehow bq. How about docvalues? You don't need the _branch part since it'll be at http://svn.apache.org.../branches/docvalues. OK First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914490#action_12914490 ] Simon Willnauer commented on LUCENE-2186: - I would want to move this to a branch for further development. If nobody objects I'm gonna move forward within the next days. simon First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12914494#action_12914494 ] Robert Muir commented on LUCENE-2186: - bq. I would want to move this to a branch for further development. If nobody objects I'm gonna move forward within the next days. +1 In my opinion, if its helpful to use a branch for a feature like this, we should not hesitate! With a lot of development on trunk, big patches make it difficult for anyone to get involved. Additionally its extremely difficult to iterate, because its hard to see the differences between iterations. But say, with the flexible indexing branch, this history is preserved since it was done in a branch. So I am able to just click 'view merged revisions' in my IDE and see all that history. First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896614#action_12896614 ] Simon Willnauer commented on LUCENE-2186: - {quote} It should be more extensible, ie, you can make your own attrs to store whatever you want. EG we should be able to use this to store the flex scoring stats (LUCENE-2392). {quote} This is actually the first real use-case together with the norms which is kind of part of LUCENE-2392 anyway {quote} The end-user API is rather cumbersome now (ie, that the user must interact directly w/ attrs). It seems like we should have a sugar layer on top, eg an IntField(Type) and I can do IntField.set/get. {quote} Yeah I guess lots of users would have a rather hard time with that. I remember Grant saying that he tries to explain Document and Fields since every in his trainings and with users in mind this should be done with least amount of changes. Nevertheless this is something which should be fixed outside of this particular issue, LUCENE-2310 would be one I could think of. Guess I need to talk to chrismale on Friday about that. {quote} Also... maybe we should use Attrs the way NumericField does. Ie, for CSF we'd have a TokenStream (single valued, for now anyway), and then attrs could be added to it. If we can get attr serialization (LUCENE-2125) online, then we can refactor all the read/write code in this issue as the default attr serializers? And, then, indexer would have no special code for CSF in particular. It just asks attrs to serialize themselves... {quote} LUCENE-2125 is something which would be nice to have together with CSF. Yet I don't think it depends on each other but it should use the same or very closely related APIs eventually. LUCENE-2125 has different problems to tackle first I guess - but I am closely following that! I will update that patch to make use of the {NumericField} - lets call it - work-around to make this patch less hairy. Still hairy but I like the idea of using TokenStream to attach the ValuesAttribute. {quote} Shouldn't FloatsRef be FloatRef (same for IntsRef)? It's ref'ing a single value right? {quote} Yes and no. I was too lazy to add all the capabilities {BytesRef} has but I could imagine that this can benefit from being able to hold more values - maybe a entire page when paging is used. If it only holds a single value we don't need offset and length too. I will leaf it like that for now, can still change it later if it turns out that we don't need this flexibility. I guess I will move the ValuesEnum down to Fields and FieldsEnum soon. I don't think we should confuse this with an DocsEnum since DocsEnum is so closely related to Terms and has explicit getters for freq() though. DocIdSetIterator seems to be fine for that purpose - while the AttributeSource could be pulled up. First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896112#action_12896112 ] Yonik Seeley commented on LUCENE-2186: -- bq. I'd really appreciate any comments especially on the API as this most important to me right now. Could you show some examples of the most efficient way to use this API? i.e. an example that shows both how to index a document with a CSF, and then how to iterate over all values of a CSF (or get the value for a specific set of documents). First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12896122#action_12896122 ] Simon Willnauer commented on LUCENE-2186: - Hey Yonik, bq. Could you show some examples of the most efficient way to use this API? Sure! While it's already late over here I am happy to provide you those two examples. This is how you can index CSF with this Attribute approach: {code} RAMDirectory dir = new RAMDirectory(); IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(Version.LUCENE_40, new SimpleAnalyzer(Version.LUCENE_40))); Document doc = new Document(); Fieldable fieldable = new AttributeField(myIntField); ValuesAttribute valuesAttribute = fieldable.attributes().addAttribute(ValuesAttribute.class); valuesAttribute.setType(Values.PACKED_INTS); valuesAttribute.ints().set(100); doc.add(fieldable); writer.addDocument(doc); writer.close(); {code} This is how get the values back via source or the ValuesEnum: {code} IndexReader reader = IndexReader.open(dir); // this might be integrated into Fields eventually Reader indexValues = reader.getIndexValues(myIntField); // can get cached version too via reader.getIndexValuesCache(); Source load = indexValues.load(); long value = load.ints(0); System.out.println(value); // or get it from the enum ValuesEnum intEnum = indexValues.getEnum(); ValuesAttribute attr = intEnum.getAttribute(ValuesAttribute.class); while(intEnum.nextDoc() != ValuesEnum.NO_MORE_DOCS) { System.out.println(attr.ints().get()); } {code} I guess this should make it at least easier to get started. First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Simon Willnauer Fix For: 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands,
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12883872#action_12883872 ] Michael McCandless commented on LUCENE-2186: Great -- thanks for pushing this forward Simon! bq. Mike do you mind if I take this? Please do! First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 4.0 Attachments: LUCENE-2186.patch, LUCENE-2186.patch, LUCENE-2186.patch, mem.py I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12796200#action_12796200 ] Michael McCandless commented on LUCENE-2186: bq. Is this patch for flex, as it contains CodecUtils and so on? Actually it's intended for trunk; I was thinking this should land before flex (it's a much smaller change, and it's isolated from flex), and so I wrote the CodecUtil/BytesRef basic infrastructure, thinking flex would then cutover to them. {quote} Hmm, so random-access would obviously be the preferred approach for SSDs, but with conventional disks I think the performance would be poor? In 1231 I implemented the var-sized CSF with a skip list, similar to a posting list. I think we should add that here too and we can still keep the additional index that stores the pointers? We could have two readers: one that allows random-access and loads the pointers into RAM (or uses MMAP as you mentioned), and a second one that doesn't load anything into RAM, uses the skip lists and only allows iterator-based access? {quote} The intention here is for this (index values) to replace field cache, but not aim (initially at least) to do much more. Ie, it's meant to be a RAM resident (either via explicit slurping-into-RAM or via MMAP). So the SSD or spinning magnets should not be hit on retrieval. If we add an iterator API, I think it should be simpler than the postings API (ie, no seeking, dense (every doc is visited, sequentially) iteration). {quote} It looks like ByteRef is very similar to Payload? Could you use that instead and extend it with the new String constructor and compare methods? {quote} Good point! I agree. Also, we should use BytesRef when reading the payload from TermsEnum. Actually I think Payload, BytesRef, TermRef (in flex) should all eventually be merged; of the three names, I think I like BytesRef the best. With *Enum in flex we can switch to BytesRef. For analysis we should switch PayloadAttribute to BytesRef, and deprecate the methods using Payload? Hmmm... but PayloadAttribute is an interface. {quote} So it looks like with your approach you want to support certain primitive types out of the box, such as byte[], float, int, String? {quote} Actually, all primitive types (ie, byte/short/int/long are included under int, as well as arbitrary bit precision between those primitive types). Because the API uses a method invocation (eg IntSource.get) instead of direct array access, we can hide how many bits are actually used, under the impl. Same is true for float/double (except we can't [easily] do arbitrary bit precision here... just 4 or 8 bytes). {quote} If someone has custom data types, then they have, similar as with payloads today, the byte[] indirection? {quote} Right, byte[] is for String, but also for arbitrary (opaque to Lucene) extensibility. The six anonymous (separate package private classes) concrete impls should give good efficiency to fit the different use cases. {quote} The code I initially wrote for 1231 exposed IndexOutput, so that one can call write*() directly, without having to convert to byte[] first. I think we will also want to do that for 2125 (store attributes in the index). So I'm wondering if this and 2125 should work similarly? {quote} This is compelling (letting Attrs read/write directly), but, I have some questions: * How would the random-access API work? (Attrs are designed for iteration). Eg, just providing IndexInput/Output to the Attr isn't quite enough -- the encoding is sometimes context dependent (like frq writes the delta between docIDs, the symbol table needed when reading/writing deref/sorted). How would I build a random access API on top of that? captureState-per-doc is too costly. What API would be used to write the shared state, ie, to tell the Attr we now are writing the segment, so you need to dump the symbol table. * How would the packed ints work? EG say my ints only need 5 bits. (Attrs are sort of designed for one-value-at-once). * How would the symbol table based encodings (deref, sorted) work? I guess the attr would need to have some state associated with it, and when I first create the attr I need to pass it segment name, Directory, etc, so it opens the right files? * I'm thinking we should still directly support native types, ie, Attrs are there for extensibility beyond native types? * Exposing single attr across a multi reader sounds tricky -- LUCENE-2154 (and, we need this for flex, which is worrying me!). But it sounds like you and Uwe are making some progress on that (using some under-the-hood Java reflection magic)... and this doesn't directly affect this issue, assuming we don't expose this API at the MultiReader level. {quote} Thinking out loud: Could we have then attributes with serialize/deserialize methods for
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795963#action_12795963 ] Michael Busch commented on LUCENE-2186: --- Great to see progress here, Mike! {quote} String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). {quote} It looks like ByteRef is very similar to Payload? Could you use that instead and extend it with the new String constructor and compare methods? {quote} It handles 3 types of values: {quote} So it looks like with your approach you want to support certain primitive types out of the box, such as byte[], float, int, String? If someone has custom data types, then they have, similar as with payloads today, the byte[] indirection? The code I initially wrote for 1231 exposed IndexOutput, so that one can call write*() directly, without having to convert to byte[] first. I think we will also want to do that for 2125 (store attributes in the index). So I'm wondering if this and 2125 should work similarly? Thinking out loud: Could we have then attributes with serialize/deserialize methods for primitive types, such as float? Could we efficiently use such an approach all the way up to FieldCache? It would be compelling if you could store an attribute as CSF, or in the postinglist, retrieve it from the flex APIs, and also from the FieldCache. All would be the same API and there would only be one place that needs to know about the encoding (the attribute). {quote} Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. {quote} Yeah, that's where I got kind of stuck with 1231: We need to figure out how the public API should look like, with which a user can add CSF values to the index and retrieve them. The easiest and fastest way would be to add a dedicated new API. The cleaner one would be to make the whole Document/Field/FieldInfos API more flexible. LUCENE-1597 was a first attempt. {quote} There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) {quote} Hmm, so random-access would obviously be the preferred approach for SSDs, but with conventional disks I think the performance would be poor? In 1231 I implemented the var-sized CSF with a skip list, similar to a posting list. I think we should add that here too and we can still keep the additional index that stores the pointers? We could have two readers: one that allows random-access and loads the pointers into RAM (or uses MMAP as you mentioned), and a second one that doesn't load anything into RAM, uses the skip lists and only allows iterator-based access? About updating CSF: I hope we can use parallel indexing for that. In other words: It should be possible for users to use parallel indexes to update certain fields, and Lucene should use the same approach internally to store different generations of things like norms and CSFs. First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2186.patch I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef
[jira] Commented: (LUCENE-2186) First cut at column-stride fields (index values storage)
[ https://issues.apache.org/jira/browse/LUCENE-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12795897#action_12795897 ] Uwe Schindler commented on LUCENE-2186: --- Is this patch for flex, as it contains CodecUtils and so on? If it is so we should use affects version: flex. First cut at column-stride fields (index values storage) Key: LUCENE-2186 URL: https://issues.apache.org/jira/browse/LUCENE-2186 Project: Lucene - Java Issue Type: New Feature Components: Index Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 3.1 Attachments: LUCENE-2186.patch I created an initial basic impl for storing index values (ie column-stride value storage). This is still a work in progress... but the approach looks compelling. I'm posting my current status/patch here to get feedback/iterate, etc. The code is standalone now, and lives under new package oal.index.values (plus some util changes, refactorings) -- I have yet to integrate into Lucene so eg you can mark that a given Field's value should be stored into the index values, sorting will use these values instead of field cache, etc. It handles 3 types of values: * Six variants of byte[] per doc, all combinations of fixed vs variable length, and stored either straight (good for eg a title field), deref (good when many docs share the same value, but you won't do any sorting) or sorted. * Integers (variable bit precision used as necessary, ie this can store byte/short/int/long, and all precisions in between) * Floats (4 or 8 byte precision) String fields are stored as the UTF8 byte[]. This patch adds a BytesRef, which does the same thing as flex's TermRef (we should merge them). This patch also adds basic initial impl of PackedInts (LUCENE-1990); we can swap that out if/when we get a better impl. This storage is dense (like field cache), so it's appropriate when the field occurs in all/most docs. It's just like field cache, except the reading API is a get() method invocation, per document. Next step is to do basic integration with Lucene, and then compare sort performance of this vs field cache. For the sort by String value case, I think RAM usage GC load of this index values API should be much better than field caache, since it does not create object per document (instead shares big long[] and byte[] across all docs), and because the values are stored in RAM as their UTF8 bytes. There are abstract Writer/Reader classes. The current reader impls are entirely RAM resident (like field cache), but the API is (I think) agnostic, ie, one could make an MMAP impl instead. I think this is the first baby step towards LUCENE-1231. Ie, it cannot yet update values, and the reading API is fully random-access by docID (like field cache), not like a posting list, though I do think we should add an iterator() api (to return flex's DocsEnum) -- eg I think this would be a good way to track avg doc/field length for BM25/lnu.ltc scoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org