[jira] [Updated] (LUCENE-7299) BytesRefHash.sort() should use radix sort?
[ https://issues.apache.org/jira/browse/LUCENE-7299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-7299: - Attachment: LUCENE-7299.patch Updated patch. It makes the impl extend Sorter like David suggested and removes the cache (so it is a bit slower than the previous patch) to keep things simpler/safer for a first iteration. > BytesRefHash.sort() should use radix sort? > -- > > Key: LUCENE-7299 > URL: https://issues.apache.org/jira/browse/LUCENE-7299 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: ByteBlockListSorter.java, LUCENE-7299.patch, > LUCENE-7299.patch > > > Switching DocIdSetBuilder to radix sort helped make things significantly > faster. We should be able to do the same with BytesRefHash.sort()? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7299) BytesRefHash.sort() should use radix sort?
[ https://issues.apache.org/jira/browse/LUCENE-7299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-7299: Attachment: ByteBlockListSorter.java We have an internal radix sort for sorting (largeish) key buffers (I attach it and hereby put it in the public domain). It was always a gain over anything else we tried. As for common prefixes -- the nice part about it is that when you're descending into same prefix blocks you can disregard those prefixes in comparisons (knowing they have to be equal). We thus "pushed down" the comparison routine to byte block buffers -- see reference to compareAssumingEqualPrefix inside the code. There is also a hook inside byte block list to allow you to retrieve a single byte at a given offset so there's no need to copy keys over and over again (.byteAt). A simple insertion sort at the "leaf" level was always better than trying to stop radix sort earlier. These was for realistic keys, not random distributions. All these optimizations yielded significant speed boost for us, perhaps they'll be an inspiration of some sort, Adrien. > BytesRefHash.sort() should use radix sort? > -- > > Key: LUCENE-7299 > URL: https://issues.apache.org/jira/browse/LUCENE-7299 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: ByteBlockListSorter.java, LUCENE-7299.patch > > > Switching DocIdSetBuilder to radix sort helped make things significantly > faster. We should be able to do the same with BytesRefHash.sort()? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7299) BytesRefHash.sort() should use radix sort?
[ https://issues.apache.org/jira/browse/LUCENE-7299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-7299: - Attachment: LUCENE-7299.patch Here is a patch. I ran a simple benchmark with 10M docs, a single thread, a single indexed field and a largish ram buffer of 500 MB. On random binary keys of length 16, indexing speed improved by 37%. On random ascii keys whose length is between 0 and 16, indexing speed improved by 34%. This is not a realistic speedup since there is no merging involved (the ram buffer can hold everything), no stored fields, no doc values, etc. - but I think this is still interesting to be able to generate the inverted index more quickly for flushed segments. There will probably be a noticeable speedup with some setups that make heavy use of the inverted index of high-cardinality fields with large ram buffers. > BytesRefHash.sort() should use radix sort? > -- > > Key: LUCENE-7299 > URL: https://issues.apache.org/jira/browse/LUCENE-7299 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Assignee: Adrien Grand >Priority: Minor > Attachments: LUCENE-7299.patch > > > Switching DocIdSetBuilder to radix sort helped make things significantly > faster. We should be able to do the same with BytesRefHash.sort()? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org