[jira] [Updated] (LUCENE-7299) BytesRefHash.sort() should use radix sort?

2016-05-25 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7299:
-
Attachment: LUCENE-7299.patch

Updated patch. It makes the impl extend Sorter like David suggested and removes 
the cache (so it is a bit slower than the previous patch) to keep things 
simpler/safer for a first iteration.

> BytesRefHash.sort() should use radix sort?
> --
>
> Key: LUCENE-7299
> URL: https://issues.apache.org/jira/browse/LUCENE-7299
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: ByteBlockListSorter.java, LUCENE-7299.patch, 
> LUCENE-7299.patch
>
>
> Switching DocIdSetBuilder to radix sort helped make things significantly 
> faster. We should be able to do the same with BytesRefHash.sort()?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7299) BytesRefHash.sort() should use radix sort?

2016-05-25 Thread Dawid Weiss (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-7299:

Attachment: ByteBlockListSorter.java

We have an internal radix sort for sorting (largeish) key buffers (I attach it 
and hereby put it in the public domain). It was always a gain over anything 
else we tried.

As for common prefixes -- the nice part about it is that when you're descending 
into same prefix blocks you can disregard those prefixes in comparisons 
(knowing they have to be equal). We thus "pushed down" the comparison routine 
to byte block buffers -- see reference to compareAssumingEqualPrefix inside the 
code. There is also a hook inside byte block list to allow you to retrieve a 
single byte at a given offset so there's no need to copy keys over and over 
again (.byteAt).

A simple insertion sort at the "leaf" level was always better than trying to 
stop radix sort earlier. These was for realistic keys, not random distributions.

All these optimizations yielded significant speed boost for us, perhaps they'll 
be an inspiration of some sort, Adrien.


> BytesRefHash.sort() should use radix sort?
> --
>
> Key: LUCENE-7299
> URL: https://issues.apache.org/jira/browse/LUCENE-7299
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: ByteBlockListSorter.java, LUCENE-7299.patch
>
>
> Switching DocIdSetBuilder to radix sort helped make things significantly 
> faster. We should be able to do the same with BytesRefHash.sort()?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7299) BytesRefHash.sort() should use radix sort?

2016-05-24 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7299:
-
Attachment: LUCENE-7299.patch

Here is a patch. I ran a simple benchmark with 10M docs, a single thread, a 
single indexed field and a largish ram buffer of 500 MB. On random binary keys 
of length 16, indexing speed improved by 37%. On random ascii keys whose length 
is between 0 and 16, indexing speed improved by 34%.

This is not a realistic speedup since there is no merging involved (the ram 
buffer can hold everything), no stored fields, no doc values, etc. - but I 
think this is still interesting to be able to generate the inverted index more 
quickly for flushed segments. There will probably be a noticeable speedup with 
some setups that make heavy use of the inverted index of high-cardinality 
fields with large ram buffers.

> BytesRefHash.sort() should use radix sort?
> --
>
> Key: LUCENE-7299
> URL: https://issues.apache.org/jira/browse/LUCENE-7299
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Assignee: Adrien Grand
>Priority: Minor
> Attachments: LUCENE-7299.patch
>
>
> Switching DocIdSetBuilder to radix sort helped make things significantly 
> faster. We should be able to do the same with BytesRefHash.sort()?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org