[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

Robert Muir (JIRA) Tue, 31 Aug 2010 14:38:39 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904780#action_12904780
 ]


Robert Muir commented on LUCENE-2369:
-------------------------------------

bq. ICU collator keys makes sorting very fast at the cost of some extra disk 
space, as one will probably want to store the original Term together with the 
key. It requires a non-trivial memory overhead, in the ideal case as many bytes 
as there are characters in the terms. Works extremely well with reopening.

This doesnt make sense, why do you need the original term also?

What 'memory overhead'? indexing collation keys, even at tertiary strength (the 
largest size) is in general less than 2 bytes per character. this is actually 
less than the cost of a term in ram in lucene 3.1, so i don't understand this?

bq. The two approaches are not in conflict and combining them would indeed seem 
to give many benefits

if you are using collation keys, then binary order gives you collated results. 
So thats what I am hinting at here, is there a more general improvement here 
you can apply to sorting bytes? If this issue has some ideas that can improve 
the more general case, I think we should look at factoring those improvements 
out, and leave the locale stuff as an indexing-time thing.

bq. I agree that the sort-fields as well as sort-locale is well known at index 
time in most cases.

In all cases really. I don't see this issue really helping if you dont know the 
locale at index time, by invoking the collator over all the terms at startup 
you are essentially reindexing in RAM.

if one doesnt know the necessary locales at index-time, i suggest using a 
generic UCA collator: ULocale.ROOT as a 'catch-all' field for all other locales.


> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache 
> which keeps all sort terms in memory. Beside the huge memory overhead, 
> searching requires comparison of terms with collator.compare every time, 
> making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of 
> pre-sorted ordinals for the sort terms and a map from document-IDs to entries 
> in the sorted ordinals list. This results in very low memory overhead and 
> faster sorted searches, at the cost of increased startup-time. As the 
> ordinals can be resolved to terms after the sorting has been performed, this 
> approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 
> which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

Reply via email to