[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

Michael McCandless (JIRA) Tue, 25 May 2010 12:26:48 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871308#action_12871308
 ]


Michael McCandless commented on LUCENE-2380:
--------------------------------------------

OK I ran some sort perf tests.  I picked the worst case -- trivial
query (TermQuery) matching all docs, sorting by either a highly unique
string field (random string) or enumerated field (country ~ a couple
hundred values), from benchmark's SortableSingleDocSource.

Index has 5M docs.  Each run is best of 3.

Results:

||Sort||Trunk QPS||Patch QPS||Change %||
|random|7.75|5.64|{color:red}-27.2%{color}
|country|8.05|7.62|{color:red}-5.3%{color}

So.... the packed ints lookups are more costly than trunk today (but,
at a large reduction in RAM used).

Then I tried another test, asking packed ints to upgrade to an array
of the nearest native type (ie byte[], short[], int[], long[]) for the
doc -> ord map.  This is faster since lookups don't require
shift/mask, but, wastes some space since you have unused bits:

||Sort||Trunk QPS||Patch QPS||Change %||
|random|7.75|7.89|{color:green}1.8%{color}
|country|8.05|7.64|{color:red}-5.1%{color}

The country case didn't get any better (noise) because it happened to
already be using 8 bits (byte[]) for doc->ord map.

Remember this is a worst case test -- if you query matches fewer
results than your entire index, or your query is more costly to
evaluate than the simple single TermQuery, this FieldCache lookup cost
will be relatively smaller.

So... I think we should expose in the new FieldCache methods an
optional param to control time/space tradeoff; I'll add this,
defaulting to upgrading to nearest native type.  I think the 5.3%
slowdown on the country field is acceptable given the large reduction
in RAM used...


> Add FieldCache.getTermBytes, to load term data as byte[]
> --------------------------------------------------------
>
>                 Key: LUCENE-2380
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2380
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch
>
>
> With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode 
> string, but not necessarily), so we need to push this up the search stack.
> FieldCache now has getStrings and getStringIndex; we need corresponding 
> methods to load terms as native byte[], since in general they may not be 
> representable as String.  This should be quite a bit more RAM efficient too, 
> for US ascii content since each character would then use 1 byte not 2.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2380) Add FieldCache.getTermBytes, to load term data as byte[]

Reply via email to