[ https://issues.apache.org/jira/browse/LUCENE-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871308#action_12871308 ]
Michael McCandless commented on LUCENE-2380: -------------------------------------------- OK I ran some sort perf tests. I picked the worst case -- trivial query (TermQuery) matching all docs, sorting by either a highly unique string field (random string) or enumerated field (country ~ a couple hundred values), from benchmark's SortableSingleDocSource. Index has 5M docs. Each run is best of 3. Results: ||Sort||Trunk QPS||Patch QPS||Change %|| |random|7.75|5.64|{color:red}-27.2%{color} |country|8.05|7.62|{color:red}-5.3%{color} So.... the packed ints lookups are more costly than trunk today (but, at a large reduction in RAM used). Then I tried another test, asking packed ints to upgrade to an array of the nearest native type (ie byte[], short[], int[], long[]) for the doc -> ord map. This is faster since lookups don't require shift/mask, but, wastes some space since you have unused bits: ||Sort||Trunk QPS||Patch QPS||Change %|| |random|7.75|7.89|{color:green}1.8%{color} |country|8.05|7.64|{color:red}-5.1%{color} The country case didn't get any better (noise) because it happened to already be using 8 bits (byte[]) for doc->ord map. Remember this is a worst case test -- if you query matches fewer results than your entire index, or your query is more costly to evaluate than the simple single TermQuery, this FieldCache lookup cost will be relatively smaller. So... I think we should expose in the new FieldCache methods an optional param to control time/space tradeoff; I'll add this, defaulting to upgrading to nearest native type. I think the 5.3% slowdown on the country field is acceptable given the large reduction in RAM used... > Add FieldCache.getTermBytes, to load term data as byte[] > -------------------------------------------------------- > > Key: LUCENE-2380 > URL: https://issues.apache.org/jira/browse/LUCENE-2380 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2380.patch, LUCENE-2380.patch, LUCENE-2380.patch > > > With flex, a term is now an opaque byte[] (typically, utf8 encoded unicode > string, but not necessarily), so we need to push this up the search stack. > FieldCache now has getStrings and getStringIndex; we need corresponding > methods to load terms as native byte[], since in general they may not be > representable as String. This should be quite a bit more RAM efficient too, > for US ascii content since each character would then use 1 byte not 2. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org