[ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049674#comment-13049674
 ] 

Martin Grotzke commented on SOLR-2583:
--------------------------------------

The test that produced this output can be found in my lucene-solr fork on 
github: https://github.com/magro/lucene-solr/commit/b9af87b1
The test method that was executed was testCompareMemoryUsage, for measuring 
memory usage I used http://code.google.com/p/memory-measurer/ and ran the 
test/jvm with "-Xmx1G -javaagent:solr/lib/object-explorer.jar" (just from 
eclipse).

I just added another test, that uses a fixed size and an increasing number of 
puts (testCompareMemoryUsageWithFixSizeAndIncreasingNumPuts, 
https://github.com/magro/lucene-solr/blob/trunk/solr/src/test/org/apache/solr/search/function/FileFloatSourceMemoryTest.java#L56),
 with the following results:

{noformat}
Size: 1000000
NumPuts 1.000 (0,1%),           CompactFloatArray 918.616,      float[] 
4.000.016,      HashMap  72.128
NumPuts 10.000 (1,0%),          CompactFloatArray 3.738.712,    float[] 
4.000.016,      HashMap  701.696
NumPuts 50.000 (5,0%),          CompactFloatArray 4.016.472,    float[] 
4.000.016,      HashMap  3.383.104
NumPuts 55.000 (5,5%),          CompactFloatArray 4.016.472,    float[] 
4.000.016,      HashMap  3.949.120
NumPuts 60.000 (6,0%),          CompactFloatArray 4.016.472,    float[] 
4.000.016,      HashMap  4.254.848
NumPuts 100.000 (10,0%),        CompactFloatArray 4.016.472,    float[] 
4.000.016,      HashMap  6.622.272
NumPuts 500.000 (50,0%),        CompactFloatArray 4.016.472,    float[] 
4.000.016,      HashMap  27.262.976
NumPuts 1.000.000 (100,0%),     CompactFloatArray 4.016.472,    float[] 
4.000.016,      HashMap  44.649.664
{noformat}

It seems that the HashMap is the most efficient solution up to ~5.5%. Starting 
from this threshold CompactFloatArray and float[] use less memory, while the 
CompactFloatArray has no advantages over float[] for puts > 5%.

Therefore I'd suggest that we use an adaptive strategy that uses a HashMap up 
to 5,5% of number of scores compared to numdocs, and starting from this 
threshold the original float[] approach is used.

What do you say?

> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in 
> the index. The ExternalFileField (used for external scoring) uses 
> FileFloatSource, where one FileFloatSource is created per external scoring 
> file. FileFloatSource creates a float array with the size of the number of 
> docs (this is also done if the file to load is not found). If there are much 
> less entries in the scoring file than there are number of docs in total the 
> big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map 
> contains as many entries as there are scoring entries in the external file, 
> but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to