[
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049674#comment-13049674
]
Martin Grotzke commented on SOLR-2583:
--------------------------------------
The test that produced this output can be found in my lucene-solr fork on
github: https://github.com/magro/lucene-solr/commit/b9af87b1
The test method that was executed was testCompareMemoryUsage, for measuring
memory usage I used http://code.google.com/p/memory-measurer/ and ran the
test/jvm with "-Xmx1G -javaagent:solr/lib/object-explorer.jar" (just from
eclipse).
I just added another test, that uses a fixed size and an increasing number of
puts (testCompareMemoryUsageWithFixSizeAndIncreasingNumPuts,
https://github.com/magro/lucene-solr/blob/trunk/solr/src/test/org/apache/solr/search/function/FileFloatSourceMemoryTest.java#L56),
with the following results:
{noformat}
Size: 1000000
NumPuts 1.000 (0,1%), CompactFloatArray 918.616, float[]
4.000.016, HashMap 72.128
NumPuts 10.000 (1,0%), CompactFloatArray 3.738.712, float[]
4.000.016, HashMap 701.696
NumPuts 50.000 (5,0%), CompactFloatArray 4.016.472, float[]
4.000.016, HashMap 3.383.104
NumPuts 55.000 (5,5%), CompactFloatArray 4.016.472, float[]
4.000.016, HashMap 3.949.120
NumPuts 60.000 (6,0%), CompactFloatArray 4.016.472, float[]
4.000.016, HashMap 4.254.848
NumPuts 100.000 (10,0%), CompactFloatArray 4.016.472, float[]
4.000.016, HashMap 6.622.272
NumPuts 500.000 (50,0%), CompactFloatArray 4.016.472, float[]
4.000.016, HashMap 27.262.976
NumPuts 1.000.000 (100,0%), CompactFloatArray 4.016.472, float[]
4.000.016, HashMap 44.649.664
{noformat}
It seems that the HashMap is the most efficient solution up to ~5.5%. Starting
from this threshold CompactFloatArray and float[] use less memory, while the
CompactFloatArray has no advantages over float[] for puts > 5%.
Therefore I'd suggest that we use an adaptive strategy that uses a HashMap up
to 5,5% of number of scores compared to numdocs, and starting from this
threshold the original float[] approach is used.
What do you say?
> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
> Key: SOLR-2583
> URL: https://issues.apache.org/jira/browse/SOLR-2583
> Project: Solr
> Issue Type: Improvement
> Components: search
> Reporter: Martin Grotzke
> Priority: Minor
> Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in
> the index. The ExternalFileField (used for external scoring) uses
> FileFloatSource, where one FileFloatSource is created per external scoring
> file. FileFloatSource creates a float array with the size of the number of
> docs (this is also done if the file to load is not found). If there are much
> less entries in the scoring file than there are number of docs in total the
> big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map
> contains as many entries as there are scoring entries in the external file,
> but not more.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]