[jira] [Updated] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Robert Muir (JIRA) Thu, 09 Jun 2011 18:38:52 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated SOLR-2583:
------------------------------

    Attachment: patch.txt

{quote}
What do you mean with a two-stage table, can you clarify this please?
{quote}

See: http://www.strchr.com/multi-stage_tables

i attached a patch, of a (not great) implementation i was sorta kinda trying to 
clean up for other reasons... maybe you can use it.

in the sparse case, blocks that share all the default value are folded into one 
block (in this patch, blocksize=256 but maybe you should be able to configure 
it).

for example in your 4GB case (1billion floats), if you use this with SmallFloat 
the absolute worst case (no sharing) is 1GB + 16MB or so, and the best case 
(all default values) is 16MB, but the lookups should be a lot faster than 
hashtables... all primitive types, etc... and it could definitely be improved 
more.

really this is still probably overkill, as the datastructure is intended to 
share blocks with the same values in general, when in reality its probably 
enough to just share ones that have only the default value set...

i didnt look at the solr side to see if its possible to build it incrementally 
(this would be better, rather than building then compact()ing, but i wonder if 
this is possible due to lucenedocid/solr id, etc)


> Make external scoring more efficient (ExternalFileField, FileFloatSource)
> -------------------------------------------------------------------------
>
>                 Key: SOLR-2583
>                 URL: https://issues.apache.org/jira/browse/SOLR-2583
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Martin Grotzke
>            Priority: Minor
>         Attachments: FileFloatSource.java.patch, patch.txt
>
>
> External scoring eats much memory, depending on the number of documents in 
> the index. The ExternalFileField (used for external scoring) uses 
> FileFloatSource, where one FileFloatSource is created per external scoring 
> file. FileFloatSource creates a float array with the size of the number of 
> docs (this is also done if the file to load is not found). If there are much 
> less entries in the scoring file than there are number of docs in total the 
> big float array wastes much memory.
> This could be optimized by using a map of doc -> score, so that the map 
> contains as many entries as there are scoring entries in the external file, 
> but not more.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-2583) Make external scoring more efficient (ExternalFileField, FileFloatSource)

Reply via email to