Great, I was hoping for that. In my case I will have to deal with the
worst case scenario, i.e. all documents matching the query, because
the only criterion is the fingerprint and the result of the
distance/similarity function which will have to be executed for every
document. However, I am dealing with a scenario where there will not
be many concurrent users.

Thank you.

On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein <joels...@gmail.com> wrote:
> You only need to have fast access to the fingerprint field so only that
> field needs to be in memory. You'll want to review how Lucene DocValues and
> FieldCache work. Sorting is done with a PriorityQueue so only the top N
> docs are kept in memory.
>
> You'll only need to access the fingerprint field values for documents that
> match the query, so it won't be a full table scan unless all the docs match
> the query.
>
> Sounds like an interesting project. Please keep us posted.
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger <krue...@lesspain.de> wrote:
>
>> Hi,
>>
>> let's say I have an index that contains a field of type BinaryField
>> called "fingerprint" that stores a few (let's say 100) bytes that are
>> some kind of digital fingerprint-like thing.
>>
>> Let's say I want to perform queries on that field to achieve sorting
>> or filtering based on a kind of custom distance function
>> "customDistance", i.e. I input a reference "fingerprint" and Solr
>> returns either all documents sorted by
>> customDistance(<referenceFingerprint>,<documentFingerprint>) or use
>> that in an frange expression for filtering.
>>
>> I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I
>> do understand that using function queries with a custom function is
>> definitely an expensive thing as it will result in what is called a
>> "full table scan" in the sql world, i.e. data from all documents needs
>> to be touched to select the correct documents or sort by a function's
>> result.
>>
>> Given all that and provided, I have to use a custom function for my
>> needs, I would like to know a few more details about solr architecture
>> to understand what I have to look out for.
>>
>> I will have potentially millions of records. Does the data contained
>> in other index fields play a role when I only use the "fingerprint"
>> field for sorting and searching when it comes to RAM usage? I am
>> hoping to calculate that my RAM should be able to accommodate the
>> fingerprint data of all available documents for the queries to be fast
>> but not fingerprint data and all other indexed or stored data.
>>
>> Example: My fingerprint data needs 100bytes per document, my other
>> indexed field data needs 900 bytes per document. Will I need 100MB or
>> 1GB to fit all data that is needed to process one query in memory?
>>
>> Are there other things to be aware of?
>>
>> Thanks,
>>
>> Robert
>>



-- 
Robert Krüger
Managing Partner
Lesspain GmbH & Co. KG

www.lesspain-software.com

Reply via email to