You only need to have fast access to the fingerprint field so only that
field needs to be in memory. You'll want to review how Lucene DocValues and
FieldCache work. Sorting is done with a PriorityQueue so only the top N
docs are kept in memory.

You'll only need to access the fingerprint field values for documents that
match the query, so it won't be a full table scan unless all the docs match
the query.

Sounds like an interesting project. Please keep us posted.

Joel Bernstein
Search Engineer at Heliosearch


On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger <krue...@lesspain.de> wrote:

> Hi,
>
> let's say I have an index that contains a field of type BinaryField
> called "fingerprint" that stores a few (let's say 100) bytes that are
> some kind of digital fingerprint-like thing.
>
> Let's say I want to perform queries on that field to achieve sorting
> or filtering based on a kind of custom distance function
> "customDistance", i.e. I input a reference "fingerprint" and Solr
> returns either all documents sorted by
> customDistance(<referenceFingerprint>,<documentFingerprint>) or use
> that in an frange expression for filtering.
>
> I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I
> do understand that using function queries with a custom function is
> definitely an expensive thing as it will result in what is called a
> "full table scan" in the sql world, i.e. data from all documents needs
> to be touched to select the correct documents or sort by a function's
> result.
>
> Given all that and provided, I have to use a custom function for my
> needs, I would like to know a few more details about solr architecture
> to understand what I have to look out for.
>
> I will have potentially millions of records. Does the data contained
> in other index fields play a role when I only use the "fingerprint"
> field for sorting and searching when it comes to RAM usage? I am
> hoping to calculate that my RAM should be able to accommodate the
> fingerprint data of all available documents for the queries to be fast
> but not fingerprint data and all other indexed or stored data.
>
> Example: My fingerprint data needs 100bytes per document, my other
> indexed field data needs 900 bytes per document. Will I need 100MB or
> 1GB to fit all data that is needed to process one query in memory?
>
> Are there other things to be aware of?
>
> Thanks,
>
> Robert
>

Reply via email to