Great, I was hoping for that. In my case I will have to deal with the worst case scenario, i.e. all documents matching the query, because the only criterion is the fingerprint and the result of the distance/similarity function which will have to be executed for every document. However, I am dealing with a scenario where there will not be many concurrent users.
Thank you. On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein <joels...@gmail.com> wrote: > You only need to have fast access to the fingerprint field so only that > field needs to be in memory. You'll want to review how Lucene DocValues and > FieldCache work. Sorting is done with a PriorityQueue so only the top N > docs are kept in memory. > > You'll only need to access the fingerprint field values for documents that > match the query, so it won't be a full table scan unless all the docs match > the query. > > Sounds like an interesting project. Please keep us posted. > > Joel Bernstein > Search Engineer at Heliosearch > > > On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger <krue...@lesspain.de> wrote: > >> Hi, >> >> let's say I have an index that contains a field of type BinaryField >> called "fingerprint" that stores a few (let's say 100) bytes that are >> some kind of digital fingerprint-like thing. >> >> Let's say I want to perform queries on that field to achieve sorting >> or filtering based on a kind of custom distance function >> "customDistance", i.e. I input a reference "fingerprint" and Solr >> returns either all documents sorted by >> customDistance(<referenceFingerprint>,<documentFingerprint>) or use >> that in an frange expression for filtering. >> >> I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I >> do understand that using function queries with a custom function is >> definitely an expensive thing as it will result in what is called a >> "full table scan" in the sql world, i.e. data from all documents needs >> to be touched to select the correct documents or sort by a function's >> result. >> >> Given all that and provided, I have to use a custom function for my >> needs, I would like to know a few more details about solr architecture >> to understand what I have to look out for. >> >> I will have potentially millions of records. Does the data contained >> in other index fields play a role when I only use the "fingerprint" >> field for sorting and searching when it comes to RAM usage? I am >> hoping to calculate that my RAM should be able to accommodate the >> fingerprint data of all available documents for the queries to be fast >> but not fingerprint data and all other indexed or stored data. >> >> Example: My fingerprint data needs 100bytes per document, my other >> indexed field data needs 900 bytes per document. Will I need 100MB or >> 1GB to fit all data that is needed to process one query in memory? >> >> Are there other things to be aware of? >> >> Thanks, >> >> Robert >> -- Robert Krüger Managing Partner Lesspain GmbH & Co. KG www.lesspain-software.com