Re: Performance/scaling with custom function queries
Thanks for the info. I will look at that. On Wed, Jun 11, 2014 at 3:47 PM, Joel Bernstein joels...@gmail.com wrote: In Solr 4.9 there is a feature called RankQueries, that allows you to plugin your own ranking collector. So, if you wanted to write a ranking/sorting collector that used a thread per segment, you could cleanly plug it in. Joel Bernstein Search Engineer at Heliosearch On Wed, Jun 11, 2014 at 9:39 AM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: On Wed, Jun 11, 2014 at 7:46 AM, Robert Krüger krue...@lesspain.de wrote: Or will I have to set up distributed search to achieve that? Yes — you have to shard it to achieve that. The shards could be on the same node. There were some discussions this year in JIRA about being able to do thread-per-segment but it’s not quite there yet. FWIW I think it would be a nice option for some use-cases (like yours). ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Re: Performance/scaling with custom function queries
Would Solr use multithreading to process the records of a function query as described above? In my scenario concurrent searches are not the issue, rather the speed of one query will be the optimization target. Or will I have to set up distributed search to achieve that? Thanks, Robert On Tue, Jun 10, 2014 at 10:11 AM, Robert Krüger krue...@lesspain.de wrote: Great, I was hoping for that. In my case I will have to deal with the worst case scenario, i.e. all documents matching the query, because the only criterion is the fingerprint and the result of the distance/similarity function which will have to be executed for every document. However, I am dealing with a scenario where there will not be many concurrent users. Thank you. On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein joels...@gmail.com wrote: You only need to have fast access to the fingerprint field so only that field needs to be in memory. You'll want to review how Lucene DocValues and FieldCache work. Sorting is done with a PriorityQueue so only the top N docs are kept in memory. You'll only need to access the fingerprint field values for documents that match the query, so it won't be a full table scan unless all the docs match the query. Sounds like an interesting project. Please keep us posted. Joel Bernstein Search Engineer at Heliosearch On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger krue...@lesspain.de wrote: Hi, let's say I have an index that contains a field of type BinaryField called fingerprint that stores a few (let's say 100) bytes that are some kind of digital fingerprint-like thing. Let's say I want to perform queries on that field to achieve sorting or filtering based on a kind of custom distance function customDistance, i.e. I input a reference fingerprint and Solr returns either all documents sorted by customDistance(referenceFingerprint,documentFingerprint) or use that in an frange expression for filtering. I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I do understand that using function queries with a custom function is definitely an expensive thing as it will result in what is called a full table scan in the sql world, i.e. data from all documents needs to be touched to select the correct documents or sort by a function's result. Given all that and provided, I have to use a custom function for my needs, I would like to know a few more details about solr architecture to understand what I have to look out for. I will have potentially millions of records. Does the data contained in other index fields play a role when I only use the fingerprint field for sorting and searching when it comes to RAM usage? I am hoping to calculate that my RAM should be able to accommodate the fingerprint data of all available documents for the queries to be fast but not fingerprint data and all other indexed or stored data. Example: My fingerprint data needs 100bytes per document, my other indexed field data needs 900 bytes per document. Will I need 100MB or 1GB to fit all data that is needed to process one query in memory? Are there other things to be aware of? Thanks, Robert -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Re: Performance/scaling with custom function queries
On Wed, Jun 11, 2014 at 7:46 AM, Robert Krüger krue...@lesspain.de wrote: Or will I have to set up distributed search to achieve that? Yes — you have to shard it to achieve that. The shards could be on the same node. There were some discussions this year in JIRA about being able to do thread-per-segment but it’s not quite there yet. FWIW I think it would be a nice option for some use-cases (like yours). ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley
Re: Performance/scaling with custom function queries
In Solr 4.9 there is a feature called RankQueries, that allows you to plugin your own ranking collector. So, if you wanted to write a ranking/sorting collector that used a thread per segment, you could cleanly plug it in. Joel Bernstein Search Engineer at Heliosearch On Wed, Jun 11, 2014 at 9:39 AM, david.w.smi...@gmail.com david.w.smi...@gmail.com wrote: On Wed, Jun 11, 2014 at 7:46 AM, Robert Krüger krue...@lesspain.de wrote: Or will I have to set up distributed search to achieve that? Yes — you have to shard it to achieve that. The shards could be on the same node. There were some discussions this year in JIRA about being able to do thread-per-segment but it’s not quite there yet. FWIW I think it would be a nice option for some use-cases (like yours). ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley
Re: Performance/scaling with custom function queries
Great, I was hoping for that. In my case I will have to deal with the worst case scenario, i.e. all documents matching the query, because the only criterion is the fingerprint and the result of the distance/similarity function which will have to be executed for every document. However, I am dealing with a scenario where there will not be many concurrent users. Thank you. On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein joels...@gmail.com wrote: You only need to have fast access to the fingerprint field so only that field needs to be in memory. You'll want to review how Lucene DocValues and FieldCache work. Sorting is done with a PriorityQueue so only the top N docs are kept in memory. You'll only need to access the fingerprint field values for documents that match the query, so it won't be a full table scan unless all the docs match the query. Sounds like an interesting project. Please keep us posted. Joel Bernstein Search Engineer at Heliosearch On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger krue...@lesspain.de wrote: Hi, let's say I have an index that contains a field of type BinaryField called fingerprint that stores a few (let's say 100) bytes that are some kind of digital fingerprint-like thing. Let's say I want to perform queries on that field to achieve sorting or filtering based on a kind of custom distance function customDistance, i.e. I input a reference fingerprint and Solr returns either all documents sorted by customDistance(referenceFingerprint,documentFingerprint) or use that in an frange expression for filtering. I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I do understand that using function queries with a custom function is definitely an expensive thing as it will result in what is called a full table scan in the sql world, i.e. data from all documents needs to be touched to select the correct documents or sort by a function's result. Given all that and provided, I have to use a custom function for my needs, I would like to know a few more details about solr architecture to understand what I have to look out for. I will have potentially millions of records. Does the data contained in other index fields play a role when I only use the fingerprint field for sorting and searching when it comes to RAM usage? I am hoping to calculate that my RAM should be able to accommodate the fingerprint data of all available documents for the queries to be fast but not fingerprint data and all other indexed or stored data. Example: My fingerprint data needs 100bytes per document, my other indexed field data needs 900 bytes per document. Will I need 100MB or 1GB to fit all data that is needed to process one query in memory? Are there other things to be aware of? Thanks, Robert -- Robert Krüger Managing Partner Lesspain GmbH Co. KG www.lesspain-software.com
Performance/scaling with custom function queries
Hi, let's say I have an index that contains a field of type BinaryField called fingerprint that stores a few (let's say 100) bytes that are some kind of digital fingerprint-like thing. Let's say I want to perform queries on that field to achieve sorting or filtering based on a kind of custom distance function customDistance, i.e. I input a reference fingerprint and Solr returns either all documents sorted by customDistance(referenceFingerprint,documentFingerprint) or use that in an frange expression for filtering. I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I do understand that using function queries with a custom function is definitely an expensive thing as it will result in what is called a full table scan in the sql world, i.e. data from all documents needs to be touched to select the correct documents or sort by a function's result. Given all that and provided, I have to use a custom function for my needs, I would like to know a few more details about solr architecture to understand what I have to look out for. I will have potentially millions of records. Does the data contained in other index fields play a role when I only use the fingerprint field for sorting and searching when it comes to RAM usage? I am hoping to calculate that my RAM should be able to accommodate the fingerprint data of all available documents for the queries to be fast but not fingerprint data and all other indexed or stored data. Example: My fingerprint data needs 100bytes per document, my other indexed field data needs 900 bytes per document. Will I need 100MB or 1GB to fit all data that is needed to process one query in memory? Are there other things to be aware of? Thanks, Robert
Re: Performance/scaling with custom function queries
You only need to have fast access to the fingerprint field so only that field needs to be in memory. You'll want to review how Lucene DocValues and FieldCache work. Sorting is done with a PriorityQueue so only the top N docs are kept in memory. You'll only need to access the fingerprint field values for documents that match the query, so it won't be a full table scan unless all the docs match the query. Sounds like an interesting project. Please keep us posted. Joel Bernstein Search Engineer at Heliosearch On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger krue...@lesspain.de wrote: Hi, let's say I have an index that contains a field of type BinaryField called fingerprint that stores a few (let's say 100) bytes that are some kind of digital fingerprint-like thing. Let's say I want to perform queries on that field to achieve sorting or filtering based on a kind of custom distance function customDistance, i.e. I input a reference fingerprint and Solr returns either all documents sorted by customDistance(referenceFingerprint,documentFingerprint) or use that in an frange expression for filtering. I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I do understand that using function queries with a custom function is definitely an expensive thing as it will result in what is called a full table scan in the sql world, i.e. data from all documents needs to be touched to select the correct documents or sort by a function's result. Given all that and provided, I have to use a custom function for my needs, I would like to know a few more details about solr architecture to understand what I have to look out for. I will have potentially millions of records. Does the data contained in other index fields play a role when I only use the fingerprint field for sorting and searching when it comes to RAM usage? I am hoping to calculate that my RAM should be able to accommodate the fingerprint data of all available documents for the queries to be fast but not fingerprint data and all other indexed or stored data. Example: My fingerprint data needs 100bytes per document, my other indexed field data needs 900 bytes per document. Will I need 100MB or 1GB to fit all data that is needed to process one query in memory? Are there other things to be aware of? Thanks, Robert