Re: Performance/scaling with custom function queries

2014-06-12 Thread Robert Krüger
Thanks for the info. I will look at that.

On Wed, Jun 11, 2014 at 3:47 PM, Joel Bernstein joels...@gmail.com wrote:
 In Solr 4.9 there is a feature called RankQueries, that allows you to
 plugin your own ranking collector. So, if you wanted to write a
 ranking/sorting collector that used a thread per segment, you could cleanly
 plug it in.

 Joel Bernstein
 Search Engineer at Heliosearch


 On Wed, Jun 11, 2014 at 9:39 AM, david.w.smi...@gmail.com 
 david.w.smi...@gmail.com wrote:

 On Wed, Jun 11, 2014 at 7:46 AM, Robert Krüger krue...@lesspain.de
 wrote:

  Or will I have to set up distributed search to achieve that?


 Yes — you have to shard it to achieve that.  The shards could be on the
 same node.

 There were some discussions this year in JIRA about being able to do
 thread-per-segment but it’s not quite there yet.  FWIW I think it would be
 a nice option for some use-cases (like yours).

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley




-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Re: Performance/scaling with custom function queries

2014-06-11 Thread Robert Krüger
Would Solr use multithreading to process the records of a function
query as described above? In my scenario concurrent searches are not
the issue, rather the speed of one query will be the optimization
target. Or will I have to set up distributed search to achieve that?

Thanks,

Robert

On Tue, Jun 10, 2014 at 10:11 AM, Robert Krüger krue...@lesspain.de wrote:
 Great, I was hoping for that. In my case I will have to deal with the
 worst case scenario, i.e. all documents matching the query, because
 the only criterion is the fingerprint and the result of the
 distance/similarity function which will have to be executed for every
 document. However, I am dealing with a scenario where there will not
 be many concurrent users.

 Thank you.

 On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein joels...@gmail.com wrote:
 You only need to have fast access to the fingerprint field so only that
 field needs to be in memory. You'll want to review how Lucene DocValues and
 FieldCache work. Sorting is done with a PriorityQueue so only the top N
 docs are kept in memory.

 You'll only need to access the fingerprint field values for documents that
 match the query, so it won't be a full table scan unless all the docs match
 the query.

 Sounds like an interesting project. Please keep us posted.

 Joel Bernstein
 Search Engineer at Heliosearch


 On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger krue...@lesspain.de wrote:

 Hi,

 let's say I have an index that contains a field of type BinaryField
 called fingerprint that stores a few (let's say 100) bytes that are
 some kind of digital fingerprint-like thing.

 Let's say I want to perform queries on that field to achieve sorting
 or filtering based on a kind of custom distance function
 customDistance, i.e. I input a reference fingerprint and Solr
 returns either all documents sorted by
 customDistance(referenceFingerprint,documentFingerprint) or use
 that in an frange expression for filtering.

 I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I
 do understand that using function queries with a custom function is
 definitely an expensive thing as it will result in what is called a
 full table scan in the sql world, i.e. data from all documents needs
 to be touched to select the correct documents or sort by a function's
 result.

 Given all that and provided, I have to use a custom function for my
 needs, I would like to know a few more details about solr architecture
 to understand what I have to look out for.

 I will have potentially millions of records. Does the data contained
 in other index fields play a role when I only use the fingerprint
 field for sorting and searching when it comes to RAM usage? I am
 hoping to calculate that my RAM should be able to accommodate the
 fingerprint data of all available documents for the queries to be fast
 but not fingerprint data and all other indexed or stored data.

 Example: My fingerprint data needs 100bytes per document, my other
 indexed field data needs 900 bytes per document. Will I need 100MB or
 1GB to fit all data that is needed to process one query in memory?

 Are there other things to be aware of?

 Thanks,

 Robert




 --
 Robert Krüger
 Managing Partner
 Lesspain GmbH  Co. KG

 www.lesspain-software.com



-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Re: Performance/scaling with custom function queries

2014-06-11 Thread david.w.smi...@gmail.com
On Wed, Jun 11, 2014 at 7:46 AM, Robert Krüger krue...@lesspain.de wrote:

 Or will I have to set up distributed search to achieve that?


Yes — you have to shard it to achieve that.  The shards could be on the
same node.

There were some discussions this year in JIRA about being able to do
thread-per-segment but it’s not quite there yet.  FWIW I think it would be
a nice option for some use-cases (like yours).

~ David Smiley
Freelance Apache Lucene/Solr Search Consultant/Developer
http://www.linkedin.com/in/davidwsmiley


Re: Performance/scaling with custom function queries

2014-06-11 Thread Joel Bernstein
In Solr 4.9 there is a feature called RankQueries, that allows you to
plugin your own ranking collector. So, if you wanted to write a
ranking/sorting collector that used a thread per segment, you could cleanly
plug it in.

Joel Bernstein
Search Engineer at Heliosearch


On Wed, Jun 11, 2014 at 9:39 AM, david.w.smi...@gmail.com 
david.w.smi...@gmail.com wrote:

 On Wed, Jun 11, 2014 at 7:46 AM, Robert Krüger krue...@lesspain.de
 wrote:

  Or will I have to set up distributed search to achieve that?


 Yes — you have to shard it to achieve that.  The shards could be on the
 same node.

 There were some discussions this year in JIRA about being able to do
 thread-per-segment but it’s not quite there yet.  FWIW I think it would be
 a nice option for some use-cases (like yours).

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley



Re: Performance/scaling with custom function queries

2014-06-10 Thread Robert Krüger
Great, I was hoping for that. In my case I will have to deal with the
worst case scenario, i.e. all documents matching the query, because
the only criterion is the fingerprint and the result of the
distance/similarity function which will have to be executed for every
document. However, I am dealing with a scenario where there will not
be many concurrent users.

Thank you.

On Mon, Jun 9, 2014 at 1:57 AM, Joel Bernstein joels...@gmail.com wrote:
 You only need to have fast access to the fingerprint field so only that
 field needs to be in memory. You'll want to review how Lucene DocValues and
 FieldCache work. Sorting is done with a PriorityQueue so only the top N
 docs are kept in memory.

 You'll only need to access the fingerprint field values for documents that
 match the query, so it won't be a full table scan unless all the docs match
 the query.

 Sounds like an interesting project. Please keep us posted.

 Joel Bernstein
 Search Engineer at Heliosearch


 On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger krue...@lesspain.de wrote:

 Hi,

 let's say I have an index that contains a field of type BinaryField
 called fingerprint that stores a few (let's say 100) bytes that are
 some kind of digital fingerprint-like thing.

 Let's say I want to perform queries on that field to achieve sorting
 or filtering based on a kind of custom distance function
 customDistance, i.e. I input a reference fingerprint and Solr
 returns either all documents sorted by
 customDistance(referenceFingerprint,documentFingerprint) or use
 that in an frange expression for filtering.

 I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I
 do understand that using function queries with a custom function is
 definitely an expensive thing as it will result in what is called a
 full table scan in the sql world, i.e. data from all documents needs
 to be touched to select the correct documents or sort by a function's
 result.

 Given all that and provided, I have to use a custom function for my
 needs, I would like to know a few more details about solr architecture
 to understand what I have to look out for.

 I will have potentially millions of records. Does the data contained
 in other index fields play a role when I only use the fingerprint
 field for sorting and searching when it comes to RAM usage? I am
 hoping to calculate that my RAM should be able to accommodate the
 fingerprint data of all available documents for the queries to be fast
 but not fingerprint data and all other indexed or stored data.

 Example: My fingerprint data needs 100bytes per document, my other
 indexed field data needs 900 bytes per document. Will I need 100MB or
 1GB to fit all data that is needed to process one query in memory?

 Are there other things to be aware of?

 Thanks,

 Robert




-- 
Robert Krüger
Managing Partner
Lesspain GmbH  Co. KG

www.lesspain-software.com


Performance/scaling with custom function queries

2014-06-08 Thread Robert Krüger
Hi,

let's say I have an index that contains a field of type BinaryField
called fingerprint that stores a few (let's say 100) bytes that are
some kind of digital fingerprint-like thing.

Let's say I want to perform queries on that field to achieve sorting
or filtering based on a kind of custom distance function
customDistance, i.e. I input a reference fingerprint and Solr
returns either all documents sorted by
customDistance(referenceFingerprint,documentFingerprint) or use
that in an frange expression for filtering.

I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I
do understand that using function queries with a custom function is
definitely an expensive thing as it will result in what is called a
full table scan in the sql world, i.e. data from all documents needs
to be touched to select the correct documents or sort by a function's
result.

Given all that and provided, I have to use a custom function for my
needs, I would like to know a few more details about solr architecture
to understand what I have to look out for.

I will have potentially millions of records. Does the data contained
in other index fields play a role when I only use the fingerprint
field for sorting and searching when it comes to RAM usage? I am
hoping to calculate that my RAM should be able to accommodate the
fingerprint data of all available documents for the queries to be fast
but not fingerprint data and all other indexed or stored data.

Example: My fingerprint data needs 100bytes per document, my other
indexed field data needs 900 bytes per document. Will I need 100MB or
1GB to fit all data that is needed to process one query in memory?

Are there other things to be aware of?

Thanks,

Robert


Re: Performance/scaling with custom function queries

2014-06-08 Thread Joel Bernstein
You only need to have fast access to the fingerprint field so only that
field needs to be in memory. You'll want to review how Lucene DocValues and
FieldCache work. Sorting is done with a PriorityQueue so only the top N
docs are kept in memory.

You'll only need to access the fingerprint field values for documents that
match the query, so it won't be a full table scan unless all the docs match
the query.

Sounds like an interesting project. Please keep us posted.

Joel Bernstein
Search Engineer at Heliosearch


On Sun, Jun 8, 2014 at 6:17 AM, Robert Krüger krue...@lesspain.de wrote:

 Hi,

 let's say I have an index that contains a field of type BinaryField
 called fingerprint that stores a few (let's say 100) bytes that are
 some kind of digital fingerprint-like thing.

 Let's say I want to perform queries on that field to achieve sorting
 or filtering based on a kind of custom distance function
 customDistance, i.e. I input a reference fingerprint and Solr
 returns either all documents sorted by
 customDistance(referenceFingerprint,documentFingerprint) or use
 that in an frange expression for filtering.

 I have read http://wiki.apache.org/solr/SolrPerformanceFactors and I
 do understand that using function queries with a custom function is
 definitely an expensive thing as it will result in what is called a
 full table scan in the sql world, i.e. data from all documents needs
 to be touched to select the correct documents or sort by a function's
 result.

 Given all that and provided, I have to use a custom function for my
 needs, I would like to know a few more details about solr architecture
 to understand what I have to look out for.

 I will have potentially millions of records. Does the data contained
 in other index fields play a role when I only use the fingerprint
 field for sorting and searching when it comes to RAM usage? I am
 hoping to calculate that my RAM should be able to accommodate the
 fingerprint data of all available documents for the queries to be fast
 but not fingerprint data and all other indexed or stored data.

 Example: My fingerprint data needs 100bytes per document, my other
 indexed field data needs 900 bytes per document. Will I need 100MB or
 1GB to fit all data that is needed to process one query in memory?

 Are there other things to be aware of?

 Thanks,

 Robert