Re: When is too many fields in qf is too many?
Steven, What does being your hero entails, beside a salute? :-) Approach 1: Tinker with your-app - Solr relationship. Approach 2: Gauge what's really used and limit the customization. Approach 3: Offer what's wanted (might be different than what you're trying to achieve). In your write-up I'm unsure on what is being demanded of Solr, but I assume you're after searchability no matter the view / ... / field combination. Every single field can be searchable and I assume you're looking for a way to provide just that - search on every field is to be possible if user wants it. Ad 3: with customize-all apps folks usually quickly find patterns where they have what they want and they don't use other patterns unless they really need to. Offer relevant search for their favourite patterns and much weaker search for other patterns. Your current approach may be over-engineering. You may be trying to answer a problem product folks posed you, while real need lies elsewhere. So, I'm kinda asking you question the problem, to find out your work won't be in vain, or not as good for end-user as they might have a different problem (like: which field is which in my so-and-so customized view nr 57 which changed again this month). Ad 2: find out which fields are most important for searches and offer these to Solr. The real usage is usually much less than capability offered - so if you have a view with 200 fields, I doubt folks even want to query all 200, but perhaps only 5 matter. Find a way to know those 5 (via user prefs per view perhaps or default view config) and search only them. Ties nicely with #3 as folks most likely don't even WANT to query all fields. Humans like limits: we don't want too much elements on screen, we like simple UIs, we don't want to input too long search query and we often don't want too many choices. Ad 1. EXPERIMENT. First, create a way for you to manage your configs automatically and have them in version control, you'll need fast generation and even faster revert / regeneration when something is NOT OK. Set up more than one way to achieve your search-them-all-as-user-pleases approach and test and compare them. Your case is quite unique (3,5k qf fields anyone? changing collections monthly) and I don't think without experimenting you will get good results, you need to compare number of options. Can you offer Solr servers per user group? Do you have similarities in views for user groups - even if informal? Like, 30% of your user base uses only 20% of all views that you have? Then it makes sense to have dedicated Solr for those 20% of all views. You'll need routing here and rules per user groups in your app. How many customizations you have and how can you use that? Are there any patterns in customizing views that you can predict / observe / use? Kinda synthesis of all approaches, but with your customization level I don't think one Solr for all cases will be of any use, even if you do manage to have it by some tinkering with settings. As I kinda looked at the problem not in terms of Solr settings, this is somewhat off-topic, so if you wish to ask something it might be better off the group, unless others want the thread to continue here out of curiosity how it ends. pozdrawiam, LAFK -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Thursday, May 28, 2015 5:59 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Hi Folks, First, thanks for taking the time to read and reply to this subject, it is much appreciated, I have yet to come up with a final solution that optimizes Solr. To give you more context, let me give you the big picture of how the application and the database is structured for which I'm trying to enable Solr search on. Application: Has the concept of views. A view contains one or more object types. An object type may exist in any view. An object type has one or more field groups. A field group has a set of fields. A field group can be used with any object type of any view. Notice how field groups are free standing, that they can be linked to an object type of any view? Here is a diagram of the above: FieldGroup-#1 == Field-1, Field-2, Field-5, etc. FieldGroup-#2 == Field-1, Field-5, Field-6, Field-7, Field-8, etc. FieldGroup-#3 == Field-2, Field-5, Field-8, etc. View-#1 == ObjType-#2 (using FieldGroup-#1 #3) + ObjType-#4 (using FieldGroup-#1) + ObjType-#5 (using FieldGroup-#1, #2, #3, etc). View-#2 == ObjType-#1 (using FieldGroup-#3, #15, #16, #19, etc.) + ObjType-#4 (using FieldGroup-#1, #4, #19, etc.) + etc. View-#3 == ObjType-#1 (using FieldGroup-#1, #8) + etc. Do you see where this is heading? To make it even a bit more interesting, ObjType-#4 (which is in view-#1 and #2 per the above) which in both views, it uses FieldGroup-#1, in one view it can be configured to have its own fields off FieldGroup-#1. With the above setting, a user is assigned
RE: When is too many fields in qf is too many?
Before giving up, I might try a copyTo fields per field group and see how that works. Won't that get you down to 10-20 fields per query and be stable wrt view changes? But Solr is column oriented, in that the core query logic is a scatter/gather over qf list. Perhaps there is a reason qf does not support wildcards. Not sure. But it seems likely. That said, having thousands of columns is not weird at all in some applications. You might be better served with a product oriented to this type of usage. Maybe HBASE? -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Thursday, May 28, 2015 5:59 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Hi Folks, First, thanks for taking the time to read and reply to this subject, it is much appreciated, I have yet to come up with a final solution that optimizes Solr. To give you more context, let me give you the big picture of how the application and the database is structured for which I'm trying to enable Solr search on. Application: Has the concept of views. A view contains one or more object types. An object type may exist in any view. An object type has one or more field groups. A field group has a set of fields. A field group can be used with any object type of any view. Notice how field groups are free standing, that they can be linked to an object type of any view? Here is a diagram of the above: FieldGroup-#1 == Field-1, Field-2, Field-5, etc. FieldGroup-#2 == Field-1, Field-5, Field-6, Field-7, Field-8, etc. FieldGroup-#3 == Field-2, Field-5, Field-8, etc. View-#1 == ObjType-#2 (using FieldGroup-#1 #3) + ObjType-#4 (using FieldGroup-#1) + ObjType-#5 (using FieldGroup-#1, #2, #3, etc). View-#2 == ObjType-#1 (using FieldGroup-#3, #15, #16, #19, etc.) + ObjType-#4 (using FieldGroup-#1, #4, #19, etc.) + etc. View-#3 == ObjType-#1 (using FieldGroup-#1, #8) + etc. Do you see where this is heading? To make it even a bit more interesting, ObjType-#4 (which is in view-#1 and #2 per the above) which in both views, it uses FieldGroup-#1, in one view it can be configured to have its own fields off FieldGroup-#1. With the above setting, a user is assigned a view and can be moved around views but cannot be in multiple views at the same time. Based on which view that user is in, that user will see different fields of ObjType-#1 (the example I gave for FieldGroup-#1) or even not see an object type that he was able to see in another view. If I have not lost you with the above, you can see that per view, there can be may fields. To make it even yet more interesting, a field in FieldGroup-#1 may have the exact same name as a field in another FieldGroup and the two could be of different type (one is date, the other is string type). Thus when I build my Solr doc object (and create list of Solr fields) those fields must be prefixed with the FieldGroup name otherwise I could end up overwriting the type of another field. Are you still with me? :-) Now you see how a view can end up with many fields (over 3500 in my case), but a doc I post to Solr for indexing will have on average 50 fields, worse case maybe 200 fields. This is fine, and it is not my issue but I want to call it out to get it out of our way. Another thing I need to mention is this (in case it is not clear from the above). Users create and edit records in the DB by an instance of ObjType-#N. Those object types that are created do NOT belong to a view, in fact they do NOT have any view concept in them. They simply have the concept of what fields the user can see / edit based on which view that user is in. In effect, in the DB, we have instances of object types data. One last thing I should point out is that views, and field groups are dynamic. This month, View-#3 may have ObjType-#1, but next month it may not or a new object type may be added to it. Still with me? If so, you are my hero!! :-) So, I setup my Solr schema.xml to include all fields off each field group that exists in the database like so: field name=FieldGroup-1.Headline type=text multiValued=true indexed=true stored=false required=false/ field name=FieldGroup-1.Summary type=text multiValued=true indexed=true stored=false required=false/ field name=FieldGroup-1. ... ... ... ... / field name=FieldGroup-2.Headline type=text multiValued=true indexed=true stored=false required=false/ field name=FieldGroup-2.Summary type=text multiValued=true indexed=true stored=false required=false/ field name=FieldGroup-2.Date type=text multiValued=true indexed=true stored=false required=false/ field name=FieldGroup-2. ... ... ... ... / field name=FieldGroup-3. ... ... ... ... / field name=FieldGroup-4. ... ... ... ... / You got the idea. Each record of an object type I index contains ALL the fields off that that object type REGARDLESS which view that object type is set
Re: When is too many fields in qf is too many?
I would reconsider the strategy of mashing so many different record types into one Solr collection. Sure, you get some advantage from denormalizing data, but if the downside cost gets too high, it may not make so much sense. I'd consider a collection per record type, or at least group similar record types, and then query as many collections - in parallel - as needed for a given user. That should also assure that a query for a given record type should be much faster as well. Surely you should be able to examine the query in the app and determine what record types it might apply to. When in doubt, make your schema as clean and simple as possible. Simplicity over complexity. -- Jack Krupansky On Thu, May 28, 2015 at 12:06 PM, Erick Erickson erickerick...@gmail.com wrote: Gotta agree with Jack here. This is an insane number of fields, query performance on any significant corpus will be fraught etc. The very first thing I'd look at is having that many fields. You have 3,500 different fields! Whatever the motivation for having that many fields is the place I'd start. Best, Erick On Thu, May 28, 2015 at 5:50 AM, Jack Krupansky jack.krupan...@gmail.com wrote: This does not even pass a basic smell test for reasonability of matching the capabilities of Solr and the needs of your application. I'd like to hear from others, but I personally would be -1 on this approach to misusing qf. I'd simply say that you need to go back to the drawing board, and that your primary focus should be on working with your application product manager to revise your application requirements to more closely match the capabilities of Solr. To put it simply, if you have more than a dozen fields in qf, you're probably doing something wrong. In this case horribly wrong. Focus on designing your app to exploit the capabilities of Solr, not to misuse them. In short, to answer the original question, more than a couple dozen fields in qf is indeed too many. More than a dozen raises a yellow flag for me. -- Jack Krupansky On Thu, May 28, 2015 at 8:13 AM, Steven White swhite4...@gmail.com wrote: Hi Charles, That is what I have done. At the moment, I have 22 request handlers, some have 3490 field items in qf (that's the most and the qf line spans over 95,000 characters in solrconfig.xml file) and the least one has 1341 fields. I'm working on seeing if I can use copyField to copy the data of that view's field into a single pseudo-view-field and use that pseudo field for qf of that view's request handler. The I still have outstanding with using copyField in this way is that it could lead to a complete re-indexing of all the data in that view when a field is adding / removing from that view. Thanks Steve On Wed, May 27, 2015 at 6:02 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: One request handler per view? I think if you are able to make the actual view in use for the current request a single value (vs. all views that the user could use over time), it would keep the qf list down to a manageable size (e.g. specified within the request handler XML). Not sure if this is feasible for you, but it seems like a reasonable approach given the use case you describe. Just a thought ... -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, May 26, 2015 4:48 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name
Re: When is too many fields in qf is too many?
to reindex my entire database to reflect a view change even when the actual data has not changed. 2) My Solr index size will now be larger. I have to create a pseudo Solr field to copyField to it for each view in my database. I have also considered creating multiple cores per view, but that still doesn't solve the above two issues, requiring reindex and increasing the index size. Now that you see what my backend application is like, let me know if you have any ideas on how you would solve this puzzle. And if you have read this all the way to the end, I solute you!! Steve On Thu, May 28, 2015 at 4:23 PM, Jack Krupansky jack.krupan...@gmail.com wrote: I would reconsider the strategy of mashing so many different record types into one Solr collection. Sure, you get some advantage from denormalizing data, but if the downside cost gets too high, it may not make so much sense. I'd consider a collection per record type, or at least group similar record types, and then query as many collections - in parallel - as needed for a given user. That should also assure that a query for a given record type should be much faster as well. Surely you should be able to examine the query in the app and determine what record types it might apply to. When in doubt, make your schema as clean and simple as possible. Simplicity over complexity. -- Jack Krupansky On Thu, May 28, 2015 at 12:06 PM, Erick Erickson erickerick...@gmail.com wrote: Gotta agree with Jack here. This is an insane number of fields, query performance on any significant corpus will be fraught etc. The very first thing I'd look at is having that many fields. You have 3,500 different fields! Whatever the motivation for having that many fields is the place I'd start. Best, Erick On Thu, May 28, 2015 at 5:50 AM, Jack Krupansky jack.krupan...@gmail.com wrote: This does not even pass a basic smell test for reasonability of matching the capabilities of Solr and the needs of your application. I'd like to hear from others, but I personally would be -1 on this approach to misusing qf. I'd simply say that you need to go back to the drawing board, and that your primary focus should be on working with your application product manager to revise your application requirements to more closely match the capabilities of Solr. To put it simply, if you have more than a dozen fields in qf, you're probably doing something wrong. In this case horribly wrong. Focus on designing your app to exploit the capabilities of Solr, not to misuse them. In short, to answer the original question, more than a couple dozen fields in qf is indeed too many. More than a dozen raises a yellow flag for me. -- Jack Krupansky On Thu, May 28, 2015 at 8:13 AM, Steven White swhite4...@gmail.com wrote: Hi Charles, That is what I have done. At the moment, I have 22 request handlers, some have 3490 field items in qf (that's the most and the qf line spans over 95,000 characters in solrconfig.xml file) and the least one has 1341 fields. I'm working on seeing if I can use copyField to copy the data of that view's field into a single pseudo-view-field and use that pseudo field for qf of that view's request handler. The I still have outstanding with using copyField in this way is that it could lead to a complete re-indexing of all the data in that view when a field is adding / removing from that view. Thanks Steve On Wed, May 27, 2015 at 6:02 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: One request handler per view? I think if you are able to make the actual view in use for the current request a single value (vs. all views that the user could use over time), it would keep the qf list down to a manageable size (e.g. specified within the request handler XML). Not sure if this is feasible for you, but it seems like a reasonable approach given the use case you describe. Just a thought ... -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, May 26, 2015 4:48 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout
RE: When is too many fields in qf is too many?
Still, it seems like the right direction. Does it smell ok to have a few hundred request handlers?Again, my logic is that if any given view requires no more than 50 fields, one request handler per view would work. This is different than a request handler per user category (which requires access to any number of views and, thus, many more fields). This does require a design change for Steven's application ... Steven, do you have tables of the many-to-many relationship between fields and views and users and views? If so, you should be able to programmatically generate the request handlers. If these relationships change frequently, then some custom plugin will be required to access these tables at query time. See what I mean? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, May 28, 2015 12:07 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Gotta agree with Jack here. This is an insane number of fields, query performance on any significant corpus will be fraught etc. The very first thing I'd look at is having that many fields. You have 3,500 different fields! Whatever the motivation for having that many fields is the place I'd start. Best, Erick On Thu, May 28, 2015 at 5:50 AM, Jack Krupansky jack.krupan...@gmail.com wrote: This does not even pass a basic smell test for reasonability of matching the capabilities of Solr and the needs of your application. I'd like to hear from others, but I personally would be -1 on this approach to misusing qf. I'd simply say that you need to go back to the drawing board, and that your primary focus should be on working with your application product manager to revise your application requirements to more closely match the capabilities of Solr. To put it simply, if you have more than a dozen fields in qf, you're probably doing something wrong. In this case horribly wrong. Focus on designing your app to exploit the capabilities of Solr, not to misuse them. In short, to answer the original question, more than a couple dozen fields in qf is indeed too many. More than a dozen raises a yellow flag for me. -- Jack Krupansky On Thu, May 28, 2015 at 8:13 AM, Steven White swhite4...@gmail.com wrote: Hi Charles, That is what I have done. At the moment, I have 22 request handlers, some have 3490 field items in qf (that's the most and the qf line spans over 95,000 characters in solrconfig.xml file) and the least one has 1341 fields. I'm working on seeing if I can use copyField to copy the data of that view's field into a single pseudo-view-field and use that pseudo field for qf of that view's request handler. The I still have outstanding with using copyField in this way is that it could lead to a complete re-indexing of all the data in that view when a field is adding / removing from that view. Thanks Steve On Wed, May 27, 2015 at 6:02 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: One request handler per view? I think if you are able to make the actual view in use for the current request a single value (vs. all views that the user could use over time), it would keep the qf list down to a manageable size (e.g. specified within the request handler XML). Not sure if this is feasible for you, but it seems like a reasonable approach given the use case you describe. Just a thought ... -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, May 26, 2015 4:48 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst
Re: When is too many fields in qf is too many?
Hi Charles, That is what I have done. At the moment, I have 22 request handlers, some have 3490 field items in qf (that's the most and the qf line spans over 95,000 characters in solrconfig.xml file) and the least one has 1341 fields. I'm working on seeing if I can use copyField to copy the data of that view's field into a single pseudo-view-field and use that pseudo field for qf of that view's request handler. The I still have outstanding with using copyField in this way is that it could lead to a complete re-indexing of all the data in that view when a field is adding / removing from that view. Thanks Steve On Wed, May 27, 2015 at 6:02 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: One request handler per view? I think if you are able to make the actual view in use for the current request a single value (vs. all views that the user could use over time), it would keep the qf list down to a manageable size (e.g. specified within the request handler XML). Not sure if this is feasible for you, but it seems like a reasonable approach given the use case you describe. Just a thought ... -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, May 26, 2015 4:48 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=defTypeedismax/str str name=qfF1 F2 F3 F4 ... ... .../str float name=tie1.0/float str name=fl_UNIQUE_FIELD_,score/str str name=wtxml/str str name=indenttrue/str /lst /requestHandler Or must tie be passed as part of the URL? Thanks Steve On Wed, May 20, 2015 at 2:58 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still have cases where one field's score might just happen to be far off of another, and thus dominating the summation. But something easy to try if you want to keep playing with dismax. -Doug On Wed, May 20, 2015 at 2:56 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow get those group-based-fields data into a single field using copyField. Thanks Steve On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dis max-why-your-incorrect-assumptions-about-dismax-are-hurting-search-rel evancy/ I'm about to win the blashphemer merit badge, but ad-hoc all-field like searching over many fields is actually a good use case for Elasticsearch's cross field queries. https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fiel ds_queries.html
Re: When is too many fields in qf is too many?
This does not even pass a basic smell test for reasonability of matching the capabilities of Solr and the needs of your application. I'd like to hear from others, but I personally would be -1 on this approach to misusing qf. I'd simply say that you need to go back to the drawing board, and that your primary focus should be on working with your application product manager to revise your application requirements to more closely match the capabilities of Solr. To put it simply, if you have more than a dozen fields in qf, you're probably doing something wrong. In this case horribly wrong. Focus on designing your app to exploit the capabilities of Solr, not to misuse them. In short, to answer the original question, more than a couple dozen fields in qf is indeed too many. More than a dozen raises a yellow flag for me. -- Jack Krupansky On Thu, May 28, 2015 at 8:13 AM, Steven White swhite4...@gmail.com wrote: Hi Charles, That is what I have done. At the moment, I have 22 request handlers, some have 3490 field items in qf (that's the most and the qf line spans over 95,000 characters in solrconfig.xml file) and the least one has 1341 fields. I'm working on seeing if I can use copyField to copy the data of that view's field into a single pseudo-view-field and use that pseudo field for qf of that view's request handler. The I still have outstanding with using copyField in this way is that it could lead to a complete re-indexing of all the data in that view when a field is adding / removing from that view. Thanks Steve On Wed, May 27, 2015 at 6:02 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: One request handler per view? I think if you are able to make the actual view in use for the current request a single value (vs. all views that the user could use over time), it would keep the qf list down to a manageable size (e.g. specified within the request handler XML). Not sure if this is feasible for you, but it seems like a reasonable approach given the use case you describe. Just a thought ... -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, May 26, 2015 4:48 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=defTypeedismax/str str name=qfF1 F2 F3 F4 ... ... .../str float name=tie1.0/float str name=fl_UNIQUE_FIELD_,score/str str name=wtxml/str str name=indenttrue/str /lst /requestHandler Or must tie be passed as part of the URL? Thanks Steve On Wed, May 20, 2015 at 2:58 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still have cases where one field's score might just happen to be far off of another, and thus dominating the summation. But something easy to try if you want to keep playing with dismax. -Doug On Wed, May 20, 2015 at 2:56 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow
Re: When is too many fields in qf is too many?
Gotta agree with Jack here. This is an insane number of fields, query performance on any significant corpus will be fraught etc. The very first thing I'd look at is having that many fields. You have 3,500 different fields! Whatever the motivation for having that many fields is the place I'd start. Best, Erick On Thu, May 28, 2015 at 5:50 AM, Jack Krupansky jack.krupan...@gmail.com wrote: This does not even pass a basic smell test for reasonability of matching the capabilities of Solr and the needs of your application. I'd like to hear from others, but I personally would be -1 on this approach to misusing qf. I'd simply say that you need to go back to the drawing board, and that your primary focus should be on working with your application product manager to revise your application requirements to more closely match the capabilities of Solr. To put it simply, if you have more than a dozen fields in qf, you're probably doing something wrong. In this case horribly wrong. Focus on designing your app to exploit the capabilities of Solr, not to misuse them. In short, to answer the original question, more than a couple dozen fields in qf is indeed too many. More than a dozen raises a yellow flag for me. -- Jack Krupansky On Thu, May 28, 2015 at 8:13 AM, Steven White swhite4...@gmail.com wrote: Hi Charles, That is what I have done. At the moment, I have 22 request handlers, some have 3490 field items in qf (that's the most and the qf line spans over 95,000 characters in solrconfig.xml file) and the least one has 1341 fields. I'm working on seeing if I can use copyField to copy the data of that view's field into a single pseudo-view-field and use that pseudo field for qf of that view's request handler. The I still have outstanding with using copyField in this way is that it could lead to a complete re-indexing of all the data in that view when a field is adding / removing from that view. Thanks Steve On Wed, May 27, 2015 at 6:02 PM, Reitzel, Charles charles.reit...@tiaa-cref.org wrote: One request handler per view? I think if you are able to make the actual view in use for the current request a single value (vs. all views that the user could use over time), it would keep the qf list down to a manageable size (e.g. specified within the request handler XML). Not sure if this is feasible for you, but it seems like a reasonable approach given the use case you describe. Just a thought ... -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, May 26, 2015 4:48 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=defTypeedismax/str str name=qfF1 F2 F3 F4 ... ... .../str float name=tie1.0/float str name=fl_UNIQUE_FIELD_,score/str str name=wtxml/str str name=indenttrue/str /lst /requestHandler Or must tie be passed as part of the URL? Thanks Steve On Wed, May 20, 2015 at 2:58 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still have cases where one field's score might just happen to be far off of another, and thus dominating
RE: When is too many fields in qf is too many?
One request handler per view? I think if you are able to make the actual view in use for the current request a single value (vs. all views that the user could use over time), it would keep the qf list down to a manageable size (e.g. specified within the request handler XML). Not sure if this is feasible for you, but it seems like a reasonable approach given the use case you describe. Just a thought ... -Original Message- From: Steven White [mailto:swhite4...@gmail.com] Sent: Tuesday, May 26, 2015 4:48 PM To: solr-user@lucene.apache.org Subject: Re: When is too many fields in qf is too many? Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=defTypeedismax/str str name=qfF1 F2 F3 F4 ... ... .../str float name=tie1.0/float str name=fl_UNIQUE_FIELD_,score/str str name=wtxml/str str name=indenttrue/str /lst /requestHandler Or must tie be passed as part of the URL? Thanks Steve On Wed, May 20, 2015 at 2:58 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still have cases where one field's score might just happen to be far off of another, and thus dominating the summation. But something easy to try if you want to keep playing with dismax. -Doug On Wed, May 20, 2015 at 2:56 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow get those group-based-fields data into a single field using copyField. Thanks Steve On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dis max-why-your-incorrect-assumptions-about-dismax-are-hurting-search-rel evancy/ I'm about to win the blashphemer merit badge, but ad-hoc all-field like searching over many fields is actually a good use case for Elasticsearch's cross field queries. https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fiel ds_queries.html http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-f ield-search-is-a-lie/ It wouldn't be hard (and actually a great feature for the project) to get the Lucene query associated with cross field search into Solr. You could easily write a plugin to integrate it into a query parser: https://github.com/elastic/elasticsearch/blob/master/src/main/java/org /apache/lucene/queries/BlendedTermQuery.java Hope that helps -Doug -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including
Re: When is too many fields in qf is too many?
How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=defTypeedismax/str str name=qfF1 F2 F3 F4 ... ... .../str float name=tie1.0/float str name=fl_UNIQUE_FIELD_,score/str str name=wtxml/str str name=indenttrue/str /lst /requestHandler Or must tie be passed as part of the URL? Thanks Steve On Wed, May 20, 2015 at 2:58 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still have cases where one field's score might just happen to be far off of another, and thus dominating the summation. But something easy to try if you want to keep playing with dismax. -Doug On Wed, May 20, 2015 at 2:56 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow get those group-based-fields data into a single field using copyField. Thanks Steve On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/ I'm about to win the blashphemer merit badge, but ad-hoc all-field like searching over many fields is actually a good use case for Elasticsearch's cross field queries. https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/ It wouldn't be hard (and actually a great feature for the project) to get the Lucene query associated with cross field search into Solr. You could easily write a plugin to integrate it into a query parser: https://github.com/elastic/elasticsearch/blob/master/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java Hope that helps -Doug -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles?
Re: When is too many fields in qf is too many?
Thanks Doug. I might have to take you on the hangout offer. Let me refine the requirement further and if I still see the need, I will let you know. Steve On Tue, May 26, 2015 at 2:01 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: How you have tie is fine. Setting tie to 1 might give you reasonable results. You could easily still have scores that are just always an order of magnitude or two higher, but try it out! BTW Anything you put in teh URL can also be put into a request handler. If you ever just want to have a 15 minute conversation via hangout, happy to chat with you :) Might be fun to think through your prob together. -Doug On Tue, May 26, 2015 at 1:42 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=defTypeedismax/str str name=qfF1 F2 F3 F4 ... ... .../str float name=tie1.0/float str name=fl_UNIQUE_FIELD_,score/str str name=wtxml/str str name=indenttrue/str /lst /requestHandler Or must tie be passed as part of the URL? Thanks Steve On Wed, May 20, 2015 at 2:58 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still have cases where one field's score might just happen to be far off of another, and thus dominating the summation. But something easy to try if you want to keep playing with dismax. -Doug On Wed, May 20, 2015 at 2:56 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow get those group-based-fields data into a single field using copyField. Thanks Steve On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/ I'm about to win the blashphemer merit badge, but ad-hoc all-field like searching over many fields is actually a good use case for Elasticsearch's cross field queries. https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/ It wouldn't be hard (and actually a great feature for the project) to get the Lucene query associated with cross field search into Solr. You could easily write a plugin to integrate it into a query parser: https://github.com/elastic/elasticsearch/blob/master/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java Hope that helps -Doug -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name
Re: When is too many fields in qf is too many?
Hi Doug, I'm back to this topic. Unfortunately, due to my DB structer, and business need, I will not be able to search against a single field (i.e.: using copyField). Thus, I have to use list of fields via qf. Given this, I see you said above to use tie=1.0 will that, more or less, address this scoring issue? Should tie=1.0 be set on the request handler like so: requestHandler name=/select class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows20/int str name=defTypeedismax/str str name=qfF1 F2 F3 F4 ... ... .../str float name=tie1.0/float str name=fl_UNIQUE_FIELD_,score/str str name=wtxml/str str name=indenttrue/str /lst /requestHandler Or must tie be passed as part of the URL? Thanks Steve On Wed, May 20, 2015 at 2:58 PM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still have cases where one field's score might just happen to be far off of another, and thus dominating the summation. But something easy to try if you want to keep playing with dismax. -Doug On Wed, May 20, 2015 at 2:56 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow get those group-based-fields data into a single field using copyField. Thanks Steve On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/ I'm about to win the blashphemer merit badge, but ad-hoc all-field like searching over many fields is actually a good use case for Elasticsearch's cross field queries. https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/ It wouldn't be hard (and actually a great feature for the project) to get the Lucene query associated with cross field search into Solr. You could easily write a plugin to integrate it into a query parser: https://github.com/elastic/elasticsearch/blob/master/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java Hope that helps -Doug -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. If the network traffic becomes an issue, my alternative solution is to create a /select handler for each group and in that handler list the fields under qf. I have considered creating pseudo-fields for each group and then use copyField into that group. During search, I than can qf against that one field. Unfortunately, this is not ideal for my solution because the fields that go into each group dynamically change (at least once a month) and when they do change, I have to re-index everything (this I have to avoid) to sync that group-field.
Re: When is too many fields in qf is too many?
Also, is this 1500 fields that are always populated, or are there really a larger number of different record types, each with a relatively small number of fields populated in a particular document? Answer: This is a large number of different record types, each with a relatively small number of fields in a particular document. Some documents will have 5 fields, others may have 50 (that's the average) Could you try to point to a real-world example of where your use case might apply, so we can relate to it? I'm indexing data off a DB, all the fields of each record is indexed. The application is complex such that it has views and users belong to 1 or more views. Users can move between views and views can change over time. A user in view-A can see certain fields, while a user in view-B can see some other fields. So, when a user issues a search, I have to limit into which fields that search is executed against. And like I said, because users can move between views, and views can change over time, the list of fields isn't static. This is why I have to pass the list of fields for each search based on user's current view. I hope this gives context to my problem I'm trying to solve and describes why I'm using fq and why the list of fields maybe long because there is a case in which a user may belong to N - 1 views. Steve On Wed, May 20, 2015 at 11:14 AM, Jack Krupansky jack.krupan...@gmail.com wrote: The uf parameter is used to specify which fields a user may query against - the qf parameter specifies the set of fields that an unfielded query term must be queried against. The user is free to specify fielded query terms, like field1:term1 OR field2:term2. So, which use case are you really talking about. Could you try to point to a real-world example of where your use case might apply, so we can relate to it? Generally, I would say that a Solr document/collection should have no more than low hundreds of fields. It's not that you absolutely can't have more or absolutely can't have 5,000 or more, but simply that you will be asking for trouble, for example, with the cost of comprehending and maintaining and communicating your solution with others, including this mailing list for support. What specifically pushed you to have documents with 1500 field? Also, is this 1500 fields that are always populated, or are there really a larger number of different record types, each with a relatively small number of fields populated in a particular document? -- Jack Krupansky On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. If the network traffic becomes an issue, my alternative solution is to create a /select handler for each group and in that handler list the fields under qf. I have considered creating pseudo-fields for each group and then use copyField into that group. During search, I than can qf against that one field. Unfortunately, this is not ideal for my solution because the fields that go into each group dynamically change (at least once a month) and when they do change, I have to re-index everything (this I have to avoid) to sync that group-field. I'm using qf with edismax and my Solr version is 5.1. Thanks Steve
Re: When is too many fields in qf is too many?
Thanks Shawn. I have already switched to using POST because I need to send a long list of data in qf. My question isn't about POST / GET, it's about Solr and Lucene having to deal with such long list of fields. Here is the text of my question reposted: Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. Steve On Wed, May 20, 2015 at 10:52 AM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 6:27 AM, Steven White wrote: My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. You have two choices when queries become that large. One is to increase the max HTTP header size in the servlet container. In most containers, webservers, and proxy servers, this defaults to 8192 bytes. This is an approach that works very well, but will not scale to extremely large sizes. I have done this on my indexes, because I regularly have queries in the 20K range, but I do not expect them to get very much larger than this. The other option is to switch to sending a POST instead of a GET. The default max POST size that Solr sets is 2MB, which is plenty for just about any query, and can be increased easily to much larger sizes. If you are using SolrJ, switching to POST is very easy ... you'd need to research to figure out how if you're using another framework. Thanks, Shawn
Re: When is too many fields in qf is too many?
On 5/20/2015 9:24 AM, Steven White wrote: I have already switched to using POST because I need to send a long list of data in qf. My question isn't about POST / GET, it's about Solr and Lucene having to deal with such long list of fields. Here is the text of my question reposted: Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. You may need to increase maxBooleanClauses beyond the default of 1024. There will be a message in the log if that is required. Note that such an increase must happen on EVERY config you have, or one of them may set it back to the 1024 default -- it's a global JVM-wide config. Large complex queries are usually slow, requiring more memory and CPU than simple queries, but if you have the resources, Solr will handle it just fine. Thanks, Shawn
Re: When is too many fields in qf is too many?
Yeah a copyField into one could be a good space/time tradeoff. It can be more manageable to use an all field for both relevancy and performance, if you can handle the duplication of data. You could set tie=1.0, which effectively sums all the matches instead of picking the best match. You'll still have cases where one field's score might just happen to be far off of another, and thus dominating the summation. But something easy to try if you want to keep playing with dismax. -Doug On Wed, May 20, 2015 at 2:56 PM, Steven White swhite4...@gmail.com wrote: Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow get those group-based-fields data into a single field using copyField. Thanks Steve On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/ I'm about to win the blashphemer merit badge, but ad-hoc all-field like searching over many fields is actually a good use case for Elasticsearch's cross field queries. https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/ It wouldn't be hard (and actually a great feature for the project) to get the Lucene query associated with cross field search into Solr. You could easily write a plugin to integrate it into a query parser: https://github.com/elastic/elasticsearch/blob/master/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java Hope that helps -Doug -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. If the network traffic becomes an issue, my alternative solution is to create a /select handler for each group and in that handler list the fields under qf. I have considered creating pseudo-fields for each group and then use copyField into that group. During search, I than can qf against that one field. Unfortunately, this is not ideal for my solution because the fields that go into each group dynamically change (at least once a month) and when they do change, I have to re-index everything (this I have to avoid) to sync that group-field. I'm using qf with edismax and my Solr version is 5.1. Thanks Steve -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.
Re: When is too many fields in qf is too many?
Hi Doug, Your blog write up on relevancy is very interesting, I didn't know this. Looks like I have to go back to my drawing board and figure out an alternative solution: somehow get those group-based-fields data into a single field using copyField. Thanks Steve On Wed, May 20, 2015 at 11:17 AM, Doug Turnbull dturnb...@opensourceconnections.com wrote: Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/ I'm about to win the blashphemer merit badge, but ad-hoc all-field like searching over many fields is actually a good use case for Elasticsearch's cross field queries. https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/ It wouldn't be hard (and actually a great feature for the project) to get the Lucene query associated with cross field search into Solr. You could easily write a plugin to integrate it into a query parser: https://github.com/elastic/elasticsearch/blob/master/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java Hope that helps -Doug -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. If the network traffic becomes an issue, my alternative solution is to create a /select handler for each group and in that handler list the fields under qf. I have considered creating pseudo-fields for each group and then use copyField into that group. During search, I than can qf against that one field. Unfortunately, this is not ideal for my solution because the fields that go into each group dynamically change (at least once a month) and when they do change, I have to re-index everything (this I have to avoid) to sync that group-field. I'm using qf with edismax and my Solr version is 5.1. Thanks Steve
Re: When is too many fields in qf is too many?
Thanks for calling out maxBooleanClauses. The current default of 1024 has not caused me any issues (so far) in my testing. However, you probably saw Doug Tumbull's reply, it looks like my relevance will suffer. Steve On Wed, May 20, 2015 at 11:42 AM, Shawn Heisey apa...@elyograg.org wrote: On 5/20/2015 9:24 AM, Steven White wrote: I have already switched to using POST because I need to send a long list of data in qf. My question isn't about POST / GET, it's about Solr and Lucene having to deal with such long list of fields. Here is the text of my question reposted: Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. You may need to increase maxBooleanClauses beyond the default of 1024. There will be a message in the log if that is required. Note that such an increase must happen on EVERY config you have, or one of them may set it back to the 1024 default -- it's a global JVM-wide config. Large complex queries are usually slow, requiring more memory and CPU than simple queries, but if you have the resources, Solr will handle it just fine. Thanks, Shawn
Re: When is too many fields in qf is too many?
The uf parameter is used to specify which fields a user may query against - the qf parameter specifies the set of fields that an unfielded query term must be queried against. The user is free to specify fielded query terms, like field1:term1 OR field2:term2. So, which use case are you really talking about. Could you try to point to a real-world example of where your use case might apply, so we can relate to it? Generally, I would say that a Solr document/collection should have no more than low hundreds of fields. It's not that you absolutely can't have more or absolutely can't have 5,000 or more, but simply that you will be asking for trouble, for example, with the cost of comprehending and maintaining and communicating your solution with others, including this mailing list for support. What specifically pushed you to have documents with 1500 field? Also, is this 1500 fields that are always populated, or are there really a larger number of different record types, each with a relatively small number of fields populated in a particular document? -- Jack Krupansky On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. If the network traffic becomes an issue, my alternative solution is to create a /select handler for each group and in that handler list the fields under qf. I have considered creating pseudo-fields for each group and then use copyField into that group. During search, I than can qf against that one field. Unfortunately, this is not ideal for my solution because the fields that go into each group dynamically change (at least once a month) and when they do change, I have to re-index everything (this I have to avoid) to sync that group-field. I'm using qf with edismax and my Solr version is 5.1. Thanks Steve
Re: When is too many fields in qf is too many?
Steven, I'd be concerned about your relevance with that many qf fields. Dismax takes a winner takes all point of view to search. Field scores can vary by an order of magnitude (or even two) despite the attempts of query normalization. You can read more here http://opensourceconnections.com/blog/2013/07/02/getting-dissed-by-dismax-why-your-incorrect-assumptions-about-dismax-are-hurting-search-relevancy/ I'm about to win the blashphemer merit badge, but ad-hoc all-field like searching over many fields is actually a good use case for Elasticsearch's cross field queries. https://www.elastic.co/guide/en/elasticsearch/guide/master/_cross_fields_queries.html http://opensourceconnections.com/blog/2015/03/19/elasticsearch-cross-field-search-is-a-lie/ It wouldn't be hard (and actually a great feature for the project) to get the Lucene query associated with cross field search into Solr. You could easily write a plugin to integrate it into a query parser: https://github.com/elastic/elasticsearch/blob/master/src/main/java/org/apache/lucene/queries/BlendedTermQuery.java Hope that helps -Doug -- *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections, LLC | 240.476.9983 | http://www.opensourceconnections.com Author: Relevant Search http://manning.com/turnbull from Manning Publications This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such. On Wed, May 20, 2015 at 8:27 AM, Steven White swhite4...@gmail.com wrote: Hi everyone, My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. If the network traffic becomes an issue, my alternative solution is to create a /select handler for each group and in that handler list the fields under qf. I have considered creating pseudo-fields for each group and then use copyField into that group. During search, I than can qf against that one field. Unfortunately, this is not ideal for my solution because the fields that go into each group dynamically change (at least once a month) and when they do change, I have to re-index everything (this I have to avoid) to sync that group-field. I'm using qf with edismax and my Solr version is 5.1. Thanks Steve
Re: When is too many fields in qf is too many?
On 5/20/2015 6:27 AM, Steven White wrote: My solution requires that users in group-A can only search against a set of fields-A and users in group-B can only search against a set of fields-B, etc. There can be several groups, as many as 100 even more. To meet this need, I build my search by passing in the list of fields via qf. What goes into qf can be large: as many as 1500 fields and each field name averages 15 characters long, in effect the data passed via qf will be over 20K characters. Given the above, beside the fact that a search for apple translating to a 20K characters passing over the network, what else within Solr and Lucene I should be worried about if any? Will I hit some kind of a limit? Will each search now require more CPU cycles? Memory? Etc. You have two choices when queries become that large. One is to increase the max HTTP header size in the servlet container. In most containers, webservers, and proxy servers, this defaults to 8192 bytes. This is an approach that works very well, but will not scale to extremely large sizes. I have done this on my indexes, because I regularly have queries in the 20K range, but I do not expect them to get very much larger than this. The other option is to switch to sending a POST instead of a GET. The default max POST size that Solr sets is 2MB, which is plenty for just about any query, and can be increased easily to much larger sizes. If you are using SolrJ, switching to POST is very easy ... you'd need to research to figure out how if you're using another framework. Thanks, Shawn