Re: edismax parsing confusion
Try declaring your mm as 1 then and see if that assumption is correct. Default 'mm' values are complicated to describe and depend on a variety of factors. Generally if you want it to be a certain value, just declare it. On 5 April 2017 at 02:07, Abhishek Mishra <solrmis...@gmail.com> wrote: > Hello guys > sorry for late response. @steve I am using solr 5.2 . > @greg i am using default mm from config file(According to me it is default > mm is 1). > > Regards, > Abhishek > > On Tue, Apr 4, 2017 at 5:27 AM, Greg Pendlebury <greg.pendleb...@gmail.com > > > wrote: > > > eDismax uses 'mm', so knowing what that has been set to is important, or > if > > it has been left unset/default you would need to consider whether 'q.op' > > has been set. Or the default operator from the config file. > > > > Ta, > > Greg > > > > > > On 3 April 2017 at 23:56, Steve Rowe <sar...@gmail.com> wrote: > > > > > Hi Abhishek, > > > > > > Which version of Solr are you using? > > > > > > I can see that the parsed queries are different, but they’re also very > > > similar, and there’s a lot of detail there - can you be more specific > > about > > > what the problem is? > > > > > > -- > > > Steve > > > www.lucidworks.com > > > > > > > On Apr 3, 2017, at 4:54 AM, Abhishek Mishra <solrmis...@gmail.com> > > > wrote: > > > > > > > > Hi all > > > > i am running solr query with these parameter > > > > > > > > bf: "sum(product(new_popularity,100),if(exists(third_price),50,0))" > > > > qf: "test_product^5 category_path_tf^4 product_id gender" > > > > q: "handbags between rs150 and rs 400" > > > > defType: "edismax" > > > > > > > > parsed query is like below one > > > > > > > > for q:- > > > > (+(DisjunctionMaxQuery((category_path_tf:handbags^4.0 | > > gender:handbag | > > > > test_product:handbag^5.0 | product_id:handbags)) > > > > DisjunctionMaxQuery((category_path_tf:between^4.0 | gender:between | > > > > test_product:between^5.0 | product_id:between)) > > > > +DisjunctionMaxQuery((category_path_tf:rs150^4.0 | gender:rs150 | > > > > test_product:rs150^5.0 | product_id:rs150)) > > > > +DisjunctionMaxQuery((category_path_tf:rs^4.0 | gender:rs | > > > > test_product:rs^5.0 | product_id:rs)) > > > > DisjunctionMaxQuery((category_path_tf:400^4.0 | gender:400 | > > > > test_product:400^5.0 | product_id:400))) DisjunctionMaxQuery(("":" > > > handbags > > > > between rs150 ? rs 400")) (DisjunctionMaxQuery(("":"handbags > > between")) > > > > DisjunctionMaxQuery(("":"between rs150")) > DisjunctionMaxQuery(("":"rs > > > > 400"))) (DisjunctionMaxQuery(("":"handbags between rs150")) > > > > DisjunctionMaxQuery(("":"between rs150")) > > > DisjunctionMaxQuery(("":"rs150 ? > > > > rs")) DisjunctionMaxQuery(("":"? rs 400"))) > > > > FunctionQuery(sum(product(float(new_popularity),const( > > > 100)),if(exists(float(third_price)),const(50),const(0)/no_coord > > > > > > > > but for dismax parser it is working perfect: > > > > > > > > (+(DisjunctionMaxQuery((category_path_tf:handbags^4.0 | > > gender:handbag | > > > > test_product:handbag^5.0 | product_id:handbags)) > > > > DisjunctionMaxQuery((category_path_tf:between^4.0 | gender:between | > > > > test_product:between^5.0 | product_id:between)) > > > > DisjunctionMaxQuery((category_path_tf:rs150^4.0 | gender:rs150 | > > > > test_product:rs150^5.0 | product_id:rs150)) > > > > DisjunctionMaxQuery((product_id:and)) > > > > DisjunctionMaxQuery((category_path_tf:rs^4.0 | gender:rs | > > > > test_product:rs^5.0 | product_id:rs)) > > > > DisjunctionMaxQuery((category_path_tf:400^4.0 | gender:400 | > > > > test_product:400^5.0 | product_id:400))) DisjunctionMaxQuery(("":" > > > handbags > > > > between rs150 ? rs 400")) > > > > FunctionQuery(sum(product(float(new_popularity),const( > > > 100)),if(exists(float(third_price)),const(50),const(0)/no_coord > > > > > > > > > > > > *according to me difference between dismax and edismax is based on > some > > > > extra features plus working of boosting fucntions.* > > > > > > > > > > > > > > > > Regards, > > > > Abhishek > > > > > > > > >
Re: edismax parsing confusion
eDismax uses 'mm', so knowing what that has been set to is important, or if it has been left unset/default you would need to consider whether 'q.op' has been set. Or the default operator from the config file. Ta, Greg On 3 April 2017 at 23:56, Steve Rowewrote: > Hi Abhishek, > > Which version of Solr are you using? > > I can see that the parsed queries are different, but they’re also very > similar, and there’s a lot of detail there - can you be more specific about > what the problem is? > > -- > Steve > www.lucidworks.com > > > On Apr 3, 2017, at 4:54 AM, Abhishek Mishra > wrote: > > > > Hi all > > i am running solr query with these parameter > > > > bf: "sum(product(new_popularity,100),if(exists(third_price),50,0))" > > qf: "test_product^5 category_path_tf^4 product_id gender" > > q: "handbags between rs150 and rs 400" > > defType: "edismax" > > > > parsed query is like below one > > > > for q:- > > (+(DisjunctionMaxQuery((category_path_tf:handbags^4.0 | gender:handbag | > > test_product:handbag^5.0 | product_id:handbags)) > > DisjunctionMaxQuery((category_path_tf:between^4.0 | gender:between | > > test_product:between^5.0 | product_id:between)) > > +DisjunctionMaxQuery((category_path_tf:rs150^4.0 | gender:rs150 | > > test_product:rs150^5.0 | product_id:rs150)) > > +DisjunctionMaxQuery((category_path_tf:rs^4.0 | gender:rs | > > test_product:rs^5.0 | product_id:rs)) > > DisjunctionMaxQuery((category_path_tf:400^4.0 | gender:400 | > > test_product:400^5.0 | product_id:400))) DisjunctionMaxQuery(("":" > handbags > > between rs150 ? rs 400")) (DisjunctionMaxQuery(("":"handbags between")) > > DisjunctionMaxQuery(("":"between rs150")) DisjunctionMaxQuery(("":"rs > > 400"))) (DisjunctionMaxQuery(("":"handbags between rs150")) > > DisjunctionMaxQuery(("":"between rs150")) > DisjunctionMaxQuery(("":"rs150 ? > > rs")) DisjunctionMaxQuery(("":"? rs 400"))) > > FunctionQuery(sum(product(float(new_popularity),const( > 100)),if(exists(float(third_price)),const(50),const(0)/no_coord > > > > but for dismax parser it is working perfect: > > > > (+(DisjunctionMaxQuery((category_path_tf:handbags^4.0 | gender:handbag | > > test_product:handbag^5.0 | product_id:handbags)) > > DisjunctionMaxQuery((category_path_tf:between^4.0 | gender:between | > > test_product:between^5.0 | product_id:between)) > > DisjunctionMaxQuery((category_path_tf:rs150^4.0 | gender:rs150 | > > test_product:rs150^5.0 | product_id:rs150)) > > DisjunctionMaxQuery((product_id:and)) > > DisjunctionMaxQuery((category_path_tf:rs^4.0 | gender:rs | > > test_product:rs^5.0 | product_id:rs)) > > DisjunctionMaxQuery((category_path_tf:400^4.0 | gender:400 | > > test_product:400^5.0 | product_id:400))) DisjunctionMaxQuery(("":" > handbags > > between rs150 ? rs 400")) > > FunctionQuery(sum(product(float(new_popularity),const( > 100)),if(exists(float(third_price)),const(50),const(0)/no_coord > > > > > > *according to me difference between dismax and edismax is based on some > > extra features plus working of boosting fucntions.* > > > > > > > > Regards, > > Abhishek > >
Re: Edismax query parsing in Solr 4 vs Solr 6
This has come up a lot on the lists lately. Keep in mind that edismax parses your query uses additional parameters such as 'mm' and 'q.op'. It is the handling of these parameters (and the selection of default values) which has changed between versions to address a few functionality gaps. The most common issue I've seen is where users were not setting those values and relying on the defaults. You might now need to set them explicitly to return to desired behaviour. I can't see all of your configuration, but I'm guessing the important one here is 'q.op', which was previously hard coded to 'OR', irrespective of either parameters or solrconfig. Try setting that to 'OR' explicitly... maybe you have your default operator set to 'AND' in solrconfig and that is now being applied? The other option is 'mm', which I suspect should be set to '0' unless you have some reason to want it. If it was set to '100%' it might insert the additional '+' flags, but it can also show up as a '~' operator on the end. Ta, Greg On 8 November 2016 at 22:13, Max Bridgewaterwrote: > I am migrating a solr based app from Solr 4 to Solr 6. One of the > discrepancies I am noticing is around edismax query parsing. My code makes > the following call: > > > userQuery="+(title:shirts isbn:shirts) +(id:20446 id:82876)" > Query query=QParser.getParser(userQuery, "edismax", req).getQuery(); > > > With Solr 4, query becomes: > > +(+(title:shirt isbn:shirts) +(id:20446 id:82876)) > > With Solr 6 it however becomes: > > +(+(+title:shirt +isbn:shirts) +(+id:20446 +id:82876)) > > Digging deeper, it appears that parseOriginalQuery() in > ExtendedDismaxQParser is adding those additional + signs. > > > Is there a way to prevent this altering of queries? > > Thanks, > Max. >
Re: changed query parsing between 4.10.4 and 5.5.3?
Hi Bernd, I was referring to assessing 5.5's behaviour based on a comparison to 4.10 when giving it that same inputs and configuration. Maybe I am wrong, and I apologise if so. I am only seeing fragments of the situation each time, so it is hard to be sure. Certainly in this case it looks like a case of 'mm' set to 100% in this example, but I am basing that of previous emails about your config. Since you seem to comfortable moving the code around, might I suggest you try looking in the TestExtendedDismaxParser class? It is a nice, portable way of demonstrating the behaviour you believe is wrong. You can put some fake documents at the top in the index() method, then add a new test method (copy one of the existing ones like testDefaultOperatorWithMm() ) to show the config that is behaving strangely with a query. If there is something strange going on we should be able to get to the bottom of it with some reproduction steps. Ta, Greg On 15 September 2016 at 16:28, Bernd Fehling <bernd.fehl...@uni-bielefeld.de > wrote: > Your statement "using the old behaviour as a baseline for checking the > correctness of 5.5 behaviour" might be a point of view. > > Let me give an example, my query: > q=(text:(star AND trek AND wars)^200 OR text:("star trek wars")^350) > results to 159 hits from 99 million records in the index (version 4.10.4). > I checked all 159 hits, they are correct. > > The same query to the same indexed content build with 5.5.3 and also > having 99 million records results in 0 (zero) hits. > > What do you think about this result? > > By the way, after copying ExtendedDismaxQParser from 4.10.4 to 5.5.3 I get > now 137 hits. I really don't care about the difference, but at least > I get some hits out of 99 million records and they are correct. > > Regards, > Bernd > > > Am 15.09.2016 um 01:41 schrieb Greg Pendlebury: > > I'm sorry that's been your experience Bernd. If you do manage to find > some > > time it would be good to see some details on these bugs. It looks at the > > moment as though this is a matter of perception when using the old > > behaviour as a baseline for checking the correctness of 5.5 behaviour. > > > > Ta, > > Greg > > > > > > On 15 September 2016 at 01:27, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> Perhaps https://issues.apache.org/jira/browse/SOLR-8812 and related? > >> > >> Best, > >> Erick > >> > >> On Tue, Sep 13, 2016 at 11:37 PM, Bernd Fehling > >> <bernd.fehl...@uni-bielefeld.de> wrote: > >>> Hi Greg, > >>> > >>> after trying several hours with all combinations of parameters and not > >>> getting any useful search result with complex search terms and edismax > >>> I finally copied o.a.s.s.ExtendedDismaxQParser.java from version > 4.10.4 > >>> to 5.5.3 and did a little modification in o.a.s.u.SolrPluginUtils.java. > >>> > >>> Now it is searching correct and getting logical and valid search > results > >>> with any kind of complex search. > >>> Problem solved. > >>> > >>> But still, the edismax, at least of 5.5.3, has some bugs. > >>> If I get time I will look into this but right now my problem is solved > >>> and the customers and users are happy. > >>> > >>> I hope that this buggy edismax version is not used in solr 6.x > otherwise > >> you > >>> have the same problems there. > >>> > >>> Regards > >>> Bernd > >>> > >>> > >>> Am 12.09.2016 um 05:10 schrieb Greg Pendlebury: > >>>> Hi Bernd, > >>>> > >>>> "From my point of view the old parsing behavior was correct. > >>>> If searching for a term without operator it is always OR, otherwise > >>>> you can add "+" or "-" to modify that. Now with q.op AND it is > >>>> modified to "+" as a MUST." > >>>> > >>>> It is correct in both cases. q.op dictates (for that query) what > default > >>>> operator to use when none is provided, and it is used as a priority > over > >>>> the system whole 'defaultOperator'. In either case, if you ask it to > use > >>>> OR, it uses it; if you ask it to use AND, it uses it. The behaviour > from > >>>> 4.10 that was changed (arguably fixed, although I know that is a > >> debatable > >>>> point) was that you asked it to use AND, and it ignored you > >> (irrespective > >>
Re: changed query parsing between 4.10.4 and 5.5.3?
I'm sorry that's been your experience Bernd. If you do manage to find some time it would be good to see some details on these bugs. It looks at the moment as though this is a matter of perception when using the old behaviour as a baseline for checking the correctness of 5.5 behaviour. Ta, Greg On 15 September 2016 at 01:27, Erick Erickson <erickerick...@gmail.com> wrote: > Perhaps https://issues.apache.org/jira/browse/SOLR-8812 and related? > > Best, > Erick > > On Tue, Sep 13, 2016 at 11:37 PM, Bernd Fehling > <bernd.fehl...@uni-bielefeld.de> wrote: > > Hi Greg, > > > > after trying several hours with all combinations of parameters and not > > getting any useful search result with complex search terms and edismax > > I finally copied o.a.s.s.ExtendedDismaxQParser.java from version 4.10.4 > > to 5.5.3 and did a little modification in o.a.s.u.SolrPluginUtils.java. > > > > Now it is searching correct and getting logical and valid search results > > with any kind of complex search. > > Problem solved. > > > > But still, the edismax, at least of 5.5.3, has some bugs. > > If I get time I will look into this but right now my problem is solved > > and the customers and users are happy. > > > > I hope that this buggy edismax version is not used in solr 6.x otherwise > you > > have the same problems there. > > > > Regards > > Bernd > > > > > > Am 12.09.2016 um 05:10 schrieb Greg Pendlebury: > >> Hi Bernd, > >> > >> "From my point of view the old parsing behavior was correct. > >> If searching for a term without operator it is always OR, otherwise > >> you can add "+" or "-" to modify that. Now with q.op AND it is > >> modified to "+" as a MUST." > >> > >> It is correct in both cases. q.op dictates (for that query) what default > >> operator to use when none is provided, and it is used as a priority over > >> the system whole 'defaultOperator'. In either case, if you ask it to use > >> OR, it uses it; if you ask it to use AND, it uses it. The behaviour from > >> 4.10 that was changed (arguably fixed, although I know that is a > debatable > >> point) was that you asked it to use AND, and it ignored you > (irrespective > >> of whether you used defaultOperator or q.op). The are a few subtle > >> distinctions that are being missed (like the difference between the > boolean > >> operators and the OCCURS flags that your are talking about), but they > are > >> not going to change the outcome. > >> > >> 8812 related to users who had been historically setting the q.op > parameter > >> to influence the downstream default selection of 'mm' (If you don't > provide > >> 'mm' it is set for you based on 'q.op') instead of directly setting the > >> 'mm' value themselves. But again in this case, you're setting 'mm' > anyway, > >> so it shouldn't be relevant. > >> > >> Ta, > >> Greg > >> > >> On 9 September 2016 at 16:44, Bernd Fehling < > bernd.fehl...@uni-bielefeld.de> > >> wrote: > >> > >>> Hi Greg, > >>> > >>> thanks a lot, thats it. > >>> After setting q.op to OR it works _nearly_ as before with 4.10.4. > >>> > >>> But how stupid this? > >>> I have in my schema > >>> and also had q.op to AND to make sure my default _is_ AND, > >>> meant as conjunction between terms. > >>> But now I have q.op to OR and defaultOperator in schema to AND > >>> to just get _nearly_ my old behavior back. > >>> > >>> schema has following comment: > >>> "... The default is OR, which is generally assumed so it is > >>> not a good idea to change it globally here. The "q.op" request > >>> parameter takes precedence over this. ..." > >>> > >>> What I don't understand is why they change some major internals > >>> and don't give any notice about how to keep old parsing behavior. > >>> > >>> From my point of view the old parsing behavior was correct. > >>> If searching for a term without operator it is always OR, otherwise > >>> you can add "+" or "-" to modify that. Now with q.op AND it is > >>> modified to "+" as a MUST. > >>> > >>> I still get some differences in search results between 4.10.4 and > 5.5.3. > >>> What other side effects has this change of q.op from AND to
Re: changed query parsing between 4.10.4 and 5.5.3?
I'm not certain what is going on with your boost. It doesn't seem related to those tickets as far as I can see, but I note it comes back in the 'parsedquery_toString' step below that. Perhaps the debug output has a display bug? The fact that 4.10 was not applying 'mm' in this context relates the other part of 2649. Because you provided an explicit OR operator inside this particular search string the 'mm' parameter was ignored. This confusing(?) behaviour was the primary reason 2649 was originally opened. Under 5.5 it was applied, so you get the '~2' operator. Because you explicitly set the 'mm' parameter to 100% it required both or your 'should' OCCUR terms to be present. Are you setting mm to 100% consciously because you want every term to always apply, or was it just a leftover setting? I can see that if you were relying on this behaviour it might appear disruptive, but what I would hope is that you can see that 5.5 did everything you asked it to, following clear and consistent rules for your parameters to influence the output. But 4.10 was following some internal, rarely/poorly documented behaviours that people had just learned to live with. Some parameters did nothing, other parameters influenced yet more parameters in confusing ways. Those old behaviours had various pitfalls that created use cases edismax could not support so it got cleaned up. If you want edismax to behave (mostly) the old way, set q.op to 'OR' and 'mm' to whatever you would like. I say 'mostly' because 'mm' will now be paid attention to if you add your own operators. But if you really, really want it to ignore that you can always wrap your search in parentheses to group all the terms into a single clause. 'mm' only applies to top level clauses and always has. If you want to use edismax for simpler boolean search logic, set 'q.op' to whatever you would like and 'mm' to something like 0 or 1 so that is doesn't screw with your boolean ORs. Ta, Greg Ta, Greg On 9 September 2016 at 20:00, Bernd Fehling <bernd.fehl...@uni-bielefeld.de> wrote: > After some more testing it feels like the parsing in 5.5.3 is _really_ > messed up. > > Query version 4.10.4: > > > (text:(star AND trek AND wars)^200 OR text:("star trek wars")^350) > > > (text:(star AND trek AND wars)^200 OR text:("star trek wars")^350) > > > (+(((+text:star +text:trek +text:war)^200.0) PhraseQuery(text:"star trek > war"^350.0)))/no_coord > > > +(((+text:star +text:trek +text:war)^200.0) text:"star trek war"^350.0) > > > > Same query version 5.5.3: > > > (text:(star AND trek AND wars)^200 OR text:("star trek wars")^350) > > > (text:(star AND trek AND wars)^200 OR text:("star trek wars")^350) > > > (+((+text:star +text:trek +text:war^200.0 PhraseQuery(text:"star trek > war"))~2))/no_coord > > > +(((+text:star +text:trek +text:war)^200.0 text:"star trek war"^350.0)~2) > > > As you can see version 5.5.3 "parsedquery" is different to version 4.10.4. > > And why is parsedquery different to parsedquery_toString in version 5.5.3? > > Where is my second boost in "parsedquery" of 5.5.3? > > > Bernd > > > > Am 09.09.2016 um 08:44 schrieb Bernd Fehling: > > Hi Greg, > > > > thanks a lot, thats it. > > After setting q.op to OR it works _nearly_ as before with 4.10.4. > > > > But how stupid this? > > I have in my schema > > and also had q.op to AND to make sure my default _is_ AND, > > meant as conjunction between terms. > > But now I have q.op to OR and defaultOperator in schema to AND > > to just get _nearly_ my old behavior back. > > > > schema has following comment: > > "... The default is OR, which is generally assumed so it is > > not a good idea to change it globally here. The "q.op" request > > parameter takes precedence over this. ..." > > > > What I don't understand is why they change some major internals > > and don't give any notice about how to keep old parsing behavior. > > > > From my point of view the old parsing behavior was correct. > > If searching for a term without operator it is always OR, otherwise > > you can add "+" or "-" to modify that. Now with q.op AND it is > > modified to "+" as a MUST. > > > > I still get some differences in search results between 4.10.4 and 5.5.3. > > What other side effects has this change of q.op from AND to OR in > > other parts of query handling, parsing and searching? > > > > Regards > > Bernd > > > > Am 09.09.2016 um 05:43 schrieb Greg Pendlebury: > >> I forgot to mention the tickets
Re: changed query parsing between 4.10.4 and 5.5.3?
Hi Bernd, "From my point of view the old parsing behavior was correct. If searching for a term without operator it is always OR, otherwise you can add "+" or "-" to modify that. Now with q.op AND it is modified to "+" as a MUST." It is correct in both cases. q.op dictates (for that query) what default operator to use when none is provided, and it is used as a priority over the system whole 'defaultOperator'. In either case, if you ask it to use OR, it uses it; if you ask it to use AND, it uses it. The behaviour from 4.10 that was changed (arguably fixed, although I know that is a debatable point) was that you asked it to use AND, and it ignored you (irrespective of whether you used defaultOperator or q.op). The are a few subtle distinctions that are being missed (like the difference between the boolean operators and the OCCURS flags that your are talking about), but they are not going to change the outcome. 8812 related to users who had been historically setting the q.op parameter to influence the downstream default selection of 'mm' (If you don't provide 'mm' it is set for you based on 'q.op') instead of directly setting the 'mm' value themselves. But again in this case, you're setting 'mm' anyway, so it shouldn't be relevant. Ta, Greg On 9 September 2016 at 16:44, Bernd Fehling <bernd.fehl...@uni-bielefeld.de> wrote: > Hi Greg, > > thanks a lot, thats it. > After setting q.op to OR it works _nearly_ as before with 4.10.4. > > But how stupid this? > I have in my schema > and also had q.op to AND to make sure my default _is_ AND, > meant as conjunction between terms. > But now I have q.op to OR and defaultOperator in schema to AND > to just get _nearly_ my old behavior back. > > schema has following comment: > "... The default is OR, which is generally assumed so it is > not a good idea to change it globally here. The "q.op" request > parameter takes precedence over this. ..." > > What I don't understand is why they change some major internals > and don't give any notice about how to keep old parsing behavior. > > From my point of view the old parsing behavior was correct. > If searching for a term without operator it is always OR, otherwise > you can add "+" or "-" to modify that. Now with q.op AND it is > modified to "+" as a MUST. > > I still get some differences in search results between 4.10.4 and 5.5.3. > What other side effects has this change of q.op from AND to OR in > other parts of query handling, parsing and searching? > > Regards > Bernd > > Am 09.09.2016 um 05:43 schrieb Greg Pendlebury: > > I forgot to mention the tickets: > > SOLR-2649 and SOLR-8812 > > > > On 9 September 2016 at 13:38, Greg Pendlebury <greg.pendleb...@gmail.com > > > > wrote: > > > >> Under 4.10 q.op was ignored by the edismax parser and always forced to > OR. > >> 5.5 is looking at the q.op=AND you requested. > >> > >> There are also some changes to the default values selected for mm, but I > >> doubt those apply here since you are setting it explicitly. > >> > >> On 8 September 2016 at 00:35, Mikhail Khludnev <m...@apache.org> wrote: > >> > >>> I suppose > >>>+((text:star text:trek)~2) > >>> and > >>> +(+text:star +text:trek) > >>> are equal. mm=2 is equal to +foo +bar > >>> > >>> On Wed, Sep 7, 2016 at 10:52 AM, Bernd Fehling < > >>> bernd.fehl...@uni-bielefeld.de> wrote: > >>> > >>>> Hi list, > >>>> > >>>> while going from SOLR 4.10.4 to 5.5.3 I noticed a change in query > >>> parsing. > >>>> 4.10.4 > >>>> text:star text:trek > >>>> text:star text:trek > >>>> (+((text:star text:trek)~2))/no_coord > >>>> +((text:star text:trek)~2) > >>>> > >>>> 5.5.3 > >>>> text:star text:trek > >>>> text:star text:trek > >>>> (+(+text:star +text:trek))/no_coord > >>>> +(+text:star +text:trek) > >>>> > >>>> There are very many new features and changes between this two > versions. > >>>> It looks like a change in query parsing. > >>>> Can someone point me to the solr or lucene jira about the changes? > >>>> Or even give a hint how to get my "old" query parsing back? > >>>> > >>>> Regards > >>>> Bernd > >>>> > >>> > >>> > >>> > >>> -- > >>> Sincerely yours > >>> Mikhail Khludnev > >>> >
Re: changed query parsing between 4.10.4 and 5.5.3?
I forgot to mention the tickets: SOLR-2649 and SOLR-8812 On 9 September 2016 at 13:38, Greg Pendlebury <greg.pendleb...@gmail.com> wrote: > Under 4.10 q.op was ignored by the edismax parser and always forced to OR. > 5.5 is looking at the q.op=AND you requested. > > There are also some changes to the default values selected for mm, but I > doubt those apply here since you are setting it explicitly. > > On 8 September 2016 at 00:35, Mikhail Khludnev <m...@apache.org> wrote: > >> I suppose >>+((text:star text:trek)~2) >> and >> +(+text:star +text:trek) >> are equal. mm=2 is equal to +foo +bar >> >> On Wed, Sep 7, 2016 at 10:52 AM, Bernd Fehling < >> bernd.fehl...@uni-bielefeld.de> wrote: >> >> > Hi list, >> > >> > while going from SOLR 4.10.4 to 5.5.3 I noticed a change in query >> parsing. >> > 4.10.4 >> > text:star text:trek >> > text:star text:trek >> > (+((text:star text:trek)~2))/no_coord >> > +((text:star text:trek)~2) >> > >> > 5.5.3 >> > text:star text:trek >> > text:star text:trek >> > (+(+text:star +text:trek))/no_coord >> > +(+text:star +text:trek) >> > >> > There are very many new features and changes between this two versions. >> > It looks like a change in query parsing. >> > Can someone point me to the solr or lucene jira about the changes? >> > Or even give a hint how to get my "old" query parsing back? >> > >> > Regards >> > Bernd >> > >> >> >> >> -- >> Sincerely yours >> Mikhail Khludnev >> > >
Re: changed query parsing between 4.10.4 and 5.5.3?
Under 4.10 q.op was ignored by the edismax parser and always forced to OR. 5.5 is looking at the q.op=AND you requested. There are also some changes to the default values selected for mm, but I doubt those apply here since you are setting it explicitly. On 8 September 2016 at 00:35, Mikhail Khludnevwrote: > I suppose >+((text:star text:trek)~2) > and > +(+text:star +text:trek) > are equal. mm=2 is equal to +foo +bar > > On Wed, Sep 7, 2016 at 10:52 AM, Bernd Fehling < > bernd.fehl...@uni-bielefeld.de> wrote: > > > Hi list, > > > > while going from SOLR 4.10.4 to 5.5.3 I noticed a change in query > parsing. > > 4.10.4 > > text:star text:trek > > text:star text:trek > > (+((text:star text:trek)~2))/no_coord > > +((text:star text:trek)~2) > > > > 5.5.3 > > text:star text:trek > > text:star text:trek > > (+(+text:star +text:trek))/no_coord > > +(+text:star +text:trek) > > > > There are very many new features and changes between this two versions. > > It looks like a change in query parsing. > > Can someone point me to the solr or lucene jira about the changes? > > Or even give a hint how to get my "old" query parsing back? > > > > Regards > > Bernd > > > > > > -- > Sincerely yours > Mikhail Khludnev >
Re: After Solr 5.5, mm parameter doesn't work properly
I think the confusion stems from the legacy implementation partially conflating q.op with mm for users, when they are very different things. q.op tells Solr how to insert boolean operators before they are converted into occurs flags, and then downstream, mm applies on _only_ the SHOULD occurs flags, not MUST or NOT flags. So if the user is setting mm=2, they are asking for a minimum of 2 of the SHOULD clauses to be found, not 2 of ALL clauses. mm has absolutely nothing to do with q.op other than (because of the implementation) q.op is used to derive a default value when it is not explicitly set. The legacy implementation has situations where it was not possible to generate the search you wanted because of the conflation, hence why SOLR-2649 was so popular. I fully acknowledge that there are cases where the change is disrupting users that (for whatever reason) are/were not necessarily aware of what the parameters they are using actually do, or users that were very aware, but forced to rely on a non-intuitive settings to work around the behaviour eDismax had. SOLR-8812 (although not relevant to the OP) goes part way towards helping the former users, but the latter will want to adjust their parameters to be explicit now instead of leveraging a workaround. I haven't yet seen a use case where the final solution we put in for SOLR-2649 does not work, but I have seen lots of user parameters used that Solr handles perfectly... just in a way that the user did not expect. I suspect this is mainly because the topic and the implementation are fairly technically dense (from q.op, then to boolean to occurs conversion, then finally to mm) and difficult to explain and document accurately for an end user. I am writing this in rush sorry, to go collect a child from school. Ta, Greg On 2 June 2016 at 19:08, Jan Høydahl <jan@cominvent.com> wrote: > [Aside] Your quote style is confusing, leaving my lines unquoted and your > new lines quoted?? [/Aside] > > > So in relation to the OP's sample queries I was pointing out that > 'q.op=OR > > + mm=2' and 'q,op=AND + mm=2' are treated as identical queries by Solr > 5.4, > > but 5.5+ will manipulate the occurs flags differently before it applies > mm > > afterwards... because that is what q.op does. > > If a user explicitly says mm=2, then the users intent is that he should > neither have pure OR (no clauses required) nor pure AND (all clauses > required), > but exactly two clauses required. > > So I think we need to go back to a solution where q.op technically > stays as OR for custom mm. How that would affect queries with explicit > operators > I don’t know... > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > > 2. jun. 2016 kl. 05.12 skrev Greg Pendlebury <greg.pendleb...@gmail.com > >: > > > > I would describe that subtly differently, and I think it is where the > > difference lies: > > > > "Then from 4.x it did not care about q.op if mm was set explicitly" > >>> I agree. q.op was not actually used in the query, but rather as a way > of > > inferred the default mm value. eDismax still ignored whatever q.op was > set > > and built your query operators (ie. the occurs flags) using q.op=OR. > > > > "And from 5.5 it seems as q.op does something even if mm is set..." > >>> Yes, although I think it is the words 'even if' drawing too strong a > > relationship between the two parameters. q.op has a function of its own, > > and that now functions as it 'should' (opinionated, I know) in the query > > construction, and continues to influence the default value of mm if it > has > > not been explicitly set. SOLR-8812 further evolves that influence by > trying > > to improve backwards compatibility for users who were not explicitly > > setting mm, and only ever changed 'q.op' despite it being a step removed > > from the actual parameter they were trying to manipulate. > > > > So in relation to the OP's sample queries I was pointing out that > 'q.op=OR > > + mm=2' and 'q,op=AND + mm=2' are treated as identical queries by Solr > 5.4, > > but 5.5+ will manipulate the occurs flags differently before it applies > mm > > afterwards... because that is what q.op does. > > > > > > On 2 June 2016 at 07:13, Jan Høydahl <jan@cominvent.com> wrote: > > > >> Edismax used to default to mm=100% and not care about q.op at all > >> > >> Then from 4.x it did not care about q.op if mm was set explicitly, > >> but if mm was not set, then q.op=OR —> mm=0%, q.op=AND —> mm=100% > >> > >> And from 5.5 it seems as q.op does something even if mm is set... > >> > >&g
Re: After Solr 5.5, mm parameter doesn't work properly
I would describe that subtly differently, and I think it is where the difference lies: "Then from 4.x it did not care about q.op if mm was set explicitly" >> I agree. q.op was not actually used in the query, but rather as a way of inferred the default mm value. eDismax still ignored whatever q.op was set and built your query operators (ie. the occurs flags) using q.op=OR. "And from 5.5 it seems as q.op does something even if mm is set..." >> Yes, although I think it is the words 'even if' drawing too strong a relationship between the two parameters. q.op has a function of its own, and that now functions as it 'should' (opinionated, I know) in the query construction, and continues to influence the default value of mm if it has not been explicitly set. SOLR-8812 further evolves that influence by trying to improve backwards compatibility for users who were not explicitly setting mm, and only ever changed 'q.op' despite it being a step removed from the actual parameter they were trying to manipulate. So in relation to the OP's sample queries I was pointing out that 'q.op=OR + mm=2' and 'q,op=AND + mm=2' are treated as identical queries by Solr 5.4, but 5.5+ will manipulate the occurs flags differently before it applies mm afterwards... because that is what q.op does. On 2 June 2016 at 07:13, Jan Høydahl <jan@cominvent.com> wrote: > Edismax used to default to mm=100% and not care about q.op at all > > Then from 4.x it did not care about q.op if mm was set explicitly, > but if mm was not set, then q.op=OR —> mm=0%, q.op=AND —> mm=100% > > And from 5.5 it seems as q.op does something even if mm is set... > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > > 1. jun. 2016 kl. 23.05 skrev Greg Pendlebury <greg.pendleb...@gmail.com > >: > > > > But isn't that the default value? In this case the OP is setting mm > > explicitly to 2. > > > > Will have to look at those code links more thoroughly at work this > morning. > > Apologies if I am wrong. > > > > Ta, > > Greg > > > > On Wednesday, 1 June 2016, Jan Høydahl <jan@cominvent.com> wrote: > > > >>> 1. jun. 2016 kl. 03.47 skrev Greg Pendlebury < > greg.pendleb...@gmail.com > >> <javascript:;>>: > >> > >>> I don't think it is 8812. q.op was completely ignored by edismax prior > to > >>> 5.5, so it is not mm that changed. > >> > >> That is not the case. Prior to 5.5, mm would be automatically set to > 100% > >> if q.op==AND > >> See https://issues.apache.org/jira/browse/SOLR-1889 and > >> https://svn.apache.org/viewvc?view=revision=950710 > >> > >> Jan > >
Re: After Solr 5.5, mm parameter doesn't work properly
But isn't that the default value? In this case the OP is setting mm explicitly to 2. Will have to look at those code links more thoroughly at work this morning. Apologies if I am wrong. Ta, Greg On Wednesday, 1 June 2016, Jan Høydahl <jan@cominvent.com> wrote: > > 1. jun. 2016 kl. 03.47 skrev Greg Pendlebury <greg.pendleb...@gmail.com > <javascript:;>>: > > > I don't think it is 8812. q.op was completely ignored by edismax prior to > > 5.5, so it is not mm that changed. > > That is not the case. Prior to 5.5, mm would be automatically set to 100% > if q.op==AND > See https://issues.apache.org/jira/browse/SOLR-1889 and > https://svn.apache.org/viewvc?view=revision=950710 > > Jan
Re: After Solr 5.5, mm parameter doesn't work properly
I don't think it is 8812. q.op was completely ignored by edismax prior to 5.5, so it is not mm that changed. If you do the same 5.4 query with q.op=OR I suspect it will not change the debug query at all. On 30 May 2016 at 21:07, Jan Høydahlwrote: > Hi, > > This may be related to SOLR-8812, but still different. Please file a JIRA > issue for this. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > > 29. mai 2016 kl. 18.20 skrev Issei Nishigata : > > > > Hi, > > > > “mm" parameter does not work properly, when I set "q.op=AND” after Solr > 5.5. > > In Solr 5.4, mm parameter works expectedly with the following setting. > > > > --- > > [schema] > > > > > > maxGramSize="2"/> > > > > > > > > > > [request] > > > http://localhost:8983/solr/collection1/select?defType=edismax=AND=2=solar > > — > > > > After Solr 5.5, the result will not be the same as Solr 5.4. > > Has the setting of mm parameter specs, or description of file setting > changed? > > > > > > [Solr 5.4] > > > > ... > > > > 2 > > solar > > edismax > > AND > > > > ... > > > > > > 0 > > > > solr > > > > > > > > > > solar > > solar > > > > (+DisjunctionMaxQuerytext:so text:ol text:la > text:ar)~2/no_coord > > > > +(((text:so text:ol text:la > text:ar)~2)) > > ... > > > > > > > > > > [Solr 6.0.1] > > > > > > ... > > > > 2 > > solar > > edismax > > AND > > > > ... > > > > > > solar > > solar > > > > (+DisjunctionMaxQuery(((+text:so +text:ol +text:la > +text:ar/no_coord > > > > +((+text:so +text:ol +text:la > +text:ar)) > > ... > > > > > > As shown above, parsedquery also differs from Solr 5.4 and Solr > 6.0.1(after Solr 5.5). > > > > > > — > > Thanks > > Issei Nishigata > >
Phrase Slop relevance tuning
I've received a request from our business area to take a look at emphasising ~0 phrase matches over ~1 (and greater) more that they are already. I can't see any doco on the subject, and I'd like to ask if anyone else has played in this area? Or at least is willing to sanity check my reasoning before I rush in and code a solution, when I may be reinventing the wheel? Looking through the codebase, I can only find hardcoded weightings in a couple of places, using the formula: return 1.0f / (distance + 1); which results in ~0 getting a weight of 1, and ~1 getting a weight of 0.5. There are a number of ways I've already considered, but the most flexible seems to be to expose those two numbers via configuration. We are considering adjusting them in sync with each other (using 1/3 instead of 1 in both places), which has the impact of altering the overall distribution of the weightings graph, but retaining the scale between 1 and 0. Additionally, we are considering increasing the numerator to increase the upper scale above 1. Not sure if this is dumb idea though. Our hope was to use something like return 2.0f / (distance + 0.33f); to give ~0 matches a real (^2) boost in comparison to other weighting factors, and retain the ~1 (and greater) matches at around their current weight. This remains a completely untested theory though, since I may be misunderstanding how the output gets combined outside this method. The real technical change though would be to simply get those two numbers from config. Any advice or suggestions about other ideas we haven't even considered? The larger picture here is that we are using edismax and the pf fields are all covered by ps=5. Ta, Greg
Re: SolrCloud leaders using more disk space
Thanks for the reply Tim. Can you diff the listings of the index data directories on a leader vs. replica? It was a good tip, and mirrors some stuff we have been exploring in house as well. The leaders all have additional 'index.' directories on disk, but we have come to the conclusion that this is a coincidence and not related to the fact that they are leaders. Current theory is that they are the result of an upgrade rehearsal that was performed before launch where the cluster was split into two on different versions of Solr and different ZK paths. I suspect that whilst the ops team where doing the deployment there were a number of server restarts that triggered leader elections and recovery events that weren't allowed to complete gracefully, leaving the old data on disk. The coincidence is simply that the ops team did all their initial practice stuff on the same 3 hosts, which later became our leaders. I've found a few small similar issue on hosts 4-6, and none at all on hosts 7-9. I hoping we get a chance to test all this soon, but we need to re-jig our test systems first, since they don't have any redundancy depth to them right now. Ta, Greg On 28 June 2014 02:59, Timothy Potter thelabd...@gmail.com wrote: Hi Greg, Sorry for the slow response. The general thinking is that you shouldn't worry about which nodes host leaders vs. replicas because A) that can change, and B) as you say, the additional responsibilities for leader nodes is quite minimal (mainly per-doc version management and then distributing updates to replicas). The segment merging all happens at the Lucene level, which has no knowledge of SolrCloud leaders / replicas. Since this is SolrCloud, all nodes pull the config from ZooKeeper so should be running the same settings. Can you diff the listings of the index data directories on a leader vs. replica? Might give us some insights to what files the leader has that the replicas don't have. Cheers, Tim On Tue, Jun 3, 2014 at 8:32 PM, Greg Pendlebury greg.pendleb...@gmail.com wrote: Hi all, We launched our new production instance of SolrCloud last week and since then have noticed a trend with regards to disk usage. The non-leader replicas all seem to be self-optimizing their index segments as expected, but the leaders have (on average) around 33% more data on disk. My assumption is that leader's are not self-optimising (or not to the same extent)... but it is still early days of course. If it helps, there are 45 JVMs in the cloud, with 15 shards and 3 replicas per shard. Each non-leader shard is sitting at between 59GB and 87GB on their SSD, but the leaders are between 84GB and 116GB. We have pretty much constant read and write traffic 24x7, with just 'slow' periods overnight when write traffic is 1 document per second and searches are between 1 and 2 per second. Is this light level of traffic still too much for the leaders to self-optimise? I'd also be curious to hear about what others are doing in terms of operating procedures. We load test before launch what would happen if we turned off JVMs and forced recovery events. I know that these things all work, just that customers will experience slower search responses whilst they occur. For example, a restore from a leader to a replica under load testing for us takes around 30 minutes and response times drop from around 200-300ms average to 1.5s average. Bottleneck appears to be network I/O on the servers. We haven't explored whether this is specific to the servers replicating, or saturation of the of the infrastructure that all the servers share, because... This performance is acceptable for us, but I'm not sure if I'd like to force that event to occur unless required... this is following the line of reasoning proposed internally that we should periodically rotate leaders by turning them off briefly. We aren't going to do that unless we have a strong reason though. Does anyone try to manipulate production instances that way? Vaguely related to this is leader distribution. We have 9 physical servers and 5 JVMs running on each server. By virtue of the deployment procedures the first 3 servers to come online are all running 5 leaders each. Is there any merit in 'moving' these around (by reboots)? Our planning up to launch was based on lots of mailing list response we'd seen that indicated leaders had no significant performance difference to normal replicas, and all of our testing has agreed with that. The disk size 'issue' (which we aren't worried about... yet. It hasn't been in prod long enough to know for certain) may be the only thing we've seen so far. Ta, Greg
SolrCloud leaders using more disk space
Hi all, We launched our new production instance of SolrCloud last week and since then have noticed a trend with regards to disk usage. The non-leader replicas all seem to be self-optimizing their index segments as expected, but the leaders have (on average) around 33% more data on disk. My assumption is that leader's are not self-optimising (or not to the same extent)... but it is still early days of course. If it helps, there are 45 JVMs in the cloud, with 15 shards and 3 replicas per shard. Each non-leader shard is sitting at between 59GB and 87GB on their SSD, but the leaders are between 84GB and 116GB. We have pretty much constant read and write traffic 24x7, with just 'slow' periods overnight when write traffic is 1 document per second and searches are between 1 and 2 per second. Is this light level of traffic still too much for the leaders to self-optimise? I'd also be curious to hear about what others are doing in terms of operating procedures. We load test before launch what would happen if we turned off JVMs and forced recovery events. I know that these things all work, just that customers will experience slower search responses whilst they occur. For example, a restore from a leader to a replica under load testing for us takes around 30 minutes and response times drop from around 200-300ms average to 1.5s average. Bottleneck appears to be network I/O on the servers. We haven't explored whether this is specific to the servers replicating, or saturation of the of the infrastructure that all the servers share, because... This performance is acceptable for us, but I'm not sure if I'd like to force that event to occur unless required... this is following the line of reasoning proposed internally that we should periodically rotate leaders by turning them off briefly. We aren't going to do that unless we have a strong reason though. Does anyone try to manipulate production instances that way? Vaguely related to this is leader distribution. We have 9 physical servers and 5 JVMs running on each server. By virtue of the deployment procedures the first 3 servers to come online are all running 5 leaders each. Is there any merit in 'moving' these around (by reboots)? Our planning up to launch was based on lots of mailing list response we'd seen that indicated leaders had no significant performance difference to normal replicas, and all of our testing has agreed with that. The disk size 'issue' (which we aren't worried about... yet. It hasn't been in prod long enough to know for certain) may be the only thing we've seen so far. Ta, Greg
Re: Deep paging in parallel with solr cloud - OutOfMemory
Shouldn't all deep pagination against a cluster use the new cursor mark feature instead of 'start' and 'rows'? 4 or 5 requests still seems a very low limit to be running into an OOM issues though, so perhaps it is both issues combined? Ta, Greg On 18 March 2014 07:49, Mike Hugo m...@piragua.com wrote: Thanks! On Mon, Mar 17, 2014 at 3:47 PM, Steve Rowe sar...@gmail.com wrote: Mike, Days. I plan on making a 4.7.1 release candidate a week from today, and assuming nobody finds any problems with the RC, it will be released roughly four days thereafter (three days for voting + one day for release propogation to the Apache mirrors): i.e., next Friday-ish. Steve On Mar 17, 2014, at 4:40 PM, Mike Hugo m...@piragua.com wrote: Thanks Steve, That certainly looks like it could be the culprit. Any word on a release date for 4.7.1? Days? Weeks? Months? Mike On Mon, Mar 17, 2014 at 3:31 PM, Steve Rowe sar...@gmail.com wrote: Hi Mike, The OOM you're seeing is likely a result of the bug described in (and fixed by a commit under) SOLR-5875: https://issues.apache.org/jira/browse/SOLR-5875. If you can build from source, it would be great if you could confirm the fix addresses the issue you're facing. This fix will be part of a to-be-released Solr 4.7.1. Steve On Mar 17, 2014, at 4:14 PM, Mike Hugo m...@piragua.com wrote: Hello, We recently upgraded to Solr Cloud 4.7 (went from a single node Solr 4.0 instance to 3 node Solr 4.7 cluster). Part of out application does an automated traversal of all documents that match a specific query. It does this by iterating through results by setting the start and rows parameters, starting with start=0 and rows=1000, then start=1000, rows=1000, start = 2000, rows=1000, etc etc. We do this in parallel fashion with multiple workers on multiple nodes. It's easy to chunk up the work to be done by figuring out how many total results there are and then creating 'chunks' (0-1000, 1000-2000, 2000-3000) and sending each chunk to a worker in a pool of multi-threaded workers. This worked well for us with a single server. However upon upgrading to solr cloud, we've found that this quickly (within the first 4 or 5 requests) causes an OutOfMemory error on the coordinating node that receives the query. I don't fully understand what's going on here, but it looks like the coordinating node receives the query and sends it to the shard requested. For example, given: shards=shard3sort=id+ascstart=4000q=*:*rows=1000 The coordinating node sends this query to shard3: NOW=1395086719189shard.url= http://shard3_url_goes_here:8080/solr/collection1/fl=idsort=id+ascstart=0q=*:*distrib=falsewt=javabinisShard=truefsv=trueversion=2rows=5000 Notice the rows parameter is 5000 (start + rows). If the coordinator node is able to process the result set (which works for the first few pages, after that it will quickly run out of memory), it eventually issues this request back to shard3: NOW=1395086719189shard.url= http://10.128.215.226:8080/extera-search/gemindex/start=4000ids=a..bunch...(1000)..of..doc..ids..go..hereq=*:*distrib=falsewt=javabinisShard=trueversion=2rows=1000 and then finally returns the response to the client. One possible workaround: We've found that if we issue non-distributed requests to specific shards, that we get performance along the same lines that we did before. E.g. issue a query with shards=shard3distrib=false directly to the url of the shard3 instance, rather than going through the cloud solr server solrj API. The other workaround is to adapt to use the new new cursorMark functionality. I've manually tried a few requests and it is pretty efficient, and doesn't result in the OOM errors on the coordinating node. However, i've only done this in single threaded manner. I'm wondering if there would be a way to get cursor marks for an entire result set at a given page interval, so that they could then be fed to the pool of parallel workers to get the results in parallel rather than single threaded. Is there a way to do this so we could process the results in parallel? Any other possible solutions? Thanks in advance. Mike
Re: Deep paging in parallel with solr cloud - OutOfMemory
My suspicion is that it won't work in parallel, but we've only just asked the ops team to start our upgrade to look into it, so I don't have a server yet to test. The bug identified in SOLR-5875 has put them off though :( If things pan out as I think they will I suspect we are going to end up with two implementations here. One for our GUI applications that uses traditional paging and is capped at some arbitrarily low limit (say 1000 records like Google does). And another for our API users that harvest full datasets, which will use cursor marks and support only serial harvests that cannot skip content. Ta, Greg On 18 March 2014 09:44, Mike Hugo m...@piragua.com wrote: Cursor mark definitely seems like the way to go. If I can get it to work in parallel then that's additional bonus On Mon, Mar 17, 2014 at 5:41 PM, Greg Pendlebury greg.pendleb...@gmail.comwrote: Shouldn't all deep pagination against a cluster use the new cursor mark feature instead of 'start' and 'rows'? 4 or 5 requests still seems a very low limit to be running into an OOM issues though, so perhaps it is both issues combined? Ta, Greg On 18 March 2014 07:49, Mike Hugo m...@piragua.com wrote: Thanks! On Mon, Mar 17, 2014 at 3:47 PM, Steve Rowe sar...@gmail.com wrote: Mike, Days. I plan on making a 4.7.1 release candidate a week from today, and assuming nobody finds any problems with the RC, it will be released roughly four days thereafter (three days for voting + one day for release propogation to the Apache mirrors): i.e., next Friday-ish. Steve On Mar 17, 2014, at 4:40 PM, Mike Hugo m...@piragua.com wrote: Thanks Steve, That certainly looks like it could be the culprit. Any word on a release date for 4.7.1? Days? Weeks? Months? Mike On Mon, Mar 17, 2014 at 3:31 PM, Steve Rowe sar...@gmail.com wrote: Hi Mike, The OOM you're seeing is likely a result of the bug described in (and fixed by a commit under) SOLR-5875: https://issues.apache.org/jira/browse/SOLR-5875. If you can build from source, it would be great if you could confirm the fix addresses the issue you're facing. This fix will be part of a to-be-released Solr 4.7.1. Steve On Mar 17, 2014, at 4:14 PM, Mike Hugo m...@piragua.com wrote: Hello, We recently upgraded to Solr Cloud 4.7 (went from a single node Solr 4.0 instance to 3 node Solr 4.7 cluster). Part of out application does an automated traversal of all documents that match a specific query. It does this by iterating through results by setting the start and rows parameters, starting with start=0 and rows=1000, then start=1000, rows=1000, start = 2000, rows=1000, etc etc. We do this in parallel fashion with multiple workers on multiple nodes. It's easy to chunk up the work to be done by figuring out how many total results there are and then creating 'chunks' (0-1000, 1000-2000, 2000-3000) and sending each chunk to a worker in a pool of multi-threaded workers. This worked well for us with a single server. However upon upgrading to solr cloud, we've found that this quickly (within the first 4 or 5 requests) causes an OutOfMemory error on the coordinating node that receives the query. I don't fully understand what's going on here, but it looks like the coordinating node receives the query and sends it to the shard requested. For example, given: shards=shard3sort=id+ascstart=4000q=*:*rows=1000 The coordinating node sends this query to shard3: NOW=1395086719189shard.url= http://shard3_url_goes_here:8080/solr/collection1/fl=idsort=id+ascstart=0q=*:*distrib=falsewt=javabinisShard=truefsv=trueversion=2rows=5000 Notice the rows parameter is 5000 (start + rows). If the coordinator node is able to process the result set (which works for the first few pages, after that it will quickly run out of memory), it eventually issues this request back to shard3: NOW=1395086719189shard.url= http://10.128.215.226:8080/extera-search/gemindex/start=4000ids=a..bunch...(1000)..of..doc..ids..go..hereq=*:*distrib=falsewt=javabinisShard=trueversion=2rows=1000 and then finally returns the response to the client. One possible workaround: We've found that if we issue non-distributed requests to specific shards, that we get performance along the same lines that we did before. E.g. issue a query with shards=shard3distrib=false directly to the url of the shard3 instance, rather than going through the cloud solr server solrj API. The other workaround is to adapt to use
Re: Deep paging in parallel with solr cloud - OutOfMemory
Sorry, I meant one thread requesting records 1 - 1000, whilst the next thread requests 1001 - 2000 from the same ordered result set. We've observed several of our customers trying to harvest our data with multi-threaded scripts that work like this. I thought it would not work using cursor marks... but: A) I could be wrong, and B) I could be talking about parallel in a different way to Mike. Ta, Greg On 18 March 2014 10:24, Yonik Seeley yo...@heliosearch.com wrote: On Mon, Mar 17, 2014 at 7:14 PM, Greg Pendlebury greg.pendleb...@gmail.com wrote: My suspicion is that it won't work in parallel Deep paging with cursorMark does work with distributed search (assuming that's what you meant by parallel... querying sub-shards in parallel?). -Yonik http://heliosearch.org - solve Solr GC pauses with off-heap filters and fieldcache
Re: Solr metrics in Codahale metrics and Graphite?
In the codahale metrics library there are 1, 5 and 15 minute moving averages just like you would see in a tool like 'top'. However in Solr I can only see 5 and 15 minute values, plus 'avgRequestsPerSecond'. I assumed this was the 1 minute value initially, but it seems to be something like the average since startup. I haven't looked thoroughly, but it is around 1% of the other two in a normally idle test cluster after load tests have been running for long enough that the 5 and 15 minute numbers match the load testing throughput. Is this difference deliberate? or an accident? or am I wrong entirely? I can compute the overall average anyway, given that the stats also include the start time of the search handler and the total search count, so I thought it might be an accident. Ta, Greg On 4 May 2013 01:19, Furkan KAMACI furkankam...@gmail.com wrote: Does anybody tested Ganglia with JMXTrans at production environment for SolrCloud? 2013/4/26 Dmitry Kan solrexp...@gmail.com Alan, Shawn, If backporting to 3.x is hard, no worries, we don't necessarily require the patch as we are heading to 4.x eventually. It is just much easier within our organization to test on the existing solr 3.4 as there are a few of internal dependencies and custom code on top of solr. Also solr upgrades on production systems are usually pushed forward by a month or so starting the upgrade on development systems (requires lots of testing and verifications). Nevertheless, it is good effort to make #solr #graphite friendly, so keep it up! :) Dmitry On Thu, Apr 25, 2013 at 9:29 PM, Shawn Heisey s...@elyograg.org wrote: On 4/25/2013 6:30 AM, Dmitry Kan wrote: We are very much interested in 3.4. On Thu, Apr 25, 2013 at 12:55 PM, Alan Woodward a...@flax.co.uk wrote: This is on top of trunk at the moment, but would be back ported to 4.4 if there was interest. This will be bad news, I'm sorry: All remaining work on 3.x versions happens in the 3.6 branch. This branch is in maintenance mode. It will only get fixes for serious bugs with no workaround. Improvements and new features won't be considered at all. You're welcome to try backporting patches from newer issues. Due to the major differences in the 3x and 4x codebases, the best case scenario is that you'll be facing a very manual task. Some changes can't be backported because they rely on other features only found in 4.x code. Thanks, Shawn
Re: Solr metrics in Codahale metrics and Graphite?
Oh my bad. I thought it was already in. Thanks for the correction. Ta, Greg On 17 March 2014 15:55, Shalin Shekhar Mangar shalinman...@gmail.comwrote: Greg, SOLR-4735 (using the codahale metrics lib) hasn't been committed yet. It is still work in progress. Actually the internal Solr Metrics class has a method to return 1 minute stats but it is not used. On Mon, Mar 17, 2014 at 10:06 AM, Greg Pendlebury greg.pendleb...@gmail.com wrote: In the codahale metrics library there are 1, 5 and 15 minute moving averages just like you would see in a tool like 'top'. However in Solr I can only see 5 and 15 minute values, plus 'avgRequestsPerSecond'. I assumed this was the 1 minute value initially, but it seems to be something like the average since startup. I haven't looked thoroughly, but it is around 1% of the other two in a normally idle test cluster after load tests have been running for long enough that the 5 and 15 minute numbers match the load testing throughput. Is this difference deliberate? or an accident? or am I wrong entirely? I can compute the overall average anyway, given that the stats also include the start time of the search handler and the total search count, so I thought it might be an accident. Ta, Greg On 4 May 2013 01:19, Furkan KAMACI furkankam...@gmail.com wrote: Does anybody tested Ganglia with JMXTrans at production environment for SolrCloud? 2013/4/26 Dmitry Kan solrexp...@gmail.com Alan, Shawn, If backporting to 3.x is hard, no worries, we don't necessarily require the patch as we are heading to 4.x eventually. It is just much easier within our organization to test on the existing solr 3.4 as there are a few of internal dependencies and custom code on top of solr. Also solr upgrades on production systems are usually pushed forward by a month or so starting the upgrade on development systems (requires lots of testing and verifications). Nevertheless, it is good effort to make #solr #graphite friendly, so keep it up! :) Dmitry On Thu, Apr 25, 2013 at 9:29 PM, Shawn Heisey s...@elyograg.org wrote: On 4/25/2013 6:30 AM, Dmitry Kan wrote: We are very much interested in 3.4. On Thu, Apr 25, 2013 at 12:55 PM, Alan Woodward a...@flax.co.uk wrote: This is on top of trunk at the moment, but would be back ported to 4.4 if there was interest. This will be bad news, I'm sorry: All remaining work on 3.x versions happens in the 3.6 branch. This branch is in maintenance mode. It will only get fixes for serious bugs with no workaround. Improvements and new features won't be considered at all. You're welcome to try backporting patches from newer issues. Due to the major differences in the 3x and 4x codebases, the best case scenario is that you'll be facing a very manual task. Some changes can't be backported because they rely on other features only found in 4.x code. Thanks, Shawn -- Regards, Shalin Shekhar Mangar.
Re: Solr 4.7.0 - cursorMark question
That was really clear; I just had another read through of the documentation with that explanation in mind and I can see I went off the rails. Sorry for any confusion on my part, and thanks for the details. Ta, Greg On 8 March 2014 08:36, Chris Hostetter hossman_luc...@fucit.org wrote: : Thank-you, that all sounds great. My assumption about documents being : missed was something like this: ... : In that situation D would always be missed, whether the cursorMark 'C or : greater' or 'greater than B' (I'm not sure which it is in practice), simply : because the cursorMark is the unique ID and the unique ID is not your first : sort mechanism. First off: nothing about your example would result in the cursorMark is the unique ID ... let's clear that misconception up right away: Using Cursors requires a deterministic sort w/o any ties that can result in abiguity. For this reason (eliminating the abiguity) it is neccessary that the uniqueKey always be included in a sort -- but the cursorMark values that get computed are determined by *all* of the sort critera used. So let's revisit your example, but let's make sure we are explicit about everything involved: * A,B,C,D are all uniqueyKey values in the id field * 1,2,3 are all time values in a timestamp field. * we're going to use a sort=timestamp asc, id asc param in this example * when we say X(123) we mean Document with id 'X' which currently has value '123' in the timestamp field Let's suppose that at the start of the example, all of the docs in your example, in sorted order, look like this... A(1), B(3), C(14), D(32) A client uses our sort, along with cursorMark=* rows=2. That client will get back A(1) and B(3) as well as some nextCursorMark value of $%^ (deliberately not using any letters or numbers so as not to misslead you ito thinking hte cursorMark value is an id or a timestamp -- it's neaither, it's an encoded binary value that has no meaning to client other then as a mark to send back to the server) Now let's suppose that B C are edited as you mention -- their new timestamp values must -- by definition -- be greater then D's existing timestamp value of 32 (otherwise it's not really a timestamp field) So let's assume now, that the total ordering of all our docs, using our sort is: A(1), D(32), B(56), C(57) After B C are modified, the the client makes a followup request using the same sort, rows=2, and cursorMark=$%^ (the nextCursorMark returned from the previous request) the two documents the client will get this time are D(32) and B(56). - D will never be skipped. - B will be returned twice, because it's timestamp value was updated after it was fetched Does that make sense? You can try this out manually if you want to see it for yourlself -- either using a real auto-assigned timestamp field, or just using a simple numeric field you set your self when updating docs. -Hoss http://www.lucidworks.com/
Re:Solr 4.7.0 - cursorMark question
* New 'cursorMark' request param for efficient deep paging of sorted result sets. See http://s.apache.org/cursorpagination; At the end of the linked doco there is an example that doesn't make sense to me, because it mentions sort=timestamp asc and is then followed by pseudo code that sorts by id only. I understand that cursorMark requires that sort clauses must include the uniqueKey field, but is it really just 'include', or is it the only field that sort can be performed on? ie. can sort be specified as 'sort=timestamp asc, id asc'? I am assuming that if the index is changed between requests than we can still 'miss' or duplicate documents by not sorting on the id as the only sort parameter, but I can live with that scenario. cursorMark is still attractive to us since it will prevent the SolrCloud cluster from crashing when deep pagination requests are sent to it... I'm just trying to explore all the edge cases our business area are likely to consider. Ta, Greg On 27 February 2014 02:15, Simon Willnauer sim...@apache.org wrote: February 2014, Apache Solr(tm) 4.7 available The Lucene PMC is pleased to announce the release of Apache Solr 4.7 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. Solr 4.7 is available for immediate download at: http://lucene.apache.org/solr/mirrors-solr-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Solr 4.7 Release Highlights: * A new 'migrate' collection API to split all documents with a route key into another collection. * Added support for tri-level compositeId routing. * Admin UI - Added a new Files conf directory browser/file viewer. * Add a QParserPlugin for Lucene's SimpleQueryParser. * Suggest improvements: a new SuggestComponent that fully utilizes the Lucene suggester module; queries can now use multiple suggesters; Lucene's FreeTextSuggester and BlendedInfixSuggester are now supported. * New 'cursorMark' request param for efficient deep paging of sorted result sets. See http://s.apache.org/cursorpagination * Add a Solr contrib that allows for building Solr indexes via Hadoop's MapReduce. * Upgrade to Spatial4j 0.4. Various new options are now exposed automatically for an RPT field type. See Spatial4j CHANGES javadocs. https://github.com/spatial4j/spatial4j/blob/master/CHANGES.md * SSL support for SolrCloud. Solr 4.7 also includes many other new features as well as numerous optimizations and bugfixes. Please report any feedback to the mailing lists (http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
Re: Solr 4.7.0 - cursorMark question
Thank-you, that all sounds great. My assumption about documents being missed was something like this: A,B,C,D where they are sorted by timestamp first and ID second. Say the first 'page' of results is 'A,B', and before the second page is requested both documents B + C receive update events and the new order (by timestamp) is: A,D,B,C In that situation D would always be missed, whether the cursorMark 'C or greater' or 'greater than B' (I'm not sure which it is in practice), simply because the cursorMark is the unique ID and the unique ID is not your first sort mechanism. However, I'm not really concerned about that anyway since it is not a use case we consider important, and in an information science sense of things I think it is a non-trivial problem to solve without brute force caching of all result sets. I'm just happy that we don't have to get our users to replace existing sort options; we just need to add a unique ID field at the end and change the parameters we send into the cluster. Thanks, Greg On 7 March 2014 11:05, Chris Hostetter hossman_luc...@fucit.org wrote: : At the end of the linked doco there is an example that doesn't make sense : to me, because it mentions sort=timestamp asc and is then followed by : pseudo code that sorts by id only. I understand that cursorMark requires Ok ... 2 things contributing to the confusion. 1) the para that refers to sort=timestamp asc should be fixed to include id as well. 2) psuedo-code you're refering to that uses sort = 'id asc' isn't ment to give an example of specifically tailing by timestamp -- it's an extension on the earlier example (of fetching all docs sorting on id) to show tailing new docs with new (increasing) ids ... i'll try to fix the wording to better elborate : that sort clauses must include the uniqueKey field, but is it really just : 'include', or is it the only field that sort can be performed on? : : ie. can sort be specified as 'sort=timestamp asc, id asc'? That will absolutely work ... i'll update the doc to include more examples with multi-clause sort criteria. : I am assuming that if the index is changed between requests than we can : still 'miss' or duplicate documents by not sorting on the id as the only : sort parameter, but I can live with that scenario. cursorMark is still If you are using a timestamp param, you should never miss a document (assuming every doc gets a timestamp) but yes: you can absolutely get the same doc twice if it's updated after the first time you fetch it -- that's one of the advantages of sorting on a timestamp field like that. -Hoss http://www.lucidworks.com/
Re: Cluster state ranges are all null after reboot
Thanks again for the info. Hopefully we find some more clues if it continues to occur. The ops team are looking at alternative deployment methods as well, so we might end up avoiding the issue altogether. Ta, Greg On 28 February 2014 02:42, Shalin Shekhar Mangar shalinman...@gmail.comwrote: I think it is just a side-effect of the current implementation that the ranges are assigned linearly. You can also verify this by choosing a document from each shard and running it's uniqueKey against the CompositeIdRouter's sliceHash method and verifying that it is included in the range. I couldn't reproduce this but I didn't try too hard either. If you are able to isolate a reproducible example then please do report back. I'll spend some time to review the related code again to see if I can spot the problem. On Thu, Feb 27, 2014 at 2:19 AM, Greg Pendlebury greg.pendleb...@gmail.com wrote: Thanks Shalin, that code might be helpful... do you know if there is a reliable way to line up the ranges with the shard numbers? When the problem occurred we had 80 million documents already in the index, and could not issue even a basic 'deleteById' call. I'm tempted to assume they are just assigned linearly since our Test and Prod clusters both look to work that way now, but I can't be sure whether that is by design or just happenstance of boot order. And no, unfortunately we have not been able to reproduce this issue consistently despite trying a number of different things such as graceless stop/start and screwing with the underlying WAR file (which is what we thought puppet might be doing). The problem has occurred twice since, but always in our Test environment. The fact that Test has only a single replica per shard is the most likely culprit for me, but as mentioned, even gracelessly killing the last replica in the cluster seems to leave the range set correctly in clusterstate when we test it in isolation. In production (45 JVMs, 15 shards with 3 replicas each) we've never seen the problem, despite a similar number of rollouts for version changes etc. Ta, Greg On 26 February 2014 23:46, Shalin Shekhar Mangar shalinman...@gmail.com wrote: If you have 15 shards and assuming that you've never used shard splitting, you can calculate the shard ranges by using new CompositeIdRouter().partitionRange(15, new CompositeIdRouter().fullRange()) This gives me: [8000-9110, 9111-a221, a222-b332, b333-c443, c444-d554, d555-e665, e666-f776, f777-887, 888-1998, 1999-2aa9, 2aaa-3bba, 3bbb-4ccb, 4ccc-5ddc, 5ddd-6eed, 6eee-7fff] Have you done any more investigation into why this happened? Anything strange in the logs? Are you able to reproduce this in a test environment? On Wed, Feb 19, 2014 at 5:16 AM, Greg Pendlebury greg.pendleb...@gmail.com wrote: We've got a 15 shard cluster spread across 3 hosts. This morning our puppet software rebooted them all and afterwards the 'range' for each shard has become null in zookeeper. Is there any way to restore this value short of rebuilding a fresh index? I've read various questions from people with a similar problem, although in those cases it is usually a single shard that has become null allowing them to infer what the value should be and manually fix it in ZK. In this case I have no idea what the ranges should be. This is our test cluster, and checking production I can see that the ranges don't appear to be predictable based on the shard number. I'm also not certain why it even occurred. Our test cluster only has a single replica per shard, so when a JVM is rebooted the cluster is unavailable... would that cause this? Production has 3 replicas so we can do rolling reboots. -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: Cluster state ranges are all null after reboot
Thanks Shalin, that code might be helpful... do you know if there is a reliable way to line up the ranges with the shard numbers? When the problem occurred we had 80 million documents already in the index, and could not issue even a basic 'deleteById' call. I'm tempted to assume they are just assigned linearly since our Test and Prod clusters both look to work that way now, but I can't be sure whether that is by design or just happenstance of boot order. And no, unfortunately we have not been able to reproduce this issue consistently despite trying a number of different things such as graceless stop/start and screwing with the underlying WAR file (which is what we thought puppet might be doing). The problem has occurred twice since, but always in our Test environment. The fact that Test has only a single replica per shard is the most likely culprit for me, but as mentioned, even gracelessly killing the last replica in the cluster seems to leave the range set correctly in clusterstate when we test it in isolation. In production (45 JVMs, 15 shards with 3 replicas each) we've never seen the problem, despite a similar number of rollouts for version changes etc. Ta, Greg On 26 February 2014 23:46, Shalin Shekhar Mangar shalinman...@gmail.comwrote: If you have 15 shards and assuming that you've never used shard splitting, you can calculate the shard ranges by using new CompositeIdRouter().partitionRange(15, new CompositeIdRouter().fullRange()) This gives me: [8000-9110, 9111-a221, a222-b332, b333-c443, c444-d554, d555-e665, e666-f776, f777-887, 888-1998, 1999-2aa9, 2aaa-3bba, 3bbb-4ccb, 4ccc-5ddc, 5ddd-6eed, 6eee-7fff] Have you done any more investigation into why this happened? Anything strange in the logs? Are you able to reproduce this in a test environment? On Wed, Feb 19, 2014 at 5:16 AM, Greg Pendlebury greg.pendleb...@gmail.com wrote: We've got a 15 shard cluster spread across 3 hosts. This morning our puppet software rebooted them all and afterwards the 'range' for each shard has become null in zookeeper. Is there any way to restore this value short of rebuilding a fresh index? I've read various questions from people with a similar problem, although in those cases it is usually a single shard that has become null allowing them to infer what the value should be and manually fix it in ZK. In this case I have no idea what the ranges should be. This is our test cluster, and checking production I can see that the ranges don't appear to be predictable based on the shard number. I'm also not certain why it even occurred. Our test cluster only has a single replica per shard, so when a JVM is rebooted the cluster is unavailable... would that cause this? Production has 3 replicas so we can do rolling reboots. -- Regards, Shalin Shekhar Mangar.
Cluster state ranges are all null after reboot
We've got a 15 shard cluster spread across 3 hosts. This morning our puppet software rebooted them all and afterwards the 'range' for each shard has become null in zookeeper. Is there any way to restore this value short of rebuilding a fresh index? I've read various questions from people with a similar problem, although in those cases it is usually a single shard that has become null allowing them to infer what the value should be and manually fix it in ZK. In this case I have no idea what the ranges should be. This is our test cluster, and checking production I can see that the ranges don't appear to be predictable based on the shard number. I'm also not certain why it even occurred. Our test cluster only has a single replica per shard, so when a JVM is rebooted the cluster is unavailable... would that cause this? Production has 3 replicas so we can do rolling reboots.
SolrCloud Archecture recommendations + related questions
Hi All, TL;DR version: We think we want to explore Lucene/Solr 4.0 and SolrCloud, but I’m not sure if there is any good doco/articles on how to make architecture choices for how to chop up big indexes… and what other general considerations are part of the equation? I’m throwing this post out to the public to see if any kind and knowledgeable individuals could provide some educated feedback on the options our team is currently considering for the future architecture of our Solr indexes. We have a loose collection of Solr indexes, each with a specific purpose and differing schemas and document makeup, containing just over 300 million documents with varying degrees of full-text. Our existing architecture is showing its age, as it is really just the setup used for small/medium indexes scaled upwards. The biggest individual index is around 140 million documents and currently exists as a Master/Slave setup with the Master receiving all writes in the background and the 3 load balanced slaves updating with a 5 minute poll interval. The master index is 451gb on disk and the 3 slaves are running JVMs with RAM allocations of 21gb (right now anyway). We are struggling under the traffic load and/or scale of our indexes (mainly the later I think). We know this isn’t the best way to run things, but the index in question is a fairly new addition and each time we run into issues we tend to make small changes to improve things in the short term… like bumping the RAM allocation up, toying with poll intervals, garbage collection config etc. We’ve historically run into issues with facet queries generating a lot of bloat on some types of fields. These had to be solved through internal modifications, but I expect we’ll have to review this with the new version anyway. Related to that, there are some question marks on generating good facet data from a sharded approach. In particular though, we are really struggling with garbage collection on the slave machines around the time that the slave/master sync occurs because of multiple copies of the index being held in memory until all searchers have de-referenced the old index. The machines typically either crash from OOM when we occasionally have a third and/or forth copy of the index appear because of really old searchers not ‘letting go’ (hence we play with widening poll intervals), or they seem to rarely become perpetually locked in GC and have to be restarted (not 100% why, but large heap allocations aren’t helping, and cache warming may be a culprit). The team has lots of things we want to try to improve things, but given the scale of the systems it is very hard to just try things out without considerable resourcing implications. The entire ecosystem is spread across 7 machines that are resourced in the 64gb-100gb of RAM range (this is just me poking around our servers… not a thorough assessment). Each machine is running several JVMs so that for each ‘type’ of index there are typically 2-4 load balanced slaves available at any given time. One of those machines is exclusively used as the Master for all indexes and receives no search traffic… just lots of write traffic. I believe the answers to some of these are going to be very much dependent on schemas and documents, so I don’t imagine anyone can answer the questions better then we can after testing and benchmarking… but right now we are still trying to choose where to start, so broad ideas are very welcome. The kind of things we are currently thinking about: - Moving to v4.0 (currently just completed our v3.5 upgrade) to take advantage of the reduced RAM consumption: https://issues.apache.org/jira/browse/LUCENE-2380 We are hoping that this has the double-whammy impact of improving garbage collection as well. Lots of full-text data should equal lots of Strings, and thus lots of savings from this change. - Moving to a basic sharded approach. We’ve only just started testing this, and I’m not involved, so I’m not sure on what early results we’ve got…. But: - Given that we’d like to move to v4.0, I believe this opens up the option of a SolrCloud implementation… my suspicion is that this is where the money is at… but I’d be happy to hear feedback (good or bad) from people that are using it in production. - Hardware; we are not certain that the current approach of a few colossal machines is any better that lots of smaller clustered machines… and it is prohibitively expensive to experiment here. We don’t think that our current setup using SSDs and fibre-channel connections would be creating too many bottlenecks on I/O, and rarely see other hardware related issues, but I’d again be curious if people have observed contradictory evidence. My suspicion is that with the changes above though, our current hardware would handle the load far better than it currently is. - Are there any sort of pros and cons documented out there for making decisions on sharding
Re: Embedded Solr Optimize under Windows
Ahh, thanks. I might try a basic commit() then and see, although it's not a huge deal for me. It occurred to me that two optimize() calls would probably leave exactly the same problem behind. On 20 May 2011 09:52, Chris Hostetter hossman_luc...@fucit.org wrote: : Thanks for the reply. I'm at home right now, or I'd try this myself, but is : the suggestion that two optimize() calls in a row would resolve the issue? it might ... I think the situations in which it happens have evolved a bit over the years as IndexWRiter has gotten smarter about knowing when it really needs to touch the disk to reduce IO. there's a relatively new explicit method (IndexWriter.deleteUnusedFiles) that can force this... https://issues.apache.org/jira/browse/LUCENE-2259 ...but it's only on trunk, and there isn't any user level hook for it in Solr yet (i opened SOLR-2532 to consider adding it) -Hoss
Re: Embedded Solr Optimize under Windows
Thanks for the reply. I'm at home right now, or I'd try this myself, but is the suggestion that two optimize() calls in a row would resolve the issue? The process in question is a JVM devoted entirely to harvesting, calls optimize() then shuts down. The least processor intensive way of triggering this behaviour is desirable... perhaps a commit()? But I wouldn't have expected that to trigger a write. On 17 May 2011 10:20, Chris Hostetter hossman_luc...@fucit.org wrote: : http://code.google.com/p/solr-geonames/wiki/DeveloperInstall : It's worth noting that the build has also been run on Mac and Solaris now, : and the Solr index is about half the size. We suspect the optimize() call in : Embedded Solr is not working correctly under Windows. : : We've observed that Windows leaves lots of segments on disk and takes up : twice the volume as the other OSs. Perhaps file locking or something The problem isn't that optimize doesn't work on windows, the problem is that windows file semantics won't let files be deleted while there are open file handles -- so Lucene's Directory behavior is to leave the files on disk, and try to clean them up later. (on the next write, or next optimize call) -Hoss
Embedded Solr Optimize under Windows
Hi All, Just quick query of no particular importance to me, but we did observe this problem: http://code.google.com/p/solr-geonames/wiki/DeveloperInstall It's worth noting that the build has also been run on Mac and Solaris now, and the Solr index is about half the size. We suspect the optimize() call in Embedded Solr is not working correctly under Windows. We've observed that Windows leaves lots of segments on disk and takes up twice the volume as the other OSs. Perhaps file locking or something prevents the optimize() call from functioning. This wasn't particularly important to us since we don't run Windows for any prod systems. For that reason we haven't looked too closely, but thought it might be of interest to others... if we are even right of course :) Ta, Greg
Re: Embedded Solr constructor not returning
Sounds good. Please go ahead and make this change yourself. Done. Ta, Greg On 6 April 2011 22:52, Steven A Rowe sar...@syr.edu wrote: Hi Greg, I need the servlet API in my app for it to work, despite being command line. So adding this to the maven POM fixed everything: dependency groupIdjavax.servlet/groupId artifactIdservlet-api/artifactId version2.5/version /dependency Perhaps this dependency could be listed on the wiki? Alongside the sample code for using embedded solr? http://wiki.apache.org/solr/Solrj Sounds good. Please go ahead and make this change yourself. FYI, the Solr 3.1 POM has a servlet-api dependency, but the scope is provided, because the servlet container includes this dependency. When *you* are the container, you have to provide it. Steve
Embedded Solr constructor not returning
Hi All, I'm hoping this is a reasonably trivial issue, but it's frustrating me to no end. I'm putting together a tiny command line app to write data into an index. It has no web based Solr running against it; the index will be moved at a later time to have a proper server instance start for responding to queries. My problem however is I seem to have stalled on instantiating the embedded server: private SolrServer startSolr(String home) throws Exception { try { System.setProperty(solr.solr.home, home); CoreContainer.Initializer initializer = new CoreContainer.Initializer(); solrCore = initializer.initialize(); return new EmbeddedSolrServer(solrCore, ); } catch(Exception ex) { log.error(\n===\nFailed to start Solr server\n); throw ex; } } The constructor for the embedded server just never comes back. I've seen three or four different ways of starting the server with varying levels of complexity, and they all APPEAR to work, but still do not return. STDOUT show the output I have largely come to expect from watching Solr start 'correctly': === Starting Solr: JNDI not configured for solr (NoInitialContextEx) using system property solr.solr.home: C:\test\harvester\solr looking for solr.xml: C:\test\harvester\solr\solr.xml Solr home set to 'C:\test\harvester\solr\' Loaded SolrConfig: solrconfig.xml Opening new SolrCore at C:\test\harvester\solr\, dataDir=C:\tf2\geonames\harvester\solr\.\data\ Reading Solr Schema Schema name=test created string: org.apache.solr.schema.StrField created date: org.apache.solr.schema.TrieDateField created sint: org.apache.solr.schema.SortableIntField created sfloat: org.apache.solr.schema.SortableFloatField created null: org.apache.solr.analysis.WhitespaceTokenizerFactory created null: org.apache.solr.analysis.LowerCaseFilterFactory created null: org.apache.solr.analysis.WhitespaceTokenizerFactory created null: org.apache.solr.analysis.LowerCaseFilterFactory created text: org.apache.solr.schema.TextField default search field is basic_name query parser default operator is AND unique key field: id No JMX servers found, not exposing Solr information with JMX. created /update: solr.XmlUpdateRequestHandler adding lazy requestHandler: solr.CSVRequestHandler created /update/csv: solr.CSVRequestHandler Opening Searcher@11b86c7 main AutoCommit: disabled registering core: [] Registered new searcher Searcher@11b86c7 main Terminate batch job (Y/N)? y At this stage I'm grasping at straws. It appears as though the embedded instance is behaving like a proper server, waiting for a request or something. I've scrubbed the solrconfig.xml (from from the Solr example download) file back to remove most entries, but perhaps I'm using the incorrect handlers/listeners for an embedded server? I'm a tad confused though, because every other time I've done this (admittedly in a servlet, not a command line app) the constructor simply returns straight away and execution of my app code continues. Any advice or suggestions would be greatly appreciated. Ta, Greg
Re: Embedded Solr constructor not returning
Hmmm, after being stuck on this for hours, I find the answer myself 15minutes after asking for help... as usual. :) For anyone interested, and no doubt this will not be a revelation for some, I need the servlet API in my app for it to work, despite being command line. So adding this to the maven POM fixed everything: dependency groupIdjavax.servlet/groupId artifactIdservlet-api/artifactId version2.5/version /dependency Perhaps this dependency could be listed on the wiki? Alongside the sample code for using embedded solr? http://wiki.apache.org/solr/Solrj Logback is passing along all of my logging but I suspect I'd have to add some Solr logging config before it would tell me this itself. I only stumbled on it by accident: http://osdir.com/ml/solr-user.lucene.apache.org/2009-11/msg00831.html On 6 April 2011 14:48, Greg Pendlebury greg.pendleb...@gmail.com wrote: Hi All, I'm hoping this is a reasonably trivial issue, but it's frustrating me to no end. I'm putting together a tiny command line app to write data into an index. It has no web based Solr running against it; the index will be moved at a later time to have a proper server instance start for responding to queries. My problem however is I seem to have stalled on instantiating the embedded server: private SolrServer startSolr(String home) throws Exception { try { System.setProperty(solr.solr.home, home); CoreContainer.Initializer initializer = new CoreContainer.Initializer(); solrCore = initializer.initialize(); return new EmbeddedSolrServer(solrCore, ); } catch(Exception ex) { log.error(\n===\nFailed to start Solr server\n); throw ex; } } The constructor for the embedded server just never comes back. I've seen three or four different ways of starting the server with varying levels of complexity, and they all APPEAR to work, but still do not return. STDOUT show the output I have largely come to expect from watching Solr start 'correctly': === Starting Solr: JNDI not configured for solr (NoInitialContextEx) using system property solr.solr.home: C:\test\harvester\solr looking for solr.xml: C:\test\harvester\solr\solr.xml Solr home set to 'C:\test\harvester\solr\' Loaded SolrConfig: solrconfig.xml Opening new SolrCore at C:\test\harvester\solr\, dataDir=C:\tf2\geonames\harvester\solr\.\data\ Reading Solr Schema Schema name=test created string: org.apache.solr.schema.StrField created date: org.apache.solr.schema.TrieDateField created sint: org.apache.solr.schema.SortableIntField created sfloat: org.apache.solr.schema.SortableFloatField created null: org.apache.solr.analysis.WhitespaceTokenizerFactory created null: org.apache.solr.analysis.LowerCaseFilterFactory created null: org.apache.solr.analysis.WhitespaceTokenizerFactory created null: org.apache.solr.analysis.LowerCaseFilterFactory created text: org.apache.solr.schema.TextField default search field is basic_name query parser default operator is AND unique key field: id No JMX servers found, not exposing Solr information with JMX. created /update: solr.XmlUpdateRequestHandler adding lazy requestHandler: solr.CSVRequestHandler created /update/csv: solr.CSVRequestHandler Opening Searcher@11b86c7 main AutoCommit: disabled registering core: [] Registered new searcher Searcher@11b86c7 main Terminate batch job (Y/N)? y At this stage I'm grasping at straws. It appears as though the embedded instance is behaving like a proper server, waiting for a request or something. I've scrubbed the solrconfig.xml (from from the Solr example download) file back to remove most entries, but perhaps I'm using the incorrect handlers/listeners for an embedded server? I'm a tad confused though, because every other time I've done this (admittedly in a servlet, not a command line app) the constructor simply returns straight away and execution of my app code continues. Any advice or suggestions would be greatly appreciated. Ta, Greg
Re: Batch update, order of evaluation
I can't reproduce reliably, so I'm suspecting there are issues in our code. I'm refactoring to avoid the problem entirely. Thanks for the response though Erick. Greg On 8 September 2010 21:51, Greg Pendlebury greg.pendleb...@gmail.comwrote: Thanks, I'll create a deliberate test tomorrow feed some random data through it several times to see what happens. I'm also working on simply improving the buffer to handle the situation internally, but a few hours of testing isn't a big deal. Ta, Greg On 8 September 2010 21:41, Erick Erickson erickerick...@gmail.com wrote: This would be surprising behavior, if you can reliably reproduce this it's worth a JIRA. But (and I'm stretching a bit here) are you sure you're committing at the end of the batch AND are you sure you're looking after the commit? Here's the scenario: Your updated document is a position 1 and 100 in your batch. Somewhere around SOLR processing document 50, an autocommit occurs, and you're looking at your results before SOLR gets around to committing document 100. Like I said, it's a stretch. To test this, you need to be absolutely sure of two things before you search: 1 the batch is finished processing 2 you've issued a commit after the last document in the batch. If you're sure of the above and still see the problem, please let us know... HTH Erick On Tue, Sep 7, 2010 at 10:32 PM, Greg Pendlebury greg.pendleb...@gmail.comwrote: Does anyone know with certainty how (or even if) order is evaluated when updates are performed by batch? Our application internally buffers solr documents for speed of ingest before sending them to the server in chunks. The XML documents sent to the solr server contain all documents in the order they arrived without any settings changed from the defaults (so overwrite = true). We are careful to avoid things like HashMaps on our side since they'd lose the order, but I can't be certain what occurs inside Solr. Sometimes if an object has been indexed twice for various reasons it could appear twice in the buffer but the most up-to-date version is always last. I have however observed instances where the first copy of the document is indexed and differences in the second copy are missing. Does this sound likely? And if so are there any obvious settings I can play with to get the behavior I desire? I looked at: http://wiki.apache.org/solr/UpdateXmlMessages but there is no mention of order, just the overwrite flag (which I'm unsure how it is applied internally to an update message) and the deprecated duplicates flag (which I have no idea about). Would switching to SolrInputDocuments on a CommonsHttpSolrServer help? as per http://wiki.apache.org/solr/Solrj. This is no mention of order there either however. Thanks to anyone who took the time to read this. Ta, Greg
Re: Batch update, order of evaluation
Thanks, I'll create a deliberate test tomorrow feed some random data through it several times to see what happens. I'm also working on simply improving the buffer to handle the situation internally, but a few hours of testing isn't a big deal. Ta, Greg On 8 September 2010 21:41, Erick Erickson erickerick...@gmail.com wrote: This would be surprising behavior, if you can reliably reproduce this it's worth a JIRA. But (and I'm stretching a bit here) are you sure you're committing at the end of the batch AND are you sure you're looking after the commit? Here's the scenario: Your updated document is a position 1 and 100 in your batch. Somewhere around SOLR processing document 50, an autocommit occurs, and you're looking at your results before SOLR gets around to committing document 100. Like I said, it's a stretch. To test this, you need to be absolutely sure of two things before you search: 1 the batch is finished processing 2 you've issued a commit after the last document in the batch. If you're sure of the above and still see the problem, please let us know... HTH Erick On Tue, Sep 7, 2010 at 10:32 PM, Greg Pendlebury greg.pendleb...@gmail.comwrote: Does anyone know with certainty how (or even if) order is evaluated when updates are performed by batch? Our application internally buffers solr documents for speed of ingest before sending them to the server in chunks. The XML documents sent to the solr server contain all documents in the order they arrived without any settings changed from the defaults (so overwrite = true). We are careful to avoid things like HashMaps on our side since they'd lose the order, but I can't be certain what occurs inside Solr. Sometimes if an object has been indexed twice for various reasons it could appear twice in the buffer but the most up-to-date version is always last. I have however observed instances where the first copy of the document is indexed and differences in the second copy are missing. Does this sound likely? And if so are there any obvious settings I can play with to get the behavior I desire? I looked at: http://wiki.apache.org/solr/UpdateXmlMessages but there is no mention of order, just the overwrite flag (which I'm unsure how it is applied internally to an update message) and the deprecated duplicates flag (which I have no idea about). Would switching to SolrInputDocuments on a CommonsHttpSolrServer help? as per http://wiki.apache.org/solr/Solrj. This is no mention of order there either however. Thanks to anyone who took the time to read this. Ta, Greg
Batch update, order of evaluation
Does anyone know with certainty how (or even if) order is evaluated when updates are performed by batch? Our application internally buffers solr documents for speed of ingest before sending them to the server in chunks. The XML documents sent to the solr server contain all documents in the order they arrived without any settings changed from the defaults (so overwrite = true). We are careful to avoid things like HashMaps on our side since they'd lose the order, but I can't be certain what occurs inside Solr. Sometimes if an object has been indexed twice for various reasons it could appear twice in the buffer but the most up-to-date version is always last. I have however observed instances where the first copy of the document is indexed and differences in the second copy are missing. Does this sound likely? And if so are there any obvious settings I can play with to get the behavior I desire? I looked at: http://wiki.apache.org/solr/UpdateXmlMessages but there is no mention of order, just the overwrite flag (which I'm unsure how it is applied internally to an update message) and the deprecated duplicates flag (which I have no idea about). Would switching to SolrInputDocuments on a CommonsHttpSolrServer help? as per http://wiki.apache.org/solr/Solrj. This is no mention of order there either however. Thanks to anyone who took the time to read this. Ta, Greg
Always spellcheck (suggest)
Hi All, If I understand correctly the flag 'onlyMorePopular' encapsulates two independent behaviours. 1) It runs spell checking across queries that returned hits. Without the flag spell checking is not run when results are found. 2) It limits suggestions to terms with higher frequencies. Is there any way to get behaviour (1) without behaviour (2)? Such as another flag I'm not seeing in the doco? The usage context is spelling suggestions for international usage. Eg. The user searches 'behaviour', we want it to suggest US spelling 'behavior' and vice versa. At the moment, the suggestion only works one way. Ta, Greg This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M)
RE: Always spellcheck (suggest)
Thanks for the response Christian. I'll modify my original point (1) then. Is 'onlyMorePopular' the only way to return suggestions when all of the search terms are present in the dictionary (ie. correct)? Is there any way to force behaviour (1) without behaviour (2) (filtering on frequency). Ta, Greg -Original Message- From: Christian Zambrano [mailto:czamb...@gmail.com] Sent: Monday, 5 October 2009 11:59 AM To: solr-user@lucene.apache.org Subject: Re: Always spellcheck (suggest) I believe your understanding in incorrect. The first behavior you described is produced by adding the paremeter spellcheck=true. Suggestions will be returned regardless of whether there are results. The only time I believe spelling suggestions might not be included is when all of the words are spelled correctly. On 10/04/2009 07:55 PM, Greg Pendlebury wrote: Hi All, If I understand correctly the flag 'onlyMorePopular' encapsulates two independent behaviours. 1) It runs spell checking across queries that returned hits. Without the flag spell checking is not run when results are found. 2) It limits suggestions to terms with higher frequencies. Is there any way to get behaviour (1) without behaviour (2)? Such as another flag I'm not seeing in the doco? The usage context is spelling suggestions for international usage. Eg. The user searches 'behaviour', we want it to suggest US spelling 'behavior' and vice versa. At the moment, the suggestion only works one way. Ta, Greg This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M)
RE: Always spellcheck (suggest)
Thanks. I'll have to look into modifications then (was hoping to avoid that). For clarity though I believe this point is slightly off: Adding the parameter onlyMorePopular limits the suggestions that solr can give you(to ones that return more hits than the existing query), nothing more. The flag is definitely returning suggestions, even for 'correct' terms, they just have to be more popular 'correct' terms. Eg. 'behaviour' suggests 'behavior' because it has four times as many hits, but they are both 'correct' and the suggestion does not occur without the 'onlyMorePopular' flag set. 'behavior' will not suggest 'behaviour' however because it is less popular. Greg -Original Message- From: Christian Zambrano [mailto:czamb...@gmail.com] Sent: Monday, 5 October 2009 12:41 PM To: solr-user@lucene.apache.org Subject: Re: Always spellcheck (suggest) Greg, I apologize if I misunderstood your original post. I don't think there is a way you can force solr to return suggestions when all of the words are correctly spelled. Adding the parameter onlyMorePopular limits the suggestions that solr can give you(to ones that return more hits than the existing query), nothing more. In short, I believe the answer is No. On 10/04/2009 09:19 PM, Greg Pendlebury wrote: Thanks for the response Christian. I'll modify my original point (1) then. Is 'onlyMorePopular' the only way to return suggestions when all of the search terms are present in the dictionary (ie. correct)? Is there any way to force behaviour (1) without behaviour (2) (filtering on frequency). Ta, Greg -Original Message- From: Christian Zambrano [mailto:czamb...@gmail.com] Sent: Monday, 5 October 2009 11:59 AM To: solr-user@lucene.apache.org Subject: Re: Always spellcheck (suggest) I believe your understanding in incorrect. The first behavior you described is produced by adding the paremeter spellcheck=true. Suggestions will be returned regardless of whether there are results. The only time I believe spelling suggestions might not be included is when all of the words are spelled correctly. On 10/04/2009 07:55 PM, Greg Pendlebury wrote: Hi All, If I understand correctly the flag 'onlyMorePopular' encapsulates two independent behaviours. 1) It runs spell checking across queries that returned hits. Without the flag spell checking is not run when results are found. 2) It limits suggestions to terms with higher frequencies. Is there any way to get behaviour (1) without behaviour (2)? Such as another flag I'm not seeing in the doco? The usage context is spelling suggestions for international usage. Eg. The user searches 'behaviour', we want it to suggest US spelling 'behavior' and vice versa. At the moment, the suggestion only works one way. Ta, Greg This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M) This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email. The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt. The University of Southern Queensland is a registered provider of education with the Australian Government (CRICOS Institution Code No's. QLD 00244B / NSW 02225M)