Re: edismax parsing confusion

2017-04-04 Thread Greg Pendlebury
Try declaring your mm as 1 then and see if that assumption is correct.
Default 'mm' values are complicated to describe and depend on a variety of
factors. Generally if you want it to be a certain value, just declare it.

On 5 April 2017 at 02:07, Abhishek Mishra <solrmis...@gmail.com> wrote:

> Hello guys
> sorry for late response. @steve I am using solr 5.2 .
> @greg i am using default mm from config file(According to me it is default
> mm is 1).
>
> Regards,
> Abhishek
>
> On Tue, Apr 4, 2017 at 5:27 AM, Greg Pendlebury <greg.pendleb...@gmail.com
> >
> wrote:
>
> > eDismax uses 'mm', so knowing what that has been set to is important, or
> if
> > it has been left unset/default you would need to consider whether 'q.op'
> > has been set. Or the default operator from the config file.
> >
> > Ta,
> > Greg
> >
> >
> > On 3 April 2017 at 23:56, Steve Rowe <sar...@gmail.com> wrote:
> >
> > > Hi Abhishek,
> > >
> > > Which version of Solr are you using?
> > >
> > > I can see that the parsed queries are different, but they’re also very
> > > similar, and there’s a lot of detail there - can you be more specific
> > about
> > > what the problem is?
> > >
> > > --
> > > Steve
> > > www.lucidworks.com
> > >
> > > > On Apr 3, 2017, at 4:54 AM, Abhishek Mishra <solrmis...@gmail.com>
> > > wrote:
> > > >
> > > > Hi all
> > > > i am running solr query with these parameter
> > > >
> > > > bf: "sum(product(new_popularity,100),if(exists(third_price),50,0))"
> > > > qf: "test_product^5 category_path_tf^4 product_id gender"
> > > > q: "handbags between rs150 and rs 400"
> > > > defType: "edismax"
> > > >
> > > > parsed query is like below one
> > > >
> > > > for q:-
> > > > (+(DisjunctionMaxQuery((category_path_tf:handbags^4.0 |
> > gender:handbag |
> > > > test_product:handbag^5.0 | product_id:handbags))
> > > > DisjunctionMaxQuery((category_path_tf:between^4.0 | gender:between |
> > > > test_product:between^5.0 | product_id:between))
> > > > +DisjunctionMaxQuery((category_path_tf:rs150^4.0 | gender:rs150 |
> > > > test_product:rs150^5.0 | product_id:rs150))
> > > > +DisjunctionMaxQuery((category_path_tf:rs^4.0 | gender:rs |
> > > > test_product:rs^5.0 | product_id:rs))
> > > > DisjunctionMaxQuery((category_path_tf:400^4.0 | gender:400 |
> > > > test_product:400^5.0 | product_id:400))) DisjunctionMaxQuery(("":"
> > > handbags
> > > > between rs150 ? rs 400")) (DisjunctionMaxQuery(("":"handbags
> > between"))
> > > > DisjunctionMaxQuery(("":"between rs150"))
> DisjunctionMaxQuery(("":"rs
> > > > 400"))) (DisjunctionMaxQuery(("":"handbags between rs150"))
> > > > DisjunctionMaxQuery(("":"between rs150"))
> > > DisjunctionMaxQuery(("":"rs150 ?
> > > > rs")) DisjunctionMaxQuery(("":"? rs 400")))
> > > > FunctionQuery(sum(product(float(new_popularity),const(
> > > 100)),if(exists(float(third_price)),const(50),const(0)/no_coord
> > > >
> > > > but for dismax parser it is working perfect:
> > > >
> > > > (+(DisjunctionMaxQuery((category_path_tf:handbags^4.0 |
> > gender:handbag |
> > > > test_product:handbag^5.0 | product_id:handbags))
> > > > DisjunctionMaxQuery((category_path_tf:between^4.0 | gender:between |
> > > > test_product:between^5.0 | product_id:between))
> > > > DisjunctionMaxQuery((category_path_tf:rs150^4.0 | gender:rs150 |
> > > > test_product:rs150^5.0 | product_id:rs150))
> > > > DisjunctionMaxQuery((product_id:and))
> > > > DisjunctionMaxQuery((category_path_tf:rs^4.0 | gender:rs |
> > > > test_product:rs^5.0 | product_id:rs))
> > > > DisjunctionMaxQuery((category_path_tf:400^4.0 | gender:400 |
> > > > test_product:400^5.0 | product_id:400))) DisjunctionMaxQuery(("":"
> > > handbags
> > > > between rs150 ? rs 400"))
> > > > FunctionQuery(sum(product(float(new_popularity),const(
> > > 100)),if(exists(float(third_price)),const(50),const(0)/no_coord
> > > >
> > > >
> > > > *according to me difference between dismax and edismax is based on
> some
> > > > extra features plus working of boosting fucntions.*
> > > >
> > > >
> > > >
> > > > Regards,
> > > > Abhishek
> > >
> > >
> >
>


Re: edismax parsing confusion

2017-04-03 Thread Greg Pendlebury
eDismax uses 'mm', so knowing what that has been set to is important, or if
it has been left unset/default you would need to consider whether 'q.op'
has been set. Or the default operator from the config file.

Ta,
Greg


On 3 April 2017 at 23:56, Steve Rowe  wrote:

> Hi Abhishek,
>
> Which version of Solr are you using?
>
> I can see that the parsed queries are different, but they’re also very
> similar, and there’s a lot of detail there - can you be more specific about
> what the problem is?
>
> --
> Steve
> www.lucidworks.com
>
> > On Apr 3, 2017, at 4:54 AM, Abhishek Mishra 
> wrote:
> >
> > Hi all
> > i am running solr query with these parameter
> >
> > bf: "sum(product(new_popularity,100),if(exists(third_price),50,0))"
> > qf: "test_product^5 category_path_tf^4 product_id gender"
> > q: "handbags between rs150 and rs 400"
> > defType: "edismax"
> >
> > parsed query is like below one
> >
> > for q:-
> > (+(DisjunctionMaxQuery((category_path_tf:handbags^4.0 | gender:handbag |
> > test_product:handbag^5.0 | product_id:handbags))
> > DisjunctionMaxQuery((category_path_tf:between^4.0 | gender:between |
> > test_product:between^5.0 | product_id:between))
> > +DisjunctionMaxQuery((category_path_tf:rs150^4.0 | gender:rs150 |
> > test_product:rs150^5.0 | product_id:rs150))
> > +DisjunctionMaxQuery((category_path_tf:rs^4.0 | gender:rs |
> > test_product:rs^5.0 | product_id:rs))
> > DisjunctionMaxQuery((category_path_tf:400^4.0 | gender:400 |
> > test_product:400^5.0 | product_id:400))) DisjunctionMaxQuery(("":"
> handbags
> > between rs150 ? rs 400")) (DisjunctionMaxQuery(("":"handbags between"))
> > DisjunctionMaxQuery(("":"between rs150")) DisjunctionMaxQuery(("":"rs
> > 400"))) (DisjunctionMaxQuery(("":"handbags between rs150"))
> > DisjunctionMaxQuery(("":"between rs150"))
> DisjunctionMaxQuery(("":"rs150 ?
> > rs")) DisjunctionMaxQuery(("":"? rs 400")))
> > FunctionQuery(sum(product(float(new_popularity),const(
> 100)),if(exists(float(third_price)),const(50),const(0)/no_coord
> >
> > but for dismax parser it is working perfect:
> >
> > (+(DisjunctionMaxQuery((category_path_tf:handbags^4.0 | gender:handbag |
> > test_product:handbag^5.0 | product_id:handbags))
> > DisjunctionMaxQuery((category_path_tf:between^4.0 | gender:between |
> > test_product:between^5.0 | product_id:between))
> > DisjunctionMaxQuery((category_path_tf:rs150^4.0 | gender:rs150 |
> > test_product:rs150^5.0 | product_id:rs150))
> > DisjunctionMaxQuery((product_id:and))
> > DisjunctionMaxQuery((category_path_tf:rs^4.0 | gender:rs |
> > test_product:rs^5.0 | product_id:rs))
> > DisjunctionMaxQuery((category_path_tf:400^4.0 | gender:400 |
> > test_product:400^5.0 | product_id:400))) DisjunctionMaxQuery(("":"
> handbags
> > between rs150 ? rs 400"))
> > FunctionQuery(sum(product(float(new_popularity),const(
> 100)),if(exists(float(third_price)),const(50),const(0)/no_coord
> >
> >
> > *according to me difference between dismax and edismax is based on some
> > extra features plus working of boosting fucntions.*
> >
> >
> >
> > Regards,
> > Abhishek
>
>


Re: Edismax query parsing in Solr 4 vs Solr 6

2016-11-12 Thread Greg Pendlebury
This has come up a lot on the lists lately. Keep in mind that edismax
parses your query uses additional parameters such as 'mm' and 'q.op'. It is
the handling of these parameters (and the selection of default values)
which has changed between versions to address a few functionality gaps.

The most common issue I've seen is where users were not setting those
values and relying on the defaults. You might now need to set them
explicitly to return to desired behaviour.

I can't see all of your configuration, but I'm guessing the important one
here is 'q.op', which was previously hard coded to 'OR', irrespective of
either parameters or solrconfig. Try setting that to 'OR' explicitly...
maybe you have your default operator set to 'AND' in solrconfig and that is
now being applied? The other option is 'mm', which I suspect should be set
to '0' unless you have some reason to want it. If it was set to '100%' it
might insert the additional '+' flags, but it can also show up as a '~'
operator on the end.

Ta,
Greg

On 8 November 2016 at 22:13, Max Bridgewater 
wrote:

> I am migrating a solr based app from Solr 4 to Solr 6.  One of the
> discrepancies I am noticing is around edismax query parsing. My code makes
> the following call:
>
>
>  userQuery="+(title:shirts isbn:shirts) +(id:20446 id:82876)"
>   Query query=QParser.getParser(userQuery, "edismax", req).getQuery();
>
>
> With Solr 4, query becomes:
>
> +(+(title:shirt isbn:shirts) +(id:20446 id:82876))
>
> With Solr 6 it however becomes:
>
> +(+(+title:shirt +isbn:shirts) +(+id:20446 +id:82876))
>
> Digging deeper, it appears that parseOriginalQuery() in
> ExtendedDismaxQParser is adding those additional + signs.
>
>
> Is there a way to prevent this altering of queries?
>
> Thanks,
> Max.
>


Re: changed query parsing between 4.10.4 and 5.5.3?

2016-09-18 Thread Greg Pendlebury
Hi Bernd,

I was referring to assessing 5.5's behaviour based on a comparison to 4.10
when giving it that same inputs and configuration. Maybe I am wrong, and I
apologise if so. I am only seeing fragments of the situation each time, so
it is hard to be sure. Certainly in this case it looks like a case of 'mm'
set to 100% in this example, but I am basing that of previous emails about
your config.

Since you seem to comfortable moving the code around, might I suggest you
try looking in the TestExtendedDismaxParser class? It is a nice, portable
way of demonstrating the behaviour you believe is wrong. You can put some
fake documents at the top in the index() method, then add a new test method
(copy one of the existing ones like testDefaultOperatorWithMm() ) to show
the config that is behaving strangely with a query.

If there is something strange going on we should be able to get to the
bottom of it with some reproduction steps.

Ta,
Greg



On 15 September 2016 at 16:28, Bernd Fehling <bernd.fehl...@uni-bielefeld.de
> wrote:

> Your statement "using the old behaviour as a baseline for checking the
> correctness of 5.5 behaviour" might be a point of view.
>
> Let me give an example, my query:
> q=(text:(star AND trek AND wars)^200 OR text:("star trek wars")^350)
> results to 159 hits from 99 million records in the index (version 4.10.4).
> I checked all 159 hits, they are correct.
>
> The same query to the same indexed content build with 5.5.3 and also
> having 99 million records results in 0 (zero) hits.
>
> What do you think about this result?
>
> By the way, after copying ExtendedDismaxQParser from 4.10.4 to 5.5.3 I get
> now 137 hits. I really don't care about the difference, but at least
> I get some hits out of 99 million records and they are correct.
>
> Regards,
> Bernd
>
>
> Am 15.09.2016 um 01:41 schrieb Greg Pendlebury:
> > I'm sorry that's been your experience Bernd. If you do manage to find
> some
> > time it would be good to see some details on these bugs. It looks at the
> > moment as though this is a matter of perception when using the old
> > behaviour as a baseline for checking the correctness of 5.5 behaviour.
> >
> > Ta,
> > Greg
> >
> >
> > On 15 September 2016 at 01:27, Erick Erickson <erickerick...@gmail.com>
> > wrote:
> >
> >> Perhaps https://issues.apache.org/jira/browse/SOLR-8812 and related?
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, Sep 13, 2016 at 11:37 PM, Bernd Fehling
> >> <bernd.fehl...@uni-bielefeld.de> wrote:
> >>> Hi Greg,
> >>>
> >>> after trying several hours with all combinations of parameters and not
> >>> getting any useful search result with complex search terms and edismax
> >>> I finally copied o.a.s.s.ExtendedDismaxQParser.java from version
> 4.10.4
> >>> to 5.5.3 and did a little modification in o.a.s.u.SolrPluginUtils.java.
> >>>
> >>> Now it is searching correct and getting logical and valid search
> results
> >>> with any kind of complex search.
> >>> Problem solved.
> >>>
> >>> But still, the edismax, at least of 5.5.3, has some bugs.
> >>> If I get time I will look into this but right now my problem is solved
> >>> and the customers and users are happy.
> >>>
> >>> I hope that this buggy edismax version is not used in solr 6.x
> otherwise
> >> you
> >>> have the same problems there.
> >>>
> >>> Regards
> >>> Bernd
> >>>
> >>>
> >>> Am 12.09.2016 um 05:10 schrieb Greg Pendlebury:
> >>>> Hi Bernd,
> >>>>
> >>>> "From my point of view the old parsing behavior was correct.
> >>>> If searching for a term without operator it is always OR, otherwise
> >>>> you can add "+" or "-" to modify that. Now with q.op AND it is
> >>>> modified to "+" as a MUST."
> >>>>
> >>>> It is correct in both cases. q.op dictates (for that query) what
> default
> >>>> operator to use when none is provided, and it is used as a priority
> over
> >>>> the system whole 'defaultOperator'. In either case, if you ask it to
> use
> >>>> OR, it uses it; if you ask it to use AND, it uses it. The behaviour
> from
> >>>> 4.10 that was changed (arguably fixed, although I know that is a
> >> debatable
> >>>> point) was that you asked it to use AND, and it ignored you
> >> (irrespective
> >>

Re: changed query parsing between 4.10.4 and 5.5.3?

2016-09-14 Thread Greg Pendlebury
I'm sorry that's been your experience Bernd. If you do manage to find some
time it would be good to see some details on these bugs. It looks at the
moment as though this is a matter of perception when using the old
behaviour as a baseline for checking the correctness of 5.5 behaviour.

Ta,
Greg


On 15 September 2016 at 01:27, Erick Erickson <erickerick...@gmail.com>
wrote:

> Perhaps https://issues.apache.org/jira/browse/SOLR-8812 and related?
>
> Best,
> Erick
>
> On Tue, Sep 13, 2016 at 11:37 PM, Bernd Fehling
> <bernd.fehl...@uni-bielefeld.de> wrote:
> > Hi Greg,
> >
> > after trying several hours with all combinations of parameters and not
> > getting any useful search result with complex search terms and edismax
> > I finally copied o.a.s.s.ExtendedDismaxQParser.java from version 4.10.4
> > to 5.5.3 and did a little modification in o.a.s.u.SolrPluginUtils.java.
> >
> > Now it is searching correct and getting logical and valid search results
> > with any kind of complex search.
> > Problem solved.
> >
> > But still, the edismax, at least of 5.5.3, has some bugs.
> > If I get time I will look into this but right now my problem is solved
> > and the customers and users are happy.
> >
> > I hope that this buggy edismax version is not used in solr 6.x otherwise
> you
> > have the same problems there.
> >
> > Regards
> > Bernd
> >
> >
> > Am 12.09.2016 um 05:10 schrieb Greg Pendlebury:
> >> Hi Bernd,
> >>
> >> "From my point of view the old parsing behavior was correct.
> >> If searching for a term without operator it is always OR, otherwise
> >> you can add "+" or "-" to modify that. Now with q.op AND it is
> >> modified to "+" as a MUST."
> >>
> >> It is correct in both cases. q.op dictates (for that query) what default
> >> operator to use when none is provided, and it is used as a priority over
> >> the system whole 'defaultOperator'. In either case, if you ask it to use
> >> OR, it uses it; if you ask it to use AND, it uses it. The behaviour from
> >> 4.10 that was changed (arguably fixed, although I know that is a
> debatable
> >> point) was that you asked it to use AND, and it ignored you
> (irrespective
> >> of whether you used defaultOperator or q.op). The are a few subtle
> >> distinctions that are being missed (like the difference between the
> boolean
> >> operators and the OCCURS flags that your are talking about), but they
> are
> >> not going to change the outcome.
> >>
> >> 8812 related to users who had been historically setting the q.op
> parameter
> >> to influence the downstream default selection of 'mm' (If you don't
> provide
> >> 'mm' it is set for you based on 'q.op') instead of directly setting the
> >> 'mm' value themselves. But again in this case, you're setting 'mm'
> anyway,
> >> so it shouldn't be relevant.
> >>
> >> Ta,
> >> Greg
> >>
> >> On 9 September 2016 at 16:44, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de>
> >> wrote:
> >>
> >>> Hi Greg,
> >>>
> >>> thanks a lot, thats it.
> >>> After setting q.op to OR it works _nearly_ as before with 4.10.4.
> >>>
> >>> But how stupid this?
> >>> I have in my schema 
> >>> and also had q.op to AND to make sure my default _is_ AND,
> >>> meant as conjunction between terms.
> >>> But now I have q.op to OR and defaultOperator in schema to AND
> >>> to just get _nearly_ my old behavior back.
> >>>
> >>> schema has following comment:
> >>> "... The default is OR, which is generally assumed so it is
> >>> not a good idea to change it globally here.  The "q.op" request
> >>> parameter takes precedence over this. ..."
> >>>
> >>> What I don't understand is why they change some major internals
> >>> and don't give any notice about how to keep old parsing behavior.
> >>>
> >>> From my point of view the old parsing behavior was correct.
> >>> If searching for a term without operator it is always OR, otherwise
> >>> you can add "+" or "-" to modify that. Now with q.op AND it is
> >>> modified to "+" as a MUST.
> >>>
> >>> I still get some differences in search results between 4.10.4 and
> 5.5.3.
> >>> What other side effects has this change of q.op from AND to 

Re: changed query parsing between 4.10.4 and 5.5.3?

2016-09-11 Thread Greg Pendlebury
I'm not certain what is going on with your boost. It doesn't seem related
to those tickets as far as I can see, but I note it comes back in the
'parsedquery_toString' step below that. Perhaps the debug output has a
display bug?

The fact that 4.10 was not applying 'mm' in this context relates the other
part of 2649. Because you provided an explicit OR operator inside this
particular search string the 'mm' parameter was ignored. This confusing(?)
behaviour was the primary reason 2649 was originally opened. Under 5.5 it
was applied, so you get the '~2' operator. Because you explicitly set the
'mm' parameter to 100% it required both or your 'should' OCCUR terms to be
present.

Are you setting mm to 100% consciously because you want every term to
always apply, or was it just a leftover setting?

I can see that if you were relying on this behaviour it might appear
disruptive, but what I would hope is that you can see that 5.5 did
everything you asked it to, following clear and consistent rules for your
parameters to influence the output. But 4.10 was following some internal,
rarely/poorly documented behaviours that people had just learned to live
with. Some parameters did nothing, other parameters influenced yet more
parameters in confusing ways. Those old behaviours had various pitfalls
that created use cases edismax could not support so it got cleaned up.

If you want edismax to behave (mostly) the old way, set q.op to 'OR' and
'mm' to whatever you would like. I say 'mostly' because 'mm' will now be
paid attention to if you add your own operators. But if you really, really
want it to ignore that you can always wrap your search in parentheses to
group all the terms into a single clause. 'mm' only applies to top level
clauses and always has.

If you want to use edismax for simpler boolean search logic, set 'q.op' to
whatever you would like and 'mm' to something like 0 or 1 so that is
doesn't screw with your boolean ORs.

Ta,
Greg

Ta,
Greg


On 9 September 2016 at 20:00, Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
wrote:

> After some more testing it feels like the parsing in 5.5.3 is _really_
> messed up.
>
> Query version 4.10.4:
>
> 
>   (text:(star AND trek AND wars)^200 OR text:("star trek wars")^350)
> 
> 
>   (text:(star AND trek AND wars)^200 OR text:("star trek wars")^350)
> 
> 
>   (+(((+text:star +text:trek +text:war)^200.0) PhraseQuery(text:"star trek
> war"^350.0)))/no_coord
> 
> 
>   +(((+text:star +text:trek +text:war)^200.0) text:"star trek war"^350.0)
> 
>
>
> Same query version 5.5.3:
>
> 
>   (text:(star AND trek AND wars)^200 OR text:("star trek wars")^350)
> 
> 
>   (text:(star AND trek AND wars)^200 OR text:("star trek wars")^350)
> 
> 
>   (+((+text:star +text:trek +text:war^200.0 PhraseQuery(text:"star trek
> war"))~2))/no_coord
> 
> 
>   +(((+text:star +text:trek +text:war)^200.0 text:"star trek war"^350.0)~2)
> 
>
> As you can see version 5.5.3 "parsedquery" is different to version 4.10.4.
>
> And why is parsedquery different to parsedquery_toString in version 5.5.3?
>
> Where is my second boost in "parsedquery" of 5.5.3?
>
>
> Bernd
>
>
>
> Am 09.09.2016 um 08:44 schrieb Bernd Fehling:
> > Hi Greg,
> >
> > thanks a lot, thats it.
> > After setting q.op to OR it works _nearly_ as before with 4.10.4.
> >
> > But how stupid this?
> > I have in my schema 
> > and also had q.op to AND to make sure my default _is_ AND,
> > meant as conjunction between terms.
> > But now I have q.op to OR and defaultOperator in schema to AND
> > to just get _nearly_ my old behavior back.
> >
> > schema has following comment:
> > "... The default is OR, which is generally assumed so it is
> > not a good idea to change it globally here.  The "q.op" request
> > parameter takes precedence over this. ..."
> >
> > What I don't understand is why they change some major internals
> > and don't give any notice about how to keep old parsing behavior.
> >
> > From my point of view the old parsing behavior was correct.
> > If searching for a term without operator it is always OR, otherwise
> > you can add "+" or "-" to modify that. Now with q.op AND it is
> > modified to "+" as a MUST.
> >
> > I still get some differences in search results between 4.10.4 and 5.5.3.
> > What other side effects has this change of q.op from AND to OR in
> > other parts of query handling, parsing and searching?
> >
> > Regards
> > Bernd
> >
> > Am 09.09.2016 um 05:43 schrieb Greg Pendlebury:
> >> I forgot to mention the tickets

Re: changed query parsing between 4.10.4 and 5.5.3?

2016-09-11 Thread Greg Pendlebury
Hi Bernd,

"From my point of view the old parsing behavior was correct.
If searching for a term without operator it is always OR, otherwise
you can add "+" or "-" to modify that. Now with q.op AND it is
modified to "+" as a MUST."

It is correct in both cases. q.op dictates (for that query) what default
operator to use when none is provided, and it is used as a priority over
the system whole 'defaultOperator'. In either case, if you ask it to use
OR, it uses it; if you ask it to use AND, it uses it. The behaviour from
4.10 that was changed (arguably fixed, although I know that is a debatable
point) was that you asked it to use AND, and it ignored you (irrespective
of whether you used defaultOperator or q.op). The are a few subtle
distinctions that are being missed (like the difference between the boolean
operators and the OCCURS flags that your are talking about), but they are
not going to change the outcome.

8812 related to users who had been historically setting the q.op parameter
to influence the downstream default selection of 'mm' (If you don't provide
'mm' it is set for you based on 'q.op') instead of directly setting the
'mm' value themselves. But again in this case, you're setting 'mm' anyway,
so it shouldn't be relevant.

Ta,
Greg

On 9 September 2016 at 16:44, Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
wrote:

> Hi Greg,
>
> thanks a lot, thats it.
> After setting q.op to OR it works _nearly_ as before with 4.10.4.
>
> But how stupid this?
> I have in my schema 
> and also had q.op to AND to make sure my default _is_ AND,
> meant as conjunction between terms.
> But now I have q.op to OR and defaultOperator in schema to AND
> to just get _nearly_ my old behavior back.
>
> schema has following comment:
> "... The default is OR, which is generally assumed so it is
> not a good idea to change it globally here.  The "q.op" request
> parameter takes precedence over this. ..."
>
> What I don't understand is why they change some major internals
> and don't give any notice about how to keep old parsing behavior.
>
> From my point of view the old parsing behavior was correct.
> If searching for a term without operator it is always OR, otherwise
> you can add "+" or "-" to modify that. Now with q.op AND it is
> modified to "+" as a MUST.
>
> I still get some differences in search results between 4.10.4 and 5.5.3.
> What other side effects has this change of q.op from AND to OR in
> other parts of query handling, parsing and searching?
>
> Regards
> Bernd
>
> Am 09.09.2016 um 05:43 schrieb Greg Pendlebury:
> > I forgot to mention the tickets:
> > SOLR-2649 and SOLR-8812
> >
> > On 9 September 2016 at 13:38, Greg Pendlebury <greg.pendleb...@gmail.com
> >
> > wrote:
> >
> >> Under 4.10 q.op was ignored by the edismax parser and always forced to
> OR.
> >> 5.5 is looking at the q.op=AND you requested.
> >>
> >> There are also some changes to the default values selected for mm, but I
> >> doubt those apply here since you are setting it explicitly.
> >>
> >> On 8 September 2016 at 00:35, Mikhail Khludnev <m...@apache.org> wrote:
> >>
> >>> I suppose
> >>>+((text:star text:trek)~2)
> >>> and
> >>>   +(+text:star +text:trek)
> >>> are equal. mm=2 is equal to +foo +bar
> >>>
> >>> On Wed, Sep 7, 2016 at 10:52 AM, Bernd Fehling <
> >>> bernd.fehl...@uni-bielefeld.de> wrote:
> >>>
> >>>> Hi list,
> >>>>
> >>>> while going from SOLR 4.10.4 to 5.5.3 I noticed a change in query
> >>> parsing.
> >>>> 4.10.4
> >>>> text:star text:trek
> >>>>   text:star text:trek
> >>>>   (+((text:star text:trek)~2))/no_coord
> >>>>   +((text:star text:trek)~2)
> >>>>
> >>>> 5.5.3
> >>>> text:star text:trek
> >>>>   text:star text:trek
> >>>>   (+(+text:star +text:trek))/no_coord
> >>>>   +(+text:star +text:trek)
> >>>>
> >>>> There are very many new features and changes between this two
> versions.
> >>>> It looks like a change in query parsing.
> >>>> Can someone point me to the solr or lucene jira about the changes?
> >>>> Or even give a hint how to get my "old" query parsing back?
> >>>>
> >>>> Regards
> >>>> Bernd
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Sincerely yours
> >>> Mikhail Khludnev
> >>>
>


Re: changed query parsing between 4.10.4 and 5.5.3?

2016-09-08 Thread Greg Pendlebury
I forgot to mention the tickets:
SOLR-2649 and SOLR-8812

On 9 September 2016 at 13:38, Greg Pendlebury <greg.pendleb...@gmail.com>
wrote:

> Under 4.10 q.op was ignored by the edismax parser and always forced to OR.
> 5.5 is looking at the q.op=AND you requested.
>
> There are also some changes to the default values selected for mm, but I
> doubt those apply here since you are setting it explicitly.
>
> On 8 September 2016 at 00:35, Mikhail Khludnev <m...@apache.org> wrote:
>
>> I suppose
>>+((text:star text:trek)~2)
>> and
>>   +(+text:star +text:trek)
>> are equal. mm=2 is equal to +foo +bar
>>
>> On Wed, Sep 7, 2016 at 10:52 AM, Bernd Fehling <
>> bernd.fehl...@uni-bielefeld.de> wrote:
>>
>> > Hi list,
>> >
>> > while going from SOLR 4.10.4 to 5.5.3 I noticed a change in query
>> parsing.
>> > 4.10.4
>> > text:star text:trek
>> >   text:star text:trek
>> >   (+((text:star text:trek)~2))/no_coord
>> >   +((text:star text:trek)~2)
>> >
>> > 5.5.3
>> > text:star text:trek
>> >   text:star text:trek
>> >   (+(+text:star +text:trek))/no_coord
>> >   +(+text:star +text:trek)
>> >
>> > There are very many new features and changes between this two versions.
>> > It looks like a change in query parsing.
>> > Can someone point me to the solr or lucene jira about the changes?
>> > Or even give a hint how to get my "old" query parsing back?
>> >
>> > Regards
>> > Bernd
>> >
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>
>


Re: changed query parsing between 4.10.4 and 5.5.3?

2016-09-08 Thread Greg Pendlebury
Under 4.10 q.op was ignored by the edismax parser and always forced to OR.
5.5 is looking at the q.op=AND you requested.

There are also some changes to the default values selected for mm, but I
doubt those apply here since you are setting it explicitly.

On 8 September 2016 at 00:35, Mikhail Khludnev  wrote:

> I suppose
>+((text:star text:trek)~2)
> and
>   +(+text:star +text:trek)
> are equal. mm=2 is equal to +foo +bar
>
> On Wed, Sep 7, 2016 at 10:52 AM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
>
> > Hi list,
> >
> > while going from SOLR 4.10.4 to 5.5.3 I noticed a change in query
> parsing.
> > 4.10.4
> > text:star text:trek
> >   text:star text:trek
> >   (+((text:star text:trek)~2))/no_coord
> >   +((text:star text:trek)~2)
> >
> > 5.5.3
> > text:star text:trek
> >   text:star text:trek
> >   (+(+text:star +text:trek))/no_coord
> >   +(+text:star +text:trek)
> >
> > There are very many new features and changes between this two versions.
> > It looks like a change in query parsing.
> > Can someone point me to the solr or lucene jira about the changes?
> > Or even give a hint how to get my "old" query parsing back?
> >
> > Regards
> > Bernd
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
>


Re: After Solr 5.5, mm parameter doesn't work properly

2016-06-02 Thread Greg Pendlebury
I think the confusion stems from the legacy implementation partially
conflating q.op with mm for users, when they are very different things.
q.op tells Solr how to insert boolean operators before they are converted
into occurs flags, and then downstream, mm applies on _only_ the SHOULD
occurs flags, not MUST or NOT flags.

So if the user is setting mm=2, they are asking for a minimum of 2 of the
SHOULD clauses to be found, not 2 of ALL clauses. mm has absolutely nothing
to do with q.op other than (because of the implementation) q.op is used to
derive a default value when it is not explicitly set.

The legacy implementation has situations where it was not possible to
generate the search you wanted because of the conflation, hence why
SOLR-2649 was so popular. I fully acknowledge that there are cases where
the change is disrupting users that (for whatever reason) are/were not
necessarily aware of what the parameters they are using actually do, or
users that were very aware, but forced to rely on a non-intuitive settings
to work around the behaviour eDismax had. SOLR-8812 (although not relevant
to the OP) goes part way towards helping the former users, but the latter
will want to adjust their parameters to be explicit now instead of
leveraging a workaround.

I haven't yet seen a use case where the final solution we put in for
SOLR-2649 does not work, but I have seen lots of user parameters used that
Solr handles perfectly... just in a way that the user did not expect. I
suspect this is mainly because the topic and the implementation are fairly
technically dense (from q.op, then to boolean to occurs conversion, then
finally to mm) and difficult to explain and document accurately for an end
user.

I am writing this in rush sorry, to go collect a child from school.

Ta,
Greg


On 2 June 2016 at 19:08, Jan Høydahl <jan@cominvent.com> wrote:

> [Aside] Your quote style is confusing, leaving my lines unquoted and your
> new lines quoted?? [/Aside]
>
> > So in relation to the OP's sample queries I was pointing out that
> 'q.op=OR
> > + mm=2' and 'q,op=AND + mm=2' are treated as identical queries by Solr
> 5.4,
> > but 5.5+ will manipulate the occurs flags differently before it applies
> mm
> > afterwards... because that is what q.op does.
>
> If a user explicitly says mm=2, then the users intent is that he should
> neither have pure OR (no clauses required) nor pure AND (all clauses
> required),
> but exactly two clauses required.
>
> So I think we need to go back to a solution where q.op technically
> stays as OR for custom mm. How that would affect queries with explicit
> operators
> I don’t know...
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 2. jun. 2016 kl. 05.12 skrev Greg Pendlebury <greg.pendleb...@gmail.com
> >:
> >
> > I would describe that subtly differently, and I think it is where the
> > difference lies:
> >
> > "Then from 4.x it did not care about q.op if mm was set explicitly"
> >>> I agree. q.op was not actually used in the query, but rather as a way
> of
> > inferred the default mm value. eDismax still ignored whatever q.op was
> set
> > and built your query operators (ie. the occurs flags) using q.op=OR.
> >
> > "And from 5.5 it seems as q.op does something even if mm is set..."
> >>> Yes, although I think it is the words 'even if' drawing too strong a
> > relationship between the two parameters. q.op has a function of its own,
> > and that now functions as it 'should' (opinionated, I know) in the query
> > construction, and continues to influence the default value of mm if it
> has
> > not been explicitly set. SOLR-8812 further evolves that influence by
> trying
> > to improve backwards compatibility for users who were not explicitly
> > setting mm, and only ever changed 'q.op' despite it being a step removed
> > from the actual parameter they were trying to manipulate.
> >
> > So in relation to the OP's sample queries I was pointing out that
> 'q.op=OR
> > + mm=2' and 'q,op=AND + mm=2' are treated as identical queries by Solr
> 5.4,
> > but 5.5+ will manipulate the occurs flags differently before it applies
> mm
> > afterwards... because that is what q.op does.
> >
> >
> > On 2 June 2016 at 07:13, Jan Høydahl <jan@cominvent.com> wrote:
> >
> >> Edismax used to default to mm=100% and not care about q.op at all
> >>
> >> Then from 4.x it did not care about q.op if mm was set explicitly,
> >> but if mm was not set, then q.op=OR —> mm=0%, q.op=AND —> mm=100%
> >>
> >> And from 5.5 it seems as q.op does something even if mm is set...
> >>
> >&g

Re: After Solr 5.5, mm parameter doesn't work properly

2016-06-01 Thread Greg Pendlebury
I would describe that subtly differently, and I think it is where the
difference lies:

"Then from 4.x it did not care about q.op if mm was set explicitly"
>> I agree. q.op was not actually used in the query, but rather as a way of
inferred the default mm value. eDismax still ignored whatever q.op was set
and built your query operators (ie. the occurs flags) using q.op=OR.

"And from 5.5 it seems as q.op does something even if mm is set..."
>> Yes, although I think it is the words 'even if' drawing too strong a
relationship between the two parameters. q.op has a function of its own,
and that now functions as it 'should' (opinionated, I know) in the query
construction, and continues to influence the default value of mm if it has
not been explicitly set. SOLR-8812 further evolves that influence by trying
to improve backwards compatibility for users who were not explicitly
setting mm, and only ever changed 'q.op' despite it being a step removed
from the actual parameter they were trying to manipulate.

So in relation to the OP's sample queries I was pointing out that 'q.op=OR
+ mm=2' and 'q,op=AND + mm=2' are treated as identical queries by Solr 5.4,
but 5.5+ will manipulate the occurs flags differently before it applies mm
afterwards... because that is what q.op does.


On 2 June 2016 at 07:13, Jan Høydahl <jan@cominvent.com> wrote:

> Edismax used to default to mm=100% and not care about q.op at all
>
> Then from 4.x it did not care about q.op if mm was set explicitly,
> but if mm was not set, then q.op=OR —> mm=0%, q.op=AND —> mm=100%
>
> And from 5.5 it seems as q.op does something even if mm is set...
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 1. jun. 2016 kl. 23.05 skrev Greg Pendlebury <greg.pendleb...@gmail.com
> >:
> >
> > But isn't that the default value? In this case the OP is setting mm
> > explicitly to 2.
> >
> > Will have to look at those code links more thoroughly at work this
> morning.
> > Apologies if I am wrong.
> >
> > Ta,
> > Greg
> >
> > On Wednesday, 1 June 2016, Jan Høydahl <jan@cominvent.com> wrote:
> >
> >>> 1. jun. 2016 kl. 03.47 skrev Greg Pendlebury <
> greg.pendleb...@gmail.com
> >> <javascript:;>>:
> >>
> >>> I don't think it is 8812. q.op was completely ignored by edismax prior
> to
> >>> 5.5, so it is not mm that changed.
> >>
> >> That is not the case. Prior to 5.5, mm would be automatically set to
> 100%
> >> if q.op==AND
> >> See https://issues.apache.org/jira/browse/SOLR-1889 and
> >> https://svn.apache.org/viewvc?view=revision=950710
> >>
> >> Jan
>
>


Re: After Solr 5.5, mm parameter doesn't work properly

2016-06-01 Thread Greg Pendlebury
But isn't that the default value? In this case the OP is setting mm
explicitly to 2.

Will have to look at those code links more thoroughly at work this morning.
Apologies if I am wrong.

Ta,
Greg

On Wednesday, 1 June 2016, Jan Høydahl <jan@cominvent.com> wrote:

> > 1. jun. 2016 kl. 03.47 skrev Greg Pendlebury <greg.pendleb...@gmail.com
> <javascript:;>>:
>
> > I don't think it is 8812. q.op was completely ignored by edismax prior to
> > 5.5, so it is not mm that changed.
>
> That is not the case. Prior to 5.5, mm would be automatically set to 100%
> if q.op==AND
> See https://issues.apache.org/jira/browse/SOLR-1889 and
> https://svn.apache.org/viewvc?view=revision=950710
>
> Jan


Re: After Solr 5.5, mm parameter doesn't work properly

2016-05-31 Thread Greg Pendlebury
I don't think it is 8812. q.op was completely ignored by edismax prior to
5.5, so it is not mm that changed.

If you do the same 5.4 query with q.op=OR I suspect it will not change the
debug query at all.

On 30 May 2016 at 21:07, Jan Høydahl  wrote:

> Hi,
>
> This may be related to SOLR-8812, but still different. Please file a JIRA
> issue for this.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 29. mai 2016 kl. 18.20 skrev Issei Nishigata :
> >
> > Hi,
> >
> > “mm" parameter does not work properly, when I set "q.op=AND” after Solr
> 5.5.
> > In Solr 5.4, mm parameter works expectedly with the following setting.
> >
> > ---
> > [schema]
> > 
> >   
> >  maxGramSize="2"/>
> >   
> > 
> >
> >
> > [request]
> >
> http://localhost:8983/solr/collection1/select?defType=edismax=AND=2=solar
> > —
> >
> > After Solr 5.5, the result will not be the same as Solr 5.4.
> > Has the setting of mm parameter specs, or description of file setting
> changed?
> >
> >
> > [Solr 5.4]
> > 
> > ...
> >   
> > 2
> > solar
> > edismax
> > AND
> >   
> > ...
> > 
> >   
> > 0
> > 
> >   solr
> > 
> >   
> > 
> > 
> >   solar
> >   solar
> >   
> >   (+DisjunctionMaxQuerytext:so text:ol text:la
> text:ar)~2/no_coord
> >   
> >   +(((text:so text:ol text:la
> text:ar)~2))
> >   ...
> > 
> >
> >
> >
> > [Solr 6.0.1]
> >
> > 
> > ...
> >   
> > 2
> > solar
> > edismax
> > AND
> >   
> > ...
> > 
> >   
> > solar
> > solar
> > 
> > (+DisjunctionMaxQuery(((+text:so +text:ol +text:la
> +text:ar/no_coord
> > 
> > +((+text:so +text:ol +text:la
> +text:ar))
> > ...
> >
> >
> > As shown above, parsedquery also differs from Solr 5.4 and Solr
> 6.0.1(after Solr 5.5).
> >
> >
> > —
> > Thanks
> > Issei Nishigata
>
>


Phrase Slop relevance tuning

2014-07-09 Thread Greg Pendlebury
I've received a request from our business area to take a look at
emphasising ~0 phrase matches over ~1 (and greater) more that they are
already. I can't see any doco on the subject, and I'd like to ask if anyone
else has played in this area? Or at least is willing to sanity check my
reasoning before I rush in and code a solution, when I may be reinventing
the wheel?

Looking through the codebase, I can only find hardcoded weightings in a
couple of places, using the formula: return 1.0f / (distance + 1); which
results in ~0 getting a weight of 1, and ~1 getting a weight of 0.5.

There are a number of ways I've already considered, but the most flexible
seems to be to expose those two numbers via configuration.

We are considering adjusting them in sync with each other (using 1/3
instead of 1 in both places), which has the impact of altering the overall
distribution of the weightings graph, but retaining the scale between 1 and
0.

Additionally, we are considering increasing the numerator to increase the
upper scale above 1. Not sure if this is dumb idea though. Our hope was to
use something like return 2.0f / (distance + 0.33f); to give ~0 matches a
real (^2) boost in comparison to other weighting factors, and retain the ~1
(and greater) matches at around their current weight. This remains a
completely untested theory though, since I may be misunderstanding how the
output gets combined outside this method.

The real technical change though would be to simply get those two numbers
from config. Any advice or suggestions about other ideas we haven't even
considered? The larger picture here is that we are using edismax and the pf
fields are all covered by ps=5.

Ta,
Greg


Re: SolrCloud leaders using more disk space

2014-06-30 Thread Greg Pendlebury
Thanks for the reply Tim.

 Can you diff the listings of the index data directories on a leader vs.
replica?

It was a good tip, and mirrors some stuff we have been exploring in house
as well. The leaders all have additional 'index.' directories on disk,
but we have come to the conclusion that this is a coincidence and not
related to the fact that they are leaders.

Current theory is that they are the result of an upgrade rehearsal that was
performed before launch where the cluster was split into two on different
versions of Solr and different ZK paths. I suspect that whilst the ops team
where doing the deployment there were a number of server restarts that
triggered leader elections and recovery events that weren't allowed to
complete gracefully, leaving the old data on disk.

The coincidence is simply that the ops team did all their initial practice
stuff on the same 3 hosts, which later became our leaders. I've found a few
small similar issue on hosts 4-6, and none at all on hosts 7-9.

I hoping we get a chance to test all this soon, but we need to re-jig our
test systems first, since they don't have any redundancy depth to them
right now.

Ta,
Greg


On 28 June 2014 02:59, Timothy Potter thelabd...@gmail.com wrote:

 Hi Greg,

 Sorry for the slow response. The general thinking is that you
 shouldn't worry about which nodes host leaders vs. replicas because A)
 that can change, and B) as you say, the additional responsibilities
 for leader nodes is quite minimal (mainly per-doc version management
 and then distributing updates to replicas). The segment merging all
 happens at the Lucene level, which has no knowledge of SolrCloud
 leaders / replicas. Since this is SolrCloud, all nodes pull the config
 from ZooKeeper so should be running the same settings. Can you diff
 the listings of the index data directories on a leader vs. replica?
 Might give us some insights to what files the leader has that the
 replicas don't have.

 Cheers,
 Tim

 On Tue, Jun 3, 2014 at 8:32 PM, Greg Pendlebury
 greg.pendleb...@gmail.com wrote:
  Hi all,
 
  We launched our new production instance of SolrCloud last week and since
  then have noticed a trend with regards to disk usage. The non-leader
  replicas all seem to be self-optimizing their index segments as expected,
  but the leaders have (on average) around 33% more data on disk. My
  assumption is that leader's are not self-optimising (or not to the same
  extent)... but it is still early days of course.
 
  If it helps, there are 45 JVMs in the cloud, with 15 shards and 3
 replicas
  per shard. Each non-leader shard is sitting at between 59GB and 87GB on
  their SSD, but the leaders are between 84GB and 116GB.
 
  We have pretty much constant read and write traffic 24x7, with just
 'slow'
  periods overnight when write traffic is  1 document per second and
  searches are between 1 and 2 per second. Is this light level of traffic
  still too much for the leaders to self-optimise?
 
  I'd also be curious to hear about what others are doing in terms of
  operating procedures. We load test before launch what would happen if we
  turned off JVMs and forced recovery events. I know that these things all
  work, just that customers will experience slower search responses whilst
  they occur. For example, a restore from a leader to a replica under load
  testing for us takes around 30 minutes and response times drop from
 around
  200-300ms average to 1.5s average.
 
  Bottleneck appears to be network I/O on the servers. We haven't explored
  whether this is specific to the servers replicating, or saturation of the
  of the infrastructure that all the servers share, because...
 
  This performance is acceptable for us, but I'm not sure if I'd like to
  force that event to occur unless required... this is following the line
 of
  reasoning proposed internally that we should periodically rotate leaders
 by
  turning them off briefly. We aren't going to do that unless we have a
  strong reason though. Does anyone try to manipulate production instances
  that way?
 
  Vaguely related to this is leader distribution. We have 9 physical
 servers
  and 5 JVMs running on each server. By virtue of the deployment procedures
  the first 3 servers to come online are all running 5 leaders each. Is
 there
  any merit in 'moving' these around (by reboots)?
 
  Our planning up to launch was based on lots of mailing list response we'd
  seen that indicated leaders had no significant performance difference to
  normal replicas, and all of our testing has agreed with that. The disk
 size
  'issue' (which we aren't worried about... yet. It hasn't been in prod
 long
  enough to know for certain) may be the only thing we've seen so far.
 
  Ta,
  Greg



SolrCloud leaders using more disk space

2014-06-03 Thread Greg Pendlebury
Hi all,

We launched our new production instance of SolrCloud last week and since
then have noticed a trend with regards to disk usage. The non-leader
replicas all seem to be self-optimizing their index segments as expected,
but the leaders have (on average) around 33% more data on disk. My
assumption is that leader's are not self-optimising (or not to the same
extent)... but it is still early days of course.

If it helps, there are 45 JVMs in the cloud, with 15 shards and 3 replicas
per shard. Each non-leader shard is sitting at between 59GB and 87GB on
their SSD, but the leaders are between 84GB and 116GB.

We have pretty much constant read and write traffic 24x7, with just 'slow'
periods overnight when write traffic is  1 document per second and
searches are between 1 and 2 per second. Is this light level of traffic
still too much for the leaders to self-optimise?

I'd also be curious to hear about what others are doing in terms of
operating procedures. We load test before launch what would happen if we
turned off JVMs and forced recovery events. I know that these things all
work, just that customers will experience slower search responses whilst
they occur. For example, a restore from a leader to a replica under load
testing for us takes around 30 minutes and response times drop from around
200-300ms average to 1.5s average.

Bottleneck appears to be network I/O on the servers. We haven't explored
whether this is specific to the servers replicating, or saturation of the
of the infrastructure that all the servers share, because...

This performance is acceptable for us, but I'm not sure if I'd like to
force that event to occur unless required... this is following the line of
reasoning proposed internally that we should periodically rotate leaders by
turning them off briefly. We aren't going to do that unless we have a
strong reason though. Does anyone try to manipulate production instances
that way?

Vaguely related to this is leader distribution. We have 9 physical servers
and 5 JVMs running on each server. By virtue of the deployment procedures
the first 3 servers to come online are all running 5 leaders each. Is there
any merit in 'moving' these around (by reboots)?

Our planning up to launch was based on lots of mailing list response we'd
seen that indicated leaders had no significant performance difference to
normal replicas, and all of our testing has agreed with that. The disk size
'issue' (which we aren't worried about... yet. It hasn't been in prod long
enough to know for certain) may be the only thing we've seen so far.

Ta,
Greg


Re: Deep paging in parallel with solr cloud - OutOfMemory

2014-03-17 Thread Greg Pendlebury
Shouldn't all deep pagination against a cluster use the new cursor mark
feature instead of 'start' and 'rows'?

4 or 5 requests still seems a very low limit to be running into an OOM
issues though, so perhaps it is both issues combined?

Ta,
Greg



On 18 March 2014 07:49, Mike Hugo m...@piragua.com wrote:

 Thanks!


 On Mon, Mar 17, 2014 at 3:47 PM, Steve Rowe sar...@gmail.com wrote:

  Mike,
 
  Days.  I plan on making a 4.7.1 release candidate a week from today, and
  assuming nobody finds any problems with the RC, it will be released
 roughly
  four days thereafter (three days for voting + one day for release
  propogation to the Apache mirrors): i.e., next Friday-ish.
 
  Steve
 
  On Mar 17, 2014, at 4:40 PM, Mike Hugo m...@piragua.com wrote:
 
   Thanks Steve,
  
   That certainly looks like it could be the culprit.  Any word on a
 release
   date for 4.7.1?  Days?  Weeks?  Months?
  
   Mike
  
  
   On Mon, Mar 17, 2014 at 3:31 PM, Steve Rowe sar...@gmail.com wrote:
  
   Hi Mike,
  
   The OOM you're seeing is likely a result of the bug described in (and
   fixed by a commit under) SOLR-5875: 
   https://issues.apache.org/jira/browse/SOLR-5875.
  
   If you can build from source, it would be great if you could confirm
 the
   fix addresses the issue you're facing.
  
   This fix will be part of a to-be-released Solr 4.7.1.
  
   Steve
  
   On Mar 17, 2014, at 4:14 PM, Mike Hugo m...@piragua.com wrote:
  
   Hello,
  
   We recently upgraded to Solr Cloud 4.7 (went from a single node Solr
  4.0
   instance to 3 node Solr 4.7 cluster).
  
   Part of out application does an automated traversal of all documents
  that
   match a specific query.  It does this by iterating through results by
   setting the start and rows parameters, starting with start=0 and
   rows=1000,
   then start=1000, rows=1000, start = 2000, rows=1000, etc etc.
  
   We do this in parallel fashion with multiple workers on multiple
 nodes.
   It's easy to chunk up the work to be done by figuring out how many
  total
   results there are and then creating 'chunks' (0-1000, 1000-2000,
   2000-3000)
   and sending each chunk to a worker in a pool of multi-threaded
 workers.
  
   This worked well for us with a single server.  However upon upgrading
  to
   solr cloud, we've found that this quickly (within the first 4 or 5
   requests) causes an OutOfMemory error on the coordinating node that
   receives the query.   I don't fully understand what's going on here,
  but
   it
   looks like the coordinating node receives the query and sends it to
 the
   shard requested.  For example, given:
  
   shards=shard3sort=id+ascstart=4000q=*:*rows=1000
  
   The coordinating node sends this query to shard3:
  
   NOW=1395086719189shard.url=
  
  
 
 http://shard3_url_goes_here:8080/solr/collection1/fl=idsort=id+ascstart=0q=*:*distrib=falsewt=javabinisShard=truefsv=trueversion=2rows=5000
  
   Notice the rows parameter is 5000 (start + rows).  If the coordinator
   node
   is able to process the result set (which works for the first few
 pages,
   after that it will quickly run out of memory), it eventually issues
  this
   request back to shard3:
  
   NOW=1395086719189shard.url=
  
  
 
 http://10.128.215.226:8080/extera-search/gemindex/start=4000ids=a..bunch...(1000)..of..doc..ids..go..hereq=*:*distrib=falsewt=javabinisShard=trueversion=2rows=1000
  
   and then finally returns the response to the client.
  
   One possible workaround:  We've found that if we issue
 non-distributed
   requests to specific shards, that we get performance along the same
  lines
   that we did before.  E.g. issue a query with
  shards=shard3distrib=false
   directly to the url of the shard3 instance, rather than going through
  the
   cloud solr server solrj API.
  
   The other workaround is to adapt to use the new new cursorMark
   functionality.  I've manually tried a few requests and it is pretty
   efficient, and doesn't result in the OOM errors on the coordinating
  node.
   However, i've only done this in single threaded manner.  I'm
 wondering
  if
   there would be a way to get cursor marks for an entire result set at
 a
   given page interval, so that they could then be fed to the pool of
   parallel
   workers to get the results in parallel rather than single threaded.
  Is
   there a way to do this so we could process the results in parallel?
  
   Any other possible solutions?  Thanks in advance.
  
   Mike
  
  
 
 



Re: Deep paging in parallel with solr cloud - OutOfMemory

2014-03-17 Thread Greg Pendlebury
My suspicion is that it won't work in parallel, but we've only just asked
the ops team to start our upgrade to look into it, so I don't have a server
yet to test. The bug identified in SOLR-5875 has put them off though :(

If things pan out as I think they will I suspect we are going to end up
with two implementations here. One for our GUI applications that uses
traditional paging and is capped at some arbitrarily low limit (say 1000
records like Google does). And another for our API users that harvest full
datasets, which will use cursor marks and support only serial harvests that
cannot skip content.

Ta,
Greg



On 18 March 2014 09:44, Mike Hugo m...@piragua.com wrote:

 Cursor mark definitely seems like the way to go.  If I can get it to work
 in parallel then that's additional bonus


 On Mon, Mar 17, 2014 at 5:41 PM, Greg Pendlebury
 greg.pendleb...@gmail.comwrote:

  Shouldn't all deep pagination against a cluster use the new cursor mark
  feature instead of 'start' and 'rows'?
 
  4 or 5 requests still seems a very low limit to be running into an OOM
  issues though, so perhaps it is both issues combined?
 
  Ta,
  Greg
 
 
 
  On 18 March 2014 07:49, Mike Hugo m...@piragua.com wrote:
 
   Thanks!
  
  
   On Mon, Mar 17, 2014 at 3:47 PM, Steve Rowe sar...@gmail.com wrote:
  
Mike,
   
Days.  I plan on making a 4.7.1 release candidate a week from today,
  and
assuming nobody finds any problems with the RC, it will be released
   roughly
four days thereafter (three days for voting + one day for release
propogation to the Apache mirrors): i.e., next Friday-ish.
   
Steve
   
On Mar 17, 2014, at 4:40 PM, Mike Hugo m...@piragua.com wrote:
   
 Thanks Steve,

 That certainly looks like it could be the culprit.  Any word on a
   release
 date for 4.7.1?  Days?  Weeks?  Months?

 Mike


 On Mon, Mar 17, 2014 at 3:31 PM, Steve Rowe sar...@gmail.com
  wrote:

 Hi Mike,

 The OOM you're seeing is likely a result of the bug described in
  (and
 fixed by a commit under) SOLR-5875: 
 https://issues.apache.org/jira/browse/SOLR-5875.

 If you can build from source, it would be great if you could
 confirm
   the
 fix addresses the issue you're facing.

 This fix will be part of a to-be-released Solr 4.7.1.

 Steve

 On Mar 17, 2014, at 4:14 PM, Mike Hugo m...@piragua.com wrote:

 Hello,

 We recently upgraded to Solr Cloud 4.7 (went from a single node
  Solr
4.0
 instance to 3 node Solr 4.7 cluster).

 Part of out application does an automated traversal of all
  documents
that
 match a specific query.  It does this by iterating through
 results
  by
 setting the start and rows parameters, starting with start=0 and
 rows=1000,
 then start=1000, rows=1000, start = 2000, rows=1000, etc etc.

 We do this in parallel fashion with multiple workers on multiple
   nodes.
 It's easy to chunk up the work to be done by figuring out how
 many
total
 results there are and then creating 'chunks' (0-1000, 1000-2000,
 2000-3000)
 and sending each chunk to a worker in a pool of multi-threaded
   workers.

 This worked well for us with a single server.  However upon
  upgrading
to
 solr cloud, we've found that this quickly (within the first 4 or
 5
 requests) causes an OutOfMemory error on the coordinating node
 that
 receives the query.   I don't fully understand what's going on
  here,
but
 it
 looks like the coordinating node receives the query and sends it
 to
   the
 shard requested.  For example, given:

 shards=shard3sort=id+ascstart=4000q=*:*rows=1000

 The coordinating node sends this query to shard3:

 NOW=1395086719189shard.url=


   
  
 
 http://shard3_url_goes_here:8080/solr/collection1/fl=idsort=id+ascstart=0q=*:*distrib=falsewt=javabinisShard=truefsv=trueversion=2rows=5000

 Notice the rows parameter is 5000 (start + rows).  If the
  coordinator
 node
 is able to process the result set (which works for the first few
   pages,
 after that it will quickly run out of memory), it eventually
 issues
this
 request back to shard3:

 NOW=1395086719189shard.url=


   
  
 
 http://10.128.215.226:8080/extera-search/gemindex/start=4000ids=a..bunch...(1000)..of..doc..ids..go..hereq=*:*distrib=falsewt=javabinisShard=trueversion=2rows=1000

 and then finally returns the response to the client.

 One possible workaround:  We've found that if we issue
   non-distributed
 requests to specific shards, that we get performance along the
 same
lines
 that we did before.  E.g. issue a query with
shards=shard3distrib=false
 directly to the url of the shard3 instance, rather than going
  through
the
 cloud solr server solrj API.

 The other workaround is to adapt to use

Re: Deep paging in parallel with solr cloud - OutOfMemory

2014-03-17 Thread Greg Pendlebury
Sorry, I meant one thread requesting records 1 - 1000, whilst the next
thread requests 1001 - 2000 from the same ordered result set. We've
observed several of our customers trying to harvest our data with
multi-threaded scripts that work like this. I thought it would not work
using cursor marks... but:

A) I could be wrong, and
B) I could be talking about parallel in a different way to Mike.

Ta,
Greg



On 18 March 2014 10:24, Yonik Seeley yo...@heliosearch.com wrote:

 On Mon, Mar 17, 2014 at 7:14 PM, Greg Pendlebury
 greg.pendleb...@gmail.com wrote:
  My suspicion is that it won't work in parallel

 Deep paging with cursorMark does work with distributed search
 (assuming that's what you meant by parallel... querying sub-shards
 in parallel?).

 -Yonik
 http://heliosearch.org - solve Solr GC pauses with off-heap filters
 and fieldcache



Re: Solr metrics in Codahale metrics and Graphite?

2014-03-16 Thread Greg Pendlebury
In the codahale metrics library there are 1, 5 and 15 minute moving
averages just like you would see in a tool like 'top'. However in Solr I
can only see 5 and 15 minute values, plus 'avgRequestsPerSecond'. I assumed
this was the 1 minute value initially, but it seems to be something like
the average since startup. I haven't looked thoroughly, but it is around 1%
of the other two in a normally idle test cluster after load tests have been
running for long enough that the 5 and 15 minute numbers match the load
testing throughput.

Is this difference deliberate? or an accident? or am I wrong entirely? I
can compute the overall average anyway, given that the stats also include
the start time of the search handler and the total search count, so I
thought it might be an accident.

Ta,
Greg





On 4 May 2013 01:19, Furkan KAMACI furkankam...@gmail.com wrote:

 Does anybody tested Ganglia with JMXTrans at production environment for
 SolrCloud?

 2013/4/26 Dmitry Kan solrexp...@gmail.com

  Alan, Shawn,
 
  If backporting to 3.x is hard, no worries, we don't necessarily require
 the
  patch as we are heading to 4.x eventually. It is just much easier within
  our organization to test on the existing solr 3.4 as there are a few of
  internal dependencies and custom code on top of solr. Also solr upgrades
 on
  production systems are usually pushed forward by a month or so starting
 the
  upgrade on development systems (requires lots of testing and
  verifications).
 
  Nevertheless, it is good effort to make #solr #graphite friendly, so keep
  it up! :)
 
  Dmitry
 
 
 
 
  On Thu, Apr 25, 2013 at 9:29 PM, Shawn Heisey s...@elyograg.org wrote:
 
   On 4/25/2013 6:30 AM, Dmitry Kan wrote:
We are very much interested in 3.4.
   
On Thu, Apr 25, 2013 at 12:55 PM, Alan Woodward a...@flax.co.uk
  wrote:
This is on top of trunk at the moment, but would be back ported to
 4.4
   if
there was interest.
  
   This will be bad news, I'm sorry:
  
   All remaining work on 3.x versions happens in the 3.6 branch. This
   branch is in maintenance mode.  It will only get fixes for serious bugs
   with no workaround.  Improvements and new features won't be considered
   at all.
  
   You're welcome to try backporting patches from newer issues.  Due to
 the
   major differences in the 3x and 4x codebases, the best case scenario is
   that you'll be facing a very manual task.  Some changes can't be
   backported because they rely on other features only found in 4.x code.
  
   Thanks,
   Shawn
  
  
 



Re: Solr metrics in Codahale metrics and Graphite?

2014-03-16 Thread Greg Pendlebury
Oh my bad. I thought it was already in. Thanks for the correction.

Ta,
Greg


On 17 March 2014 15:55, Shalin Shekhar Mangar shalinman...@gmail.comwrote:

 Greg, SOLR-4735 (using the codahale metrics lib) hasn't been committed
 yet. It is still work in progress.

 Actually the internal Solr Metrics class has a method to return 1
 minute stats but it is not used.

 On Mon, Mar 17, 2014 at 10:06 AM, Greg Pendlebury
 greg.pendleb...@gmail.com wrote:
  In the codahale metrics library there are 1, 5 and 15 minute moving
  averages just like you would see in a tool like 'top'. However in Solr I
  can only see 5 and 15 minute values, plus 'avgRequestsPerSecond'. I
 assumed
  this was the 1 minute value initially, but it seems to be something like
  the average since startup. I haven't looked thoroughly, but it is around
 1%
  of the other two in a normally idle test cluster after load tests have
 been
  running for long enough that the 5 and 15 minute numbers match the load
  testing throughput.
 
  Is this difference deliberate? or an accident? or am I wrong entirely? I
  can compute the overall average anyway, given that the stats also include
  the start time of the search handler and the total search count, so I
  thought it might be an accident.
 
  Ta,
  Greg
 
 
 
 
 
  On 4 May 2013 01:19, Furkan KAMACI furkankam...@gmail.com wrote:
 
  Does anybody tested Ganglia with JMXTrans at production environment for
  SolrCloud?
 
  2013/4/26 Dmitry Kan solrexp...@gmail.com
 
   Alan, Shawn,
  
   If backporting to 3.x is hard, no worries, we don't necessarily
 require
  the
   patch as we are heading to 4.x eventually. It is just much easier
 within
   our organization to test on the existing solr 3.4 as there are a few
 of
   internal dependencies and custom code on top of solr. Also solr
 upgrades
  on
   production systems are usually pushed forward by a month or so
 starting
  the
   upgrade on development systems (requires lots of testing and
   verifications).
  
   Nevertheless, it is good effort to make #solr #graphite friendly, so
 keep
   it up! :)
  
   Dmitry
  
  
  
  
   On Thu, Apr 25, 2013 at 9:29 PM, Shawn Heisey s...@elyograg.org
 wrote:
  
On 4/25/2013 6:30 AM, Dmitry Kan wrote:
 We are very much interested in 3.4.

 On Thu, Apr 25, 2013 at 12:55 PM, Alan Woodward a...@flax.co.uk
   wrote:
 This is on top of trunk at the moment, but would be back ported
 to
  4.4
if
 there was interest.
   
This will be bad news, I'm sorry:
   
All remaining work on 3.x versions happens in the 3.6 branch. This
branch is in maintenance mode.  It will only get fixes for serious
 bugs
with no workaround.  Improvements and new features won't be
 considered
at all.
   
You're welcome to try backporting patches from newer issues.  Due to
  the
major differences in the 3x and 4x codebases, the best case
 scenario is
that you'll be facing a very manual task.  Some changes can't be
backported because they rely on other features only found in 4.x
 code.
   
Thanks,
Shawn
   
   
  
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: Solr 4.7.0 - cursorMark question

2014-03-09 Thread Greg Pendlebury
That was really clear; I just had another read through of the documentation
with that explanation in mind and I can see I went off the rails.

Sorry for any confusion on my part, and thanks for the details.

Ta,
Greg


On 8 March 2014 08:36, Chris Hostetter hossman_luc...@fucit.org wrote:


 : Thank-you, that all sounds great. My assumption about documents being
 : missed was something like this:
 ...
 : In that situation D would always be missed, whether the cursorMark 'C or
 : greater' or 'greater than B' (I'm not sure which it is in practice),
 simply
 : because the cursorMark is the unique ID and the unique ID is not your
 first
 : sort mechanism.

 First off: nothing about your example would result in the cursorMark is
 the unique ID ... let's clear that misconception up right away:

 Using Cursors requires a deterministic sort w/o any ties that can result
 in abiguity.  For this reason (eliminating the abiguity) it is neccessary
 that the uniqueKey always be included in a sort -- but the cursorMark
 values that get computed are determined by *all* of the sort critera used.

 So let's revisit your example, but let's make sure we are explicit about
 everything involved:

  * A,B,C,D are all uniqueyKey values in the id field
  * 1,2,3 are all time values in a timestamp field.
  * we're going to use a sort=timestamp asc, id asc param in this example
  * when we say X(123) we mean Document with id 'X' which currently has
value '123' in the timestamp field

 Let's suppose that at the start of the example, all of the docs in your
 example, in sorted order, look like this...

   A(1), B(3), C(14), D(32)

 A client uses our sort, along with cursorMark=*  rows=2.  That client
 will get back A(1) and B(3) as well as some nextCursorMark value of $%^
 (deliberately not using any letters or numbers so as not to misslead you
 ito thinking hte cursorMark value is an id or a timestamp -- it's
 neaither, it's an encoded binary value that has no meaning to client other
 then as a mark to send back to the server)

 Now let's suppose that B  C are edited as you mention -- their new
 timestamp values must -- by definition -- be greater then D's existing
 timestamp value of 32 (otherwise it's not really a timestamp field) So
 let's assume now, that the total ordering of all our docs, using our sort
 is:

   A(1), D(32), B(56), C(57)

 After B  C are modified, the the client makes a followup request using
 the same sort, rows=2, and cursorMark=$%^ (the nextCursorMark returned
 from the previous request)  the two documents the client will get this
 time are D(32) and B(56).

  - D will never be skipped.
  - B will be returned twice, because it's timestamp
value was updated after it was fetched

 Does that make sense?

 You can try this out manually if you want to see it for yourlself --
 either using a real auto-assigned timestamp field, or just using a
 simple numeric field you set your self when updating docs.



 -Hoss
 http://www.lucidworks.com/



Re:Solr 4.7.0 - cursorMark question

2014-03-06 Thread Greg Pendlebury
* New 'cursorMark' request param for efficient deep paging of sorted
  result sets. See http://s.apache.org/cursorpagination;

At the end of the linked doco there is an example that doesn't make sense
to me, because it mentions sort=timestamp asc and is then followed by
pseudo code that sorts by id only. I understand that cursorMark requires
that sort clauses must include the uniqueKey field, but is it really just
'include', or is it the only field that sort can be performed on?

ie. can sort be specified as 'sort=timestamp asc, id asc'?

I am assuming that if the index is changed between requests than we can
still 'miss' or duplicate documents by not sorting on the id as the only
sort parameter, but I can live with that scenario. cursorMark is still
attractive to us since it will prevent the SolrCloud cluster from crashing
when deep pagination requests are sent to it... I'm just trying to explore
all the edge cases our business area are likely to consider.

Ta,
Greg

On 27 February 2014 02:15, Simon Willnauer sim...@apache.org wrote:

 February 2014, Apache Solr(tm) 4.7 available

 The Lucene PMC is pleased to announce the release of Apache Solr 4.7

 Solr is the popular, blazing fast, open source NoSQL search platform
 from the Apache Lucene project. Its major features include powerful
 full-text search, hit highlighting, faceted search, dynamic
 clustering, database integration, rich document (e.g., Word, PDF)
 handling, and geospatial search.  Solr is highly scalable, providing
 fault tolerant distributed search and indexing, and powers the search
 and navigation features of many of the world's largest internet sites.

 Solr 4.7 is available for immediate download at:
   http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

 See the CHANGES.txt file included with the release for a full list of
 details.

 Solr 4.7 Release Highlights:

 * A new 'migrate' collection API to split all documents with a route key
   into another collection.

 * Added support for tri-level compositeId routing.

 * Admin UI - Added a new Files conf directory browser/file viewer.

 * Add a QParserPlugin for Lucene's SimpleQueryParser.

 * Suggest improvements: a new SuggestComponent that fully utilizes the
   Lucene suggester module; queries can now use multiple suggesters;
   Lucene's FreeTextSuggester and BlendedInfixSuggester are now supported.

 * New 'cursorMark' request param for efficient deep paging of sorted
   result sets. See http://s.apache.org/cursorpagination

 * Add a Solr contrib that allows for building Solr indexes via Hadoop's
   MapReduce.

 * Upgrade to Spatial4j 0.4. Various new options are now exposed
   automatically for an RPT field type.  See Spatial4j CHANGES  javadocs.
   https://github.com/spatial4j/spatial4j/blob/master/CHANGES.md

 * SSL support for SolrCloud.

 Solr 4.7 also includes many other new features as well as numerous
 optimizations and bugfixes.

 Please report any feedback to the mailing lists
 (http://lucene.apache.org/solr/discussion.html)

 Note: The Apache Software Foundation uses an extensive mirroring network
 for distributing releases.  It is possible that the mirror you are using
 may not have replicated the release yet.  If that is the case, please
 try another mirror.  This also goes for Maven access.



Re: Solr 4.7.0 - cursorMark question

2014-03-06 Thread Greg Pendlebury
Thank-you, that all sounds great. My assumption about documents being
missed was something like this:

A,B,C,D

where they are sorted by timestamp first and ID second. Say the first
'page' of results is 'A,B', and before the second page is requested both
documents B + C receive update events and the new order (by timestamp) is:

A,D,B,C

In that situation D would always be missed, whether the cursorMark 'C or
greater' or 'greater than B' (I'm not sure which it is in practice), simply
because the cursorMark is the unique ID and the unique ID is not your first
sort mechanism.

However, I'm not really concerned about that anyway since it is not a use
case we consider important, and in an information science sense of things I
think it is a non-trivial problem to solve without brute force caching of
all result sets. I'm just happy that we don't have to get our users to
replace existing sort options; we just need to add a unique ID field at the
end and change the parameters we send into the cluster.

Thanks,
Greg


On 7 March 2014 11:05, Chris Hostetter hossman_luc...@fucit.org wrote:


 : At the end of the linked doco there is an example that doesn't make sense
 : to me, because it mentions sort=timestamp asc and is then followed by
 : pseudo code that sorts by id only. I understand that cursorMark requires

 Ok ... 2 things contributing to the confusion.

 1) the para that refers to sort=timestamp asc should be fixed to include
 id as well.

 2) psuedo-code you're refering to that uses sort = 'id asc' isn't ment
 to give an example of specifically tailing by timestamp -- it's an
 extension on the earlier example (of fetching all docs sorting on id) to
 show tailing new docs with new (increasing) ids ... i'll try to fix the
 wording to better elborate

 : that sort clauses must include the uniqueKey field, but is it really
 just
 : 'include', or is it the only field that sort can be performed on?
 :
 : ie. can sort be specified as 'sort=timestamp asc, id asc'?

 That will absolutely work ... i'll update the doc to include more examples
 with multi-clause sort criteria.

 : I am assuming that if the index is changed between requests than we can
 : still 'miss' or duplicate documents by not sorting on the id as the only
 : sort parameter, but I can live with that scenario. cursorMark is still

 If you are using a timestamp param, you should never miss a document
 (assuming every doc gets a timestamp) but yes: you can absolutely get the
 same doc twice if it's updated after the first time you fetch it -- that's
 one of the advantages of sorting on a timestamp field like that.



 -Hoss
 http://www.lucidworks.com/



Re: Cluster state ranges are all null after reboot

2014-03-02 Thread Greg Pendlebury
Thanks again for the info. Hopefully we find some more clues if it
continues to occur. The ops team are looking at alternative deployment
methods as well, so we might end up avoiding the issue altogether.

Ta,
Greg


On 28 February 2014 02:42, Shalin Shekhar Mangar shalinman...@gmail.comwrote:

 I think it is just a side-effect of the current implementation that
 the ranges are assigned linearly. You can also verify this by choosing
 a document from each shard and running it's uniqueKey against the
 CompositeIdRouter's sliceHash method and verifying that it is included
 in the range.

 I couldn't reproduce this but I didn't try too hard either. If you are
 able to isolate a reproducible example then please do report back.
 I'll spend some time to review the related code again to see if I can
 spot the problem.

 On Thu, Feb 27, 2014 at 2:19 AM, Greg Pendlebury
 greg.pendleb...@gmail.com wrote:
  Thanks Shalin, that code might be helpful... do you know if there is a
  reliable way to line up the ranges with the shard numbers? When the
 problem
  occurred we had 80 million documents already in the index, and could not
  issue even a basic 'deleteById' call. I'm tempted to assume they are just
  assigned linearly since our Test and Prod clusters both look to work that
  way now, but I can't be sure whether that is by design or just
 happenstance
  of boot order.
 
  And no, unfortunately we have not been able to reproduce this issue
  consistently despite trying a number of different things such as
 graceless
  stop/start and screwing with the underlying WAR file (which is what we
  thought puppet might be doing). The problem has occurred twice since, but
  always in our Test environment. The fact that Test has only a single
  replica per shard is the most likely culprit for me, but as mentioned,
 even
  gracelessly killing the last replica in the cluster seems to leave the
  range set correctly in clusterstate when we test it in isolation.
 
  In production (45 JVMs, 15 shards with 3 replicas each) we've never seen
  the problem, despite a similar number of rollouts for version changes
 etc.
 
  Ta,
  Greg
 
 
 
 
  On 26 February 2014 23:46, Shalin Shekhar Mangar shalinman...@gmail.com
 wrote:
 
  If you have 15 shards and assuming that you've never used shard
  splitting, you can calculate the shard ranges by using new
  CompositeIdRouter().partitionRange(15, new
  CompositeIdRouter().fullRange())
 
  This gives me:
  [8000-9110, 9111-a221, a222-b332,
  b333-c443, c444-d554, d555-e665,
  e666-f776, f777-887, 888-1998,
  1999-2aa9, 2aaa-3bba, 3bbb-4ccb,
  4ccc-5ddc, 5ddd-6eed, 6eee-7fff]
 
  Have you done any more investigation into why this happened? Anything
  strange in the logs? Are you able to reproduce this in a test
  environment?
 
  On Wed, Feb 19, 2014 at 5:16 AM, Greg Pendlebury
  greg.pendleb...@gmail.com wrote:
   We've got a 15 shard cluster spread across 3 hosts. This morning our
  puppet
   software rebooted them all and afterwards the 'range' for each shard
 has
   become null in zookeeper. Is there any way to restore this value
 short of
   rebuilding a fresh index?
  
   I've read various questions from people with a similar problem,
 although
  in
   those cases it is usually a single shard that has become null allowing
  them
   to infer what the value should be and manually fix it in ZK. In this
  case I
   have no idea what the ranges should be. This is our test cluster, and
   checking production I can see that the ranges don't appear to be
   predictable based on the shard number.
  
   I'm also not certain why it even occurred. Our test cluster only has a
   single replica per shard, so when a JVM is rebooted the cluster is
   unavailable... would that cause this? Production has 3 replicas so we
 can
   do rolling reboots.
 
 
 
  --
  Regards,
  Shalin Shekhar Mangar.
 



 --
 Regards,
 Shalin Shekhar Mangar.



Re: Cluster state ranges are all null after reboot

2014-02-26 Thread Greg Pendlebury
Thanks Shalin, that code might be helpful... do you know if there is a
reliable way to line up the ranges with the shard numbers? When the problem
occurred we had 80 million documents already in the index, and could not
issue even a basic 'deleteById' call. I'm tempted to assume they are just
assigned linearly since our Test and Prod clusters both look to work that
way now, but I can't be sure whether that is by design or just happenstance
of boot order.

And no, unfortunately we have not been able to reproduce this issue
consistently despite trying a number of different things such as graceless
stop/start and screwing with the underlying WAR file (which is what we
thought puppet might be doing). The problem has occurred twice since, but
always in our Test environment. The fact that Test has only a single
replica per shard is the most likely culprit for me, but as mentioned, even
gracelessly killing the last replica in the cluster seems to leave the
range set correctly in clusterstate when we test it in isolation.

In production (45 JVMs, 15 shards with 3 replicas each) we've never seen
the problem, despite a similar number of rollouts for version changes etc.

Ta,
Greg




On 26 February 2014 23:46, Shalin Shekhar Mangar shalinman...@gmail.comwrote:

 If you have 15 shards and assuming that you've never used shard
 splitting, you can calculate the shard ranges by using new
 CompositeIdRouter().partitionRange(15, new
 CompositeIdRouter().fullRange())

 This gives me:
 [8000-9110, 9111-a221, a222-b332,
 b333-c443, c444-d554, d555-e665,
 e666-f776, f777-887, 888-1998,
 1999-2aa9, 2aaa-3bba, 3bbb-4ccb,
 4ccc-5ddc, 5ddd-6eed, 6eee-7fff]

 Have you done any more investigation into why this happened? Anything
 strange in the logs? Are you able to reproduce this in a test
 environment?

 On Wed, Feb 19, 2014 at 5:16 AM, Greg Pendlebury
 greg.pendleb...@gmail.com wrote:
  We've got a 15 shard cluster spread across 3 hosts. This morning our
 puppet
  software rebooted them all and afterwards the 'range' for each shard has
  become null in zookeeper. Is there any way to restore this value short of
  rebuilding a fresh index?
 
  I've read various questions from people with a similar problem, although
 in
  those cases it is usually a single shard that has become null allowing
 them
  to infer what the value should be and manually fix it in ZK. In this
 case I
  have no idea what the ranges should be. This is our test cluster, and
  checking production I can see that the ranges don't appear to be
  predictable based on the shard number.
 
  I'm also not certain why it even occurred. Our test cluster only has a
  single replica per shard, so when a JVM is rebooted the cluster is
  unavailable... would that cause this? Production has 3 replicas so we can
  do rolling reboots.



 --
 Regards,
 Shalin Shekhar Mangar.



Cluster state ranges are all null after reboot

2014-02-18 Thread Greg Pendlebury
We've got a 15 shard cluster spread across 3 hosts. This morning our puppet
software rebooted them all and afterwards the 'range' for each shard has
become null in zookeeper. Is there any way to restore this value short of
rebuilding a fresh index?

I've read various questions from people with a similar problem, although in
those cases it is usually a single shard that has become null allowing them
to infer what the value should be and manually fix it in ZK. In this case I
have no idea what the ranges should be. This is our test cluster, and
checking production I can see that the ranges don't appear to be
predictable based on the shard number.

I'm also not certain why it even occurred. Our test cluster only has a
single replica per shard, so when a JVM is rebooted the cluster is
unavailable... would that cause this? Production has 3 replicas so we can
do rolling reboots.


SolrCloud Archecture recommendations + related questions

2012-08-06 Thread Greg Pendlebury
Hi All,

TL;DR version: We think we want to explore Lucene/Solr 4.0 and SolrCloud,
but I’m not sure if there is any good doco/articles on how to make
architecture choices for how to chop up big indexes… and what other general
considerations are part of the equation?



I’m throwing this post out to the public to see if any kind and
knowledgeable individuals could provide some educated feedback on the
options our team is currently considering for the future architecture of
our Solr indexes. We have a loose collection of Solr indexes, each with a
specific purpose and differing schemas and document makeup, containing just
over 300 million documents with varying degrees of full-text. Our existing
architecture is showing its age, as it is really just the setup used for
small/medium indexes scaled upwards.

The biggest individual index is around 140 million documents and currently
exists as a Master/Slave setup with the Master receiving all writes in the
background and the 3 load balanced slaves updating with a 5 minute poll
interval. The master index is 451gb on disk and the 3 slaves are running
JVMs with RAM allocations of 21gb (right now anyway).

We are struggling under the traffic load and/or scale of our indexes
(mainly the later I think). We know this isn’t the best way to run things,
but the index in question is a fairly new addition and each time we run
into issues we tend to make small changes to improve things in the short
term… like bumping the RAM allocation up, toying with poll intervals,
garbage collection config etc.

We’ve historically run into issues with facet queries generating a lot of
bloat on some types of fields. These had to be solved through internal
modifications, but I expect we’ll have to review this with the new version
anyway. Related to that, there are some question marks on generating good
facet data from a sharded approach. In particular though, we are really
struggling with garbage collection on the slave machines around the time
that the slave/master sync occurs because of multiple copies of the index
being held in memory until all searchers have de-referenced the old index.
The machines typically either crash from OOM when we occasionally have a
third and/or forth copy of the index appear because of really old searchers
not ‘letting go’ (hence we play with widening poll intervals), or they seem
to rarely become perpetually locked in GC and have to be restarted (not
100% why, but large heap allocations aren’t helping, and cache warming may
be a culprit).

The team has lots of things we want to try to improve things, but given the
scale of the systems it is very hard to just try things out without
considerable resourcing implications. The entire ecosystem is spread across
7 machines that are resourced in the 64gb-100gb of RAM range (this is just
me poking around our servers… not a thorough assessment). Each machine is
running several JVMs so that for each ‘type’ of index there are typically
2-4 load balanced slaves available at any given time. One of those machines
is exclusively used as the Master for all indexes and receives no search
traffic… just lots of write traffic.

I believe the answers to some of these are going to be very much dependent
on schemas and documents, so I don’t imagine anyone can answer the
questions better then we can after testing and benchmarking… but right now
we are still trying to choose where to start, so broad ideas are very
welcome.

The kind of things we are currently thinking about:

   - Moving to v4.0 (currently just completed our v3.5 upgrade) to take
   advantage of the reduced RAM consumption:
   https://issues.apache.org/jira/browse/LUCENE-2380 We are hoping that
   this has the double-whammy impact of improving garbage collection as well.
   Lots of full-text data should equal lots of Strings, and thus lots of
   savings from this change.
   - Moving to a basic sharded approach. We’ve only just started testing
   this, and I’m not involved, so I’m not sure on what early results we’ve
   got…. But:
   - Given that we’d like to move to v4.0, I believe this opens up the
   option of a SolrCloud implementation… my suspicion is that this is where
   the money is at… but I’d be happy to hear feedback (good or bad) from
   people that are using it in production.
   - Hardware; we are not certain that the current approach of a few
   colossal machines is any better that lots of smaller clustered machines…
   and it is prohibitively expensive to experiment here. We don’t think that
   our current setup using SSDs and fibre-channel connections would be
   creating too many bottlenecks on I/O, and rarely see other hardware related
   issues, but I’d again be curious if people have observed contradictory
   evidence. My suspicion is that with the changes above though, our current
   hardware would handle the load far better than it currently is.
   - Are there any sort of pros and cons documented out there for making
   decisions on sharding 

Re: Embedded Solr Optimize under Windows

2011-05-19 Thread Greg Pendlebury
Ahh, thanks. I might try a basic commit() then and see, although it's not a
huge deal for me. It occurred to me that two optimize() calls would probably
leave exactly the same problem behind.

On 20 May 2011 09:52, Chris Hostetter hossman_luc...@fucit.org wrote:


 : Thanks for the reply. I'm at home right now, or I'd try this myself, but
 is
 : the suggestion that two optimize() calls in a row would resolve the
 issue?

 it might ... I think the situations in which it happens have evolved a bit
 over the years as IndexWRiter has gotten smarter about knowing when it
 really needs to touch the disk to reduce IO.

 there's a relatively new explicit method (IndexWriter.deleteUnusedFiles)
 that can force this...

 https://issues.apache.org/jira/browse/LUCENE-2259

 ...but it's only on trunk, and there isn't any user level hook for it in
 Solr yet (i opened SOLR-2532 to consider adding it)


 -Hoss



Re: Embedded Solr Optimize under Windows

2011-05-16 Thread Greg Pendlebury
Thanks for the reply. I'm at home right now, or I'd try this myself, but is
the suggestion that two optimize() calls in a row would resolve the issue?
The process in question is a JVM devoted entirely to harvesting, calls
optimize() then shuts down.

The least processor intensive way of triggering this behaviour is
desirable... perhaps a commit()? But I wouldn't have expected that to
trigger a write.

On 17 May 2011 10:20, Chris Hostetter hossman_luc...@fucit.org wrote:


 : http://code.google.com/p/solr-geonames/wiki/DeveloperInstall
 : It's worth noting that the build has also been run on Mac and Solaris
 now,
 : and the Solr index is about half the size. We suspect the optimize() call
 in
 : Embedded Solr is not working correctly under Windows.
 :
 : We've observed that Windows leaves lots of segments on disk and takes up
 : twice the volume as the other OSs. Perhaps file locking or something

 The problem isn't that optimize doesn't work on windows, the problem is
 that windows file semantics won't let files be deleted while there are
 open file handles -- so Lucene's Directory behavior is to leave the files
 on disk, and try to clean them up later.  (on the next write, or next
 optimize call)


 -Hoss



Embedded Solr Optimize under Windows

2011-04-27 Thread Greg Pendlebury
Hi All,

Just quick query of no particular importance to me, but we did observe this
problem:

http://code.google.com/p/solr-geonames/wiki/DeveloperInstall
It's worth noting that the build has also been run on Mac and Solaris now,
and the Solr index is about half the size. We suspect the optimize() call in
Embedded Solr is not working correctly under Windows.

We've observed that Windows leaves lots of segments on disk and takes up
twice the volume as the other OSs. Perhaps file locking or something
prevents the optimize() call from functioning. This wasn't particularly
important to us since we don't run Windows for any prod systems. For that
reason we haven't looked too closely, but thought it might be of interest to
others... if we are even right of course :)

Ta,
Greg


Re: Embedded Solr constructor not returning

2011-04-06 Thread Greg Pendlebury
 Sounds good.  Please go ahead and make this change yourself.

Done.

Ta,
Greg

On 6 April 2011 22:52, Steven A Rowe sar...@syr.edu wrote:

 Hi Greg,

  I need the servlet API in my app for it to work, despite being command
  line.
  So adding this to the maven POM fixed everything:
  dependency
  groupIdjavax.servlet/groupId
  artifactIdservlet-api/artifactId
  version2.5/version
  /dependency
 
  Perhaps this dependency could be listed on the wiki? Alongside the sample
  code for using embedded solr?
  http://wiki.apache.org/solr/Solrj

 Sounds good.  Please go ahead and make this change yourself.

 FYI, the Solr 3.1 POM has a servlet-api dependency, but the scope is
 provided, because the servlet container includes this dependency.  When
 *you* are the container, you have to provide it.

 Steve



Embedded Solr constructor not returning

2011-04-05 Thread Greg Pendlebury
Hi All,

I'm hoping this is a reasonably trivial issue, but it's frustrating me to no
end. I'm putting together a tiny command line app to write data into an
index. It has no web based Solr running against it; the index will be moved
at a later time to have a proper server instance start for responding to
queries. My problem however is I seem to have stalled on instantiating the
embedded server:

private SolrServer startSolr(String home) throws Exception {
try {
System.setProperty(solr.solr.home, home);
CoreContainer.Initializer initializer = new
CoreContainer.Initializer();
solrCore = initializer.initialize();
return new EmbeddedSolrServer(solrCore, );
} catch(Exception ex) {
log.error(\n===\nFailed to start Solr server\n);
throw ex;
}
}

The constructor for the embedded server just never comes back. I've seen
three or four different ways of starting the server with varying levels of
complexity, and they all APPEAR to work, but still do not return. STDOUT
show the output I have largely come to expect from watching Solr start
'correctly':

===
Starting Solr:

JNDI not configured for solr (NoInitialContextEx)
using system property solr.solr.home: C:\test\harvester\solr
looking for solr.xml: C:\test\harvester\solr\solr.xml
Solr home set to 'C:\test\harvester\solr\'
Loaded SolrConfig: solrconfig.xml
Opening new SolrCore at C:\test\harvester\solr\,
dataDir=C:\tf2\geonames\harvester\solr\.\data\
Reading Solr Schema
Schema name=test
created string: org.apache.solr.schema.StrField
created date: org.apache.solr.schema.TrieDateField
created sint: org.apache.solr.schema.SortableIntField
created sfloat: org.apache.solr.schema.SortableFloatField
created null: org.apache.solr.analysis.WhitespaceTokenizerFactory
created null: org.apache.solr.analysis.LowerCaseFilterFactory
created null: org.apache.solr.analysis.WhitespaceTokenizerFactory
created null: org.apache.solr.analysis.LowerCaseFilterFactory
created text: org.apache.solr.schema.TextField
default search field is basic_name
query parser default operator is AND
unique key field: id
No JMX servers found, not exposing Solr information with JMX.
created /update: solr.XmlUpdateRequestHandler
adding lazy requestHandler: solr.CSVRequestHandler
created /update/csv: solr.CSVRequestHandler
Opening Searcher@11b86c7 main
AutoCommit: disabled
registering core:
[] Registered new searcher Searcher@11b86c7 main
Terminate batch job (Y/N)? y


At this stage I'm grasping at straws. It appears as though the embedded
instance is behaving like a proper server, waiting for a request or
something. I've scrubbed the solrconfig.xml (from from the Solr example
download) file back to remove most entries, but perhaps I'm using the
incorrect handlers/listeners for an embedded server?

I'm a tad confused though, because every other time I've done this
(admittedly in a servlet, not a command line app) the constructor simply
returns straight away and execution of my app code continues.

Any advice or suggestions would be greatly appreciated.

Ta,
Greg


Re: Embedded Solr constructor not returning

2011-04-05 Thread Greg Pendlebury
Hmmm, after being stuck on this for hours, I find the answer myself
15minutes after asking for help... as usual. :)

For anyone interested, and no doubt this will not be a revelation for some,
I need the servlet API in my app for it to work, despite being command line.
So adding this to the maven POM fixed everything:
dependency
groupIdjavax.servlet/groupId
artifactIdservlet-api/artifactId
version2.5/version
/dependency

Perhaps this dependency could be listed on the wiki? Alongside the sample
code for using embedded solr?
http://wiki.apache.org/solr/Solrj

Logback is passing along all of my logging but I suspect I'd have to add
some Solr logging config before it would tell me this itself. I only
stumbled on it by accident:
http://osdir.com/ml/solr-user.lucene.apache.org/2009-11/msg00831.html



On 6 April 2011 14:48, Greg Pendlebury greg.pendleb...@gmail.com wrote:

 Hi All,

 I'm hoping this is a reasonably trivial issue, but it's frustrating me to
 no end. I'm putting together a tiny command line app to write data into an
 index. It has no web based Solr running against it; the index will be moved
 at a later time to have a proper server instance start for responding to
 queries. My problem however is I seem to have stalled on instantiating the
 embedded server:

 private SolrServer startSolr(String home) throws Exception {
 try {
 System.setProperty(solr.solr.home, home);
 CoreContainer.Initializer initializer = new
 CoreContainer.Initializer();
 solrCore = initializer.initialize();
 return new EmbeddedSolrServer(solrCore, );
 } catch(Exception ex) {
 log.error(\n===\nFailed to start Solr server\n);
 throw ex;
 }
 }

 The constructor for the embedded server just never comes back. I've seen
 three or four different ways of starting the server with varying levels of
 complexity, and they all APPEAR to work, but still do not return. STDOUT
 show the output I have largely come to expect from watching Solr start
 'correctly':

 ===
 Starting Solr:

 JNDI not configured for solr (NoInitialContextEx)
 using system property solr.solr.home: C:\test\harvester\solr
 looking for solr.xml: C:\test\harvester\solr\solr.xml
 Solr home set to 'C:\test\harvester\solr\'
 Loaded SolrConfig: solrconfig.xml
 Opening new SolrCore at C:\test\harvester\solr\,
 dataDir=C:\tf2\geonames\harvester\solr\.\data\
 Reading Solr Schema
 Schema name=test
 created string: org.apache.solr.schema.StrField
 created date: org.apache.solr.schema.TrieDateField
 created sint: org.apache.solr.schema.SortableIntField
 created sfloat: org.apache.solr.schema.SortableFloatField
 created null: org.apache.solr.analysis.WhitespaceTokenizerFactory
 created null: org.apache.solr.analysis.LowerCaseFilterFactory
 created null: org.apache.solr.analysis.WhitespaceTokenizerFactory
 created null: org.apache.solr.analysis.LowerCaseFilterFactory
 created text: org.apache.solr.schema.TextField
 default search field is basic_name
 query parser default operator is AND
 unique key field: id
 No JMX servers found, not exposing Solr information with JMX.
 created /update: solr.XmlUpdateRequestHandler
 adding lazy requestHandler: solr.CSVRequestHandler
 created /update/csv: solr.CSVRequestHandler
 Opening Searcher@11b86c7 main
 AutoCommit: disabled
 registering core:
 [] Registered new searcher Searcher@11b86c7 main
 Terminate batch job (Y/N)? y


 At this stage I'm grasping at straws. It appears as though the embedded
 instance is behaving like a proper server, waiting for a request or
 something. I've scrubbed the solrconfig.xml (from from the Solr example
 download) file back to remove most entries, but perhaps I'm using the
 incorrect handlers/listeners for an embedded server?

 I'm a tad confused though, because every other time I've done this
 (admittedly in a servlet, not a command line app) the constructor simply
 returns straight away and execution of my app code continues.

 Any advice or suggestions would be greatly appreciated.

 Ta,
 Greg





Re: Batch update, order of evaluation

2010-09-09 Thread Greg Pendlebury
I can't reproduce reliably, so I'm suspecting there are issues in our code.
I'm refactoring to avoid the problem entirely.

Thanks for the response though Erick.

Greg

On 8 September 2010 21:51, Greg Pendlebury greg.pendleb...@gmail.comwrote:

 Thanks,

 I'll create a deliberate test tomorrow feed some random data through it
 several times to see what happens.

 I'm also working on simply improving the buffer to handle the situation
 internally, but a few hours of testing isn't a big deal.

 Ta,
 Greg


 On 8 September 2010 21:41, Erick Erickson erickerick...@gmail.com wrote:

 This would be surprising behavior, if you can reliably reproduce this
 it's worth a JIRA.

 But (and I'm stretching a bit here) are you sure you're committing at the
 end of the batch AND are you sure you're looking after the commit? Here's
 the scenario: Your updated document is a position 1 and 100 in your batch.
 Somewhere around SOLR processing document 50, an autocommit occurs,
 and you're looking at your results before SOLR gets around to committing
 document 100. Like I said, it's a stretch.

 To test this, you need to be absolutely sure of two things before you
 search:
 1 the batch is finished processing
 2 you've issued a commit after the last document in the batch.

 If you're sure of the above and still see the problem, please let us
 know...

 HTH
 Erick

 On Tue, Sep 7, 2010 at 10:32 PM, Greg Pendlebury
 greg.pendleb...@gmail.comwrote:

  Does anyone know with certainty how (or even if) order is evaluated when
  updates are performed by batch?
 
  Our application internally buffers solr documents for speed of ingest
  before
  sending them to the server in chunks. The XML documents sent to the solr
  server contain all documents in the order they arrived without any
 settings
  changed from the defaults (so overwrite = true). We are careful to avoid
  things like HashMaps on our side since they'd lose the order, but I
 can't
  be
  certain what occurs inside Solr.
 
  Sometimes if an object has been indexed twice for various reasons it
 could
  appear twice in the buffer but the most up-to-date version is always
 last.
  I
  have however observed instances where the first copy of the document is
  indexed and differences in the second copy are missing. Does this sound
  likely? And if so are there any obvious settings I can play with to get
 the
  behavior I desire?
 
  I looked at:
  http://wiki.apache.org/solr/UpdateXmlMessages
 
  but there is no mention of order, just the overwrite flag (which I'm
 unsure
  how it is applied internally to an update message) and the deprecated
  duplicates flag (which I have no idea about).
 
  Would switching to SolrInputDocuments on a CommonsHttpSolrServer help?
 as
  per http://wiki.apache.org/solr/Solrj. This is no mention of order
 there
  either however.
 
  Thanks to anyone who took the time to read this.
 
  Ta,
  Greg
 





Re: Batch update, order of evaluation

2010-09-08 Thread Greg Pendlebury
Thanks,

I'll create a deliberate test tomorrow feed some random data through it
several times to see what happens.

I'm also working on simply improving the buffer to handle the situation
internally, but a few hours of testing isn't a big deal.

Ta,
Greg

On 8 September 2010 21:41, Erick Erickson erickerick...@gmail.com wrote:

 This would be surprising behavior, if you can reliably reproduce this
 it's worth a JIRA.

 But (and I'm stretching a bit here) are you sure you're committing at the
 end of the batch AND are you sure you're looking after the commit? Here's
 the scenario: Your updated document is a position 1 and 100 in your batch.
 Somewhere around SOLR processing document 50, an autocommit occurs,
 and you're looking at your results before SOLR gets around to committing
 document 100. Like I said, it's a stretch.

 To test this, you need to be absolutely sure of two things before you
 search:
 1 the batch is finished processing
 2 you've issued a commit after the last document in the batch.

 If you're sure of the above and still see the problem, please let us
 know...

 HTH
 Erick

 On Tue, Sep 7, 2010 at 10:32 PM, Greg Pendlebury
 greg.pendleb...@gmail.comwrote:

  Does anyone know with certainty how (or even if) order is evaluated when
  updates are performed by batch?
 
  Our application internally buffers solr documents for speed of ingest
  before
  sending them to the server in chunks. The XML documents sent to the solr
  server contain all documents in the order they arrived without any
 settings
  changed from the defaults (so overwrite = true). We are careful to avoid
  things like HashMaps on our side since they'd lose the order, but I can't
  be
  certain what occurs inside Solr.
 
  Sometimes if an object has been indexed twice for various reasons it
 could
  appear twice in the buffer but the most up-to-date version is always
 last.
  I
  have however observed instances where the first copy of the document is
  indexed and differences in the second copy are missing. Does this sound
  likely? And if so are there any obvious settings I can play with to get
 the
  behavior I desire?
 
  I looked at:
  http://wiki.apache.org/solr/UpdateXmlMessages
 
  but there is no mention of order, just the overwrite flag (which I'm
 unsure
  how it is applied internally to an update message) and the deprecated
  duplicates flag (which I have no idea about).
 
  Would switching to SolrInputDocuments on a CommonsHttpSolrServer help? as
  per http://wiki.apache.org/solr/Solrj. This is no mention of order there
  either however.
 
  Thanks to anyone who took the time to read this.
 
  Ta,
  Greg
 



Batch update, order of evaluation

2010-09-07 Thread Greg Pendlebury
Does anyone know with certainty how (or even if) order is evaluated when
updates are performed by batch?

Our application internally buffers solr documents for speed of ingest before
sending them to the server in chunks. The XML documents sent to the solr
server contain all documents in the order they arrived without any settings
changed from the defaults (so overwrite = true). We are careful to avoid
things like HashMaps on our side since they'd lose the order, but I can't be
certain what occurs inside Solr.

Sometimes if an object has been indexed twice for various reasons it could
appear twice in the buffer but the most up-to-date version is always last. I
have however observed instances where the first copy of the document is
indexed and differences in the second copy are missing. Does this sound
likely? And if so are there any obvious settings I can play with to get the
behavior I desire?

I looked at:
http://wiki.apache.org/solr/UpdateXmlMessages

but there is no mention of order, just the overwrite flag (which I'm unsure
how it is applied internally to an update message) and the deprecated
duplicates flag (which I have no idea about).

Would switching to SolrInputDocuments on a CommonsHttpSolrServer help? as
per http://wiki.apache.org/solr/Solrj. This is no mention of order there
either however.

Thanks to anyone who took the time to read this.

Ta,
Greg


Always spellcheck (suggest)

2009-10-04 Thread Greg Pendlebury
Hi All,

If I understand correctly the flag 'onlyMorePopular' encapsulates two 
independent behaviours. 1) It runs spell checking across queries that returned 
hits. Without the flag spell checking is not run when results are found. 2) It 
limits suggestions to terms with higher frequencies.

Is there any way to get behaviour (1) without behaviour (2)? Such as another 
flag I'm not seeing in the doco? The usage context is spelling suggestions for 
international usage. Eg. The user searches 'behaviour', we want it to suggest 
US spelling 'behavior' and vice versa. At the moment, the suggestion only works 
one way.

Ta,
Greg


This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)




RE: Always spellcheck (suggest)

2009-10-04 Thread Greg Pendlebury
Thanks for the response Christian. I'll modify my original point (1) then. Is 
'onlyMorePopular' the only way to return suggestions when all of the search 
terms are present in the dictionary (ie. correct)? Is there any way to force 
behaviour (1) without behaviour (2) (filtering on frequency).

Ta,
Greg

-Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com] 
Sent: Monday, 5 October 2009 11:59 AM
To: solr-user@lucene.apache.org
Subject: Re: Always spellcheck (suggest)

I believe your understanding in incorrect. The first behavior you 
described is produced by adding the paremeter spellcheck=true. 
Suggestions will be returned regardless of whether there are results. 
The only time I believe spelling suggestions might not be included is 
when all of the words are spelled correctly.

On 10/04/2009 07:55 PM, Greg Pendlebury wrote:
 Hi All,

 If I understand correctly the flag 'onlyMorePopular' encapsulates two 
 independent behaviours. 1) It runs spell checking across queries that 
 returned hits. Without the flag spell checking is not run when results are 
 found. 2) It limits suggestions to terms with higher frequencies.

 Is there any way to get behaviour (1) without behaviour (2)? Such as another 
 flag I'm not seeing in the doco? The usage context is spelling suggestions 
 for international usage. Eg. The user searches 'behaviour', we want it to 
 suggest US spelling 'behavior' and vice versa. At the moment, the suggestion 
 only works one way.

 Ta,
 Greg


 This email (including any attached files) is confidential and is for the
 intended recipient(s) only.  If you received this email by mistake,
 please, as a courtesy, tell the sender, then delete this email.

 The views and opinions are the originator's and do not necessarily
 reflect those of the University of Southern Queensland.  Although all
 reasonable precautions were taken to ensure that this email contained no
 viruses at the time it was sent we accept no liability for any losses
 arising from its receipt.

 The University of Southern Queensland is a registered provider of
 education with the Australian Government (CRICOS Institution Code No's.
 QLD 00244B / NSW 02225M)





This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)




RE: Always spellcheck (suggest)

2009-10-04 Thread Greg Pendlebury
Thanks. I'll have to look into modifications then (was hoping to avoid that).

For clarity though I believe this point is slightly off:

 Adding the parameter onlyMorePopular limits the suggestions that solr can 
 give you(to ones that return more hits than the existing query), nothing 
 more.

The flag is definitely returning suggestions, even for 'correct' terms, they 
just have to be more popular 'correct' terms.

Eg. 'behaviour' suggests 'behavior' because it has four times as many hits, but 
they are both 'correct' and the suggestion does not occur without the 
'onlyMorePopular' flag set. 'behavior' will not suggest 'behaviour' however 
because it is less popular.

Greg 

-Original Message-
From: Christian Zambrano [mailto:czamb...@gmail.com] 
Sent: Monday, 5 October 2009 12:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Always spellcheck (suggest)

Greg,

I apologize if I misunderstood your original post. I don't think there 
is a way you can force solr to return suggestions when all of the words 
are correctly spelled. Adding the parameter onlyMorePopular limits the 
suggestions that solr can give you(to ones that return more hits than 
the existing query), nothing more.

In short, I believe the answer is No.

On 10/04/2009 09:19 PM, Greg Pendlebury wrote:
 Thanks for the response Christian. I'll modify my original point (1) then. Is 
 'onlyMorePopular' the only way to return suggestions when all of the search 
 terms are present in the dictionary (ie. correct)? Is there any way to force 
 behaviour (1) without behaviour (2) (filtering on frequency).

 Ta,
 Greg

 -Original Message-
 From: Christian Zambrano [mailto:czamb...@gmail.com]
 Sent: Monday, 5 October 2009 11:59 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Always spellcheck (suggest)

 I believe your understanding in incorrect. The first behavior you
 described is produced by adding the paremeter spellcheck=true.
 Suggestions will be returned regardless of whether there are results.
 The only time I believe spelling suggestions might not be included is
 when all of the words are spelled correctly.

 On 10/04/2009 07:55 PM, Greg Pendlebury wrote:

 Hi All,

 If I understand correctly the flag 'onlyMorePopular' encapsulates two 
 independent behaviours. 1) It runs spell checking across queries that 
 returned hits. Without the flag spell checking is not run when results are 
 found. 2) It limits suggestions to terms with higher frequencies.

 Is there any way to get behaviour (1) without behaviour (2)? Such as another 
 flag I'm not seeing in the doco? The usage context is spelling suggestions 
 for international usage. Eg. The user searches 'behaviour', we want it to 
 suggest US spelling 'behavior' and vice versa. At the moment, the suggestion 
 only works one way.

 Ta,
 Greg


 This email (including any attached files) is confidential and is for the
 intended recipient(s) only.  If you received this email by mistake,
 please, as a courtesy, tell the sender, then delete this email.

 The views and opinions are the originator's and do not necessarily
 reflect those of the University of Southern Queensland.  Although all
 reasonable precautions were taken to ensure that this email contained no
 viruses at the time it was sent we accept no liability for any losses
 arising from its receipt.

 The University of Southern Queensland is a registered provider of
 education with the Australian Government (CRICOS Institution Code No's.
 QLD 00244B / NSW 02225M)




  
 This email (including any attached files) is confidential and is for the
 intended recipient(s) only.  If you received this email by mistake,
 please, as a courtesy, tell the sender, then delete this email.

 The views and opinions are the originator's and do not necessarily
 reflect those of the University of Southern Queensland.  Although all
 reasonable precautions were taken to ensure that this email contained no
 viruses at the time it was sent we accept no liability for any losses
 arising from its receipt.

 The University of Southern Queensland is a registered provider of
 education with the Australian Government (CRICOS Institution Code No's.
 QLD 00244B / NSW 02225M)




This email (including any attached files) is confidential and is for the
intended recipient(s) only.  If you received this email by mistake,
please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily
reflect those of the University of Southern Queensland.  Although all
reasonable precautions were taken to ensure that this email contained no
viruses at the time it was sent we accept no liability for any losses
arising from its receipt.

The University of Southern Queensland is a registered provider of
education with the Australian Government (CRICOS Institution Code No's.
QLD 00244B / NSW 02225M)