Re: Faceting a multi valued field

2011-11-07 Thread Steve Fatula
From: Chris Hostetter 
>To: Steve Fatula 
>Cc: "solr-user@lucene.apache.org" 
>Sent: Monday, November 7, 2011 7:17 PM
>Subject: Re: Faceting a multi valued field
> 
>: A > B > C > D > E
>: Z > C > D > E
>: Z > C > F > G > H > E
>: Y > G > H > E
>: 
>: Now, I want to get a count of the products in the children of C, AND, 
>: each of their children (so, 2 levels, i.e. D, D > E, F, F > G). Note, 
>
>Are these letters just "labels" for categories and the individual labels 
>are frequently re-used to describe differnet concrete categories, or are 
>you genuinely saying that a single category (labled "C") has multiple 
>parent categories (B, and Z) and depending on *which* parent you are 
>considering at any given time, it has differend child categories (ie: C 
>has direct children D and F when viewed from parent Z, but when viewed 
>from parent B, C's only direct child is D)?
>
>Each letter represents a category. So, I am browsing category C and it is 100% 
>irrelevant which parent of C as that is not considered at all. I am on C, that 
>is all that matters. Think of a product with multiple uses. An product might 
>be used in RVs. But, it's also used in trucks, it's also used with solar 
>panels, and it's also used in consumer electronics. It can be in many 
>different places within a tree. That is perfectly normal and most all of the 
>large websites do such a thing. So, just an example. It's not messed up. You 
>are correct in that I missed the other children in parts of the tree example, 
>that IS messed up, sorry! So, C ALWAYS has the same children. But it also may 
>have many many parents at many different tree levels. So, this is much 
>different than the examples in the wiki as level doesn't work.


>If it's the former (just an issue of reusing labels) then you can probably 
>make your life a lot simpler by choosing unique identifiers for every 
>category in the hierachy (regardless o the label) and indexing those.
>
>Category C is category C, period. It is the same identifier as it is the same 
>category with the same product memberships. It would make little sense to add 
>products to many more categories and duplicate all that data. That's not how a 
>tree is typically organized. But still, this is all tangential. I assume you 
>mean category C would have a different id depending on its parent, not 
>something we can do. Unless I mis-understand


>: The reality is products are in C. It is meaningless what parent category 
>: they have, and thus what level. So, what is a good way to tackle this 
>: using Solr?
>
>from the standpoint of a single product document, you may not care what 
>the "parent" categories are for each category the product, but if your 
>goal is to get facet counts for every "child" category of a specified 
>"parent" then it absolutely matters what the parent categories are -- the 
>easiest way i know of to do that is to have a field containing each 
>of the categories the document is in expressed as a "path", and then use 
>facet.prefix to limit the constraints considered to terms that match the 
>"path" of the parent category you are interested in.  
>
>I don't follow that. The parent of C is, well, many categories, possibly 4 in 
>the power inverter example, in some cases, many more. I cannot therefore use a 
>full path up to that point as I do know know all of the full paths getting to 
>C, I just know I am on C. This would come out abysmally slow to calculate all 
>full paths to a given category for every product and index that. Just use one 
>you say, well, that then messed up higher level queries that now are not aware 
>of products in other parts of the tree via different parents. I also already 
>know the children of the category I am in, with or without Solr. So, since I 
>know that, was hoping for a simple single query using some sort of Solr index 
>field that would allow a query to get the data I need.


>
>since you also said 
>you only want the categories that are immediate children of the current 
>category, encoding the "level" of the category at the begining of it's 
>path makes this possible using facet.prefix as well 
>
>Not when a given category exists at many levels. That's the simplistic Solr 
>wiki example. 


>-- if you *only* ever 
>want constraint counts for the immediate child categories, you can the 
>level and most of the path and just index the "${parent_cat_id}:${cat_id}" 
>tuples for every $cat_id the product is in, and tuples and use 
>"${cat_id}:" as your facet.prefix.
>
>
>Yes, and if you go back to the original message which didn't explain the 
>structure as you had mentioned, that's what I am doing. But, I also said I 
>needed the children of the current category, and their children as well. So, 
>getting the children is one Solr call as we do now. Now, that may return say 
>25 subcategories. Now, for EACH of those, I need their children and counts 
>(so, this is 2 levels of subcategories and counts, no more, no less. I cannot 
>find a way to do

Re: can't determine sort order with desc provided

2011-11-07 Thread Greg Pelly
Thanks again

On Tue, Nov 8, 2011 at 2:56 PM, Chris Hostetter wrote:

>
> : I'm having an issue with sorting because the PHP plugin converts the + to
> : %2B, I get the error "Can't determine Sort Order: 'name+desc'".
>
> then it sounds like the PHP library you are using is URL escaping
> things properly, and you should just be passing a simple space
> character to it.
>
> the canonical form of a sort is "fieldname desc" or "fieldname asc" ...
> when you see examples that look like "fieldname+desc" that's justbecause
> the example is showing you what it looks like when it's been URL escaped
> and put into the URL...
>
> https://wiki.apache.org/solr/CommonQueryParameters#sort
>
> :
> : Thanks in advance for any assistance.
> :
> : Cheers
> :
> : Nov 8, 2011 1:53:00 PM org.apache.solr.core.SolrCore execute
> : INFO: [pending] webapp=/solr path=/select/
> :
> params={facet=true&sort=name+desc&indent=on&start=0&q=*:*&group.field=resourceid&group=true&facet.field=sport&facet.field=learningmode&rows=10&version=2.2}
> : hits=3 status=0 QTime=2
> : Nov 8, 2011 1:53:58 PM org.apache.solr.core.SolrCore execute
> : INFO: [pending] webapp=/solr path=/select
> :
> params={facet=true&start=0&q=Just*&group.field=resourceid&group=true&facet.field=sport&facet.field=learningmode&facet.field=resourceid&
> : json.nl=map&wt=json&rows=10} hits=3 status=0 QTime=2
> : Nov 8, 2011 1:54:01 PM org.apache.solr.common.SolrException log
> : SEVERE: org.apache.solr.common.SolrException: Can't determine Sort Order:
> : 'name+desc', pos=9
> : at
> org.apache.solr.search.QueryParsing.parseSort(QueryParsing.java:358)
> : at org.apache.solr.search.QParser.getSort(QParser.java:251)
> : at
> :
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:82)
> : at
> :
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
> : at
> :
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> : at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
> : at
> :
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> : at
> :
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> : at
> :
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> : at
> :
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> : at
> :
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> : at
> :
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> : at
> :
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> : at
> :
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> : at
> :
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> : at
> :
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> : at
> :
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
> : at
> :
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
> : at
> : org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
> : at java.lang.Thread.run(Thread.java:662)
> :
> : Nov 8, 2011 1:54:01 PM org.apache.solr.core.SolrCore execute
> : INFO: [pending] webapp=/solr path=/select
> :
> params={facet=true&sort=name%2Bdesc&start=0&q=Just*&group.field=resourceid&group=true&facet.field=sport&facet.field=learningmode&facet.field=resourceid&
> : json.nl=map&wt=json&rows=10} status=400 QTime=2
> :
>
> -Hoss
>


Re: when using group=true facet numbers are "incorrect"

2011-11-07 Thread Greg Pelly
That works well, thanks very much.

On Tue, Nov 8, 2011 at 12:55 PM, Chris Hostetter
wrote:

>
> : I understand that's a valid thing for faceting to do, I was just
> wondering
> : if there's any way to get it to do the faceting on the groups returned.
> : Otherwise I guess I'll need to convince the UI people to just show the
> : facets without the numbers.
>
> what you are asking about is generally refered to as "post-group faceting"
> and can be activated using "group.truncate"...
>
> https://wiki.apache.org/solr/FieldCollapsing#Request_Parameters
>
>
>
>
> -Hoss
>


Re: can't determine sort order with desc provided

2011-11-07 Thread Chris Hostetter

: I'm having an issue with sorting because the PHP plugin converts the + to
: %2B, I get the error "Can't determine Sort Order: 'name+desc'".

then it sounds like the PHP library you are using is URL escaping 
things properly, and you should just be passing a simple space 
character to it.

the canonical form of a sort is "fieldname desc" or "fieldname asc" ... 
when you see examples that look like "fieldname+desc" that's justbecause 
the example is showing you what it looks like when it's been URL escaped 
and put into the URL...

https://wiki.apache.org/solr/CommonQueryParameters#sort

: 
: Thanks in advance for any assistance.
: 
: Cheers
: 
: Nov 8, 2011 1:53:00 PM org.apache.solr.core.SolrCore execute
: INFO: [pending] webapp=/solr path=/select/
: 
params={facet=true&sort=name+desc&indent=on&start=0&q=*:*&group.field=resourceid&group=true&facet.field=sport&facet.field=learningmode&rows=10&version=2.2}
: hits=3 status=0 QTime=2
: Nov 8, 2011 1:53:58 PM org.apache.solr.core.SolrCore execute
: INFO: [pending] webapp=/solr path=/select
: 
params={facet=true&start=0&q=Just*&group.field=resourceid&group=true&facet.field=sport&facet.field=learningmode&facet.field=resourceid&
: json.nl=map&wt=json&rows=10} hits=3 status=0 QTime=2
: Nov 8, 2011 1:54:01 PM org.apache.solr.common.SolrException log
: SEVERE: org.apache.solr.common.SolrException: Can't determine Sort Order:
: 'name+desc', pos=9
: at org.apache.solr.search.QueryParsing.parseSort(QueryParsing.java:358)
: at org.apache.solr.search.QParser.getSort(QParser.java:251)
: at
: 
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:82)
: at
: 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
: at
: 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
: at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
: at
: 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
: at
: 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
: at
: 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
: at
: 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
: at
: 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
: at
: 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
: at
: org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
: at
: org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
: at
: 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
: at
: org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
: at
: org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
: at
: 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
: at
: org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
: at java.lang.Thread.run(Thread.java:662)
: 
: Nov 8, 2011 1:54:01 PM org.apache.solr.core.SolrCore execute
: INFO: [pending] webapp=/solr path=/select
: 
params={facet=true&sort=name%2Bdesc&start=0&q=Just*&group.field=resourceid&group=true&facet.field=sport&facet.field=learningmode&facet.field=resourceid&
: json.nl=map&wt=json&rows=10} status=400 QTime=2
: 

-Hoss


can't determine sort order with desc provided

2011-11-07 Thread Greg Pelly
Hi,

I'm having an issue with sorting because the PHP plugin converts the + to
%2B, I get the error "Can't determine Sort Order: 'name+desc'".

Thanks in advance for any assistance.

Cheers

Nov 8, 2011 1:53:00 PM org.apache.solr.core.SolrCore execute
INFO: [pending] webapp=/solr path=/select/
params={facet=true&sort=name+desc&indent=on&start=0&q=*:*&group.field=resourceid&group=true&facet.field=sport&facet.field=learningmode&rows=10&version=2.2}
hits=3 status=0 QTime=2
Nov 8, 2011 1:53:58 PM org.apache.solr.core.SolrCore execute
INFO: [pending] webapp=/solr path=/select
params={facet=true&start=0&q=Just*&group.field=resourceid&group=true&facet.field=sport&facet.field=learningmode&facet.field=resourceid&
json.nl=map&wt=json&rows=10} hits=3 status=0 QTime=2
Nov 8, 2011 1:54:01 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: Can't determine Sort Order:
'name+desc', pos=9
at org.apache.solr.search.QueryParsing.parseSort(QueryParsing.java:358)
at org.apache.solr.search.QParser.getSort(QParser.java:251)
at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:82)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)

Nov 8, 2011 1:54:01 PM org.apache.solr.core.SolrCore execute
INFO: [pending] webapp=/solr path=/select
params={facet=true&sort=name%2Bdesc&start=0&q=Just*&group.field=resourceid&group=true&facet.field=sport&facet.field=learningmode&facet.field=resourceid&
json.nl=map&wt=json&rows=10} status=400 QTime=2


Weird: Solr Search result and Analysis Result not match?

2011-11-07 Thread Ellery Leung
Hi all.

 

I am using Solr 3.4 under Win 7.

 

In schema there is a multivalue field indexed in this way:

==

Schema:

==



 



























































==

Actual index: 

==



2284e2

2284e4

2284e5

1911e2



 

==

Question:

==

Now when I do a search like this:

 

myEvent:1911e2

 

This should match the 4th item.  Now on "Full Interface", it does not return
any result.  But on "analysis", matches are highlighted.

 

By using Debug: the parsedquery is:

 

MultiPhraseQuery(myEvent:"(1911e2 1911) (A e) 2")

 

Parsedquery_toString:

 

myEvent:"(1911e2 1911) (A e) 2"

 

Can anyone please help me on this?



Re: when using group=true facet numbers are "incorrect"

2011-11-07 Thread Yonik Seeley
On Mon, Nov 7, 2011 at 8:55 PM, Chris Hostetter
 wrote:
>
> : I understand that's a valid thing for faceting to do, I was just wondering
> : if there's any way to get it to do the faceting on the groups returned.
> : Otherwise I guess I'll need to convince the UI people to just show the
> : facets without the numbers.
>
> what you are asking about is generally refered to as "post-group faceting"
> and can be activated using "group.truncate"...

We don't have true "post group faceting" currently (i.e. where the
units for facet counts would be numbers of groups, not numbers of
documents).
group.truncate just truncates the list of documents in each group, and
faceting still returns numbers of documents, not numbers of groups.
This is why I advocated the name group.truncate instead of
group.after, and have avoided any mention of "post grouping" on the
wiki page.

-Yonik
http://www.lucidimagination.com


Re: when using group=true facet numbers are "incorrect"

2011-11-07 Thread Chris Hostetter

: I understand that's a valid thing for faceting to do, I was just wondering
: if there's any way to get it to do the faceting on the groups returned.
: Otherwise I guess I'll need to convince the UI people to just show the
: facets without the numbers.

what you are asking about is generally refered to as "post-group faceting" 
and can be activated using "group.truncate"...

https://wiki.apache.org/solr/FieldCollapsing#Request_Parameters




-Hoss


Re: Faceting a multi valued field

2011-11-07 Thread Chris Hostetter

: Someone always wants to understand the full use case. :-) I do 
: understand why, but, sometimes said use case is extremely complex with 
: dozens and dozens of search requirements. I was trying to limit the 
: explanation and was hoping someone could just answer the question as is. 

well -- i gave you one answer to the question as is "sort the facet.query 
counts on the client" ... my question about how you were modeling the 
taxonomy in field values is kind of crucial to any discussion about how to 
filter the facet resompse based on the taxonomy -- we have to know what 
the terms look like in order to give you suggestions on how to limit the 
temrs being faceted.

: A > B > C > D > E
: Z > C > D > E
: Z > C > F > G > H > E
: Y > G > H > E
: 
: Now, I want to get a count of the products in the children of C, AND, 
: each of their children (so, 2 levels, i.e. D, D > E, F, F > G). Note, 

Are these letters just "labels" for categories and the individual labels 
are frequently re-used to describe differnet concrete categories, or are 
you genuinely saying that a single category (labled "C") has multiple 
parent categories (B, and Z) and depending on *which* parent you are 
considering at any given time, it has differend child categories (ie: C 
has direct children D and F when viewed from parent Z, but when viewed 
from parent B, C's only direct child is D)? ... because if it's the later, 
that's the most fucked up "taxonomy" i've ever encountered.

If it's the former (just an issue of reusing labels) then you can probably 
make your life a lot simpler by choosing unique identifiers for every 
category in the hierachy (regardless o the label) and indexing those.

: The reality is products are in C. It is meaningless what parent category 
: they have, and thus what level. So, what is a good way to tackle this 
: using Solr?

from the standpoint of a single product document, you may not care what 
the "parent" categories are for each category the product, but if your 
goal is to get facet counts for every "child" category of a specified 
"parent" then it absolutely matters what the parent categories are -- the 
easiest way i know of to do that is to have a field containing each 
of the categories the document is in expressed as a "path", and then use 
facet.prefix to limit the constraints considered to terms that match the 
"path" of the parent category you are interested in.  since you also said 
you only want the categories that are immediate children of the current 
category, encoding the "level" of the category at the begining of it's 
path makes this possible using facet.prefix as well -- if you *only* ever 
want constraint counts for the immediate child categories, you can the 
level and most of the path and just index the "${parent_cat_id}:${cat_id}" 
tuples for every $cat_id the product is in, and tuples and use 
"${cat_id}:" as your facet.prefix.

-Hoss


Re: to prevent number-of-matching-terms in contributing score

2011-11-07 Thread Chris Hostetter

: You can write your custom similarity implementation, and override the
: /lengthNorm()/ method to return a constant value.

The postered already said (twice!) that they have already set 
omitNorms=true, so lengthNorm won't even be used 

omiting norms (or mucking with norms by modifying the lengthNorm function) 
only affects the norms portion of the scoring -- the problem being 
described here is when a document matches the input term more then once: 
that is an issue of the "term freuency".

Setting omitTermFreqAndPositions="true" on your field type will eliminate 
the term frequency from the equation, and it will become a simple "match 
or not" factor in your scoring.

>From the "more then one way to do it" standpoint, another option is to 
wrap the query in a function that flattens the scores (more fine grained 
control, and doesn't require re-indexing, but probably less efficient)

q={!boost b=$cat_boost v=$main_query}
main_query=...
cat_boost={!func}map(map(query({!field f=cat v=$cat},-1),0,1,5)-1,-1,1)
cat=...

(note: used nested maps so that non-matches would result in a 1x 
multipler, while matches result in a 5x multiplier)

-Hoss


Re: Faceting a multi valued field

2011-11-07 Thread Steve Fatula
From: Chris Hostetter 
>To: "solr-user@lucene.apache.org" ; Steve Fatula 
>
>Sent: Monday, November 7, 2011 5:42 PM
>Subject: Re: Faceting a multi valued field
>
>
>how are you modeling the tree nature of your cateory taxonomy when you 
>index the terms?  if you index each category id as the breadcrumb of all 
>it's ancestor categories and the "depth" of the category in the tree, you 
>can use facet.prefix to only see the children of a specified category.  
>
>
Someone always wants to understand the full use case. :-) I do understand why, 
but, sometimes said use case is extremely complex with dozens and dozens of 
search requirements. I was trying to limit the explanation and was hoping 
someone could just answer the question as is. However, I will improve upon my 
question... To provide slightly more detail, the examples given on the Solr 
hierarchical wiki page are extremely simplistic. Consider the following 
taxonomy:

A > B > C > D > E
Z > C > D > E
Z > C > F > G > H > E
Y > G > H > E

Now, I want to get a count of the products in the children of C, AND, each of 
their children (so, 2 levels, i.e. D, D > E, F, F > G). Note, unlike the wiki 
examples, C exists at multiple levels, possibly, lots of multiples. Each 
product (indexed document) can be a member of multiple categories. So, if we 
search for products of category C, no matter I believe how it is indexed, if a 
given product is a member of C and G, you get G data as well which is not what 
we want. Now, scale this to millions of documents, each document may be a 
member of half a dozen categories, any given query could return thousands of 
categories which are all meaningless except for the few we want.

The reality is products are in C. It is meaningless what parent category they 
have, and thus what level. So, what is a good way to tackle this using Solr?

So, what I WAS asking is - is there a way to filter the facets. This is one 
example of around a dozen use cases we have for the technique. It is likely 
that one technique will resolve all of those. The technique I described that we 
were using works very fast. However, we have added the new requirement of 
getting the children of the children counts. For that, that's many more queries 
to obtain just to show one screen of data. Not sure it will scale.

The only thing I have come up with is for each product, index ALL taxonomies to 
get to it, so, perhaps a product in E might index:

E
D > E
C > D > E
B > C > D > E
A > B > C > D > E
Z > C > D > E
H > E
G > H > E
F > G > H > E
C > F > G > H > E
Z > C > F > G > H > E
Y > G > H > E

By doing that, one could use prefix since we index any possible starting point 
so, prefix "C >" would in fact count all products at any level below C, which 
means it won't work since I wanted product in each level. Probably, I'd have to 
index more data than this, yuck. The problem of course is the volume of data 
for each product, and, that the data can easily change drastically with tree 
changes, which happen all the time. The indexing time will grow quite a bit. 
Still, trying to figure out a good structure for this that would enable the 
queries to be done with Solr.

Any other thoughts? Hopefully, this more fully explains the requirement.

Re: Aggregated indexing of updating RSS feeds

2011-11-07 Thread Chris Hostetter

: We've successfully setup Solr 3.4.0 to parse and import multiple news 
: RSS feeds (based on the slashdot example on 
: http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.

: The objective is for Solr to index ALL news items published on this feed 
: (ever) - not just the current contents of the feed. I've read that the 
: delta import is not supported for XML imports. I've therefore tried to 
: use "command=full-impor&clean=false". 

1) note your typo, should be "full-import"

: But still the number of Documents Processed seems to be stuck at a fixed 
: number of items looking at the Stats and the 'numFound' result for a 
: generic '*:*' search. New items are being added to the feeds all the 
: time (and old ones dropping off).

"Documents Processed" after each full import should be whatever the number 
of items in the current feed is -- it's the number processed in that 
import, no total number processed in all time.

if you specify clean=false no documents should be deleted.  I just tested 
this using the slashdot example with Solr 3.4 and could not reproduce the 
problem you described.  I loaded the following URL...

http://localhost:8983/solr/rss/dataimport?clean=false&command=full-import

...then waited a while for the feed to cahnge, and then loaded that URL 
again.  The number of documents (returned by a *:* query) increased after 
the second run.


-Hoss

Re: Faceting a multi valued field

2011-11-07 Thread Chris Hostetter

: Now, instead we want to browse the products by category. This also works 
: since we can simply find all products for category A. So, we show them. 
: Now, we also want to show a list of categories underneath (in the tree 
: structure) that category, and, a count of items in each. Just the 
: subcategory level, not levels below it.

how are you modeling the tree nature of your cateory taxonomy when you 
index the terms?  if you index each category id as the breadcrumb of all 
it's ancestor categories and the "depth" of the category in the tree, you 
can use facet.prefix to only see the children of a specified category.  
See slides #32-35...

https://webselfservice.alaskaair.com/checkinweb/default.aspx?language=

: So, right now, we do this by doing a solr query with multiple 
: facet.query=eachsubcategory, and, q=eachsubcategory with space between 
: each one. This does exactly what we want, i.e., the resulting facets 
: have a count for only the specific subcategories we want counts for.

: The results though are a list of the counts in the same order as the 
: facet.query parms. I want them ordered by count. I understand it's 
: ordered that way intentionally. So, instead, I want to find another 
: syntax to do the same thing, except, return in count order.

Why don't you just sort them on the client side? There's no efficiency 
gained by having the facet.query results sorted on the server side 
(because they all come back, there's no cut-off like there is with 
facet.field)

-Hoss


when using group=true facet numbers are "incorrect"

2011-11-07 Thread Greg Pelly
Hi,

I've noticed that when field collapsing and faceting are both used in the
one query the facet numbers ignore the grouping. In my example I have three
documents (I have a small index for testing) and if I group on a certain
field I get two groups in the results but the facet numbers show that there
were three hits.

I understand that's a valid thing for faceting to do, I was just wondering
if there's any way to get it to do the faceting on the groups returned.
Otherwise I guess I'll need to convince the UI people to just show the
facets without the numbers.

Cheers,
Greg


Replication fails in SolrCloud

2011-11-07 Thread prakash chandrasekaran

hi all, i followed steps in link  
http://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_shard_replicas_and_zookeeper_ensemble
 and created "Two shard cluster with shard replicas and zookeeper ensemble", 
and then for Solr Replication i followed steps in link 
http://wiki.apache.org/solr/SolrReplication .. 
now after server start, when slave tries to pull data from master .. i m seeing 
below error messages .. 
org.apache.solr.common.SolrException logSEVERE: 
org.apache.solr.common.cloud.ZooKeeperException: ZkSolrResourceLoader does not 
support getConfigDir() - likely, what you are trying to do is not supported in 
ZooKeeper modeat 
org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceLoader.java:99)
at 
org.apache.solr.handler.ReplicationHandler.getConfFileInfoFromCache(ReplicationHandler.java:378)
 at 
org.apache.solr.handler.ReplicationHandler.getFileList(ReplicationHandler.java:364)
i have few questions regarding this 1) Does Solr Cloud supports Replication 
??2) or do we need to follow different steps to achieve Replication in Solr 
Cloud ?? 

Thanks,prakash

> From: prakashchandraseka...@live.com
> To: solr-user@lucene.apache.org
> Subject: Zookeeper aware Replication in SolrCloud
> Date: Fri, 4 Nov 2011 03:36:27 +
> 
> 
> 
> hi,
> i m using SolrCloud and i wanted to add Replication feature to it .. 
> i followed the steps in Solr Wiki .. but when the client tried to poll for 
> data from server i got below Error Message ..
> in Master LogNov 3, 2011 8:34:00 PM 
> 
> in Slave logNov 3, 2011 8:34:00 PM org.apache.solr.handler.ReplicationHandler 
> doFetchSEVERE: SnapPull failed org.apache.solr.common.SolrException: Request 
> failed for the url org.apache.commons.httpclient.methods.PostMethod@18eabf6   
>  at 
> org.apache.solr.handler.SnapPuller.getNamedListResponse(SnapPuller.java:197) 
> at org.apache.solr.handler.SnapPuller.fetchFileList(SnapPuller.java:219)  
>   at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:281) 
> at 
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:284)
> but i could see the slave pointing to correct master from link : 
> http://localhost:7574/solr/replication?command=details
> i m also seeing these values in replication details link .. 
> (http://localhost:7574/solr/replication?command=details)
> Thu Nov 03 20:28:00 PDT 
> 2011Thu Nov 03 20:27:00 PDT 2011Thu Nov 03 20:26:00 PDT 
> 2011Thu Nov 03 20:25:00 PDT 2011  name="replicationFailedAtList"> Thu Nov 03 20:28:00 PDT 2011 
> Thu Nov 03 20:27:00 PDT 2011 Thu Nov 03 20:26:00 PDT 
> 2011 Thu Nov 03 20:25:00 PDT 2011
> 
> 
> Thanks,Prakash  
  

Re: changing omitNorms on an already built index

2011-11-07 Thread Jonathan Rochkind

On 10/27/2011 9:14 PM, Erick Erickson wrote:

Well, this could be explained if your fields are very short. Norms
are encoded into (part of?) a byte, so your ranking may be unaffected.

Try adding debugQuery=on and looking at the explanation. If you've
really omitted norms, I think you should see clauses like:

1.0 = fieldNorm(field=features, doc=1)
in the output, never something like


Thanks, this was very helpful. Indeed with debugQuery on, I get "1.0 = 
fieldNorm" on my index with omitNorms for the relevant field, and in my 
index without omitNorms for the relevant field, I get a non-unit value 
"= fieldNorm", thanks for giving me a way to reassure myself that 
omitNorms really is doing it's thing.


Now to dive into my debugQuery and figure out why it doesn't seem to be 
having as much effect as I anticipated on relevance!





Faceting a multi valued field

2011-11-07 Thread Steve Fatula
So, I have a bunch of products indexed in Solr. Each product may exist in any 
number of product categories. The product category field is therefore 
multivalued in Solr. This allow us to show categories a product exists in.

Now, instead we want to browse the products by category. This also works since 
we can simply find all products for category A. So, we show them. Now, we also 
want to show a list of categories underneath (in the tree structure) that 
category, and, a count of items in each. Just the subcategory level, not levels 
below it.

So, right now, we do this by doing a solr query with multiple 
facet.query=eachsubcategory, and, q=eachsubcategory with space between each 
one. This does exactly what we want, i.e., the resulting facets have a count 
for only the specific subcategories we want counts for.

So, for example:

q=cata catb&facet.query=facetfield:cata&facet.query=facetfield:catb

The results though are a list of the counts in the same order as the 
facet.query parms. I want them ordered by count. I understand it's ordered that 
way intentionally. So, instead, I want to find another syntax to do the same 
thing, except, return in count order.

So, something like:

q=cata catb&facet.limitto=facetfield:cata or catb

i.e., a facet.query that contains a list of facet queries to do so that way, I 
presume they would be ordered by count. Or, perhaps facet.prefix where I can 
specify the LIST of categories I want.

Just looking for a better query syntax to allow the multivalued category field 
to only return counts for the categories I want in count order. Any way to do 
this?

Re: Term frequency question

2011-11-07 Thread Chris Hostetter

: ./NoLengthNormAndTfSimilarity.java:7: error: lengthNorm(String,int) in
: NoLengthNormAndTfSimilarity cannot override lengthNorm(String,int) in
: Similarity
:  public float lengthNorm(String fieldName, int numTerms) {
:   ^
:  overridden method is final
: 1 error
: -
: What am I doing wrong, is there a better way or newer way to do this?

did you look at the javadocs for the method?

https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/Similarity.html#lengthNorm%28java.lang.String,%20int%29

"Deprecated. Please override computeNorm instead"


-Hoss


Re: overwrite=false support with SolrJ client

2011-11-07 Thread Chris Hostetter

: I see that https://issues.apache.org/jira/browse/SOLR-653 removed this 
: support from SolrJ, because it was deemed too dangerous for mere 
: mortals.

I believe the concern was that the "novice level" API was very in your 
face about asking if you wanted to "overwrite" and made it too easy to 
hurt yourself.

It should still be fairly trivial to specify overwrite=false in a SolrJ 
request -- just not using hte convenience methods.  something like...

UpdateRequest req = new UpdateRequest();
req.add(myBigCollectionOfDocuments);
req.setParam(UpdateParams.OVERWRITE, true);
req.process(mySolrServer);

: For Hadoop-based workflows, it's straightforward to ensure that the 
: unique key field is really unique, thus if the performance gain is 
: significant, I might look into figuring out some way (with a trigger 
: lock) of re-enabling this support in SolrJ.

it's not just an issue of knowing that the key is unique -- it's an issue 
of being certain that your index does not contain any documents with the 
same key as a document you are about to add.  If you are generating a 
completley new solr index from data that you are certain is unique -- then 
you will probably see some perf gains.  but if you are adding to an 
existing index, i would avoid it. 


-Hoss


Re: InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting

2011-11-07 Thread Chris Hostetter

: finally I want to use Solr highlighting. But there seems to be a problem 
: if I combine the char filter and the compound word filter in combination 
: with highlighting (an 
: org.apache.lucene.search.highlight.InvalidTokenOffsetsException is 
: raised).

Definitely sounds like a bug somwhere in dealing with the offsets.

can you please file a Jira, and include all of the data you have provided 
here?  it would also be helpful to know what the analysis tool says about 
the various attributes of your tokens at each stage of the analysis?

: SEVERE: org.apache.solr.common.SolrException: 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall 
exceeds length of provided text sized 12
:   at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469)
:   at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
:   at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
:   at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
:   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
:   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
:   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
:   at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
:   at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
:   at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
:   at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
:   at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
:   at 
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
:   at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
:   at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
:   at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851)
:   at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
:   at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
:   at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278)
:   at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
:   at 
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302)
:   at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
:   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
:   at java.lang.Thread.run(Thread.java:680)
: Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: 
Token fall exceeds length of provided text sized 12
:   at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228)
:   at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
:   ... 23 more


-Hoss


Re: Solr's JMX domain names

2011-11-07 Thread Chris Hostetter

: depending on the Solr version and on the servlet container.  In some 
: cases the domain name is "solr", while in others it is "solr/".  But we 
: also saw further inconsistencies.  For example, we have 2 Solr 1.4.0 
: instances on the same version of the servlet container, and one has 
: "solr", while the other has "solr/" domain name.

I believe the default naming is "solr/${corename}" -- but if you are 
seeing "solr" by itself in some cases that may be an edge case when using 
the legacy single core mode (not certain ... could maybe be a bug)

there is also the "rootName" attribute on the  solrconfig.xml 
setting which can override the defailt -- maybe some of your indexes are 
explicitly settting this to "solr" ?

https://issues.apache.org/jira/browse/SOLR-1843

-Hoss

Re: question from a beginner

2011-11-07 Thread Chris Hostetter

: So for example, if searching on "Santa Clara"  I would like to display all
: sections/paragraphs where "Santa Clara" occurs in the document. 

can you clarify what you mean by "display" and how you intend to use that 
info?

it may be obvious to you what you mean by "display" but depending on the 
answer there are differnet approachs to take.

for example: you may be able to tune highlighting to show you all of the 
"snippets" of each document where the search matches, but the results will 
still just contain one "document" for your file, so things like facet 
counts would still only return "1" doc per file -- and if you want a UI 
that lets the user "click" on each match, you would still only have one 
file to return to them for all of the snippets in each document.

alternately, you could split the word file up into multiple files per 
section (or paragraph, or page -- whatever you want), and index them 
independently, and then each "matching document" would corrispond to the 
section of the file that you've split up - so facet counts and stuff 
like that owuld corrispond to your individuall little files, and users 
could click on each result and your UI could return that micro-file, 
etc...

bottom line: think through your entire use case, and how you want users to 
interact with the results, and then model your documents based on that.  
then figure out how to feed the data into solr to match your model.


-Hoss


Solr's JMX domain names

2011-11-07 Thread Otis Gospodnetic
Hello,

While working on our Performance Monitoring SaaS for Solr [1] we've noticed 
Solr MBeans are registered under a different JMX domain name, depending on the 
Solr version and on the servlet container.  In some cases the domain name is 
"solr", while in others it is "solr/".  But we also saw further 
inconsistencies.  For example, we have 2 Solr 1.4.0 instances on the same 
version of the servlet container, and one has "solr", while the other has 
"solr/" domain name.

Does anyone know what controls this?

Thanks,

Otis

[1] http://sematext.com/spm/solr-performance-monitoring/index.html (currently 
free)

Re: How to use an External Database for Fields?

2011-11-07 Thread Draconissa

Chris Hostetter-3 wrote:
> 
> Agreed.  In 3.x and below this type of logic is expected to live in the 
> QueryResponseWriters.  
> 

Forgive my ignorance, but where do QueryResponseWriters live? And where do
they fit into the flow? 

I know how the different components fit into a distributed search, and how
all the stages flow, but I haven't seen anything about response writers...

Would it be something like using the distributed search and components to
retrieve the doc_ids and facets, and once the entire distributed search is
complete, a ResponseWriter would take the results (which is just doc ids
since we didn't use the get_fields or any Database component), queries our
SQL database, and creates the final response that gets sent to the client? 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-use-an-External-Database-for-Fields-tp3468308p3487972.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Aggregated indexing of updating RSS feeds

2011-11-07 Thread sbarriba
Thanks Nagendra, I'll take a look.

So question for you et al, so Solr in its default installation will ALWAYS
delete content for an entity prior to doing a full import? 
You cannot simply build up an index incrementally from multiple imports
(from XML)? I read elsewhere that the 'clean' parameter was intended to
control this.

Regards,
Shaun

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Aggregated-indexing-of-updating-RSS-feeds-tp3485335p3487969.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Return the ranks of selected documents

2011-11-07 Thread Chris Hostetter

: Ideally this means that for a given query, I would like Solr just to return
: the ranks of selected unique keys within the results.

If i understand you correctly, given a query MY_QUERY and a set of IDs 
(ID1, ID2, ID3, etc...) you would like to know the score of those IDs 
against that query?

that's fairly straight forward...

?q=MY_QUERY&fq=id:(ID1 ID2 ID3 ...)


-Hoss


Re: i don't get why this says non-match

2011-11-07 Thread Chris Hostetter

: It looks to me like everything matches down the line but top level says
: otherQuery is a non-match... I don't get it?

Note the parsed query...

:   +moreWords:syncmaster
: -moreWords:"sync (master syncmaster)"

...and the top level explanation message...

: 0.0 = (NON-MATCH) Failure to meet condition(s) of
: required/prohibited clause(s) 1.4043131 = (MATCH)

...so either this document *doens't* match the mandatory clause 
"+moreWords:syncmaster" or it *does* match the prohibited clause 
"-moreWords:"sync (master syncmaster)"

Looking at the details...

: match on prohibited clause (moreWords:"sync (master syncmaster)")
: 9.393997 = (MATCH) weight(moreWords:"sync (master syncmaster)" in

...so there's your answer.



-Hoss


Re: Why Jboss server is stopped due to SOLR

2011-11-07 Thread Chris Hostetter

: I am trying to connect the SOLR with Java code using URLConnection, i have
: deployed solr war file in jboss server(assuming server machine in some other
: location or remote) its working fine if no exception raises... but if any
: exception raises in server like connection failure its stopping the jboss
: client(assuming client machine) where my Java code resides.
: 
: 
: 11:49:38,345 INFO  [STDOUT] [2011-10-27 11:49:38.345] class =
: com.dstsystems.adc.efs.rs.util.SimplePost,method = fatal(),level = SEVERE:
: ,message = Connection error (is Solr running at
: http://xx.yy.zzz:8080/solr/update ?): java.net.ConnectException: Connection
: refused: connect

By the looks of that exception, I'm guessing that "SimplePost" class is 
your own renamed/repackaged version of the "SimplePostTool" class from 
Solr, which has a "fatal" method thta calls System.exit -- because 
SimplePostTool is a simple little command line tool for posting files.  
it is not remotely intended/suggested for embedding in other java 
applications.

if you have a java app that you'd like to use to talk to Solr, you should 
use SolrJ.

: 11:49:38,361 INFO  [Server] Runtime shutdown hook called, forceHalt: true
: 11:49:38,376 INFO  [Server] JBoss SHUTDOWN: Undeploying all packages
: 11:49:48,018 INFO  [TransactionManagerService] Stopping recovery manager
: 11:49:48,128 INFO  [Server] Shutdown complete
: Shutdown complete..
: 
: --
: View this message in context: 
http://lucene.472066.n3.nabble.com/Why-Jboss-server-is-stopped-due-to-SOLR-tp3456903p3456903.html
: Sent from the Solr - User mailing list archive at Nabble.com.
: 

-Hoss


Re: how to achieve google.com like results for phrase queries

2011-11-07 Thread alxsss
Solr also can query link(url) text and rank them higher if we specify url in qf 
field. Only problem is that why it does not rank pages with both words higher 
when mm is set as 
1<-1. It seems to me that this is a bug.

Thanks.
Alex.

 
 

 

-Original Message-
From: Ted Dunning 
To: solr-user 
Sent: Sat, Nov 5, 2011 8:59 pm
Subject: Re: how to achieve google.com like results for phrase queries


Google achieves their results by using data not found in the web pages
themselves.  This additional data critically includes link text, but also
is derived from behavioral information.



On Sat, Nov 5, 2011 at 5:07 PM,  wrote:

> Hi Erick,
>
> The term  "newspaper latimes" is not found in latimes.com. However,
> google places it in the first place. My guess is that mm parameter must
>  not be set as 2<-1 in order to achieve google.com like ranking for
> two word phrase queries.
>
> My goal is to set mm parameter in such a way that latimes.com will be
> ranked in 1-3rd places and sites with both words will be placed after them.
> As I wrote in my previous letter
> setting mm as 1<-1 solves this issue partially. Problem in this case is
> that sites with both words are placed at the bottom or are not in the
> search results at all.
>
> Thanks.
> Alex.
>
>
>
>
>
>
> -Original Message-
> From: Erick Erickson 
> To: solr-user 
> Sent: Sat, Nov 5, 2011 9:01 am
> Subject: Re: how to achieve google.com like results for phrase queries
>
>
> First, the default query operator is ignored by edismax, so that's
> not doing anything.
>
> Why would you expect "newspaper latimes" to be found at all in
> "latimes.com"? What
> proof do you have that the two terms are even in the "latimes.com"
> document?
>
> You can look at the Query Elevation Component to force certain known
> documents to the top of the results based on the search terms, but that's
> not a very elegant solution.
>
> What business requirement are you trying to accomplish here? Because as
> asked, there's really not enough information to provide a meaningful
> suggestion.
>
> Best
> Erick
>
> On Thu, Nov 3, 2011 at 7:30 PM,   wrote:
> > Hello,
> >
> > I use nutch-1.3 crawled results in solr-3.4. I noticed that for two word
> phrases like newspaper latimes, latimes.com is not in results at all.
> > This may be due to the dismax def type that I use in  request handler
> >
> > dismax
> > url^1.5 id^1.5 content^ title^1.2
> > url^1.5 id^1.5 content^0.5 title^1.2
> >
> >
> >  with mm as
> > 2<-1 5<-2 6<90%
> >
> > However, changing it to
> > 1<-1 2<-1 5<-2 6<90%
> >
> > and q.op to OR or AND
> >
> > do not solve the problem. In this case latimes.com is ranked higher,
> but still
> is not in the first place.
> > Also in this case results with both words are ranked very low, almost at
> the
> end.
> >
> > We need to be able to achieve the case when latimes.com is placed in
> the first
> place then results with both words and etc.
> >
> > Any ideas how to modify config to this end?
> >
> > Thanks in advance.
> > Alex.
> >
> >
>
>
>
>

 


Re: Aggregated indexing of updating RSS feeds

2011-11-07 Thread Fred Zimmerman
Any options that do not require adding new software?

On Mon, Nov 7, 2011 at 11:11 AM, Nagendra Nagarajayya <
nnagaraja...@transaxtions.com> wrote:

> Shaun:
>
> You should try NRT available with Solr with RankingAlgorithm here. You
> should be able to add docs in real time and also query them in real time.
>  If DIH does not retain the old index, you may be able to convert the rss
> fields to a XML format as needed by Solr and update the docs (make sure
> there is a unique id)
>
> http://solr-ra.tgels.org/wiki/**en/Near_Real_Time_Search_ver_**3.x
>
> You can download Solr 3.4.0 with RankingAlgorithm 1.3 from here:
> http://solr-ra.tgels.org
>
> Regards,
>
> - Nagendra Nagarajayya
> http://solr-ra.tgels.org
> http://rankingalgorithm.tgels.**org 
>
>
> On 11/6/2011 1:22 PM, Shaun Barriball wrote:
>
>> Hi all,
>>
>> We've successfully setup Solr 3.4.0 to parse and import multiple news RSS
>> feeds (based on the slashdot example on http://wiki.apache.org/solr/**
>> DataImportHandler ) using
>> the HttpDataSource.
>> The objective is for Solr to index ALL news items published on this feed
>> (ever) - not just the current contents of the feed. I've read that the
>> delta import is not supported for XML imports. I've therefore tried to use
>> "command=full-impor&clean=**false".
>> But still the number of Documents Processed seems to be stuck at a fixed
>> number of items looking at the Stats and the 'numFound' result for a
>> generic '*:*' search. New items are being added to the feeds all the time
>> (and old ones dropping off).
>>
>> Is it possible for Solr to incrementally build an index of a live RSS
>> feed which is changing but retain the index of its archive?
>>
>> All help appreciated.
>> Shaun
>>
>
>


SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException:

2011-11-07 Thread OldSkoolMark
Having some trouble clustering my data ... These symptoms are similar to some
problems that were fixed last year. Possible regression? Suggestions on how
to proceed? Thanks in advance!

https://issues.apache.org/jira/browse/SOLR-1883
https://issues.apache.org/jira/browse/SOLR-1404

Nov 7, 2011 8:15:35 AM
org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine cluster
SEVERE: Carrot2 clustering failed
org.apache.solr.common.SolrException:
org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token
exhilar exceeds length of provided text sized 3801
at
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:475)
at
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:379)
at
org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.getDocuments(CarrotClusteringEngine.java:303)
at
org.apache.solr.handler.clustering.carrot2.CarrotClusteringEngine.cluster(CarrotClusteringEngine.java:124)
at
org.apache.solr.handler.clustering.ClusteringComponent.process(ClusteringComponent.java:91)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:241)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException:
Token exhilar exceeds length of provided text sized 3801
at
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)
at
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:468)
... 27 more

Here is the relevant portion of my solrconfig.

  

  true
  default
  true
  
  title
  url
  
   description
   
   true
   
   
   
   false
   
   edismax
   
  text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
   
   *:*
   10
   *,score
 

  clustering

  

Also my data-config.xml as I my data is in an sqlite3 DB.


  
 

  
  



  

  


schema.xml has the standard description and title fields. 
  
   


--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrException-org-apache-lucene-search-highlight-InvalidTokenOffsetsException-tp3487517p3487517.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: size of data replicated

2011-11-07 Thread Otis Gospodnetic
Hi Xin,

I don't know if you can see this information anywhere in Solr's UI...
... but I know you could see this information using SPM for Solr [1].  I don't 
have a screenshot handy to show this visually, but it's easy to explain.  One 
of the SPM for Solr reports shows the index size (in terms of size on disk, 
number of files, number of segments, number of documents, etc.).  You can see 
how this size changes over time [2], so if you look at this report before and 
after replication, you will see how much your index size changed after 
replication and will know how much data was replicated.
And since there is also a report that shows query response time, you will be 
able to visually see how much faster or slower query performance is after 
replication.

I hope this helps.
 
[1] http://sematext.com/spm/solr-performance-monitoring/index.html (free now)
[2] sidenote: it's quite informative to see how Lucene merges segments and how 
index size varies over time


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



>
>From: Xin Li 
>To: solr-user 
>Sent: Monday, November 7, 2011 10:45 AM
>Subject: size of data replicated
>
>Hi, there,
>
>I am trying to look into the performance impact of data replication on
>query response time. To get a clear picture, I would like to know how
>to get the size of data being replicated for each commit. Through the
>admin UI, you may read a x of y G data is being replicated; however,
>"y" is the total index size, instead of data being copied over. I
>couldn't find the info in the solr logs either. Any idea?
>
>Thanks,
>Xin
>
>
>

Re: Aggregated indexing of updating RSS feeds

2011-11-07 Thread Nagendra Nagarajayya

Shaun:

You should try NRT available with Solr with RankingAlgorithm here. You 
should be able to add docs in real time and also query them in real 
time.  If DIH does not retain the old index, you may be able to convert 
the rss fields to a XML format as needed by Solr and update the docs 
(make sure there is a unique id)


http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

You can download Solr 3.4.0 with RankingAlgorithm 1.3 from here:
http://solr-ra.tgels.org

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 11/6/2011 1:22 PM, Shaun Barriball wrote:

Hi all,

We've successfully setup Solr 3.4.0 to parse and import multiple news RSS feeds 
(based on the slashdot example on 
http://wiki.apache.org/solr/DataImportHandler) using the HttpDataSource.
The objective is for Solr to index ALL news items published on this feed (ever) - not just the current contents of the feed. I've read that the delta import is not supported for XML imports. I've therefore tried to use "command=full-impor&clean=false". 


But still the number of Documents Processed seems to be stuck at a fixed number 
of items looking at the Stats and the 'numFound' result for a generic '*:*' 
search. New items are being added to the feeds all the time (and old ones 
dropping off).

Is it possible for Solr to incrementally build an index of a live RSS feed 
which is changing but retain the index of its archive?

All help appreciated.
Shaun




size of data replicated

2011-11-07 Thread Xin Li
Hi, there,

I am trying to look into the performance impact of data replication on
query response time. To get a clear picture, I would like to know how
to get the size of data being replicated for each commit. Through the
admin UI, you may read a x of y G data is being replicated; however,
"y" is the total index size, instead of data being copied over. I
couldn't find the info in the solr logs either. Any idea?

Thanks,
Xin


Re: TikaEntityProcessor not working?

2011-11-07 Thread Erick Erickson
You have to provide a lot more information about what you're doing. Are
you trying to use DIH? the extracting update request handler? What
do your config files look like?

Please review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Mon, Nov 7, 2011 at 8:18 AM, kumar8anuj  wrote:
> I tried to do the same but problem still persist and my document is not
> getting indexed. I am using solr 3.4.0 and it was having tika 0.8 i replaced
> core and parser jar with the 0.6 but document is not getting indexed. Please
> help and nothing is coming in my logs related to that.
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-tp856965p3486898.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: best way for sum of fields

2011-11-07 Thread Tanguy Moal

Hi again,
Since you have a custom high availability solution over your solr 
instances, I can't help much I guess... :-)


I usually rely on master/slave replication to separate index build and 
index search processes.


The fact is that resources consumption at build time and search time are 
not necessarily the same, and therefor hardware dimensioning can be 
adapted as required.
I like to have the service related processes isolated and easy to deploy 
wherever needed. Just in case things go wrong, hardware failures occur.
Build services on the other hand don't have the same availability 
constraints, and can be off for a while, it's no issue (unless near 
realtime indexing comes into party, that's an other thing)


In a slave configuration, the index doesn't need to commit. It simply 
replicates its data from its associated master whenever the master 
changes and performs a reopen of the searcher. "Change" events can be 
triggered at commit, startup and / or optimize. (see 
http://wiki.apache.org/solr/SolrReplication , although you seemed to be 
not interested by this feature :) )


Having search and build on the same host is no bad point.
It simply depends on available resources and build vs service load 
requirements.
For example with a big core such as the one you have, segments merging 
can occur from time to time, which is an operation that is IO bound 
(i.e. time is dependant of disk performances). Under high IO load, a 
server can become less responsive and therefor having the service 
separated from the build could became handy at that time.


As you see, I can't tell you what makes sense and what doesn't.
It's all about what you're doing, at which frequency, etc. :-)

Regards,

Tanguy

Le 07/11/2011 12:12, stockii a écrit :

hi thanks for the big reply ;)

i had the idea with the several and small 5M shards too.
and i think thats the next step i have to go, because our biggest index
grows each day with avg. 50K documents.
but make it sense to keep searcher AND updater cores on one big server? i
dont want to use replication, because with our own high avalibility solution
is this not possible.

my system is split into searcher and updater cores, each with his own index.
some search requests are over all this 8 cores with distributed search.



-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores,
1 Core with 45 Million Documents other Cores<  200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/best-way-for-sum-of-fields-tp3477517p3486652.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: TikaEntityProcessor not working?

2011-11-07 Thread kumar8anuj
I tried to do the same but problem still persist and my document is not
getting indexed. I am using solr 3.4.0 and it was having tika 0.8 i replaced
core and parser jar with the 0.6 but document is not getting indexed. Please
help and nothing is coming in my logs related to that.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaEntityProcessor-not-working-tp856965p3486898.html
Sent from the Solr - User mailing list archive at Nabble.com.


Similar documents and advantages / disadvantages of MLT / Deduplication

2011-11-07 Thread Vadim Kisselmann
Hello folks,

i have questions about MLT and Deduplication and what would be the best
choice in my case.

Case:

I index 1000 docs, 5 of them are 95% the same (for example: copy pasted
blog articles from different sources, with slight changes (author name,
etc..)).
But they have differences.
*Now i like to see 1 doc in my result set and the other 4 should be marked
as similar.*

With *MLT*:
text
  5
  50
  3
  5000
  true
  text
   

With this config i get about 500 similar docs for this 1 doc, unfortunately
too much.


*Deduplication*:
I index this docs now with an signature and i'm using TextProfileSignature.


   
 true
 signature_t
 false
 text
 solr.processor.TextProfileSignature

   
   
 

How can i compare the created signatures?


I want only see the 5 similar docs, nothing else.
Which of this two cases is relevant to me? Can i tune the MLT for my
requirement? Or should i use Dedupe?

Thanks and Regards
Vadim


Re: Solr, MultiValues and links...

2011-11-07 Thread Tiernan OToole
That looks promising... Will look into that a bit more.

--Tiernan

On Sat, Nov 5, 2011 at 4:07 PM, Erick Erickson wrote:

> Hmmm, MultiValues are guaranteed to be returned in the order they were
> inserted, so you might be able to do the linking yourself given the
> results.
>
> But have you considered grouping (aka field collapsing) on the ISBN number?
> If you indexed each record uniquely, that might do what you need.
>
> Best
> Erick
>
> On Fri, Nov 4, 2011 at 5:41 AM, Tiernan OToole 
> wrote:
> > Right, not sure how to ask this question, what the terminology, but
> > hopefully my explaination will help...
> >
> > We are chucking data into solr for queries. i cant mention the exact
> data,
> > but the closest thing i can think of is as follows:
> >
> >
> >   - Unique ID for the solr record (DB ID in this case)
> >   - A not so unique ID (say an ISBN number)
> >   - the name of the book (multiple names for this, each have a ISBN and a
> >   unique DB ID)
> >   - A Status of the book (books arent the correct term, but, say, a book
> >   has 4 editions but keeps the ISBN, we would have 4 names in Solr, each
> >   queryable. So, searching for the first edition title will return the
> >   correct ISBN).
> >
> > Anyway, what i want to be able to do is search for a single title (say
> Solr
> > or Dummies) and find all instances of that title in index. but for each
> > name, i want to be able to link the status of each title with each one.
> > there are other "statuses" for each item also...
> >
> > Anyway, i had 2 ways of doing this:
> >
> >
> >   - the first way was using multi values, storing the names in a multi
> >   value list and also the statuses, but the linking dont seem to be
> working
> >   correctly...
> >   - the second way is storing each record uniqley, but with this i would
> >   need to run a second query to get all records for the ISBN...
> >
> > Any ideas which one i should be using? any tips on this?
> >
> > Thanks.
> >
> >
> > --
> > Tiernan O'Toole
> > blog.lotas-smartman.net
> > www.geekphotographer.com
> > www.tiernanotoole.ie
> >
>



-- 
Tiernan O'Toole
blog.lotas-smartman.net
www.geekphotographer.com
www.tiernanotoole.ie


Re: best way for sum of fields

2011-11-07 Thread stockii
hi thanks for the big reply ;)

i had the idea with the several and small 5M shards too. 
and i think thats the next step i have to go, because our biggest index
grows each day with avg. 50K documents.
but make it sense to keep searcher AND updater cores on one big server? i
dont want to use replication, because with our own high avalibility solution
is this not possible. 

my system is split into searcher and updater cores, each with his own index.
some search requests are over all this 8 cores with distributed search.



-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores < 200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/best-way-for-sum-of-fields-tp3477517p3486652.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: best way for sum of fields

2011-11-07 Thread Tanguy Moal

Hi,

If you only need to sum over "displayed" results, go with the 
post-processing of hits solution, that's fast and easy.
If you sum over the whole data set (i.e your sum is not query 
dependant), have it computed at indexing time, depending on your 
indexing workflow.


Otherwise, (sum over the whole result set, query dependant but 
independantly of displayed results) you should give a try to sharding...
You generally want that when your index size is too large to be searched 
quickly (see http://wiki.apache.org/solr/DistributedSearch) (here the 
sum operation is part of a search query)


Basically what you need is:
- On the master host : n master instances (each being a shard)
- On slave host : n slave instances (each being a replica of its master 
side counterpart)


Only the slave instances will need a comfortable amount of RAM in order 
to serve queries rapidly. Slave instances can be deployed over several 
hosts if the total amount of RAM required is high.


Your main effort here might be in finding the 'n' value.
You have 45M documents in a single shard and that may be the cause of 
your issue, especially for queries returning a high number of results.

You may need to split it into more shards to achieve your goal.

This should enable you to reduce the time to perform the sum operation 
at search time (but adds complixity at data indexing time : you need to 
define a way to send documents to shard #1, #2, ..., or #n).
If you keep having more and more documents over time, may be you'll want 
to have a fixed maximum shard size (say 5M docs, if performing the sum 
on 5M docs is fast enough) and simply add shards as required, when more 
documents are to be indexed/searched. This addresses the importing issue 
because you'll simply need to change the target shard every 5M documents.

The last shard is always the smallest.

Such sharding can involve a little overhead at search time : make sure 
you don't allow for retrieval of far documents (start=k, where k is high 
-- see 
http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations).
-> When using stats component, have start and rows parameters set to 0 
if you don't need the documents themselves.



After that, if you face high search load issues, you could still 
duplicate the slave host to match your load requirements, and 
load-balance your search traffic over slaves as required.


Hope this helps,

Tanguy

Le 07/11/2011 09:49, stockii a écrit :

sry.

i need the sum of values of the found documents. e.g. the total amount of
one day. each doc in index has ist own amount.

i try out something with StatsComponent but with  48 Million docs in Index
its to slow.

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores,
1 Core with 45 Million Documents other Cores<  200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/best-way-for-sum-of-fields-tp3477517p3486406.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: best way for sum of fields

2011-11-07 Thread stockii
yes, this way i am using on another part in my application. i hoped, that
exists another way to avoid the way over php

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores < 200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/best-way-for-sum-of-fields-tp3477517p3486593.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Test Framework

2011-11-07 Thread Ronak Patel


Hi,


I am trying to write JUnit Test Code for my Solr interaction and while 
executing I keep getting the following errors.
Mind you, I am using JUnit 4.7 and I have been calling super.setUp().

Here is some sample code...

/* (non-Javadoc)
 * @see org.apache.solr.util.AbstractSolrTestCase#getSchemaFile()
 */
@Override
public String getSchemaFile() {
return "classpath:schema.xml";
}

/* (non-Javadoc)
 * @see org.apache.solr.util.AbstractSolrTestCase#getSolrConfigFile()
 */
@Override
public String getSolrConfigFile() {
return "classpath:solrconfig.xml";
}

@Before
@Override
public void setUp() throws Exception {
super.setUp();

this.server = new EmbeddedSolrServer(this.h.getCoreContainer(), 
this.h.getCore().getName());
}
}

java.lang.AssertionError: ensure your setUp() calls super.setUp()!!!
at org.junit.Assert.fail(Assert.java:91)
at org.junit.Assert.assertTrue(Assert.java:43)
at org.apache.lucene.util.LuceneTestCase$1.starting(LuceneTestCase.java:413)
at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:46)
at 
org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:72)
at 
org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:231)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at 
org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:31)
at 
org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:70)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at 
org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:174)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

Re: best way for sum of fields

2011-11-07 Thread pravesh
I Guess,

This has nothing to do with search part. You can post process the search
results(I mean iterate through your results and sum it)

Regds
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/best-way-for-sum-of-fields-tp3477517p3486536.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: to prevent number-of-matching-terms in contributing score

2011-11-07 Thread pravesh
Hi Samar,

You can write your custom similarity implementation, and override the
/lengthNorm()/ method to return a constant value.

Then in your /schema.xml/ specify your custom implementation as the default
similarity class.

But you need to rebuild your index from scratch for this to come into
effect(also set /omitNorms="true"/ for your fields where you need this
feature)

Regds
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/to-prevent-number-of-matching-terms-in-contributing-score-tp3486373p3486512.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: to prevent number-of-matching-terms in contributing score

2011-11-07 Thread Samarendra Pratap
Hi Pravesh, thanks for your reply but I am not asking about the "omitNorms"
(index time parameter) I am asking about how to consider multiple matches
of a term in a single field as "one" during query time.

Thanks


On Mon, Nov 7, 2011 at 2:48 PM, pravesh  wrote:

> Did you rebuild the index from scratch. Since this is index time factor,
> you
> need to build complete index from scratch.
>
> Regds
> Pravesh
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/to-prevent-number-of-matching-terms-in-contributing-score-tp3486373p3486447.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,
Samar


Re: to prevent number-of-matching-terms in contributing score

2011-11-07 Thread pravesh
Did you rebuild the index from scratch. Since this is index time factor, you
need to build complete index from scratch.

Regds
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/to-prevent-number-of-matching-terms-in-contributing-score-tp3486373p3486447.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrJ - threading, http clients, connection managers

2011-11-07 Thread pravesh
>1) Is it safe to reuse a single _mgr and _client
across all 28 cores?

both are thread-safe API as per HttpClient specs. You shld go ahead with
this.

Regds
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-threading-http-clients-connection-managers-tp3485012p3486436.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: best way for sum of fields

2011-11-07 Thread stockii
sry. 

i need the sum of values of the found documents. e.g. the total amount of
one day. each doc in index has ist own amount.

i try out something with StatsComponent but with  48 Million docs in Index
its to slow. 

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores < 200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/best-way-for-sum-of-fields-tp3477517p3486406.html
Sent from the Solr - User mailing list archive at Nabble.com.


to prevent number-of-matching-terms in contributing score

2011-11-07 Thread Samarendra Pratap
Hi everyone!
 We are working on Solr - 3.4.

 In Short: If my query term matches more than one words I want it to be
considered as one match (in a particular field).

 Details:
  Our index has a multi-valued field "category" which contains possible
category names of the company. It is entered by the company employees.
There are two companies in the index

  1. First company falls in category -> "wooden chairs"
  2. Second company falls in following categories -> "chairs", "plastic
chairs", "wooden chairs"

 Now when I search for "chair" in "category" field (along with other fields
as "qf" parameter), the second company gets the higher score due to
multiple match against the word "chair". As per the business logic in
"category" field it should be a match or no-match for score calculation
because this field in not filled in by the end user and length/match of
text does not add to relevance.

 We are already using "omitNorms=true". So we have prevented length of the
field to contribute in score, but have been unable to prevent number of
matching terms' contribution.
 We can not use filters (fq) because there are other fields I am matching
in. We want "category" field matching such that "n" number of matches in
"category" are equivalent to one match in "title" field.

 Can someone give me some pointers on how to achieve this? or is there any
other better way of doing this?


-- 
Regards,
Samar