Re: search result not correct in solr

2014-04-30 Thread Alexandre Rafalovitch
Can't figure out the exact question. Need more specific example.
However, if you look in Solr 4 Admin panel, there is analysis screen
that shows you how the text is analyzed during indexing and during
search. Putting your words there will show you the effect of various
components in your type definition.

If that does not help, show us the type you have now (in schema.xml)
and try to explain the problem more precisely.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Wed, Apr 30, 2014 at 11:56 AM, neha sinha nehasinha...@gmail.com wrote:
 Hi I am trying to search with word Ribbing and i am also getting those result
 which have R-B or RB letter in their dsecription but when i am trying to
 search with Ribbin i m getting correct result...not getting any clue what to
 use in my solr schema.xml.


 Any guidance will be helpful.



 Thanks




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/search-result-not-correct-in-solr-tp4133841.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Issue with solr searching : words with - not able to search

2014-04-30 Thread neha sinha
same issue with my search result also and i have used solr.Textfield for this



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-with-solr-searching-words-with-not-able-to-search-tp4128549p4133845.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: timeAllowed in not honoring

2014-04-30 Thread Salman Akram
I had this issue too. timeAllowed only works for a certain phase of the
query. I think that's the 'process' part. However, if the query is taking
time in 'prepare' phase (e.g. I think for wildcards to get all the possible
combinations before running the query) it won't have any impact on that.
You can debug your query and confirm that.


On Wed, Apr 30, 2014 at 10:43 AM, Aman Tandon amantandon...@gmail.comwrote:

 Shawn this is the first time i raised this problem.

 My heap size is 14GB and  i am not using solr cloud currently, 40GB index
 is replicated from master to two slaves.

 I read somewhere that it return the partial results which is computed by
 the query in that specified amount of time which is defined by this
 timeAllowed parameter, but it doesn't seems to happen.

 Here is the link :
 http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed

  *The time allowed for a search to finish. This value only applies to the
 search and not to requests in general. Time is in milliseconds. Values = 0
 mean no time restriction. Partial results may be returned (if there are
 any). *



 With Regards
 Aman Tandon


 On Wed, Apr 30, 2014 at 10:05 AM, Shawn Heisey s...@elyograg.org wrote:

  On 4/29/2014 10:05 PM, Aman Tandon wrote:
   I am using solr 4.2 with the index size of 40GB, while querying to my
  index
   there are some queries which is taking the significant amount of time
 of
   about 22 seconds *in the case of minmatch of 50%*. So i added a
 parameter
   timeAllowed = 2000 in my query but this doesn't seems to be work.
 Please
   help me out.
 
  I remember reading that timeAllowed has some limitations about which
  stages of a query it can limit, particularly in the distributed case.
  These limitations mean that it cannot always limit the total time for a
  query.  I do not remember precisely what those limitations are, and I
  cannot find whatever it was that I was reading.
 
  When I looked through my local list archive to see if you had ever
  mentioned how much RAM you have and what the size of your Solr heap is,
  there didn't seem to be anything.  There's not enough information for me
  to know whether that 40GB is the amount of index data on a single
  SolrCloud server, or whether it's the total size of the index across all
  servers.
 
  If we leave timeAllowed alone for a moment and treat this purely as a
  performance problem, usually my questions revolve around figuring out
  whether you have enough RAM.  Here's where that conversation ends up:
 
  http://wiki.apache.org/solr/SolrPerformanceProblems
 
  I think I've probably mentioned this to you before on another thread.
 
  Thanks,
  Shawn
 
 




-- 
Regards,

Salman Akram


Re: search result not correct in solr

2014-04-30 Thread neha sinha
Thanks Alexandre..but still that doesn't help me

I am doing keyword search for word Ribbing and i am getting those products
also which have R-B or RB word in some other field.But when i am doing
search for Ribbin i am getting correct search results.

My field type is textfield.Please find below my schema.xml

fieldType name=wc_text class=solr.TextField positionIncrementGap=100
sortMissingLast=true omitNorms=true
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/  
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z])
replacement= replace=all /
filter class=solr.EdgeNGramFilterFactory minGramSize=3
maxGramSize=15 side=front/
  filter class=solr.TrimFilterFactory / 
filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
  analyzer type=query
   tokenizer class=solr.KeywordTokenizerFactory/  
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z])
replacement= replace=all /
filter class=solr.EdgeNGramFilterFactory minGramSize=3
maxGramSize=15 side=front/
filter class=solr.SnowballPorterFilterFactory language=English
protected=protwords.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer
/fieldType



--
View this message in context: 
http://lucene.472066.n3.nabble.com/search-result-not-correct-in-solr-tp4133841p4133848.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: search result not correct in solr

2014-04-30 Thread Alexandre Rafalovitch
On Wed, Apr 30, 2014 at 1:29 PM, neha sinha nehasinha...@gmail.com wrote:
 filter class=solr.EdgeNGramFilterFactory minGramSize=3
 maxGramSize=15 side=front/
 filter class=solr.SnowballPorterFilterFactory language=English
 protected=protwords.txt /


I think combining NGrams with Porter filters, especially in that order
will do really weird things.

Have you tried using the Admin console? You really want to see what
happens to different words when you run them through your pipelines.
Probably with debug mode enabled to see what effect NGram filter does
for positions as well.

Oh, and if you modified your index chain, did you reindex completely?
You must, otherwise you have old processed tokens lying around. On the
other hand, you can experiment with filter definition and not reindex
(only reload core) until you see the text flowing through and being
indexed/queries correctly.

You are quite far from the normal scenario with your setup, so you are
unlikely to get a magic answer, more like the pointers towards the
tools that solve the problem.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


Sorting is not correct in autosuggest

2014-04-30 Thread neha sinha
Hi All

In my auto suggest page sorting is not correct for the suggestions i am
getting.
However suggestions are all correct.





Any guidance will be helpful



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-is-not-correct-in-autosuggest-tp4133859.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: search result not correct in solr

2014-04-30 Thread neha sinha
Hello Alex


Yes I reindex completely.

I am new to solr so donot have much idea of all the filters.Can u suggest
some filters which i can try?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/search-result-not-correct-in-solr-tp4133841p4133861.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: search result not correct in solr

2014-04-30 Thread Anshum Gupta
Hi Neha,

There are a bunch of filters available and it wouldn't make sense to
suggest anything unless we know what's the intention. As they say, if you
don't know where you're going, any road will take you there.

If you want the most basic cases of being able to search for standard terms
in your documents, I'd recommend you start fresh and look up the example
schema. Using the basic fields types for your field should do the job for
you, but again, I don't really know what's the intended behavior.

Also, you should look at the official reference guide:
https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide

Be sure to look up the guide for the version of Solr you're using.



On Wed, Apr 30, 2014 at 12:43 AM, neha sinha nehasinha...@gmail.com wrote:

 Hello Alex


 Yes I reindex completely.

 I am new to solr so donot have much idea of all the filters.Can u suggest
 some filters which i can try?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/search-result-not-correct-in-solr-tp4133841p4133861.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 

Anshum Gupta
http://www.anshumgupta.net


Problem indexing subentitienties from a multivalued field

2014-04-30 Thread Jordi Martin
Hi there

I have a problem trying to create subentities during the data import.

I have defined the following data-config

entity name=efl processor=FileListEntityProcessor baseDir=/path/ 
fileName=.*.xml$ recursive=false rootEntity=false dataSource=null
entity name=subefl dataSource=ds-3  pk=id 
processor=XPathEntityProcessor forEach=/export/doc_debur
   transformer=DateFormatTransformer,RegexTransformer 
url=${efl.fileAbsolutePath} stream=true onError=skip 
...
...
...
field column=thk 
xpath=/export/doc_debur/thematization_keys /
entity dataSource=ds-1 
name=thematization_keys query=select tmid as thematization_keys from 
thematization where tmid='${subefl.thk}'  /

...
...
/entity
enity

Thk is a multivalued string field
And thematization_keys is also defined as a multivalued string field

What I want is to make a query for each one of the values of thk and store all 
the results in the thematizations_keys field

Could anyone help me?

Thanks in advance
Jordi



Re: timeAllowed in not honoring

2014-04-30 Thread Aman Tandon
Hi Salman,

here is the my debug query dump please help!. I am unable to find the
wildcards in it.

?xml version=1.0 encoding=UTF-8?responselst
name=responseHeader  bool name=partialResultstrue/bool  int
name=status0/int  int name=QTime10080/int/lstresult
name=response numFound=976303 start=0/resultlst
name=facet_counts  lst name=facet_queries/  lst
name=facet_fieldslst name=city  int name=delhi
ncr884159/int  int name=delhi629472/int  int
name=mumbai491426/int  int name=ahmedabad259356/int
 int name=chennai259029/int  int
name=bengaluru257193/int  int name=kolkata195077/int
  int name=pune193569/int  int
name=hyderabad179369/int  int name=jaipur115356/int
 int name=coimbatore111644/int  int
name=noida86794/int  int name=surat80621/int  int
name=gurgaon72815/int  int name=rajkot68982/int
int name=vadodara65082/int  int name=ludhiana63244/int
 int name=thane55091/int  int name=indore50225/int
 int name=ghaziabad49756/int  int
name=faridabad45322/int  int name=navi mumbai40127/int
int name=tiruppur37639/int  int
name=nagpur37126/int  int name=kochi32874/int/lst
   lst name=datatype  int name=product966816/int
int name=offer6003/int  int name=company3484/int
/lst  /lst  lst name=facet_dates/  lst
name=facet_ranges//lstlst name=debug  str
name=rawquerystringmisc items/str  str name=querystringmisc
items/str  str
name=parsedqueryBoostedQuery(boost(+(((titlex:misc^1.5 |
smalldesc:misc | titlews:misc^0.5 | city:misc | usrpcatname:misc |
mcatnametext:misc^0.2)~0.3 (titlex:item^1.5 | smalldesc:item |
titlews:items^0.5 | city:items | usrpcatname:item |
mcatnametext:item^0.2)~0.3)~1) (mcatnametext:misc item^0.5)~0.3
(titlews:misc items)~0.3 (titlex:misc item^3.0)~0.3
(smalldesc:misc item^2.0)~0.3 (usrpcatname:misc item)~0.3
(),product(map(query(+(titlex:item imsw)~0.3
(),def=0.0),0.0,0.0,1.0),map(query(+(titlex:misc item imsw)~0.3
(),def=0.0),0.0,0.0,1.0),map(int(sdesclen),0.0,150.0,1.0),map(int(sdesclen),0.0,0.0,0.1),map(int(CustTypeWt),699.0,699.0,1.2),map(int(CustTypeWt),199.0,199.0,1.3),map(int(CustTypeWt),0.0,179.0,1.35),1.0/(3.16E-11*float(ms(const(1398852652419),date(lastactiondatet)))+1.0),map(ms(const(1398852652419),date(blpurchasedate)),0.0,2.6E9,1.15),map(query(+(attribs:hot)~0.3
(titlex:hot^3.0 | smalldesc:hot^2.0 | titlews:hot | city:hot |
usrpcatname:hot |
mcatnametext:hot^0.5)~0.3,def=0.0),0.0,0.0,1.0),map(query(+(attribs:dupimg)~0.3
(titlex:dupimg^3.0 | smalldesc:dupimg^2.0 | titlews:dupimg |
city:dupimg | usrpcatname:dupimg |
mcatnametext:dupimg^0.5)~0.3,def=0.0),0.0,0.0,1.0),map(query(+(isphoto:T)~0.3
(),def=0.0),0.0,0.0,0.1/str  str
name=parsedquery_toStringboost(+(((titlex:misc^1.5 | smalldesc:misc
| titlews:misc^0.5 | city:misc | usrpcatname:misc |
mcatnametext:misc^0.2)~0.3 (titlex:item^1.5 | smalldesc:item |
titlews:items^0.5 | city:items | usrpcatname:item |
mcatnametext:item^0.2)~0.3)~1) (mcatnametext:misc item^0.5)~0.3
(titlews:misc items)~0.3 (titlex:misc item^3.0)~0.3
(smalldesc:misc item^2.0)~0.3 (usrpcatname:misc item)~0.3
(),product(map(query(+(titlex:item imsw)~0.3
(),def=0.0),0.0,0.0,1.0),map(query(+(titlex:misc item imsw)~0.3
(),def=0.0),0.0,0.0,1.0),map(int(sdesclen),0.0,150.0,1.0),map(int(sdesclen),0.0,0.0,0.1),map(int(CustTypeWt),699.0,699.0,1.2),map(int(CustTypeWt),199.0,199.0,1.3),map(int(CustTypeWt),0.0,179.0,1.35),1.0/(3.16E-11*float(ms(const(1398852652419),date(lastactiondatet)))+1.0),map(ms(const(1398852652419),date(blpurchasedate)),0.0,2.6E9,1.15),map(query(+(attribs:hot)~0.3
(titlex:hot^3.0 | smalldesc:hot^2.0 | titlews:hot | city:hot |
usrpcatname:hot |
mcatnametext:hot^0.5)~0.3,def=0.0),0.0,0.0,1.0),map(query(+(attribs:dupimg)~0.3
(titlex:dupimg^3.0 | smalldesc:dupimg^2.0 | titlews:dupimg |
city:dupimg | usrpcatname:dupimg |
mcatnametext:dupimg^0.5)~0.3,def=0.0),0.0,0.0,1.0),map(query(+(isphoto:T)~0.3
(),def=0.0),0.0,0.0,0.1)))/str  lst name=explain/  str
name=QParserSynonymExpandingExtendedDismaxQParser/str  null
name=altquerystring/  null name=boost_queries/  arr
name=parsed_boost_queries/  null name=boostfuncs/  arr
name=filter_queries
str{!tag=cityf}latlong:Intersects(Circle(28.63576,77.22445
d=2.248))/strstrattribs:(locprefglobal locprefnational
locprefcity)/strstr+((+datatype:product +attribs:(aprstatus20
aprstatus40 aprstatus50) +aggregate:true -attribs:liststatusnfl
+((+countryiso:IN +isfcp:true) CustTypeWt:[149 TO 1499]))
(+datatype:offer +iildisplayflag:true) (+datatype:company
-attribs:liststatusnfl +((+countryiso:IN +isfcp:true) CustTypeWt:[149
TO 1499]))) -attribs:liststatusdnf/str  /arr  arr
name=parsed_filter_queries
strConstantScore(org.apache.lucene.spatial.prefix.IntersectsPrefixTreeFilter@414cd6c2)/str
   strattribs:locprefglobal attribs:locprefnational
attribs:locprefcity/strstr+((+datatype:product
+(attribs:aprstatus20 attribs:aprstatus40 attribs:aprstatus50)
+aggregate:true -attribs:liststatusnfl +((+countryiso:IN +isfcp:true)

Problem indexing subentitienties from a multivalued field

2014-04-30 Thread Jordi Martin
Hi there

I have a problem trying to create subentities during the data import.

I have defined the following data-config

entity name=efl processor=FileListEntityProcessor baseDir=/path/ 
fileName=.*.xml$ recursive=false rootEntity=false dataSource=null
entity name=subefl dataSource=ds-3  pk=id 
processor=XPathEntityProcessor forEach=/export/doc_debur
   transformer=DateFormatTransformer,RegexTransformer 
url=${efl.fileAbsolutePath} stream=true onError=skip 
...
...
...
field column=thk 
xpath=/export/doc_debur/thematization_keys /
entity dataSource=ds-1 
name=thematization_keys query=select tmid as thematization_keys from 
thematization where tmid='${subefl.thk}'  /

...
...
/entity
enity

Thk is a multivalued string field
And thematization_keys is also defined as a multivalued string field

What I want is to make a query for each one of the values of thk and store all 
the results in the thematizations_keys field

Could anyone help me?

Thanks in advance
Jordi


Re: Problem indexing subentitienties from a multivalued field

2014-04-30 Thread Alexandre Rafalovitch
This is a little complicated. What are you getting now with this
setup? Is everything else actually working? I would have thought that
even --dataSource=null-- would cause issues.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency


On Wed, Apr 30, 2014 at 4:48 PM, Jordi Martin jordi.mar...@indicator.es wrote:
 Hi there

 I have a problem trying to create subentities during the data import.

 I have defined the following data-config

 entity name=efl processor=FileListEntityProcessor baseDir=/path/ 
 fileName=.*.xml$ recursive=false rootEntity=false dataSource=null
 entity name=subefl dataSource=ds-3  pk=id 
 processor=XPathEntityProcessor forEach=/export/doc_debur
transformer=DateFormatTransformer,RegexTransformer 
 url=${efl.fileAbsolutePath} stream=true onError=skip 
 ...
 ...
 ...
 field column=thk 
 xpath=/export/doc_debur/thematization_keys /
 entity dataSource=ds-1 
 name=thematization_keys query=select tmid as thematization_keys from 
 thematization where tmid='${subefl.thk}'  /

 ...
 ...
 /entity
 enity

 Thk is a multivalued string field
 And thematization_keys is also defined as a multivalued string field

 What I want is to make a query for each one of the values of thk and store 
 all the results in the thematizations_keys field

 Could anyone help me?

 Thanks in advance
 Jordi



RE: Problem indexing subentitienties from a multivalued field

2014-04-30 Thread Jordi Martin
Playing a bit with this dataconfig I get two different results

If thk is defined in the schema.xml  I get all the values for it indexed but 
the subentity thematization_keys is not processed.

In the other hand , if I do not define thk in the schema.xml file only the last 
value for thk is stored and then the thematization_keys is processed for that 
value.



-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Wednesday, 30 de April de 2014 12:35
To: solr-user@lucene.apache.org
Subject: Re: Problem indexing subentitienties from a multivalued field

This is a little complicated. What are you getting now with this setup? Is 
everything else actually working? I would have thought that even 
--dataSource=null-- would cause issues.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/ Current project: 
http://www.solr-start.com/ - Accelerating your Solr proficiency


On Wed, Apr 30, 2014 at 4:48 PM, Jordi Martin jordi.mar...@indicator.es wrote:
 Hi there

 I have a problem trying to create subentities during the data import.

 I have defined the following data-config

 entity name=efl processor=FileListEntityProcessor 
 baseDir=/path/ fileName=.*.xml$ recursive=false rootEntity=false 
 dataSource=null entity name=subefl dataSource=ds-3  pk=id 
 processor=XPathEntityProcessor forEach=/export/doc_debur
transformer=DateFormatTransformer,RegexTransformer 
 url=${efl.fileAbsolutePath} stream=true onError=skip 
 ...
 ...
 ...
 field column=thk 
 xpath=/export/doc_debur/thematization_keys /
 entity dataSource=ds-1 
 name=thematization_keys query=select tmid as thematization_keys 
 from thematization where tmid='${subefl.thk}'  /

 ...
 ...
 /entity
 enity

 Thk is a multivalued string field
 And thematization_keys is also defined as a multivalued string field

 What I want is to make a query for each one of the values of thk and 
 store all the results in the thematizations_keys field

 Could anyone help me?

 Thanks in advance
 Jordi



Re: timeAllowed in not honoring

2014-04-30 Thread Mikhail Khludnev
On Wed, Apr 30, 2014 at 2:16 PM, Aman Tandon amantandon...@gmail.comwrote:

  lst name=querydouble
 name=time3337.0/double  /lst  lst name=facet
 double name=time6739.0/double  /lst


Most time is spent in facet counting. FacetComponent doesn't checks
timeAllowed right now. You can try to experiment with facet.method=enum or
even with https://issues.apache.org/jira/browse/SOLR-5725 or try to
distribute search with SolrCloud. AFAIK, you can't employ threads to speed
up multivalue facets.

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Error initializing QueryElevationComponent

2014-04-30 Thread Geepalem
Hi Team,

I am getting error null:org.apache.solr.common.SolrException: SolrCore
'master' is not available due to init failure: Error initializing
QueryElevationComponent.

Please check below for configurations

elevate.xml
--

elevate
 query text=analog
  doc
id=sitecore://master/{137f5eb3-eb84-4165-bef0-5be1fbbc3201}?lang=enver=1/
 /query  
/elevate


Scema.xml
---
 field name=_uniqueid type=string indexed=true stored=true
required=true /


SolrConfig.xml
---

 arr name=last-components
  strspellcheck1/str
  strelevator/str
 /arr

I am adding elevator to default request handler. This handler is also using
Spellcheck1 component.

  searchComponent name=elevator
class=org.apache.solr.handler.component.QueryElevationComponent 
str name=queryFieldTypestring/str
str name=config-fileelevate.xml/str
  /searchComponent



So now I am getting below error and core itself not loading.
If I change id in elevate.xml as  doc
id=bce22a40d2be4cd791ed6bf4b88d0450/ 
instead of  doc
id=sitecore://master/{137f5eb3-eb84-4165-bef0-5be1fbbc3201}?lang=enver=1/
then error is not coming.But results are not coming as expected.

What is wrong with value
sitecore://master/{137f5eb3-eb84-4165-bef0-5be1fbbc3201}?lang=enver=1 ?

Please suggest or guide how to make it work..

Complete Error details



null:org.apache.solr.common.SolrException: SolrCore 'master' is not
available due to init failure: Error initializing QueryElevationComponent.
at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:783)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:287)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408)
at
org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1041)
at
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:603)
at
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.solr.common.SolrException: Error initializing
QueryElevationComponent.
at org.apache.solr.core.SolrCore.init(SolrCore.java:834)
at org.apache.solr.core.SolrCore.init(SolrCore.java:625)
at
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:522)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:557)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:247)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:239)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
... 3 more
Caused by: org.apache.solr.common.SolrException: Error initializing
QueryElevationComponent.
at
org.apache.solr.handler.component.QueryElevationComponent.inform(QueryElevationComponent.java:241)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:601)
at org.apache.solr.core.SolrCore.init(SolrCore.java:829)
... 11 more
Caused by: org.apache.solr.common.SolrException:
org.xml.sax.SAXParseException; systemId: solrres:/elevate.xml; lineNumber:
28; columnNumber: 80; The reference to entity ver must end with the ';'
delimiter.
at org.apache.solr.core.Config.init(Config.java:148)
at org.apache.solr.core.Config.init(Config.java:86)
at org.apache.solr.core.Config.init(Config.java:81)
at
org.apache.solr.handler.component.QueryElevationComponent.inform(QueryElevationComponent.java:223)
... 13 more
Caused by: org.xml.sax.SAXParseException; systemId: solrres:/elevate.xml;
lineNumber: 28; columnNumber: 80; The reference to entity ver must end
with the ';' delimiter.
at

Re: timeAllowed in not honoring

2014-04-30 Thread Shawn Heisey
On 4/29/2014 11:43 PM, Aman Tandon wrote:
 My heap size is 14GB and  i am not using solr cloud currently, 40GB index
 is replicated from master to two slaves.
 
 I read somewhere that it return the partial results which is computed by
 the query in that specified amount of time which is defined by this
 timeAllowed parameter, but it doesn't seems to happen.

Mikhail Khludnev has replied and explained why timeAllowed isn't
stopping the query and returning partial results.

A 14GB heap is quite large.  If you aren't starting Solr with garbage
collection tuning parameters, long GC pauses *will* be happening, and
that will make some of your queries take a really long time.  The wiki
page I sent has a section about garbage collection and a link showing
the GC tuning parameters that I use.

You didn't indicate how much total RAM you have.  If your total RAM is
16GB, that's definitely not enough for a 14GB heap and a 40GB index.
32GB of total RAM might be enough, but it also might not be.  A perfect
world RAM size for this setup would be at least 54GB -- the total of
heap plus index size, not counting the small number of megabytes that
the OS and its basic services take.

Thanks,
Shawn



Re: timeAllowed in not honoring

2014-04-30 Thread Jeff Wartes

It¹s not just FacetComponent, here¹s the original feature ticket for
timeAllowed:
https://issues.apache.org/jira/browse/SOLR-502


As I read it, timeAllowed only limits the time spent actually getting
documents, not the time spent figuring out what data to get or how. I
think that means the primary use-case is serving as a guard against
excessive paging.



On 4/30/14, 4:49 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote:

On Wed, Apr 30, 2014 at 2:16 PM, Aman Tandon
amantandon...@gmail.comwrote:

  lst name=querydouble
 name=time3337.0/double  /lst  lst name=facet
 double name=time6739.0/double  /lst


Most time is spent in facet counting. FacetComponent doesn't checks
timeAllowed right now. You can try to experiment with facet.method=enum or
even with https://issues.apache.org/jira/browse/SOLR-5725 or try to
distribute search with SolrCloud. AFAIK, you can't employ threads to speed
up multivalue facets.

-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com



Re: merge shards indexes

2014-04-30 Thread Erick Erickson
Is this SolrCloud? If so you have to be quite careful to get the
expected results, in fact I'm not all that sure you can and still have
a consistent index.

Best,
Erick

On Mon, Apr 28, 2014 at 5:33 AM, Dmitry Kan solrexp...@gmail.com wrote:
 Yes, according to this documentation:
 https://wiki.apache.org/solr/MergingSolrIndexes


 On Mon, Apr 28, 2014 at 12:14 PM, Gastone Penzo 
 gastone.pe...@gmail.comwrote:

 Hi,
 it's possible to merge 2 shards indexes into one?

 Thank you

 --
 *Gastone Penzo*




 --
 Dmitry
 Blog: http://dmitrykan.blogspot.com
 Twitter: http://twitter.com/dmitrykan


Re: Stemming not working with wildcard search

2014-04-30 Thread Erick Erickson
Did you re-index? And what do you get when adding debug=query? That
should show you the parsed query. Have you looked at the results of
the admin/analysis page? That tool is invaluable for seeing what the
actual transformations are.

Best,
Erick

On Mon, Apr 28, 2014 at 11:41 AM, Geepalem naresh.geepa...@yahoo.com wrote:
 Hi Ahmet,

 Thanks for your prompt response!

 I have added filters which you have specified but still its not working.
 Below is field Query Analyzer

  analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory /
 filter class=solr.LowerCaseFilterFactory /

 filter class=solr.KeywordRepeatFilterFactory/
 filter class=solr.PorterStemFilterFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

 http://localhost:8080/solr/master/select?q=page_title_t:*products*
 http://localhost:8080/solr/master/select?q=page_title_t:*product*


 Please let me know if I am doing anything wrong.

 Thanks,
 G. Naresh Kumar



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Stemming-not-working-with-wildcard-search-tp4133382p4133556.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Server Infrastructure Config

2014-04-30 Thread Erick Erickson
Impossible to answer even if you gave much more detailed information,
you need to prototype and push one of your machines until it falls
over, then extrapolate. See:
http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Tue, Apr 29, 2014 at 7:41 AM, EXTERNAL Taminidi Ravi (ETI,
Automotive-Service-Solutions) external.ravi.tamin...@us.bosch.com
wrote:
 Hi, Can some one share or refer to get information on the SOLR server 
 environment for production.

 Appx. We have 40 Collections, with appx size from 300MB to 8GB (for each 
 Collection) and appx Total 100GB. The average increase of the size for total 
 may be 2-5Gb / Year.

 To Get best performance for atleast 1000-1 Concurrent Users.

 Thanks

 Ravi


Re: Sorting is not correct in autosuggest

2014-04-30 Thread Erick Erickson
Please review:

http://wiki.apache.org/solr/UsingMailingLists

You've given us  virtually no information here.

Best,
Erick

On Wed, Apr 30, 2014 at 12:35 AM, neha sinha nehasinha...@gmail.com wrote:
 Hi All

 In my auto suggest page sorting is not correct for the suggestions i am
 getting.
 However suggestions are all correct.





 Any guidance will be helpful



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Sorting-is-not-correct-in-autosuggest-tp4133859.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: search result not correct in solr

2014-04-30 Thread Erick Erickson
Neha:

You _really_ need to get familiar with the admin/analysis page in the
Solr admin UI. It shows you, step-by-step, what each tokenizer and
filter in your analysis chain does. It'll save you a world of pain :).

Best,
Erick

P.S. unless you care about a bunch of really gory detail, un-check the
verbose checkbox!

On Wed, Apr 30, 2014 at 12:55 AM, Anshum Gupta ans...@anshumgupta.net wrote:
 Hi Neha,

 There are a bunch of filters available and it wouldn't make sense to
 suggest anything unless we know what's the intention. As they say, if you
 don't know where you're going, any road will take you there.

 If you want the most basic cases of being able to search for standard terms
 in your documents, I'd recommend you start fresh and look up the example
 schema. Using the basic fields types for your field should do the job for
 you, but again, I don't really know what's the intended behavior.

 Also, you should look at the official reference guide:
 https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide

 Be sure to look up the guide for the version of Solr you're using.



 On Wed, Apr 30, 2014 at 12:43 AM, neha sinha nehasinha...@gmail.com wrote:

 Hello Alex


 Yes I reindex completely.

 I am new to solr so donot have much idea of all the filters.Can u suggest
 some filters which i can try?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/search-result-not-correct-in-solr-tp4133841p4133861.html
 Sent from the Solr - User mailing list archive at Nabble.com.




 --

 Anshum Gupta
 http://www.anshumgupta.net


Re: When not to use NRTCachingDirectory and what to use instead.

2014-04-30 Thread Jeff Wartes


On 4/19/14, 6:51 AM, Ken Krugler kkrugler_li...@transpac.com wrote:

The code I see seems to be using an FSDirectory, or is there another
layer of wrapping going on here?

return new NRTCachingDirectory(FSDirectory.open(new File(path)),
maxMergeSizeMB, maxCachedMB);


I was also curious about this subject. Not enough to test anything, but
enough to look at the code too.

FSDirectory.open picks one of MMapDirectory, SimpleFSDirectory and
NIOFSDirectory in that order of preference based on what it thinks your
system will support.

There¹s still the possibility that the added caching functionality slows
down bulk index operations, but setting that aside, it does look like
NRTCachingDirectoryFactory is almost always the best choice.



Re: saving user actions on item in solr for later retrieval

2014-04-30 Thread nolim
Thank you, we will check it out.
 On Apr 29, 2014 9:28 PM, iorixxx [via Lucene] 
ml-node+s472066n4133796...@n3.nabble.com wrote:

 Hi Nolim,

 Actually EFF is searchable. See my comments at the end of the page


 https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes

 Ahmet



 On Tuesday, April 29, 2014 9:07 PM, nolim [hidden 
 email]http://user/SendEmail.jtp?type=nodenode=4133796i=0
 wrote:
 Thank you, it was interesting and I have learned some new things in solr
 :)

 But the External File Field isn't a good option because the field is
 unsearchable which it very important to us.
 We think about the first option (updating document in solr) but preforming
 commit only each 10 minutes - If we would like to retrieve the value
 realtime we can use RealTimeGet.

 Maybe you have other suggestion?




 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/saving-user-actions-on-item-in-solr-for-later-retrieval-tp4133558p4133793.html

 Sent from the Solr - User mailing list archive at Nabble.com.



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://lucene.472066.n3.nabble.com/saving-user-actions-on-item-in-solr-for-later-retrieval-tp4133558p4133796.html
  To unsubscribe from saving user actions on item in solr for later
 retrieval, click 
 herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4133558code=YWxvbnlhZG9AZ21haWwuY29tfDQxMzM1NTh8MTMwMDI0NTg3MA==
 .
 NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://lucene.472066.n3.nabble.com/saving-user-actions-on-item-in-solr-for-later-retrieval-tp4133558p4133955.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: saving user actions on item in solr for later retrieval

2014-04-30 Thread Mikhail Khludnev
is there somebody from LucidWorks who can refer to Click Score Relevance
Framework in LucidWorks Search?


On Mon, Apr 28, 2014 at 10:48 PM, nolim alony...@gmail.com wrote:

 Hi,
 We are using solr in production system for around ~500 users and we have
 around ~1 queries per day.
 Our user's search topics most of the time static and repeat themselves over
 time.

 We have in our system an option to specify specific search subject (we
 also call it specific information need) and most of our users are using
 this option.
 We keep in our system logs each query and document retrieved from each
 information need
 and the user can also give feedback if the document is relevant for his
 information need.

 We also have special query expansion technique and diversity algorithm
 based
 on MMR.

 We want to use this information from logs as data set for training our
 ranking system
 and preforming Learning To Rank for each information need or cluster of
 information needs.
 We also want to give the user the option filter by relevant and read
 based on his actions\friends actions in the same topic.
 When he runs a query again or similar one he can skip already read
 documents. That's an important requirement to our users.

 We think about 2 possibilities to implement it:
 1. Updating each item in solr and creating 2 fields named: read,
 relevant.
 Each field is multivalue field with the corresponding label of the
 information need.
 When the user reads a document an update is sent to solr and the field
 read gets a label with
 the information need the user is working on...
 Will cause update when each item is read by user (still nothing compare to
 new items coming in each day).
 We are saving information that belongs to the application in solr which
 may be wrong architecture.

 2. Save the information In DB, and then preforming filtering on the
 retrieved results.
 this option is much more complicated (We now have fields that aren't solr
 and the user uses them for search). We won't get facets, autocomplete and
 other nice stuff that a regular field in solr can have.
 cost in preformances, we can''t retrieve easy: give me top 10 documents
 that answer the query and unread from the information need and more
 complicated code to hold.

 3. Do you have more ideas?

 Which of those options is the better?

 Thanks in advance!



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/saving-user-actions-on-item-in-solr-for-later-retrieval-tp4133558.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
 mkhlud...@griddynamics.com


Denormalize or use multivalued field for nested data?

2014-04-30 Thread Utkarsh Sengar
I have to modify a schema where I can attach nested pricing per store
information for a product. For example:

10010137332:{
   title:iPad 64gb
   description: iPad 64gb with retina
   pricing:{
merchantid64354:{
  locationid643:{
 USD|600
  }
  locationid6436:{
 USD|600
  }
}
merchantid343:{
  locationid1345:{
 USD|600
  }
  locationid4353:{
 USD|600
  }
}
   }
}


This is what is suggested all over the internet:
Denormalize it: In my case, I will end up with total number of columns =
total locations with a price which is about 100k. I don't think having 100k
columns for 60M products is a good idea.

Are there any better ways of handling this?
I am trying to figure out multivalue field but as far as I understand it,
it can only be used as a flag but cannot be used to get a value
associated to a key.

Based on this answer, solr 4.5+ supports nested documents:
http://stackoverflow.com/a/5585891/231917 but I am currently on 4.4.



-- 
Thanks,
-Utkarsh


Shards don't return documents in same order

2014-04-30 Thread Francois Perron
Hi guys,

  I have a small SolrCloud setup (3 servers, 1 collection with 1 shard and 3 
replicat).  In my schema, I have a alphaOnlySort field with a copyfield.

This is a part of my managed-schema :

field name=_root_ type=string indexed=true stored=false/
field name=_uid type=string multiValued=false indexed=true 
required=true stored=true/
field name=_version_ type=long indexed=true stored=true/
field name=event_id type=string indexed=true stored=true/
field name=event_name type=text_general indexed=true stored=true/
field name=event_name_sort type=alphaOnlySort/

with the copyfield

  copyField source=event_name dest=event_name_sort/


The problem is : I query my collection with a sort on my alphasort field but on 
one of my servers, the sort order is not the same.

On server 1 and 2, I have this result :

doc
str name=event_nameMB20140410A/str
/doc
doc
str name=event_nameMB20140410A-New/str
/doc
doc
str name=event_nameMB20140411A/str
/doc



and on the third one, this :

str name=event_nameMB20140410A/str
/doc
doc
str name=event_nameMB20140411A/str
/doc
doc
str name=event_nameMB20140410A-New/str
/doc


The doc named MB20140411A should be at the end ...

Any idea ?

Regards


Re: Denormalize or use multivalued field for nested data?

2014-04-30 Thread Erick Erickson
I think you are misunderstanding denormalize in this context. It
still may not be what you want to do for other reasons, but the usual
idea is to replicate the parent info in each of the children, so you'd
have something like:


doc1 = title:iPad 64gb description: iPad 64gb with retina
merchantid:343 locationid: 1345 cost: USD|600

doc2 = title:iPad 64gb description: iPad 64gb with retina
merchantid:343 locationid: 4353 cost: USD|600

And so on.

Best,
Erick

On Wed, Apr 30, 2014 at 12:24 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote:
 I have to modify a schema where I can attach nested pricing per store
 information for a product. For example:

 10010137332:{
title:iPad 64gb
description: iPad 64gb with retina
pricing:{
 merchantid64354:{
   locationid643:{
  USD|600
   }
   locationid6436:{
  USD|600
   }
 }
 merchantid343:{
   locationid1345:{
  USD|600
   }
   locationid4353:{
  USD|600
   }
 }
}
 }


 This is what is suggested all over the internet:
 Denormalize it: In my case, I will end up with total number of columns =
 total locations with a price which is about 100k. I don't think having 100k
 columns for 60M products is a good idea.

 Are there any better ways of handling this?
 I am trying to figure out multivalue field but as far as I understand it,
 it can only be used as a flag but cannot be used to get a value
 associated to a key.

 Based on this answer, solr 4.5+ supports nested documents:
 http://stackoverflow.com/a/5585891/231917 but I am currently on 4.4.



 --
 Thanks,
 -Utkarsh


Re: Denormalize or use multivalued field for nested data?

2014-04-30 Thread Anshum Gupta
Block joins could be what you're looking for if you can upgrade to 4.5+ [
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
]

I'd recommend an upgrade but if that's not possible, replicating the parent
information is the way to go.




On Wed, Apr 30, 2014 at 12:24 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote:

 I have to modify a schema where I can attach nested pricing per store
 information for a product. For example:

 10010137332:{
title:iPad 64gb
description: iPad 64gb with retina
pricing:{
 merchantid64354:{
   locationid643:{
  USD|600
   }
   locationid6436:{
  USD|600
   }
 }
 merchantid343:{
   locationid1345:{
  USD|600
   }
   locationid4353:{
  USD|600
   }
 }
}
 }


 This is what is suggested all over the internet:
 Denormalize it: In my case, I will end up with total number of columns =
 total locations with a price which is about 100k. I don't think having 100k
 columns for 60M products is a good idea.

 Are there any better ways of handling this?
 I am trying to figure out multivalue field but as far as I understand it,
 it can only be used as a flag but cannot be used to get a value
 associated to a key.

 Based on this answer, solr 4.5+ supports nested documents:
 http://stackoverflow.com/a/5585891/231917 but I am currently on 4.4.



 --
 Thanks,
 -Utkarsh




-- 

Anshum Gupta
http://www.anshumgupta.net


Re: Shards don't return documents in same order

2014-04-30 Thread Erick Erickson
Hmmm, take a look at the admin/analysis page for these inputs for
alphaOnlySort. If you're using the stock Solr distro, you're probably
not considering the effects patternReplaceFilterFactory which is
removing all non-letters. So these three terms reduce to

mba
mba
mbanew

You can look at the actual indexed terms by the admin/schema-browser as well.

That said, unless you transposed the order because you were
concentrating on the numeric part, the doc with MB20140410A-New should
always be sorting last.

All of which is irrelevant if you're doing something else with
alphaOnlySort, so please paste in the fieldType definition if you've
changed it.

What gets returned in the doc for _stored_ data is a verbatim copy,
NOT the output of the analysis chain, which can be confusing.

Oh, and Solr uses the internal lucene doc ID to break ties, and docs
on different replicas can have different internal Lucene doc IDs
relative to each other as a result of merging so that's something else
to watch out for.

Best,
Erick

On Wed, Apr 30, 2014 at 1:06 PM, Francois Perron
francois.per...@ticketmaster.com wrote:
 Hi guys,

   I have a small SolrCloud setup (3 servers, 1 collection with 1 shard and 3 
 replicat).  In my schema, I have a alphaOnlySort field with a copyfield.

 This is a part of my managed-schema :

 field name=_root_ type=string indexed=true stored=false/
 field name=_uid type=string multiValued=false indexed=true 
 required=true stored=true/
 field name=_version_ type=long indexed=true stored=true/
 field name=event_id type=string indexed=true stored=true/
 field name=event_name type=text_general indexed=true 
 stored=true/
 field name=event_name_sort type=alphaOnlySort/

 with the copyfield

   copyField source=event_name dest=event_name_sort/


 The problem is : I query my collection with a sort on my alphasort field but 
 on one of my servers, the sort order is not the same.

 On server 1 and 2, I have this result :

 doc
 str name=event_nameMB20140410A/str
 /doc
 doc
 str name=event_nameMB20140410A-New/str
 /doc
 doc
 str name=event_nameMB20140411A/str
 /doc



 and on the third one, this :

 str name=event_nameMB20140410A/str
 /doc
 doc
 str name=event_nameMB20140411A/str
 /doc
 doc
 str name=event_nameMB20140410A-New/str
 /doc


 The doc named MB20140411A should be at the end ...

 Any idea ?

 Regards


Which Lucene search syntax is faster

2014-04-30 Thread johnmunir

Hi,


Given the following Lucene document that I’m adding to my index(and I expect to 
have over 10 million of them, each with various sizes from 1 Kbto 50 Kb:


add
  doc
fieldname=doc_typePDF/field
fieldname=titleSome name/field
fieldname=summarySome summary/field
fieldname=ownerWho owns this/field
fieldname=price10/field
fieldname=isbn1234567890/field
  /doc
  doc
fieldname=doc_typeDOC/field
fieldname=titleSome name/field
fieldname=summarySome summary/field
fieldname=ownerWho owns this/field
fieldname=price10/field
fieldname=isbn0987654321/field
  /doc
  !-- and more doc's --
/add



My question is this: what Lucene search syntax will give meback result the 
fastest?  If my user is interestedin finding data within “title” and “owner” 
fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as:
 
1) skyfall ian fleming AND doc_type:DOC
2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian ORfleming) AND 
doc_type:DOC
3) Something else I don't know about.


Of the 10 million documents I will be indexing, 80% will be of doc_type PDF, 
and about 10% of type DOC, so please keep that in mind as a factor (if that 
will mean anything in terms of which syntax I should use).


Thanks in advanced,
 
- MJ 


Re: Which Lucene search syntax is faster

2014-04-30 Thread Shawn Heisey
On 4/30/2014 2:29 PM, johnmu...@aol.com wrote:
 My question is this: what Lucene search syntax will give meback result the 
 fastest?  If my user is interestedin finding data within “title” and “owner” 
 fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as:
  
 1) skyfall ian fleming AND doc_type:DOC

If your default field is text, I'm fairly sure this will become
equivalent to the following which is probably NOT what you want. 
Parentheses can be very important.

text:skyfall OR text:ian OR (text:fleming AND doc_type:DOC)

 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian OR fleming) AND 
 doc_type:DOC

This kind of query syntax is probably what you should shoot for.  Not
from a performance perspective -- just from the perspective of making
your queries completely correct.  Note that the +/- syntax combined with
parentheses is far more precise than using AND/OR/NOT.

 3) Something else I don't know about.

The edismax query parser is very powerful.  That might be something
you're interested in.

https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser


 Of the 10 million documents I will be indexing, 80% will be of doc_type 
 PDF, and about 10% of type DOC, so please keep that in mind as a factor (if 
 that will mean anything in terms of which syntax I should use).

For the most part, whatever general query format you choose to use will
not matter very much.  There are exceptions, but mostly Solr (Lucene) is
smart enough to convert your query to an efficient final parsed format. 
Turn on the debugQuery parameterto see what it does with each query.

Regardless of whether you use the standard lucene query parser or
edismax, incorporate filter queries into your query constructing logic. 
Your second example above would be better to express like this, with the
default operator set to OR.  This uses both q and fq parameters:

q=title:(skyfall ian fleming) owner:(skyfall ian fleming)fq=doc_type:DOC

https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter

Thanks,
Shawn



Re: Which Lucene search syntax is faster

2014-04-30 Thread Erick Erickson
I'd add that I think you're worrying about the wrong thing. 10M
documents is not very many by modern Solr standards. I rather suspect
that you won't notice much difference in performance due to how you
construct the query.

Shawn's suggestion to use fq clauses is spot on, though. fq clauses
are re-used (see filterCache in solrconfig.xml). My rule of thumb is
to use fq clauses for most everything that does NOT contribute to
scoring...

Best,
Erick

On Wed, Apr 30, 2014 at 2:18 PM, Shawn Heisey s...@elyograg.org wrote:
 On 4/30/2014 2:29 PM, johnmu...@aol.com wrote:
 My question is this: what Lucene search syntax will give meback result the 
 fastest?  If my user is interestedin finding data within “title” and “owner” 
 fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as:

 1) skyfall ian fleming AND doc_type:DOC

 If your default field is text, I'm fairly sure this will become
 equivalent to the following which is probably NOT what you want.
 Parentheses can be very important.

 text:skyfall OR text:ian OR (text:fleming AND doc_type:DOC)

 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian OR fleming) AND 
 doc_type:DOC

 This kind of query syntax is probably what you should shoot for.  Not
 from a performance perspective -- just from the perspective of making
 your queries completely correct.  Note that the +/- syntax combined with
 parentheses is far more precise than using AND/OR/NOT.

 3) Something else I don't know about.

 The edismax query parser is very powerful.  That might be something
 you're interested in.

 https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser


 Of the 10 million documents I will be indexing, 80% will be of doc_type 
 PDF, and about 10% of type DOC, so please keep that in mind as a factor (if 
 that will mean anything in terms of which syntax I should use).

 For the most part, whatever general query format you choose to use will
 not matter very much.  There are exceptions, but mostly Solr (Lucene) is
 smart enough to convert your query to an efficient final parsed format.
 Turn on the debugQuery parameterto see what it does with each query.

 Regardless of whether you use the standard lucene query parser or
 edismax, incorporate filter queries into your query constructing logic.
 Your second example above would be better to express like this, with the
 default operator set to OR.  This uses both q and fq parameters:

 q=title:(skyfall ian fleming) owner:(skyfall ian fleming)fq=doc_type:DOC

 https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter

 Thanks,
 Shawn



Re: Which Lucene search syntax is faster

2014-04-30 Thread johnmunir

Thank you Shawn and Erick for the quick response.


A follow up question.


Basedon 
https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter,I
 see the fl (field list) parameter.  Does this mean I canbuild my Lucene 
search syntax as follows:


q=skyfall OR ian ORflemingfl=titlefl=ownerfq=doc_type:DOC


And get the same result as (per Shawn's example changed it bit toadd OR):


q=title:(skyfall OR ian OR fleming)owner:(skyfall OR ian OR 
fleming)fq=doc_type:DOC


Btw, my default search operator is set to AND.  My need is tofind whatever the 
user types in both of those two fields (or maybe some otherfields which is 
controlled by the UI).. For example, user typesskyfall ian fleming and 
selected 3 fields, and want to narrowdown to doc_type DOC.


- MJ




-Original Message-
From: Erick Erickson erickerick...@gmail.com
To: solr-user solr-user@lucene.apache.org
Sent: Wed, Apr 30, 2014 5:33 pm
Subject: Re: Which Lucene search syntax is faster


I'd add that I think you're worrying about the wrong thing. 10M
documents is not very many by modern Solr standards. I rather suspect
that you won't notice much difference in performance due to how you
construct the query.

Shawn's suggestion to use fq clauses is spot on, though. fq clauses
are re-used (see filterCache in solrconfig.xml). My rule of thumb is
to use fq clauses for most everything that does NOT contribute to
scoring...

Best,
Erick

On Wed, Apr 30, 2014 at 2:18 PM, Shawn Heisey s...@elyograg.org wrote:
 On 4/30/2014 2:29 PM, johnmu...@aol.com wrote:
 My question is this: what Lucene search syntax will give meback result the 
fastest?  If my user is interestedin finding data within “title” and “owner” 
fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as:

 1) skyfall ian fleming AND doc_type:DOC

 If your default field is text, I'm fairly sure this will become
 equivalent to the following which is probably NOT what you want.
 Parentheses can be very important.

 text:skyfall OR text:ian OR (text:fleming AND doc_type:DOC)

 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian OR fleming) AND 
doc_type:DOC

 This kind of query syntax is probably what you should shoot for.  Not
 from a performance perspective -- just from the perspective of making
 your queries completely correct.  Note that the +/- syntax combined with
 parentheses is far more precise than using AND/OR/NOT.

 3) Something else I don't know about.

 The edismax query parser is very powerful.  That might be something
 you're interested in.

 https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser


 Of the 10 million documents I will be indexing, 80% will be of doc_type 
PDF, and about 10% of type DOC, so please keep that in mind as a factor (if 
that 
will mean anything in terms of which syntax I should use).

 For the most part, whatever general query format you choose to use will
 not matter very much.  There are exceptions, but mostly Solr (Lucene) is
 smart enough to convert your query to an efficient final parsed format.
 Turn on the debugQuery parameterto see what it does with each query.

 Regardless of whether you use the standard lucene query parser or
 edismax, incorporate filter queries into your query constructing logic.
 Your second example above would be better to express like this, with the
 default operator set to OR.  This uses both q and fq parameters:

 q=title:(skyfall ian fleming) owner:(skyfall ian fleming)fq=doc_type:DOC

 https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter

 Thanks,
 Shawn


 


What are the best practices on Multiple Language support in Solr Cloud ?

2014-04-30 Thread Shamik Bandopadhyay
Hi,

  I'm  trying to implement multiple language support in Solr Cloud (4.7).
Although we've different languages in index, we were only supporting
english in terms of index and query. To provide some context, our current
index size is 35 GB with close to 15 million documents. We've two shards
with two replicas per shard. I'm using composite id to support
de-duplication, which puts the documents having the same field (dedup)
value to a specific shard.
Language is known prior to for every document being indexed. That saves the
need for runtime language detection. Similarly, during query, the language
will be known as well. To extend it, there's no need for multi-lingual
support.

Based on my understanding so far, there are three approaches which are
widely adopted. Multi-field indexing, Multi-Core indexing and Multiple
language in one field (based from Solr in Action).

First option seems easy to implement. But then, I've around 40 fields which
are getting indexed currently, though a majority of them are type=string
and not being analyzed. I'm planning to support around 10 languages, which
translates to 400 field definitions in the same schema. And this is poised
to grow with addition of languages and fields. My apprehension is whether
this approach becomes a maintenance nightmare ? Does it affect overall
scalability ? Does is affect any existing features like Suggester,
Spellcheck, etc. ? I was thinking of including language as part of the id
key. It'll look like Language!Dedup_id!url so that documents are spread
across the two shards.

Second option of a dedicated core sounds easy in terms of maintaining
config files. Also,routing requests will be fairly easy as the language
will be always known up-front,both during indexing and query time. But, as
I looked into the documents, 60% of our total index will be in English,
while rest 40% will constitute remaining 10-14 languages. Some language
content are in few thousands which perhaps doesn't merit a dedicate core.
On top of that, this approach has the potential of getting into a complex
infrastructure, which might be hard to maintain.

I read about the use of multiple language in a single field in Trey
Grainger's book. It looks like a great approach but not sure if it is meant
to address my scenario. My first impression is that it's more geared
towards supporting multi-lingual, but I maybe completely wrong. Also, this
is not supported by Solr / Lucene out of the box.

I know there's a lot of people in this group who have excelled as far as
supporting multiple language in Solr is concerned. I'm trying to gather
their inputs / experience on the best practice to help me decide the right
approach. Any pointer on this will be highly appreciated.

Thanks,
Shamik


Re: Which Lucene search syntax is faster

2014-04-30 Thread Shawn Heisey
On 4/30/2014 3:47 PM, johnmu...@aol.com wrote:
 Thank you Shawn and Erick for the quick response.


 A follow up question.


 Basedon 
 https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter,I
  see the fl (field list) parameter.  Does this mean I canbuild my Lucene 
 search syntax as follows:

The fl parameter determines which stored fields show up in the results. 
By default, all fields that are stored will be returned.  If you want
relevancy scores, you'd include the pseudofield named score --
fl=*,score is something we see a lot.  The fl parameter does not affect
the *search* at all.

 q=skyfall OR ian ORflemingfl=titlefl=ownerfq=doc_type:DOC


 And get the same result as (per Shawn's example changed it bit toadd OR):


 q=title:(skyfall OR ian OR fleming)owner:(skyfall OR ian OR 
 fleming)fq=doc_type:DOC

Exactly right.

 Btw, my default search operator is set to AND.  My need is tofind whatever 
 the user types in both of those two fields (or maybe some otherfields which 
 is controlled by the UI).. For example, user typesskyfall ian fleming and 
 selected 3 fields, and want to narrowdown to doc_type DOC.

With the standard parser, you'd have to do the following.  Assume that
USERQUERY is a very basic query, perhaps a few terms, like your example
of skyfall ian fleming.

q=field1:(USERQUERY) OR field2:(USERQUERY) OR
field3:(USERQUERY)fq=doc_type:DOC

With edismax, you'd do:

q=USERQUERYqf=field1 field2 field3fq=doc_type:DOC

You might also add pf=field1 field2 field3 ... and there are a great
many other edismax/dismax query parameters too.  The edismax parser does
some truly amazing stuff.

Echoing what both Erick and I said ... worrying about the exact syntax
is premature optimization.  10 million docs is something that Solr can
handle easily, as long as there's enough RAM.

Thanks,
Shawn



Re: timeAllowed in not honoring

2014-04-30 Thread Aman Tandon
Jeff - Thanks Jeff this discussion on jira is really quite helpful. Thanks
for this.

Shawn - Yes we have some plans to move to SolrCloud, Our total index size
is 40GB with 11M of Docs, Available RAM 32GB, Allowed heap space for solr
is 14GB, the GC tuning parameters using in our server
is -XX:+UseConcMarkSweepGC -XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps.

Mikhail Khludnev - Thanks i will try to use facet.method=enum this will
definitely help us in improving some time.

With Regards
Aman Tandon


On Wed, Apr 30, 2014 at 8:30 PM, Jeff Wartes jwar...@whitepages.com wrote:


 It¹s not just FacetComponent, here¹s the original feature ticket for
 timeAllowed:
 https://issues.apache.org/jira/browse/SOLR-502


 As I read it, timeAllowed only limits the time spent actually getting
 documents, not the time spent figuring out what data to get or how. I
 think that means the primary use-case is serving as a guard against
 excessive paging.



 On 4/30/14, 4:49 AM, Mikhail Khludnev mkhlud...@griddynamics.com
 wrote:

 On Wed, Apr 30, 2014 at 2:16 PM, Aman Tandon
 amantandon...@gmail.comwrote:
 
   lst name=querydouble
  name=time3337.0/double  /lst  lst name=facet
  double name=time6739.0/double  /lst
 
 
 Most time is spent in facet counting. FacetComponent doesn't checks
 timeAllowed right now. You can try to experiment with facet.method=enum or
 even with https://issues.apache.org/jira/browse/SOLR-5725 or try to
 distribute search with SolrCloud. AFAIK, you can't employ threads to speed
 up multivalue facets.
 
 --
 Sincerely yours
 Mikhail Khludnev
 Principal Engineer,
 Grid Dynamics
 
 http://www.griddynamics.com
  mkhlud...@griddynamics.com