Re: search result not correct in solr
Can't figure out the exact question. Need more specific example. However, if you look in Solr 4 Admin panel, there is analysis screen that shows you how the text is analyzed during indexing and during search. Putting your words there will show you the effect of various components in your type definition. If that does not help, show us the type you have now (in schema.xml) and try to explain the problem more precisely. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Wed, Apr 30, 2014 at 11:56 AM, neha sinha nehasinha...@gmail.com wrote: Hi I am trying to search with word Ribbing and i am also getting those result which have R-B or RB letter in their dsecription but when i am trying to search with Ribbin i m getting correct result...not getting any clue what to use in my solr schema.xml. Any guidance will be helpful. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/search-result-not-correct-in-solr-tp4133841.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Issue with solr searching : words with - not able to search
same issue with my search result also and i have used solr.Textfield for this -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-with-solr-searching-words-with-not-able-to-search-tp4128549p4133845.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: timeAllowed in not honoring
I had this issue too. timeAllowed only works for a certain phase of the query. I think that's the 'process' part. However, if the query is taking time in 'prepare' phase (e.g. I think for wildcards to get all the possible combinations before running the query) it won't have any impact on that. You can debug your query and confirm that. On Wed, Apr 30, 2014 at 10:43 AM, Aman Tandon amantandon...@gmail.comwrote: Shawn this is the first time i raised this problem. My heap size is 14GB and i am not using solr cloud currently, 40GB index is replicated from master to two slaves. I read somewhere that it return the partial results which is computed by the query in that specified amount of time which is defined by this timeAllowed parameter, but it doesn't seems to happen. Here is the link : http://wiki.apache.org/solr/CommonQueryParameters#timeAllowed *The time allowed for a search to finish. This value only applies to the search and not to requests in general. Time is in milliseconds. Values = 0 mean no time restriction. Partial results may be returned (if there are any). * With Regards Aman Tandon On Wed, Apr 30, 2014 at 10:05 AM, Shawn Heisey s...@elyograg.org wrote: On 4/29/2014 10:05 PM, Aman Tandon wrote: I am using solr 4.2 with the index size of 40GB, while querying to my index there are some queries which is taking the significant amount of time of about 22 seconds *in the case of minmatch of 50%*. So i added a parameter timeAllowed = 2000 in my query but this doesn't seems to be work. Please help me out. I remember reading that timeAllowed has some limitations about which stages of a query it can limit, particularly in the distributed case. These limitations mean that it cannot always limit the total time for a query. I do not remember precisely what those limitations are, and I cannot find whatever it was that I was reading. When I looked through my local list archive to see if you had ever mentioned how much RAM you have and what the size of your Solr heap is, there didn't seem to be anything. There's not enough information for me to know whether that 40GB is the amount of index data on a single SolrCloud server, or whether it's the total size of the index across all servers. If we leave timeAllowed alone for a moment and treat this purely as a performance problem, usually my questions revolve around figuring out whether you have enough RAM. Here's where that conversation ends up: http://wiki.apache.org/solr/SolrPerformanceProblems I think I've probably mentioned this to you before on another thread. Thanks, Shawn -- Regards, Salman Akram
Re: search result not correct in solr
Thanks Alexandre..but still that doesn't help me I am doing keyword search for word Ribbing and i am getting those products also which have R-B or RB word in some other field.But when i am doing search for Ribbin i am getting correct search results. My field type is textfield.Please find below my schema.xml fieldType name=wc_text class=solr.TextField positionIncrementGap=100 sortMissingLast=true omitNorms=true analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=3 maxGramSize=15 side=front/ filter class=solr.TrimFilterFactory / filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.PatternReplaceFilterFactory pattern=([^a-z]) replacement= replace=all / filter class=solr.EdgeNGramFilterFactory minGramSize=3 maxGramSize=15 side=front/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/search-result-not-correct-in-solr-tp4133841p4133848.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: search result not correct in solr
On Wed, Apr 30, 2014 at 1:29 PM, neha sinha nehasinha...@gmail.com wrote: filter class=solr.EdgeNGramFilterFactory minGramSize=3 maxGramSize=15 side=front/ filter class=solr.SnowballPorterFilterFactory language=English protected=protwords.txt / I think combining NGrams with Porter filters, especially in that order will do really weird things. Have you tried using the Admin console? You really want to see what happens to different words when you run them through your pipelines. Probably with debug mode enabled to see what effect NGram filter does for positions as well. Oh, and if you modified your index chain, did you reindex completely? You must, otherwise you have old processed tokens lying around. On the other hand, you can experiment with filter definition and not reindex (only reload core) until you see the text flowing through and being indexed/queries correctly. You are quite far from the normal scenario with your setup, so you are unlikely to get a magic answer, more like the pointers towards the tools that solve the problem. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
Sorting is not correct in autosuggest
Hi All In my auto suggest page sorting is not correct for the suggestions i am getting. However suggestions are all correct. Any guidance will be helpful -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-is-not-correct-in-autosuggest-tp4133859.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: search result not correct in solr
Hello Alex Yes I reindex completely. I am new to solr so donot have much idea of all the filters.Can u suggest some filters which i can try? -- View this message in context: http://lucene.472066.n3.nabble.com/search-result-not-correct-in-solr-tp4133841p4133861.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: search result not correct in solr
Hi Neha, There are a bunch of filters available and it wouldn't make sense to suggest anything unless we know what's the intention. As they say, if you don't know where you're going, any road will take you there. If you want the most basic cases of being able to search for standard terms in your documents, I'd recommend you start fresh and look up the example schema. Using the basic fields types for your field should do the job for you, but again, I don't really know what's the intended behavior. Also, you should look at the official reference guide: https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide Be sure to look up the guide for the version of Solr you're using. On Wed, Apr 30, 2014 at 12:43 AM, neha sinha nehasinha...@gmail.com wrote: Hello Alex Yes I reindex completely. I am new to solr so donot have much idea of all the filters.Can u suggest some filters which i can try? -- View this message in context: http://lucene.472066.n3.nabble.com/search-result-not-correct-in-solr-tp4133841p4133861.html Sent from the Solr - User mailing list archive at Nabble.com. -- Anshum Gupta http://www.anshumgupta.net
Problem indexing subentitienties from a multivalued field
Hi there I have a problem trying to create subentities during the data import. I have defined the following data-config entity name=efl processor=FileListEntityProcessor baseDir=/path/ fileName=.*.xml$ recursive=false rootEntity=false dataSource=null entity name=subefl dataSource=ds-3 pk=id processor=XPathEntityProcessor forEach=/export/doc_debur transformer=DateFormatTransformer,RegexTransformer url=${efl.fileAbsolutePath} stream=true onError=skip ... ... ... field column=thk xpath=/export/doc_debur/thematization_keys / entity dataSource=ds-1 name=thematization_keys query=select tmid as thematization_keys from thematization where tmid='${subefl.thk}' / ... ... /entity enity Thk is a multivalued string field And thematization_keys is also defined as a multivalued string field What I want is to make a query for each one of the values of thk and store all the results in the thematizations_keys field Could anyone help me? Thanks in advance Jordi
Re: timeAllowed in not honoring
Hi Salman, here is the my debug query dump please help!. I am unable to find the wildcards in it. ?xml version=1.0 encoding=UTF-8?responselst name=responseHeader bool name=partialResultstrue/bool int name=status0/int int name=QTime10080/int/lstresult name=response numFound=976303 start=0/resultlst name=facet_counts lst name=facet_queries/ lst name=facet_fieldslst name=city int name=delhi ncr884159/int int name=delhi629472/int int name=mumbai491426/int int name=ahmedabad259356/int int name=chennai259029/int int name=bengaluru257193/int int name=kolkata195077/int int name=pune193569/int int name=hyderabad179369/int int name=jaipur115356/int int name=coimbatore111644/int int name=noida86794/int int name=surat80621/int int name=gurgaon72815/int int name=rajkot68982/int int name=vadodara65082/int int name=ludhiana63244/int int name=thane55091/int int name=indore50225/int int name=ghaziabad49756/int int name=faridabad45322/int int name=navi mumbai40127/int int name=tiruppur37639/int int name=nagpur37126/int int name=kochi32874/int/lst lst name=datatype int name=product966816/int int name=offer6003/int int name=company3484/int /lst /lst lst name=facet_dates/ lst name=facet_ranges//lstlst name=debug str name=rawquerystringmisc items/str str name=querystringmisc items/str str name=parsedqueryBoostedQuery(boost(+(((titlex:misc^1.5 | smalldesc:misc | titlews:misc^0.5 | city:misc | usrpcatname:misc | mcatnametext:misc^0.2)~0.3 (titlex:item^1.5 | smalldesc:item | titlews:items^0.5 | city:items | usrpcatname:item | mcatnametext:item^0.2)~0.3)~1) (mcatnametext:misc item^0.5)~0.3 (titlews:misc items)~0.3 (titlex:misc item^3.0)~0.3 (smalldesc:misc item^2.0)~0.3 (usrpcatname:misc item)~0.3 (),product(map(query(+(titlex:item imsw)~0.3 (),def=0.0),0.0,0.0,1.0),map(query(+(titlex:misc item imsw)~0.3 (),def=0.0),0.0,0.0,1.0),map(int(sdesclen),0.0,150.0,1.0),map(int(sdesclen),0.0,0.0,0.1),map(int(CustTypeWt),699.0,699.0,1.2),map(int(CustTypeWt),199.0,199.0,1.3),map(int(CustTypeWt),0.0,179.0,1.35),1.0/(3.16E-11*float(ms(const(1398852652419),date(lastactiondatet)))+1.0),map(ms(const(1398852652419),date(blpurchasedate)),0.0,2.6E9,1.15),map(query(+(attribs:hot)~0.3 (titlex:hot^3.0 | smalldesc:hot^2.0 | titlews:hot | city:hot | usrpcatname:hot | mcatnametext:hot^0.5)~0.3,def=0.0),0.0,0.0,1.0),map(query(+(attribs:dupimg)~0.3 (titlex:dupimg^3.0 | smalldesc:dupimg^2.0 | titlews:dupimg | city:dupimg | usrpcatname:dupimg | mcatnametext:dupimg^0.5)~0.3,def=0.0),0.0,0.0,1.0),map(query(+(isphoto:T)~0.3 (),def=0.0),0.0,0.0,0.1/str str name=parsedquery_toStringboost(+(((titlex:misc^1.5 | smalldesc:misc | titlews:misc^0.5 | city:misc | usrpcatname:misc | mcatnametext:misc^0.2)~0.3 (titlex:item^1.5 | smalldesc:item | titlews:items^0.5 | city:items | usrpcatname:item | mcatnametext:item^0.2)~0.3)~1) (mcatnametext:misc item^0.5)~0.3 (titlews:misc items)~0.3 (titlex:misc item^3.0)~0.3 (smalldesc:misc item^2.0)~0.3 (usrpcatname:misc item)~0.3 (),product(map(query(+(titlex:item imsw)~0.3 (),def=0.0),0.0,0.0,1.0),map(query(+(titlex:misc item imsw)~0.3 (),def=0.0),0.0,0.0,1.0),map(int(sdesclen),0.0,150.0,1.0),map(int(sdesclen),0.0,0.0,0.1),map(int(CustTypeWt),699.0,699.0,1.2),map(int(CustTypeWt),199.0,199.0,1.3),map(int(CustTypeWt),0.0,179.0,1.35),1.0/(3.16E-11*float(ms(const(1398852652419),date(lastactiondatet)))+1.0),map(ms(const(1398852652419),date(blpurchasedate)),0.0,2.6E9,1.15),map(query(+(attribs:hot)~0.3 (titlex:hot^3.0 | smalldesc:hot^2.0 | titlews:hot | city:hot | usrpcatname:hot | mcatnametext:hot^0.5)~0.3,def=0.0),0.0,0.0,1.0),map(query(+(attribs:dupimg)~0.3 (titlex:dupimg^3.0 | smalldesc:dupimg^2.0 | titlews:dupimg | city:dupimg | usrpcatname:dupimg | mcatnametext:dupimg^0.5)~0.3,def=0.0),0.0,0.0,1.0),map(query(+(isphoto:T)~0.3 (),def=0.0),0.0,0.0,0.1)))/str lst name=explain/ str name=QParserSynonymExpandingExtendedDismaxQParser/str null name=altquerystring/ null name=boost_queries/ arr name=parsed_boost_queries/ null name=boostfuncs/ arr name=filter_queries str{!tag=cityf}latlong:Intersects(Circle(28.63576,77.22445 d=2.248))/strstrattribs:(locprefglobal locprefnational locprefcity)/strstr+((+datatype:product +attribs:(aprstatus20 aprstatus40 aprstatus50) +aggregate:true -attribs:liststatusnfl +((+countryiso:IN +isfcp:true) CustTypeWt:[149 TO 1499])) (+datatype:offer +iildisplayflag:true) (+datatype:company -attribs:liststatusnfl +((+countryiso:IN +isfcp:true) CustTypeWt:[149 TO 1499]))) -attribs:liststatusdnf/str /arr arr name=parsed_filter_queries strConstantScore(org.apache.lucene.spatial.prefix.IntersectsPrefixTreeFilter@414cd6c2)/str strattribs:locprefglobal attribs:locprefnational attribs:locprefcity/strstr+((+datatype:product +(attribs:aprstatus20 attribs:aprstatus40 attribs:aprstatus50) +aggregate:true -attribs:liststatusnfl +((+countryiso:IN +isfcp:true)
Problem indexing subentitienties from a multivalued field
Hi there I have a problem trying to create subentities during the data import. I have defined the following data-config entity name=efl processor=FileListEntityProcessor baseDir=/path/ fileName=.*.xml$ recursive=false rootEntity=false dataSource=null entity name=subefl dataSource=ds-3 pk=id processor=XPathEntityProcessor forEach=/export/doc_debur transformer=DateFormatTransformer,RegexTransformer url=${efl.fileAbsolutePath} stream=true onError=skip ... ... ... field column=thk xpath=/export/doc_debur/thematization_keys / entity dataSource=ds-1 name=thematization_keys query=select tmid as thematization_keys from thematization where tmid='${subefl.thk}' / ... ... /entity enity Thk is a multivalued string field And thematization_keys is also defined as a multivalued string field What I want is to make a query for each one of the values of thk and store all the results in the thematizations_keys field Could anyone help me? Thanks in advance Jordi
Re: Problem indexing subentitienties from a multivalued field
This is a little complicated. What are you getting now with this setup? Is everything else actually working? I would have thought that even --dataSource=null-- would cause issues. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Wed, Apr 30, 2014 at 4:48 PM, Jordi Martin jordi.mar...@indicator.es wrote: Hi there I have a problem trying to create subentities during the data import. I have defined the following data-config entity name=efl processor=FileListEntityProcessor baseDir=/path/ fileName=.*.xml$ recursive=false rootEntity=false dataSource=null entity name=subefl dataSource=ds-3 pk=id processor=XPathEntityProcessor forEach=/export/doc_debur transformer=DateFormatTransformer,RegexTransformer url=${efl.fileAbsolutePath} stream=true onError=skip ... ... ... field column=thk xpath=/export/doc_debur/thematization_keys / entity dataSource=ds-1 name=thematization_keys query=select tmid as thematization_keys from thematization where tmid='${subefl.thk}' / ... ... /entity enity Thk is a multivalued string field And thematization_keys is also defined as a multivalued string field What I want is to make a query for each one of the values of thk and store all the results in the thematizations_keys field Could anyone help me? Thanks in advance Jordi
RE: Problem indexing subentitienties from a multivalued field
Playing a bit with this dataconfig I get two different results If thk is defined in the schema.xml I get all the values for it indexed but the subentity thematization_keys is not processed. In the other hand , if I do not define thk in the schema.xml file only the last value for thk is stored and then the thematization_keys is processed for that value. -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Wednesday, 30 de April de 2014 12:35 To: solr-user@lucene.apache.org Subject: Re: Problem indexing subentitienties from a multivalued field This is a little complicated. What are you getting now with this setup? Is everything else actually working? I would have thought that even --dataSource=null-- would cause issues. Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Wed, Apr 30, 2014 at 4:48 PM, Jordi Martin jordi.mar...@indicator.es wrote: Hi there I have a problem trying to create subentities during the data import. I have defined the following data-config entity name=efl processor=FileListEntityProcessor baseDir=/path/ fileName=.*.xml$ recursive=false rootEntity=false dataSource=null entity name=subefl dataSource=ds-3 pk=id processor=XPathEntityProcessor forEach=/export/doc_debur transformer=DateFormatTransformer,RegexTransformer url=${efl.fileAbsolutePath} stream=true onError=skip ... ... ... field column=thk xpath=/export/doc_debur/thematization_keys / entity dataSource=ds-1 name=thematization_keys query=select tmid as thematization_keys from thematization where tmid='${subefl.thk}' / ... ... /entity enity Thk is a multivalued string field And thematization_keys is also defined as a multivalued string field What I want is to make a query for each one of the values of thk and store all the results in the thematizations_keys field Could anyone help me? Thanks in advance Jordi
Re: timeAllowed in not honoring
On Wed, Apr 30, 2014 at 2:16 PM, Aman Tandon amantandon...@gmail.comwrote: lst name=querydouble name=time3337.0/double /lst lst name=facet double name=time6739.0/double /lst Most time is spent in facet counting. FacetComponent doesn't checks timeAllowed right now. You can try to experiment with facet.method=enum or even with https://issues.apache.org/jira/browse/SOLR-5725 or try to distribute search with SolrCloud. AFAIK, you can't employ threads to speed up multivalue facets. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Error initializing QueryElevationComponent
Hi Team, I am getting error null:org.apache.solr.common.SolrException: SolrCore 'master' is not available due to init failure: Error initializing QueryElevationComponent. Please check below for configurations elevate.xml -- elevate query text=analog doc id=sitecore://master/{137f5eb3-eb84-4165-bef0-5be1fbbc3201}?lang=enver=1/ /query /elevate Scema.xml --- field name=_uniqueid type=string indexed=true stored=true required=true / SolrConfig.xml --- arr name=last-components strspellcheck1/str strelevator/str /arr I am adding elevator to default request handler. This handler is also using Spellcheck1 component. searchComponent name=elevator class=org.apache.solr.handler.component.QueryElevationComponent str name=queryFieldTypestring/str str name=config-fileelevate.xml/str /searchComponent So now I am getting below error and core itself not loading. If I change id in elevate.xml as doc id=bce22a40d2be4cd791ed6bf4b88d0450/ instead of doc id=sitecore://master/{137f5eb3-eb84-4165-bef0-5be1fbbc3201}?lang=enver=1/ then error is not coming.But results are not coming as expected. What is wrong with value sitecore://master/{137f5eb3-eb84-4165-bef0-5be1fbbc3201}?lang=enver=1 ? Please suggest or guide how to make it work.. Complete Error details null:org.apache.solr.common.SolrException: SolrCore 'master' is not available due to init failure: Error initializing QueryElevationComponent. at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:783) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:287) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:953) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:408) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1041) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:603) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.solr.common.SolrException: Error initializing QueryElevationComponent. at org.apache.solr.core.SolrCore.init(SolrCore.java:834) at org.apache.solr.core.SolrCore.init(SolrCore.java:625) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:522) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:557) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:247) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:239) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) ... 3 more Caused by: org.apache.solr.common.SolrException: Error initializing QueryElevationComponent. at org.apache.solr.handler.component.QueryElevationComponent.inform(QueryElevationComponent.java:241) at org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:601) at org.apache.solr.core.SolrCore.init(SolrCore.java:829) ... 11 more Caused by: org.apache.solr.common.SolrException: org.xml.sax.SAXParseException; systemId: solrres:/elevate.xml; lineNumber: 28; columnNumber: 80; The reference to entity ver must end with the ';' delimiter. at org.apache.solr.core.Config.init(Config.java:148) at org.apache.solr.core.Config.init(Config.java:86) at org.apache.solr.core.Config.init(Config.java:81) at org.apache.solr.handler.component.QueryElevationComponent.inform(QueryElevationComponent.java:223) ... 13 more Caused by: org.xml.sax.SAXParseException; systemId: solrres:/elevate.xml; lineNumber: 28; columnNumber: 80; The reference to entity ver must end with the ';' delimiter. at
Re: timeAllowed in not honoring
On 4/29/2014 11:43 PM, Aman Tandon wrote: My heap size is 14GB and i am not using solr cloud currently, 40GB index is replicated from master to two slaves. I read somewhere that it return the partial results which is computed by the query in that specified amount of time which is defined by this timeAllowed parameter, but it doesn't seems to happen. Mikhail Khludnev has replied and explained why timeAllowed isn't stopping the query and returning partial results. A 14GB heap is quite large. If you aren't starting Solr with garbage collection tuning parameters, long GC pauses *will* be happening, and that will make some of your queries take a really long time. The wiki page I sent has a section about garbage collection and a link showing the GC tuning parameters that I use. You didn't indicate how much total RAM you have. If your total RAM is 16GB, that's definitely not enough for a 14GB heap and a 40GB index. 32GB of total RAM might be enough, but it also might not be. A perfect world RAM size for this setup would be at least 54GB -- the total of heap plus index size, not counting the small number of megabytes that the OS and its basic services take. Thanks, Shawn
Re: timeAllowed in not honoring
It¹s not just FacetComponent, here¹s the original feature ticket for timeAllowed: https://issues.apache.org/jira/browse/SOLR-502 As I read it, timeAllowed only limits the time spent actually getting documents, not the time spent figuring out what data to get or how. I think that means the primary use-case is serving as a guard against excessive paging. On 4/30/14, 4:49 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Apr 30, 2014 at 2:16 PM, Aman Tandon amantandon...@gmail.comwrote: lst name=querydouble name=time3337.0/double /lst lst name=facet double name=time6739.0/double /lst Most time is spent in facet counting. FacetComponent doesn't checks timeAllowed right now. You can try to experiment with facet.method=enum or even with https://issues.apache.org/jira/browse/SOLR-5725 or try to distribute search with SolrCloud. AFAIK, you can't employ threads to speed up multivalue facets. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: merge shards indexes
Is this SolrCloud? If so you have to be quite careful to get the expected results, in fact I'm not all that sure you can and still have a consistent index. Best, Erick On Mon, Apr 28, 2014 at 5:33 AM, Dmitry Kan solrexp...@gmail.com wrote: Yes, according to this documentation: https://wiki.apache.org/solr/MergingSolrIndexes On Mon, Apr 28, 2014 at 12:14 PM, Gastone Penzo gastone.pe...@gmail.comwrote: Hi, it's possible to merge 2 shards indexes into one? Thank you -- *Gastone Penzo* -- Dmitry Blog: http://dmitrykan.blogspot.com Twitter: http://twitter.com/dmitrykan
Re: Stemming not working with wildcard search
Did you re-index? And what do you get when adding debug=query? That should show you the parsed query. Have you looked at the results of the admin/analysis page? That tool is invaluable for seeing what the actual transformations are. Best, Erick On Mon, Apr 28, 2014 at 11:41 AM, Geepalem naresh.geepa...@yahoo.com wrote: Hi Ahmet, Thanks for your prompt response! I have added filters which you have specified but still its not working. Below is field Query Analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.KeywordRepeatFilterFactory/ filter class=solr.PorterStemFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer http://localhost:8080/solr/master/select?q=page_title_t:*products* http://localhost:8080/solr/master/select?q=page_title_t:*product* Please let me know if I am doing anything wrong. Thanks, G. Naresh Kumar -- View this message in context: http://lucene.472066.n3.nabble.com/Stemming-not-working-with-wildcard-search-tp4133382p4133556.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Server Infrastructure Config
Impossible to answer even if you gave much more detailed information, you need to prototype and push one of your machines until it falls over, then extrapolate. See: http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Best, Erick On Tue, Apr 29, 2014 at 7:41 AM, EXTERNAL Taminidi Ravi (ETI, Automotive-Service-Solutions) external.ravi.tamin...@us.bosch.com wrote: Hi, Can some one share or refer to get information on the SOLR server environment for production. Appx. We have 40 Collections, with appx size from 300MB to 8GB (for each Collection) and appx Total 100GB. The average increase of the size for total may be 2-5Gb / Year. To Get best performance for atleast 1000-1 Concurrent Users. Thanks Ravi
Re: Sorting is not correct in autosuggest
Please review: http://wiki.apache.org/solr/UsingMailingLists You've given us virtually no information here. Best, Erick On Wed, Apr 30, 2014 at 12:35 AM, neha sinha nehasinha...@gmail.com wrote: Hi All In my auto suggest page sorting is not correct for the suggestions i am getting. However suggestions are all correct. Any guidance will be helpful -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-is-not-correct-in-autosuggest-tp4133859.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: search result not correct in solr
Neha: You _really_ need to get familiar with the admin/analysis page in the Solr admin UI. It shows you, step-by-step, what each tokenizer and filter in your analysis chain does. It'll save you a world of pain :). Best, Erick P.S. unless you care about a bunch of really gory detail, un-check the verbose checkbox! On Wed, Apr 30, 2014 at 12:55 AM, Anshum Gupta ans...@anshumgupta.net wrote: Hi Neha, There are a bunch of filters available and it wouldn't make sense to suggest anything unless we know what's the intention. As they say, if you don't know where you're going, any road will take you there. If you want the most basic cases of being able to search for standard terms in your documents, I'd recommend you start fresh and look up the example schema. Using the basic fields types for your field should do the job for you, but again, I don't really know what's the intended behavior. Also, you should look at the official reference guide: https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide Be sure to look up the guide for the version of Solr you're using. On Wed, Apr 30, 2014 at 12:43 AM, neha sinha nehasinha...@gmail.com wrote: Hello Alex Yes I reindex completely. I am new to solr so donot have much idea of all the filters.Can u suggest some filters which i can try? -- View this message in context: http://lucene.472066.n3.nabble.com/search-result-not-correct-in-solr-tp4133841p4133861.html Sent from the Solr - User mailing list archive at Nabble.com. -- Anshum Gupta http://www.anshumgupta.net
Re: When not to use NRTCachingDirectory and what to use instead.
On 4/19/14, 6:51 AM, Ken Krugler kkrugler_li...@transpac.com wrote: The code I see seems to be using an FSDirectory, or is there another layer of wrapping going on here? return new NRTCachingDirectory(FSDirectory.open(new File(path)), maxMergeSizeMB, maxCachedMB); I was also curious about this subject. Not enough to test anything, but enough to look at the code too. FSDirectory.open picks one of MMapDirectory, SimpleFSDirectory and NIOFSDirectory in that order of preference based on what it thinks your system will support. There¹s still the possibility that the added caching functionality slows down bulk index operations, but setting that aside, it does look like NRTCachingDirectoryFactory is almost always the best choice.
Re: saving user actions on item in solr for later retrieval
Thank you, we will check it out. On Apr 29, 2014 9:28 PM, iorixxx [via Lucene] ml-node+s472066n4133796...@n3.nabble.com wrote: Hi Nolim, Actually EFF is searchable. See my comments at the end of the page https://cwiki.apache.org/confluence/display/solr/Working+with+External+Files+and+Processes Ahmet On Tuesday, April 29, 2014 9:07 PM, nolim [hidden email]http://user/SendEmail.jtp?type=nodenode=4133796i=0 wrote: Thank you, it was interesting and I have learned some new things in solr :) But the External File Field isn't a good option because the field is unsearchable which it very important to us. We think about the first option (updating document in solr) but preforming commit only each 10 minutes - If we would like to retrieve the value realtime we can use RealTimeGet. Maybe you have other suggestion? -- View this message in context: http://lucene.472066.n3.nabble.com/saving-user-actions-on-item-in-solr-for-later-retrieval-tp4133558p4133793.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/saving-user-actions-on-item-in-solr-for-later-retrieval-tp4133558p4133796.html To unsubscribe from saving user actions on item in solr for later retrieval, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4133558code=YWxvbnlhZG9AZ21haWwuY29tfDQxMzM1NTh8MTMwMDI0NTg3MA== . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/saving-user-actions-on-item-in-solr-for-later-retrieval-tp4133558p4133955.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: saving user actions on item in solr for later retrieval
is there somebody from LucidWorks who can refer to Click Score Relevance Framework in LucidWorks Search? On Mon, Apr 28, 2014 at 10:48 PM, nolim alony...@gmail.com wrote: Hi, We are using solr in production system for around ~500 users and we have around ~1 queries per day. Our user's search topics most of the time static and repeat themselves over time. We have in our system an option to specify specific search subject (we also call it specific information need) and most of our users are using this option. We keep in our system logs each query and document retrieved from each information need and the user can also give feedback if the document is relevant for his information need. We also have special query expansion technique and diversity algorithm based on MMR. We want to use this information from logs as data set for training our ranking system and preforming Learning To Rank for each information need or cluster of information needs. We also want to give the user the option filter by relevant and read based on his actions\friends actions in the same topic. When he runs a query again or similar one he can skip already read documents. That's an important requirement to our users. We think about 2 possibilities to implement it: 1. Updating each item in solr and creating 2 fields named: read, relevant. Each field is multivalue field with the corresponding label of the information need. When the user reads a document an update is sent to solr and the field read gets a label with the information need the user is working on... Will cause update when each item is read by user (still nothing compare to new items coming in each day). We are saving information that belongs to the application in solr which may be wrong architecture. 2. Save the information In DB, and then preforming filtering on the retrieved results. this option is much more complicated (We now have fields that aren't solr and the user uses them for search). We won't get facets, autocomplete and other nice stuff that a regular field in solr can have. cost in preformances, we can''t retrieve easy: give me top 10 documents that answer the query and unread from the information need and more complicated code to hold. 3. Do you have more ideas? Which of those options is the better? Thanks in advance! -- View this message in context: http://lucene.472066.n3.nabble.com/saving-user-actions-on-item-in-solr-for-later-retrieval-tp4133558.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Denormalize or use multivalued field for nested data?
I have to modify a schema where I can attach nested pricing per store information for a product. For example: 10010137332:{ title:iPad 64gb description: iPad 64gb with retina pricing:{ merchantid64354:{ locationid643:{ USD|600 } locationid6436:{ USD|600 } } merchantid343:{ locationid1345:{ USD|600 } locationid4353:{ USD|600 } } } } This is what is suggested all over the internet: Denormalize it: In my case, I will end up with total number of columns = total locations with a price which is about 100k. I don't think having 100k columns for 60M products is a good idea. Are there any better ways of handling this? I am trying to figure out multivalue field but as far as I understand it, it can only be used as a flag but cannot be used to get a value associated to a key. Based on this answer, solr 4.5+ supports nested documents: http://stackoverflow.com/a/5585891/231917 but I am currently on 4.4. -- Thanks, -Utkarsh
Shards don't return documents in same order
Hi guys, I have a small SolrCloud setup (3 servers, 1 collection with 1 shard and 3 replicat). In my schema, I have a alphaOnlySort field with a copyfield. This is a part of my managed-schema : field name=_root_ type=string indexed=true stored=false/ field name=_uid type=string multiValued=false indexed=true required=true stored=true/ field name=_version_ type=long indexed=true stored=true/ field name=event_id type=string indexed=true stored=true/ field name=event_name type=text_general indexed=true stored=true/ field name=event_name_sort type=alphaOnlySort/ with the copyfield copyField source=event_name dest=event_name_sort/ The problem is : I query my collection with a sort on my alphasort field but on one of my servers, the sort order is not the same. On server 1 and 2, I have this result : doc str name=event_nameMB20140410A/str /doc doc str name=event_nameMB20140410A-New/str /doc doc str name=event_nameMB20140411A/str /doc and on the third one, this : str name=event_nameMB20140410A/str /doc doc str name=event_nameMB20140411A/str /doc doc str name=event_nameMB20140410A-New/str /doc The doc named MB20140411A should be at the end ... Any idea ? Regards
Re: Denormalize or use multivalued field for nested data?
I think you are misunderstanding denormalize in this context. It still may not be what you want to do for other reasons, but the usual idea is to replicate the parent info in each of the children, so you'd have something like: doc1 = title:iPad 64gb description: iPad 64gb with retina merchantid:343 locationid: 1345 cost: USD|600 doc2 = title:iPad 64gb description: iPad 64gb with retina merchantid:343 locationid: 4353 cost: USD|600 And so on. Best, Erick On Wed, Apr 30, 2014 at 12:24 PM, Utkarsh Sengar utkarsh2...@gmail.com wrote: I have to modify a schema where I can attach nested pricing per store information for a product. For example: 10010137332:{ title:iPad 64gb description: iPad 64gb with retina pricing:{ merchantid64354:{ locationid643:{ USD|600 } locationid6436:{ USD|600 } } merchantid343:{ locationid1345:{ USD|600 } locationid4353:{ USD|600 } } } } This is what is suggested all over the internet: Denormalize it: In my case, I will end up with total number of columns = total locations with a price which is about 100k. I don't think having 100k columns for 60M products is a good idea. Are there any better ways of handling this? I am trying to figure out multivalue field but as far as I understand it, it can only be used as a flag but cannot be used to get a value associated to a key. Based on this answer, solr 4.5+ supports nested documents: http://stackoverflow.com/a/5585891/231917 but I am currently on 4.4. -- Thanks, -Utkarsh
Re: Denormalize or use multivalued field for nested data?
Block joins could be what you're looking for if you can upgrade to 4.5+ [ https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers ] I'd recommend an upgrade but if that's not possible, replicating the parent information is the way to go. On Wed, Apr 30, 2014 at 12:24 PM, Utkarsh Sengar utkarsh2...@gmail.comwrote: I have to modify a schema where I can attach nested pricing per store information for a product. For example: 10010137332:{ title:iPad 64gb description: iPad 64gb with retina pricing:{ merchantid64354:{ locationid643:{ USD|600 } locationid6436:{ USD|600 } } merchantid343:{ locationid1345:{ USD|600 } locationid4353:{ USD|600 } } } } This is what is suggested all over the internet: Denormalize it: In my case, I will end up with total number of columns = total locations with a price which is about 100k. I don't think having 100k columns for 60M products is a good idea. Are there any better ways of handling this? I am trying to figure out multivalue field but as far as I understand it, it can only be used as a flag but cannot be used to get a value associated to a key. Based on this answer, solr 4.5+ supports nested documents: http://stackoverflow.com/a/5585891/231917 but I am currently on 4.4. -- Thanks, -Utkarsh -- Anshum Gupta http://www.anshumgupta.net
Re: Shards don't return documents in same order
Hmmm, take a look at the admin/analysis page for these inputs for alphaOnlySort. If you're using the stock Solr distro, you're probably not considering the effects patternReplaceFilterFactory which is removing all non-letters. So these three terms reduce to mba mba mbanew You can look at the actual indexed terms by the admin/schema-browser as well. That said, unless you transposed the order because you were concentrating on the numeric part, the doc with MB20140410A-New should always be sorting last. All of which is irrelevant if you're doing something else with alphaOnlySort, so please paste in the fieldType definition if you've changed it. What gets returned in the doc for _stored_ data is a verbatim copy, NOT the output of the analysis chain, which can be confusing. Oh, and Solr uses the internal lucene doc ID to break ties, and docs on different replicas can have different internal Lucene doc IDs relative to each other as a result of merging so that's something else to watch out for. Best, Erick On Wed, Apr 30, 2014 at 1:06 PM, Francois Perron francois.per...@ticketmaster.com wrote: Hi guys, I have a small SolrCloud setup (3 servers, 1 collection with 1 shard and 3 replicat). In my schema, I have a alphaOnlySort field with a copyfield. This is a part of my managed-schema : field name=_root_ type=string indexed=true stored=false/ field name=_uid type=string multiValued=false indexed=true required=true stored=true/ field name=_version_ type=long indexed=true stored=true/ field name=event_id type=string indexed=true stored=true/ field name=event_name type=text_general indexed=true stored=true/ field name=event_name_sort type=alphaOnlySort/ with the copyfield copyField source=event_name dest=event_name_sort/ The problem is : I query my collection with a sort on my alphasort field but on one of my servers, the sort order is not the same. On server 1 and 2, I have this result : doc str name=event_nameMB20140410A/str /doc doc str name=event_nameMB20140410A-New/str /doc doc str name=event_nameMB20140411A/str /doc and on the third one, this : str name=event_nameMB20140410A/str /doc doc str name=event_nameMB20140411A/str /doc doc str name=event_nameMB20140410A-New/str /doc The doc named MB20140411A should be at the end ... Any idea ? Regards
Which Lucene search syntax is faster
Hi, Given the following Lucene document that I’m adding to my index(and I expect to have over 10 million of them, each with various sizes from 1 Kbto 50 Kb: add doc fieldname=doc_typePDF/field fieldname=titleSome name/field fieldname=summarySome summary/field fieldname=ownerWho owns this/field fieldname=price10/field fieldname=isbn1234567890/field /doc doc fieldname=doc_typeDOC/field fieldname=titleSome name/field fieldname=summarySome summary/field fieldname=ownerWho owns this/field fieldname=price10/field fieldname=isbn0987654321/field /doc !-- and more doc's -- /add My question is this: what Lucene search syntax will give meback result the fastest? If my user is interestedin finding data within “title” and “owner” fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as: 1) skyfall ian fleming AND doc_type:DOC 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian ORfleming) AND doc_type:DOC 3) Something else I don't know about. Of the 10 million documents I will be indexing, 80% will be of doc_type PDF, and about 10% of type DOC, so please keep that in mind as a factor (if that will mean anything in terms of which syntax I should use). Thanks in advanced, - MJ
Re: Which Lucene search syntax is faster
On 4/30/2014 2:29 PM, johnmu...@aol.com wrote: My question is this: what Lucene search syntax will give meback result the fastest? If my user is interestedin finding data within “title” and “owner” fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as: 1) skyfall ian fleming AND doc_type:DOC If your default field is text, I'm fairly sure this will become equivalent to the following which is probably NOT what you want. Parentheses can be very important. text:skyfall OR text:ian OR (text:fleming AND doc_type:DOC) 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian OR fleming) AND doc_type:DOC This kind of query syntax is probably what you should shoot for. Not from a performance perspective -- just from the perspective of making your queries completely correct. Note that the +/- syntax combined with parentheses is far more precise than using AND/OR/NOT. 3) Something else I don't know about. The edismax query parser is very powerful. That might be something you're interested in. https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser Of the 10 million documents I will be indexing, 80% will be of doc_type PDF, and about 10% of type DOC, so please keep that in mind as a factor (if that will mean anything in terms of which syntax I should use). For the most part, whatever general query format you choose to use will not matter very much. There are exceptions, but mostly Solr (Lucene) is smart enough to convert your query to an efficient final parsed format. Turn on the debugQuery parameterto see what it does with each query. Regardless of whether you use the standard lucene query parser or edismax, incorporate filter queries into your query constructing logic. Your second example above would be better to express like this, with the default operator set to OR. This uses both q and fq parameters: q=title:(skyfall ian fleming) owner:(skyfall ian fleming)fq=doc_type:DOC https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter Thanks, Shawn
Re: Which Lucene search syntax is faster
I'd add that I think you're worrying about the wrong thing. 10M documents is not very many by modern Solr standards. I rather suspect that you won't notice much difference in performance due to how you construct the query. Shawn's suggestion to use fq clauses is spot on, though. fq clauses are re-used (see filterCache in solrconfig.xml). My rule of thumb is to use fq clauses for most everything that does NOT contribute to scoring... Best, Erick On Wed, Apr 30, 2014 at 2:18 PM, Shawn Heisey s...@elyograg.org wrote: On 4/30/2014 2:29 PM, johnmu...@aol.com wrote: My question is this: what Lucene search syntax will give meback result the fastest? If my user is interestedin finding data within “title” and “owner” fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as: 1) skyfall ian fleming AND doc_type:DOC If your default field is text, I'm fairly sure this will become equivalent to the following which is probably NOT what you want. Parentheses can be very important. text:skyfall OR text:ian OR (text:fleming AND doc_type:DOC) 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian OR fleming) AND doc_type:DOC This kind of query syntax is probably what you should shoot for. Not from a performance perspective -- just from the perspective of making your queries completely correct. Note that the +/- syntax combined with parentheses is far more precise than using AND/OR/NOT. 3) Something else I don't know about. The edismax query parser is very powerful. That might be something you're interested in. https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser Of the 10 million documents I will be indexing, 80% will be of doc_type PDF, and about 10% of type DOC, so please keep that in mind as a factor (if that will mean anything in terms of which syntax I should use). For the most part, whatever general query format you choose to use will not matter very much. There are exceptions, but mostly Solr (Lucene) is smart enough to convert your query to an efficient final parsed format. Turn on the debugQuery parameterto see what it does with each query. Regardless of whether you use the standard lucene query parser or edismax, incorporate filter queries into your query constructing logic. Your second example above would be better to express like this, with the default operator set to OR. This uses both q and fq parameters: q=title:(skyfall ian fleming) owner:(skyfall ian fleming)fq=doc_type:DOC https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter Thanks, Shawn
Re: Which Lucene search syntax is faster
Thank you Shawn and Erick for the quick response. A follow up question. Basedon https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter,I see the fl (field list) parameter. Does this mean I canbuild my Lucene search syntax as follows: q=skyfall OR ian ORflemingfl=titlefl=ownerfq=doc_type:DOC And get the same result as (per Shawn's example changed it bit toadd OR): q=title:(skyfall OR ian OR fleming)owner:(skyfall OR ian OR fleming)fq=doc_type:DOC Btw, my default search operator is set to AND. My need is tofind whatever the user types in both of those two fields (or maybe some otherfields which is controlled by the UI).. For example, user typesskyfall ian fleming and selected 3 fields, and want to narrowdown to doc_type DOC. - MJ -Original Message- From: Erick Erickson erickerick...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Wed, Apr 30, 2014 5:33 pm Subject: Re: Which Lucene search syntax is faster I'd add that I think you're worrying about the wrong thing. 10M documents is not very many by modern Solr standards. I rather suspect that you won't notice much difference in performance due to how you construct the query. Shawn's suggestion to use fq clauses is spot on, though. fq clauses are re-used (see filterCache in solrconfig.xml). My rule of thumb is to use fq clauses for most everything that does NOT contribute to scoring... Best, Erick On Wed, Apr 30, 2014 at 2:18 PM, Shawn Heisey s...@elyograg.org wrote: On 4/30/2014 2:29 PM, johnmu...@aol.com wrote: My question is this: what Lucene search syntax will give meback result the fastest? If my user is interestedin finding data within “title” and “owner” fields only “doc_type” “DOC”, shouldI build my Lucene search syntax as: 1) skyfall ian fleming AND doc_type:DOC If your default field is text, I'm fairly sure this will become equivalent to the following which is probably NOT what you want. Parentheses can be very important. text:skyfall OR text:ian OR (text:fleming AND doc_type:DOC) 2) title:(skyfall OR ian OR fleming) owner:(skyfall OR ian OR fleming) AND doc_type:DOC This kind of query syntax is probably what you should shoot for. Not from a performance perspective -- just from the perspective of making your queries completely correct. Note that the +/- syntax combined with parentheses is far more precise than using AND/OR/NOT. 3) Something else I don't know about. The edismax query parser is very powerful. That might be something you're interested in. https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser Of the 10 million documents I will be indexing, 80% will be of doc_type PDF, and about 10% of type DOC, so please keep that in mind as a factor (if that will mean anything in terms of which syntax I should use). For the most part, whatever general query format you choose to use will not matter very much. There are exceptions, but mostly Solr (Lucene) is smart enough to convert your query to an efficient final parsed format. Turn on the debugQuery parameterto see what it does with each query. Regardless of whether you use the standard lucene query parser or edismax, incorporate filter queries into your query constructing logic. Your second example above would be better to express like this, with the default operator set to OR. This uses both q and fq parameters: q=title:(skyfall ian fleming) owner:(skyfall ian fleming)fq=doc_type:DOC https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter Thanks, Shawn
What are the best practices on Multiple Language support in Solr Cloud ?
Hi, I'm trying to implement multiple language support in Solr Cloud (4.7). Although we've different languages in index, we were only supporting english in terms of index and query. To provide some context, our current index size is 35 GB with close to 15 million documents. We've two shards with two replicas per shard. I'm using composite id to support de-duplication, which puts the documents having the same field (dedup) value to a specific shard. Language is known prior to for every document being indexed. That saves the need for runtime language detection. Similarly, during query, the language will be known as well. To extend it, there's no need for multi-lingual support. Based on my understanding so far, there are three approaches which are widely adopted. Multi-field indexing, Multi-Core indexing and Multiple language in one field (based from Solr in Action). First option seems easy to implement. But then, I've around 40 fields which are getting indexed currently, though a majority of them are type=string and not being analyzed. I'm planning to support around 10 languages, which translates to 400 field definitions in the same schema. And this is poised to grow with addition of languages and fields. My apprehension is whether this approach becomes a maintenance nightmare ? Does it affect overall scalability ? Does is affect any existing features like Suggester, Spellcheck, etc. ? I was thinking of including language as part of the id key. It'll look like Language!Dedup_id!url so that documents are spread across the two shards. Second option of a dedicated core sounds easy in terms of maintaining config files. Also,routing requests will be fairly easy as the language will be always known up-front,both during indexing and query time. But, as I looked into the documents, 60% of our total index will be in English, while rest 40% will constitute remaining 10-14 languages. Some language content are in few thousands which perhaps doesn't merit a dedicate core. On top of that, this approach has the potential of getting into a complex infrastructure, which might be hard to maintain. I read about the use of multiple language in a single field in Trey Grainger's book. It looks like a great approach but not sure if it is meant to address my scenario. My first impression is that it's more geared towards supporting multi-lingual, but I maybe completely wrong. Also, this is not supported by Solr / Lucene out of the box. I know there's a lot of people in this group who have excelled as far as supporting multiple language in Solr is concerned. I'm trying to gather their inputs / experience on the best practice to help me decide the right approach. Any pointer on this will be highly appreciated. Thanks, Shamik
Re: Which Lucene search syntax is faster
On 4/30/2014 3:47 PM, johnmu...@aol.com wrote: Thank you Shawn and Erick for the quick response. A follow up question. Basedon https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-Thefq%28FilterQuery%29Parameter,I see the fl (field list) parameter. Does this mean I canbuild my Lucene search syntax as follows: The fl parameter determines which stored fields show up in the results. By default, all fields that are stored will be returned. If you want relevancy scores, you'd include the pseudofield named score -- fl=*,score is something we see a lot. The fl parameter does not affect the *search* at all. q=skyfall OR ian ORflemingfl=titlefl=ownerfq=doc_type:DOC And get the same result as (per Shawn's example changed it bit toadd OR): q=title:(skyfall OR ian OR fleming)owner:(skyfall OR ian OR fleming)fq=doc_type:DOC Exactly right. Btw, my default search operator is set to AND. My need is tofind whatever the user types in both of those two fields (or maybe some otherfields which is controlled by the UI).. For example, user typesskyfall ian fleming and selected 3 fields, and want to narrowdown to doc_type DOC. With the standard parser, you'd have to do the following. Assume that USERQUERY is a very basic query, perhaps a few terms, like your example of skyfall ian fleming. q=field1:(USERQUERY) OR field2:(USERQUERY) OR field3:(USERQUERY)fq=doc_type:DOC With edismax, you'd do: q=USERQUERYqf=field1 field2 field3fq=doc_type:DOC You might also add pf=field1 field2 field3 ... and there are a great many other edismax/dismax query parameters too. The edismax parser does some truly amazing stuff. Echoing what both Erick and I said ... worrying about the exact syntax is premature optimization. 10 million docs is something that Solr can handle easily, as long as there's enough RAM. Thanks, Shawn
Re: timeAllowed in not honoring
Jeff - Thanks Jeff this discussion on jira is really quite helpful. Thanks for this. Shawn - Yes we have some plans to move to SolrCloud, Our total index size is 40GB with 11M of Docs, Available RAM 32GB, Allowed heap space for solr is 14GB, the GC tuning parameters using in our server is -XX:+UseConcMarkSweepGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCTimeStamps. Mikhail Khludnev - Thanks i will try to use facet.method=enum this will definitely help us in improving some time. With Regards Aman Tandon On Wed, Apr 30, 2014 at 8:30 PM, Jeff Wartes jwar...@whitepages.com wrote: It¹s not just FacetComponent, here¹s the original feature ticket for timeAllowed: https://issues.apache.org/jira/browse/SOLR-502 As I read it, timeAllowed only limits the time spent actually getting documents, not the time spent figuring out what data to get or how. I think that means the primary use-case is serving as a guard against excessive paging. On 4/30/14, 4:49 AM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: On Wed, Apr 30, 2014 at 2:16 PM, Aman Tandon amantandon...@gmail.comwrote: lst name=querydouble name=time3337.0/double /lst lst name=facet double name=time6739.0/double /lst Most time is spent in facet counting. FacetComponent doesn't checks timeAllowed right now. You can try to experiment with facet.method=enum or even with https://issues.apache.org/jira/browse/SOLR-5725 or try to distribute search with SolrCloud. AFAIK, you can't employ threads to speed up multivalue facets. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com