Re: Solr commit taking too long
Some questions: What version of Solr? Has the number of documents in your index changed in the meantime? How many before, how many now? How does maxdocs compare to numdocs? Has this system ever been upgraded from an older Solr? Is it committing that is taking that long, or opening a searcher one the commit is done? Maybe answers to these might help unpick your issue. Upayavira On Thu, Jan 17, 2013, at 06:22 AM, Cool Techi wrote: Hi, We have an index of approximately 400GB in size, indexing 5000 documents was taking 20 seconds. But lately, the indexing is taking very long, committing the same amount of document is taking 5-20 mins. On checking the logs I can see that their a frequent merges happening, which I am guessing is the reason for this, how can this be improved. My configurations are given below, useCompoundFilefalse/useCompoundFile mergeFactor30/mergeFactor ramBufferSizeMB64/ramBufferSizeMB regards, Ayush
Re: Missing documents with ConcurrentUpdateSolrServer (vs. HttpSolrServer) ?
Hi Mark, one entry in my long list of self made problems is: Done the commit before the ConcurrentUpdateSolrServer was finished. Since the ConcurrentUpdateSolrServer is asynchronous, it's very easy to create a race conditions. Make sure that your program is waiting () before it's doing the commit. if (solrserver instanceof ConcurrentUpdateSolrServer) { ((ConcurrentUpdateSolrServer) solrserver).blockUntilFinished(); } Uwe
URL encoding problems
Hi, I have some problems related to URL encoding. I'm using Solr 3.6.1 on a Windows (32 bit) system. Apache Tomcat is version 6.0.36. I'm accessing Solr through solrj-3.3.0. When using the Solr admin and specifying my request, the URL looks like this (${SOLR} is there for the sake of brevity) : ${SOLR}/select?q=rapporteur_name%3A%28John+%2BSmith+%2B%5C%28FOO%5C%29%29 But when my app launching the query, the URL looks like this : ${SOLR}/select?q=rapporteur_name%3A%28John%5C+Smith%5C+%5C%28FOO%5C%29%29 My decoded query, as entered in the admin interface, is : rapporteur_name:(John +Smith +\(FOO\)) Both request return results, but only the one returns the correct ones. The code that escapes the query is : SolrQuery query = new SolrQuery(); query.setQuery(rapporteur_name:( + ClientUtils.escapeQueryChars(John Smith (FOO)) + )); I don't know if it's the right way to encode the query. Any ideas or directions ? Regards. -- Bruno Dusausoy Software Engineer YP5 Software -- Pensez environnement : limitez l'impression de ce mail. Please don't print this e-mail unless you really need to.
Re: Response time in client was much longer than QTime in tomcat
Hello, QTime counts only searching and filtering, but not writing response, which includes retrieving the stored fields (fl=...). So, it's quite reasonable. On Thu, Jan 17, 2013 at 7:09 AM, 张浓飞 zhangnong...@vancl.cn wrote: I have a solr website with about 500 docs ( 30 fileds defined in schema ), and a c# client on the same machine which would sent http get request to that solr website. These logs were recorded by my c# client: ** ** 01-16 23:54:49,301 [107] INFO LogHelper - requst time too long: 1054, solr time: 1003 01-16 23:54:49,847 [63] INFO LogHelper - requst time too long: 1068, solr time: 1021 01-16 23:57:17,813 [108] INFO LogHelper - requst time too long: 1051, solr time: 1027 01-16 23:57:18,313 [111] INFO LogHelper - requst time too long: 1031, solr time: 1007 and so on… ** ** You can see , the query time from solr were so long and every similar (between 1000ms to 1050ms). On the same time, the corresponding logs in tomcat: ** ** 2013-1-16 23:54:49 org.apache.solr.core.SolrCore execute Info: [suit1] webapp=/vanclsearchV2 path=/select/ params={fl=id,typeid,createtime,vprice,sprice,price,totalassesscount,totalsalescount,productcode,productname,stylecode,tag,vpricesku,spricesku,pricesku,userrate,assesscount,lstphotos,mainphotos,salesflag,isduanma,detailsalescount,productplusstyleinfosort=createtime+descstart=0q=*:*wt=jsonfq=ancestorsid:(28976+OR+28978)fq=typeid:(1)rows=30} hits=43 status=0 QTime=0 2013-1-16 23:54:49 org.apache.solr.core.SolrCore execute Info: [suit1] webapp=/vanclsearchV2 path=/select/ params={fl=id,typeid,createtime,vprice,sprice,price,totalassesscount,totalsalescount,productcode,productname,stylecode,tag,vpricesku,spricesku,pricesku,userrate,assesscount,lstphotos,mainphotos,salesflag,isduanma,detailsalescount,productplusstyleinfosort=createtime+descstart=0q=*:*wt=jsonfq=ancestorsid:(28976+OR+28978)fq=typeid:(1)rows=30} hits=43 status=0 QTime=0 2013-1-16 23:57:17 org.apache.solr.core.SolrCore execute Info: [suit1] webapp=/vanclsearchV2 path=/select/ params={fl=id,typeid,createtime,vprice,sprice,price,totalassesscount,totalsalescount,productcode,productname,stylecode,tag,vpricesku,spricesku,pricesku,userrate,assesscount,lstphotos,mainphotos,salesflag,isduanma,detailsalescount,productplusstyleinfosort=createtime+descstart=0q=*:*wt=jsonfq=ancestorsid:(27547+OR+27614)rows=30} hits=9 status=0 QTime=0 2013-1-16 23:57:18 org.apache.solr.core.SolrCore execute Info: [suit1] webapp=/vanclsearchV2 path=/select/ params={fl=id,typeid,createtime,vprice,sprice,price,totalassesscount,totalsalescount,productcode,productname,stylecode,tag,vpricesku,spricesku,pricesku,userrate,assesscount,lstphotos,mainphotos,salesflag,isduanma,detailsalescount,productplusstyleinfosort=createtime+descstart=0q=*:*wt=jsonfq=ancestorsid:(27547+OR+27614)rows=30} hits=9 status=0 QTime=0 ** ** Every strange, all the QTime were zero! Can anyone explain this circumstance, and how to solve the problem? ** ** -- [image: 说明: 说明: 说明: 说明: image001] Domi.N.Zhang | Dev Center Email : zhangnong...@vancl.cn Tel:86-028-65528402 I’m the coming days… ** ** -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
how to get abortOnConfigurationError=false working
I will explain the scenario just to avoid all the potential replies asking why. We run coldFusion servers (windows) which has SOLR built in (running on Jetty). A customer creates a collection which is stored within their own webspace, they only have read/write access to their own webspace so cannot put them anywhere else. the default value for abortOnConfigurationError is true. This causes endless problems when customers make changes to their websites or cancel their hosting, the collection gets deleted, and SOLR then crashes because it cannot find the config files for that collection. We then have to find out which collection is causing the problem, and manually remove its entry from solr.xml Obviously this is a PITA. In the error output it says. If you want solr to continue after configuration errors, change: abortOnConfigurationErrorfalse/abortOnConfigurationError in solr.xml I have tried this, but it has no effect. I have also tried putting it in all the solrconfig.xml files I tried this abortOnConfigurationError${solr.abortOnConfigurationError:false}/abortOnConfigurationError and this abortOnConfigurationErrorfalse/abortOnConfigurationError neither had any effect. How do you get this to work ? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-abortOnConfigurationError-false-working-tp4034149.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Suggestion that preserve original phrase case
You could write a custom Filter (or perhaps Tokenizer), but I usually just do it on the input side before things get sent to Solr. I don't think PatternReplaceCharFilterFactory will help, you could easily turn the input into original:original, but then you'd need to write a custom filter that normalized the left-hand-side but not the right-hand-side Best Erick On Tue, Jan 15, 2013 at 11:27 AM, Selvam s.selvams...@gmail.com wrote: Thanks Erick, can you tell me how to do the appending (lowercaseversion:LowerCaseVersion) before indexing. I tried pattern factory filters, but I could not get it right. On Sun, Jan 13, 2013 at 8:49 PM, Erick Erickson erickerick...@gmail.comwrote: One way I've seen this done is to index pairs like lowercaseversion:LowerCaseVersion. You can't push this whole thing through your field as defined since it'll all be lowercased, you have to produce the left hand side of the above yourself and just use KeywordTokenizer without LowercaseFilter. Then, your application displays the right-hand-side of the returned token. Simple solution, not very elegant, but sometimes the easiest... Best Erick On Fri, Jan 11, 2013 at 1:30 AM, Selvam s.selvams...@gmail.com wrote: Hi*, * I have been trying to figure out a way for case insensitive suggestion but which should return original phrase as result.* *I am using* *solr 3.5* * *For eg: * If I index 'Hello world' and search for 'hello' it needs to return *'Hello world'* not *'hello world'. *My configurations are as follows,* * * New field type:* fieldType class=solr.TextField name=text_auto analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory/ /analyzer *Field values*: field name=label type=text indexed=true stored=true termVectors=true omitNorms=true/ field name=label_autocomplete type=text_auto indexed=true stored=true multiValued=false/ copyField source=label dest=label_autocomplete / *Spellcheck Component*: searchComponent name=suggest class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_auto/str lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=buildOnOptimizetrue/str str name=buildOnCommittrue/str str name=fieldlabel_autocomplete/str /lst /searchComponent Kindly share your suggestions to implement this behavior. -- Regards, Selvam KnackForge http://knackforge.com Acquia Service Partner No. 1, 12th Line, K.K. Road, Venkatapuram, Ambattur, Chennai, Tamil Nadu, India. PIN - 600 053. -- Regards, Selvam KnackForge http://knackforge.com Acquia Service Partner No. 1, 12th Line, K.K. Road, Venkatapuram, Ambattur, Chennai, Tamil Nadu, India. PIN - 600 053.
Re: SOlr 3.5 and sharding
You're still confusing shards (or at least mixing up the terminology) with simple replication. Shards are when you split up the index into several sub indexes and configure the sub-indexes to know about each other. Say you have 1M docs in 2 shards. 500K of them would go on one shard and 500K on the other. But logically you have a single index of 1M docs. So the two shards have to know about each other and when you send a request to one of them, it automatically queries the other (as well as itself), collects the response and combines them, returning the top N to the requester. This is totally different from replication. In replication (master/slave), each node has all 1M documents. Each node can work totally in isolation. An incoming request is handled by the slave without contacting any other node. If you're copying around indexes AND configuring them as though they were shards, each request will be distributed to all shards and the results collated, giving you the same doc repeatedly in your result set. If you have no access to the indexing code, you really can't go to a sharded setup. Polling is when the slaves periodically ask the master has anything changed? If so then the slave pulls down the changes. The polling interval is configured in solrconfig.xml _on the slave_. So let's say you index docs to the master. For some interval, until the slaves poll the master and get an updated index, the number of searchable docs on the master will be different than for the slaves. Additionally, you may have the issue of the polling intervals for the slaves being offset from one another, so for some brief interval the counts on the slaves may be different as well. Best Erick On Tue, Jan 15, 2013 at 10:18 AM, Jean-Sebastien Vachon jean-sebastien.vac...@wantedanalytics.com wrote: Ok I see what Erick`s meant now.. Thanks. The original index I`m working on contains about 120k documents. Since I have no access to the code that pushes documents into the index, I made four copies of the same index. The master node contains no data at all, it simply use the data available in its four shards. Knowing that I have 1000 documents matching the keyword java on each shard I was expecting to receive 4000 documents out of my sharded setup. There are only a few documents that are not accounted for (The result count is about 3996 which is pretty close but not accurate). Right now, the index is static so there is no need for any replication so the polling interval has no effect. Later this week, I will configure the replication and have the indexation modified to distribute the documents to each shard using a simple ID modulo 4 rule. Were my expectations wrong about the number of documents? -Original Message- From: Upayavira [mailto:u...@odoko.co.uk] Sent: January-15-13 9:21 AM To: solr-user@lucene.apache.org Subject: Re: SOlr 3.5 and sharding He was referring to master/slave setup, where a slave will poll the master periodically asking for index updates. That frequency is configured in solrconfig.xml on the slave. So, you are saying that you have, say 1m documents in your master index. You then copy your index to four other boxes. At that point you have 1m documents on each of those four. Eventually, you'll delete some docs, so'd you have 250k on each. You're wondering, before the deletes, you're not seeing 1m docs on each of your instances. Or are you wondering why you're not seeing 1m docs when you do a distributed query across all for of these boxes? Is that correct? Upayavira On Tue, Jan 15, 2013, at 02:11 PM, Jean-Sebastien Vachon wrote: Hi Erick, Thanks for your comments but I am migrating an existing index (single instance) to a sharded setup and currently I have no access to the code involved in the indexation process. That`s why I made a simple copy of the index on each shards. In the end, the data will be distributed among all shards. I was just curious to know why I had not the expected number of documents with my four shards. Can you elaborate on this polling interval thing? I am pretty sure I never eared about this... Regards -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: January-15-13 8:00 AM To: solr-user@lucene.apache.org Subject: Re: SOlr 3.5 and sharding You're confusing shards and slaves here. Shards are splitting a logical index amongst N machines, where each machine contains a portion of the index. In that setup, you have to configure the slaves to know about the other shards, and the incoming query has to be distributed amongst all the shards to find all the docs. In your case, since you're really replicating (rather than sharding), you only have to query _one_ slave, the query doesn't need to be distributed. So pull all the sharding stuff out of your config files, put a load balancer in front of your slaves and only send the request to one of them would be the
Field Collapsing - Anything in the works for multi-valued fields?
I want to configure Field Collapsing, but my target field is multi-valued (e.g. the field I want to group on has a variable # of entries per document, 1-N entries). I read on the wiki (http://wiki.apache.org/solr/FieldCollapsing) that grouping doesn't support multi-valued fields yet. Anything in the works on that front by chance? Any common work-arounds?
Re: Large data importing getting rollback with solr
Hi, It looks like this is the cause: JBC0016E: Remote call failed (return code=-2,220). SDK9019E: internal errorSDK9019X: Interestingly, Google gives just 1 hit for the above as query - your post. But it seems you should look up what the above codes mean first... Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Jan 17, 2013 at 2:43 AM, ashimbose ashimb...@gmail.com wrote: I am trying to index large data (not rich document) about 5GB, but Its not getting index. In case of small data it's perfectly indexing.For Large data import XML response.. 00 data-config.xml full-import busy A command is still running... 0:9:12.738169 18107902013-01-17 12:50:13Indexing failed. Rolled back all changes.2013-01-17 12:50:30This response format is experimental. It is likely to change in the future.BUT for small data index XML response perfectly OK as below... 00 data-config.xml full-import busy A command is still running... 0:0:12.43611 3820902013-01-17 12:56:57Indexing completed. Added/Updated: 38209 documents. Deleted 0 documents.This response format is experimental. It is likely to change in the future.For Large data error log response is as below...Its getting RollbackINFO: Time taken for getConnection(): 1343Jan 17, 2013 12:36:21 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Creating a connection for entity PS_JOBCODE_HAZ_BRA with URL: jdbc:attconnect://192.168.1.29:2551/NAVIGATOR;DefTdpName=sampleDBJan 17, 2013 12:36:23 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Time taken for getConnection(): 1341Jan 17, 2013 12:36:23 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Creating a connection for entity PS_JOBCODE_HAZ_TBL with URL: jdbc:attconnect://192.168.1.29:2551/NAVIGATOR;DefTdpName=sampleDBJan 17, 2013 12:36:24 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Time taken for getConnection(): 1357Jan 17, 2013 12:36:24 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Creating a connection for entity PS_JOBCODE_LANG with URL: jdbc:attconnect://192.168.1.29:2551/NAVIGATOR;DefTdpName=sampleDBJan 17, 2013 12:36:26 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Time taken for getConnection(): 1392Jan 17, 2013 12:36:26 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Creating a connection for entity PS_JOBCODE_TBL with URL: jdbc:attconnect://192.168.1.29:2551/NAVIGATOR;DefTdpName=sampleDBJan 17, 2013 12:36:27 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Time taken for getConnection(): 1535Jan 17, 2013 12:36:41 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Creating a connection for entity PS_JOBCODE_TBL_ARG with URL: jdbc:attconnect://192.168.1.29:2551/NAVIGATOR;DefTdpName=sampleDBJan 17, 2013 12:36:43 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Time taken for getConnection(): 1467Jan 17, 2013 12:36:43 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Creating a connection for entity PS_JOBCODE_TBL_BRA with URL: jdbc:attconnect://192.168.1.29:2551/NAVIGATOR;DefTdpName=sampleDBJan 17, 2013 12:36:44 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Time taken for getConnection(): 1373Jan 17, 2013 12:36:44 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Creating a connection for entity PS_JOBCOMP_TMP_MC with URL: jdbc:attconnect://192.168.1.29:2551/NAVIGATOR;DefTdpName=sampleDBJan 17, 2013 12:36:45 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Time taken for getConnection(): 1404Jan 17, 2013 12:36:45 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Creating a connection for entity PS_JOBFUNCTION_LNG with URL: jdbc:attconnect://192.168.1.29:2551/NAVIGATOR;DefTdpName=sampleDBJan 17, 2013 12:36:47 PM org.apache.solr.core.SolrCore executeINFO: [core1] webapp=/solr path=/dataimport params={command=full-import} status=0 QTime=0Jan 17, 2013 12:36:47 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Time taken for getConnection(): 1357Jan 17, 2013 12:36:47 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Creating a connection for entity PS_JOBFUNCTION_TBL with URL: jdbc:attconnect://192.168.1.29:2551/NAVIGATOR;DefTdpName=sampleDBJan 17, 2013 12:36:48 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Time taken for getConnection(): 1310Jan 17, 2013 12:36:48 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Creating a connection for entity PS_JOB_APPROVALS with URL: jdbc:attconnect://192.168.1.29:2551/NAVIGATOR;DefTdpName=sampleDBJan 17, 2013 12:36:50 PM org.apache.solr.handler.dataimport.JdbcDataSource$1 callINFO: Time taken for getConnection(): 1342Jan 17, 2013 12:36:50 PM
Re: Solr commit taking too long
Hi, That's a juicy index. Is this on a single server? Have you considered sharding it and thus spreading the indexing work over multiple servers, disks, etc.? You could increase ramBufferSizeMB, which will help a bit with indexing speed, but not with actual merging. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Jan 17, 2013 at 1:22 AM, Cool Techi cooltec...@outlook.com wrote: Hi, We have an index of approximately 400GB in size, indexing 5000 documents was taking 20 seconds. But lately, the indexing is taking very long, committing the same amount of document is taking 5-20 mins. On checking the logs I can see that their a frequent merges happening, which I am guessing is the reason for this, how can this be improved. My configurations are given below, useCompoundFilefalse/useCompoundFile mergeFactor30/mergeFactor ramBufferSizeMB64/ramBufferSizeMB regards, Ayush
Re: how to get abortOnConfigurationError=false working
here is what it says in the SOLR info page Solr Specification Version: 1.4.0.2009.11.18.10.19.05 Solr Implementation Version: 1.4.1-dev exported - kvinu - 2009-11-18 10:19:05 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 On Thu, Jan 17, 2013 at 1:33 PM, Alexandre Rafalovitch [via Lucene] ml-node+s472066n4034156...@n3.nabble.com wrote: Which version of Solr is it for? I had a situation on Solr4, where I basically did not have a directory that solr.xml was pointing at for one of the cores. And Solr continued working but the Admin interface was showing big red banners about configuration problem. So, maybe it was a bug that was fixed for Solr 4? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jan 17, 2013 at 8:03 AM, snake [hidden email]http://user/SendEmail.jtp?type=nodenode=4034156i=0 wrote: I will explain the scenario just to avoid all the potential replies asking why. We run coldFusion servers (windows) which has SOLR built in (running on Jetty). A customer creates a collection which is stored within their own webspace, they only have read/write access to their own webspace so cannot put them anywhere else. the default value for abortOnConfigurationError is true. This causes endless problems when customers make changes to their websites or cancel their hosting, the collection gets deleted, and SOLR then crashes because it cannot find the config files for that collection. We then have to find out which collection is causing the problem, and manually remove its entry from solr.xml Obviously this is a PITA. In the error output it says. If you want solr to continue after configuration errors, change: abortOnConfigurationErrorfalse/abortOnConfigurationError in solr.xml I have tried this, but it has no effect. I have also tried putting it in all the solrconfig.xml files I tried this abortOnConfigurationError${solr.abortOnConfigurationError:false}/abortOnConfigurationError and this abortOnConfigurationErrorfalse/abortOnConfigurationError neither had any effect. How do you get this to work ? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-abortOnConfigurationError-false-working-tp4034149.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/how-to-get-abortOnConfigurationError-false-working-tp4034149p4034156.html To unsubscribe from how to get abortOnConfigurationError=false working, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4034149code=cnVzc0BtaWNoYWVscy5tZS51a3w0MDM0MTQ5fDEwMDg4NTg5MzM= . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- -- Russ Michaels www.bluethunderinternet.com : Business hosting services solutions www.cfmldeveloper.com: ColdFusion developer community www.michaels.me.uk : my blog www.cfsearch.com : ColdFusion search engine ** *skype me* : russmichaels -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-abortOnConfigurationError-false-working-tp4034149p4034178.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: URL encoding problems
Similar thoughts: I used unit tests to explore that issue with SolrJ, originally encoding with ClientUtils; The returned results had | many places in the text, with no clear way to un-encode. I eventually ran some tests with no encoding at all, including strings like taghello goodbye/tag; such strings were served and fetched without errors. In queries at the admin console, they show up in the JSON results correctly. What's left? I share the confusion about what is really going on. Jack On Thu, Jan 17, 2013 at 2:44 AM, Bruno Dusausoy bdusau...@yp5.be wrote: Hi, I have some problems related to URL encoding. I'm using Solr 3.6.1 on a Windows (32 bit) system. Apache Tomcat is version 6.0.36. I'm accessing Solr through solrj-3.3.0. When using the Solr admin and specifying my request, the URL looks like this (${SOLR} is there for the sake of brevity) : ${SOLR}/select?q=rapporteur_name%3A%28John+%2BSmith+%2B%5C%28FOO%5C%29%29 But when my app launching the query, the URL looks like this : ${SOLR}/select?q=rapporteur_name%3A%28John%5C+Smith%5C+%5C%28FOO%5C%29%29 My decoded query, as entered in the admin interface, is : rapporteur_name:(John +Smith +\(FOO\)) Both request return results, but only the one returns the correct ones. The code that escapes the query is : SolrQuery query = new SolrQuery(); query.setQuery(rapporteur_name:( + ClientUtils.escapeQueryChars(John Smith (FOO)) + )); I don't know if it's the right way to encode the query. Any ideas or directions ? Regards. -- Bruno Dusausoy Software Engineer YP5 Software -- Pensez environnement : limitez l'impression de ce mail. Please don't print this e-mail unless you really need to.
Re: group.ngroups behavior in response
There's a parameter to enable that. :D In solrJ solrQuery.setParam(group.ngroups, true); http://wiki.apache.org/solr/FieldCollapsing -- View this message in context: http://lucene.472066.n3.nabble.com/group-ngroups-behavior-in-response-tp4033924p4034187.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Missing documents with ConcurrentUpdateSolrServer (vs. HttpSolrServer) ?
On 1/17/2013 3:32 AM, Uwe Reh wrote: one entry in my long list of self made problems is: Done the commit before the ConcurrentUpdateSolrServer was finished. Since the ConcurrentUpdateSolrServer is asynchronous, it's very easy to create a race conditions. Make sure that your program is waiting () before it's doing the commit. if (solrserver instanceof ConcurrentUpdateSolrServer) { ((ConcurrentUpdateSolrServer) solrserver).blockUntilFinished(); } If you are using the same ConcurrentUpdateSolrServer object for all update interaction with Solr (including commits) and you still have to do the blockUntilFinished() in your own code before you issue an explicit commit, that sounds like a bug, and you should put all the details in a Jira issue. The following code is part of the request method in CUSS: // this happens for commit... if (req.getDocuments() == null || req.getDocuments().isEmpty()) { blockUntilFinished(); return server.request(request); } This means that if you use the same CUSS object for update interaction with Solr (including commits), the object will do the waiting for you when you make an explicit commit() call. If you issue a commit with a different object (either another instance of CUSS or HttpSolrServer), then this won't work and you'd have to handle it yourself. For error handling, I filed SOLR-3284 and provided a patch. It hasn't been committed, I think mostly because it doesn't give any specific information about what failed. I have an idea for how to improve the patch to address committer concerns, but until I have some time to actually look at it, I won't know if it's viable. When I have a moment, I'll update the issue with details about my idea. Thanks, Shawn
Re: Search strategy - improving search quality for short search terms such as doll
Hi David, I think this is where search analytics can help. If your intuition is right and people who search for doll are not actually searching for doll face... CD, then search analytics will confirm that. This analytics I'm talking about involves search and click tracking and analysis. Once you have this data you can play with boosting queries, altering queries, etc. based on this historical knowledge about what people who searched for X tend to do after the search. Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Jan 16, 2013 at 9:51 PM, David Parks davidpark...@yahoo.com wrote: My issue is more that the search term doll shows up in both documents on CDs as well as documents about toys. But I have 10 CD documents for every toy document, so my searches for doll tend to show the CDs most prominently. But that's not the way a user thinks. If they want the CD documents they'll search for doll face, or doll face song, more specific queries (which work fine), but if they want the toy they might just search for doll. If I run the searches doll and doll song on google image search you'll clearly see that google has solved this problem perfectly. doll returns toy dolls, and doll song returns music and anime results. I'm striving for this type of result. -Original Message- From: Amit Jha [mailto:shanuu@gmail.com] Sent: Wednesday, January 16, 2013 11:41 PM To: solr-user@lucene.apache.org Subject: Re: Search strategy - improving search quality for short search terms such as doll Its all about the data data set, here I mean index. If you have documents containing toy and doll it will return that in result set. What I understood that you are talking about the context of the query. For example if you search books on MK Gandhi and books by MK Gandhi both queries have different context. Context based search at some level achieved by natural language processing. This one you can look at for better search. Look for solr wiki mailing list would be great source of learning. Rgds AJ On 16-Jan-2013, at 15:10, David Parks davidpark...@yahoo.com wrote: I'm a beginner-intermediate solr admin, I've set up the basics for our application and it runs well. Now it's time for me to dig in and start tuning and improving queries. My next target is searches on simple terms such as doll which, in google, would return documents about, well, toy dolls, because that's the most common usage of the simple term doll. But in my index it predominantly returns documents about CDs with the song Doll Face, and My baby doll in them. I'm not directly asking how to solve this as much as I'm asking what direction I should be looking in to learn what I need to know to tackle the general issue myself. Left on my own I would start looking at categorizing the CD's into a facet called music, reasonably doable in my dataset. Then I need to reduce the boost-value of the entire facet/category of music unless certain pre-defined query terms exist, such as [music, cd, song, listen, dvd, analyze actual user queries to come up with a more exhaustive list, etc.]. I don't yet know how to do all of this, but after a couple more good books I should be dangerous. So the question to this list: - Am I on the right track here? If not, can you point me in a direction to go?
Re: Large data importing getting rollback with solr
ashimbose, It is possible that this is happening because Solr reaches a point where it is doing so many simultaneous merges that ongoing indexing is stopped until a huge merge finishes. This causes the JDBC driver to time out and disconnect, and there is no viable generic way to recover from that problem. I used to run into this with large MySQL imports. If this is what's happening, the following change/addition in the mergeScheduler section of indexConfig in solrconfig.xml will fix it: mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler int name=maxThreadCount1/int int name=maxMergeCount6/int /mergeScheduler If that doesn't fix it, then I would look for a problem with either your JDBC driver or your DB server. Thanks, Shawn On 1/17/2013 7:19 AM, Otis Gospodnetic wrote: Hi, It looks like this is the cause: JBC0016E: Remote call failed (return code=-2,220). SDK9019E: internal errorSDK9019X: Interestingly, Google gives just 1 hit for the above as query - your post. But it seems you should look up what the above codes mean first... Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Jan 17, 2013 at 2:43 AM, ashimbose ashimb...@gmail.com wrote: I am trying to index large data (not rich document) about 5GB, but Its not getting index. In case of small data it's perfectly indexing.For Large data import XML response..
Re: group.ngroups behavior in response
Bu Amit is right, when you use group.main, the number of groups is not displayed, even if you set grop.ngroups. I think in this case NumFound should display the number of groups instead of the number of docs matching. Other option would be to keep numFound as the number of docs matching and add another attribute to the response that shows the number of groups. On Thu, Jan 17, 2013 at 11:51 AM, denl0 david.vandendriess...@gmail.comwrote: There's a parameter to enable that. :D In solrJ solrQuery.setParam(group.ngroups, true); http://wiki.apache.org/solr/FieldCollapsing -- View this message in context: http://lucene.472066.n3.nabble.com/group-ngroups-behavior-in-response-tp4033924p4034187.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: group.ngroups behavior in response
I'd think adding a new response attribute would be more flexible and powerful, thinking about clients, UIs, etc. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Jan 17, 2013 at 10:15 AM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: Bu Amit is right, when you use group.main, the number of groups is not displayed, even if you set grop.ngroups. I think in this case NumFound should display the number of groups instead of the number of docs matching. Other option would be to keep numFound as the number of docs matching and add another attribute to the response that shows the number of groups. On Thu, Jan 17, 2013 at 11:51 AM, denl0 david.vandendriess...@gmail.com wrote: There's a parameter to enable that. :D In solrJ solrQuery.setParam(group.ngroups, true); http://wiki.apache.org/solr/FieldCollapsing -- View this message in context: http://lucene.472066.n3.nabble.com/group-ngroups-behavior-in-response-tp4033924p4034187.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr commit taking too long
On 1/16/2013 11:22 PM, Cool Techi wrote: We have an index of approximately 400GB in size, indexing 5000 documents was taking 20 seconds. But lately, the indexing is taking very long, committing the same amount of document is taking 5-20 mins. On checking the logs I can see that their a frequent merges happening, which I am guessing is the reason for this, how can this be improved. My configurations are given below, useCompoundFilefalse/useCompoundFile mergeFactor30/mergeFactor ramBufferSizeMB64/ramBufferSizeMB What version of Solr? Version 4 will finish merges in the background even after indexing and commits are complete, although you do have to have a high enough maxMergeCount so that indexing stays in the foreground. I use a maxMergeCount of 6 which seems to work for all situations. Another thing that makes commits take an extremely long time is high autowarmCount values on Solr caches, especially filterCache. Thanks, Shawn
Solr multicore aborts with socket timeout exceptions
I'm currently running Solr 4.0 final on tomcat v7.0.34 with ManifoldCF v1.2 dev running on Jetty. I have solr multicore set up with 10 cores. (Is this too much?) I so I also have at least 10 connectors set up in ManifoldCF (1 per core, 10 JVMs per connection) From the look of it; Solr couldn't handle all the data that ManifoldCF was sending it and the connection would abort socket timeout exceptions. I tried increasing the maxThreads to 200 on tomcat and it didn't work. In the ManifoldCF throttling section, I decreased the number of JVMs per connection from 10 down to 1 and not only did the crawl speed up significantly, the socket exceptions went away (for the most part) Here's the ticket for this issue: https://issues.apache.org/jira/browse/CONNECTORS-608 My question is this: how do I increase the number of connections on the solr side so I can run multiple ManifoldCF jobs concurrently without aborting or timeouts? The ManifoldCF team did mention that there was a committer who had socket timeout exceptions in a newer version of Solr and he fixed it by increasing the timeout window. I'm looking for that patch if available. Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-multicore-aborts-with-socket-timeout-exceptions-tp4034250.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: SOlr 3.5 and sharding
Hi Erick, It looks like we are saying the exact same thing but with different terms ;) I looked at the Solr glossary and you might be right.. maybe I should talk about partitions instead of shards. Since my last message, I`ve configured the replication between the master and slave and everything is working fine except for my original question about the number of documents not matching my expectations. I`ll try to clarify a few things and come back to this question... Machine A (which I called the master node) is where the indexation takes place. It consist of four Solr instances that will (eventually ) contain 1/4 of the entire collection. It`s just that, at this moment, since I have no control on which partition a given document is sent, I made copies of the same index for all partitions. Each Solr instance has a replication handler configured. I will eventually get to the point of changing the indexation code to distribute documents evenly on all partitions but the person who can give me access to this portion is not available right now so I can do nothing about it. Machine B has the same four shards setup to be replicas of the corresponding shard on machine A. Machine B also contains another Solr instance with the default handler configured to use the four local partitions. This instance receives client`s requests, collect the results from each partition and then select the best matches to form the final response. We intent to add new slaves being exact copies of Machine B and load balance clients requests on all slaves. My original question was that if each partition has 1000 documents matching a certain keyword and that I know all partitions have the same content then I was expecting to receive 4*1000 documents for the same keyword. But that is not the case. The replication is not an issue here since the same request on the master node will give me the same result. Each shard when called individually will give 1000 documents. But when I call them using the shards=xxx parameters then I am getting a little less than 4000 documents. I was just curious to know why this was happening... Is this a bug? Or something I am misunderstanding... Thanks for your time and contribution to Solr! -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: January-17-13 8:46 AM To: solr-user@lucene.apache.org Subject: Re: SOlr 3.5 and sharding You're still confusing shards (or at least mixing up the terminology) with simple replication. Shards are when you split up the index into several sub indexes and configure the sub-indexes to know about each other. Say you have 1M docs in 2 shards. 500K of them would go on one shard and 500K on the other. But logically you have a single index of 1M docs. So the two shards have to know about each other and when you send a request to one of them, it automatically queries the other (as well as itself), collects the response and combines them, returning the top N to the requester. This is totally different from replication. In replication (master/slave), each node has all 1M documents. Each node can work totally in isolation. An incoming request is handled by the slave without contacting any other node. If you're copying around indexes AND configuring them as though they were shards, each request will be distributed to all shards and the results collated, giving you the same doc repeatedly in your result set. If you have no access to the indexing code, you really can't go to a sharded setup. Polling is when the slaves periodically ask the master has anything changed? If so then the slave pulls down the changes. The polling interval is configured in solrconfig.xml _on the slave_. So let's say you index docs to the master. For some interval, until the slaves poll the master and get an updated index, the number of searchable docs on the master will be different than for the slaves. Additionally, you may have the issue of the polling intervals for the slaves being offset from one another, so for some brief interval the counts on the slaves may be different as well. Best Erick On Tue, Jan 15, 2013 at 10:18 AM, Jean-Sebastien Vachon jean-sebastien.vac...@wantedanalytics.com wrote: Ok I see what Erick`s meant now.. Thanks. The original index I`m working on contains about 120k documents. Since I have no access to the code that pushes documents into the index, I made four copies of the same index. The master node contains no data at all, it simply use the data available in its four shards. Knowing that I have 1000 documents matching the keyword java on each shard I was expecting to receive 4000 documents out of my sharded setup. There are only a few documents that are not accounted for (The result count is about 3996 which is pretty close but not accurate). Right now, the index is static so there is no need for any replication so the polling interval has no effect. Later this week, I
Re: Solr 4 slower than Solr 3.x?
Hello, Here is another one from the other day: http://search-lucene.com/m/tqmNjXO51B/SolrCloud+Performance+for+High+Query+Volume Am I the only one seeing people reporting this? :) Otis -- Solr ElasticSearch Support http://sematext.com/ On Mon, Jan 14, 2013 at 10:55 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, I've seen this mentioned on the ML a few times now with the most recent one being: http://search-lucene.com/m/mbT4g1fQPr91/?subj=Solr+4+0+upgrade+reduced+performance Are there any known, good Solr 3.x vs. Solr 4.x benchmarks? Thanks, Otis -- Solr ElasticSearch Support http://sematext.com/
Function Query vs. Analyzing results
Hi, Is there any performance boost when using FunctionQuery over getting all the documents and analyzing their result fields? As far as I understand, Function Query does exactly that, for each matched document it feches the fields you're interested at, and then it calculates whatever score mechanism you need. Are there some special configurations that I can use that take make FunctionQueries faster? Cheers, John
Re: Function Query vs. Analyzing results
Hello John, getting all the documents and analyzing their result fields? is almost not ever possible. Lucene stored fields usually are really slow. when FunctionQueries is backed of field values it uses Lucene FieldCache, which is array of field values that's damn faster. You are welcome. On Thu, Jan 17, 2013 at 8:20 PM, John fatmanc...@gmail.com wrote: Hi, Is there any performance boost when using FunctionQuery over getting all the documents and analyzing their result fields? As far as I understand, Function Query does exactly that, for each matched document it feches the fields you're interested at, and then it calculates whatever score mechanism you need. Are there some special configurations that I can use that take make FunctionQueries faster? Cheers, John -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Function Query vs. Analyzing results
Hi Mikhail, Thanks for the info. If my FunctionQuery accesses stored fields like that: public float floatVal(int docNum) { Document doc = null; try { doc = reader.document(docNum); } catch (Exception e) {} return getSimilarityScore(doc); } Is it still the same case? Is there a faster way to access document info? Cheers, John On Thu, Jan 17, 2013 at 6:40 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello John, getting all the documents and analyzing their result fields? is almost not ever possible. Lucene stored fields usually are really slow. when FunctionQueries is backed of field values it uses Lucene FieldCache, which is array of field values that's damn faster. You are welcome. On Thu, Jan 17, 2013 at 8:20 PM, John fatmanc...@gmail.com wrote: Hi, Is there any performance boost when using FunctionQuery over getting all the documents and analyzing their result fields? As far as I understand, Function Query does exactly that, for each matched document it feches the fields you're interested at, and then it calculates whatever score mechanism you need. Are there some special configurations that I can use that take make FunctionQueries faster? Cheers, John -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
MultiValue
my json file look like [ { last_name : jain, training_skill:[c, c++, php,java,.net] }] can u please suggest me how should i declare field in schema for trainingskill field please reply urgent -- View this message in context: http://lucene.472066.n3.nabble.com/MultiValue-tp4034305.html Sent from the Solr - User mailing list archive at Nabble.com.
searching for q terms that start with a dash/hyphen being interpreted as prohibited clauses
hello environment: solr 3.5 problem statement: i have a requirement to search for part numbers that start with a dash / hyphen. example q= term: *-0004A-0436* example query: http://some_url:some_port/some_core/select?facet=falsesort=score+desc%2C+rankNo+asc%2C+partCnt+descstart=0q=*-0004A-0436*+itemType%3A1wt=xmlqt=itemModelNoProductTypeBrandSearchrows=4 what is happening: query is returning a huge results set. in reality there is one (1) and only one record in the database with this part number. i believe this is happening because the dash is being interpreted by the query parser as a prohibited clause and the effective result is, give me everything that does NOT have this part number. how is this handled so that the search is conducted for the actual part: -0004A-0436 thx mark more information: request handler in solrconfig.xml requestHandler name=itemModelNoProductTypeBrandSearch class=solr.SearchHandler default=false lst name=defaults str name=defTypeedismax/str str name=echoParamsall/str int name=rows10/int str name=qfitemModelNoExactMatchStr^30 itemModelNo^.9 divProductTypeDesc^.8 plsBrandDesc^.5/str str name=q.alt*:*/str str name=sortscore desc, rankNo desc, partCnt desc/str str name=facettrue/str str name=facet.fielditemModelDescFacet/str str name=facet.fieldplsBrandDescFacet/str str name=facet.fielddivProductTypeIdFacet/str /lst lst name=appends /lst lst name=invariants /lst /requestHandler field information from schema.xml (if helpful) field name=itemModelNoExactMatchStr type=text_general_trim indexed=true stored=true/ field name=itemModelNo type=text_en_splitting indexed=true stored=true omitNorms=true/ field name=divProductTypeDesc type=text_general_edge_ngram indexed=true stored=true multiValued=true/ field name=plsBrandDesc type=text_general_edge_ngram indexed=true stored=true multiValued=true/ fieldType name=text_general_trim class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.PatternReplaceFilterFactory pattern=\. replacement= replace=all/ filter class=solr.EdgeNGramFilterFactory minGramSize=3 maxGramSize=15 side=front/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer fieldType name=text_general_edge_ngram class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms_SHC.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=3 maxGramSize=15 side=front/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/searching-for-q-terms-that-start-with-a-dash-hyphen-being-interpreted-as-prohibited-clauses-tp4034310.html Sent from the Solr - User mailing list archive at Nabble.com.
Using Solr Spatial in conjunction with HBASE/Hadoop
Hello, I have point data (lat/lon) stored in hbase/hadoop and would like to query the data spatially with polygons. (If I pass in a few polygons find me all the records that exist within these polygons. I need it to support polygons not just box queries). Hadoop doesn't really have much support that I could find for these types of queries. I was wondering if I could leverage SOLR spatial 4 and create spatial indexes on the hbase data that could be used to query this data?? I need near real-time answers (within a couple seconds). If anyone has any thoughts on this I would greatly appreciate them. Thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Using-Solr-Spatial-in-conjunction-with-HBASE-Hadoop-tp4034307.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MultiValue
you just need to make the field as multivalued. field name=last_name type=string indexed=true stored=true * */ field name=trainingskill type=string indexed=true stored=true *multiValued=true */ type should be set based on your search requirements. On Thu, Jan 17, 2013 at 11:27 PM, anurag.jain anurag.k...@gmail.com wrote: my json file look like [ { last_name : jain, training_skill:[c, c++, php,java,.net] }] can u please suggest me how should i declare field in schema for trainingskill field please reply urgent -- View this message in context: http://lucene.472066.n3.nabble.com/MultiValue-tp4034305.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MultiValue
[ { last_name : jain, training_skill:[c, c++, php,java,.net] } ] actually i want to tokenize in c c++ php java .net so through this i can make them as facet. but problem is in list training_skill:[c, c++, *php,java,.net*] -- View this message in context: http://lucene.472066.n3.nabble.com/MultiValue-tp4034305p4034316.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MultiValue
You mean to say that the problem is with json which is being ingested. What you are trying to achieve is that you want to split the values on the basis of comma and index it as multiple value. What problem you are facing in indexing json in format Solr expects. If you don't have control over it, probably you can try playing with custom processors. On Fri, Jan 18, 2013 at 12:31 AM, anurag.jain anurag.k...@gmail.com wrote: [ { last_name : jain, training_skill:[c, c++, php,java,.net] } ] actually i want to tokenize in c c++ php java .net so through this i can make them as facet. but problem is in list training_skill:[c, c++, *php,java,.net*] -- View this message in context: http://lucene.472066.n3.nabble.com/MultiValue-tp4034305p4034316.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Function Query vs. Analyzing results
no-no-no. your implementation as slow as result processing, due to using stored fields. Fast way is something like *org.apache.solr.schema.IntField.getValueSource(SchemaField, QParser)* . it's worth to check how the standard functions are build - check the static {} block in org.apache.solr.search.ValueSourceParser I just googled this tutorial and find it rather useful for you. Feel free to check. http://www.solrtutorial.com/custom-solr-functionquery.html On Thu, Jan 17, 2013 at 8:53 PM, John fatmanc...@gmail.com wrote: Hi Mikhail, Thanks for the info. If my FunctionQuery accesses stored fields like that: public float floatVal(int docNum) { Document doc = null; try { doc = reader.document(docNum); } catch (Exception e) {} return getSimilarityScore(doc); } Is it still the same case? Is there a faster way to access document info? Cheers, John On Thu, Jan 17, 2013 at 6:40 PM, Mikhail Khludnev mkhlud...@griddynamics.com wrote: Hello John, getting all the documents and analyzing their result fields? is almost not ever possible. Lucene stored fields usually are really slow. when FunctionQueries is backed of field values it uses Lucene FieldCache, which is array of field values that's damn faster. You are welcome. On Thu, Jan 17, 2013 at 8:20 PM, John fatmanc...@gmail.com wrote: Hi, Is there any performance boost when using FunctionQuery over getting all the documents and analyzing their result fields? As far as I understand, Function Query does exactly that, for each matched document it feches the fields you're interested at, and then it calculates whatever score mechanism you need. Are there some special configurations that I can use that take make FunctionQueries faster? Cheers, John -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: MultiValue
actually [ { last_name : jain, training_skill:*[c, c++, php,java,.net]* } ] training_skill is list. and if i want to store in string field type then it will include [ and , also. so how to avoid ? or it will not. or do you have any other field type definition through which my work will be easy. -- View this message in context: http://lucene.472066.n3.nabble.com/MultiValue-tp4034305p4034327.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to get abortOnConfigurationError=false working
Snake, It was killed in 4.0/trunk more than two years ago https://issues.apache.org/jira/browse/SOLR-1846 Setting abortOnConfigurationError==false has not worked for some time, and based on a POLL of existing users, no one seems to need/want it, You might be in that rare case when it used to don't work before. On Thu, Jan 17, 2013 at 6:21 PM, snake r...@michaels.me.uk wrote: here is what it says in the SOLR info page Solr Specification Version: 1.4.0.2009.11.18.10.19.05 Solr Implementation Version: 1.4.1-dev exported - kvinu - 2009-11-18 10:19:05 Lucene Specification Version: 2.9.1 Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25 On Thu, Jan 17, 2013 at 1:33 PM, Alexandre Rafalovitch [via Lucene] ml-node+s472066n4034156...@n3.nabble.com wrote: Which version of Solr is it for? I had a situation on Solr4, where I basically did not have a directory that solr.xml was pointing at for one of the cores. And Solr continued working but the Admin interface was showing big red banners about configuration problem. So, maybe it was a bug that was fixed for Solr 4? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jan 17, 2013 at 8:03 AM, snake [hidden email] http://user/SendEmail.jtp?type=nodenode=4034156i=0 wrote: I will explain the scenario just to avoid all the potential replies asking why. We run coldFusion servers (windows) which has SOLR built in (running on Jetty). A customer creates a collection which is stored within their own webspace, they only have read/write access to their own webspace so cannot put them anywhere else. the default value for abortOnConfigurationError is true. This causes endless problems when customers make changes to their websites or cancel their hosting, the collection gets deleted, and SOLR then crashes because it cannot find the config files for that collection. We then have to find out which collection is causing the problem, and manually remove its entry from solr.xml Obviously this is a PITA. In the error output it says. If you want solr to continue after configuration errors, change: abortOnConfigurationErrorfalse/abortOnConfigurationError in solr.xml I have tried this, but it has no effect. I have also tried putting it in all the solrconfig.xml files I tried this abortOnConfigurationError${solr.abortOnConfigurationError:false}/abortOnConfigurationError and this abortOnConfigurationErrorfalse/abortOnConfigurationError neither had any effect. How do you get this to work ? -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-abortOnConfigurationError-false-working-tp4034149.html Sent from the Solr - User mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/how-to-get-abortOnConfigurationError-false-working-tp4034149p4034156.html To unsubscribe from how to get abortOnConfigurationError=false working, click here http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4034149code=cnVzc0BtaWNoYWVscy5tZS51a3w0MDM0MTQ5fDEwMDg4NTg5MzM= . NAML http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- -- Russ Michaels www.bluethunderinternet.com : Business hosting services solutions www.cfmldeveloper.com: ColdFusion developer community www.michaels.me.uk : my blog www.cfsearch.com : ColdFusion search engine ** *skype me* : russmichaels -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-abortOnConfigurationError-false-working-tp4034149p4034178.html Sent from the Solr - User mailing list archive at Nabble.com. -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Field Collapsing - Anything in the works for multi-valued fields?
David, What's the documents and the field? It can help to suggest workaround. On Thu, Jan 17, 2013 at 5:51 PM, David Parks davidpark...@yahoo.com wrote: I want to configure Field Collapsing, but my target field is multi-valued (e.g. the field I want to group on has a variable # of entries per document, 1-N entries). I read on the wiki (http://wiki.apache.org/solr/FieldCollapsing) that grouping doesn't support multi-valued fields yet. Anything in the works on that front by chance? Any common work-arounds? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: MultiValue
On 18 January 2013 00:31, anurag.jain anurag.k...@gmail.com wrote: [ { last_name : jain, training_skill:[c, c++, php,java,.net] } ] actually i want to tokenize in c c++ php java .net What do you mean by tokenize in this case? It has been a while since I had occasion to use JSON input, and also do not remember which Solr version introduced this, but with a JSON array mapped to a multi-valued Solr field, you should get one value per entry in the array. http://wiki.apache.org/solr/UpdateJSON#Update_Commands seems to be in agreement. so through this i can make them as facet. but problem is in list training_skill:[c, c++, *php,java,.net*] Faceting should be straightforward. Are you not seeing the behaviour described above? Could you describe the issues that you are facing in more detail? Regards, Gora
Re: Missing documents with ConcurrentUpdateSolrServer (vs. HttpSolrServer) ?
: You're not only giving up the ability to monitor things, you're also giving up : the ability to detect errors. All exceptions that get thrown by the internals : of ConcurrentUpdateSolrServer are swallowed, your code will never know they : happened. The client log (slf4j with whatever binding config you chose) may : have such errors logged, but they are completely undetectable by the code. This isn't the first time i've seen someone make this claim, but i really don't understand it -- ConcurrentUpdateSolrServer has a handleError() method that gets called when an error happens during the async processing. By default it just logs the exception, if you want to do something more interesting with it in your code, just subclass ConcurrentUpdateSolrServer and override that method -- that's the entire point of that method. The bigger issue is wether your client cod could reasonable do anything if/when that method is called -- because it's all async, you probably can't do much more then log/report it in your own custom way instead of just using org.slf4j.Logger. -Hoss
Re: MultiValue
I think the problem here is that the list has 3-values, but the last one is actually a set of several as well. Anurag seem to be able to split them into separate values whether they came as individual array items or as part of joint list. So, we have a mix of multiValue submission and desire to split it out. The correct solution I suspect would be to normalize everything to just be training_skill:[c, c++, php, java, .net] before this hits Solr. However, since he wants this for facets and as a training exercise, one could remember that facets values come from the tokens, not stored value. So, it might be possible to do this: field name=test type=comaSplit indexed=true stored=true multiValued=true/ fieldType name=comaSplit class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.PatternTokenizerFactory pattern=, / /analyzer /fieldType I think the filter code will probably just aggregate all tokens despite the fact that they are spread over multiple values. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jan 17, 2013 at 2:33 PM, Gora Mohanty g...@mimirtech.com wrote: On 18 January 2013 00:31, anurag.jain anurag.k...@gmail.com wrote: [ { last_name : jain, training_skill:[c, c++, php,java,.net] } ] actually i want to tokenize in c c++ php java .net What do you mean by tokenize in this case? It has been a while since I had occasion to use JSON input, and also do not remember which Solr version introduced this, but with a JSON array mapped to a multi-valued Solr field, you should get one value per entry in the array. http://wiki.apache.org/solr/UpdateJSON#Update_Commands seems to be in agreement. so through this i can make them as facet. but problem is in list training_skill:[c, c++, *php,java,.net*] Faceting should be straightforward. Are you not seeing the behaviour described above? Could you describe the issues that you are facing in more detail? Regards, Gora
Re: MultiValue
@Alexandre Rafalovitch Thanks. yeah you got my point. training_skill:[c, c++, php, java, .net] but it is not possible for me to split php,java,.net because data can very and data is very large. i mean i have to perform on 5 line data. it might come[c++,php,java,.net,c#,ruby, python java] like that. so i have to perform on this list. just want to ignore [ , ] -- View this message in context: http://lucene.472066.n3.nabble.com/MultiValue-tp4034305p4034339.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MultiValue
Try my suggested field definition and see if it helps with faceting. It should. Try it on a small example or a fake schema. But I would still recommend escalating the problem up the chain to an architect or similar. Because I bet that data is stored in multiple places (e.g. in the database) and you will hit a real problem later when you will try to match a particular data/configuration set back to original sources. Otherwise, like suggested somewhere else in the chain, you can also look at update.chain and Request Processors. But you will have to write one yourself for this situation. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jan 17, 2013 at 2:50 PM, anurag.jain anurag.k...@gmail.com wrote: @Alexandre Rafalovitch Thanks. yeah you got my point. training_skill:[c, c++, php, java, .net] but it is not possible for me to split php,java,.net because data can very and data is very large. i mean i have to perform on 5 line data. it might come[c++,php,java,.net,c#,ruby, python java] like that. so i have to perform on this list. just want to ignore [ , ]
Re: Missing documents with ConcurrentUpdateSolrServer (vs. HttpSolrServer) ?
Hi Shawn, don't panic Due 'historical' reasons, like comparing the different subclasses of SolrServer, I have an HttpSolrServer for querys and commits. I've never tried to to use the CUSS for anything else than adding documents. As I wrote, it was a home made problem and not a bug. Sometimes I hope, not to be the only dumbass and others may caught in the same trap. Uwe Am 17.01.2013 15:52, schrieb Shawn Heisey: If you are using the same ConcurrentUpdateSolrServer object for all update interaction with Solr (including commits) and you still have to do the blockUntilFinished() in your own code before you issue an explicit commit, that sounds like a bug, and you should put all the details in a Jira issue.
Re: Missing documents with ConcurrentUpdateSolrServer (vs. HttpSolrServer) ?
On 1/17/2013 12:38 PM, Chris Hostetter wrote: : You're not only giving up the ability to monitor things, you're also giving up : the ability to detect errors. All exceptions that get thrown by the internals : of ConcurrentUpdateSolrServer are swallowed, your code will never know they : happened. The client log (slf4j with whatever binding config you chose) may : have such errors logged, but they are completely undetectable by the code. This isn't the first time i've seen someone make this claim, but i really don't understand it -- ConcurrentUpdateSolrServer has a handleError() method that gets called when an error happens during the async processing. By default it just logs the exception, if you want to do something more interesting with it in your code, just subclass ConcurrentUpdateSolrServer and override that method -- that's the entire point of that method. The bigger issue is wether your client cod could reasonable do anything if/when that method is called -- because it's all async, you probably can't do much more then log/report it in your own custom way instead of just using org.slf4j.Logger. I have my update process (using HttpSolrServer) encapsulated in a method that has several parts -- deletes, reinserts, a specific kind of partial reindex, and inserting new content. It ends with a commit(). Any exceptions that happen down inside this method are either rethrown or propagate. When the method is called, update position information is only updated if it returns without throwing an exception. For my use case, it is enough to know that an error happened, exactly where it happened is not critical unless the problem turns out to be in the data - a scenario that has not happened so far. All failures so far have been due to the server or Solr being down. I understand that many people would want to know which update failed. I hope to come up with a way to make this possible with CUSS out of the box. Do you have an example of how to override handleError that would make error detection easy? IMHO, either that information should be easily accessible to someone who's looking at the javadoc for CUSS, or the class should provide an out of the box way to detect errors. I will work on this problem, not just complain about the current state. Thanks, Shawn
Re: how to get abortOnConfigurationError=false working
Ok so is there any other to stop this problem I am having where any site can break solr by delering their collection? Seems odd everyone would vote to remove a feature that would make solr more stable. -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-abortOnConfigurationError-false-working-tp4034149p4034349.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to get abortOnConfigurationError=false working
Or a different design. You can mark collections for deletion, then delete them in an organized, safe manner later. wunder On Jan 17, 2013, at 12:40 PM, snake wrote: Ok so is there any other to stop this problem I am having where any site can break solr by delering their collection? Seems odd everyone would vote to remove a feature that would make solr more stable.
Why do I keep seeing org.apache.solr.core.SolrCore execute in the tomcat logs
I keep seeing these in the tomcat logs: Jan 17, 2013 3:57:33 PM org.apache.solr.core.SolrCore execute INFO: [Lisa] webapp=/solr path=/admin/logging params={since=1358453312320wt=jso n} status=0 QTime=0 I'm just curious: What is getting executed here? I'm not running any queries against this core or using it in any way currently. -- View this message in context: http://lucene.472066.n3.nabble.com/Why-do-I-keep-seeing-org-apache-solr-core-SolrCore-execute-in-the-tomcat-logs-tp4034353.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to get abortOnConfigurationError=false working
I think your not understanding the issue.Imagine www.acme.com has created a collection. This resides in d:\acme.com\wwwroot\collections Then they decide to redo their website, or they get a new developer who decides not to use collections, or they simply move hosts, so they delete the old one. The collection is now gone. Solr now cannot find the config files for that collection since they are gone, so solr crashes and breaks every other website on the entire server that is using solr. The customers have no idea this will happen, no knowledge about having to get collections removed properly etc, so saying they should do this and that simply wont happen so is not a solution. I need a way to avoid the above scenarios, is it possible? On Jan 17, 2013 8:43 PM, Walter Underwood [via Lucene] ml-node+s472066n4034351...@n3.nabble.com wrote: Or a different design. You can mark collections for deletion, then delete them in an organized, safe manner later. wunder On Jan 17, 2013, at 12:40 PM, snake wrote: Ok so is there any other to stop this problem I am having where any site can break solr by delering their collection? Seems odd everyone would vote to remove a feature that would make solr more stable. -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/how-to-get-abortOnConfigurationError-false-working-tp4034149p4034351.html To unsubscribe from how to get abortOnConfigurationError=false working, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4034149code=cnVzc0BtaWNoYWVscy5tZS51a3w0MDM0MTQ5fDEwMDg4NTg5MzM= . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-abortOnConfigurationError-false-working-tp4034149p4034354.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to get abortOnConfigurationError=false working
On Thu, Jan 17, 2013 at 3:40 PM, snake r...@michaels.me.uk wrote: Ok so is there any other to stop this problem I am having where any site can break solr by delering their collection? Seems odd everyone would vote to remove a feature that would make solr more stable. I agree. abortOnConfigurationError was more about a single core... if the core would still be loaded if there were config errors. There *should* be a way to still load other cores if one core has an error and is not loaded. If there's not currently, then we should implement it. -Yonik http://lucidworks.com
Questions about boosting
I've been trying to figure this out on my own, but I've come up empty so far. I need to boost documents from a certain provider. The idea is that if any documents in a result match a separate query (like provider:bigbucks), I need to multiply the score by X. It's important that the result set of the actual query is not changed, just the order. I've tried a few things from the relevancy page on the wiki but so far I can't seem to get anything to work. What syntax should I be using? Is it possible to do this at query time? Thanks, Shawn
Re: Why do I keep seeing org.apache.solr.core.SolrCore execute in the tomcat logs
You must have an Admin UI open and pointing at Logging section. So, it sends a ping to see if any new log entries were added. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jan 17, 2013 at 4:00 PM, eShard zim...@yahoo.com wrote: I keep seeing these in the tomcat logs: Jan 17, 2013 3:57:33 PM org.apache.solr.core.SolrCore execute INFO: [Lisa] webapp=/solr path=/admin/logging params={since=1358453312320wt=jso n} status=0 QTime=0 I'm just curious: What is getting executed here? I'm not running any queries against this core or using it in any way currently. -- View this message in context: http://lucene.472066.n3.nabble.com/Why-do-I-keep-seeing-org-apache-solr-core-SolrCore-execute-in-the-tomcat-logs-tp4034353.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to get abortOnConfigurationError=false working
my knowledge of solr is pretty limited, I have only been investigating this in the last couple of days due to this issue. The way SOLR is implemented in ColdFusion is with a single core, so all sites run under same core. I presume a core is like multiple instances ? On Thu, Jan 17, 2013 at 9:03 PM, Yonik Seeley-4 [via Lucene] ml-node+s472066n403435...@n3.nabble.com wrote: On Thu, Jan 17, 2013 at 3:40 PM, snake [hidden email]http://user/SendEmail.jtp?type=nodenode=4034355i=0 wrote: Ok so is there any other to stop this problem I am having where any site can break solr by delering their collection? Seems odd everyone would vote to remove a feature that would make solr more stable. I agree. abortOnConfigurationError was more about a single core... if the core would still be loaded if there were config errors. There *should* be a way to still load other cores if one core has an error and is not loaded. If there's not currently, then we should implement it. -Yonik http://lucidworks.com -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/how-to-get-abortOnConfigurationError-false-working-tp4034149p4034355.html To unsubscribe from how to get abortOnConfigurationError=false working, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4034149code=cnVzc0BtaWNoYWVscy5tZS51a3w0MDM0MTQ5fDEwMDg4NTg5MzM= . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- -- Russ Michaels www.bluethunderinternet.com : Business hosting services solutions www.cfmldeveloper.com: ColdFusion developer community www.michaels.me.uk : my blog www.cfsearch.com : ColdFusion search engine ** *skype me* : russmichaels -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-get-abortOnConfigurationError-false-working-tp4034149p4034358.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to get abortOnConfigurationError=false working
Solr 4 most definitely ignores missing cores (just run into that accidentally again myself). So, if you start Solr and directory is missing, it will survive (but complain). The other problem is what happens when a customer deletes the account and the core directory disappears in a middle of open searcher. I would suggest some-sort of pre-delete trigger that hits Solr admin interface and unloads that core first. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Thu, Jan 17, 2013 at 4:03 PM, Yonik Seeley yo...@lucidworks.com wrote: On Thu, Jan 17, 2013 at 3:40 PM, snake r...@michaels.me.uk wrote: Ok so is there any other to stop this problem I am having where any site can break solr by delering their collection? Seems odd everyone would vote to remove a feature that would make solr more stable. I agree. abortOnConfigurationError was more about a single core... if the core would still be loaded if there were config errors. There *should* be a way to still load other cores if one core has an error and is not loaded. If there's not currently, then we should implement it. -Yonik http://lucidworks.com
Re: how to get abortOnConfigurationError=false working
On 1/17/2013 2:01 PM, snake wrote: I think your not understanding the issue.Imagine www.acme.com has created a collection. This resides in d:\acme.com\wwwroot\collections Then they decide to redo their website, or they get a new developer who decides not to use collections, or they simply move hosts, so they delete the old one. The collection is now gone. Solr now cannot find the config files for that collection since they are gone, so solr crashes and breaks every other website on the entire server that is using solr. The customers have no idea this will happen, no knowledge about having to get collections removed properly etc, so saying they should do this and that simply wont happen so is not a solution. Solr has no security measures. If you are giving customers direct access to one or more directories on your Solr server, there are a LOT of ways that they can cause you problems, intentionally or not. By adding a jar to their data directory and referencing it in their config, they can do just about anything. Custom Solr components could be written that do one or more of the following: - Tie up all of Solr's memory and cause it to crash. - Grant general access to the server as the user that runs solr. - Utilize a security vulnerability and gain admin access. Changes need to be checked before implementation. If a customer wants to use custom components, that would require extra scrutiny. I can't think of any way to fully protect your server without requiring human intervention for all changes. Thanks, Shawn
Re: Using Solr Spatial in conjunction with HBASE/Hadoop
Hi, You certainly can do that, but you'll need to suck all data out of HBase and index it in Solr first. And then presumably you'll want to keep the 2 more or less in sync via incremental indexing. Maybe Lily project can help? If not, you'll have to write something that scans HBase and indexes, say via SolrJ. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Jan 17, 2013 at 1:26 PM, oakstream mike.oa...@oakstreamsystems.comwrote: Hello, I have point data (lat/lon) stored in hbase/hadoop and would like to query the data spatially with polygons. (If I pass in a few polygons find me all the records that exist within these polygons. I need it to support polygons not just box queries). Hadoop doesn't really have much support that I could find for these types of queries. I was wondering if I could leverage SOLR spatial 4 and create spatial indexes on the hbase data that could be used to query this data?? I need near real-time answers (within a couple seconds). If anyone has any thoughts on this I would greatly appreciate them. Thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Using-Solr-Spatial-in-conjunction-with-HBASE-Hadoop-tp4034307.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOlr 3.5 and sharding
Hmmm, Maybe I'm finally getting it. Right, that does seem odd. I would expect you to get 4x the number of docs on any particular shard/replica in this situation. What happens you look at the Solr logs for each partition? You should be able to glean the num results from the logs. I guess there are a couple of possibilities 1 each machine actually returns N documents, but the aggregator does something weird and gives you 4X. Indicating something's peculiar with the Solr aggregation. 2 you find that, for some reason, you aren't getting the same count _at the server level_, indicating your assertion that all the indexes are identical isn't valid. All of which means I'm pretty much out of ideas, it's hunt-and-seek time. Erick On Thu, Jan 17, 2013 at 10:53 AM, Jean-Sebastien Vachon jean-sebastien.vac...@wantedanalytics.com wrote: Hi Erick, It looks like we are saying the exact same thing but with different terms ;) I looked at the Solr glossary and you might be right.. maybe I should talk about partitions instead of shards. Since my last message, I`ve configured the replication between the master and slave and everything is working fine except for my original question about the number of documents not matching my expectations. I`ll try to clarify a few things and come back to this question... Machine A (which I called the master node) is where the indexation takes place. It consist of four Solr instances that will (eventually ) contain 1/4 of the entire collection. It`s just that, at this moment, since I have no control on which partition a given document is sent, I made copies of the same index for all partitions. Each Solr instance has a replication handler configured. I will eventually get to the point of changing the indexation code to distribute documents evenly on all partitions but the person who can give me access to this portion is not available right now so I can do nothing about it. Machine B has the same four shards setup to be replicas of the corresponding shard on machine A. Machine B also contains another Solr instance with the default handler configured to use the four local partitions. This instance receives client`s requests, collect the results from each partition and then select the best matches to form the final response. We intent to add new slaves being exact copies of Machine B and load balance clients requests on all slaves. My original question was that if each partition has 1000 documents matching a certain keyword and that I know all partitions have the same content then I was expecting to receive 4*1000 documents for the same keyword. But that is not the case. The replication is not an issue here since the same request on the master node will give me the same result. Each shard when called individually will give 1000 documents. But when I call them using the shards=xxx parameters then I am getting a little less than 4000 documents. I was just curious to know why this was happening... Is this a bug? Or something I am misunderstanding... Thanks for your time and contribution to Solr! -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: January-17-13 8:46 AM To: solr-user@lucene.apache.org Subject: Re: SOlr 3.5 and sharding You're still confusing shards (or at least mixing up the terminology) with simple replication. Shards are when you split up the index into several sub indexes and configure the sub-indexes to know about each other. Say you have 1M docs in 2 shards. 500K of them would go on one shard and 500K on the other. But logically you have a single index of 1M docs. So the two shards have to know about each other and when you send a request to one of them, it automatically queries the other (as well as itself), collects the response and combines them, returning the top N to the requester. This is totally different from replication. In replication (master/slave), each node has all 1M documents. Each node can work totally in isolation. An incoming request is handled by the slave without contacting any other node. If you're copying around indexes AND configuring them as though they were shards, each request will be distributed to all shards and the results collated, giving you the same doc repeatedly in your result set. If you have no access to the indexing code, you really can't go to a sharded setup. Polling is when the slaves periodically ask the master has anything changed? If so then the slave pulls down the changes. The polling interval is configured in solrconfig.xml _on the slave_. So let's say you index docs to the master. For some interval, until the slaves poll the master and get an updated index, the number of searchable docs on the master will be different than for the slaves. Additionally, you may have the issue of the polling intervals for the slaves being offset from one another, so for some brief
Re: Solr cache considerations
filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in cache). Notice the /8. This reflects the fact that the filters are represented by a bitset on the _internal_ Lucene ID. UniqueId has no bearing here whatsoever. This is, in a nutshell, why warming is required, the internal Lucene IDs may change. Note also that it's maxDoc, the internal arrays have holes for deleted documents. Note this is an _upper_ bound, if there are only a few docs that match, the size will be (num of matching docs) * sizeof(int)). fieldValueCache. I don't think so, although I'm a bit fuzzy on this. It depends on whether these are per-segment caches or not. Any per segment cache is still valid. Think of documentCache as intended to hold the stored fields while various components operate on it, thus avoiding repeatedly fetching the data from disk. It's _usually_ not too big a worry. About hard-commits once a day. That's _extremely_ long. Think instead of committing more frequently with openSearcher=false. If nothing else, you transaction log will grow lots and lots and lots. I'm thinking on the order of 15 minutes, or possibly even much less. With softCommits happening more often, maybe every 15 seconds. In fact, I'd start out with soft commits every 15 seconds and hard commits (openSearcher=false) every 5 minutes. The problem with hard commits being once a day is that, if for any reason the server is interrupted, on startup Solr will try to replay the entire transaction log to assure index integrity. Not to mention that your tlog will be huge. Not to mention that there is some memory usage for each document in the tlog. Hard commits roll over the tlog, flush the in-memory tlog pointers, close index segments, etc. Best Erick On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, I am going to build a big Solr (4.0?) index, which holds some dozens of millions of documents. Each document has some dozens of fields, and one big textual field. The queries on the index are non-trivial, and a little-bit long (might be hundreds of terms). No query is identical to another. Now, I want to analyze the cache performance (before setting up the whole environment), in order to estimate how much RAM will I need. filterCache: In my scenariom, every query has some filters. let's say that each filter matches 1M documents, out of 10M. Does the estimated memory usage should be 1M * sizeof(uniqueId) * num-of-filters-in-cache? fieldValueCache: Due to the difference between queries, I guess that fieldValueCache is the most important factor on query performance. Here comes a generic question: I'm indexing new documents to the index constantly. Soft commits will be performed every 10 mins. Does it say that the cache is meaningless, after every 10 minutes? documentCache: enableLazyFieldLoading will be enabled, and fl contains a very small set of fields. BUT, I need to return highlighting on about (possibly) 20 fields. Does the highlighting component use the documentCache? I guess that highlighting requires the whole field to be loaded into the documentCache. Will it happen only for fields that matched a term from the query? And one more question: I'm planning to hard-commit once a day. Should I prepare to a significant RAM usage growth between hard-commits? (consider a lot of new documents in this period...) Does this RAM comes from the same pool as the caches? An OutOfMemory exception can happen is this scenario? Thanks a lot.
Re: searching for q terms that start with a dash/hyphen being interpreted as prohibited clauses
I think all you need to do is escape the hyphen, or have you tried that already? Best Erick On Thu, Jan 17, 2013 at 1:38 PM, geeky2 gee...@hotmail.com wrote: hello environment: solr 3.5 problem statement: i have a requirement to search for part numbers that start with a dash / hyphen. example q= term: *-0004A-0436* example query: http://some_url:some_port/some_core/select?facet=falsesort=score+desc%2C+rankNo+asc%2C+partCnt+descstart=0q=*-0004A-0436*+itemType%3A1wt=xmlqt=itemModelNoProductTypeBrandSearchrows=4 what is happening: query is returning a huge results set. in reality there is one (1) and only one record in the database with this part number. i believe this is happening because the dash is being interpreted by the query parser as a prohibited clause and the effective result is, give me everything that does NOT have this part number. how is this handled so that the search is conducted for the actual part: -0004A-0436 thx mark more information: request handler in solrconfig.xml requestHandler name=itemModelNoProductTypeBrandSearch class=solr.SearchHandler default=false lst name=defaults str name=defTypeedismax/str str name=echoParamsall/str int name=rows10/int str name=qfitemModelNoExactMatchStr^30 itemModelNo^.9 divProductTypeDesc^.8 plsBrandDesc^.5/str str name=q.alt*:*/str str name=sortscore desc, rankNo desc, partCnt desc/str str name=facettrue/str str name=facet.fielditemModelDescFacet/str str name=facet.fieldplsBrandDescFacet/str str name=facet.fielddivProductTypeIdFacet/str /lst lst name=appends /lst lst name=invariants /lst /requestHandler field information from schema.xml (if helpful) field name=itemModelNoExactMatchStr type=text_general_trim indexed=true stored=true/ field name=itemModelNo type=text_en_splitting indexed=true stored=true omitNorms=true/ field name=divProductTypeDesc type=text_general_edge_ngram indexed=true stored=true multiValued=true/ field name=plsBrandDesc type=text_general_edge_ngram indexed=true stored=true multiValued=true/ fieldType name=text_general_trim class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.PatternReplaceFilterFactory pattern=\. replacement= replace=all/ filter class=solr.EdgeNGramFilterFactory minGramSize=3 maxGramSize=15 side=front/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer fieldType name=text_general_edge_ngram class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms_SHC.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=3 maxGramSize=15 side=front/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/searching-for-q-terms-that-start-with-a-dash-hyphen-being-interpreted-as-prohibited-clauses-tp4034310.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr cache considerations
I think fieldValueCache is not per segment, only fieldCache is. However, unless I'm missing something, this cache is only used for faceting on multivalued fields On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson erickerick...@gmail.comwrote: filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in cache). Notice the /8. This reflects the fact that the filters are represented by a bitset on the _internal_ Lucene ID. UniqueId has no bearing here whatsoever. This is, in a nutshell, why warming is required, the internal Lucene IDs may change. Note also that it's maxDoc, the internal arrays have holes for deleted documents. Note this is an _upper_ bound, if there are only a few docs that match, the size will be (num of matching docs) * sizeof(int)). fieldValueCache. I don't think so, although I'm a bit fuzzy on this. It depends on whether these are per-segment caches or not. Any per segment cache is still valid. Think of documentCache as intended to hold the stored fields while various components operate on it, thus avoiding repeatedly fetching the data from disk. It's _usually_ not too big a worry. About hard-commits once a day. That's _extremely_ long. Think instead of committing more frequently with openSearcher=false. If nothing else, you transaction log will grow lots and lots and lots. I'm thinking on the order of 15 minutes, or possibly even much less. With softCommits happening more often, maybe every 15 seconds. In fact, I'd start out with soft commits every 15 seconds and hard commits (openSearcher=false) every 5 minutes. The problem with hard commits being once a day is that, if for any reason the server is interrupted, on startup Solr will try to replay the entire transaction log to assure index integrity. Not to mention that your tlog will be huge. Not to mention that there is some memory usage for each document in the tlog. Hard commits roll over the tlog, flush the in-memory tlog pointers, close index segments, etc. Best Erick On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, I am going to build a big Solr (4.0?) index, which holds some dozens of millions of documents. Each document has some dozens of fields, and one big textual field. The queries on the index are non-trivial, and a little-bit long (might be hundreds of terms). No query is identical to another. Now, I want to analyze the cache performance (before setting up the whole environment), in order to estimate how much RAM will I need. filterCache: In my scenariom, every query has some filters. let's say that each filter matches 1M documents, out of 10M. Does the estimated memory usage should be 1M * sizeof(uniqueId) * num-of-filters-in-cache? fieldValueCache: Due to the difference between queries, I guess that fieldValueCache is the most important factor on query performance. Here comes a generic question: I'm indexing new documents to the index constantly. Soft commits will be performed every 10 mins. Does it say that the cache is meaningless, after every 10 minutes? documentCache: enableLazyFieldLoading will be enabled, and fl contains a very small set of fields. BUT, I need to return highlighting on about (possibly) 20 fields. Does the highlighting component use the documentCache? I guess that highlighting requires the whole field to be loaded into the documentCache. Will it happen only for fields that matched a term from the query? And one more question: I'm planning to hard-commit once a day. Should I prepare to a significant RAM usage growth between hard-commits? (consider a lot of new documents in this period...) Does this RAM comes from the same pool as the caches? An OutOfMemory exception can happen is this scenario? Thanks a lot.
Re: Using Solr Spatial in conjunction with HBASE/Hadoop
Thanks for your response! I appreciate it. There will be cases where I want to AND or OR the query between HBASE and Lucene. Would it make sense to custom code querying both repositories at the same time or sequentiallyOr are there any tools out there to do this? Basically I'm thinking that HBASE will keep the majority of my data columns and lucene will keep the index and a unique pointer to the HBASE record. Like HBASE UID = 12345, COL1, COL2, COL3, COL4, COL5, COL6 LUCENE ID = 999, UID = 12345 , INDEX Columns (LAT/LON) My query would be something like where lat/lon in (Polygon) AND COL3 = 'ABC' Would this kind of setup make sense? Is there a better way? I'll be working with Terabytes of data Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Using-Solr-Spatial-in-conjunction-with-HBASE-Hadoop-tp4034307p4034400.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Field Collapsing - Anything in the works for multi-valued fields?
The documents are individual products which come from 1 or more vendors. Example: a 'toy spiderman doll' is sold by 2 vendors, that is 1 document. Most fields are multi valued (short_description from each of the 2 vendors, long_description, product_name, vendor, etc. the same). I'd like to collapse on the vendor in an attempt to ensure that vast collections of books, music, and movies, by just a few vendors, don't overwhelm the results simply due to the fact that they have every search term imaginable due to the sheer volume of books, CDs, and DVDs, in relation to other product items. But in this case there is clearly 1...N vendors per document, solidly a multi-valued field. And it's hard to put a maximum number of vendors possible. Thanks, Dave -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Friday, January 18, 2013 2:32 AM To: solr-user Subject: Re: Field Collapsing - Anything in the works for multi-valued fields? David, What's the documents and the field? It can help to suggest workaround. On Thu, Jan 17, 2013 at 5:51 PM, David Parks davidpark...@yahoo.com wrote: I want to configure Field Collapsing, but my target field is multi-valued (e.g. the field I want to group on has a variable # of entries per document, 1-N entries). I read on the wiki (http://wiki.apache.org/solr/FieldCollapsing) that grouping doesn't support multi-valued fields yet. Anything in the works on that front by chance? Any common work-arounds? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Need 'stupid beginner' help with SolrCloud
I'm trying to get a 2-node SolrCloud install off the ground with the 4.1 branch. This is a new project for a different system than my existing Solr 3.5.0 setup. It will have one shard and two replicas. I have part of the example in /opt/mbsolr4 -- jetty, the war file, logs, etc. This is the CWD. I want all my config and data to live in /index/mbsolr4, so I am using -Dsolr.solr.home=/index/mbsolr4. This setup mirrors what I am doing for upgrading the other system from 3.5.0 to 4.1, which is not using SolrCloud. There is also a separate 3-node zookeeper ensemble, with two of those nodes living on the two Solr servers. What do I need in the solr home (/index/mbsolr4) before I start Solr? If I was not using SolrCloud, I would put solr.xml in there, pointing at directories relative to that location. I'm going to have multiple collections. Some of those collections will use the same config/schema, others will use slightly different versions. I have worked out the zkHost value that I will need: -DzkHost=mbzoo1:2181,mbzoo2:2181,mbzoo3:2181/mbsolr1 I have both Solr servers started and talking to zookeeper, but there are no collections so the UI doesn't work. Are the following options enough for me to get my first config collection into zookeeper/solrcloud -- assuming the config is right? Do I need numShards and the replica count at this phase? -Dbootstrap_confdir=/index/mbsolr4/bootstrapconf -Dcollection.configName=mbbasecfg Thanks, Shawn
Re: Field Collapsing - Anything in the works for multi-valued fields?
Hi, Instead of the multi-valued fields, would parent-child setup for you here? See http://search-lucene.com/?q=solr+joinfc_type=wiki Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Jan 17, 2013 at 8:04 PM, David Parks davidpark...@yahoo.com wrote: The documents are individual products which come from 1 or more vendors. Example: a 'toy spiderman doll' is sold by 2 vendors, that is 1 document. Most fields are multi valued (short_description from each of the 2 vendors, long_description, product_name, vendor, etc. the same). I'd like to collapse on the vendor in an attempt to ensure that vast collections of books, music, and movies, by just a few vendors, don't overwhelm the results simply due to the fact that they have every search term imaginable due to the sheer volume of books, CDs, and DVDs, in relation to other product items. But in this case there is clearly 1...N vendors per document, solidly a multi-valued field. And it's hard to put a maximum number of vendors possible. Thanks, Dave -Original Message- From: Mikhail Khludnev [mailto:mkhlud...@griddynamics.com] Sent: Friday, January 18, 2013 2:32 AM To: solr-user Subject: Re: Field Collapsing - Anything in the works for multi-valued fields? David, What's the documents and the field? It can help to suggest workaround. On Thu, Jan 17, 2013 at 5:51 PM, David Parks davidpark...@yahoo.com wrote: I want to configure Field Collapsing, but my target field is multi-valued (e.g. the field I want to group on has a variable # of entries per document, 1-N entries). I read on the wiki (http://wiki.apache.org/solr/FieldCollapsing) that grouping doesn't support multi-valued fields yet. Anything in the works on that front by chance? Any common work-arounds? -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Using Solr Spatial in conjunction with HBASE/Hadoop
You'd want to do your Solr spatial query, get IDs from the index, and then *after* that do a multi get against your HBase table with top N IDs from Solr's response and get thus get the data back to the caller. I don't know how fast multi gets are, what the limitations are, etc. Maybe somebody else can address that. Alternatively, I suppose you could implement a custom collector that does gets as matching documents are being collected by Solr. I don't recall the class/interface you'd need to implement off the top of my head. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, Jan 17, 2013 at 8:01 PM, oakstream mike.oa...@oakstreamsystems.comwrote: Thanks for your response! I appreciate it. There will be cases where I want to AND or OR the query between HBASE and Lucene. Would it make sense to custom code querying both repositories at the same time or sequentiallyOr are there any tools out there to do this? Basically I'm thinking that HBASE will keep the majority of my data columns and lucene will keep the index and a unique pointer to the HBASE record. Like HBASE UID = 12345, COL1, COL2, COL3, COL4, COL5, COL6 LUCENE ID = 999, UID = 12345 , INDEX Columns (LAT/LON) My query would be something like where lat/lon in (Polygon) AND COL3 = 'ABC' Would this kind of setup make sense? Is there a better way? I'll be working with Terabytes of data Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Using-Solr-Spatial-in-conjunction-with-HBASE-Hadoop-tp4034307p4034400.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Need 'stupid beginner' help with SolrCloud
There are a couple ways you can proceed. You can preconfigure some SolrCores in solr.xml. Even if you don't, you want a solr.xml, because that is where a lot of cloud properties are defined. Or you can use the collections API or the core admin API. I guess I'd recommend the collections API. You have a couple options for getting in config. I'd recommend using the ZkCli tool to upload each of your config sets: http://wiki.apache.org/solr/SolrCloud#Getting_your_Configuration_Files_into_ZooKeeper After that, use the collections API to create the necessary cores on each node. Another options is to setup solr.xml like you would locally, then start with -Dconf_bootstrap=true and it will duplicate your local config and collection setup in ZooKeeper. - Mark On Jan 17, 2013, at 9:10 PM, Shawn Heisey s...@elyograg.org wrote: I'm trying to get a 2-node SolrCloud install off the ground with the 4.1 branch. This is a new project for a different system than my existing Solr 3.5.0 setup. It will have one shard and two replicas. I have part of the example in /opt/mbsolr4 -- jetty, the war file, logs, etc. This is the CWD. I want all my config and data to live in /index/mbsolr4, so I am using -Dsolr.solr.home=/index/mbsolr4. This setup mirrors what I am doing for upgrading the other system from 3.5.0 to 4.1, which is not using SolrCloud. There is also a separate 3-node zookeeper ensemble, with two of those nodes living on the two Solr servers. What do I need in the solr home (/index/mbsolr4) before I start Solr? If I was not using SolrCloud, I would put solr.xml in there, pointing at directories relative to that location. I'm going to have multiple collections. Some of those collections will use the same config/schema, others will use slightly different versions. I have worked out the zkHost value that I will need: -DzkHost=mbzoo1:2181,mbzoo2:2181,mbzoo3:2181/mbsolr1 I have both Solr servers started and talking to zookeeper, but there are no collections so the UI doesn't work. Are the following options enough for me to get my first config collection into zookeeper/solrcloud -- assuming the config is right? Do I need numShards and the replica count at this phase? -Dbootstrap_confdir=/index/mbsolr4/bootstrapconf -Dcollection.configName=mbbasecfg Thanks, Shawn
build CMIS compatible Solr
hi I am new to solr and I would like to use Solr as my document server, plus search engine. But solr is not CMIS compatible( While it shoud not be, as it is not build as a pure document management server). In that sense, I would build another layer beyond Solr so that the exposed interface would be CMIS compatible. I did some investigation and looks like OpenCMIS is one of the choices. My next step would be build this CMIS Bridge layer, which can marshall the request as CMIS request, then within the CMIS implementation, marshall the requst as Solr compatible request and send it to Solr. Finally marshall the Solr response to CMIS compatible response. Is my logic right? And, is that any other library other than OpenCMIS to do this job? cheers. Nick
Re: Questions about boosting
Start with Query Elevation and see if that helps: http://wiki.apache.org/solr/QueryElevationComponent Index-time document boost is a possibility. Maybe an ExternalFileField where every document could have a dynamic boost value that you add with a boost function. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Thursday, January 17, 2013 4:11 PM To: solr-user@lucene.apache.org Subject: Questions about boosting I've been trying to figure this out on my own, but I've come up empty so far. I need to boost documents from a certain provider. The idea is that if any documents in a result match a separate query (like provider:bigbucks), I need to multiply the score by X. It's important that the result set of the actual query is not changed, just the order. I've tried a few things from the relevancy page on the wiki but so far I can't seem to get anything to work. What syntax should I be using? Is it possible to do this at query time? Thanks, Shawn
Re: build CMIS compatible Solr
On 18 January 2013 10:36, Nicholas Li nicholas...@yarris.com wrote: hi I am new to solr and I would like to use Solr as my document server, plus search engine. But solr is not CMIS compatible( While it shoud not be, as it is not build as a pure document management server). In that sense, I would build another layer beyond Solr so that the exposed interface would be CMIS compatible. [...] May I ask why? Solr is designed to be a search engine, which is a very different beast from a document repository. In the open-source world, Alfresco ( http://www.alfresco.com/ ) already exists, can index into Solr, and supports CMIS-based access. Regards, Gora
Re: searching for q terms that start with a dash/hyphen being interpreted as prohibited clauses
Or put the term in quotes. -- Jack Krupansky -Original Message- From: Erick Erickson Sent: Thursday, January 17, 2013 6:59 PM To: solr-user@lucene.apache.org Subject: Re: searching for q terms that start with a dash/hyphen being interpreted as prohibited clauses I think all you need to do is escape the hyphen, or have you tried that already? Best Erick On Thu, Jan 17, 2013 at 1:38 PM, geeky2 gee...@hotmail.com wrote: hello environment: solr 3.5 problem statement: i have a requirement to search for part numbers that start with a dash / hyphen. example q= term: *-0004A-0436* example query: http://some_url:some_port/some_core/select?facet=falsesort=score+desc%2C+rankNo+asc%2C+partCnt+descstart=0q=*-0004A-0436*+itemType%3A1wt=xmlqt=itemModelNoProductTypeBrandSearchrows=4 what is happening: query is returning a huge results set. in reality there is one (1) and only one record in the database with this part number. i believe this is happening because the dash is being interpreted by the query parser as a prohibited clause and the effective result is, give me everything that does NOT have this part number. how is this handled so that the search is conducted for the actual part: -0004A-0436 thx mark more information: request handler in solrconfig.xml requestHandler name=itemModelNoProductTypeBrandSearch class=solr.SearchHandler default=false lst name=defaults str name=defTypeedismax/str str name=echoParamsall/str int name=rows10/int str name=qfitemModelNoExactMatchStr^30 itemModelNo^.9 divProductTypeDesc^.8 plsBrandDesc^.5/str str name=q.alt*:*/str str name=sortscore desc, rankNo desc, partCnt desc/str str name=facettrue/str str name=facet.fielditemModelDescFacet/str str name=facet.fieldplsBrandDescFacet/str str name=facet.fielddivProductTypeIdFacet/str /lst lst name=appends /lst lst name=invariants /lst /requestHandler field information from schema.xml (if helpful) field name=itemModelNoExactMatchStr type=text_general_trim indexed=true stored=true/ field name=itemModelNo type=text_en_splitting indexed=true stored=true omitNorms=true/ field name=divProductTypeDesc type=text_general_edge_ngram indexed=true stored=true multiValued=true/ field name=plsBrandDesc type=text_general_edge_ngram indexed=true stored=true multiValued=true/ fieldType name=text_general_trim class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType fieldType name=text_en_splitting class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.PatternReplaceFilterFactory pattern=\. replacement= replace=all/ filter class=solr.EdgeNGramFilterFactory minGramSize=3 maxGramSize=15 side=front/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1 splitOnCaseChange=1 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer fieldType name=text_general_edge_ngram class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.SynonymFilterFactory synonyms=synonyms_SHC.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EdgeNGramFilterFactory minGramSize=3 maxGramSize=15 side=front/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType -- View this message in context: http://lucene.472066.n3.nabble.com/searching-for-q-terms-that-start-with-a-dash-hyphen-being-interpreted-as-prohibited-clauses-tp4034310.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What is the difference in defining multiValued on field and or fieldtype?
Specifying an attribute on the field type makes it the default for any field of that type. Setting multiValued=true on ignored simply allows it to be used for any field, whether it is single or multi-valued, and any source data, whether it has one or multiple values for that ignored field. Otherwise, you would get an error if multiple values were given for an ignored field which had no multiValued attribute, while the stated goal is to simply ignore the field and its incoming values. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Thursday, January 17, 2013 6:20 PM To: solr-user@lucene.apache.org Subject: What is the difference in defining multiValued on field and or fieldtype? Hello, I was looking at the 'ignored' field in the example's schema.xml and suddenly noticed that its field type has multiValued=true in the definition. Wiki confirms that it is possible, but does not explains. What's the difference between defining it on the type and on the field itself? Because example has it defined on both. I am confused suddenly, because we now have permutation of 9 different values (true/false/missing ^ 2) and I am not sure what the exact semantics is. I am mostly interested in fieldType/@multiValued=true impact, but curious about the other permutations. Thanks, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Solr cache considerations
Unfortunately, it seems ( http://lucene.472066.n3.nabble.com/Nrt-and-caching-td3993612.html) that these caches are not per-segment. In this case, I want to (soft) commit less frequently. Am I right? Tomás, as the fieldValueCache is very similar to lucene's FieldCache, I guess it has a big contribution to standard (not only faceted) queries time. SolrWiki claims that it primarily used by faceting. What that says about complex textual queries? documentCache: Erick, After a query processing is finished, doesn't some documents stay in the documentCache? can't I use it to accelerate queries that should retrieve stored fields of documents? In this case, a big documentCache can hold more documents.. About commit frequency: HardCommit: openSearch=false seems as a nice solution. Where can I read about this? (found nothing but one unexplained sentence in SolrWiki). SoftCommit: In my case, the required index freshness is 10 minutes. The plan to soft commit every 10 minutes is similar to storing all of the documents in a queue (outside to Solr), an indexing a bulk every 10 minutes. Thanks. On Fri, Jan 18, 2013 at 2:15 AM, Tomás Fernández Löbbe tomasflo...@gmail.com wrote: I think fieldValueCache is not per segment, only fieldCache is. However, unless I'm missing something, this cache is only used for faceting on multivalued fields On Thu, Jan 17, 2013 at 8:58 PM, Erick Erickson erickerick...@gmail.com wrote: filterCache: This is bounded by 1M * (maxDoc) / 8 * (num filters in cache). Notice the /8. This reflects the fact that the filters are represented by a bitset on the _internal_ Lucene ID. UniqueId has no bearing here whatsoever. This is, in a nutshell, why warming is required, the internal Lucene IDs may change. Note also that it's maxDoc, the internal arrays have holes for deleted documents. Note this is an _upper_ bound, if there are only a few docs that match, the size will be (num of matching docs) * sizeof(int)). fieldValueCache. I don't think so, although I'm a bit fuzzy on this. It depends on whether these are per-segment caches or not. Any per segment cache is still valid. Think of documentCache as intended to hold the stored fields while various components operate on it, thus avoiding repeatedly fetching the data from disk. It's _usually_ not too big a worry. About hard-commits once a day. That's _extremely_ long. Think instead of committing more frequently with openSearcher=false. If nothing else, you transaction log will grow lots and lots and lots. I'm thinking on the order of 15 minutes, or possibly even much less. With softCommits happening more often, maybe every 15 seconds. In fact, I'd start out with soft commits every 15 seconds and hard commits (openSearcher=false) every 5 minutes. The problem with hard commits being once a day is that, if for any reason the server is interrupted, on startup Solr will try to replay the entire transaction log to assure index integrity. Not to mention that your tlog will be huge. Not to mention that there is some memory usage for each document in the tlog. Hard commits roll over the tlog, flush the in-memory tlog pointers, close index segments, etc. Best Erick On Thu, Jan 17, 2013 at 1:29 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, I am going to build a big Solr (4.0?) index, which holds some dozens of millions of documents. Each document has some dozens of fields, and one big textual field. The queries on the index are non-trivial, and a little-bit long (might be hundreds of terms). No query is identical to another. Now, I want to analyze the cache performance (before setting up the whole environment), in order to estimate how much RAM will I need. filterCache: In my scenariom, every query has some filters. let's say that each filter matches 1M documents, out of 10M. Does the estimated memory usage should be 1M * sizeof(uniqueId) * num-of-filters-in-cache? fieldValueCache: Due to the difference between queries, I guess that fieldValueCache is the most important factor on query performance. Here comes a generic question: I'm indexing new documents to the index constantly. Soft commits will be performed every 10 mins. Does it say that the cache is meaningless, after every 10 minutes? documentCache: enableLazyFieldLoading will be enabled, and fl contains a very small set of fields. BUT, I need to return highlighting on about (possibly) 20 fields. Does the highlighting component use the documentCache? I guess that highlighting requires the whole field to be loaded into the documentCache. Will it happen only for fields that matched a term from the query? And one more question: I'm planning to hard-commit once a day. Should I prepare to a significant RAM usage growth between hard-commits? (consider a lot of new documents in this period...) Does this RAM
Re: What is the difference in defining multiValued on field and or fieldtype?
Thank you Jack, I just realized that perhaps ignored was a bad example. But if I understood correctly, then I can specify multiValued on the type and not do so on the field itself and I still get multiValued entries. That's good to know. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 18, 2013 at 12:19 AM, Jack Krupansky j...@basetechnology.comwrote: Specifying an attribute on the field type makes it the default for any field of that type. Setting multiValued=true on ignored simply allows it to be used for any field, whether it is single or multi-valued, and any source data, whether it has one or multiple values for that ignored field. Otherwise, you would get an error if multiple values were given for an ignored field which had no multiValued attribute, while the stated goal is to simply ignore the field and its incoming values. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Thursday, January 17, 2013 6:20 PM To: solr-user@lucene.apache.org Subject: What is the difference in defining multiValued on field and or fieldtype? Hello, I was looking at the 'ignored' field in the example's schema.xml and suddenly noticed that its field type has multiValued=true in the definition. Wiki confirms that it is possible, but does not explains. What's the difference between defining it on the type and on the field itself? Because example has it defined on both. I am confused suddenly, because we now have permutation of 9 different values (true/false/missing ^ 2) and I am not sure what the exact semantics is. I am mostly interested in fieldType/@multiValued=true impact, but curious about the other permutations. Thanks, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/**alexandrerafalovitchhttp://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: What is the difference in defining multiValued on field and or fieldtype?
Yes. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Friday, January 18, 2013 12:26 AM To: solr-user@lucene.apache.org Subject: Re: What is the difference in defining multiValued on field and or fieldtype? Thank you Jack, I just realized that perhaps ignored was a bad example. But if I understood correctly, then I can specify multiValued on the type and not do so on the field itself and I still get multiValued entries. That's good to know. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 18, 2013 at 12:19 AM, Jack Krupansky j...@basetechnology.comwrote: Specifying an attribute on the field type makes it the default for any field of that type. Setting multiValued=true on ignored simply allows it to be used for any field, whether it is single or multi-valued, and any source data, whether it has one or multiple values for that ignored field. Otherwise, you would get an error if multiple values were given for an ignored field which had no multiValued attribute, while the stated goal is to simply ignore the field and its incoming values. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Thursday, January 17, 2013 6:20 PM To: solr-user@lucene.apache.org Subject: What is the difference in defining multiValued on field and or fieldtype? Hello, I was looking at the 'ignored' field in the example's schema.xml and suddenly noticed that its field type has multiValued=true in the definition. Wiki confirms that it is possible, but does not explains. What's the difference between defining it on the type and on the field itself? Because example has it defined on both. I am confused suddenly, because we now have permutation of 9 different values (true/false/missing ^ 2) and I am not sure what the exact semantics is. I am mostly interested in fieldType/@multiValued=true impact, but curious about the other permutations. Thanks, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/**alexandrerafalovitchhttp://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: build CMIS compatible Solr
I want to make something like Alfresco, but not having that many features. And I'd like to utilise the searching ability of Solr. On Fri, Jan 18, 2013 at 4:11 PM, Gora Mohanty g...@mimirtech.com wrote: On 18 January 2013 10:36, Nicholas Li nicholas...@yarris.com wrote: hi I am new to solr and I would like to use Solr as my document server, plus search engine. But solr is not CMIS compatible( While it shoud not be, as it is not build as a pure document management server). In that sense, I would build another layer beyond Solr so that the exposed interface would be CMIS compatible. [...] May I ask why? Solr is designed to be a search engine, which is a very different beast from a document repository. In the open-source world, Alfresco ( http://www.alfresco.com/ ) already exists, can index into Solr, and supports CMIS-based access. Regards, Gora
Re: Is required=true useless in dynamicField?
Solr will ignore required for dynamic fields. It will be parsed and preserved, but will not affect the check for required fields in an input document. Ditto for default value for a dynamic field. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Friday, January 18, 2013 12:08 AM To: solr-user@lucene.apache.org Subject: Is required=true useless in dynamicField? Hello, Given the definition: dynamicField name=addr_* type=email multiValued=true indexed=true stored=true required=true / Does it actually matter whether I specify required? I guess there is no way to have it enforced, right? Looking at the Wiki, dynamicField does not actually say what parameters it cares about, so it probably does not even read it from the definition. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Re: Questions about boosting
Have you tried boost query? bq=provider:fred wunder On Jan 17, 2013, at 9:08 PM, Jack Krupansky wrote: Start with Query Elevation and see if that helps: http://wiki.apache.org/solr/QueryElevationComponent Index-time document boost is a possibility. Maybe an ExternalFileField where every document could have a dynamic boost value that you add with a boost function. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Thursday, January 17, 2013 4:11 PM To: solr-user@lucene.apache.org Subject: Questions about boosting I've been trying to figure this out on my own, but I've come up empty so far. I need to boost documents from a certain provider. The idea is that if any documents in a result match a separate query (like provider:bigbucks), I need to multiply the score by X. It's important that the result set of the actual query is not changed, just the order. I've tried a few things from the relevancy page on the wiki but so far I can't seem to get anything to work. What syntax should I be using? Is it possible to do this at query time? Thanks, Shawn
Re: Suggestion that preserve original phrase case
Thanks again Eric. This time I got it working :). Infact your first response itself had clear explanation, somehow I did not understand it completely! On Thu, Jan 17, 2013 at 6:59 PM, Erick Erickson erickerick...@gmail.comwrote: You could write a custom Filter (or perhaps Tokenizer), but I usually just do it on the input side before things get sent to Solr. I don't think PatternReplaceCharFilterFactory will help, you could easily turn the input into original:original, but then you'd need to write a custom filter that normalized the left-hand-side but not the right-hand-side Best Erick On Tue, Jan 15, 2013 at 11:27 AM, Selvam s.selvams...@gmail.com wrote: Thanks Erick, can you tell me how to do the appending (lowercaseversion:LowerCaseVersion) before indexing. I tried pattern factory filters, but I could not get it right. On Sun, Jan 13, 2013 at 8:49 PM, Erick Erickson erickerick...@gmail.com wrote: One way I've seen this done is to index pairs like lowercaseversion:LowerCaseVersion. You can't push this whole thing through your field as defined since it'll all be lowercased, you have to produce the left hand side of the above yourself and just use KeywordTokenizer without LowercaseFilter. Then, your application displays the right-hand-side of the returned token. Simple solution, not very elegant, but sometimes the easiest... Best Erick On Fri, Jan 11, 2013 at 1:30 AM, Selvam s.selvams...@gmail.com wrote: Hi*, * I have been trying to figure out a way for case insensitive suggestion but which should return original phrase as result.* *I am using* *solr 3.5* * *For eg: * If I index 'Hello world' and search for 'hello' it needs to return *'Hello world'* not *'hello world'. *My configurations are as follows,* * * New field type:* fieldType class=solr.TextField name=text_auto analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory/ /analyzer *Field values*: field name=label type=text indexed=true stored=true termVectors=true omitNorms=true/ field name=label_autocomplete type=text_auto indexed=true stored=true multiValued=false/ copyField source=label dest=label_autocomplete / *Spellcheck Component*: searchComponent name=suggest class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypetext_auto/str lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=buildOnOptimizetrue/str str name=buildOnCommittrue/str str name=fieldlabel_autocomplete/str /lst /searchComponent Kindly share your suggestions to implement this behavior. -- Regards, Selvam KnackForge http://knackforge.com Acquia Service Partner No. 1, 12th Line, K.K. Road, Venkatapuram, Ambattur, Chennai, Tamil Nadu, India. PIN - 600 053. -- Regards, Selvam KnackForge http://knackforge.com Acquia Service Partner No. 1, 12th Line, K.K. Road, Venkatapuram, Ambattur, Chennai, Tamil Nadu, India. PIN - 600 053. -- Regards, Selvam KnackForge http://knackforge.com Acquia Service Partner No. 1, 12th Line, K.K. Road, Venkatapuram, Ambattur, Chennai, Tamil Nadu, India. PIN - 600 053.
Re: group.ngroups behavior in response
A new response attribute would be better but it also complicates the patch in that it would require a new way to serialize DocSlices I think (especially when group.main=true)? I was looking to set group.main=true so that my existing clients don't have to change to parse the grouped resultset format. Secondly, while a new response attribute makes sense the question is whether or not numFound is the numGroups or numTotal. To me it should be the number of groups because logically that is what the resultset shows and the new attribute should point to the number of total. Thanks Amit
Re: Using Solr Spatial in conjunction with HBASE/Hadoop
Hi Oakstream, Coincidentally I've been thinking of porting the geohash prefixtree intersection algorithm in Lucene 4 spatial to Accumulo (another big-table system like HBase). There's a decent chance it'll happen this year, I think. That doesn't help your need right now of course so go with Otis's advise. ~ David Smiley oakstream wrote Hello, I have point data (lat/lon) stored in hbase/hadoop and would like to query the data spatially with polygons. (If I pass in a few polygons find me all the records that exist within these polygons. I need it to support polygons not just box queries). Hadoop doesn't really have much support that I could find for these types of queries. I was wondering if I could leverage SOLR spatial 4 and create spatial indexes on the hbase data that could be used to query this data?? I need near real-time answers (within a couple seconds). If anyone has any thoughts on this I would greatly appreciate them. Thank you - Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Using-Solr-Spatial-in-conjunction-with-HBASE-Hadoop-tp4034307p403.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Questions about boosting
I did try the bq parameter. Either I'm not using it correctly, or it's not making a noticeable difference. I was not able to find any good docs, either. Can you give me complete instructions in its use? Can I control the boost factor? Is the boost additive or multiplicative? For query elevation, don't you have to know in advance the query that a user will send? There's no way for me to know this - we want to be able to apply the boost to arbitrary queries. The source data comes from MySQL, and this is a seven-shard distributed index with 74075200 documents as of a few minutes ago. Although ExternalFileField probably wouldn't be impossible, it is rather impractical. Thanks, Shawn On 1/17/2013 10:53 PM, Walter Underwood wrote: Have you tried boost query? bq=provider:fred wunder On Jan 17, 2013, at 9:08 PM, Jack Krupansky wrote: Start with Query Elevation and see if that helps: http://wiki.apache.org/solr/QueryElevationComponent Index-time document boost is a possibility. Maybe an ExternalFileField where every document could have a dynamic boost value that you add with a boost function. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Thursday, January 17, 2013 4:11 PM To: solr-user@lucene.apache.org Subject: Questions about boosting I've been trying to figure this out on my own, but I've come up empty so far. I need to boost documents from a certain provider. The idea is that if any documents in a result match a separate query (like provider:bigbucks), I need to multiply the score by X. It's important that the result set of the actual query is not changed, just the order.
Re: Questions about boosting
As I understand it, the bq parameter is a full Lucene query, but only used for ranking, not for selection. This is the complement of fq. You can use weighting: provider:fred^8 This will be affected by idf, so providers with fewer matches will have higher weight than those with more matches. This is a bother, but the idf-free approach requires Solr 4.0. wunder On Jan 17, 2013, at 10:31 PM, Shawn Heisey wrote: I did try the bq parameter. Either I'm not using it correctly, or it's not making a noticeable difference. I was not able to find any good docs, either. Can you give me complete instructions in its use? Can I control the boost factor? Is the boost additive or multiplicative? For query elevation, don't you have to know in advance the query that a user will send? There's no way for me to know this - we want to be able to apply the boost to arbitrary queries. The source data comes from MySQL, and this is a seven-shard distributed index with 74075200 documents as of a few minutes ago. Although ExternalFileField probably wouldn't be impossible, it is rather impractical. Thanks, Shawn On 1/17/2013 10:53 PM, Walter Underwood wrote: Have you tried boost query? bq=provider:fred wunder On Jan 17, 2013, at 9:08 PM, Jack Krupansky wrote: Start with Query Elevation and see if that helps: http://wiki.apache.org/solr/QueryElevationComponent Index-time document boost is a possibility. Maybe an ExternalFileField where every document could have a dynamic boost value that you add with a boost function. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Thursday, January 17, 2013 4:11 PM To: solr-user@lucene.apache.org Subject: Questions about boosting I've been trying to figure this out on my own, but I've come up empty so far. I need to boost documents from a certain provider. The idea is that if any documents in a result match a separate query (like provider:bigbucks), I need to multiply the score by X. It's important that the result set of the actual query is not changed, just the order.
Re: Questions about boosting
On 1/17/2013 11:41 PM, Walter Underwood wrote: As I understand it, the bq parameter is a full Lucene query, but only used for ranking, not for selection. This is the complement of fq. You can use weighting: provider:fred^8 This will be affected by idf, so providers with fewer matches will have higher weight than those with more matches. This is a bother, but the idf-free approach requires Solr 4.0. I am doing my testing on Solr 4.1, so if you can give me the syntax for that, I would appreciate it. My production indexes are 3.5, but once we are confident with the 4.1 dev system, we'll upgrade. The provider field has omitTermFreqAndPositions=true defined, but the fields that typically get searched don't omit anything, so IDF probably still applies in the aggregate. On a related note, I have rather extreme length variation in my fields, so I see quite a lot of weird results due to very short metadata. Is there any way to lessen the impact of lengthNorm without eliminating it entirely? If not, is there any way to eliminate lengthNorm without also disabling index-time boosts? At this moment I am not doing index-time boosting, but business requirements may change that in the future. Thanks, Shawn
Re: Questions about boosting
On 1/17/2013 11:41 PM, Walter Underwood wrote: As I understand it, the bq parameter is a full Lucene query, but only used for ranking, not for selection. This is the complement of fq. You can use weighting: provider:fred^8 I tried bq=ip:sc^1000 and it doesn't seem to be making any difference. Even if I add fq=ip:sc, I don't see any mention of bq, ip, sc, or 1000 in the debugQuery output. This is the case on both 3.5 and 4.1. In case it was caused by omitting termfreq and positions on the field I'm using in the bq, I tried a couple of other fields that don't omit anything and bq seems to be having no effect at all. Thanks, Shawn
Re: Large data importing getting rollback with solr
On 18 January 2013 12:49, ashimbose ashimb...@gmail.com wrote: Hi Otis, Thank you for your reply. But I am unable to get any search result related to the error code. Its not response for more than 168 Data Source. I have tested it. If you have any other solution please let me know. Not sure about the limit on 168 data sources in DIH, but I am curious as to why you need that many? Do you have that many different mysql databases that you are indexing from? Regards, Gora
Re: Questions about boosting
Colleagues, fwiw bq is a DisMax parser feature. Shawn, to approach the boosting syntax with the standard parser you need something like q=foo:bar ip:sc^1000. Specifying ^1000 in bq makes no sense ever. If you show query params and debugQuery output, it would much easier for us to help you. PS omitting termfreq's and positions doesn't impact query time boosing ever. The closes caveat is that disabling norms indexing kills _index_ time boosting. On Fri, Jan 18, 2013 at 11:10 AM, Shawn Heisey s...@elyograg.org wrote: On 1/17/2013 11:41 PM, Walter Underwood wrote: As I understand it, the bq parameter is a full Lucene query, but only used for ranking, not for selection. This is the complement of fq. You can use weighting: provider:fred^8 I tried bq=ip:sc^1000 and it doesn't seem to be making any difference. Even if I add fq=ip:sc, I don't see any mention of bq, ip, sc, or 1000 in the debugQuery output. This is the case on both 3.5 and 4.1. In case it was caused by omitting termfreq and positions on the field I'm using in the bq, I tried a couple of other fields that don't omit anything and bq seems to be having no effect at all. Thanks, Shawn -- Sincerely yours Mikhail Khludnev Principal Engineer, Grid Dynamics http://www.griddynamics.com mkhlud...@griddynamics.com
Re: Large data importing getting rollback with solr
Hi Gora , Thank you for your quick reply. I have only one data source, But have more than 300 tables. Each tables I have put in individual entity in data-confic.xml But when I am trying to do full import, Its showing Thant much entry as str name=Total Requests made to DataSource169/str This 169 means I took 169 tables from my data source and each 169 tables created individual entity in my data-confic.xml file. I am not sure, if I did something wrong. Please let me know. My sample data-config.xml I am posting as below.. ?xml version=1.0 encoding=utf-8? dataConfig dataSource type=JdbcDataSource name=sampleDB driver=com.ibm.optim.connect.jdbc.NvDriver url=jdbc:attconnect://192.168.1.29:2551/NAVIGATOR;DefTdpName=sampleDB user= password=/ document name=headwords entity name=CUSTOMER dataSource=sampleDB query=SELECT * FROM CUSTOMER transformer=RegexTransformer field column=ID name=ID/ field column=ADDRESS name=ADDRESS/ field column=SIGNON_TYPE name=SIGNON_TYPE/ field column=NAME name=NAME/ /entity . . . . /document /dataConfig Thank you Regards, Ashim -- View this message in context: http://lucene.472066.n3.nabble.com/Large-data-importing-getting-rollback-with-solr-tp4034075p4034466.html Sent from the Solr - User mailing list archive at Nabble.com.