Re: Unsubscribe me
Please follow instructions here: http://lucene.apache.org/solr/resources.html F. On Jun 8, 2015, at 1:06 AM, Dylan dylan.h...@gmail.com wrote: On 30 May 2015 12:08, Lalit Kumar 4 lkum...@sapient.com wrote: Please unsubscribe me as well On May 30, 2015 15:23, Neha Jatav neha.ja...@gmail.com wrote: Unsubscribe me
Re: Unsubscribe me
Quoting Erik from two days ago: Please follow the instructions here: http://lucene.apache.org/solr/resources.html. Be sure to use the exact same e-mail you used to subscribe. On May 30, 2015, at 6:07 AM, Lalit Kumar 4 lkum...@sapient.com wrote: Please unsubscribe me as well On May 30, 2015 15:23, Neha Jatav neha.ja...@gmail.com wrote: Unsubscribe me
Re: YAJar
Run whatever tests you want with 14.0.1, replace it with 18.0, rerun the tests and compare. François On May 26, 2015, at 10:25 AM, Robust Links pey...@robustlinks.com wrote: by dumping you mean recompiling solr with guava 18? On Tue, May 26, 2015 at 10:22 AM, François Schiettecatte fschietteca...@gmail.com wrote: Have you tried dumping guava 14.0.1 and using 18.0 with Solr? I did a while ago and it worked fine for me. François On May 26, 2015, at 10:11 AM, Robust Links pey...@robustlinks.com wrote: i have a minhash logic that uses guava 18.0 method that is not in guava 14.0.1. This minhash logic is a separate maven project. I'm including it in my project via maven.the code is being used as a search component on the set of results. The logic goes through the search results and deletes duplicates. here is the solrconfig.xml requestHandler name=/select class=solr.SearchHandler default=true arr name=last-components strtvComponent/str strterms/str strminHashDedup/str /arr /requestHandler searchComponent name=minHashDedup class=com.xyz.DedupSearchHitsstr name=MAX_COMPARISONS5/str DedupSearchHits class is the one implementing the minhash (hence using guava 18). I start solr via the solr.in.sh script. The error I am getting is: Caused by: java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashUnencodedChars(Ljava/lang/CharSequence;)Lcom/google/common/hash/HashCode; at com.xyz.incrementToken(MinHashTokenFilter.java:54) at com.xyz.MinHash.calculate(MinHash.java:131) at com.xyz.Algorithms.minhash.MinHasher.compare(MinHasher.java:89) at com.xyz.Algorithms.minhash.DedupSearchHits.init(DedupSearchHits.java:74) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:619) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2311) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2305) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2338) at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:1297) at org.apache.solr.core.SolrCore.init(SolrCore.java:813) What is the best design to solve this problem?I understand the point of modularity but how can i include logic into solr that does result processing without loading that jar into solr? thank you On Tue, May 26, 2015 at 8:00 AM, Daniel Collins danwcoll...@gmail.com wrote: I guess this is one reason why the whole WAR approach is being removed! Solr should be a black-box that you talk to, and get responses from. What it depends on and how it is deployed, should be irrelevant to you. If you are wanting to override the version of guava that Solr uses, then you'd have to rebuild Solr (can be done with maven) and manually update the pom.xml to use guava 18.0, but why would you? You need to test Solr completely (in case any guava bugs affect Solr), deal with any build issues that arise (if guava changes any APIs), and cause yourself a world of pain, for what gain? On 26 May 2015 at 11:29, Robust Links pey...@robustlinks.com wrote: i have custom search components. On Tue, May 26, 2015 at 4:34 AM, Upayavira u...@odoko.co.uk wrote: Why is your app tied that closely to Solr? I can understand if you are talking about SolrJ, but normal usage you use a different application in a different JVM from Solr. Upayavira On Tue, May 26, 2015, at 05:14 AM, Robust Links wrote: I am stuck in Yet Another Jarmagedon of SOLR. this is a basic question. i noticed solr 5.0 is using guava 14.0.1. My app needs guava 18.0. What is the pattern to override a jar version uploaded into jetty? I am using maven, and solr is being started the old way java -jar start.jar -Dsolr.solr.home=... -Djetty.home=... I tried to edit jetty's start.config (then run java -DSTART=/my/dir/start.config -jar start.jar) but got no where... any help would be much appreciated Peyman
Re: YAJar
What I am suggesting is that you set up a stand alone version of solr with 14.0.1 and run some sort of test suite similar to what you would normally use solr for in your app. The replace the guava jar and re-run the tests. If all works well, and I suspect it will because it did for me, then you can use 18.0. Simple really. François On May 26, 2015, at 10:30 AM, Robust Links pey...@robustlinks.com wrote: i can't run 14.0.1. that is the problem. 14 does not have the interfaces i need On Tue, May 26, 2015 at 10:28 AM, François Schiettecatte fschietteca...@gmail.com wrote: Run whatever tests you want with 14.0.1, replace it with 18.0, rerun the tests and compare. François On May 26, 2015, at 10:25 AM, Robust Links pey...@robustlinks.com wrote: by dumping you mean recompiling solr with guava 18? On Tue, May 26, 2015 at 10:22 AM, François Schiettecatte fschietteca...@gmail.com wrote: Have you tried dumping guava 14.0.1 and using 18.0 with Solr? I did a while ago and it worked fine for me. François On May 26, 2015, at 10:11 AM, Robust Links pey...@robustlinks.com wrote: i have a minhash logic that uses guava 18.0 method that is not in guava 14.0.1. This minhash logic is a separate maven project. I'm including it in my project via maven.the code is being used as a search component on the set of results. The logic goes through the search results and deletes duplicates. here is the solrconfig.xml requestHandler name=/select class=solr.SearchHandler default=true arr name=last-components strtvComponent/str strterms/str strminHashDedup/str /arr /requestHandler searchComponent name=minHashDedup class=com.xyz.DedupSearchHitsstr name=MAX_COMPARISONS5/str DedupSearchHits class is the one implementing the minhash (hence using guava 18). I start solr via the solr.in.sh script. The error I am getting is: Caused by: java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashUnencodedChars(Ljava/lang/CharSequence;)Lcom/google/common/hash/HashCode; at com.xyz.incrementToken(MinHashTokenFilter.java:54) at com.xyz.MinHash.calculate(MinHash.java:131) at com.xyz.Algorithms.minhash.MinHasher.compare(MinHasher.java:89) at com.xyz.Algorithms.minhash.DedupSearchHits.init(DedupSearchHits.java:74) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:619) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2311) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2305) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2338) at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:1297) at org.apache.solr.core.SolrCore.init(SolrCore.java:813) What is the best design to solve this problem?I understand the point of modularity but how can i include logic into solr that does result processing without loading that jar into solr? thank you On Tue, May 26, 2015 at 8:00 AM, Daniel Collins danwcoll...@gmail.com wrote: I guess this is one reason why the whole WAR approach is being removed! Solr should be a black-box that you talk to, and get responses from. What it depends on and how it is deployed, should be irrelevant to you. If you are wanting to override the version of guava that Solr uses, then you'd have to rebuild Solr (can be done with maven) and manually update the pom.xml to use guava 18.0, but why would you? You need to test Solr completely (in case any guava bugs affect Solr), deal with any build issues that arise (if guava changes any APIs), and cause yourself a world of pain, for what gain? On 26 May 2015 at 11:29, Robust Links pey...@robustlinks.com wrote: i have custom search components. On Tue, May 26, 2015 at 4:34 AM, Upayavira u...@odoko.co.uk wrote: Why is your app tied that closely to Solr? I can understand if you are talking about SolrJ, but normal usage you use a different application in a different JVM from Solr. Upayavira On Tue, May 26, 2015, at 05:14 AM, Robust Links wrote: I am stuck in Yet Another Jarmagedon of SOLR. this is a basic question. i noticed solr 5.0 is using guava 14.0.1. My app needs guava 18.0. What is the pattern to override a jar version uploaded into jetty? I am using maven, and solr is being started the old way java -jar start.jar -Dsolr.solr.home=... -Djetty.home=... I tried to edit jetty's start.config (then run java -DSTART=/my/dir/start.config -jar start.jar) but got no where... any help would be much appreciated Peyman
Re: YAJar
Have you tried dumping guava 14.0.1 and using 18.0 with Solr? I did a while ago and it worked fine for me. François On May 26, 2015, at 10:11 AM, Robust Links pey...@robustlinks.com wrote: i have a minhash logic that uses guava 18.0 method that is not in guava 14.0.1. This minhash logic is a separate maven project. I'm including it in my project via maven.the code is being used as a search component on the set of results. The logic goes through the search results and deletes duplicates. here is the solrconfig.xml requestHandler name=/select class=solr.SearchHandler default=true arr name=last-components strtvComponent/str strterms/str strminHashDedup/str /arr /requestHandler searchComponent name=minHashDedup class=com.xyz.DedupSearchHitsstr name=MAX_COMPARISONS5/str DedupSearchHits class is the one implementing the minhash (hence using guava 18). I start solr via the solr.in.sh script. The error I am getting is: Caused by: java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashUnencodedChars(Ljava/lang/CharSequence;)Lcom/google/common/hash/HashCode; at com.xyz.incrementToken(MinHashTokenFilter.java:54) at com.xyz.MinHash.calculate(MinHash.java:131) at com.xyz.Algorithms.minhash.MinHasher.compare(MinHasher.java:89) at com.xyz.Algorithms.minhash.DedupSearchHits.init(DedupSearchHits.java:74) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:619) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2311) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2305) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2338) at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:1297) at org.apache.solr.core.SolrCore.init(SolrCore.java:813) What is the best design to solve this problem?I understand the point of modularity but how can i include logic into solr that does result processing without loading that jar into solr? thank you On Tue, May 26, 2015 at 8:00 AM, Daniel Collins danwcoll...@gmail.com wrote: I guess this is one reason why the whole WAR approach is being removed! Solr should be a black-box that you talk to, and get responses from. What it depends on and how it is deployed, should be irrelevant to you. If you are wanting to override the version of guava that Solr uses, then you'd have to rebuild Solr (can be done with maven) and manually update the pom.xml to use guava 18.0, but why would you? You need to test Solr completely (in case any guava bugs affect Solr), deal with any build issues that arise (if guava changes any APIs), and cause yourself a world of pain, for what gain? On 26 May 2015 at 11:29, Robust Links pey...@robustlinks.com wrote: i have custom search components. On Tue, May 26, 2015 at 4:34 AM, Upayavira u...@odoko.co.uk wrote: Why is your app tied that closely to Solr? I can understand if you are talking about SolrJ, but normal usage you use a different application in a different JVM from Solr. Upayavira On Tue, May 26, 2015, at 05:14 AM, Robust Links wrote: I am stuck in Yet Another Jarmagedon of SOLR. this is a basic question. i noticed solr 5.0 is using guava 14.0.1. My app needs guava 18.0. What is the pattern to override a jar version uploaded into jetty? I am using maven, and solr is being started the old way java -jar start.jar -Dsolr.solr.home=... -Djetty.home=... I tried to edit jetty's start.config (then run java -DSTART=/my/dir/start.config -jar start.jar) but got no where... any help would be much appreciated Peyman
Re: how to debug solr performance degradation
Rebecca You don’t want to give all the memory to the JVM. You want to give it just enough for it to work optimally and leave the rest of the memory for the OS to use for caching data. Giving the JVM too much memory can result in worse performance because of GC. There is no magic formula to figuring out the memory allocation for the JVM, that is very dependent on the workload. In your case I would start with 5GB, and increment by 5GB with each run. I also use these settings for the JVM -XX:+UseG1GC -Xms1G -Xmx1G -XX:+AggressiveOpts -XX:+OptimizeStringConcat -XX:+ParallelRefProcEnabled -XX:MaxGCPauseMillis=200 I got them from this list so can’t take credit for them but they work for me. Cheers François On Feb 24, 2015, at 7:45 PM, Tang, Rebecca rebecca.t...@ucsf.edu wrote: We gave the machine 180G mem to see if it improves performance. However, after we increased the memory, Solr started using only 5% of the physical memory. It has always used 90-something%. What could be causing solr to not grab all the physical memory (grabbing so little of the physical memory)? Rebecca Tang Applications Developer, UCSF CKM Industry Documents Digital Libraries E: rebecca.t...@ucsf.edu On 2/24/15 12:44 PM, Shawn Heisey apa...@elyograg.org wrote: On 2/24/2015 1:09 PM, Tang, Rebecca wrote: Our solr index used to perform OK on our beta production box (anywhere between 0-3 seconds to complete any query), but today I noticed that the performance is very bad (queries take between 12 15 seconds). I haven't updated the solr index configuration (schema.xml/solrconfig.xml) lately. All that's changed is the data ‹ every month, I rebuild the solr index from scratch and deploy it to the box. We will eventually go to incremental builds. But for now, all indexes are built from scratch. Here are the stats: Solr index size 183G Documents in index 14364201 We just have single solr box It has 100G memory 500G Harddrive 16 cpus The bottom line on this problem, and I'm sure it's not something you're going to want to hear: You don't have enough memory available to cache your index. I'd plan on at least 192GB of RAM for an index this size, and 256GB would be better. Depending on the exact index schema, the nature of your queries, and how large your Java heap for Solr is, 100GB of RAM could be enough for good performance on an index that size ... or it might be nowhere near enough. I would imagine that one of two things is true here, possibly both: 1) Your queries are very complex and involve accessing a very large percentage of the index data. 2) Your Java heap is enormous, leaving very little RAM for the OS to automatically cache the index. Adding more memory to the machine, if that's possible, might fix some of the problems. You can find a discussion of the problem here: http://wiki.apache.org/solr/SolrPerformanceProblems If you have any questions after reading that wiki article, feel free to ask them. Thanks, Shawn
Re: American British Dictionary for Solr
Dinesh See this: http://wordlist.aspell.net/varcon/ You will need to do some work to convert to a SOLR friendly format though. Cheers François On Feb 12, 2015, at 12:22 AM, dinesh naik dineshkumarn...@gmail.com wrote: Hi , We are looking for a dictionary to support American/British English synonym. Could you please let us know what all dictionaries are available ? -- Best Regards, Dinesh Naik
Re: Solr: How to delete a document
How about adding 'expungeDeletes=true' as well as 'commit=true'? François On Sep 13, 2014, at 4:09 PM, FiMka maximfil...@gmail.com wrote: Hi guys, could you say how to delete a document in Solr? After I delete a document it still persists in the search results. For example there is the following document saved in Solr: After I POST the following data to localhost:8983/solr/update/?commit=true: Solr each time says 200 OK and responds the following: If I try to search localhost:8983/solr/lexikos/select?q=phrase%3A+%22qwerty%22wt=jsonindent=true for the document once again, it still shown in the results. So how to remove the document from Solr index as well or what else to do? Thanks in advance for any assistance! -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-How-to-delete-a-document-tp4158649.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Date field related query
How about : datefield:[NOW-1DAY/DAY TO *] François On Sep 2, 2014, at 6:54 AM, Aman Tandon amantandon...@gmail.com wrote: Hi, I did it using this, fq=datefield:[2014-09-01T23:59:59Z TO 2014-09-02T23:59:59Z]. Correct me if i am wrong. Is there any way to find this using the NOW? With Regards Aman Tandon On Tue, Sep 2, 2014 at 4:08 PM, Aman Tandon amantandon...@gmail.com wrote: Hi, I am working on date and i want to find all those records which are indexed today. With Regards Aman Tandon
Re: Random OOM Exceptions
I would also get some metrics when SOLR is doing nothing, the JVM does do work in the background and looking at the memory graph in VisualVM will show a nice sawtooth. François On Aug 14, 2014, at 1:16 PM, Erick Erickson erickerick...@gmail.com wrote: bq: I just don’t know why Solr is suddenly going nuts. Hmmm, as Shawn says, hard to say at this remove. But I've personally doubled the memory requirements for Solr on the _same_ index by altering the query to a pathological one. Something like q=*:*facet.field=whatever where the field whatever contains a billion unique strings is an example of a pathological query. So you may have to do the ugly work of correlating memory spikes with the queries just prior to the spike. Which you should be able to do from the Solr logs. Sorry I can't be more help... Erick On Thu, Aug 14, 2014 at 9:45 AM, Shawn Heisey s...@elyograg.org wrote: On 8/14/2014 10:06 AM, Scott Rankin wrote: My question was actually more about what in Solr might cause the server to suddenly go from a very consistent heap size of 300-400 MB to over 2 GB in a matter of minutes with no changes in traffic. I get why the VM is crashing, I just don’t know why Solr is suddenly going nuts. That's nearly impossible to answer. Chances are that something has changed about the requests that Solr is receiving and now it's required to do something that it wasn't before, something that uses a lot of heap memory. The other likely possibilities are: * There's a bug in your solr version or in some software component that you are using with Solr. That can include the Java virtual machine, the servlet container, and/or any third-party Solr components. * You were running on the hairy edge of heap usage already, and something (a traffic increase, a slight change to your requests) pushed you over the edge into OutOfMemory. Thanks, Shawn
Re: Character encoding problems
Hi If you are seeing appelé au téléphone in the browser, I would guess that the data is being rendered in UTF-8 by your server and the content type of the html is set to iso-8859-1 or not being set and your browser is defaulting to iso-8859-1. You can force the encoding to utf-8 in the browser, usually this is a menu item (in Chrome/Safari/Firefox). FWIW having messed around with this kind of stuff in the past, I always generate utf-8 and always set the HTML content type to utf-8 with: meta contentType-equiv=Content-Type content=text/html; charset=utf-8 / Cheers François On Jul 29, 2014, at 3:59 PM, Gulliver Smith gulliver.m.sm...@gmail.com wrote: Thanks for the information about URIEncoding=UTF-8 in the tomcat conf file, but that doesn't answer my main concerns: - what is the character encoding of the text in the title_fr field? - is there any way to force it to be UTF-8? On Tue, Jul 29, 2014 at 8:35 AM, aurelien.mazo...@francelabs.com wrote: Hi, If you use solr 4.8.1, you don't have to add URIEncoding=UTF-8 in the tomcat conf file anymore : https://wiki.apache.org/solr/SolrTomcat Regards, Aurélien MAZOYER On 29.07.2014 14:22, Gulliver Smith wrote: I have solr 4.8.1 under Tomcat 7 on Debian Linux. The connector in Tomcat's server.xml has been changed to include character encoding UTF-8: Connector port=8080 protocol=HTTP/1.1 URIEncoding=UTF-8 connectionTimeout=2 redirectPort=8443 / I am posting to the server from PHP 5.5 curl. The extract POST was intercepted and confirmed that everything is being encode in UTF-8. However, the responses to query commands, whether XML or JSON are returning field values such as title_fr in something that looks like latin1 or iso-8859-1 when displayed in a browser or editor. E.g.: title_fr:[ appelé au téléphone] The highlights in the query response do have correctly displaying character codes. E.g. text_fr:[ \n \n \n \n \n \n \n \n \n \n \nappelé au téléphone\nappelé au téléphone\n PHP's utf8_decode doesn't make sense of the title_fr. Is there something to configure to fix this and get proper UTF8 results for everything? Thanks Gulliver
Re: Java heap space error
A default garbage collector will be chosen for you by the VM, might help to get the stack trace to look at. François On Jul 24, 2014, at 10:06 AM, Ameya Aware ameya.aw...@gmail.com wrote: ooh ok. So you want to say that since i am using large heap but didnt set my garbage collection, thats why i why getting java heap space error? On Thu, Jul 24, 2014 at 9:58 AM, Marcello Lorenzi mlore...@sorint.it wrote: I think that on large heap is suggested to monitor the garbage collection behavior and try to add a strategy adapted to your performance. On my production environment with a heap of 6 GB I set this parameter (server with 8 cores): -server -Xms6144m -Xmx6144m -XX:MaxPermSize=512m -Dcom.sun.management.jmxremote -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+CMSParallelRemarkEnabled -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:ConcGCThreads=6 -XX:ParallelGCThreads=6 Marcello On 07/24/2014 03:53 PM, Ameya Aware wrote: I did not make any other change than this.. rest of the settings are default. Do i need to set garbage collection strategy? On Thu, Jul 24, 2014 at 9:49 AM, Marcello Lorenzi mlore...@sorint.it wrote: Hi, Did you set a Garbage collection strategy on your JVM ? Marcello On 07/24/2014 03:32 PM, Ameya Aware wrote: Hi I am in process of indexing around 2,00,000 documents. I have increase java jeap space to 4 GB using below command : java -Xmx4096M -Xms4096M -jar start.jar Still after indexing around 15000 documents it gives java heap space error again. Any fix for this? Thanks, Ameya
Re: Garbage collection issue and RELOADing cores
Hi Just following up on my previous post about a memory leak when RELOADing cores, I narrowed it down to the SuggestComponent, specifically 'searchComponent name=suggest class=solr.SuggestComponent.../searchComponent' in solrconfig.xml. Comment that out and the leak goes away. The leak occurs in 4.7, 4.8 and 4.9. It occurs when a core is RELOADed, but not if it is UNLOADed and then LOADed. It occurs whether G1, CMS or ParallelGC is used for garbage collection. I used JDK 1.7.0_60 and Tomcat 7.0.54 for the underlying layers. Not sure where to take it from here? Cheers François On Jun 16, 2014, at 4:50 PM, François Schiettecatte fschietteca...@gmail.com wrote: Hi I am running into an interesting garbage collection issue and am looking for suggestions/thoughts. Because some word lists such as synonyms, plurals, protected words need to be updated on a regular basis I have to RELOAD a number of cores in order to 'pick up' the new lists. What I have found is that I get a memory leak when I do a RELOAD rather than an UNLOAD/CREATE with core admin. This is most pronounced with the G1 GC and much less so with the CMS GC. The former will cause the VM to run out of memory after 5/6 RELOADs, while the latter does so after 30/35 RELOADs. We are not talking about large indices here, the files footprint totals 470MB. I am using SOLR 4.8.1, Tomcat 7.0.53, jdk1.7.0_60, on Fedora Core 20. I am not using any fancy GC parameters, I cut everything back to basics, just: -Xmx1G -XX:+UseConcMarkSweepGC -XX:+UseParNewGC and -Xmx1G -XX:+UseG1GC I was curious if anyone else had run into this issue and managed to fix it? Thanks François
Garbage collection issue and RELOADing cores
Hi I am running into an interesting garbage collection issue and am looking for suggestions/thoughts. Because some word lists such as synonyms, plurals, protected words need to be updated on a regular basis I have to RELOAD a number of cores in order to 'pick up' the new lists. What I have found is that I get a memory leak when I do a RELOAD rather than an UNLOAD/CREATE with core admin. This is most pronounced with the G1 GC and much less so with the CMS GC. The former will cause the VM to run out of memory after 5/6 RELOADs, while the latter does so after 30/35 RELOADs. We are not talking about large indices here, the files footprint totals 470MB. I am using SOLR 4.8.1, Tomcat 7.0.53, jdk1.7.0_60, on Fedora Core 20. I am not using any fancy GC parameters, I cut everything back to basics, just: -Xmx1G -XX:+UseConcMarkSweepGC -XX:+UseParNewGC and -Xmx1G -XX:+UseG1GC I was curious if anyone else had run into this issue and managed to fix it? Thanks François
Re: Any way to view lucene files
Just click the 'Releases' link: https://github.com/DmitryKey/luke/releases François On Jun 9, 2014, at 10:43 AM, Aman Tandon amantandon...@gmail.com wrote: No, Anyways thanks Alex, but where is the luke jar? With Regards Aman Tandon On Mon, Jun 9, 2014 at 6:54 AM, Alexandre Rafalovitch arafa...@gmail.com wrote: Have you looked at: https://github.com/DmitryKey/luke Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Mon, Jun 9, 2014 at 8:12 AM, Aman Tandon amantandon...@gmail.com wrote: I guess this is not available now. I am trying to download from the google, please take a look https://code.google.com/p/luke/downloads/list If you have any link please share With Regards Aman Tandon On Sat, Jun 7, 2014 at 10:32 PM, Summer Shire shiresum...@gmail.com wrote: Did u try luke 47 On Jun 6, 2014, at 11:59 PM, Aman Tandon amantandon...@gmail.com wrote: I also tried with solr 4.2 and with luke version Luke 4.0.0-ALPHA but got this error: java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.codecs.Codec with name 'Lucene42' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath.The current classpath supports the following names: [Lucene40, Lucene3x, SimpleText, Appending] With Regards Aman Tandon On Sat, Jun 7, 2014 at 12:22 PM, Aman Tandon amantandon...@gmail.com wrote: My solr version is 4.8.1 and luke is 3.5 With Regards Aman Tandon On Sat, Jun 7, 2014 at 12:21 PM, Chris Collins ch...@geekychris.com wrote: What version of Solr / Lucene are you using? You have to match the Luke version to the same version of Lucene. C On Jun 6, 2014, at 11:42 PM, Aman Tandon amantandon...@gmail.com wrote: Yes tried, but it not working at all every time i choose my index directory it shows me EOF past With Regards Aman Tandon On Sat, Jun 7, 2014 at 12:01 PM, Chris Collins ch...@geekychris.com wrote: Have you tried: https://code.google.com/p/luke/ Best Chris On Jun 6, 2014, at 11:24 PM, Aman Tandon amantandon...@gmail.com wrote: Hi, Is there any way so that i can view what information and which is there in my _e.fnm, etc files. may be with the help of any application or any viewer tool. With Regards Aman Tandon
Re: OutOfMemoryError while merging large indexes
Have you tried using: -XX:-UseGCOverheadLimit François On Apr 8, 2014, at 6:06 PM, Haiying Wang haiyingwa...@yahoo.com wrote: Hi, We were trying to merge a large index (9GB, 21 million docs) into current index (only 13MB), using mergeindexes command ofCoreAdminHandler, but always run into OOM error. We currently set the max heap size to 4GB for the Solr server. We are using 4.6.0, and did not change the original solrconfig.xml. Is there any setting/configure that could help to complete the mergeindexes process without running into OOM error? I can increase the max jvm heap size, but am afraid that may not scale in case larger index need to be merged in the future, and hoping the index merge can be performed with limited memory foorprint. Please help. Thanks! The jvm heap setting: -Xmx4096M -Xms512M Command used: curl http://dev101:8983/solr/admin/cores?action=mergeindexescore=collection1indexDir=/solr/tmp/data/snapshot.20140407194442777; OOM error stack trace: Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:133) at java.lang.StringCoding.decode(StringCoding.java:179) at java.lang.String.lt;initgt;(String.java:483) at java.lang.String.lt;initgt;(String.java:539) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.readField(CompressingStoredFieldsReader.java:187) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:351) at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:276) at org.apache.lucene.index.IndexReader.document(IndexReader.java:436) at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:345) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:316) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:94) at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2555) at org.apache.solr.update.DirectUpdateHandler2.mergeIndexes(DirectUpdateHandler2.java:449) at org.apache.solr.update.processor.RunUpdateProcessor.processMergeIndexes(RunUpdateProcessorFactory.java:88) at org.apache.solr.update.processor.UpdateRequestProcessor.processMergeIndexes(UpdateRequestProcessor.java:59) at org.apache.solr.update.processor.LogUpdateProcessor.processMergeIndexes(LogUpdateProcessorFactory.java:149) at org.apache.solr.handler.admin.CoreAdminHandler.handleMergeAction(CoreAdminHandler.java:384) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:662) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) Regards, Haiying signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Reading Solr index
Maybe you should try a more recent release of Luke: https://github.com/DmitryKey/luke/releases François On Apr 7, 2014, at 12:27 PM, azhar2007 azhar2...@outlook.com wrote: Hi All, I have a solr index which is indexed ins Solr.4.7.0. Ive attempted to open the index with Luke4.0.0 and also other verisons with no luck. Gives me an error message. Is there a way of reading the data? I would like to convert the file to a readable format where i can see the terms it holds from the documents etc. Please Help!! -- View this message in context: http://lucene.472066.n3.nabble.com/Reading-Solr-index-tp4129662.html Sent from the Solr - User mailing list archive at Nabble.com. signature.asc Description: Message signed with OpenPGP using GPGMail
Re: The word no in a query
Have you looked at the debugging output? http://wiki.apache.org/solr/CommonQueryParameters#Debugging François On Apr 2, 2014, at 1:37 AM, Bob Laferriere spongeb...@icloud.com wrote: I have built an commerce search engine. I am struggling with the word “no” in queries. We have products that are “No Smoking Sign.” When the query is “Smoking AND Sign” the product is found. If I query as “No AND Sign” I get no results? I do not have no as a stop word. Any ideas why I would get zero results back? Regards, Bob signature.asc Description: Message signed with OpenPGP using GPGMail
Re: AND not as a boolean operator in Phrase
Better to user '+A +B' rather than AND/OR, see: http://searchhub.org/2011/12/28/why-not-and-or-and-not/ François On Mar 25, 2014, at 10:21 PM, Koji Sekiguchi k...@r.email.ne.jp wrote: (2014/03/26 2:29), abhishek jain wrote: hi friends, when i search for A and B it gives me result for A , B , i am not sure why? Please guide how can i exact match when it is within phrase/quotes. Generally speaking (w/ LuceneQParser), if you want phrase match results, use quotes, i.e. q=A B. If you want results which contain both terms A and B, do not use quotes but boolean operator AND, i.e. q=A AND B. koji -- http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Solr cores across multiple machines
Hi Why not copy the core directory instead of the data directory? The conf directory is very small and that would ensure that you don't get schema mismatch issues. If you are stuck with copying the data directory, then I would replace the data directory in the target core and reload that core, though I would guess that YMMV given that this is probably not supported. François On Dec 17, 2013, at 1:35 AM, sivaprasad sivaprasa...@echidnainc.com wrote: Hi, In my project, we are doing full index on dedicated machine and the index will be copied to other search serving machine. For this, we are copying the data folder from indexing machine to serving machine manually. Now, we wanted to use Solr's SWAP configuration to do this job. Looks like the SWAP will work between the cores. Based on our setup, any one has any idea how to move the data from indexing machine to serving machine? Is there any other alternatives? Regards, Siva -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-cores-across-multiple-machines-tp4107035.html Sent from the Solr - User mailing list archive at Nabble.com. signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Stop/Restart Solr
If you are on linux/unix, use the kill command. François On Oct 22, 2013, at 12:42 PM, Raheel Hasan raheelhasan@gmail.com wrote: Hi, is there a way to stop/restart java? I lost control over it via SSH and connection was closed. But the Solr (start.jar) is still running. thanks. -- Regards, Raheel Hasan
Re: Stop/Restart Solr
A few more specifics about the environment would help, Windows/Linux/...? Jetty/Tomcat/...? François On Oct 22, 2013, at 12:50 PM, Yago Riveiro yago.rive...@gmail.com wrote: If you are asking about if solr has a way to restart himself, I think that the answer is no. If you lost control of the remote machine someone will need to go and restart the machine ... You can try use a kvm or other remote control system -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Tuesday, October 22, 2013 at 5:46 PM, François Schiettecatte wrote: If you are on linux/unix, use the kill command. François On Oct 22, 2013, at 12:42 PM, Raheel Hasan raheelhasan@gmail.com (mailto:raheelhasan@gmail.com) wrote: Hi, is there a way to stop/restart java? I lost control over it via SSH and connection was closed. But the Solr (start.jar) is still running. thanks. -- Regards, Raheel Hasan
Re: Stop/Restart Solr
Yago has the right command to search for the process, that will get you the process ID specifically the first number on the output line, then do 'kill ###', if that fails 'kill -9 ###'. François On Oct 22, 2013, at 12:56 PM, Raheel Hasan raheelhasan@gmail.com wrote: its CentOS... and using jetty with solr here.. On Tue, Oct 22, 2013 at 9:54 PM, François Schiettecatte fschietteca...@gmail.com wrote: A few more specifics about the environment would help, Windows/Linux/...? Jetty/Tomcat/...? François On Oct 22, 2013, at 12:50 PM, Yago Riveiro yago.rive...@gmail.com wrote: If you are asking about if solr has a way to restart himself, I think that the answer is no. If you lost control of the remote machine someone will need to go and restart the machine ... You can try use a kvm or other remote control system -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Tuesday, October 22, 2013 at 5:46 PM, François Schiettecatte wrote: If you are on linux/unix, use the kill command. François On Oct 22, 2013, at 12:42 PM, Raheel Hasan raheelhasan@gmail.com(mailto: raheelhasan@gmail.com) wrote: Hi, is there a way to stop/restart java? I lost control over it via SSH and connection was closed. But the Solr (start.jar) is still running. thanks. -- Regards, Raheel Hasan -- Regards, Raheel Hasan
Re: Solr timeout after reboot
To put the file data into file system cache which would make for faster access. François On Oct 21, 2013, at 8:33 AM, michael.boom my_sky...@yahoo.com wrote: Hmm, no, I haven't... What would be the effect of this ? - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096809.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Exact Match Results
Kumar You might want to look into the 'pf' parameter: https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser François On Oct 21, 2013, at 9:24 AM, kumar pavan2...@gmail.com wrote: I am querying solr for exact match results. But it is showing some other results also. Examle : User Query String : Okkadu telugu movie Results : 1.Okkadu telugu movie 2.Okkadunnadu telugu movie 3.YuganikiOkkadu telugu movie 4.Okkadu telugu movie stills how can we order these results that 4th result has to come second. Please anyone can you give me any idea? -- View this message in context: http://lucene.472066.n3.nabble.com/Exact-Match-Results-tp4096816.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr timeout after reboot
Well no, the OS is smarter than that, it manages file system cache along with other memory requirements. If applications need more memory then file system cache will likely be reduced. The command is a cheap trick to get the OS to fill the file system cache as quickly as possible, not sure how much it will help though with a 100GB index on a 15GB machine. This might work if you 'cat' the index files other than the '.fdx' and '.fdt' files. François On Oct 21, 2013, at 10:03 AM, michael.boom my_sky...@yahoo.com wrote: I'm using the m3.xlarge server with 15G RAM, but my index size is over 100G, so I guess putting running the above command would bite all available memory. - Thanks, Michael -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096827.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can I use app specific document id as the document id that Solr uses for internal purposes?
Hi The approach I take is to store enough data in the SOLR index to render the results page, and go to the database if the user want to view a document. Cheers François On Oct 6, 2013, at 9:45 AM, user 01 user...@gmail.com wrote: @Gora: you understood the schema correctly, but I can't believe it's strange but i think it is actually the recommended way.. you index your data but don't store in a Search engine, you store your actual data in DB, which is the right place for it. Data in SE should be just used for indexing. Isn't it ? @maephisto: ok, thanks! On Sun, Oct 6, 2013 at 6:07 PM, Gora Mohanty g...@mimirtech.com wrote: On 6 October 2013 16:36, Ertio Lew ertio...@gmail.com wrote: I meant that solr should not be thinking that it has to retrieve any thing further (as in any stored document data) after once it gets the doc id, so that one further look up for doc data is prevented. [...] If I understood your setup correctly, the doc ID is the only field in the Solr schema, and the only data stored in the Solr index. So there is no question of recovering any other data. Having said that, this is a strange setup and seems to defeat the whole purpose of a search engine. Maybe you could explain further as to what you are trying to achieve: What does storing only doc IDs in Solr gain you? You could as well get these from a database lookup which it seems that you would be doing anyway. Regards, Gora
Re: setQuery in SolrJ
Shouldn't the search be more like this if you are searching in the 'descricaoRoteiro' field: descricaoRoteiro:(BPS 8D BEACH*) or in your example you have a space in between 'descricaoRoteiro' and 'BPS': descricaoRoteiro:BPS 8D BEACH* François On Sep 2, 2013, at 8:08 AM, Dmitry Kan solrexp...@gmail.com wrote: Hi, What's your default query field in solrconfig.xml? requestHandler name=/select class=solr.SearchHandler str name=df[WHAT IS IN HERE?]/str I think what's happening is that the query: (descricaoRoteiro: BPS 8D BEACH*) gets interpreted as: descricaoRoteiro:BPS (8D BEACH*) then on the (8D BEACH*) a default field name is applied. You can use debugQuery parameter to see how the query was parsed. HTH, Dmitry On Mon, Sep 2, 2013 at 2:53 PM, Sergio Stateri stat...@gmail.com wrote: hi, How can I looking for an exact phrase in query.setQuery method (SolrJ)? Like this: SolrQuery query = new SolrQuery(); query.setQuery( (descricaoRoteiro: BPS 8D BEACH*) ); query.set(start, 200); query.set(rows, 10); query.addField(descricaoRoteiro); QueryResponse rsp = server.query( query ); When I run this code, the following exception is thrown: Exception in thread main org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: no field name specified in query and no default specified via 'df' param at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301) at com.teste.SearchRoteirosFromCollection.extrairEApresentarResultados(SearchRoteirosFromCollection.java:65) ... But If I search one a word od put * between two words, the search works fine. Thanks in advance, -- Sergio Stateri Jr. stat...@gmail.com signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Mandatory words search in SOLR
Kamal You could also use the 'mm' parameter to require a minimum match, or you could prepend '+' to each required term. Cheers François On May 13, 2013, at 7:57 AM, Kamal Palei palei.ka...@gmail.com wrote: Hi Rafał Kuć I added q.op=AND as per you suggested. I see though some initial record document contains both keywords (*java* and *mysql*), towards end I see still there are number of documents, they have only one key word either *java* or *mysql*. Is it the SOLR behaviour or can I ask for a *strict search only if all my keywords are present, then only* *fetch record* else not. BR, Kamal On Mon, May 13, 2013 at 4:02 PM, Rafał Kuć r@solr.pl wrote: Hello! Change the default query operator. For example add the q.op=AND to your query. -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Hi SOLR Experts When I search documents with keyword as *java, mysql* then I get the documents containing either *java* or *mysql* or both. Is it possible to get the documents those contains both *java* and *mysql*. In that case, how the query would look like. Thanks a lot Kamal
Re: Indexing only on change
I would create a hash of the document content and store that in SOLR along with any document info you wish to store. When a document is presented for indexing, hash that and compare to the hash of the stored document, index if they are different and skip if they are not. François On Nov 24, 2012, at 3:30 PM, Pratyul Kapoor praty...@gmail.com wrote: Hi, I just discovered that solr while editing a particular field of a document, removes the entire document and recreates. I have a list of 1000s of documents to be indexed. But I am aware that only some of those documents would be changed and rest all would already be there. Is there any way, I can check whether the incoming and already existing document is same, and there is no need of indexing it again. Pratyul
Re: Is leading wildcard search turned on by default in Solr 3.6.1?
John You can still use leading wildcards even if you dont have the ReversedWildcardFilterFactory in your analysis but it means you will be scanning the entire dictionary when the search is run which can be a performance issue. If you do use ReversedWildcardFilterFactory you wont have that performance issue but you will increase the overall size of your index. Its a tradeoff. When I looked into it for a site I built I decided that the tradeoff was not worth it (after benchmarking) given how few leading wildcards searches it was getting. Best regards François On Nov 12, 2012, at 5:33 PM, johnmu...@aol.com wrote: Hi, I'm migrating from Solr 1.2 to 3.6.1. I used the same analyzer as I was, and re-indexed my data. I did not add solr.ReversedWildcardFilterFactory to my index analyzer, but yet leading wild cards are working!! Does this mean it's turned on by default? If so, how do I turn it off, and what are the implication of leaving ON? Won't my searches be slower and consume more memory? Thanks, --MJ
Re: Is leading wildcard search turned on by default in Solr 3.6.1?
I suspect it is just part of the wildcard handling, maybe someone can chime in here, you may need to catch this before it gets to SOLR. François On Nov 12, 2012, at 5:44 PM, johnmu...@aol.com wrote: Thanks for the quick response. So, I do not want to use ReversedWildcardFilterFactory, but leading wildcard is working and thus is ON by default. How do I disable it to prevent the use of it and the issues that come with it? -- MJ -Original Message- From: François Schiettecat te fschietteca...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Mon, Nov 12, 2012 5:39 pm Subject: Re: Is leading wildcard search turned on by default in Solr 3.6.1? John You can still use leading wildcards even if you dont have the ReversedWildcardFilterFactory in your analysis but it means you will be scanning the entire dictionary when the search is run which can be a performance issue. If you do use ReversedWildcardFilterFactory you wont have that performance issue but you will increase the overall size of your index. Its a tradeoff. When I looked into it for a site I built I decided that the tradeoff was not worth it (after benchmarking) given how few leading wildcards searches it was getting. Best regards François On Nov 12, 2012, at 5:33 PM, johnmu...@aol.com wrote: Hi, I'm migrating from Solr 1.2 to 3.6.1. I used the same analyzer as I was, and re-indexed my data. I did not add solr.ReversedWildcardFilterFactory to my index analyzer, but yet leading wild cards are working!! Does this mean it's turned on by default? If so, how do I turn it off, and what are the implication of leaving ON? Won't my searches be slower and consume more memory? Thanks, --MJ
Re: MMapDirectory, demand paging, lazy evaluation, ramfs and the much maligned RAMDirectory (oh my!)
Aaron The best way to make sure the index is cached by the OS is to just cat it on startup: cat `find /path/to/solr/index` /dev/null Just make sure your index is smaller than RAM otherwise data will be rotated out. Memory mapping is built on the virtual memory system, and I suspect that ramfs is too, so I doubt very much that copying your index to ramfs will help at all. Sidebar - a while ago I did a bunch of testing copying indices to shared memory (/dev/shm in this case) and there was no advantage compared to just accessing indices on disc when using memory mapping once the system got to a steady state. There has been a lot written about this topic on the list. Basically it come down to using MMapDirectory (which is the default), make sure your index is smaller than your RAM, and allocate just enough memory to the Java VM. That last part requires some benchmarking because it is so workload dependent. Best regards François On Oct 24, 2012, at 8:29 PM, Aaron Daubman daub...@gmail.com wrote: Greetings, Most times I've seen the topic of storing one's index in memory, it seems the asker was referring (or understood to be referring) to the (in)famous not intended to work with huge indexes Solr RAMDirectory. Let me be clear that that I am not interested in RAMDirectory. However, I would like to better understand the oft-recommended and currently-default MMapDirectory, and what the tradeoffs would be, when using a 64-bit linux server dedicated to this single solr instance, with plenty (more than 2x index size) of RAM, of storing the index files on SSDs versus on a ramfs mount. I understand that using the default MMapDirectory will allow caching of the index in-memory, however, my understanding is that mmaped files are demand-paged (lazy evaluated), meaning that only after a block is read from disk will it be paged into memory - is this correct? is it actually block-by-block (page size by page size?) - any pointers to decent documentation on this regardless of the effectiveness of the approach would be appreciated... My concern with using MMapDirectory for an index stored on disk (even SSDs), if my understanding is correct, is that there is still a large startup cost to MMapDirectory, as it may take many queries before even most of a 20G index has been loaded into memory, and there may yet still be dark corners that only come up in edge-case queries that cause QTime spikes should these queries ever occur. I would like to ensure that, at startup, no query will incur disk-seek/read penalties. Is the right way to achieve this to copy the index to a ramfs (NOT ramdisk) mount and then continue to use MMapDirectory in Solr to read the index? I am under the impression that when using ramfs (rather than ramdisk, for which this would not work) a file mmaped on a ramfs mount will actually share the same address space, and so would not incur the typical double-ram overhead of mmaping a file in memory just o have yet another copy of the file created in a second memory location. Is this correct? If not, would you please point me to documentation stating otherwise (I haven't found much documentation either way). Finally, given the desire to be quick at startup with a large index that will still easily fit within a system's memory, am I thinking about this wrong or are there other better approaches? Thanks, as always, Aaron
Re: The way to customize ranking?
I would create two indices, one with your content and one with your ads. This approach would allow you to precisely control how many ads you pull back and how you merge them into the results, and you would be able to control schemas, boosting, defaults fields, etc for each index independently. Best regards François On Aug 23, 2012, at 11:45 AM, Nicholas Ding nicholas...@gmail.com wrote: Thank you, but I don't want to filter those ads. For example, when user make a search like q=Car Result list: 1. Ford Automobile (score 10) 2. Honda Civic (score 9) ... ... ... 99. Paid Ads (score 1, Ad has own field to identify it's an Ad) What I want to find is a way to make the score of Paid Ads higher than Ford Automobile. Basically, the result structure will look like - [Paid Ads Section] [Most valuable Ads 1] [Most valuable Ads 2] [Less valuable Ads 1] [Less valuable Ads 2] - [Relevant Results Section] On Thu, Aug 23, 2012 at 11:33 AM, Karthick Duraisamy Soundararaj karthick.soundara...@gmail.com wrote: Hi You might add an int field Search Rule that identifies the type of search. example Search Rule Description 0 Unpaid Search 1 Paid Search - Rule 1 2 Paid Serch - Rule 2 You can use filterqueries ( http://wiki.apache.org/solr/CommonQueryParameters) like fq: Search Rule :[1 TO *] Alternatively, You can even use a boolean field to identify whether or not a search is paid and then an addtitional field that identifies the type of paid search. -- karthick On Thu, Aug 23, 2012 at 11:16 AM, Nicholas Ding nicholas...@gmail.com wrote: Hi I'm working on Solr to build a local business search in China. We have a special requirement from advertiser. When user makes a search, if the results contain paid advertisements, those ads need to be moved on the top of results. For different ads, they have detailed rules about which comes first. Could anyone offer me some suggestions how I customize the ranking based on my requirement? Thanks Nicholas
Re: recommended SSD
You should check this at pcper.com: http://pcper.com/ssd-decoder http://pcper.com/content/SSD-Decoder-popup Specs for a wide range of SSDs. Best regards François On Aug 23, 2012, at 5:35 PM, Peyman Faratin pey...@robustlinks.com wrote: Hi Is there a SSD brand and spec that the community recommends for an index of size 56G with mostly reads? We are evaluating this one http://www.newegg.com/Product/Product.aspx?Item=N82E16820227706 thank you Peyman
Re: Can't find solr.xml
On Jul 11, 2012, at 2:52 PM, Shawn Heisey wrote: On 7/2/2012 2:33 AM, Nabeel Sulieman wrote: Argh! (and hooray!) I started from scratch again, following the wiki instructions. I did only one thing differently; put my data directory in /opt instead of /home/dev. And now it works! I'm glad it's working now. I just wish I knew exactly what the difference is. The directory in /opt has exactly the same permissions as the one in /home/dev (chown -R tomcat solr). This could be selinux. I tend to disable it, as configuring it for proper operation with custom software can be tricky. If this is the problem, there will hopefully be a record of the denial in one of the files in /var/log. CentOS has selinux enabled by default. In case you don't know how to turn it off: in /etc/selinux/config, set SELINUX=disabled and reboot. There may be a way to disable it without rebooting, but I've found that to be the path of least resistance. Thanks, Shawn You can temporarily disable selinux until the next reboot with this: echo 0 /selinux/enforce Cheers François
Re: difference between stored=false and stored=true ?
Giovanni stored=true means the data is stored in the index and can be returned with the search results (see the 'fl' parameter). This is independent of indexed=.. Which means that you can store but not index a field: indexed=false stored=true Best regards François On Jun 30, 2012, at 9:57 AM, Giovanni Gherdovich wrote: Hi all, when declaring a field in the schema.xml file you can set the attributes 'indexed' and 'stored' to true or false. What is the difference between a indexed=true stored=false and a indexed=true stored=true? I guess understanding this would require me to have a closer look to lucene's index data structures; what's the pointer to some doc I can read? Cheers, GGhh
Re: Indexation Speed?
Just a suggestion, you might want to monitor CPU usage and disk I/O, there might be a bottleneck. Cheers François On Jun 19, 2012, at 7:07 AM, Bruno Mannina wrote: Actually -Xmx512m and no effect Concerning maxFieldLength, no problem it's commented Le 19/06/2012 13:02, Erick Erickson a écrit : Then try -Xmx600M next try -Xmx900M etc. The idea is to bump things on separate runs. But be a little cautious here. Look in your solrconfig.xml file, you'll see a commented-out line maxFieldLength1/maxFieldLength The default behavior for Solr/Lucene is to index the first 10,000 tokens (not characters, think of tokens as words for not) in each document and throw the rest on the floor. At the sizes you're talking about, that's probably not a problem, but do be aware of it. Best Erick On Tue, Jun 19, 2012 at 5:44 AM, Bruno Manninabmann...@free.fr wrote: Like that? java -Xmx300m -jar post.jar myfile.xml Le 19/06/2012 11:11, Lance Norskog a écrit : Ah! Java memory size is a java command line option: http://javahowto.blogspot.com/2006/06/6-common-errors-in-setting-java-heap.html You would try increasing the memory size in stages up to maybe 300m. On Tue, Jun 19, 2012 at 2:04 AM, Bruno Manninabmann...@free.frwrote: Le 19/06/2012 10:51, Lance Norskog a écrit : 675 doc/s is respectable for that server. You might move the memory allocated to Java up and down- there is a balance between amount of memory in Java v.s. the OS disk buffer. How can I do that ? is there an option during my command line or in a config file? sorry for this newbie question :( And, of course, use the latest trunk. Solr 3.6 On Tue, Jun 19, 2012 at 12:10 AM, Bruno Manninabmann...@free.fr wrote: Correction: file size is 40 Mo !!! Le 19/06/2012 09:09, Bruno Mannina a écrit : Dear All, I would like to know if the indexation speed is right. I have a 40Go file size with around 27 000 docs inside. I index around 20 fields, My (old) test server is a DualCore 3.06GHz Intel Xeon with only 1Go Ram The file takes 40 seconds with the command line: java -jar post.jar myfile.xml Could I increase this speed or reduce this time? Thanks a lot, PS: Newbie user
Re: Indexation Speed?
Well that depends on the platform you are on, you did not mention that. If you are using linux, you could use atop ( http://www.atoptool.nl/ ), or top, or iostat or stat, or all four. Cheers François On Jun 19, 2012, at 8:55 AM, Bruno Mannina wrote: CPU is not used, just 50-60% sometimes during the process but How can I check IO HDD ? Le 19/06/2012 14:13, François Schiettecatte a écrit : Just a suggestion, you might want to monitor CPU usage and disk I/O, there might be a bottleneck. Cheers François On Jun 19, 2012, at 7:07 AM, Bruno Mannina wrote: Actually -Xmx512m and no effect Concerning maxFieldLength, no problem it's commented Le 19/06/2012 13:02, Erick Erickson a écrit : Then try -Xmx600M next try -Xmx900M etc. The idea is to bump things on separate runs. But be a little cautious here. Look in your solrconfig.xml file, you'll see a commented-out line maxFieldLength1/maxFieldLength The default behavior for Solr/Lucene is to index the first 10,000 tokens (not characters, think of tokens as words for not) in each document and throw the rest on the floor. At the sizes you're talking about, that's probably not a problem, but do be aware of it. Best Erick On Tue, Jun 19, 2012 at 5:44 AM, Bruno Manninabmann...@free.fr wrote: Like that? java -Xmx300m -jar post.jar myfile.xml Le 19/06/2012 11:11, Lance Norskog a écrit : Ah! Java memory size is a java command line option: http://javahowto.blogspot.com/2006/06/6-common-errors-in-setting-java-heap.html You would try increasing the memory size in stages up to maybe 300m. On Tue, Jun 19, 2012 at 2:04 AM, Bruno Manninabmann...@free.fr wrote: Le 19/06/2012 10:51, Lance Norskog a écrit : 675 doc/s is respectable for that server. You might move the memory allocated to Java up and down- there is a balance between amount of memory in Java v.s. the OS disk buffer. How can I do that ? is there an option during my command line or in a config file? sorry for this newbie question :( And, of course, use the latest trunk. Solr 3.6 On Tue, Jun 19, 2012 at 12:10 AM, Bruno Manninabmann...@free.fr wrote: Correction: file size is 40 Mo !!! Le 19/06/2012 09:09, Bruno Mannina a écrit : Dear All, I would like to know if the indexation speed is right. I have a 40Go file size with around 27 000 docs inside. I index around 20 fields, My (old) test server is a DualCore 3.06GHz Intel Xeon with only 1Go Ram The file takes 40 seconds with the command line: java -jar post.jar myfile.xml Could I increase this speed or reduce this time? Thanks a lot, PS: Newbie user
Re: Indexation Speed?
There is a lot of good information about that on the web, just google for 'ubuntu performance monitor' Also the ubuntu website has a pretty good help section: https://help.ubuntu.com/ and a community wiki: https://help.ubuntu.com/community Cheers François On Jun 19, 2012, at 9:03 AM, Bruno Mannina wrote: Linux Ubuntu :) since 2 months ! so I'm a new in this world :) Le 19/06/2012 15:01, François Schiettecatte a écrit : Well that depends on the platform you are on, you did not mention that. If you are using linux, you could use atop ( http://www.atoptool.nl/ ), or top, or iostat or stat, or all four. Cheers François On Jun 19, 2012, at 8:55 AM, Bruno Mannina wrote: CPU is not used, just 50-60% sometimes during the process but How can I check IO HDD ? Le 19/06/2012 14:13, François Schiettecatte a écrit : Just a suggestion, you might want to monitor CPU usage and disk I/O, there might be a bottleneck. Cheers François On Jun 19, 2012, at 7:07 AM, Bruno Mannina wrote: Actually -Xmx512m and no effect Concerning maxFieldLength, no problem it's commented Le 19/06/2012 13:02, Erick Erickson a écrit : Then try -Xmx600M next try -Xmx900M etc. The idea is to bump things on separate runs. But be a little cautious here. Look in your solrconfig.xml file, you'll see a commented-out line maxFieldLength1/maxFieldLength The default behavior for Solr/Lucene is to index the first 10,000 tokens (not characters, think of tokens as words for not) in each document and throw the rest on the floor. At the sizes you're talking about, that's probably not a problem, but do be aware of it. Best Erick On Tue, Jun 19, 2012 at 5:44 AM, Bruno Manninabmann...@free.fr wrote: Like that? java -Xmx300m -jar post.jar myfile.xml Le 19/06/2012 11:11, Lance Norskog a écrit : Ah! Java memory size is a java command line option: http://javahowto.blogspot.com/2006/06/6-common-errors-in-setting-java-heap.html You would try increasing the memory size in stages up to maybe 300m. On Tue, Jun 19, 2012 at 2:04 AM, Bruno Manninabmann...@free.fr wrote: Le 19/06/2012 10:51, Lance Norskog a écrit : 675 doc/s is respectable for that server. You might move the memory allocated to Java up and down- there is a balance between amount of memory in Java v.s. the OS disk buffer. How can I do that ? is there an option during my command line or in a config file? sorry for this newbie question :( And, of course, use the latest trunk. Solr 3.6 On Tue, Jun 19, 2012 at 12:10 AM, Bruno Manninabmann...@free.fr wrote: Correction: file size is 40 Mo !!! Le 19/06/2012 09:09, Bruno Mannina a écrit : Dear All, I would like to know if the indexation speed is right. I have a 40Go file size with around 27 000 docs inside. I index around 20 fields, My (old) test server is a DualCore 3.06GHz Intel Xeon with only 1Go Ram The file takes 40 seconds with the command line: java -jar post.jar myfile.xml Could I increase this speed or reduce this time? Thanks a lot, PS: Newbie user
Re: Solr out of memory exception
FWIW it looks like this feature has been enabled by default since JDK 6 Update 23: http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/ François On Mar 15, 2012, at 6:39 AM, Husain, Yavar wrote: Thanks a ton. From: Li Li [fancye...@gmail.com] Sent: Thursday, March 15, 2012 12:11 PM To: Husain, Yavar Cc: solr-user@lucene.apache.org Subject: Re: Solr out of memory exception it seems you are using 64bit jvm(32bit jvm can only allocate about 1.5GB). you should enable pointer compression by -XX:+UseCompressedOops On Thu, Mar 15, 2012 at 1:58 PM, Husain, Yavar yhus...@firstam.commailto:yhus...@firstam.com wrote: Thanks for helping me out. I have allocated Xms-2.0GB Xmx-2.0GB However i see Tomcat is still using pretty less memory and not 2.0G Total Memory on my Windows Machine = 4GB. With smaller index size it is working perfectly fine. I was thinking of increasing the system RAM tomcat heap space allocated but then how come on a different server with exactly same system and solr configuration memory it is working fine? -Original Message- From: Li Li [mailto:fancye...@gmail.commailto:fancye...@gmail.com] Sent: Thursday, March 15, 2012 11:11 AM To: solr-user@lucene.apache.orgmailto:solr-user@lucene.apache.org Subject: Re: Solr out of memory exception how many memory are allocated to JVM? On Thu, Mar 15, 2012 at 1:27 PM, Husain, Yavar yhus...@firstam.commailto:yhus...@firstam.com wrote: Solr is giving out of memory exception. Full Indexing was completed fine. Later while searching maybe when it tries to load the results in memory it starts giving this exception. Though with the same memory allocated to Tomcat and exactly same solr replica on another server it is working perfectly fine. I am working on 64 bit software's including Java Tomcat on Windows. Any help would be appreciated. Here are the logs: The server encountered an internal error (Severe errors in solr configuration. Check your log files for more detailed information on what may be wrong. If you want solr to continue after configuration errors, change: abortOnConfigurationErrorfalse/abortOnConfigurationError in null - java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at org.apache.solr.core.SolrCore.init(SolrCore.java:579) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83) at org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295) at org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422) at org.apache.catalina.core.ApplicationFilterConfig.init(ApplicationFilterConfig.java:115) at org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072) at org.apache.catalina.core.StandardContext.start(StandardContext.java:4726) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601) at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:943) at org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:778) at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:504) at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065) at org.apache.catalina.core.StandardHost.start(StandardHost.java:840) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057) at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463) at org.apache.catalina.core.StandardService.start(StandardService.java:525) at org.apache.catalina.core.StandardServer.start(StandardServer.java:754) at org.apache.catalina.startup.Catalina.start(Catalina.java:595) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by: java.lang.OutOfMemoryError: Java heap space at org.apache.lucene.index.SegmentTermEnum.termInfo(SegmentTermEnum.java:180) at org.apache.lucene.index.TermInfosReader.init(TermInfosReader.java:91) at
Re: Development inside or outside of Solr?
You could take a look at this: http://www.let.rug.nl/vannoord/TextCat/ Will probably require some work to integrate/implement through François On Feb 20, 2012, at 3:37 AM, bing wrote: I have looked into the TikaCLI with -language option, and learned that Tika can output only the language metadata. It cannot help me to solve my problem though, as my main concern is whether to change Solr or not. Thank you all the same. -- View this message in context: http://lucene.472066.n3.nabble.com/Development-inside-or-outside-of-Solr-tp3759680p3760131.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr logging
Ola Here is what I have for this: ## # # Log4J configuration for SOLR # # http://wiki.apache.org/solr/SolrLogging # # # 1) Download LOG4J: # http://logging.apache.org/log4j/1.2/ # http://logging.apache.org/log4j/1.2/download.html # http://www.apache.org/dyn/closer.cgi/logging/log4j/1.2.16/apache-log4j-1.2.16.tar.gz # http://newverhost.com/pub//logging/log4j/1.2.16/apache-log4j-1.2.16.tar.gz # # 2) Download SLF4J: # http://www.slf4j.org/ # http://www.slf4j.org/download.html # http://www.slf4j.org/dist/slf4j-1.6.4.tar.gz # # 3) Unpack Solr: # jar xvf apache-solr-3.5.0.war # # 4) Delete: # WEB-INF/lib/log4j-over-slf4j-1.6.4.jar # WEB-INF/lib/slf4j-jdk14-1.6.4.jar # # 5) Copy: # apache-log4j-1.2.16/log4j-1.2.16.jar- WEB-INF/lib # slf4j-1.6.4/slf4j-log4j12-1.6.4.jar - WEB-INF/lib # log4j.properties (this file)- WEB-INF/classes/ (needs to be created) # # 6) Pack Solr: # jar cvf apache-solr-3.4.0-omim.war admin favicon.ico index.jsp META-INF WEB-INF # # # Author: Francois Schiettecatte # Version:1.0 # ## ## # # Logging levels (helpful reminder) # # DEBUG INFO WARN ERROR FATAL # ## # # Logging setup # log4j.rootLogger=WARN, SOLR # Daily Rolling File Appender (SOLR) log4j.appender.SOLR=org.apache.log4j.DailyRollingFileAppender log4j.appender.SOLR.File=${catalina.base}/logs/solr.log log4j.appender.SOLR.Append=true log4j.appender.SOLR.Encoding=UTF-8 log4j.appender.SOLR.DatePattern='-'-MM-dd log4j.appender.SOLR.layout=org.apache.log4j.PatternLayout log4j.appender.SOLR.layout.ConversionPattern=%d [%t] %-5p %c - %m%n ## # # Logging levels for SOLR # # Default logging level log4j.logger.org.apache.solr=WARN ## On Feb 20, 2012, at 5:15 AM, ola nowak wrote: Yep. I suppose it is. But I have several applications installed on glassfish and I want each one of them to write into separate file. And Your solution with this jvm option was redirecting all messages from all apps to one file. Does anyone knows how to accomplish that? On Mon, Feb 20, 2012 at 11:09 AM, darul daru...@gmail.com wrote: Hmm, I did not try to achieve this but interested if you find a way... After I believe than having log4j config file outside war archive is a better solution, if you may need to update its content for example. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-logging-tp3760171p3760322.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Help:Solr can't put all pdf files into index
Have you tried checking any logs? Have you tried identifying a file which did not make it in and submitting just that one and seeing what happens? François On Feb 9, 2012, at 10:37 AM, Rong Kang wrote: Yes, I put all file in one directory and I have tested file names using code. At 2012-02-09 20:45:49,Jan Høydahl jan@cominvent.com wrote: Hi, Are you 100% sure that the filename is globally unique, since you use it as the uniqueKey? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 9. feb. 2012, at 08:30, 荣康 wrote: Hey , I am using solr as my search engine to search my pdf files. I have 18219 files(different file names) and all the files are in one same directory。But when I use solr to import the files into index using Dataimport method, solr report only import 17233 files. It's very strange. This problem has stoped out project for a few days. I can't handle it. please help me! Schema.xml fields field name=text type=text indexed=true multiValued=true termVectors=true termPositions=true termOffsets=true/ field name=filename type=filenametext indexed=true required=true termVectors=true termPositions=true termOffsets=true/ field name=id type=string stored=true/ /fields uniqueKeyid/uniqueKey copyField source=filename dest=text/ and dataConfig dataSource type=BinFileDataSource name=bin/ document entity name=f processor=FileListEntityProcessor recursive=true rootEntity=false dataSource=null baseDir=H:/pdf/cls_1_16800_OCRed/1 fileName=.*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF) onError=skip entity name=tika-test processor=TikaEntityProcessor url=${f.fileAbsolutePath} format=text dataSource=bin onError=skip field column=text name=text/ /entity field column=file name=id/ field column=file name=filename/ /entity /document /dataConfig sincerecly Rong Kang
Re: Using UUID for uniqueId
Anderson I would say that this is highly unlikely, but you would need to pay attention to how they are generated, this would be a good place to start: http://en.wikipedia.org/wiki/Universally_unique_identifier Cheers François On Feb 8, 2012, at 1:31 PM, Anderson vasconcelos wrote: HI all If i use the UUID like a uniqueId in the future if i break my index in shards, i will have problems? The UUID generation could generate the same UUID in differents machines? Thanks
Re: Question on Reverse Indexing
Using ReversedWildcardFilterFactory will double the size of your dictionary (more or less), maybe the drop in performance that you are seeing is a result of that? François On Jan 17, 2012, at 9:01 PM, Shyam Bhaskaran wrote: Hi, For reverse indexing we are using the ReversedWildcardFilterFactory on Solr 4.0 filter class=solr.ReversedWildcardFilterFactory withOriginal=true maxPosAsterisk=3 maxPosQuestion=2 maxFractionAsterisk=0.33/ ReversedWildcardFilterFactory was helping us to perform leading wild card searches like *lock. But it was observed that the performance of the searches was not good after introducing ReversedWildcardFilterFactory filter. Hence we disabled ReversedWildcardFilterFactory filter and re-created the indexes and this time we found the performance of Solr query to be faster. But surprisingly it is observed that leading wild card searches were still working inspite of disabling the ReversedWildcardFilterFactory filter. This behavior is puzzling everyone and wanted to know how this behavior of reverse indexing works? Can anyone share with me on this Solr behavior. -Shyam
Re: best query for one-box search string over multiple types fields?
Johnny What you are going to want to do is boost the artist field with respect to the others, for example using edismax my 'qf' parameter is: number^5 title^3 default so hits in the number field get a five-fold boost and hits in the title field get a three-fold boost. In your case you might want to start with: artist^5 album^3 song Getting these parameters right will take a little work, and I would suggest you build a set of searches with known results so you can quickly check the effect of any tweaks you do. Useful reading would include: http://wiki.apache.org/solr/SolrRelevancyFAQ http://wiki.apache.org/solr/SolrRelevancyCookbook http://www.lucidimagination.com/blog/2011/12/14/options-to-tune-document’s-relevance-in-solr/ http://www.lucidimagination.com/blog/2011/03/10/solr-relevancy-function-queries/ Cheers François On Jan 15, 2012, at 1:19 AM, Johnny Marnell wrote: hi all, short of it: i want queen bohemian rhapsody to return that song named Bohemian Rhapsody by the artist named Queen, rather than songs with titles like Bohemian Rhapsody (Queen Cover). i'm indexing a catalog of music with these types of docs and their fields: artist (artistName), album (albumName, artistName), and song (songName, albumName, artistName). the client is one search box, and i'm having trouble handling searching over multiple multifields and weighting their exactness. when a user types queen, i want the artist Queen to be the first hit, and then albums songs titled queen. if queen bohemian rhapsody is searched, i want to return that song, but instead i'm getting songs like Bohemian Rhapsody (Queen Cover) by Stupid Queen Tribute Band because all three terms are in the songName, i'm guessing. what kind of query do i need? i'm indexing all of these fields as multi-fields with ngram, shingle (i think this might be really useful for my use case?), keyword, and standard. that appears to be working, but i'm not sure how to combine all of this together over multiple multi-fields. if anyone has good links to broadly summarized use cases of Indexing and Querying, that would be great - i would think this would be a common situation but i can't find any good resources on the web. and i'm having trouble understanding scoring and boosting. this was my first post, hope i did it right, thanks so much! -j
Re: Doing url search in solr is slow
About the search 'referal_url:*www.someurl.com*', having a wildcard at the start will cause a dictionary scan for every term you search on unless you use ReversedWildcardFilterFactory. That could be the cause of your slowdown if you are I/O bound, and even if you are CPU bound for that matter. François On Jan 8, 2012, at 8:44 PM, yu shen wrote: Hi, My solr document has up to 20 fields, containing data from product name, date, url etc. The volume of documents is around 1.5m. My symptom is when doing url search like [ url:*www.someurl.com* referal_url:*www.someurl.com* page_url:*www.someurl.com*] will get a extraordinary long response time, while search against all other fields, the response time will be normal. Can anyone share any insights on this? Spark
Re: Shutdown hook issue
I am not an expert on this but the oom-killer will kill off the process consuming the greatest amount of memory if the machine runs out of memory, and you should see something to that effect in the system log, /var/log/messages I think. François On Dec 14, 2011, at 2:54 PM, Adolfo Castro Menna wrote: I think I found the issue. The ubuntu server is running OOM-Killer which might be sending a SIGINT to the java process, probably because of memory consumption. Thanks, Adolfo. On Wed, Dec 14, 2011 at 12:44 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Solr won't shut down by itself just because it's idle. :) You could run it with debugger attached and breakpoint set in the shutdown hook you are talking about and see what calls it. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html From: Adolfo Castro Menna adolfo.castrome...@gmail.com To: solr-user@lucene.apache.org Sent: Wednesday, December 14, 2011 8:17 AM Subject: Shutdown hook issue Hi All, I'm experiencing some issues with solr. From time to time solr goes down. After checking the logs, I see that it's due to the shutdown hook being triggered. I still don't know why it happens but it seems to be related to solr being idle. Does anyone have any insights? I'm using Ubuntu 10.04.2 LTS and solr 3.1.0 running on Jetty (default configuration). Solr runs in background, so it doesn't seem to be related to a SIGINT unless ubuntu is sending it for some odd reason. Thanks, Adolfo.
Re: Don't snowball depending on terms
It won't and depending on how your analyzer is set up the terms are most likely stemmed at index time. You could create a separate field for unstemmed terms though, or use a less aggressive stemmer such as EnglishMinimalStemFilterFactory. François On Nov 29, 2011, at 12:33 PM, Robert Brown wrote: Is it possible to search a field but not be affected by the snowball filter? ie, searching for manage is matching management, but a user may want to restrict results to only containing manage. I was hoping that simply quoting the term would do this, but it doesn't appear to make any difference. -- IntelCompute Web Design Local Online Marketing http://www.intelcompute.com
Re: how index words with their perfix in solr?
You might try the snowball stemmer too, I am not sure how closely that will fit your requirements though. Alternatively you could use synonyms. François On Nov 29, 2011, at 1:08 AM, mina wrote: thank you for your answer.i read it and i use this filter in my schema.xml in solr: filter class=solr.PorterStemFilterFactory/ but this filter doesn't understand all words with their suffix and prefix. this means when i search 'rain' solr doesn't show me any document that have 'rainy'. -- View this message in context: http://lucene.472066.n3.nabble.com/how-index-words-with-their-perfix-in-solr-tp3542300p3544319.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how index words with their perfix in solr?
It looks like you are using the plural stemmer, you might want to look into using the Porter stemmer instead: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming François On Nov 28, 2011, at 9:14 AM, mina wrote: I use solr 3.3,I want solr index words with their suffixes. when i index 'book' and 'books' and search 'book', solr show any document that has 'book' or 'books' but when I index 'rain' and 'rainy' and search 'rain', solr show any document that has 'rain' but i whant that solr show any document that has 'rain' or 'rainy'.help me. -- View this message in context: http://lucene.472066.n3.nabble.com/how-index-words-with-their-perfix-in-solr-tp3542300p3542300.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: query within search results
Wouldn't 'diseases AND water' or '+diseases +water' return you that result? Or you could search on 'water' while filtering on 'diseases'. Or am I missing something here? François On Nov 8, 2011, at 4:19 PM, sharnel pereira wrote: Hi, I have 10k records indexed using solr 1.4 We have a requirement to search within search results. example: query for 'water' returns 2000 results. I need the second query for 'diseases' to search within those 2000 results.(I cant add a facet as the second search should also check non faceted fields) Is there a way to get this working. Thanks Sharnel
Re: Is SQL Like operator feature available in Apache Solr query
Arshad Actually it is available, you need to use the ReversedWildcardFilterFactory which I am sure you can Google for. Solr and SQL address different problem sets with some overlaps but there are significant differences between the two technologies. Actually '%Solr%' is a worse case for SQL but handled quite elegantly in Solr. Hope this helps! Cheers François On Nov 1, 2011, at 7:46 AM, arshad ansari wrote: Hi, Is SQL Like operator feature available in Apache Solr Just like we have it in SQL. SQL example below - *Select * from Employee where employee_name like '%Solr%'* If not is it a Bug with Solr. If this feature available, please tell the examples available. Thanks! -- Best Regards, Arshad
Re: Is SQL Like operator feature available in Apache Solr query
Kuli Good point about just tokenizing the fields :) I ran a couple of tests to double-check my understanding and you can have a wildcard operator at either or both ends of a term. Adding ReversedWildcardFilterFactory to your field analyzer will make leading wildcard searches a lot faster of course but at the expense of index size. Cheers François On Nov 1, 2011, at 9:07 AM, Michael Kuhlmann wrote: Hi, this is not exactly true. In Solr, you can't have the wildcard operator on both sides of the operator. However, you can tokenize your fields and simply query for Solr. This is what's Solr made for. :) -Kuli Am 01.11.2011 13:24, schrieb François Schiettecatte: Arshad Actually it is available, you need to use the ReversedWildcardFilterFactory which I am sure you can Google for. Solr and SQL address different problem sets with some overlaps but there are significant differences between the two technologies. Actually '%Solr%' is a worse case for SQL but handled quite elegantly in Solr. Hope this helps! Cheers François On Nov 1, 2011, at 7:46 AM, arshad ansari wrote: Hi, Is SQL Like operator feature available in Apache Solr Just like we have it in SQL. SQL example below - *Select * from Employee where employee_name like '%Solr%'* If not is it a Bug with Solr. If this feature available, please tell the examples available. Thanks! -- Best Regards, Arshad
Re: Uncomplete date expressions
Erik I would complement the date with default values as you suggest and store a boolean flag indicating whether the date was complete or not, or store the original date if it is not complete which would probably be better because the presence of that data would tell you that the original date was not complete and you would also have it too. Cheers François On Oct 29, 2011, at 9:12 AM, Erik Fäßler wrote: Hi all, I want to index MEDLINE documents which not always contain complete dates of publication. The year is known always. Now the Solr documentation states, dates must have the format 1995-12-31T23:59:59Z for which month, day and even the time of the day must be known. I could, of course, just complement uncomplete dates with default values, 01-01 for example. But then I won't be able to distinguish between complete and uncomplete dates afterwards which is of importance when displaying the documents. I could just store the known information, e.g. the year, into an integer-typed field, but then I won't have date math. Is there a good solution to my problem? Probably I'm just missing the obvious, perhaps you can help me :-) Best regards, Erik
Re: drastic performance decrease with 20 cores
You have not said how big your index is but I suspect that allocating 13GB for your 20 cores is starving the OS of memory for caching file data. Have you tried 6GB with 20 cores? I suspect you will see the same performance as 6GB 10 cores. Generally it is better to allocate just enough memory to SOLR to run optimally rather than as much as possible. 'Just enough' depends as well. You will need to try out different allocations and see where the sweet spot is. Cheers François On Sep 26, 2011, at 9:53 AM, Bictor Man wrote: Hi everyone, Sorry if this issue has been discussed before, but I'm new to the list. I have a solr (3.4) instance running with 20 cores (around 4 million docs each). The instance has allocated 13GB in a 16GB RAM server. If I run several sets of queries sequentially in each of the cores, the I/O access goes very high, so does the system load, while the CPU percentage remains always low. It takes almost 1 hour to complete the set of queries. If I stop solr and restart it with 6GB allocated and 10 cores, after a bit the I/O access goes down and the CPU goes up, taking only around 5 minutes to complete all sets of queries. Meaning that for me is MUCH more performant having 2 solr instances running with half the data and half the memory than a single instance will all the data and memory. It would be even way faster to have 1 instance with half the cores/memory, run the queues, shut it down, start a new instance and repeat the process than having a big instance running everything. Furthermore, if I take the 20cores/13GB instance, unload 10 of the cores, trigger the garbage collector and run the sets of queries again, the behavior still remains slow taking like 30 minutes. am I missing something here? does solr change its caching policy depending on the number of cores at startup or something similar? Any hints will be very appreciated. Thanks, Victor
Re: synonyms.txt: different results on admin and on site..
Wildcard terms are not analyzed, so your synonyms.txt may come into play here, have you check the analysis for deniz* ? François On Sep 7, 2011, at 10:08 PM, deniz wrote: well yea you are right... i realised that lack of detail issue here... so here it comes... This is from my schema.xml and basically i have a synonyms.txt file which contains deniz,denis,denise After posting here, I have checked some stuff that I have faced before, while trying to add accented letters to the system... so it seems like same or similar stuff... so... As i want to support partial matches, the search string is modified on php side. if user enters deniz, it is sent to solr as deniz* when i check on solr admin, i was able to make searches with deniz,denise,denis and they all return correct results, but when i put the wildcard, i get nothing... so with the above settings; deniz denise denis works smoothly deniz* denise* denis* returns nothing... should i implement some kinda analyzer or tokenizer or any kinda component to overtime this thing? Rob Casson wrote: you should probably post your schema.xml and some parts of your synonyms.txt. it could be differences between your index and query analysis chains, synonym expansion errors, etc, but folks will likely need more details to help you out. cheers, rob On Wed, Sep 7, 2011 at 9:46 PM, deniz lt;denizdurmu...@gmail.comgt; wrote: could it be related with analysis issue about synonyms once again? - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/synonyms-txt-different-results-on-admin-and-on-site-tp3318338p3318464.html Sent from the Solr - User mailing list archive at Nabble.com. - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/synonyms-txt-different-results-on-admin-and-on-site-tp3318338p3318503.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MMapDirectory failed to map a 23G compound index segment
My memory of this is a little rusty but isn't mmap also limited by mem + swap on the box? What does 'free -g' report? François On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote: Ahoy ahoy! I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound index segment file. The stack trace looks pretty much like every other trace I've found when searching for OOM map failed[1]. My configuration follows: Solr 1.4.1/Lucene 2.9.3 (plus SOLR-1969https://issues.apache.org/jira/browse/SOLR-1969 ) CentOS 4.9 (Final) Linux 2.6.9-100.ELsmp x86_64 yada yada yada Java SE (build 1.6.0_21-b06) Hotspot 64-bit Server VM (build 17.0-b16, mixed mode) ulimits: core file size (blocks, -c) 0 data seg size(kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals(-i) 1024 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files(-n) 256000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size(kbytes, -s) 10240 cpu time(seconds, -t) unlimited max user processes (-u) 1064959 virtual memory(kbytes, -v) unlimited file locks(-x) unlimited Any suggestions? Thanks in advance, Rich [1] ... java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(Unknown Source) at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source) at org.apache.lucene.index.SegmentReader$CoreReaders.init(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.SegmentReader.get(Unknown Source) at org.apache.lucene.index.DirectoryReader.init(Unknown Source) at org.apache.lucene.index.ReadOnlyDirectoryReader.init(Unknown Source) at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown Source) at org.apache.lucene.index.DirectoryReader.open(Unknown Source) at org.apache.lucene.index.IndexReader.open(Unknown Source) ... Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) ...
Re: Solr and wikipedia for schools
I note that there is a full download option available, might be easier than crawling. François On Sep 4, 2011, at 9:56 AM, Markus Jelsma wrote: Hi, Solr is a search engine, not a crawler. You can use Apache Nutch to crawl your site and have it indexed in Solr. Cheers, Hi, I am new to Solr/Lucene, and have some problems trying to figure out the best way to perform indexing. I think I understand the general principles, but have some trouble translating this to my specific goal, which is the following: I want to use SolR as a search engine based on general (English) keywords, that has indexed Wikipedia for Schools (http://www.soschildrensvillages.org.uk/charity-news/archive/2008/10/2008- wikipedia-for-schools). I initially thought that it would be sufficient to add the root document (index.html) to Solr, after which everything would be automagically indexed, but this does not seem to work. I have also tried to use urldatasource in data-config.xml, but there I get a bit confused by the settings. Could anyone help me understand how I can achieve my goal? Thanks Kees
Re: shareSchema=true - location of schema.xml?
Satish You don't say which platform you are on but have you tried links (with ln on linux/unix) ? François On Aug 31, 2011, at 12:25 AM, Satish Talim wrote: I have 1000's of cores and to reduce the cost of loading unloading schema.xml, I have my solr.xml as mentioned here - http://wiki.apache.org/solr/CoreAdmin namely: solr cores adminPath=/admin/cores shareSchema=true ... /cores /solr However, I am not sure where to keep the common schema.xml file? In which case, do I need the schema.xml in the conf folder of each and every core? My folder structure is: multicore (contains solr.xml) |_ core0 |_ conf ||_ schema.xml ||_ solrconfig.xml ||_ other files core1 |_ conf ||_ schema.xml ||_ solrconfig.xml ||_ other files | exampledocs (contains 1000's of .csv files and post.jar) Satish
Re: Error while decoding %DC (Ü) from URL - results in ?
Merlin Just to make sure I understand what is going on here, you are getting searches from external crawlers. These are coming in the form of an HTTP request I assume? Have you checked the encoding specified in these requests (in the content type header). If the encoding is not specified then iso-8859-1 is usually assumed. Also have you checked the default encoding of your container? If you are using tomcat that is set using URIEncoding, for example: Connector address=localhost port=8000 protocol=HTTP/1.1 connectionTimeout=2 URIEncoding=UTF-8 / François On Aug 28, 2011, at 3:10 PM, Merlin Morgenstern wrote: I double checked all code on that page and it looks like everything is in utf-8 and works just perfect. The problematic URLs are called always by bots like google bot. Looks like they are operating with a different encoding. The page itself has an utf-8 meta tag. So it looks like I have to find a way that checks for the encoding and encodes apropriatly. this should be a common solr problem if all search engines treat utf-8 that way, right? Any ideas how to fix that? Is there maybe a special solr functionality for this? 2011/8/27 François Schiettecatte fschietteca...@gmail.com Merlin Ü encodes to two characters in utf-8 (C39C), and one in iso-8859-1 (%DC) so it looks like there is a charset mismatch somewhere. Cheers François On Aug 27, 2011, at 6:34 AM, Merlin Morgenstern wrote: Hello, I am having problems with searches that are issued from spiders that contain the ASCII encoded character ü For example in : Übersetzung The solr log shows following query request: /suche/%DCbersetzung which has been translated into solr query: q=?ersetzung If you enter the search term directly as a user into the search box it will result into: /suche/Übersetzung which returns perfect results. I am decoding the URL within PHP: $term = trim(urldecode($q)); Somehow urldecode() translates the Character Ü (%DC) into a ? which is a illigeal first character in Solr. I tried it without urldecode(), with rawurldecode() and with utf8_decode() but all of those did not help. Thank you for any help or hint on how to solve that problem. Regards, Merlin
Re: Error while decoding %DC (Ü) from URL - results in ?
Merlin Ü encodes to two characters in utf-8 (C39C), and one in iso-8859-1 (%DC) so it looks like there is a charset mismatch somewhere. Cheers François On Aug 27, 2011, at 6:34 AM, Merlin Morgenstern wrote: Hello, I am having problems with searches that are issued from spiders that contain the ASCII encoded character ü For example in : Übersetzung The solr log shows following query request: /suche/%DCbersetzung which has been translated into solr query: q=?ersetzung If you enter the search term directly as a user into the search box it will result into: /suche/Übersetzung which returns perfect results. I am decoding the URL within PHP: $term = trim(urldecode($q)); Somehow urldecode() translates the Character Ü (%DC) into a ? which is a illigeal first character in Solr. I tried it without urldecode(), with rawurldecode() and with utf8_decode() but all of those did not help. Thank you for any help or hint on how to solve that problem. Regards, Merlin
Re: SolrServer instances
Sounds to me that you are looking for HTTP Persistent Connections (connection keep-alive as opposed to close), and a singleton object. This would be outside SOLR per se. A few caveats though, I am not sure if tomcat supports keep-alive, and I am not sure how SOLR deals with multiple requests coming down the pipe, and you will need to deal with concurrency, and I am not sure what you are looking to gain from this, opening an http connection is pretty cheap. François On Aug 26, 2011, at 2:09 AM, Jonty Rhods wrote: do I also required to close the connection from solr server (CommonHttpSolrServer). regards On Fri, Aug 26, 2011 at 9:45 AM, Jonty Rhods jonty.rh...@gmail.com wrote: Deal all please help I am stuck here as I have not much experience.. thanks On Thu, Aug 25, 2011 at 6:51 PM, Jonty Rhods jonty.rh...@gmail.comwrote: Hi All, I am using SolrJ (3.1) and Tomcat 6.x. I want to open solr server once (20 concurrence) and reuse this across all the site. Or something like connection pool like we are using for DB (ie Apache DBCP). There is a way to use static method which is a way but I want better solution from you people. I read one threade where Ahmet suggest to use something like that String serverPath = http://localhost:8983/solr;; HttpClient client = new HttpClient(new MultiThreadedHttpConnectionManager()); URL url = new URL(serverPath); CommonsHttpSolrServer solrServer = new CommonsHttpSolrServer(url, client); But how to use instance of this across all class. Please suggest. regards Jonty
Re: Solr 3.3 crashes after ~18 hours?
Assuming you are running on Linux, you might want to check /var/log/messages too (the location might vary), I think the kernel logs forced process termination there. I recall that the kernel will usually picks the process consuming the most memory, there may be other factors involved too. François On Aug 2, 2011, at 9:04 AM, wakemaster 39 wrote: Monitor your memory usage. I use to encounter a problem like this before where nothing was in the logs and the process was just gone. Turned out my system was out odd memory and swap got used up because of another process which then forced the kernel to start killing off processes. Google OOM linux and you will find plenty of other programs and people with a similar problem. Cameron On Aug 2, 2011 6:02 AM, alexander sulz a.s...@digiconcept.net wrote: Hello folks, I'm using the latest stable Solr release - 3.3 and I encounter strange phenomena with it. After about 19 hours it just crashes, but I can't find anything in the logs, no exceptions, no warnings, no suspicious info entries.. I have an index-job running from 6am to 8pm every 10 minutes. After each job there is a commit. An optimize-job is done twice a day at 12:15pm and 9:15pm. Does anyone have an idea what could possibly be wrong or where to look for further debug info? regards and thank you alex
Re: Solr can not index F**K!
That seems a little far fetched, have you checked your analysis? François On Jul 31, 2011, at 4:58 PM, randohi wrote: One of our clients (a hot girl!) brought this to our attention: In this document there are many f* words: http://sec.gov/Archives/edgar/data/1474227/00014742271032/d424b3.htm and we have indexed it with latest version of Solr (ver 3.3). But, we if we search F**K, it does not return the document back! We have tried to index it with different text types, but still not working. Any idea why F* can not be indexed - being censored by the government? :D -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-can-not-index-F-K-tp3214246p3214246.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr can not index F**K!
Indeed, the analysis will show if the term is a stop word, the term gets removed by the stop filter, turning on verbose output shows that. François On Jul 31, 2011, at 6:27 PM, Shashi Kant wrote: Check your Stop words list On Jul 31, 2011 6:25 PM, François Schiettecatte fschietteca...@gmail.com wrote: That seems a little far fetched, have you checked your analysis? François On Jul 31, 2011, at 4:58 PM, randohi wrote: One of our clients (a hot girl!) brought this to our attention: In this document there are many f* words: http://sec.gov/Archives/edgar/data/1474227/00014742271032/d424b3.htm and we have indexed it with latest version of Solr (ver 3.3). But, we if we search F**K, it does not return the document back! We have tried to index it with different text types, but still not working. Any idea why F* can not be indexed - being censored by the government? :D -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-can-not-index-F-K-tp3214246p3214246.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: schema.xml changes, need re-indexing ?
I have not seen this mentioned anywhere, but I found a useful 'trick' to restart solr without having to restart tomcat. All you need to do is 'touch' the solr.xml in the solr.home directory. It can take a few seconds but solr will restart and reload any config. Cheers François On Jul 27, 2011, at 2:56 PM, Alexei Martchenko wrote: I believe you're fine with that. Don't need to reindex all solr database. 2011/7/27 Charles-Andre Martin charles-andre.mar...@sunmedia.ca Hi, We currently have a big index in production. We would like to add 2 non-required fields to our schema.xml : field name=myfield type=boolean indexed=true stored=true required=false/ field name=myotherfield type=string indexed=true stored=true required=false multiValued=true/ I made some tests: - I stopped tomcat - I changed the schema.xml - I started tomcat The data was still there and I was able to add new document with theses 2 fields. So far, it looks I won't need to re-index all my data. Am I right ? Do I need to re-index all my data or in that case I'm fine ? Thank you ! Charles-André Martin -- *Alexei Martchenko* | *CEO* | Superdownloads ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) 5083.1018/5080.3535/5080.3533
Re: Spellcheck compounded words
FWIW, here is the process I follow to create a log4j aware version of the apache solr war file and the corresponding lo4j.properties files. Have fun :) François ## # # Log4J configuration for SOLR # # http://wiki.apache.org/solr/SolrLogging # # # 1) Download SLF4J: # http://www.slf4j.org/ # http://www.slf4j.org/download.html # http://www.slf4j.org/dist/slf4j-1.6.1.tar.gz # # 2) Unpack Solr: # jar xvf apache-solr-3.3.0.war # # 3) Delete: # WEB-INF/lib/log4j-over-slf4j-1.6.1.jar # WEB-INF/lib/slf4j-jdk14-1.6.1.jar # # 4) Copy: # slf4j-1.6.1/slf4j-log4j12-1.6.1.jar - WEB-INF/lib # log4j.properties (this file)- WEB-INF/classes/ (needs to be created) # # 5) Pack Solr: # jar cvf apache-solr-3.3.0.war admin favicon.ico index.jsp META-INF WEB-INF # # # Author: Francois Schiettecatte # Version:1.0 # ## ## # # Logging levels (helpful reminder) # # DEBUG INFO WARN ERROR FATAL # ## # # Logging setup # log4j.rootLogger=ERROR, SOLR # Daily Rolling File Appender (SOLR) log4j.appender.SOLR=org.apache.log4j.DailyRollingFileAppender log4j.appender.SOLR.File=${catalina.base}/logs/solr.log log4j.appender.SOLR.Append=true log4j.appender.SOLR.Encoding=UTF-8 log4j.appender.SOLR.DatePattern='-'-MM-dd log4j.appender.SOLR.layout=org.apache.log4j.PatternLayout log4j.appender.SOLR.layout.ConversionPattern=%d [%t] %-5p %c - %m%n ## # # Logging levels for SOLR # # Default logging level log4j.logger.org.apache.solr=ERROR ## On Jul 26, 2011, at 2:49 PM, O. Klein wrote: Adding log4j-1.2.16.jar and deleting slf4j-jdk14-1.6.1.jar does not fix logging for 4.0 for me. Anyways, tried it on 3.3 and Solr just hangs here also. No logging, no exceptions. I'll let you know if I manage to find source of problem. -- View this message in context: http://lucene.472066.n3.nabble.com/Spellcheck-compounded-words-tp3192748p3201202.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Spellcheck compounded words
I get slf4j-log4j12-1.6.1.jar from http://www.slf4j.org/dist/slf4j-1.6.1.tar.gz, it is what interfaces slf4j to log4j, you will also need to add log4j-1.2.16.jar to WEB-INF/lib. François On Jul 26, 2011, at 3:40 PM, O. Klein wrote: François Schiettecatte wrote: # # 4) Copy: #slf4j-1.6.1/slf4j-log4j12-1.6.1.jar - WEB-INF/lib #log4j.properties (this file)- WEB-INF/classes/ (needs to be created) # Don't you mean log4j-1.2.16/slf4j-log4j12-1.6.1.jar ? Anyways. I was testing on 3.3 and found that when I added spellcheck.maxCollations=2spellcheck.maxCollationTries=2 as parameters to the URL there was no problem at all. Adding str name=spellcheck.maxCollations2/str str name=spellcheck.maxCollationTries2/str to the default requestHandler in solrconfig.xml caused request to hang. Can someone verify if this is a bug? -- View this message in context: http://lucene.472066.n3.nabble.com/Spellcheck-compounded-words-tp3192748p3201332.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: problem searching on non standard characters
Check your analyzers to make sure that these characters are not getting stripped out in the tokenization process, the url for 3.3 is somewhere along the lines of: http://localhost/solr/admin/analysis.jsp?highlight=on And you should be indeed be searching on \#test. François On Jul 22, 2011, at 10:34 AM, Jason Toy wrote: How does one search for words with characters like # and +. I have tried searching solr with #test and \#test but all my results always come up with test and not #test. Is this some kind of configuration option I need to set in solr? -- - sent from my mobile 6176064373
Re: problem searching on non standard characters
Adding to my previous reply, I just did a quick check on the 'text_en' and 'text_en_splitting' field types and they both strip leading '#'. Cheers François On Jul 22, 2011, at 10:49 AM, Shawn Heisey wrote: On 7/22/2011 8:34 AM, Jason Toy wrote: How does one search for words with characters like # and +. I have tried searching solr with #test and \#test but all my results always come up with test and not #test. Is this some kind of configuration option I need to set in solr? I would guess that your analysis chain (in schema.xml) includes something that removes and/or splits terms at non-alphanumeric characters. There are a several components that do this, but WordDelimiterFilter is the one that comes to mind most readily. I've never used the StandardTokenizer, but I believe it might do something similar. Thanks, Shawn
Re: How to find whether solr server is running or not
I think anything but a 200 OK mean it is dead like the proverbial parrot :) François On Jul 19, 2011, at 7:42 AM, Romi wrote: But the problem is when solr server is not runing *http://host:port/solr/admin/ping* will not give me any json response then how will i get the status :( when i run this url browser gives me following error *Unable to connect Firefox can't establish a connection to the server at 192.168.1.9:8983.* - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-find-whether-solr-server-is-running-or-not-tp3181870p3182202.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: - character in search query
Easy, the hyphen is out on its own (with spaces on either side) and is probably getting removed from the search by the tokenizer. Check your analysis. François On Jul 14, 2011, at 6:05 AM, roySolr wrote: It looks like it's still not working. I send this to SOLR: q=arsenal \- london I get no results. When i look at the debugQuery i see this: (name: arsenal | city:arsenal)~1.0 (name: \ | city:\)~1.0 (name: london | city: london)~1.0 my requesthandler: requestHandler name=dismax class=solr.SearchHandler default=true lst name=defaults str name=defTypedismax/str str name=qf name city /str /lst /requestHandler What is going wrong? -- View this message in context: http://lucene.472066.n3.nabble.com/character-in-search-query-tp3168604p3168666.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Wildcard
http://lucene.apache.org/java/2_9_1/queryparsersyntax.html http://wiki.apache.org/solr/SolrQuerySyntax François On Jul 13, 2011, at 1:29 PM, GAURAV PAREEK wrote: Hello, What are wildcards we can use with the SOLR ? Regards, Gaurav
Re: Result list order in case of ties
You just need to provide a second sort field along the lines of: sort=score desc, author desc François On Jul 12, 2011, at 6:13 AM, Lox wrote: Hi, In the case where two or more documents are returned with the same score, is there a way to tell Solr to sort them alphabetically? I have already tried to use the tie-breaker, but I have just one field to search. Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/Result-list-order-in-case-of-ties-tp3162001p3162001.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: performance variation with respect to the index size
Hi I don't think that anyone has run such benchmarks, in fact this topic came up two weeks ago and I volunteered some time to do that because I have some spare time this week, so I am going to run some benchmarks this weekend and report back. The machine I have to do this a core i7 960, 24GB, 4TB of disk. I am going to run SOLR 3.3 under Tomcat 7.0.16. I have three databases I can use for this, icwsm-2009 (38.5GB compressed), cdip (24GB compressed), trec vlc2 (31GB compressed). I could also use a copy of wikipedia. I have lots of user searches I can use (saved from Feedster days). I would like some input on a couple of things to make this test as real-world as possible. One is any optimizations I should set in solrconfig.xml, and the other are the heap/GC settings I should set for tomcat. Anything else? Cheers François On Jul 8, 2011, at 4:08 AM, jame vaalet wrote: hi, is there any performance degradation (response time etc ) if the index has document content text stored in it (stored=true)? -JAME
Re: Wildcard search not working if full word is queried
Celso You are very welcome and yes I should have mentioned that wildcard searches are not analyzed (which is a recurring theme). This also means that they are not downcased, so the search TEST* will probably not find anything either in your set up. Cheers François On Jul 1, 2011, at 5:16 AM, Celso Pinto wrote: Hi again, read (past tense) TFM :-) and: On wildcard and fuzzy searches, no text analysis is performed on the search word. Thanks a lot François! Regards, Celso On Fri, Jul 1, 2011 at 10:02 AM, Celso Pinto cpi...@yimports.com wrote: Hi François, it is indeed being stemmed, thanks a lot for the heads up. It appears that stemming is also configured for the query so it should work just the same, no? Thanks again. Regards, Celso 2011/6/30 François Schiettecatte fschietteca...@gmail.com: I would run that word through the analyzer, I suspect that the word 'teste' is being stemmed to 'test' in the index, at least that is the first place I would check. François On Jun 30, 2011, at 2:21 PM, Celso Pinto wrote: Hi everyone, I'm having some trouble figuring out why a query with an exact word followed by the * wildcard, eg. teste*, returns no results while a query for test* returns results that have the word teste in them. I've created a couple of pasties: Exact word with wildcard : http://pastebin.com/n9SMNsH0 Similar word: http://pastebin.com/jQ56Ww6b Parameters other than title, description and content have no effect other than filtering out unwanted results. In a two of the four results, the title has the complete word teste. On the other two, the word appears in the other fields. Does anyone have any insights about what I'm doing wrong? Thanks in advance. Regards, Celso
Re: Wildcard search not working if full word is queried
I would run that word through the analyzer, I suspect that the word 'teste' is being stemmed to 'test' in the index, at least that is the first place I would check. François On Jun 30, 2011, at 2:21 PM, Celso Pinto wrote: Hi everyone, I'm having some trouble figuring out why a query with an exact word followed by the * wildcard, eg. teste*, returns no results while a query for test* returns results that have the word teste in them. I've created a couple of pasties: Exact word with wildcard : http://pastebin.com/n9SMNsH0 Similar word: http://pastebin.com/jQ56Ww6b Parameters other than title, description and content have no effect other than filtering out unwanted results. In a two of the four results, the title has the complete word teste. On the other two, the word appears in the other fields. Does anyone have any insights about what I'm doing wrong? Thanks in advance. Regards, Celso
Re: filters effect on search results
Indeed, I find the Porter stemmer to be too 'aggressive' for my taste, I prefer the EnglishMinimalStemFilterFactory, with the caveat that it depends on your data set. Cheers François On Jun 29, 2011, at 6:21 AM, Ahmet Arslan wrote: Hi, when i query for elegant in solr i get results for elegance too. *I used these filters for index analyze* WhitespaceTokenizerFactory StopFilterFactory WordDelimiterFilterFactory LowerCaseFilterFactory SynonymFilterFactory EnglishPorterFilterFactory RemoveDuplicatesTokenFilterFactory ReversedWildcardFilterFactory * and for query analyze:* .WhitespaceTokenizerFactory SynonymFilterFactory StopFilterFactory WordDelimiterFilterFactory LowerCaseFilterFactory EnglishPorterFilterFactory RemoveDuplicatesTokenFilterFactory I want to know which filter affecting my search result. It is EnglishPorterFilterFactory, you can verify it from admin/analysis.jsp page.
Re: Include synonys in solr
Well you need to find word lists and/or a thesaurus. This is one place to start: http://wordlist.sourceforge.net/ I used the US/UK english word list for my synonyms for an index I have because it contains both US and UK english terms, the list lacks some medical terms though so we just added them. Cheers François On Jun 28, 2011, at 6:55 AM, Romi wrote: Please see http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory No offence, but a simple Google search, or a search of the Wiki would have turned this up. Please try such simpler avenues before dashing off a message to the list. Gora, I heve already read the document and also included synonyms in my search results :) My question is , when i use this *filter class=solr.SynonymFilterFactory synonyms=syn.txt ignoreCase=true expand=false/ * i need to enter synonyms manually in synonyms.txt. which is really tough if you have many words for synonyms. i wanted to ask is there any other option so that i need not to enter synonyms manually.. i hope you got my point :) - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117365.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Removing duplicate documents from search results
Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny -- Thanks and Regards Mohammad Shariq
Re: Removing duplicate documents from search results
Maybe there is a way to get Solr to reject documents that already exist in the index but I doubt it, maybe someone else with can chime here here. You could do a search for each document prior to indexing it so see if it is already in the index, that is probably non-optimal, maybe it is easiest to check if the document exists in your Riak repository, it no add it and index it, and drop if it already exists. François On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates. 2011/6/28 François Schiettecatte fschietteca...@gmail.com Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny -- Thanks and Regards Mohammad Shariq -- Thanks and Regards Mohammad Shariq
Re: Removing duplicate documents from search results
Indeed, take a look at this: http://wiki.apache.org/solr/Deduplication I have not used it but it looks like it will do the trick. François On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote: I found the deduplication thing really useful. Although I have not yet started to work on it, as there are some other low hanging fruits I've to capture. Will share my thoughts soon. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny 2011/6/28 François Schiettecatte fschietteca...@gmail.com Maybe there is a way to get Solr to reject documents that already exist in the index but I doubt it, maybe someone else with can chime here here. You could do a search for each document prior to indexing it so see if it is already in the index, that is probably non-optimal, maybe it is easiest to check if the document exists in your Riak repository, it no add it and index it, and drop if it already exists. François On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates. 2011/6/28 François Schiettecatte fschietteca...@gmail.com Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say In order to show you most relevant result, duplicates have been removed. How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny -- Thanks and Regards Mohammad Shariq
Re: Removing duplicate documents from search results
Yeah, I read the overview which suggests that duplicates can be prevented from entering the index and scanned the rest, it does not look like you can actually drop the document entirely. Maybe I am missing something here. François On Jun 28, 2011, at 9:14 AM, Mohammad Shariq wrote: Hey François, thanks for your suggestion, I followed the same link ( http://wiki.apache.org/solr/Deduplication) they have the solution*, either make Hash as uniqueKey OR overwrite on duplicate, I dont need either. I need Discard on Duplicate. * I have not used it but it looks like it will do the trick. François On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote: I found the deduplication thing really useful. Although I have not yet started to work on it, as there are some other low hanging fruits I've to capture. Will share my thoughts soon. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny 2011/6/28 François Schiettecatte fschietteca...@gmail.com Maybe there is a way to get Solr to reject documents that already exist in the index but I doubt it, maybe someone else with can chime here here. You could do a search for each document prior to indexing it so see if it is already in the index, that is probably non-optimal, maybe it is easiest to check if the document exists in your Riak repository, it no add it and index it, and drop if it already exists. François On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: I am making the Hash from URL, but I can't use this as UniqueKey because I am using UUID as UniqueKey, Since I am using SOLR as index engine Only and using Riak(key-value storage) as storage engine, I dont want to do the overwrite on duplicate. I just need to discard the duplicates. 2011/6/28 François Schiettecatte fschietteca...@gmail.com Create a hash from the url and use that as the unique key, md5 or sha1 would probably be good enough. Cheers François On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: I also have the problem of duplicate docs. I am indexing news articles, Every news article will have the source URL, If two news-article has the same URL, only one need to index, removal of duplicate at index time. On 23 June 2011 21:24, simon mtnes...@gmail.com wrote: have you checked out the deduplication process that's available at indexing time ? This includes a fuzzy hash algorithm . http://wiki.apache.org/solr/Deduplication -Simon On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash pra...@gmail.com wrote: This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* temet nosce Twitter http://twitter.com/pranavprakash | Blog http://blog.myblive.com | Google http://www.google.com/profiles/pranny On Thu, Jun 23, 2011 at 15:16, Omri Cohen o...@yotpo.com wrote: What you need to do, is to calculate some HASH (using any message digest algorithm you want, md5, sha-1 and so on), then do some reading on solr field collapse capabilities. Should not be too complicated.. *Omri Cohen* Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 My profiles: [image: LinkedIn] http://www.linkedin.com/in/omric [image: Twitter] http://www.twitter.com/omricohe [image: WordPress]http://omricohen.me Please consider your environmental responsibility. Before printing this e-mail message, ask yourself whether you really need a hard copy. IMPORTANT: The contents of this email and any attachments are confidential. They are intended for the named recipient(s) only. If you have received this email by mistake, please notify the sender immediately and do not disclose the contents to anyone or make copies thereof. Signature powered by http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer WiseStamp http://www.wisestamp.com/email-install?utm_source=extensionutm_medium=emailutm_campaign=footer -- Forwarded message -- From: Pranav Prakash pra...@gmail.com Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for keyword, in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible
Re: Include synonys in solr
Well no, you need to see which files (if any) will suit your needs, they are not all synonyms files, I only needed the UK/US english file and I needed to process it into a format suitable for the synonyms file. There may well be other word lists on the net suitable for your needs. I would not recommend the use of synonyms unless you have a specific need for them. I needed them because we have documents which mix UK/US english, and we need to be able to search on medical terms e.g. hemoglobin/haemoglobin and get the same results. Cheers François On Jun 28, 2011, at 9:21 AM, Romi wrote: Thanks François Schiettecatte, information you provided is very helpful. i need to know one more thing, i downloaded one of the given dictionary but it contains many files, do i need to add all this files data in to synonyms.text ?? - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117733.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Extending Solr Highlighter to pull information from external source
Mike I would be very interested in the answer to that question too. My hunch is that the answer is no too. I have a few text databases that range from 200MB to about 60GB with which I could run some tests. I will have some downtime in early July and will post results. From what I can tell the Guardian newspaper is doing just that: http://www.guardian.co.uk/open-platform/blog/what-is-powering-the-content-api http://www.lucidimagination.com/blog/2010/04/29/for-the-guardian-solr-is-the-new-database/ Cheers François On Jun 20, 2011, at 9:05 AM, Mike Sokolov wrote: I'd be very interested in this, as well, if you do it before me and are willing to share... A related question I have tried to ask on this list, and have never really gotten a good answer to, is whether it makes sense to just chuck the external storage and treat the lucene index as the primary storage for documents. I have a feeling the answer is no; perhaps because of increased I/O costs for lucene and solr, but I don't really know. I've been considering doing some experimentation, but would really love an expert opinion... -Mike On 06/20/2011 08:41 AM, Jamie Johnson wrote: I am trying to index data where I'm concerned that storing the contents of a specific field will be a bit of a hog so we are planning to retrieve this information as needed for highlighting from an external source. I am looking to extend the default solr highlighting capability to work with information pulled from this external source and it looks like this is possible by extending DefaultSolrHighlighter (line 418 to pull a particular field from external source) for standard highlighting and BaseFragmentsBuilder (line 99) for FastVectorHighlighter. I could just hard code this to say if the field name is a specific value look into the external source, is this the best way to accomplish this? Are there any other extension points to do what I'm suggesting?
Re: Searching in Traditional / Simplified Chinese Record
Wayne I am not sure what you mean by 'changing the record'. One option would be to implement something like the synonyms filter to generate the TC for SC when you index the document, which would index both the TC and the SC in the same location. That way your users would be able to search with either TC or SC. Another option would be to use the same synonyms filter but do the expansion at search time. Cheers François On Jun 20, 2011, at 5:41 AM, waynelam wrote: Hi, I 've recently make change to my schema.xml to support import of Chinese Record. What i want to do is to search both Traditional Chinese(TC) (e.g. ?? )and Simplified Chinese (SC) (e.g. ??) Record when in the same query. I know I can do that by encoding all SC Record to TC. I want to change to way to index rather that change the record. Anyone should show me the way in much appreciated. Thanks Wayne -- - Wayne Lam Assistant Library Officer I Systems Development Support Fong Sum Wood Library Lingnan University 8 Castle Peak Road Tuen Mun, New Territories Hong Kong SAR China Phone: +852 26168585 Email: wayne...@ln.edu.hk Website: http://www.library.ln.edu.hk
Re: Is it true that I cannot delete stored content from the index?
That is correct, but you only need to commit, optimize is not a requirement here. François On Jun 18, 2011, at 11:54 PM, Mohammad Shariq wrote: I have define uniqueKey in my solr and Deleting the docs from solr using this uniqueKey. and then doing optimization once in a day. is this right way to delete ??? On 19 June 2011 05:14, Erick Erickson erickerick...@gmail.com wrote: Yep, you've got to delete and re-add. Although if you have a uniqueKey defined you can just re-add that document and Solr will automatically delete the underlying document. You might have to optimize the index afterwards to get the data to really disappear since the deletion process just marks the document as deleted. Best Erick On Sat, Jun 18, 2011 at 1:20 PM, Gabriele Kahlout gabri...@mysimpatico.com wrote: Hello, I've indexing with the content field stored. Now I'd like to delete all stored content, is there how to do that without re-indexing? It seems not from lucene FAQ http://wiki.apache.org/lucene-java/LuceneFAQ#How_do_I_update_a_document_or_a_set_of_documents_that_are_already_indexed.3F : How do I update a document or a set of documents that are already indexed? There is no direct update procedure in Lucene. To update an index incrementally you must first *delete* the documents that were updated, and *then re-add*them to the index. -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Thanks and Regards Mohammad Shariq
Re: Why does paste get parsed into past?
What do you have set up for stemming? François On Jun 18, 2011, at 8:00 AM, Gabriele Kahlout wrote: Hello, Debugging query results I find that: str name=querystringpaste/str str name=parsedquerycontent:past/str Now paste and past are two different words. Why does Solr not consider that? How do I make it? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Why does paste get parsed into past?
What I meant was what stemmer are you using? Maybe it is the stemmer that is cutting the 'e'. You can check that on the field analysis solr web page. François On Jun 18, 2011, at 11:42 AM, Gabriele Kahlout wrote: I'm !sure where those are set, but on reflection I'd keep the default settings. My real issue is why are not query keywords treated as a set?http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201106.mbox/%3CBANLkTikHunhyWc2WVTofRYU4ZW=c8oe...@mail.gmail.com%3E 2011/6/18 François Schiettecatte fschietteca...@gmail.com What do you have set up for stemming? François On Jun 18, 2011, at 8:00 AM, Gabriele Kahlout wrote: Hello, Debugging query results I find that: str name=querystringpaste/str str name=parsedquerycontent:past/str Now paste and past are two different words. Why does Solr not consider that? How do I make it? -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)). -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: Multiple indexes
Sure. François On Jun 18, 2011, at 2:25 PM, shacky wrote: 2011/6/15 Edoardo Tosca e.to...@sourcesense.com: Try to use multiple cores: http://wiki.apache.org/solr/CoreAdmin Can I do concurrent searches on multiple cores?
Re: Multiple indexes
You would need to run two independent searches and then 'join' the results. It is best not to apply a 'sql' mindset to SOLR when it comes to (de)normalization, whereas you strive for normalization in sql, that is usually counter-productive in SOLR. For example, I am working on a project with 30+ normalized tables, but only 4 cores. Perhaps describing what you are trying to achieve would give us greater insight and thus be able to make more concrete recommendation? Cheers François On Jun 18, 2011, at 2:36 PM, shacky wrote: Il 18 giugno 2011 20:27, François Schiettecatte fschietteca...@gmail.com ha scritto: Sure. So I can have some searches similar to JOIN on MySQL? The problem is that I need at least two tables in which search data..
Re: Performance loss - querying more than 64 cores (randomly)
I am assuming that you are running on linux here, I have found atop to be very useful to see what is going on. http://freshmeat.net/projects/atop/ dstat is also very useful too but needs a little more work to 'decode'. Obviously there is contention going on, you just need to figure out where it is, most likely it is disk I/O but it could also be the number of cores you have. Also I would not say that performance is decreasing rapidly, probably more of a gentle slope down if you plot it (your double the number of cores every time). I would be very interested in hearing about what you find. Cheers François On Jun 16, 2011, at 10:00 AM, Andrzej Bialecki wrote: On 6/16/11 3:22 PM, Mark Schoy wrote: Hi, I set up a Solr instance with 512 cores. Each core has 100k documents and 15 fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM. Now I've done some benchmarks with JMeter. On each thread iteration JMeter queriing another Core by random. Here are the results (Duration: each with 180 second): Randomly queried cores | queries per second 1| 2016 2 | 2001 4 | 1978 8 | 1958 16 | 2047 32 | 1959 64 | 1879 128 | 1446 256 | 1009 512 | 428 Why are the queries per second until 64 constant and then the performance is degreasing rapidly? Solr only uses 10GB of the 16GB memory so I think it is not a memory issue. This may be an OS-level disk buffer issue. With a limited disk buffer space the more random IO occurs from different files, the higher is the churn rate, and if the buffers are full then the churn rate may increase dramatically (and the performance will drop then). Modern OS-es try to keep as much data in memory as possible, so the memory usage itself is not that informative - but check what are the pagein/pageout rates when you start hitting the 32 vs 64 cores. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Strange behavior
I think you will need to provide more information than this, no-one on this list is omniscient AFAIK. François On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote: Hi. I've debugged search on test machine, after copying to production server the entire directory (entire solr directory), i've noticed that one query (SDR S70EE K) does match on test server, and does not on production. How can that be?
Re: Solr Field name restrictions
Underscores and dashes are fine, but I would think that colons (:) are verboten. François On Jun 4, 2011, at 9:49 PM, Jamie Johnson wrote: Is there a list anywhere detailing field name restrictions. I imagine fields containing periods (.) are problematic if you try to use that field when doing faceted queries, but are there any others? Are underscores (_) or dashes (-) ok?
Re: synonyms problem
Are you sure solr.StrField is the way to go with this? solr.StrField stores the entire text verbatim and I am pretty sure skips any analysis. Perhaps you should use solr.TextField instead. François On Jun 2, 2011, at 2:28 AM, deniz wrote: Hi all, here is a piece from my solfconfig: fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ /analyzer /fieldType but somehow synonyms are not read... I mean there is no match when i use a word in the synonym file... any ideas? - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/synonyms-problem-tp3014006p3014006.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to request for Json object
This is not really an issue with SOLR per se, and I have run into this before, you will need to read up on 'Access-Control-Allow-Origin' which needs to be set in the http headers that your ajax pager is returning. Beware that not all browsers obey it and Olivier is right when he suggested creating a proxy, which I did. François On Jun 2, 2011, at 3:27 AM, Romi wrote: How to parse Json through ajax when your ajax pager is on one server(Tomcat)and Json object is of onther server(solr server). i mean i have to make a request to another server, how can i do it . - Thanks Regards Romi -- View this message in context: http://lucene.472066.n3.nabble.com/how-to-request-for-Json-object-tp3014138p3014138.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH: Exception with Too many connections
Hi You might also check the 'max_user_connections' settings too if you have that set: # Maximum number of connections, and per user max_connections = 2048 max_user_connections = 2048 http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html Cheers François On May 31, 2011, at 7:39 AM, Stefan Matheis wrote: Tiffany, On Tue, May 31, 2011 at 12:45 PM, tiffany tiffany.c...@future.co.jp wrote: I executed the SHOW PROCESSLIST; command. (Is it what you mean? I've never tried it before...) Exactly this, yes :) On Tue, May 31, 2011 at 12:45 PM, tiffany tiffany.c...@future.co.jp wrote: So, if the number of threads in the process list is larger than max_connections, I would get the too many connections error. Am I thinking the right way? Yepp, right On Tue, May 31, 2011 at 12:45 PM, tiffany tiffany.c...@future.co.jp wrote: If it is right, maybe I should think of the commit timing, changing the number of max_connections, and/or some other ways... You may lift the allowed Number of Connections for the MySQL-Server? Or, of course - if possible - tweak your SOLR-Settings, correct Regards Stefan
Re: UniqueKey field in schema.xml
You concatenate the two keys into a single string, with some sort of delimiter between the two keys. François On May 26, 2011, at 6:05 AM, Romi wrote: what do you mean by combine two fields customerID and ProductId. what i tried is 1. make both fields unique but it doesnot server my purpose 2. make a new field ID and copy both customerID , ProductId into ID using CopyField and now make ID as uniqueKey but i got a error saying: Document specifies multiple unique ids - Romi -- View this message in context: http://lucene.472066.n3.nabble.com/UniqueKey-field-in-schema-xml-tp2987807p2988168.html Sent from the Solr - User mailing list archive at Nabble.com.