Re: rough maximum cores (shards) per machine?
I've tried (very simplistically) hitting a collection with a good variety of searches and looking at the collection's heap memory and working out the bytes / doc. I've seen results around 100 bytes / doc, and as low as 3 bytes / doc for collections with small docs. It's still a work-in-progress - not sure if it will scale with docs - or is too simplistic. On 25 March 2015 at 17:49, Shai Erera ser...@gmail.com wrote: While it's hard to answer this question because as others have said, it depends, I think it will be good of we can quantify or assess the cost of running a SolrCore. For instance, let's say that a server can handle a load of 10M indexed documents (I omit search load on purpose for now) in a single SolrCore. Would the same server be able to handle the same number of documents, If we indexed 1000 docs per SolrCore, in total of 10,000 SorClores? If the answer is no, then it means there is some cost that comes w/ each SolrCore, and we may at least be able to give an upper bound --- on a server with X amount of storage, Y GB RAM and Z cores you can run up to maxSolrCores(X, Y, Z). Another way to look at it, if I were to create empty SolrCores, would I be able to create an infinite number of cores if storage was infinite? Or even empty cores have their toll on CPU and RAM? I know from the Lucene side of things that each SolrCore (carries a Lucene index) there is a toll to an index -- the lexicon, IW's RAM buffer, Codecs that store things in memory etc. For instance, one downside of splitting a 10M core into 10,000 cores is that the cost of the holding the total lexicon (dictionary of indexed words) goes up drastically, since now every word (just the byte[] of the word) is potentially represented in memory 10,000 times. What other RAM/CPU/Storage costs does a SolrCore carry with it? There are the caches of course, which really depend on how many documents are indexed. Any other non-trivial or constant cost? So yes, there isn't a single answer to this question. It's just like someone would ask how many documents can a single Lucene index handle efficiently. But if we can come up with basic numbers as I outlined above, it might help people doing rough estimates. That doesn't mean people shouldn't benchmark, as that upper bound may be wy too high for their data set, query workload and search needs. Shai On Wed, Mar 25, 2015 at 5:25 AM, Damien Kamerman dami...@gmail.com wrote: From my experience on a high-end sever (256GB memory, 40 core CPU) testing collection numbers with one shard and two replicas, the maximum that would work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps half of that), depending on your startup-time requirements. (Though I have settled on 6,000 collection maximum with some patching. See SOLR-7191). You could create multiple clouds after that, and choose the cloud least used to create your collection. Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap per collection. On 25 March 2015 at 13:46, Ian Rose ianr...@fullstory.com wrote: First off thanks everyone for the very useful replies thus far. Shawn - thanks for the list of items to check. #1 and #2 should be fine for us and I'll check our ulimit for #3. To add a bit of clarification, we are indeed using SolrCloud. Our current setup is to create a new collection for each customer. For now we allow SolrCloud to decide for itself where to locate the initial shard(s) but in time we expect to refine this such that our system will automatically choose the least loaded nodes according to some metric(s). Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Solr works well if there is an architect for the system. Jack, can you explain a bit what you mean here? It looks like Toke caught your meaning but I'm afraid it missed me. What do you mean by business entity? Is your concern that with automatic creation of collections they will be distributed willy-nilly across the cluster, leading to uneven load across nodes? If it is relevant, the schema and solrconfig are controlled entirely by me and is the same for all collections. Thus theoretically we could actually just use one single collection for all of our customers (adding a 'customer:whatever' type fq to all queries) but since we never need to query across customers it seemed more performant (as well as safer - less chance of accidentally leaking data across customers) to use separate collections. Better to give each tenant a separate Solr instance that you spin up and spin down based on demand. Regarding this, if by tenant you mean customer, this is not viable for us from a cost perspective. As I mentioned initially, many of our customers are very small so
Re: rough maximum cores (shards) per machine?
While it's hard to answer this question because as others have said, it depends, I think it will be good of we can quantify or assess the cost of running a SolrCore. For instance, let's say that a server can handle a load of 10M indexed documents (I omit search load on purpose for now) in a single SolrCore. Would the same server be able to handle the same number of documents, If we indexed 1000 docs per SolrCore, in total of 10,000 SorClores? If the answer is no, then it means there is some cost that comes w/ each SolrCore, and we may at least be able to give an upper bound --- on a server with X amount of storage, Y GB RAM and Z cores you can run up to maxSolrCores(X, Y, Z). Another way to look at it, if I were to create empty SolrCores, would I be able to create an infinite number of cores if storage was infinite? Or even empty cores have their toll on CPU and RAM? I know from the Lucene side of things that each SolrCore (carries a Lucene index) there is a toll to an index -- the lexicon, IW's RAM buffer, Codecs that store things in memory etc. For instance, one downside of splitting a 10M core into 10,000 cores is that the cost of the holding the total lexicon (dictionary of indexed words) goes up drastically, since now every word (just the byte[] of the word) is potentially represented in memory 10,000 times. What other RAM/CPU/Storage costs does a SolrCore carry with it? There are the caches of course, which really depend on how many documents are indexed. Any other non-trivial or constant cost? So yes, there isn't a single answer to this question. It's just like someone would ask how many documents can a single Lucene index handle efficiently. But if we can come up with basic numbers as I outlined above, it might help people doing rough estimates. That doesn't mean people shouldn't benchmark, as that upper bound may be wy too high for their data set, query workload and search needs. Shai On Wed, Mar 25, 2015 at 5:25 AM, Damien Kamerman dami...@gmail.com wrote: From my experience on a high-end sever (256GB memory, 40 core CPU) testing collection numbers with one shard and two replicas, the maximum that would work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps half of that), depending on your startup-time requirements. (Though I have settled on 6,000 collection maximum with some patching. See SOLR-7191). You could create multiple clouds after that, and choose the cloud least used to create your collection. Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap per collection. On 25 March 2015 at 13:46, Ian Rose ianr...@fullstory.com wrote: First off thanks everyone for the very useful replies thus far. Shawn - thanks for the list of items to check. #1 and #2 should be fine for us and I'll check our ulimit for #3. To add a bit of clarification, we are indeed using SolrCloud. Our current setup is to create a new collection for each customer. For now we allow SolrCloud to decide for itself where to locate the initial shard(s) but in time we expect to refine this such that our system will automatically choose the least loaded nodes according to some metric(s). Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Solr works well if there is an architect for the system. Jack, can you explain a bit what you mean here? It looks like Toke caught your meaning but I'm afraid it missed me. What do you mean by business entity? Is your concern that with automatic creation of collections they will be distributed willy-nilly across the cluster, leading to uneven load across nodes? If it is relevant, the schema and solrconfig are controlled entirely by me and is the same for all collections. Thus theoretically we could actually just use one single collection for all of our customers (adding a 'customer:whatever' type fq to all queries) but since we never need to query across customers it seemed more performant (as well as safer - less chance of accidentally leaking data across customers) to use separate collections. Better to give each tenant a separate Solr instance that you spin up and spin down based on demand. Regarding this, if by tenant you mean customer, this is not viable for us from a cost perspective. As I mentioned initially, many of our customers are very small so dedicating an entire machine to each of them would not be economical (or efficient). Or perhaps I am not understanding what your definition of tenant is? Cheers, Ian On Tue, Mar 24, 2015 at 4:51 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: Jack Krupansky [jack.krupan...@gmail.com] wrote: I'm sure that I am quite unqualified to describe his hypothetical setup. I mean, he's the one using the term multi-tenancy, so it's for him to be clear. It was my understanding that Ian
Re: Using G1 with Apache Solr
Interesting none the less Shawn :) We use G1GC on our servers, we were on Java 7 (64-bit, RHEL6), but are trying to migrate to Java 8 (which seems to cause more GC issues, so we clearly need to tweak our settings), will investigate 8u40 though. On 25 March 2015 at 04:23, Shawn Heisey apa...@elyograg.org wrote: On 3/24/2015 9:52 PM, Shawn Heisey wrote: On 3/24/2015 3:48 PM, Kamran Khawaja wrote: I'm running Solr 4.7.2 with Java 7u75 with the following JVM params: I really got my wires crossed. Kamran sent his message to the hostpot-gc-use mailing list, not the solr-user list! Thanks, Shawn
Re: Custom TokenFilter
Thanks Eric, I'm working on Solr 4.10.2 and all my dependencies jar seems to be compatible with this version. I can't figure out which one make this issue. ThanksRegards, Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a écrit : bq: 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr. This usually means you have jar files from different versions of Solr in your classpath. Best, Erick On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote: Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have this error message : Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is .../conf/schema.xmlat org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)... 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat java.lang.Class.asSubclass(Class.java:3208)at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) Someone can help? Thanks.Regards.
Re: rough maximum cores (shards) per machine?
On Wed, 2015-03-25 at 03:46 +0100, Ian Rose wrote: Thus theoretically we could actually just use one single collection for all of our customers (adding a 'customer:whatever' type fq to all queries) but since we never need to query across customers it seemed more performant (as well as safer - less chance of accidentally leaking data across customers) to use separate collections. If only a few customers are active at a given time, it is more performant to use a collestion/customer. If many of them are active, the more performant option is to lump them together and filter on a field, due to the redundancy-reduction of larger indexes. The 1 collection/customer solution has another edge as ranking will be calculated based on the corpus of the customer and not based on all customers. If the number of customers is low enough to get the individual collections solution to work, that would be the preferable solution. - Toke Eskildsen, State and University Library, Denmark
Re: rough maximum cores (shards) per machine?
In one of our production environments we use 32GB, 4-core, 3T RAID0 spinning disk Dell servers (do not remember the exact model). We have about 25 collections with 2 replica (shard-instances) per collection on each machine - 25 machines. Total of 25 coll * 2 replica/coll/machine * 25 machines = 1250 replica. Each replica contains about 800 million pretty small documents - thats about 1000 billion (do not know the english word for it) documents all in all. We index about 1.5 billion new documents every day (mainly into one of the collections = 50 replica across 25 machine) and keep a history of 2 years on the data. Shifting the index into collection every month. We can fairly easy keep up with the indexing load. We have almost non of the data on the heap, but of course a small fraction of the data in the files will at any time be in OS file-cache. Compared to our indexing frequency we do not do a lot of searches. We have about 10 users searching the system from time to time - anything from major extracts to small quick searches. Depending on the nature of the search we have response-times between 1 sec and 5 min. But of course that is very dependent on clever choice on each field wrt index, store, doc-value etc. BUT we are not using out-of-box Apache Solr. We have made quit a lot of performance tweaks ourselves. Please note that, even though you disable all Solr caches, each replica will use heap-memory linearly dependent on the number of documents (and their size) in that replica. But not much, so you can get pretty far with relatively little RAM. Our version of Solr is based on Apache Solr 4.4.0, but I expect/hope it did not get worse in newer releases. Just to give you some idea of what can at least be achieved - in the high-end of #replica and #docs, I guess Regards, Per Steffensen On 24/03/15 14:02, Ian Rose wrote: Hi all - I'm sure this topic has been covered before but I was unable to find any clear references online or in the mailing list. Are there any rules of thumb for how many cores (aka shards, since I am using SolrCloud) is too many for one machine? I realize there is no one answer (depends on size of the machine, etc.) so I'm just looking for a rough idea. Something like the following would be very useful: * People commonly run up to X cores/shards on a mid-sized (4 or 8 core) server without any problems. * I have never heard of anyone successfully running X cores/shards on a single machine, even if you throw a lot of hardware at it. Thanks! - Ian
Re: Data indexing is going too slow on single shard Why?
On 3/25/2015 5:03 AM, Nitin Solanki wrote: Please can anyone assist me? I am indexing on single shard it is taking too much of time to index data. And I am indexing around 49GB of data on single shard. What's wrong? Why solr is taking too much time to index data? Earlier I was indexing same data on 8 shards. That time, it was fast as compared to single shard. Why so? Any help please.. There's practically no information to go on here, so about all I can offer is general information in return: http://wiki.apache.org/solr/SolrPerformanceProblems I looked over the previous messages that you have sent the list, and I can find very little of the required information about your index. I see a lot of questions from you, but they did not include the kind of details needed here: How much total RAM is in each Solr server? Are there any other programs on the server with significant RAM requirements? An example of such a program would be a database server. On each server, how much memory is dedicated to the java heap(s) for Solr? I gather from other questions that you are running SolrCloud, can you confirm? On a per-server basis, how much disk space do all the index replicas take? How many documents are on each server? Note that for disk space and number of documents, I am asking you to count every replica, not take the total in the collection and divide it by the number of servers. How are you doing your indexing? For this question, I am asking what program or Solr API is actually sending the data to Solr. Possible answers include the dataimport handler, a SolrJ program, one of the other Solr APIs such as a PHP client, and hand-crafted URLs with an HTTP client. Thanks, Shawn
Re: rough maximum cores (shards) per machine?
Per - Wow, 1 trillion documents stored is pretty impressive. One clarification: when you say that you have 2 replica per collection on each machine, what exactly does that mean? Do you mean that each collection is sharded into 50 shards, divided evenly over all 25 machines (thus 2 shards per machine)? Or are some of these slave replicas (e.g. 25x sharding with 1 replica per shard)? Thanks! On Wed, Mar 25, 2015 at 5:13 AM, Per Steffensen st...@designware.dk wrote: In one of our production environments we use 32GB, 4-core, 3T RAID0 spinning disk Dell servers (do not remember the exact model). We have about 25 collections with 2 replica (shard-instances) per collection on each machine - 25 machines. Total of 25 coll * 2 replica/coll/machine * 25 machines = 1250 replica. Each replica contains about 800 million pretty small documents - thats about 1000 billion (do not know the english word for it) documents all in all. We index about 1.5 billion new documents every day (mainly into one of the collections = 50 replica across 25 machine) and keep a history of 2 years on the data. Shifting the index into collection every month. We can fairly easy keep up with the indexing load. We have almost non of the data on the heap, but of course a small fraction of the data in the files will at any time be in OS file-cache. Compared to our indexing frequency we do not do a lot of searches. We have about 10 users searching the system from time to time - anything from major extracts to small quick searches. Depending on the nature of the search we have response-times between 1 sec and 5 min. But of course that is very dependent on clever choice on each field wrt index, store, doc-value etc. BUT we are not using out-of-box Apache Solr. We have made quit a lot of performance tweaks ourselves. Please note that, even though you disable all Solr caches, each replica will use heap-memory linearly dependent on the number of documents (and their size) in that replica. But not much, so you can get pretty far with relatively little RAM. Our version of Solr is based on Apache Solr 4.4.0, but I expect/hope it did not get worse in newer releases. Just to give you some idea of what can at least be achieved - in the high-end of #replica and #docs, I guess Regards, Per Steffensen On 24/03/15 14:02, Ian Rose wrote: Hi all - I'm sure this topic has been covered before but I was unable to find any clear references online or in the mailing list. Are there any rules of thumb for how many cores (aka shards, since I am using SolrCloud) is too many for one machine? I realize there is no one answer (depends on size of the machine, etc.) so I'm just looking for a rough idea. Something like the following would be very useful: * People commonly run up to X cores/shards on a mid-sized (4 or 8 core) server without any problems. * I have never heard of anyone successfully running X cores/shards on a single machine, even if you throw a lot of hardware at it. Thanks! - Ian
Re: Data indexing is going too slow on single shard Why?
Hello, * Updating my question again.* Please can anyone assist me? I am indexing on single shard it is taking too much of time to index data. And I am indexing around 49GB of data on single shard. What's wrong? Why solr is taking too much time to index data? Earlier I was indexing same data on 8 shards. That time, it was fast as compared to single shard. Why so? Any help please.. *HardCommit - 15 sec* *SoftCommit - 10 min.* ii) Searching a query/term is also taking too much time. Any help on this also. On Wed, Mar 25, 2015 at 4:33 PM, Nitin Solanki nitinml...@gmail.com wrote: Hello, Please can anyone assist me? I am indexing on single shard it is taking too much of time to index data. And I am indexing around 49GB of data on single shard. What's wrong? Why solr is taking too much time to index data? Earlier I was indexing same data on 8 shards. That time, it was fast as compared to single shard. Why so? Any help please.. *HardCommit - 15 sec* *SoftCommit - 10 min.* Best, Nitin
Re: rough maximum cores (shards) per machine?
On 25/03/15 15:03, Ian Rose wrote: Per - Wow, 1 trillion documents stored is pretty impressive. One clarification: when you say that you have 2 replica per collection on each machine, what exactly does that mean? Do you mean that each collection is sharded into 50 shards, divided evenly over all 25 machines (thus 2 shards per machine)? Yes Or are some of these slave replicas (e.g. 25x sharding with 1 replica per shard)? No replication. It does not work very well, at least in 4.4.0. Besides that I am not a big fan of two (or more) machines having to do all the indexing work and making sure to keep synchronized. Use a distributed file-system supporting multiple copies of every piece of data (like HDFS) for HA on data-level. Have only one Solr-node handle the indexing into a particular shard - if this Solr-node breaks down let another Solr-node take over the indexing leadership on this shard. Besides the indexing Solr-node several other Solr-nodes can serve data from this shard - just watching the data-folder (can commits) done by the indexing-leader of this particular shard - will give you HA on service-level. That is probably how we are going to do HA - pretty soon. But that is another story Thanks! No problem
Information Retrieval/Text Mining opportunity @ GE Research Data Mining Labs, Bangalore
I have loved working on Solr, so thought of posting an Information Retrieval/Text Mining requirement that we have for our GE Data Mining Research Labs @ Bangalore. Apologies if it is considered inappropriate here. Here goes the Job Description for those interested: If Information Retrieval, Text Mining, Natural Language Processing Machine Learning fascinates you; if you are excited to research build state of art Algorithms working on massive data-sets for an array of Text Mining problems (Search, Named Entity Recognition, Semantic Graphs, Sentiments, Spell Corrector, Text Categorization, Clustering, Topic Modelling and so on…) then GE Global Research Data Mining Labs in Bangalore is looking out for you. The real scope of applied research in our lab goes way beyond the term “Natural” in Natural Language Processing. Do connect if you need more information. Even if one has limited or no experience with the areas mentioned above but is passionate about Information Retrieval/Text Mining have rock solid background in Algorithms is encouraged to apply/connect. Check out more on GE Research: http://www.geglobalresearch.com/ Cheers, Yavar Husain Lead Data Scientist - Text Mining Laboratory GE Research, Bangalore LinkedIn: http://www.linkedin.com/pub/yavar-husain/5/805/151 Text@ yavarhus...@gmail.com
Re: Have anyone used Automatic Phrase Tokenization (AutoPhrasingTokenFilterFactory) ?
Hi, I am a .net developer, but i need to use solr and specifically this good plugin AutoPhrasingTokenFilter. I searched everywhere and i couldn't get useful information, can any one help me to run it in solr 5.0 or even previous versions. I am not able to add it to my solr it is throwing below error while i am putting the Lib folder under the core which contains also my jar files for the AutoPhrasingTokenFilter Error: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: JVM Error creating core [gettingstarted_shard1_replica1]: class org.apache.lucene.codecs.diskdv.DiskDocValuesFormat$1 cannot access its superclass org.apache.lucene.codecs.lucene45.Lucene45DocValuesConsumer -- View this message in context: http://lucene.472066.n3.nabble.com/Have-anyone-used-Automatic-Phrase-Tokenization-AutoPhrasingTokenFilterFactory-tp4173808p4195182.html Sent from the Solr - User mailing list archive at Nabble.com.
Sorting and Rerank
If I do an initial search without any field sorting; and then do the exact same query but also sort one field will I get the same result set in the subsequent query but sorted. In other words, does simply applying a sort criteria affect the re-rank on the full search or does it just sort the result from the main query? -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-and-Rerank-tp4195187.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Monitoring - Stored Stats?
Hello, I am familiar with the JMX points that Solr exposes to allow for monitoring of statistics like QPS, numdocs, Average Query Time... I am wondering if there is a way to configure Solr to automatically store the value of these stats over time (for a given time interval), and then allow a user to query a stat over a time range. So for the QPS stat, the query might return a set that includes the QPS value for each hour in the time range specified. Thanks, Matt
Optimize SolrCloud without downtime
Hi, I didn't find the answer yet, please help. We have standalone Solr 5.0.0 with a few cores yet. One of those cores contains: numDocs:120M deletedDocs:110M Our data are changing frequently so that's why so many deletedDocs. Optimized core takes around 50GB on disk, we are now almost on 100GB and I'm looking for best solution howto optimize this huge core without downtime. I know optimization working in background, but anyway when the optimization is running our search system is slow and sometimes I receive errors - this behavior is like a downtime for us. I would like to switch to SolrCloud, the performance is not a issue, so I don't need the sharding feature at this time. I'm more interested with replication and distribute requests by some Nginx proxy. Idea is: 1) proxy forward requests to node1 and optimize cores on node2 2) proxy forward requests to node2 and optimize cores on node1 But when I do optimize on node2, the node1 is doing optimization as well, even if I use the distrib=false with curl. Can you please recommend architecture for optimizing without downtime? Many thanks. Pavel -- View this message in context: http://lucene.472066.n3.nabble.com/Optimize-SolrCloud-without-downtime-tp4195170.html Sent from the Solr - User mailing list archive at Nabble.com.
Replica and node states
Hi Is it possible for a replica to be DOWN, while the node it resides on is under /live_nodes? If so, what can lead to it, aside from someone unloading a core. I don't know if each SolrCore reports status to ZK independently, or it's done by the Solr process as a whole. Also, is it possible for a replica to report ACTIVE, while the node it lives on is no longer under /live_nodes? Are there any ZK timings that can cause that? Shai
Re: Data indexing is going too slow on single shard Why?
Hi Shawn, Sorry for all the things. Server configuration: 8 CPUs. 32 GB RAM O.S. - Linux *Earlier*, I was using 8 shards without replica(default is 1) using SOLR CLOUD. On server, Only Solr is running. There is no other application which are running. Java heap set to 4096 MB in Solr. While indexing, Solr(sometime) eats up whole RAM. I don't know how each solr server takes RAM? Each server taking around 50 GB data(indexed). Actually, I had deleted previous solr architecture, so I don't have any idea that how many documents were on each shards and also don't know total documents. *Currently*, I have 1 shard with 2 replicas using SOLR CLOUD. Data Size: 102Gsolr/node1/solr/wikingram_shard1_replica2 102Gsolr/node2/solr/wikingram_shard1_replica1 I am running a python script to index data using Solr RESTAPI. Commiting 2 Documents each time for indexing using python script with Solr RESTAPI. If I missed anything related to Solr. Please inform me.. THanks Shawn. Waiting for your reply On Wed, Mar 25, 2015 at 7:33 PM, Shawn Heisey apa...@elyograg.org wrote: On 3/25/2015 5:03 AM, Nitin Solanki wrote: Please can anyone assist me? I am indexing on single shard it is taking too much of time to index data. And I am indexing around 49GB of data on single shard. What's wrong? Why solr is taking too much time to index data? Earlier I was indexing same data on 8 shards. That time, it was fast as compared to single shard. Why so? Any help please.. There's practically no information to go on here, so about all I can offer is general information in return: http://wiki.apache.org/solr/SolrPerformanceProblems I looked over the previous messages that you have sent the list, and I can find very little of the required information about your index. I see a lot of questions from you, but they did not include the kind of details needed here: How much total RAM is in each Solr server? Are there any other programs on the server with significant RAM requirements? An example of such a program would be a database server. On each server, how much memory is dedicated to the java heap(s) for Solr? I gather from other questions that you are running SolrCloud, can you confirm? On a per-server basis, how much disk space do all the index replicas take? How many documents are on each server? Note that for disk space and number of documents, I am asking you to count every replica, not take the total in the collection and divide it by the number of servers. How are you doing your indexing? For this question, I am asking what program or Solr API is actually sending the data to Solr. Possible answers include the dataimport handler, a SolrJ program, one of the other Solr APIs such as a PHP client, and hand-crafted URLs with an HTTP client. Thanks, Shawn
Re: Sorting and Rerank
Hi, You're right. Those sets are same each other, only documents order is different. Koji On 2015/03/26 0:53, innoculou wrote: If I do an initial search without any field sorting; and then do the exact same query but also sort one field will I get the same result set in the subsequent query but sorted. In other words, does simply applying a sort criteria affect the re-rank on the full search or does it just sort the result from the main query? -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-and-Rerank-tp4195187.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unable to setup solr cloud with multiple collections.
You're still mixing master/slave with SolrCloud. Do _not_ reconfigure the replication. If you want your core (we call them replicas in SolrCloud) to appear on various nodes in your cluster, either create the collection with the nodes specified (createNodeSet) or, once the collection is created on any node (or set of nodes) do an ADDREPLICA (again with the collections API) where you want replicas to appear. The rest is automatic, i.e. the replica's index will be copied from the leader, all updates will be forwarded etc., without you doing any other configuration. I think you're shooting yourself in the foot by trying to fiddle with replication. Or I misunderstand your problem entirely. Best, Erick On Tue, Mar 24, 2015 at 8:09 PM, sthita sthit...@gmail.com wrote: Thanks Erick for your reply. I am trying to create a new core i.e dict_cn , which is totally different in terms of index data, configs etc from the existing core abc. The core is created successfully in my master (i.e mail) and i can do solr query on this newly created core . All the config files(Schema.xml and solrconfig.xml) are in mail server and zookeper helps it for me to share all config files to other collections. I did the similar setup to other collection , so that newly created core should be available to all the collections, but it is still showing down. -- View this message in context: http://lucene.472066.n3.nabble.com/Unable-to-setup-solr-cloud-with-multiple-collections-tp4194833p4195078.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Custom TokenFilter
Images don't come through the mailing list, can't see your image. Whether or not all the jars in the directory you're working on are consistent is the least of your problems. Are the libs to be found in any _other_ place specified on your classpath? Best, Erick On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote: Thanks Eric, I'm working on Solr 4.10.2 and all my dependencies jar seems to be compatible with this version. [image: Image en ligne] I can't figure out which one make this issue. Thanks Regards, Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a écrit : bq: 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr. This usually means you have jar files from different versions of Solr in your classpath. Best, Erick On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote: Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have this error message : Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is .../conf/schema.xmlat org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)... 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat java.lang.Class.asSubclass(Class.java:3208)at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) Someone can help? Thanks.Regards.
Re: Setting up SOLR 5 from an RPM
On 3/25/2015 5:49 AM, Tom Evans wrote: On Tue, Mar 24, 2015 at 4:00 PM, Tom Evans tevans...@googlemail.com wrote: Hi all We're migrating to SOLR 5 (from 4.8), and our infrastructure guys would prefer we installed SOLR from an RPM rather than extracting the tarball where we need it. They are creating the RPM file themselves, and it installs an init.d script and the equivalent of the tarball to /opt/solr. We're having problems running SOLR from the installed files, as SOLR wants to (I think) extract the WAR file and create various temporary files below /opt/solr/server. From the SOLR 5 reference guide, section Managing SOLR, sub-section Taking SOLR to production, it seems changing the ownership of the installed files to the user that will run SOLR is an explicit requirement if you do not wish to run as root. It would be better if this was not required. With most applications you do not normally require permission to modify the installed files in order to run the application, eg I do not need write permission to /usr/share/vim to run vim, it is a shame I need write permission to /opt/solr to run solr. I think you will only need to change the ownership of the solr home and the location where the .war file is extracted, which by default is server/solr-webapp. The user must be able to *read* the program data, but should not need to write to it. If you are using the start script included with Solr 5 and one of the examples, I believe the logging destination will also be located under the solr home, but you should make sure that's the case. Thanks, Shawn
Re: Optimize SolrCloud without downtime
That's a high number of deleted documents as a percentage of your index! Or at least I find those numbers surprising. When segments are merged in the background during normal indexing, quite a bit of weight is given to segments that have a high percentage of deleted docs. I usually see at most 10-20% of docs deleted. So what kinds of things have you done to get into this state? Did you optimize previously? Change the merge policy? Anything else? Best, Erick On Wed, Mar 25, 2015 at 8:08 AM, pavelhladik pavel.hla...@profimedia.cz wrote: Hi, I didn't find the answer yet, please help. We have standalone Solr 5.0.0 with a few cores yet. One of those cores contains: numDocs:120M deletedDocs:110M Our data are changing frequently so that's why so many deletedDocs. Optimized core takes around 50GB on disk, we are now almost on 100GB and I'm looking for best solution howto optimize this huge core without downtime. I know optimization working in background, but anyway when the optimization is running our search system is slow and sometimes I receive errors - this behavior is like a downtime for us. I would like to switch to SolrCloud, the performance is not a issue, so I don't need the sharding feature at this time. I'm more interested with replication and distribute requests by some Nginx proxy. Idea is: 1) proxy forward requests to node1 and optimize cores on node2 2) proxy forward requests to node2 and optimize cores on node1 But when I do optimize on node2, the node1 is doing optimization as well, even if I use the distrib=false with curl. Can you please recommend architecture for optimizing without downtime? Many thanks. Pavel -- View this message in context: http://lucene.472066.n3.nabble.com/Optimize-SolrCloud-without-downtime-tp4195170.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Optimize SolrCloud without downtime
On 3/25/2015 9:08 AM, pavelhladik wrote: Our data are changing frequently so that's why so many deletedDocs. Optimized core takes around 50GB on disk, we are now almost on 100GB and I'm looking for best solution howto optimize this huge core without downtime. I know optimization working in background, but anyway when the optimization is running our search system is slow and sometimes I receive errors - this behavior is like a downtime for us. I would like to switch to SolrCloud, the performance is not a issue, so I don't need the sharding feature at this time. I'm more interested with replication and distribute requests by some Nginx proxy. Idea is: 1) proxy forward requests to node1 and optimize cores on node2 2) proxy forward requests to node2 and optimize cores on node1 But when I do optimize on node2, the node1 is doing optimization as well, even if I use the distrib=false with curl. You are correct - with SolrCloud, any optimize command will optimize the entire collection, one shard replica at a time, regardless of any distrib parameter. It does NOT optimize multiple replicas or shards in parallel. I thought we had an issue in Jira asking to make optimize honor a distrib=false parameter, but I can't find it. Even if that were fixed, it would not help you, because SolrCloud is only optimizing one shard replica at any given moment. Optimization does NOT directly result in downtime ... but because optimize generates a very large amount of disk I/O, it can be disruptive if your server does not have enough resources. I don't have enough information to say for sure, but I am betting that you don't have enough RAM in your machine to effectively cache your index, so anything that negatively affects performance, like an optimize, is too much for your server to handle at the same time as ongoing queries or indexing. The info on this wiki page can help you determine how much total RAM you might need: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
Re: Optimize SolrCloud without downtime
bq: It does NOT optimize multiple replicas or shards in parallel. This behavior was changed in 4.10 though, see: https://issues.apache.org/jira/browse/SOLR-6264 So with 5.0 Pavel is seeing the result of that JIRA I bet. I have to agree with Shawn, the optimization step should proceed invisibly in the background, I suspect you have something else going on here. FWIW, Erick On Wed, Mar 25, 2015 at 9:54 AM, Shawn Heisey apa...@elyograg.org wrote: On 3/25/2015 9:08 AM, pavelhladik wrote: Our data are changing frequently so that's why so many deletedDocs. Optimized core takes around 50GB on disk, we are now almost on 100GB and I'm looking for best solution howto optimize this huge core without downtime. I know optimization working in background, but anyway when the optimization is running our search system is slow and sometimes I receive errors - this behavior is like a downtime for us. I would like to switch to SolrCloud, the performance is not a issue, so I don't need the sharding feature at this time. I'm more interested with replication and distribute requests by some Nginx proxy. Idea is: 1) proxy forward requests to node1 and optimize cores on node2 2) proxy forward requests to node2 and optimize cores on node1 But when I do optimize on node2, the node1 is doing optimization as well, even if I use the distrib=false with curl. You are correct - with SolrCloud, any optimize command will optimize the entire collection, one shard replica at a time, regardless of any distrib parameter. It does NOT optimize multiple replicas or shards in parallel. I thought we had an issue in Jira asking to make optimize honor a distrib=false parameter, but I can't find it. Even if that were fixed, it would not help you, because SolrCloud is only optimizing one shard replica at any given moment. Optimization does NOT directly result in downtime ... but because optimize generates a very large amount of disk I/O, it can be disruptive if your server does not have enough resources. I don't have enough information to say for sure, but I am betting that you don't have enough RAM in your machine to effectively cache your index, so anything that negatively affects performance, like an optimize, is too much for your server to handle at the same time as ongoing queries or indexing. The info on this wiki page can help you determine how much total RAM you might need: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
Re: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces
Yeah, this is a head scratcher. But it _has_ to be that way for things like edismax to work where you mix-and-match fielded and un-fielded terms. I.e. I can have a query like q=field1:whatever some more stuffqf=field2,field3,field4 where I want whatever to be evaluated only against field1, but the remaining terms to be searched for in the three other fields. The deal is that how you want _individual terms_ handled at index time may be different than at query time, WordDelimiterFilterFactory and SynonymFilterFactory are prime examples of this. Getting my head around why field analysis is completely different from query _parsing_ took me a while. But the fact that both are query is confusing, I'm just not sure what would be better since they're very closely related, they just both deal with queries just at different times. Missed the wildcards, you're right you need to escape Or use the prefix query parser. It'd look like: q={!prefix f=proj_name_sort}CR610070 An No escaping is necessary. If you add debug=query to a query using the prefix queries you see that there's an implied * trailing... Do be aware, though, that there is _no_ analysis done, so things like lowercasing would have to be done by the app. Neither one is more correct, in fact I believe that the wildcard query becomes a prefix query eventually, strictly a matter of how you want to deal with that in the app. Best, Erick On Wed, Mar 25, 2015 at 10:04 AM, Vadim Gorlovetsky vadim...@amdocs.com wrote: Thanks for a quick response. A bit confusing that analyzer of query type configured to use KeywordTokenizerFactory does not un-tokenize query criteria. I guess whitespace only the special case because it separates phrases in a query and runs prior analyzing. Actually I am handling a query the way you are recommended: Double quotes for exact matching and escaped whitespace for a values with wildcards (double quotes do not work as probably considering * wildcard as a part of the criteria value). Thanks Vadim -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, March 25, 2015 6:34 PM To: solr-user@lucene.apache.org Subject: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces This is a _very_ common thing we all had to learn; what you're seeing is the results of the _query parser_, not the analysis chain. Anything like proj_name_sort:term1 term2 gets split at the query parser level, attaching debug=query to the URL should show down in the parsed query section something like: proj_name_sort:term1 default_search_field:term2 To get thing through the query parser, enclose in double quotes, escape the space and such. That'll get the terms _as a single token_ to the analysis chain for that field where the behavior will be what you expect. Best, Erick On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky vadim...@amdocs.com wrote: Hello, solr.KeywordTokenizerFactory seems splitting by whitespaces though according SOLR documentation shouldn't do that. For example I have the following configuration for the fields proj_name and proj_name_sort: field name=proj_name type=sortable_text_general indexed=true stored=true/ field name=proj_name_sort type=string_sort indexed=true stored=false/ .. copyField source=proj_name dest=proj_name_sort / .. fieldType name=string_sort class=solr.TextField sortMissingLast=true omitNorms=true analyzer !-- KeywordTokenizer does no actual tokenizing, so the entire input string is preserved as a single token -- tokenizer class=solr.KeywordTokenizerFactory/ !-- The LowerCase TokenFilter does what you expect, which can be when you want your sorting to be case insensitive -- filter class=solr.LowerCaseFilterFactory / !-- The TrimFilter removes any leading or trailing whitespace -- filter class=solr.TrimFilterFactory / /analyzer /fieldType There are 3 indexed documents having the respective field values: proj_name: Test1008 CR610070 Test1 CR610070 Another Test2 Searching on the proj_name_sort giving me the following results: Query Expected Real Comments proj_name_sort : CR610070 Test1 CR610070 Test1 CR610070 Test1 Expectable as seems searching exact un-tokenized value proj_name_sort : CR610070 Te None None Expectable as seems searching exact un-tokenized value proj_name_sort : CR610070 Te* CR610070 Test1 CR610070 Test1, Test1008, CR610070 Another Test2 Seems splits on tokens by whitespace ? proj_name_sort : CR610070 An* CR610070 Another Test2 CR610070 Another Test2 Expectable as seems applying wild card on un-tokenized value proj_name_sort : CR610070 Another Te* CR610070 Another Test2 CR610070 Test1, Test1008, CR610070 Another Test2 Seems splits on tokens by whitespace ? proj_name_sort : CR610070 Another Test1* None CR610070 Test1, Test1008, CR610070
Re: Solr Monitoring - Stored Stats?
Matt: Not really. There's a bunch of third-party log analysis tools that give much of this information (not everything exposed by JMX of course is in the log files though). Not quite sure whether things like Nagios, Zabbix and the like have this kind of stuff built in seems like a natural extension of those kinds of tools though Not much help here... Erick On Wed, Mar 25, 2015 at 8:26 AM, Matt Kuiper matt.kui...@issinc.com wrote: Hello, I am familiar with the JMX points that Solr exposes to allow for monitoring of statistics like QPS, numdocs, Average Query Time... I am wondering if there is a way to configure Solr to automatically store the value of these stats over time (for a given time interval), and then allow a user to query a stat over a time range. So for the QPS stat, the query might return a set that includes the QPS value for each hour in the time range specified. Thanks, Matt
German Compound Splitter words.fst causing problems.
Hello, Chris Morley here, of Wayfair.com. I am working on the German compound-splitter by Dawid Weiss. I tried to upgrade the words.fst file that comes with the German compound-splitter using Solr 3.5, but it doesn't work. Below is the IndexNotFoundException that I get. cmorley@Caracal01:~/Work/oss/git/apache-solr-3.5.0$ java -cp lucene/build/lucene-core-3.5-SNAPSHOT.jar org.apache.lucene.index.IndexUpgrader wordsFst Exception in thread main org.apache.lucene.index.IndexNotFoundException: org.apache.lucene.store.MMapDirectory@/home/cmorley/Work/oss/git/apache-solr-3.5.0/wordsFst lockFactory=org.apache.lucene.store.NativeFSLockFactory@201a755e at org.apache.lucene.index.IndexUpgrader.upgrade(IndexUpgrader.java:118) at org.apache.lucene.index.IndexUpgrader.main(IndexUpgrader.java:85) The reason I'm attempting this at all is due to the answer here, http://stackoverflow.com/questions/25450865/migrate-solr-1-4-index-files-to-4-7, which says to do the upgrade in a two step process, first using Solr 3.5, and then the latest Solr version (4.10.3). When I try this running the unit tests for my modified German compound-splitter I'm getting this same type of error. The thing is, this is an FST, not an index, which is a little confusing. The reason why I'm following this answer though, is because I'm getting that exact same message when trying to build the (modified) project with mavenat the point at which it tries to load in words.fst. Below. [main] ERROR com.wayfair.lucene.analysis.de.compound.GermanCompoundSplitter - Format version is not supported (resource: com.wayfair.lucene.analysis.de.compound.InputStreamDataInput@79a66240): 0 (needs to be between 3 and 4). This version of Lucene only supports indexes created with release 3.0 and later. Failed to initialize static data structures for German compound splitter. Thanks, -Chris.
Re: Data indexing is going too slow on single shard Why?
On 3/25/2015 8:42 AM, Nitin Solanki wrote: Server configuration: 8 CPUs. 32 GB RAM O.S. - Linux snip are running. Java heap set to 4096 MB in Solr. While indexing, snip *Currently*, I have 1 shard with 2 replicas using SOLR CLOUD. Data Size: 102Gsolr/node1/solr/wikingram_shard1_replica2 102Gsolr/node2/solr/wikingram_shard1_replica1 If both of those are on the same machine, I'm guessing that you're running two Solr instances on that machine, so there's 8GB of RAM used for Java. That means you have about 24 GB of RAM left for caching ... and 200GB of index data to cache. 24GB is not enough to cache 200GB of index. If there is only one Solr instance (leaving 28GB for caching) with 102GB of data on the machine, it still might not be enough. See that SolrPerformanceProblems wiki page I linked in my earlier email. For 102GB of data per server, I recommend at least 64GB of total RAM, preferably 128GB. For 204GB of data per server, I recommend at least 128GB of total RAM, preferably 256GB. Thanks, Shawn
KeywordTokenizerFactory splits by whitespaces
Hello, solr.KeywordTokenizerFactory seems splitting by whitespaces though according SOLR documentation shouldn't do that. For example I have the following configuration for the fields proj_name and proj_name_sort: field name=proj_name type=sortable_text_general indexed=true stored=true/ field name=proj_name_sort type=string_sort indexed=true stored=false/ .. copyField source=proj_name dest=proj_name_sort / .. fieldType name=string_sort class=solr.TextField sortMissingLast=true omitNorms=true analyzer !-- KeywordTokenizer does no actual tokenizing, so the entire input string is preserved as a single token -- tokenizer class=solr.KeywordTokenizerFactory/ !-- The LowerCase TokenFilter does what you expect, which can be when you want your sorting to be case insensitive -- filter class=solr.LowerCaseFilterFactory / !-- The TrimFilter removes any leading or trailing whitespace -- filter class=solr.TrimFilterFactory / /analyzer /fieldType There are 3 indexed documents having the respective field values: proj_name: Test1008 CR610070 Test1 CR610070 Another Test2 Searching on the proj_name_sort giving me the following results: Query Expected Real Comments proj_name_sort : CR610070 Test1 CR610070 Test1 CR610070 Test1 Expectable as seems searching exact un-tokenized value proj_name_sort : CR610070 Te None None Expectable as seems searching exact un-tokenized value proj_name_sort : CR610070 Te* CR610070 Test1 CR610070 Test1, Test1008, CR610070 Another Test2 Seems splits on tokens by whitespace ? proj_name_sort : CR610070 An* CR610070 Another Test2 CR610070 Another Test2 Expectable as seems applying wild card on un-tokenized value proj_name_sort : CR610070 Another Te* CR610070 Another Test2 CR610070 Test1, Test1008, CR610070 Another Test2 Seems splits on tokens by whitespace ? proj_name_sort : CR610070 Another Test1* None CR610070 Test1, Test1008, CR610070 Another Test2 Seems splits on tokens by whitespace ? Please, advise the way to search on un-tokenized fields using partial criteria and wild cards. Thanks Vadim This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement, you may review at http://www.amdocs.com/email_disclaimer.asp
Re: Setting up SOLR 5 from an RPM
On Wed, Mar 25, 2015 at 2:40 PM, Shawn Heisey apa...@elyograg.org wrote: I think you will only need to change the ownership of the solr home and the location where the .war file is extracted, which by default is server/solr-webapp. The user must be able to *read* the program data, but should not need to write to it. If you are using the start script included with Solr 5 and one of the examples, I believe the logging destination will also be located under the solr home, but you should make sure that's the case. Thanks Shawn, this sort of makes sense. The thing which I cannot seem to do is change the location where the war file is extracted. I think this is probably because, as of solr 5, I am not supposed to know or be aware that there is a war file, or that the war file is hosted in jetty, which makes it tricky to specify the jetty temporary directory. Our use case is that we want to create a single system image that would be usable for several projects, each project would check out its solr home and run solr as their own user (possibly on the same server). Eg, /data/projectA being a solr home for one project, /data/projectB being a solr home for another project, both running solr from the same location. Also, on a dev server, I want to install solr once, and each member of my team run it from that single location. Because they cannot change the temporary directory, and they cannot all own server/solr-webapp, this does not work and they must each have their own copy of the solr install. I think the way we will go for this is in production to run all our solr instance as the solr user, who will own the files in /opt/solr, and have their solr home directory wherever they choose. In dev, we will just do something... Cheers Tom
Re: KeywordTokenizerFactory splits by whitespaces
This is a _very_ common thing we all had to learn; what you're seeing is the results of the _query parser_, not the analysis chain. Anything like proj_name_sort:term1 term2 gets split at the query parser level, attaching debug=query to the URL should show down in the parsed query section something like: proj_name_sort:term1 default_search_field:term2 To get thing through the query parser, enclose in double quotes, escape the space and such. That'll get the terms _as a single token_ to the analysis chain for that field where the behavior will be what you expect. Best, Erick On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky vadim...@amdocs.com wrote: Hello, solr.KeywordTokenizerFactory seems splitting by whitespaces though according SOLR documentation shouldn't do that. For example I have the following configuration for the fields proj_name and proj_name_sort: field name=proj_name type=sortable_text_general indexed=true stored=true/ field name=proj_name_sort type=string_sort indexed=true stored=false/ .. copyField source=proj_name dest=proj_name_sort / .. fieldType name=string_sort class=solr.TextField sortMissingLast=true omitNorms=true analyzer !-- KeywordTokenizer does no actual tokenizing, so the entire input string is preserved as a single token -- tokenizer class=solr.KeywordTokenizerFactory/ !-- The LowerCase TokenFilter does what you expect, which can be when you want your sorting to be case insensitive -- filter class=solr.LowerCaseFilterFactory / !-- The TrimFilter removes any leading or trailing whitespace -- filter class=solr.TrimFilterFactory / /analyzer /fieldType There are 3 indexed documents having the respective field values: proj_name: Test1008 CR610070 Test1 CR610070 Another Test2 Searching on the proj_name_sort giving me the following results: Query Expected Real Comments proj_name_sort : CR610070 Test1 CR610070 Test1 CR610070 Test1 Expectable as seems searching exact un-tokenized value proj_name_sort : CR610070 Te None None Expectable as seems searching exact un-tokenized value proj_name_sort : CR610070 Te* CR610070 Test1 CR610070 Test1, Test1008, CR610070 Another Test2 Seems splits on tokens by whitespace ? proj_name_sort : CR610070 An* CR610070 Another Test2 CR610070 Another Test2 Expectable as seems applying wild card on un-tokenized value proj_name_sort : CR610070 Another Te* CR610070 Another Test2 CR610070 Test1, Test1008, CR610070 Another Test2 Seems splits on tokens by whitespace ? proj_name_sort : CR610070 Another Test1* None CR610070 Test1, Test1008, CR610070 Another Test2 Seems splits on tokens by whitespace ? Please, advise the way to search on un-tokenized fields using partial criteria and wild cards. Thanks Vadim This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement, you may review at http://www.amdocs.com/email_disclaimer.asp
Re: Solr Monitoring - Stored Stats?
On 3/25/2015 9:26 AM, Matt Kuiper wrote: I am familiar with the JMX points that Solr exposes to allow for monitoring of statistics like QPS, numdocs, Average Query Time... I am wondering if there is a way to configure Solr to automatically store the value of these stats over time (for a given time interval), and then allow a user to query a stat over a time range. So for the QPS stat, the query might return a set that includes the QPS value for each hour in the time range specified. I am reasonably sure that JMX does not have this ability built in, and Solr does not keep track of each stat over time. Some of the statistics, in particular the average and percentile statistics for QTime on a request handler, are relevant across the entire history of the handler -- so they are valid until the core is reloaded or Solr restarts. Thanks, Shawn
Re: Replica and node states
Comments inline: On Wed, Mar 25, 2015 at 8:30 AM, Shai Erera ser...@gmail.com wrote: Hi Is it possible for a replica to be DOWN, while the node it resides on is under /live_nodes? If so, what can lead to it, aside from someone unloading a core. Yes, aside from someone unloading the index, this can happen in two ways 1) during startup each core publishes it's state as 'down' before it enters recovery, and 2) the leader force-publishes a replica as 'down' if it is not able to forward updates to that replica (this mechanism is called Leader-Initiated-Recovery or LIR in short) The #2 above can happen when the replica is partitioned from leader but both are able to talk to ZooKeeper. I don't know if each SolrCore reports status to ZK independently, or it's done by the Solr process as a whole. It is done on a per-core basis for now. But the 'live' node is maintained one per Solr instance (JVM). Also, is it possible for a replica to report ACTIVE, while the node it lives on is no longer under /live_nodes? Are there any ZK timings that can cause that? Yes, this can happen if the JVM crashed. A replica publishes itself as 'down' on shutdown so if the graceful shutdown step is skipped then the replica will continue to be 'active' in the cluster state. Even LIR doesn't apply here because there's no point in the leader marking a node as 'down' if it is not 'live' already. Shai -- Regards, Shalin Shekhar Mangar.
RE: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces
Thanks for a quick response. A bit confusing that analyzer of query type configured to use KeywordTokenizerFactory does not un-tokenize query criteria. I guess whitespace only the special case because it separates phrases in a query and runs prior analyzing. Actually I am handling a query the way you are recommended: Double quotes for exact matching and escaped whitespace for a values with wildcards (double quotes do not work as probably considering * wildcard as a part of the criteria value). Thanks Vadim -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, March 25, 2015 6:34 PM To: solr-user@lucene.apache.org Subject: [MARKETING] Re: KeywordTokenizerFactory splits by whitespaces This is a _very_ common thing we all had to learn; what you're seeing is the results of the _query parser_, not the analysis chain. Anything like proj_name_sort:term1 term2 gets split at the query parser level, attaching debug=query to the URL should show down in the parsed query section something like: proj_name_sort:term1 default_search_field:term2 To get thing through the query parser, enclose in double quotes, escape the space and such. That'll get the terms _as a single token_ to the analysis chain for that field where the behavior will be what you expect. Best, Erick On Wed, Mar 25, 2015 at 9:26 AM, Vadim Gorlovetsky vadim...@amdocs.com wrote: Hello, solr.KeywordTokenizerFactory seems splitting by whitespaces though according SOLR documentation shouldn't do that. For example I have the following configuration for the fields proj_name and proj_name_sort: field name=proj_name type=sortable_text_general indexed=true stored=true/ field name=proj_name_sort type=string_sort indexed=true stored=false/ .. copyField source=proj_name dest=proj_name_sort / .. fieldType name=string_sort class=solr.TextField sortMissingLast=true omitNorms=true analyzer !-- KeywordTokenizer does no actual tokenizing, so the entire input string is preserved as a single token -- tokenizer class=solr.KeywordTokenizerFactory/ !-- The LowerCase TokenFilter does what you expect, which can be when you want your sorting to be case insensitive -- filter class=solr.LowerCaseFilterFactory / !-- The TrimFilter removes any leading or trailing whitespace -- filter class=solr.TrimFilterFactory / /analyzer /fieldType There are 3 indexed documents having the respective field values: proj_name: Test1008 CR610070 Test1 CR610070 Another Test2 Searching on the proj_name_sort giving me the following results: Query Expected Real Comments proj_name_sort : CR610070 Test1 CR610070 Test1 CR610070 Test1 Expectable as seems searching exact un-tokenized value proj_name_sort : CR610070 Te None None Expectable as seems searching exact un-tokenized value proj_name_sort : CR610070 Te* CR610070 Test1 CR610070 Test1, Test1008, CR610070 Another Test2 Seems splits on tokens by whitespace ? proj_name_sort : CR610070 An* CR610070 Another Test2 CR610070 Another Test2 Expectable as seems applying wild card on un-tokenized value proj_name_sort : CR610070 Another Te* CR610070 Another Test2 CR610070 Test1, Test1008, CR610070 Another Test2 Seems splits on tokens by whitespace ? proj_name_sort : CR610070 Another Test1* None CR610070 Test1, Test1008, CR610070 Another Test2 Seems splits on tokens by whitespace ? Please, advise the way to search on un-tokenized fields using partial criteria and wild cards. Thanks Vadim This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement, you may review at http://www.amdocs.com/email_disclaimer.asp This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement, you may review at http://www.amdocs.com/email_disclaimer.asp
Re: rough maximum cores (shards) per machine?
Just to give a specific answer to the original question, I would say that dozens of cores (collections) is certainly fine (assuming the total data load and query rate is reasonable), maybe 50 or even 100. Low hundreds of cores/collections MAY work, but isn't advisable. Thousands, if it works at all, is probably just asking for trouble and likely to be far more hassle than it could possible be worth. Whether the number for you ends up being 37, 50, 75, 100, 237, or 1273, you will have to do a proof of concept implementation to validate it. I'm not sure where we are at these days for lazy-loading of cores. That may work for you with hundreds (thousands?!) of cores/collections for tenants who are mostly idle or dormant, but if the server is running long enough, it may build up a lot of memory usage for collections that were active but have gone idle after days or weeks. -- Jack Krupansky On Wed, Mar 25, 2015 at 2:49 AM, Shai Erera ser...@gmail.com wrote: While it's hard to answer this question because as others have said, it depends, I think it will be good of we can quantify or assess the cost of running a SolrCore. For instance, let's say that a server can handle a load of 10M indexed documents (I omit search load on purpose for now) in a single SolrCore. Would the same server be able to handle the same number of documents, If we indexed 1000 docs per SolrCore, in total of 10,000 SorClores? If the answer is no, then it means there is some cost that comes w/ each SolrCore, and we may at least be able to give an upper bound --- on a server with X amount of storage, Y GB RAM and Z cores you can run up to maxSolrCores(X, Y, Z). Another way to look at it, if I were to create empty SolrCores, would I be able to create an infinite number of cores if storage was infinite? Or even empty cores have their toll on CPU and RAM? I know from the Lucene side of things that each SolrCore (carries a Lucene index) there is a toll to an index -- the lexicon, IW's RAM buffer, Codecs that store things in memory etc. For instance, one downside of splitting a 10M core into 10,000 cores is that the cost of the holding the total lexicon (dictionary of indexed words) goes up drastically, since now every word (just the byte[] of the word) is potentially represented in memory 10,000 times. What other RAM/CPU/Storage costs does a SolrCore carry with it? There are the caches of course, which really depend on how many documents are indexed. Any other non-trivial or constant cost? So yes, there isn't a single answer to this question. It's just like someone would ask how many documents can a single Lucene index handle efficiently. But if we can come up with basic numbers as I outlined above, it might help people doing rough estimates. That doesn't mean people shouldn't benchmark, as that upper bound may be wy too high for their data set, query workload and search needs. Shai On Wed, Mar 25, 2015 at 5:25 AM, Damien Kamerman dami...@gmail.com wrote: From my experience on a high-end sever (256GB memory, 40 core CPU) testing collection numbers with one shard and two replicas, the maximum that would work is 3,000 cores (1,500 collections). I'd recommend much less (perhaps half of that), depending on your startup-time requirements. (Though I have settled on 6,000 collection maximum with some patching. See SOLR-7191). You could create multiple clouds after that, and choose the cloud least used to create your collection. Regarding memory usage I'd pencil in 6MB overheard (no docs) java heap per collection. On 25 March 2015 at 13:46, Ian Rose ianr...@fullstory.com wrote: First off thanks everyone for the very useful replies thus far. Shawn - thanks for the list of items to check. #1 and #2 should be fine for us and I'll check our ulimit for #3. To add a bit of clarification, we are indeed using SolrCloud. Our current setup is to create a new collection for each customer. For now we allow SolrCloud to decide for itself where to locate the initial shard(s) but in time we expect to refine this such that our system will automatically choose the least loaded nodes according to some metric(s). Having more than one business entity controlling the configuration of a single (Solr) server is a recipe for disaster. Solr works well if there is an architect for the system. Jack, can you explain a bit what you mean here? It looks like Toke caught your meaning but I'm afraid it missed me. What do you mean by business entity? Is your concern that with automatic creation of collections they will be distributed willy-nilly across the cluster, leading to uneven load across nodes? If it is relevant, the schema and solrconfig are controlled entirely by me and is the same for all collections. Thus theoretically we could actually just use one single collection for all of our customers
Re: Replica and node states
Thanks. Does Solr ever clean up those states? I.e. does it ever remove down replicas, or replicas belonging to non-live_nodes after some time? Or will these remain in the cluster state forever (assuming they never come back up)? If they remain there, is there any penalty? E.g. Solr tries to send them updates, maybe tries to route search requests to? I'm talking about replicas that stay in ACTIVE state, but their nodes aren't under /live_nodes. Shai On Wed, Mar 25, 2015 at 8:05 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Comments inline: On Wed, Mar 25, 2015 at 8:30 AM, Shai Erera ser...@gmail.com wrote: Hi Is it possible for a replica to be DOWN, while the node it resides on is under /live_nodes? If so, what can lead to it, aside from someone unloading a core. Yes, aside from someone unloading the index, this can happen in two ways 1) during startup each core publishes it's state as 'down' before it enters recovery, and 2) the leader force-publishes a replica as 'down' if it is not able to forward updates to that replica (this mechanism is called Leader-Initiated-Recovery or LIR in short) The #2 above can happen when the replica is partitioned from leader but both are able to talk to ZooKeeper. I don't know if each SolrCore reports status to ZK independently, or it's done by the Solr process as a whole. It is done on a per-core basis for now. But the 'live' node is maintained one per Solr instance (JVM). Also, is it possible for a replica to report ACTIVE, while the node it lives on is no longer under /live_nodes? Are there any ZK timings that can cause that? Yes, this can happen if the JVM crashed. A replica publishes itself as 'down' on shutdown so if the graceful shutdown step is skipped then the replica will continue to be 'active' in the cluster state. Even LIR doesn't apply here because there's no point in the leader marking a node as 'down' if it is not 'live' already. Shai -- Regards, Shalin Shekhar Mangar.
Re: Custom TokenFilter
Re, Sorry about the image.So, there are all my dependencies jar in listing below :- commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance,Regards. Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com a écrit : Images don't come through the mailing list, can't see your image. Whether or not all the jars in the directory you're working on are consistent is the least of your problems. Are the libs to be found in any _other_ place specified on your classpath? Best, Erick On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote: Thanks Eric, I'm working on Solr 4.10.2 and all my dependencies jar seems to be compatible with this version. [image: Image en ligne] I can't figure out which one make this issue. Thanks Regards, Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a écrit : bq: 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr. This usually means you have jar files from different versions of Solr in your classpath. Best, Erick On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote: Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have this error message : Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is .../conf/schema.xmlat org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)... 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat java.lang.Class.asSubclass(Class.java:3208)at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:474)at org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:593)at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:342)at org.apache.solr.schema.FieldTypePluginLoader$2.create(FieldTypePluginLoader.java:335)at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151) Someone can help? Thanks.Regards.
Re: Custom TokenFilter
Re, Sorry about the image.So, there are all my dependencies jar in listing below : - commons-cli-2.0-mahout.jar - commons-compress-1.9.jar - commons-io-2.4.jar - commons-logging-1.2.jar - httpclient-4.4.jar - httpcore-4.4.jar - httpmime-4.4.jar - junit-4.10.jar - log4j-1.2.17.jar - lucene-analyzers-common-4.10.2.jar - lucene-benchmark-4.10.2.jar - lucene-core-4.10.2.jar - mahout-core-0.9.jar - noggit-0.5.jar - opennlp-maxent-3.0.3.jar - opennlp-tools-1.5.3.jar - slf4j-api-1.7.9.jar - slf4j-simple-1.7.10.jar - solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml - lib dir=../../../contrib/tamingtext/lib regex=.*\.jar / - lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance Regards. Le Mercredi 25 mars 2015 20h18, Test Test andymish...@yahoo.fr a écrit : Re, Sorry about the image.So, there are all my dependencies jar in listing below :- commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance,Regards. Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com a écrit : Images don't come through the mailing list, can't see your image. Whether or not all the jars in the directory you're working on are consistent is the least of your problems. Are the libs to be found in any _other_ place specified on your classpath? Best, Erick On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote: Thanks Eric, I'm working on Solr 4.10.2 and all my dependencies jar seems to be compatible with this version. [image: Image en ligne] I can't figure out which one make this issue. Thanks Regards, Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a écrit : bq: 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr. This usually means you have jar files from different versions of Solr in your classpath. Best, Erick On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote: Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have this error message : Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is .../conf/schema.xmlat org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.FieldTypePluginLoader.readAnalyzer(FieldTypePluginLoader.java:362)at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:95)at org.apache.solr.schema.FieldTypePluginLoader.create(FieldTypePluginLoader.java:43)at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:151)... 13 moreCaused by: java.lang.ClassCastException: class
Applying Tokenizers and Filters to CopyFields
Hi all, I am wondering what the process is for applying Tokenizers and Filter (as defined in the FieldType definition) to field contents that result from CopyFields. To be more specific, in my Solr instance, Iwould like to support query expansion by two means: removing stop words and adding inflected word forms as synonyms. To use a specific example, let’s say I have the following sentence to be indexed (from a Wittgenstein manuscript): Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“ This sentence will be indexed in a field called „original“ that is defined as follows: field name=original type=text_original indexed=true stored=true required=true“/ fieldType name=text_windex_original class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ /analyzer /fieldType Then, in order to create fields for the two types of query expansion, I have set up specific fields for this: - one field where stopwords are removed both on the indexed content and the query. So, if the users is searching for a phrase like „der Sprache“, Solr should still find the segment above, because the determiners („der“ and „die“) are removed prior to indexing and prior to querying, respectively. This field is defined as follows: field name=stopwords_removed type=text_stopwords_removed indexed=true stored=true required=true“/ fieldType name=text_stopwords_removed class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=„stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType - a second field where synonyms are added to the query so that more segments will be found. For instance, if the user is searching for the plural form „Sprachen“, Solr should return the segment above, due to this entry in the synonyms file: Sprache,Sprach,Sprachen“. This field is defined as follows: field name=expanded type=text_multiplied indexed=true stored=true required=true“/expanded fieldType name=text_expanded class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.SynonymFilterFactory synonyms=synonyms_de.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Finally, to avoid having to specify three fields with identical content in the import documents, I am defining the two fields for query expansion as copyFields: copyField source=original dest=stopwords_removed/ copyField source=original dest=expanded“/ Now, my expectation would be as follows: - during import, two temporary fields are created by copying content from the original field - these two temporary fields are then pre-processed as per the definitions above - the pre-processed version of the text is added to the index - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der Sprache“ and will always get the segment above as a matching result. However, what happens actually is that I get matches only for „Sprache“ and „sprache“. The other thing that strikes as odd, is that when I restrict the search to one of the fields only using the „fq“ parameter, I get no results. For instance: http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true will return no matches. I would expected that using the fq parameter the user can specify what type of search (s)he would like to carry out: A standard search (field original) or an expanded search (one of the other two fields). For debugging, I have checked the analysis and results seem ok (posted below). Apologies for the long post, but I am really a bit stuck here (even after doing a lot of reading and googling). It is probably something simple that I missing. Thanks a lot in advance for any help. Cheers, Martin ST Was zum Wesen der Welt gehört kann die Sprache nicht ausdrücken SF Was zum Wesen Welt gehört kann die Sprache nicht
Uneven data distribution with composite router
Hi, I'm using a three level composite router in a solr cloud environment, primarily for multi-tenant and field collapsing. The format is as follows. *language!topic!url*. An example would be : ENU!12345!www.testurl.com/enu/doc1 GER!12345!www.testurl.com/ger/doc2 CHS!67890!www.testurl.com/chs/doc3 The Solr Cloud cluster contains 2 shard, each having 3 replicas. After indexing around 10 million documents, I'm observing that the index size in shard 1 is around 60gb while shard 2 is 15gb. So the bulk of the data is getting indexed in shard 1. Since 60% of the document is english, I expect the index size to be higher on one shard, but the difference seem little too high. The idea is to make sure that all ENU!12345 documents are routed to one shard so that distributed field collapsing works. Is there something I can do differently here to make a better distribution ? Any pointers will be appreciated. Regards, Shamik
Re: Applying Tokenizers and Filters to CopyFields
Thanks a lot, Michael. See replies below. Am 25.03.2015 um 21:41 schrieb Michael Della Bitta michael.della.bi...@appinions.com: Two other things I noticed: 1. You probably don't want to store your copyFields. That's literally going to be the same information each time. OK, got it. I have set the targets of the copy fields to store=„false“. 2. Your expectation the pre-processed version of the text is added to the index may be incorrect. Anything done in analyzer type=query sections actually happens at query time. Not sure if that's significant for you. I was actually referring to what is happening at index time. So, the pre-processing steps are applied under analyzer type=„index“. And this point is not quite clear to me: Assuming that I have a simple case-folding step applied to the target of the copyField: How or where are the lower-case tokens stored, if the text isn’t added to the index? How is the query supposed to retrieve the lower-case version? (sorry, if this sounds like a naive question, but I have a feeling that I am missing something really basic here). Cheers, Martin Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Wed, Mar 25, 2015 at 4:27 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Martin, fq means filter query. May be you want to use qf (query fields) parameter of edismax? On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich martin...@gmx.net wrote: Hi all, I am wondering what the process is for applying Tokenizers and Filter (as defined in the FieldType definition) to field contents that result from CopyFields. To be more specific, in my Solr instance, Iwould like to support query expansion by two means: removing stop words and adding inflected word forms as synonyms. To use a specific example, let’s say I have the following sentence to be indexed (from a Wittgenstein manuscript): Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“ This sentence will be indexed in a field called „original“ that is defined as follows: field name=original type=text_original indexed=true stored=true required=true“/ fieldType name=text_windex_original class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ /analyzer /fieldType Then, in order to create fields for the two types of query expansion, I have set up specific fields for this: - one field where stopwords are removed both on the indexed content and the query. So, if the users is searching for a phrase like „der Sprache“, Solr should still find the segment above, because the determiners („der“ and „die“) are removed prior to indexing and prior to querying, respectively. This field is defined as follows: field name=stopwords_removed type=text_stopwords_removed indexed=true stored=true required=true“/ fieldType name=text_stopwords_removed class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=„stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType - a second field where synonyms are added to the query so that more segments will be found. For instance, if the user is searching for the plural form „Sprachen“, Solr should return the segment above, due to this entry in the synonyms file: Sprache,Sprach,Sprachen“. This field is defined as follows: field name=expanded type=text_multiplied indexed=true stored=true required=true“/expanded fieldType name=text_expanded class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.SynonymFilterFactory synonyms=synonyms_de.txt ignoreCase=true expand=true/ filter
Re: Applying Tokenizers and Filters to CopyFields
Two other things I noticed: 1. You probably don't want to store your copyFields. That's literally going to be the same information each time. 2. Your expectation the pre-processed version of the text is added to the index may be incorrect. Anything done in analyzer type=query sections actually happens at query time. Not sure if that's significant for you. Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/ On Wed, Mar 25, 2015 at 4:27 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi Martin, fq means filter query. May be you want to use qf (query fields) parameter of edismax? On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich martin...@gmx.net wrote: Hi all, I am wondering what the process is for applying Tokenizers and Filter (as defined in the FieldType definition) to field contents that result from CopyFields. To be more specific, in my Solr instance, Iwould like to support query expansion by two means: removing stop words and adding inflected word forms as synonyms. To use a specific example, let’s say I have the following sentence to be indexed (from a Wittgenstein manuscript): Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“ This sentence will be indexed in a field called „original“ that is defined as follows: field name=original type=text_original indexed=true stored=true required=true“/ fieldType name=text_windex_original class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ /analyzer /fieldType Then, in order to create fields for the two types of query expansion, I have set up specific fields for this: - one field where stopwords are removed both on the indexed content and the query. So, if the users is searching for a phrase like „der Sprache“, Solr should still find the segment above, because the determiners („der“ and „die“) are removed prior to indexing and prior to querying, respectively. This field is defined as follows: field name=stopwords_removed type=text_stopwords_removed indexed=true stored=true required=true“/ fieldType name=text_stopwords_removed class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=„stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType - a second field where synonyms are added to the query so that more segments will be found. For instance, if the user is searching for the plural form „Sprachen“, Solr should return the segment above, due to this entry in the synonyms file: Sprache,Sprach,Sprachen“. This field is defined as follows: field name=expanded type=text_multiplied indexed=true stored=true required=true“/expanded fieldType name=text_expanded class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.SynonymFilterFactory synonyms=synonyms_de.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Finally, to avoid having to specify three fields with identical content in the import documents, I am defining the two fields for query expansion as copyFields: copyField source=original dest=stopwords_removed/ copyField source=original dest=expanded“/ Now, my expectation would be as follows: - during import, two temporary fields are created by copying content from the original field - these two temporary fields are then pre-processed as per the definitions above - the pre-processed version of the text is added to the index - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der Sprache“ and will always get the segment above as a matching result. However, what happens actually is that I
Re: Replica and node states
On Wed, Mar 25, 2015 at 12:51 PM, Shai Erera ser...@gmail.com wrote: Thanks. Does Solr ever clean up those states? I.e. does it ever remove down replicas, or replicas belonging to non-live_nodes after some time? Or will these remain in the cluster state forever (assuming they never come back up)? No, they remain there forever. You can still call the deletereplica API to clean them up. There's even a param onyIfDown=true which will remove a replica only if it's already 'down'. If they remain there, is there any penalty? E.g. Solr tries to send them updates, maybe tries to route search requests to? I'm talking about replicas that stay in ACTIVE state, but their nodes aren't under /live_nodes. No, there is no penalty because we always check for the state=active and the live-ness before routing any requests to a replica. Shai On Wed, Mar 25, 2015 at 8:05 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Comments inline: On Wed, Mar 25, 2015 at 8:30 AM, Shai Erera ser...@gmail.com wrote: Hi Is it possible for a replica to be DOWN, while the node it resides on is under /live_nodes? If so, what can lead to it, aside from someone unloading a core. Yes, aside from someone unloading the index, this can happen in two ways 1) during startup each core publishes it's state as 'down' before it enters recovery, and 2) the leader force-publishes a replica as 'down' if it is not able to forward updates to that replica (this mechanism is called Leader-Initiated-Recovery or LIR in short) The #2 above can happen when the replica is partitioned from leader but both are able to talk to ZooKeeper. I don't know if each SolrCore reports status to ZK independently, or it's done by the Solr process as a whole. It is done on a per-core basis for now. But the 'live' node is maintained one per Solr instance (JVM). Also, is it possible for a replica to report ACTIVE, while the node it lives on is no longer under /live_nodes? Are there any ZK timings that can cause that? Yes, this can happen if the JVM crashed. A replica publishes itself as 'down' on shutdown so if the graceful shutdown step is skipped then the replica will continue to be 'active' in the cluster state. Even LIR doesn't apply here because there's no point in the leader marking a node as 'down' if it is not 'live' already. Shai -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: Applying Tokenizers and Filters to CopyFields
Hi Martin, fq means filter query. May be you want to use qf (query fields) parameter of edismax? On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich martin...@gmx.net wrote: Hi all, I am wondering what the process is for applying Tokenizers and Filter (as defined in the FieldType definition) to field contents that result from CopyFields. To be more specific, in my Solr instance, Iwould like to support query expansion by two means: removing stop words and adding inflected word forms as synonyms. To use a specific example, let’s say I have the following sentence to be indexed (from a Wittgenstein manuscript): Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“ This sentence will be indexed in a field called „original“ that is defined as follows: field name=original type=text_original indexed=true stored=true required=true“/ fieldType name=text_windex_original class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ /analyzer /fieldType Then, in order to create fields for the two types of query expansion, I have set up specific fields for this: - one field where stopwords are removed both on the indexed content and the query. So, if the users is searching for a phrase like „der Sprache“, Solr should still find the segment above, because the determiners („der“ and „die“) are removed prior to indexing and prior to querying, respectively. This field is defined as follows: field name=stopwords_removed type=text_stopwords_removed indexed=true stored=true required=true“/ fieldType name=text_stopwords_removed class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=„stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType - a second field where synonyms are added to the query so that more segments will be found. For instance, if the user is searching for the plural form „Sprachen“, Solr should return the segment above, due to this entry in the synonyms file: Sprache,Sprach,Sprachen“. This field is defined as follows: field name=expanded type=text_multiplied indexed=true stored=true required=true“/expanded fieldType name=text_expanded class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.SynonymFilterFactory synonyms=synonyms_de.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Finally, to avoid having to specify three fields with identical content in the import documents, I am defining the two fields for query expansion as copyFields: copyField source=original dest=stopwords_removed/ copyField source=original dest=expanded“/ Now, my expectation would be as follows: - during import, two temporary fields are created by copying content from the original field - these two temporary fields are then pre-processed as per the definitions above - the pre-processed version of the text is added to the index - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der Sprache“ and will always get the segment above as a matching result. However, what happens actually is that I get matches only for „Sprache“ and „sprache“. The other thing that strikes as odd, is that when I restrict the search to one of the fields only using the „fq“ parameter, I get no results. For instance: http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true will return no matches. I would expected that using the fq parameter the user can specify what type of search (s)he would like to carry out: A standard search (field original) or an expanded search (one of the other two fields). For debugging, I have checked the analysis and results seem ok (posted below). Apologies for the long post, but I am really a bit stuck here (even after doing a lot of reading and googling). It is probably something simple that I missing. Thanks
Re: Applying Tokenizers and Filters to CopyFields
Thanks a lot, Ahmet. I’ve just read up on this query field parameter and it sounds good. Since the field contents are currently all identical, I can’t really test it, yet. Cheers, Martin Am 25.03.2015 um 21:27 schrieb Ahmet Arslan iori...@yahoo.com.INVALID: Hi Martin, fq means filter query. May be you want to use qf (query fields) parameter of edismax? On Wednesday, March 25, 2015 9:23 PM, Martin Wunderlich martin...@gmx.net wrote: Hi all, I am wondering what the process is for applying Tokenizers and Filter (as defined in the FieldType definition) to field contents that result from CopyFields. To be more specific, in my Solr instance, Iwould like to support query expansion by two means: removing stop words and adding inflected word forms as synonyms. To use a specific example, let’s say I have the following sentence to be indexed (from a Wittgenstein manuscript): Was zum Wesen der Welt gehört, kann die Sprache nicht ausdrücken.“ This sentence will be indexed in a field called „original“ that is defined as follows: field name=original type=text_original indexed=true stored=true required=true“/ fieldType name=text_windex_original class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ /analyzer /fieldType Then, in order to create fields for the two types of query expansion, I have set up specific fields for this: - one field where stopwords are removed both on the indexed content and the query. So, if the users is searching for a phrase like „der Sprache“, Solr should still find the segment above, because the determiners („der“ and „die“) are removed prior to indexing and prior to querying, respectively. This field is defined as follows: field name=stopwords_removed type=text_stopwords_removed indexed=true stored=true required=true“/ fieldType name=text_stopwords_removed class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=„stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType - a second field where synonyms are added to the query so that more segments will be found. For instance, if the user is searching for the plural form „Sprachen“, Solr should return the segment above, due to this entry in the synonyms file: Sprache,Sprach,Sprachen“. This field is defined as follows: field name=expanded type=text_multiplied indexed=true stored=true required=true“/expanded fieldType name=text_expanded class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords_de.txt format=snowball/ filter class=solr.SynonymFilterFactory synonyms=synonyms_de.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType Finally, to avoid having to specify three fields with identical content in the import documents, I am defining the two fields for query expansion as copyFields: copyField source=original dest=stopwords_removed/ copyField source=original dest=expanded“/ Now, my expectation would be as follows: - during import, two temporary fields are created by copying content from the original field - these two temporary fields are then pre-processed as per the definitions above - the pre-processed version of the text is added to the index - then, the user can search for „Sprache“, „sprache“, „Sprachen“ or „der Sprache“ and will always get the segment above as a matching result. However, what happens actually is that I get matches only for „Sprache“ and „sprache“. The other thing that strikes as odd, is that when I restrict the search to one of the fields only using the „fq“ parameter, I get no results. For instance: http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true will return no matches. I would expected that using the fq parameter the user can specify what type of search (s)he
location field giving error for lat long
hi I have field name GeoLocate with datatype as location. For some lat and long it is giving me following error during indexing process Can't parse point '139.9544301,35.4298081' because: Bad Y value 139.9544301 is not in boundary Rect(minX=-180.0,maxX=180.0,minY=-90.0,maxY=90.0) Any idea whats wrong? -- View this message in context: http://lucene.472066.n3.nabble.com/location-field-giving-error-for-lat-long-tp4195339.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Custom TokenFilter
Re, Finally, i think i found where this problem comes.I didn't use the right class extender, instead using Tokenizers, i'm using Token filter. Eric, thanks for your replies.Regards. Le Mercredi 25 mars 2015 23h55, Test Test andymish...@yahoo.fr a écrit : Re, I have tried to remove all the redundant jar files.Then i've relaunched it but it's blocked directly on the same issue. It's very strange. Regards, Le Mercredi 25 mars 2015 23h31, Erick Erickson erickerick...@gmail.com a écrit : Wait, you didn't put, say, lucene-core-4.10.2.jar into your contrib/tamingtext/dependency directory did you? That means you have Lucene (and solr and solrj and ...) in your class path twice since they're _already_ in your classpath by default since you're running Solr. All your jars should be in your aggregate classpath exactly once. Having them in twice would explain the cast exception. not need these in the tamingtext/dependency subdirectory, just the things that are _not_ in Solr already.. Best, Erick On Wed, Mar 25, 2015 at 12:21 PM, Test Test andymish...@yahoo.fr wrote: Re, Sorry about the image.So, there are all my dependencies jar in listing below : - commons-cli-2.0-mahout.jar - commons-compress-1.9.jar - commons-io-2.4.jar - commons-logging-1.2.jar - httpclient-4.4.jar - httpcore-4.4.jar - httpmime-4.4.jar - junit-4.10.jar - log4j-1.2.17.jar - lucene-analyzers-common-4.10.2.jar - lucene-benchmark-4.10.2.jar - lucene-core-4.10.2.jar - mahout-core-0.9.jar - noggit-0.5.jar - opennlp-maxent-3.0.3.jar - opennlp-tools-1.5.3.jar - slf4j-api-1.7.9.jar - slf4j-simple-1.7.10.jar - solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml - lib dir=../../../contrib/tamingtext/lib regex=.*\.jar / - lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance Regards. Le Mercredi 25 mars 2015 20h18, Test Test andymish...@yahoo.fr a écrit : Re, Sorry about the image.So, there are all my dependencies jar in listing below :- commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance,Regards. Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com a écrit : Images don't come through the mailing list, can't see your image. Whether or not all the jars in the directory you're working on are consistent is the least of your problems. Are the libs to be found in any _other_ place specified on your classpath? Best, Erick On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote: Thanks Eric, I'm working on Solr 4.10.2 and all my dependencies jar seems to be compatible with this version. [image: Image en ligne] I can't figure out which one make this issue. Thanks Regards, Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a écrit : bq: 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr. This usually means you have jar files from different versions of Solr in your classpath. Best, Erick On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote: Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have this error message : Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is .../conf/schema.xmlat org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 7
Re: Applying Tokenizers and Filters to CopyFields
I agree the terminology is possibly a little confusing. Stored refers to values that are stored verbatim. You can retrieve them verbatim. Analysis does not affect stored values. Indexed values are tokenized/transformed and stored inverted. You can't recover the literal analyzed version (at least, not easily). If what you really want is to store and retrieve case folded versions of your data as well as the original, you need to use something like a UpdateRequestProcessor, which I personally am less familiar with. On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich martin...@gmx.net wrote: So, the pre-processing steps are applied under analyzer type=„index“. And this point is not quite clear to me: Assuming that I have a simple case-folding step applied to the target of the copyField: How or where are the lower-case tokens stored, if the text isn’t added to the index? How is the query supposed to retrieve the lower-case version? (sorry, if this sounds like a naive question, but I have a feeling that I am missing something really basic here). Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/
Re: Problem with Terms Query Parser
That should work. Check to be sure that you really are running Solr 5.0. Was it an old version of trunk or the 5x branch before last August when the terms query parser was added? -- Jack Krupansky On Tue, Mar 24, 2015 at 5:15 PM, Shamik Bandopadhyay sham...@gmail.com wrote: Hi, I'm trying to use Terms Query Parser for one of my use cases where I use an implicit filter on bunch of sources. When I'm trying to run the following query, fq={!terms f=Source}help,documentation,sfdc I'm getting the following error. lst name=errorstr name=msgUnknown query parser 'terms'/strint name=code400/int/lst What am I missing here ? I'm using Solr 5.0 version. Any pointers will be appreciated. Regards, Shamik
RE: German Compound Splitter words.fst causing problems.
Hello Chris - i don't know that token filter you mention but i would like to recommend Lucene's HyphenationCompoundWordTokenFilter. It works reasonably well if you provide the hyphenation rules and a dictionary. It has some flaws such as decompounding to irrelevant subwords, overlapping subwords or to subwords that do not form the whole compound word (minus genitives), but these can be fixed. Markus -Original message- From:Chris Morley ch...@depahelix.com Sent: Wednesday 25th March 2015 17:59 To: solr-user@lucene.apache.org Subject: German Compound Splitter words.fst causing problems. Hello, Chris Morley here, of Wayfair.com. I am working on the German compound-splitter by Dawid Weiss. I tried to upgrade the words.fst file that comes with the German compound-splitter using Solr 3.5, but it doesn't work. Below is the IndexNotFoundException that I get. cmorley@Caracal01:~/Work/oss/git/apache-solr-3.5.0$ java -cp lucene/build/lucene-core-3.5-SNAPSHOT.jar org.apache.lucene.index.IndexUpgrader wordsFst Exception in thread main org.apache.lucene.index.IndexNotFoundException: org.apache.lucene.store.MMapDirectory@/home/cmorley/Work/oss/git/apache-solr-3.5.0/wordsFst lockFactory=org.apache.lucene.store.NativeFSLockFactory@201a755e at org.apache.lucene.index.IndexUpgrader.upgrade(IndexUpgrader.java:118) at org.apache.lucene.index.IndexUpgrader.main(IndexUpgrader.java:85) The reason I'm attempting this at all is due to the answer here, http://stackoverflow.com/questions/25450865/migrate-solr-1-4-index-files-to-4-7, which says to do the upgrade in a two step process, first using Solr 3.5, and then the latest Solr version (4.10.3). When I try this running the unit tests for my modified German compound-splitter I'm getting this same type of error. The thing is, this is an FST, not an index, which is a little confusing. The reason why I'm following this answer though, is because I'm getting that exact same message when trying to build the (modified) project with mavenat the point at which it tries to load in words.fst. Below. [main] ERROR com.wayfair.lucene.analysis.de.compound.GermanCompoundSplitter - Format version is not supported (resource: com.wayfair.lucene.analysis.de.compound.InputStreamDataInput@79a66240): 0 (needs to be between 3 and 4). This version of Lucene only supports indexes created with release 3.0 and later. Failed to initialize static data structures for German compound splitter. Thanks, -Chris.
Re: Custom TokenFilter
Re, I have tried to remove all the redundant jar files.Then i've relaunched it but it's blocked directly on the same issue. It's very strange. Regards, Le Mercredi 25 mars 2015 23h31, Erick Erickson erickerick...@gmail.com a écrit : Wait, you didn't put, say, lucene-core-4.10.2.jar into your contrib/tamingtext/dependency directory did you? That means you have Lucene (and solr and solrj and ...) in your class path twice since they're _already_ in your classpath by default since you're running Solr. All your jars should be in your aggregate classpath exactly once. Having them in twice would explain the cast exception. not need these in the tamingtext/dependency subdirectory, just the things that are _not_ in Solr already.. Best, Erick On Wed, Mar 25, 2015 at 12:21 PM, Test Test andymish...@yahoo.fr wrote: Re, Sorry about the image.So, there are all my dependencies jar in listing below : - commons-cli-2.0-mahout.jar - commons-compress-1.9.jar - commons-io-2.4.jar - commons-logging-1.2.jar - httpclient-4.4.jar - httpcore-4.4.jar - httpmime-4.4.jar - junit-4.10.jar - log4j-1.2.17.jar - lucene-analyzers-common-4.10.2.jar - lucene-benchmark-4.10.2.jar - lucene-core-4.10.2.jar - mahout-core-0.9.jar - noggit-0.5.jar - opennlp-maxent-3.0.3.jar - opennlp-tools-1.5.3.jar - slf4j-api-1.7.9.jar - slf4j-simple-1.7.10.jar - solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml - lib dir=../../../contrib/tamingtext/lib regex=.*\.jar / - lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance Regards. Le Mercredi 25 mars 2015 20h18, Test Test andymish...@yahoo.fr a écrit : Re, Sorry about the image.So, there are all my dependencies jar in listing below :- commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance,Regards. Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com a écrit : Images don't come through the mailing list, can't see your image. Whether or not all the jars in the directory you're working on are consistent is the least of your problems. Are the libs to be found in any _other_ place specified on your classpath? Best, Erick On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote: Thanks Eric, I'm working on Solr 4.10.2 and all my dependencies jar seems to be compatible with this version. [image: Image en ligne] I can't figure out which one make this issue. Thanks Regards, Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a écrit : bq: 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr. This usually means you have jar files from different versions of Solr in your classpath. Best, Erick On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote: Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have this error message : Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is .../conf/schema.xmlat org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat
Re: Applying Tokenizers and Filters to CopyFields
Martin: Perhaps this would help indexed=true, stored=true field can be searched. The raw input (not analyzed in any way) can be shown to the user in the results list. indexed=true, stored=false field can be searched. However, the field can't be returned in the results list with the document. indexed=false, stored=true The field cannot be searched, but the contents can be returned in the results list with the document. There are some use-cases where this is desirable behavior. indexed=false, stored=false The entire field is thrown out, it's just as if you didn't send the field to be indexed at all. And one other thing, the copyField gets the _raw_ data not the analyzed data. Let's say you have two fields, src and dst. copying from src to dest in schema.xml is identical to add doc field name=srcoriginal text/field field name=dstoriginal text/field /doc /add that is, copyfield directives are not chained. Also, watch out for your query syntax. Michael's comments are spot-on, I'd just add this: http://localhost:8983/solr/windex/select?q=Sprachefq=originalwt=jsonindent=true is kind of odd. Let's assume you mean qf rather than fq. That _only_ matters if your query parser is edismax, it'll be ignored in this case I believe. You'd want something like q=src:Sprache or q=dst:Sprache or even http://localhost:8983/solr/windex/select?q=Sprachedf=src http://localhost:8983/solr/windex/select?q=Sprachedf=dst where df is default field and the search is applied against that field in the absence of a field qualification like my first two examples. Best, Erick On Wed, Mar 25, 2015 at 2:52 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: I agree the terminology is possibly a little confusing. Stored refers to values that are stored verbatim. You can retrieve them verbatim. Analysis does not affect stored values. Indexed values are tokenized/transformed and stored inverted. You can't recover the literal analyzed version (at least, not easily). If what you really want is to store and retrieve case folded versions of your data as well as the original, you need to use something like a UpdateRequestProcessor, which I personally am less familiar with. On Wed, Mar 25, 2015 at 5:28 PM, Martin Wunderlich martin...@gmx.net wrote: So, the pre-processing steps are applied under analyzer type=„index“. And this point is not quite clear to me: Assuming that I have a simple case-folding step applied to the target of the copyField: How or where are the lower-case tokens stored, if the text isn’t added to the index? How is the query supposed to retrieve the lower-case version? (sorry, if this sounds like a naive question, but I have a feeling that I am missing something really basic here). Michael Della Bitta Senior Software Engineer o: +1 646 532 3062 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions https://twitter.com/Appinions | g+: plus.google.com/appinions https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts w: appinions.com http://www.appinions.com/
RE: Difference in indexing using config file vs client i.e SolrJ
Thanks Erick for the helpful explanations. thanks sumit From: Erick Erickson [erickerick...@gmail.com] Sent: Monday, March 23, 2015 4:58 PM To: solr-user@lucene.apache.org Subject: Re: Difference in indexing using config file vs client i.e SolrJ 1 Either none or lots, depending;). You're talking schemaless here I think. schemaless mode guesses what the field should be based on the document and creates a field in the doc. pre-defined schemas require you to make that decision up front. So in terms of what the underlying index looks like on a lower-level Lucene basis, whether a field is defined in the schema.xml or dynamically it's identical. So in that perspective, there's no difference. However, whether the field definitions chosen best represent the problem you're trying to solve is another issue all together. Schemaless simply cannot apply the same kind of domain-specific interpretation that a human can, not to mention construct analysis chains for the tokens that are reflective of the characteristics specific to that domain. 2 There have been some anecdotal reports of schemaless copying everything into a _text field that impact performance, but this is configurable. 3 Again, the underlying structure of the index at the Lucene level is the same. What's NOT the same is whether schemaless mode makes the right decisions. Almost invariably a human being can do better since you're armed with knowledge of what's important and what's not. Here's my take: Schemaless mode is a great way to get started with minimal effort on your part. But pretty soon the problem domain requires that you take control of the schema and hand-craft schema.xml. For some problem spaces, schemaless may be good enough, you have to evaluate your corpus and your problem space Best, Erick On Mon, Mar 23, 2015 at 4:41 PM, Purohit, Sumit sumit.puro...@pnnl.gov wrote: Hi All, I have recently started working with Solr and i have a trivial question to ask, as i could not find suitable answer. A document's indexes can be defined in a config file (such as schema.xml) and on the fly using some solr client such as SolrJ. 1. What is the difference in indexes created by both the approaches ? 2. Is there any major performance gain in the case of using predefined index instead of using SolrJ ? 3. Does solr persist these indexes differently and does that has any impact on the Query efficiency ? Thanks Sumit Purohit
Re: Using G1 with Apache Solr
The issue we had with Java 8 was with DIH handler. We were using Rhino and with the new implementation in Java 8, we had several Regex expression issues... We are almost ready to go now, since we moved away from Rhino and now use Java. Bill On Wed, Mar 25, 2015 at 2:14 AM, Daniel Collins danwcoll...@gmail.com wrote: Interesting none the less Shawn :) We use G1GC on our servers, we were on Java 7 (64-bit, RHEL6), but are trying to migrate to Java 8 (which seems to cause more GC issues, so we clearly need to tweak our settings), will investigate 8u40 though. On 25 March 2015 at 04:23, Shawn Heisey apa...@elyograg.org wrote: On 3/24/2015 9:52 PM, Shawn Heisey wrote: On 3/24/2015 3:48 PM, Kamran Khawaja wrote: I'm running Solr 4.7.2 with Java 7u75 with the following JVM params: I really got my wires crossed. Kamran sent his message to the hostpot-gc-use mailing list, not the solr-user list! Thanks, Shawn -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: [MASSMAIL]Re: Issues to create new core
Erick, Thanks for your help. I could fix the problem. I work in no SolrCloud mode. Best Regards, Ale - Mensaje original - De: Erick Erickson erickerick...@gmail.com Para: solr-user@lucene.apache.org Enviados: Martes, 24 de Marzo 2015 10:14:22 Asunto: [MASSMAIL]Re: Issues to create new core Tell us all the steps you went through to do this. Note that you should _not_ be using the core admin in the admin UI if you're working with SolrCloud. For stand-alone Solr, the message above is probably caused by your not having a conf directory set up already. The core admin UI expects that you have a pre-existing directory with a conf directory that contains solrconfig.xml, schema.xml, and all the rest of the configuration files. You can specify this via some of the parameters on the admin UI screen (see instanceDir and dataDir). Each core must be in a separate directory or Bad Things Happen. HTH, Erick On Tue, Mar 24, 2015 at 7:01 AM, Alejandro Jesus Mariño Molerio ajmar...@estudiantes.uci.cu wrote: Dear Solr Community: I just began to work with Solr. I choose Solr 5.0, but when I try to create a new core with GUI, show the following error: Error CREATEing SolrCore 'datos': Unable to create core [datos] Caused by: Can't find resource 'solrconfig.xml' in classpath or 'C:\solr\server\solr\datos\conf'. My question is simple, How can fix this problem?. Thanks in advance for your consideration. Alejandro.
Re: Setting up SOLR 5 from an RPM
On Tue, Mar 24, 2015 at 4:00 PM, Tom Evans tevans...@googlemail.com wrote: Hi all We're migrating to SOLR 5 (from 4.8), and our infrastructure guys would prefer we installed SOLR from an RPM rather than extracting the tarball where we need it. They are creating the RPM file themselves, and it installs an init.d script and the equivalent of the tarball to /opt/solr. We're having problems running SOLR from the installed files, as SOLR wants to (I think) extract the WAR file and create various temporary files below /opt/solr/server. From the SOLR 5 reference guide, section Managing SOLR, sub-section Taking SOLR to production, it seems changing the ownership of the installed files to the user that will run SOLR is an explicit requirement if you do not wish to run as root. It would be better if this was not required. With most applications you do not normally require permission to modify the installed files in order to run the application, eg I do not need write permission to /usr/share/vim to run vim, it is a shame I need write permission to /opt/solr to run solr. Cheers Tom
Re: Replica and node states
On Wed, Mar 25, 2015 at 9:24 PM, Shai Erera ser...@gmail.com wrote: There's even a param onyIfDown=true which will remove a replica only if it's already 'down'. That will only work if the replica is in DOWN state correct? That is, if the Solr JVM was killed, and the replica stays in ACTIVE, but its node is not under /live_nodes, it won't get deleted? What I chose to do is to delete the replica if its node is not under /live_nodes, and I'm sure it will never return. Probably not and we should fix it. It should be possible to delete replicas which are not live I guess. But there are more behaviors that need to defined e.g. what happens if a node was down and you deleted the replica which was supposed to be on it and then the node came back up. Should we re-create the replica automatically or ask the node to delete the local core and have something new assigned to it? Some of these behaviors are what we informally call ZK as Truth features where we want to move to a world where ZK is the source of truth and nodes modify their state and cores depending on what's inside ZK. No, there is no penalty because we always check for the state=active and the live-ness before routing any requests to a replica. Well, that's also a penalty :), though I agree it's a minor one. There is also a penalty ZK-wise -- clusterstate.json still records these orphanage replicas, so I'll make sure I do this cleanup from time to time. Yeah but just to avoid any misunderstanding -- the live nodes are watched by ZK so checking live-ness is a hash set lookup which is the cost but a small one. But yeah you do need to cleanup from time to time. Thanks for the responses and clarifications! Shai On Wed, Mar 25, 2015 at 11:39 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Wed, Mar 25, 2015 at 12:51 PM, Shai Erera ser...@gmail.com wrote: Thanks. Does Solr ever clean up those states? I.e. does it ever remove down replicas, or replicas belonging to non-live_nodes after some time? Or will these remain in the cluster state forever (assuming they never come back up)? No, they remain there forever. You can still call the deletereplica API to clean them up. There's even a param onyIfDown=true which will remove a replica only if it's already 'down'. If they remain there, is there any penalty? E.g. Solr tries to send them updates, maybe tries to route search requests to? I'm talking about replicas that stay in ACTIVE state, but their nodes aren't under /live_nodes. No, there is no penalty because we always check for the state=active and the live-ness before routing any requests to a replica. Shai On Wed, Mar 25, 2015 at 8:05 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Comments inline: On Wed, Mar 25, 2015 at 8:30 AM, Shai Erera ser...@gmail.com wrote: Hi Is it possible for a replica to be DOWN, while the node it resides on is under /live_nodes? If so, what can lead to it, aside from someone unloading a core. Yes, aside from someone unloading the index, this can happen in two ways 1) during startup each core publishes it's state as 'down' before it enters recovery, and 2) the leader force-publishes a replica as 'down' if it is not able to forward updates to that replica (this mechanism is called Leader-Initiated-Recovery or LIR in short) The #2 above can happen when the replica is partitioned from leader but both are able to talk to ZooKeeper. I don't know if each SolrCore reports status to ZK independently, or it's done by the Solr process as a whole. It is done on a per-core basis for now. But the 'live' node is maintained one per Solr instance (JVM). Also, is it possible for a replica to report ACTIVE, while the node it lives on is no longer under /live_nodes? Are there any ZK timings that can cause that? Yes, this can happen if the JVM crashed. A replica publishes itself as 'down' on shutdown so if the graceful shutdown step is skipped then the replica will continue to be 'active' in the cluster state. Even LIR doesn't apply here because there's no point in the leader marking a node as 'down' if it is not 'live' already. Shai -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: Custom TokenFilter
Wait, you didn't put, say, lucene-core-4.10.2.jar into your contrib/tamingtext/dependency directory did you? That means you have Lucene (and solr and solrj and ...) in your class path twice since they're _already_ in your classpath by default since you're running Solr. All your jars should be in your aggregate classpath exactly once. Having them in twice would explain the cast exception. not need these in the tamingtext/dependency subdirectory, just the things that are _not_ in Solr already.. Best, Erick On Wed, Mar 25, 2015 at 12:21 PM, Test Test andymish...@yahoo.fr wrote: Re, Sorry about the image.So, there are all my dependencies jar in listing below : - commons-cli-2.0-mahout.jar - commons-compress-1.9.jar - commons-io-2.4.jar - commons-logging-1.2.jar - httpclient-4.4.jar - httpcore-4.4.jar - httpmime-4.4.jar - junit-4.10.jar - log4j-1.2.17.jar - lucene-analyzers-common-4.10.2.jar - lucene-benchmark-4.10.2.jar - lucene-core-4.10.2.jar - mahout-core-0.9.jar - noggit-0.5.jar - opennlp-maxent-3.0.3.jar - opennlp-tools-1.5.3.jar - slf4j-api-1.7.9.jar - slf4j-simple-1.7.10.jar - solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml - lib dir=../../../contrib/tamingtext/lib regex=.*\.jar / - lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance Regards. Le Mercredi 25 mars 2015 20h18, Test Test andymish...@yahoo.fr a écrit : Re, Sorry about the image.So, there are all my dependencies jar in listing below :- commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance,Regards. Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com a écrit : Images don't come through the mailing list, can't see your image. Whether or not all the jars in the directory you're working on are consistent is the least of your problems. Are the libs to be found in any _other_ place specified on your classpath? Best, Erick On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote: Thanks Eric, I'm working on Solr 4.10.2 and all my dependencies jar seems to be compatible with this version. [image: Image en ligne] I can't figure out which one make this issue. Thanks Regards, Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a écrit : bq: 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr. This usually means you have jar files from different versions of Solr in your classpath. Best, Erick On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote: Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have this error message : Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is .../conf/schema.xmlat org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFactory.java:69)at org.apache.solr.core.ConfigSetService.createIndexSchema(ConfigSetService.java:90)at org.apache.solr.core.ConfigSetService.getConfig(ConfigSetService.java:62)... 7 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactoryat org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:177)at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:486)... 12 moreCaused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] analyzer/tokenizer: class
Retrieving list of words for highlighting
In Solr 5 (or 4), is there an easy way to retrieve the list of words to highlight? Use case: allow an external application to highlight the matching words of a matching document, rather than using the highlighted snippets returned by Solr. Thanks, Damien
Data indexing is going too slow on single shard Why?
Hello, Please can anyone assist me? I am indexing on single shard it is taking too much of time to index data. And I am indexing around 49GB of data on single shard. What's wrong? Why solr is taking too much time to index data? Earlier I was indexing same data on 8 shards. That time, it was fast as compared to single shard. Why so? Any help please.. *HardCommit - 15 sec* *SoftCommit - 10 min.* Best, Nitin
Re: Custom TokenFilter
Thanks for letting us know the resolution, the problem was bugging me Erick On Wed, Mar 25, 2015 at 4:21 PM, Test Test andymish...@yahoo.fr wrote: Re, Finally, i think i found where this problem comes.I didn't use the right class extender, instead using Tokenizers, i'm using Token filter. Eric, thanks for your replies.Regards. Le Mercredi 25 mars 2015 23h55, Test Test andymish...@yahoo.fr a écrit : Re, I have tried to remove all the redundant jar files.Then i've relaunched it but it's blocked directly on the same issue. It's very strange. Regards, Le Mercredi 25 mars 2015 23h31, Erick Erickson erickerick...@gmail.com a écrit : Wait, you didn't put, say, lucene-core-4.10.2.jar into your contrib/tamingtext/dependency directory did you? That means you have Lucene (and solr and solrj and ...) in your class path twice since they're _already_ in your classpath by default since you're running Solr. All your jars should be in your aggregate classpath exactly once. Having them in twice would explain the cast exception. not need these in the tamingtext/dependency subdirectory, just the things that are _not_ in Solr already.. Best, Erick On Wed, Mar 25, 2015 at 12:21 PM, Test Test andymish...@yahoo.fr wrote: Re, Sorry about the image.So, there are all my dependencies jar in listing below : - commons-cli-2.0-mahout.jar - commons-compress-1.9.jar - commons-io-2.4.jar - commons-logging-1.2.jar - httpclient-4.4.jar - httpcore-4.4.jar - httpmime-4.4.jar - junit-4.10.jar - log4j-1.2.17.jar - lucene-analyzers-common-4.10.2.jar - lucene-benchmark-4.10.2.jar - lucene-core-4.10.2.jar - mahout-core-0.9.jar - noggit-0.5.jar - opennlp-maxent-3.0.3.jar - opennlp-tools-1.5.3.jar - slf4j-api-1.7.9.jar - slf4j-simple-1.7.10.jar - solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml - lib dir=../../../contrib/tamingtext/lib regex=.*\.jar / - lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance Regards. Le Mercredi 25 mars 2015 20h18, Test Test andymish...@yahoo.fr a écrit : Re, Sorry about the image.So, there are all my dependencies jar in listing below :- commons-cli-2.0-mahout.jar- commons-compress-1.9.jar- commons-io-2.4.jar- commons-logging-1.2.jar- httpclient-4.4.jar- httpcore-4.4.jar- httpmime-4.4.jar- junit-4.10.jar- log4j-1.2.17.jar- lucene-analyzers-common-4.10.2.jar- lucene-benchmark-4.10.2.jar- lucene-core-4.10.2.jar- mahout-core-0.9.jar- noggit-0.5.jar- opennlp-maxent-3.0.3.jar- opennlp-tools-1.5.3.jar- slf4j-api-1.7.9.jar- slf4j-simple-1.7.10.jar- solr-solrj-4.10.2.jar I have put them into a specific repository (contrib/tamingtext/dependency).And my jar containing my class into another repository (contrib/tamingtext/lib).I added these paths in solrconfig.xml lib dir=../../../contrib/tamingtext/lib regex=.*\.jar /lib dir=../../../contrib/tamingtext/dependency regex=.*\.jar / Thanks for advance,Regards. Le Mercredi 25 mars 2015 17h12, Erick Erickson erickerick...@gmail.com a écrit : Images don't come through the mailing list, can't see your image. Whether or not all the jars in the directory you're working on are consistent is the least of your problems. Are the libs to be found in any _other_ place specified on your classpath? Best, Erick On Wed, Mar 25, 2015 at 12:36 AM, Test Test andymish...@yahoo.fr wrote: Thanks Eric, I'm working on Solr 4.10.2 and all my dependencies jar seems to be compatible with this version. [image: Image en ligne] I can't figure out which one make this issue. Thanks Regards, Le Mardi 24 mars 2015 23h45, Erick Erickson erickerick...@gmail.com a écrit : bq: 13 moreCaused by: java.lang.ClassCastException: class com.tamingtext.texttamer.solr. This usually means you have jar files from different versions of Solr in your classpath. Best, Erick On Tue, Mar 24, 2015 at 2:38 PM, Test Test andymish...@yahoo.fr wrote: Hi there, I'm trying to create my own TokenizerFactory (from tamingtext's book).After setting schema.xml and have adding path in solrconfig.xml, i start solr.I have this error message : Caused by: org.apache.solr.common.SolrException: Plugin init failure for [schema.xml] fieldType text: Plugin init failure for [schema.xml] analyzer/tokenizer: class com.tamingtext.texttamer.solr.SentenceTokenizerFactory. Schema file is .../conf/schema.xmlat org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:595)at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:166)at org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55)at
Re: Replica and node states
There's even a param onyIfDown=true which will remove a replica only if it's already 'down'. That will only work if the replica is in DOWN state correct? That is, if the Solr JVM was killed, and the replica stays in ACTIVE, but its node is not under /live_nodes, it won't get deleted? What I chose to do is to delete the replica if its node is not under /live_nodes, and I'm sure it will never return. No, there is no penalty because we always check for the state=active and the live-ness before routing any requests to a replica. Well, that's also a penalty :), though I agree it's a minor one. There is also a penalty ZK-wise -- clusterstate.json still records these orphanage replicas, so I'll make sure I do this cleanup from time to time. Thanks for the responses and clarifications! Shai On Wed, Mar 25, 2015 at 11:39 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: On Wed, Mar 25, 2015 at 12:51 PM, Shai Erera ser...@gmail.com wrote: Thanks. Does Solr ever clean up those states? I.e. does it ever remove down replicas, or replicas belonging to non-live_nodes after some time? Or will these remain in the cluster state forever (assuming they never come back up)? No, they remain there forever. You can still call the deletereplica API to clean them up. There's even a param onyIfDown=true which will remove a replica only if it's already 'down'. If they remain there, is there any penalty? E.g. Solr tries to send them updates, maybe tries to route search requests to? I'm talking about replicas that stay in ACTIVE state, but their nodes aren't under /live_nodes. No, there is no penalty because we always check for the state=active and the live-ness before routing any requests to a replica. Shai On Wed, Mar 25, 2015 at 8:05 PM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Comments inline: On Wed, Mar 25, 2015 at 8:30 AM, Shai Erera ser...@gmail.com wrote: Hi Is it possible for a replica to be DOWN, while the node it resides on is under /live_nodes? If so, what can lead to it, aside from someone unloading a core. Yes, aside from someone unloading the index, this can happen in two ways 1) during startup each core publishes it's state as 'down' before it enters recovery, and 2) the leader force-publishes a replica as 'down' if it is not able to forward updates to that replica (this mechanism is called Leader-Initiated-Recovery or LIR in short) The #2 above can happen when the replica is partitioned from leader but both are able to talk to ZooKeeper. I don't know if each SolrCore reports status to ZK independently, or it's done by the Solr process as a whole. It is done on a per-core basis for now. But the 'live' node is maintained one per Solr instance (JVM). Also, is it possible for a replica to report ACTIVE, while the node it lives on is no longer under /live_nodes? Are there any ZK timings that can cause that? Yes, this can happen if the JVM crashed. A replica publishes itself as 'down' on shutdown so if the graceful shutdown step is skipped then the replica will continue to be 'active' in the cluster state. Even LIR doesn't apply here because there's no point in the leader marking a node as 'down' if it is not 'live' already. Shai -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.