Re: solre scores remains same for exact match and nearly exact match
On 3 April 2013 10:52, amit amit.mal...@gmail.com wrote: Below is my query http://localhost:8983/solr/select/?q=subject:session management in phpfq=category:[*%20TO%20*]fl=category,score,subject [...] Add debugQuery=on to your Solr URL, and you will get an explanation of the score. Your subject field is tokenised, so that there is no a priori reason that an exact match should score higher. Several strategies are available if you want that behaviour. Try searching Google, e.g., for solr exact match higher score. Regards, Gora
Re: Out of memory on some faceting queries
On Tue, 2013-04-02 at 17:08 +0200, Dotan Cohen wrote: Most of the time I facet on one field that has about twenty unique values. They are likely to be disk cached so warming those for 9M documents should only take a few seconds. However, once per day I would like to facet on the text field, which is a free-text field usually around 1 KiB (about 100 words), in order to determine what the top keywords / topics are. That query would take up to 200 seconds to run, [...] If that query is somehow part of your warming, then I am surprised that search has worked at all with your commit frequency. That would however explain your OOM if you have multiple warmups running at the same time. It sounds like TermsComponent would be a better fit for getting top topics: https://wiki.apache.org/solr/TermsComponent
maxWarmingSearchers in Solr 4.
I have been dragging the same solrconfig.xml from Solr 3.x to 4.0 to 4.1, with no customization (bad, bad me!). I'm now looking into customizing it and I see that the Solr 4.1 solrconfig.xml is much simpler and shorter. Is this simply because many of the examples have been removed? In particular, I notice that there is no mention of maxWarmingSearchers in the Solr 4.1 solrconfig.xml. I assume that I can simply add it in, are there any other critical config options that are missing that I should be looking into as well? Would I be better off using the old Solr 3.x solrconfig.xml in Solr 4.1 as it contains so many examples? -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Out of memory on some faceting queries
On Tue, Apr 2, 2013 at 6:26 PM, Andre Bois-Crettez andre.b...@kelkoo.com wrote: warmupTime is available on the admin page for each type of cache (in milliseconds) : http://solr-box:8983/solr/#/core1/plugins/cache Or if you are only interested in the total : http://solr-box:8983/solr/core1/admin/mbeans?stats=truekey=searcher Thanks. Batches of 20-50 results are added to solr a few times a minute, and a commit is done after each batch since I'm calling Solr as such: http://127.0.0.1:8983/solr/core/update/json?commit=true Should I remove commit=true and run a cron job to commit once per minute? Even better, it sounds like a job for CommitWithin : http://wiki.apache.org/solr/CommitWithin I'll look into that. Thank you! -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Out of memory on some faceting queries
On Wed, Apr 3, 2013 at 10:11 AM, Toke Eskildsen t...@statsbiblioteket.dk wrote: However, once per day I would like to facet on the text field, which is a free-text field usually around 1 KiB (about 100 words), in order to determine what the top keywords / topics are. That query would take up to 200 seconds to run, [...] If that query is somehow part of your warming, then I am surprised that search has worked at all with your commit frequency. That would however explain your OOM if you have multiple warmups running at the same time. No, the 'heavy facet' is not part of the warming. I run it at most once per day, at the end of the day. Solr is not shut down daily. It sounds like TermsComponent would be a better fit for getting top topics: https://wiki.apache.org/solr/TermsComponent I had once looked at TermsComponent, but I think that I eliminated it as a possibility because I actually need the top keywords related to a specific keyword. For instance, I need to know which words are most commonly used with the word coffee. -- Dotan Cohen http://gibberish.co.il http://what-is-what.com
Re: Solr 4.2.0 results links
Thanks for the response. I found the issue. The data was being ingested correctly it just being echoed incorrectly. while inspecting the final HTML output I was able to find that the richtext-doc.vm file was used to display my data. The code in this file generated the links to local files. I did some more research on velocity coding and some trial and error I now have my links displaying and working correctly. I'm still picking apart the example collections and solr configs to suit my needs. Currently I ran into a HEAP memory issue but that is more of a JAVA thing. I have adjusted the setting and am testing it out. Down the road I'd like to make the year a drop down option. This way you only search the selected year and not the whole library but that is a different topic and I need to do some more research. Again thanks for the reply, ZeroEffect -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-4-2-0-results-links-tp4049788p4053420.html Sent from the Solr - User mailing list archive at Nabble.com.
Query parser cuts last letter from search term.
Hi, I have strange problem with Solr query. I added to my Solr Index new document with behave! word inside content. While I was trying to search this document using behave search term it was impossible. Only behave! returns result. Additionaly search debug returns following information: debug: { rawquerystring: behave, querystring: behave, parsedquery: allText:behav, parsedquery_toString: allText:behav, Does anybody know how to deal with such case? Below is my field type definition. Field definition: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 types=characters.txt / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 types=characters.txt / /analyzer /fieldType where: characters.txt § = ALPHA $ = ALPHA % = ALPHA = ALPHA / = ALPHA ( = ALPHA ) = ALPHA = = ALPHA ? = ALPHA + = ALPHA * = ALPHA # = ALPHA ' = ALPHA - = ALPHA = ALPHA = ALPHA -- View this message in context: http://lucene.472066.n3.nabble.com/Query-parser-cuts-last-letter-from-search-term-tp4053432.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: MoreLikeThis - Odd results - what am I doing wrong?
Thanks David - I suppose it is an AWS question and thank you for the pointers. As a further input to the MLT question - it does seem that 3.6 behavior is different from 4.2 - the issue seems to be more in terms of the raw query that is generated. I will some more research and revert back with details. David Parks davidpark...@yahoo.com wrote: Isn't this an AWS security groups question? You should probably post this question on the AWS forums, but for the moment, here's the basic reading material - go set up your EC2 security groups and lock down your systems. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html If you just want to password protect Solr here are the instructions: http://wiki.apache.org/solr/SolrSecurity But I most certainly would not leave it open to the world even with a password (note that the basic password authentication sends passwords in clear text if you're not using HTTPS, best lock the thing down behind a firewall). Dave -Original Message- From: DC tech [mailto:dctech1...@gmail.com] Sent: Tuesday, April 02, 2013 1:02 PM To: solr-user@lucene.apache.org Subject: Re: MoreLikeThis - Odd results - what am I doing wrong? OK - so I have my SOLR instance running on AWS. Any suggestions on how to safely share the link? Right now, the whole SOLR instance is totally open. Gagandeep singh gagan.g...@gmail.com wrote: say debugQuery=truemlt=true and see the scores for the MLT query, not a sample query. You can use Amazon ec2 to bring up your solr, you should be able to get a micro instance for free trial. On Mon, Apr 1, 2013 at 5:10 AM, dc tech dctech1...@gmail.com wrote: I did try the raw query against the *simi* field and those seem to return results in the order expected. For instance, Acura MDX has ( large, SUV, 4WD Luxury) in the simi field. Running a query with those words against the simi field returns the expected models (X5, Audi Q5, etc) and then the subsequent documents have decreasing relevance. So the basic query mechanism seems to be fine. The issue just seems to be with MoreLikeThis component and handler. I can post the index on a public SOLR instance - any suggestions? (or for hosting) On Sun, Mar 31, 2013 at 1:54 PM, Gagandeep singh gagan.g...@gmail.com wrote: If you can bring up your solr setup on a public machine then im sure a lot of debugging can be done. Without that, i think what you should look at is the tf-idf scores of the terms like camry etc. Usually idf is the deciding factor into which results show at the top (tf should be 1 for your data). Enable debugQuery=true and look at explain section to see show score is getting calculated. You should try giving different boosts to class, type, drive, size to control the results. On Sun, Mar 31, 2013 at 8:52 PM, dc tech dctech1...@gmail.com wrote: I am running some experiments on more like this and the results seem rather odd - I am doing something wrong but just cannot figure out what. Basically, the similarity results are decent - but not great. *Issue 1 = Quality* Toyota Camry : finds Altima (good) but then next one is Camry Hybrid whereas it should have found Accord. I have normalized the data into a simi field which has only the attributes that I care about. Without the simi field, I could not get mlt.qf boosts to work well enough to return results *Issue 2* Some fields do not work at all. For instance, text+simi (in mlt.fl) works whereas just simi does not. So some weirdness that am just not understanding. Would be grateful for your guidance ! Here is the setup: *1. SOLR Version* solr-spec 4.2.0.2013.03.06.22.32.13 solr-impl 4.2.0 1453694 rmuir - 2013-03-06 22:32:13 lucene-spec 4.2.0 lucene-impl 4.2.0 1453694 - rmuir - 2013-03-06 22:25:29 *2. Machine Information* Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_23 19.0-b09) Windows 7 Home 64 Bit with 4 GB RAM *3. Sample Data * I created this 'dummy' data of cars - the idea being that these would be sufficient and simple to generate similarity and understand how it would work. There are 181 rows in the data set (I have attached it for reference in CSV format) [image: Inline image 1] *4. SCHEMA* *Field Definitions* field name=id type=string indexed=true stored=true termVectors=true multiValued=false/ field name=make type=string indexed=true stored=true termVectors=true multiValued=false/ field name=model type=string indexed=true stored=true termVectors=true multiValued=false/ field name=class type=string indexed=true stored=true termVectors=true multiValued=false/ field name=type type=string indexed=true stored=true termVectors=true multiValued=false/ field name=drive type=string indexed=true stored=true termVectors=true multiValued=false/ field
Re: Query parser cuts last letter from search term.
This is called 'stemming', and is caused by this: filter class=solr.SnowballPorterFilterFactory language=English/ It means that all of these terms would match: behave behaving behaved (and possibly more) because they would all stem down to 'behav'. This stemming will happen at index time and at query time, so stemmed terms are stored in your index, and also, as you are seeing, stemming happens on your query terms. You can use the 'analyze' option in the admin interface to see what happens to terms at query/indexing time for your various field definitions. Upayavira On Wed, Apr 3, 2013, at 11:25 AM, vsl wrote: Hi, I have strange problem with Solr query. I added to my Solr Index new document with behave! word inside content. While I was trying to search this document using behave search term it was impossible. Only behave! returns result. Additionaly search debug returns following information: debug: { rawquerystring: behave, querystring: behave, parsedquery: allText:behav, parsedquery_toString: allText:behav, Does anybody know how to deal with such case? Below is my field type definition. Field definition: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 types=characters.txt / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 types=characters.txt / /analyzer /fieldType where: characters.txt § = ALPHA $ = ALPHA % = ALPHA = ALPHA / = ALPHA ( = ALPHA ) = ALPHA = = ALPHA ? = ALPHA + = ALPHA * = ALPHA # = ALPHA ' = ALPHA - = ALPHA = ALPHA = ALPHA -- View this message in context: http://lucene.472066.n3.nabble.com/Query-parser-cuts-last-letter-from-search-term-tp4053432.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query parser cuts last letter from search term.
So why Solr does not return proper document? -- View this message in context: http://lucene.472066.n3.nabble.com/Query-parser-cuts-last-letter-from-search-term-tp4053432p4053435.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Flow Chart of Solr
So, all in all, is there anybody who can write down just main steps of Solr(including parsing, stemming etc.)? 2013/4/2 Furkan KAMACI furkankam...@gmail.com I think about myself as an example. I have started to make research about Solr just for some weeks. I have learned Solr and its related projects. My next step writing down the main steps Solr. We have separated learning curve of Solr into two main categories. First one is who are using it as out of the box components. Second one is developer side. Actually developer side branches into two way. First one is general steps of it. i.e. document comes into Solr (i.e. crawled data of Nutch). which analyzing processes are going to done (stamming, hamming etc.), what will be doing after parsing step by step. When a search query happens what happens step by step, at which step scores are calculated so on so forth. Second one is more code specific i.e. which handlers takes into account data that will going to be indexed(no need the explain every handler at this step) . Which are the analyzer, tokenizer classes and what are the flow between them. How response handlers works and what are they. Also explaining about cloud side is other work. Some of explanations are currently presents at wiki (but some of them are at very deep places at wiki and it is not easy to find the parent topic of it, maybe starting wiki from a top age and branching all other topics as possible as from it could be better) If we could show the big picture, and beside of it the smaller pictures within it, it would be great (if you know the main parts it will be easy to go deep into the code i.e. you don't need to explain every handler, if you show the way to the developer he/she could debug and find the needs) When I think about myself as an example, I have to write down the steps of Solr a bit detail even I read many pages at wiki and a book about it, I see that it is not easy even writing down the big picture of developer side. 2013/4/2 Alexandre Rafalovitch arafa...@gmail.com Yago, My point - perhaps lost in too much text - was that Solr is presented - and can function - as a black-box. Which makes it different from more traditional open-source project. So, the stage-2 happens exactly when the non-programmers have to cross the boundary from the black-box into code-first approach and the hand-off is not particularly smooth. Or even when - say - php or .Net programmer tries to get beyond the basic operations their client library and has the understand the server-side aspects of Solr. Regards, Alex. On Tue, Apr 2, 2013 at 1:19 PM, Yago Riveiro yago.rive...@gmail.com wrote: Alexandre, You describe the normal path when a beginner try to use a source of code that doesn't understand, black-box, reading code, hacking, ok now I know 10% of the project, with lucky :p. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
Words being duplicated with highlighting DictionaryCompoundWordTokenFilterFactory
I'm having issues with highlighting DictionaryCompoundWordTokenFilterFactory in Solr 3.6.1/3.6.2. It's duplicating/adding words in the highlighted snippet. For example, my dictionary (dutch) has the following words: premie, beter, ring. If I search for 'verbetering', results with 'verbeteringspremie' are correctly found, but highlighted as following: Verhighlightbeter/highlighthighlightVerbetering/highlightspremie. Words from the DictionaryCompoundWordTokenFilterFactory dictionary are added to the highlighted item, resulting in all kinds of jibberish. schema.xml http://pastebin.com/SxGAg52N (problem is happening for fields of type 'text') solrconfig.xml http://pastebin.com/MUTkgZJq Only solution I can come up at the moment is removing those words (beter, ring) from the dictionary (which disables word compound searching on those words...which is unwanted). Any idea what this could be? I found someone else facing the exact same problem: http://stackoverflow.com/questions/13879349/solr-duplicating-words-in-highlighted-results - unfortunately, no workable solution has been given.
Solr ZooKeeper ensemble with HBase
Hi all, I have a running Hadoop + HBase cluster and the HBase cluster is running it's own zookeeper (HBase manages zookeeper). I would like to deploy my SolrCloud cluster on a portion of the machines on that cluster. My question is: Should I have any trouble / issues deploying an additional ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because, well first of all HBase manages it so I'm not sure it's possible and second I have HBase working pretty hard at times and I don't want to create any connection issues by overloading ZooKeeper. Thanks, Amit.
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
Clear out it's tlogs before starting it again may help. - Mark On Apr 2, 2013, at 10:07 PM, Jamie Johnson jej2...@gmail.com wrote: I brought the bad one down and back up and it did nothing. I can clear the index and try4.2.1. I will save off the logs and see if there is anything else odd On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote: It would appear it's a bug given what you have said. Any other exceptions would be useful. Might be best to start tracking in a JIRA issue as well. To fix, I'd bring the behind node down and back again. Unfortunately, I'm pressed for time, but we really need to get to the bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading to mirrors now). - Mark On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote: Sorry I didn't ask the obvious question. Is there anything else that I should be looking for here and is this a bug? I'd be happy to troll through the logs further if more information is needed, just let me know. Also what is the most appropriate mechanism to fix this. Is it required to kill the index that is out of sync and let solr resync things? On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com wrote: sorry for spamming here shard5-core2 is the instance we're having issues with... Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException : Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok status:503, message:Service Unavailable at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson jej2...@gmail.com wrote: here is another one that looks interesting Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the leader, but locally we don't think so at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson jej2...@gmail.com wrote: Looking at the master it looks like at some point there were shards that went down. I am seeing things like what is below. NFO: A cluster state change: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/live_nodes, has occurred - updating... (live nodes size: 12) Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3 process INFO: Updating live nodes... (9) Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess INFO: Running the leader process. Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext shouldIBeLeader INFO: Checking if I should try and be the leader. Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
No, not that I know if, which is why I say we need to get to the bottom of it. - Mark On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote: Mark It's there a particular jira issue that you think may address this? I read through it quickly but didn't see one that jumped out On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote: I brought the bad one down and back up and it did nothing. I can clear the index and try4.2.1. I will save off the logs and see if there is anything else odd On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote: It would appear it's a bug given what you have said. Any other exceptions would be useful. Might be best to start tracking in a JIRA issue as well. To fix, I'd bring the behind node down and back again. Unfortunately, I'm pressed for time, but we really need to get to the bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading to mirrors now). - Mark On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote: Sorry I didn't ask the obvious question. Is there anything else that I should be looking for here and is this a bug? I'd be happy to troll through the logs further if more information is needed, just let me know. Also what is the most appropriate mechanism to fix this. Is it required to kill the index that is out of sync and let solr resync things? On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com wrote: sorry for spamming here shard5-core2 is the instance we're having issues with... Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException : Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok status:503, message:Service Unavailable at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson jej2...@gmail.com wrote: here is another one that looks interesting Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the leader, but locally we don't think so at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson jej2...@gmail.com wrote: Looking at the master it looks like at some point there were shards that went down. I am seeing things like what is below. NFO: A cluster state change: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/live_nodes, has occurred - updating... (live nodes size: 12) Apr 2, 2013 8:12:52 PM org.apache.solr.common.cloud.ZkStateReader$3 process INFO: Updating live nodes... (9) Apr 2, 2013 8:12:52 PM org.apache.solr.cloud.ShardLeaderElectionContext runLeaderProcess INFO: Running the leader process.
Re: Flow Chart of Solr
Sure, yes. But... it comes down to what level of detail you want and need for a specific task. In other words, there are probably a dozen or more levels of detail. The reality is that if you are going to work at the Solr code level, that is very, very different than being a user of Solr, and at that point your first step is to become familiar with the code itself. When you talk about parsing and stemming, you are really talking about the user-level, not the Solr code level. Maybe what you really need is a cheat sheet that maps a user-visible feature to the main Solr code component for that implements that user feature. There are a number of different forms of parsing in Solr - parsing of what? Queries? Requests? Solr documents? Function queries? Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does that. Lucene does all of the token filtering. Are you asking for details on how Lucene works? Maybe you meant to ask how term analysis works, which is split between Solr and Lucene. Or maybe you simply wanted to know when and where term analysis is done. Tell us your specific problem or specific question and we can probably quickly give you an answer. In truth, NOBODY uses flow charts anymore. Sure, there are some user-level diagrams, but not down to the code level. If you could focus on specific questions, we could give you specific answers. Main steps? That depends on what level you are working at. Tell us what problem you are trying to solve and we can point you to the relevant areas. In truth, if you become generally familiar with Solr at the user level (study the wikis), you will already know what the main steps are. So, it is not main steps of Solr, but main steps of some specific request of Solr, and for a specified level of detail, and for a specified area of Solr if greater detail is needed. Be more specific, and then we can be more specific. For now, the general advice for people who need or want to go far beyond the user level is to get familiar with the code - just LOOK at it - a lot of the package and class names are OBVIOUS, really, and follow the class hierarchy and code flow using the standard features of any modern Java IDE. If you are wondering where to start for some specific user-level feature, please ask specifically about that feature. But... make a diligent effort to discover and learn on your own before asking open-ended questions. Sure, there are lots of things in Lucene and Solr that are rather complex and seemingly convoluted, and not obvious, but people are more than willing to help you out if you simply ask a specific question. I mean, not everybody needs to know the fine detail of query parsing, analysis, building a Lucene-level stemmer, etc. If we tried to put all of that in a diagram, most people would be more confused than enlightened. At which step are scores calculated? That's more of a Lucene question. Or, are you really asking what code in Solr invokes Lucene search methods that calculate basic scores? In short, you need to be more specific. Don't force us to guess what problem you are trying to solve. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, April 03, 2013 6:52 AM To: solr-user@lucene.apache.org Subject: Re: Flow Chart of Solr So, all in all, is there anybody who can write down just main steps of Solr(including parsing, stemming etc.)? 2013/4/2 Furkan KAMACI furkankam...@gmail.com I think about myself as an example. I have started to make research about Solr just for some weeks. I have learned Solr and its related projects. My next step writing down the main steps Solr. We have separated learning curve of Solr into two main categories. First one is who are using it as out of the box components. Second one is developer side. Actually developer side branches into two way. First one is general steps of it. i.e. document comes into Solr (i.e. crawled data of Nutch). which analyzing processes are going to done (stamming, hamming etc.), what will be doing after parsing step by step. When a search query happens what happens step by step, at which step scores are calculated so on so forth. Second one is more code specific i.e. which handlers takes into account data that will going to be indexed(no need the explain every handler at this step) . Which are the analyzer, tokenizer classes and what are the flow between them. How response handlers works and what are they. Also explaining about cloud side is other work. Some of explanations are currently presents at wiki (but some of them are at very deep places at wiki and it is not easy to find the parent topic of it, maybe starting wiki from a top age and branching all other topics as possible as from it could be better) If we could show the big picture, and beside of it the smaller pictures within it, it would be great (if you know the main parts it will be easy to go deep into the code i.e. you don't
Re: Query parser cuts last letter from search term.
The standard tokenizer recognizes ! as a punctuation character, so it will be treated as white space. You could use the white space tokenizer if punctuation is considered significant. -- Jack Krupansky -Original Message- From: vsl Sent: Wednesday, April 03, 2013 6:25 AM To: solr-user@lucene.apache.org Subject: Query parser cuts last letter from search term. Hi, I have strange problem with Solr query. I added to my Solr Index new document with behave! word inside content. While I was trying to search this document using behave search term it was impossible. Only behave! returns result. Additionaly search debug returns following information: debug: { rawquerystring: behave, querystring: behave, parsedquery: allText:behav, parsedquery_toString: allText:behav, Does anybody know how to deal with such case? Below is my field type definition. Field definition: fieldType name=text_general class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 types=characters.txt / /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=English/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1 preserveOriginal=1 types=characters.txt / /analyzer /fieldType where: characters.txt § = ALPHA $ = ALPHA % = ALPHA = ALPHA / = ALPHA ( = ALPHA ) = ALPHA = = ALPHA ? = ALPHA + = ALPHA * = ALPHA # = ALPHA ' = ALPHA - = ALPHA = ALPHA = ALPHA -- View this message in context: http://lucene.472066.n3.nabble.com/Query-parser-cuts-last-letter-from-search-term-tp4053432.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Confusion over Solr highlight hl.q parameter
Thank you for the response, unfortunately it didn't change that I'm still getting no highlighting hits for this query. ...hl.q={!dismax}text_it_IT:l'assieme... -Original Message- From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] Sent: Tuesday, April 02, 2013 9:00 PM To: solr-user@lucene.apache.org Subject: Re: Confusion over Solr highlight hl.q parameter (13/04/03 5:27), Van Tassell, Kristian wrote: Thanks Koji, this helped with some of our problems, but it is still not perfect. This query, for example, returns no highlighting: ?q=id:abc123hl.q=text_it_IT:l'assiemehl.fl=text_it_IThl=truedefTyp e=edismax But this one does (when it is, in effect, the same query): ?q=text_it_IT:l'assiemehl=truedefType=edismaxhl.fl=text_it_IT I've tried many combinations but can't seem to get the right one to work. Is this possibly a bug? As hl.q doesn't care defType parameter but does localParams, can you try to put {!edismax} to hl.q parameter? koji -- http://soleami.com/blog/lucene-4-is-super-convenient-for-developing-nlp-tools.html
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
Ok, so clearing the transaction log allowed things to go again. I am going to clear the index and try to replicate the problem on 4.2.0 and then I'll try on 4.2.1 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.com wrote: No, not that I know if, which is why I say we need to get to the bottom of it. - Mark On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote: Mark It's there a particular jira issue that you think may address this? I read through it quickly but didn't see one that jumped out On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote: I brought the bad one down and back up and it did nothing. I can clear the index and try4.2.1. I will save off the logs and see if there is anything else odd On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote: It would appear it's a bug given what you have said. Any other exceptions would be useful. Might be best to start tracking in a JIRA issue as well. To fix, I'd bring the behind node down and back again. Unfortunately, I'm pressed for time, but we really need to get to the bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading to mirrors now). - Mark On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote: Sorry I didn't ask the obvious question. Is there anything else that I should be looking for here and is this a bug? I'd be happy to troll through the logs further if more information is needed, just let me know. Also what is the most appropriate mechanism to fix this. Is it required to kill the index that is out of sync and let solr resync things? On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com wrote: sorry for spamming here shard5-core2 is the instance we're having issues with... Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException : Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok status:503, message:Service Unavailable at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson jej2...@gmail.com wrote: here is another one that looks interesting Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the leader, but locally we don't think so at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1797) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:637) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:343) On Tue, Apr 2, 2013 at 5:41 PM, Jamie Johnson jej2...@gmail.com wrote: Looking at the master it looks like at some point there were shards that went down. I am seeing things like what is below. NFO: A cluster state change:
Re: is there a way we can build spell dictionary from solr index such that it only take words leaving all`special characters
hi upayavira you mean to say that I dont have to follow this : http://wiki.apache.org/solr/SpellCheckComponent and directly I can create spell check field from copyfield and use it...I dont have to build dictionary on the fieldjust use copyfield for spell suggetions? thanks regards Rohan On Wed, Mar 13, 2013 at 12:56 PM, Upayavira u...@odoko.co.uk wrote: Use text analysis and copyField to create a new field that has terms as you expect them. Then use that for your spellcheck dictionary. Note, since 4.0, you don't need to create a dictionary. Solr can use your index directly. Upayavira On Wed, Mar 13, 2013, at 06:00 AM, Rohan Thakur wrote: while building the spell dictionary... On Wed, Mar 13, 2013 at 11:29 AM, Rohan Thakur rohan.i...@gmail.com wrote: even do not want to break the words as in samsung to s a m s u n g or sII ti s II ir s2 to s 2 On Wed, Mar 13, 2013 at 11:28 AM, Rohan Thakur rohan.i...@gmail.com wrote: k as in like if the field I am indixing from the database like title that has characters like () - # /n// example: Screenguard for Samsung Galaxy SII (Matt and Gloss) (with Dual Protection, Cleaning Cloth and Bubble Remover) or samsung-galaxy-sii-screenguard-matt-and-gloss.html or /s/a/samsung_galaxy_sii_i9100_pink_.jpg or 4.27-inch Touchscreen, 3G, Android v2.3 OS, 8MP Camera with LED Flash now I do not want to build the spell dictionary to only include the words not any of the - , _ . ( ) /s/a/ or numeric like 4.27 how can I do that? thanks regards Rohan On Tue, Mar 12, 2013 at 11:06 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Sorry, leaving them where? Can you give a concrete example or problem. Regards, Alex On Mar 12, 2013 1:31 PM, Rohan Thakur rohan.i...@gmail.com wrote: hi all wanted to know is there way we can make spell dictionary from solr index such that it only takes words from the index leaving all the special characters and unwanted characters. thanks regards Rohan
Re: Solr metrics in Codahale metrics and Graphite?
On 3/29/2013 12:07 PM, Walter Underwood wrote: What are folks using for this? I don't know that this really answers your question, but Solr 4.1 and later includes a big chunk of codahale metrics internally for request handler statistics - see SOLR-1972. First we tried including the jar and using the API, but that created thread leak problems, so the source code was added. Thanks, Shawn
Re: Synonyms problem
On 3/29/2013 12:14 PM, Plamen Mihaylov wrote: Can I ask you another question: I have Magento + Solr and have a requirement to create an admin magento module, where I can add/remove synonyms dynamically. Is this possible? I searched google but it seems not possible. If you change the synonym list that you are using in your index analyzer chain, you must rebuild your entire index. If you don't, the updated synonyms will only affect newly added records. This is because the index analyzer is only applied at index time. Thanks, Shawn
Question on Exact Matches - edismax
Hi All, I have a requirement where in exact matches for 2 fields (Series Title, Title) should be ranked higher than the partial matches. The configuration looks like below: requestHandler name=assetdismax class=solr.SearchHandler lst name=defaults str name=defTypeedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf*pg_series_title_ci*^500 *title_ci*^300 * pg_series_title*^200 *title*^25 classifications^15 classifications_texts^15 parent_classifications^10 synonym_classifications^5 pg_brand_title^5 pg_series_working_title^5 p_programme_title^5 p_item_title^5 p_interstitial_title^5 description^15 pg_series_description annotations^0.1 classification_notes^0.05 pv_program_version_number^2 pv_program_version_number_ci^2 pv_program_number^2 pv_program_number_ci^2 p_program_number^2 ma_version_number^2 ma_recording_location ma_contributions^0.001 rel_pg_series_title rel_programme_title rel_programme_number rel_programme_number_ci pg_uuid^0.5 p_uuid^0.5 pv_uuid^0.5 ma_uuid^0.5/str str name=pfpg_series_title_ci^500 title_ci^500/str int name=ps0/int str name=q.alt*:*/str str name=mm100%/str str name=q.opAND/str str name=facettrue/str str name=facet.limit-1/str str name=facet.mincount1/str /lst /requestHandler As you can see above, the search is against many fields. What I'd want is the documents that have exact matches for series title and title fields should rank higher than the rest. I have added 2 case insensitive (*pg_series_title_ci, title_ci*) fields for series title and title and have boosted them higher over the tokenized and rest of the fields. I have also implemented a similarity class to override idf however I still get documents having partial matches in title and other fields ranking higher than exact match in pg_series_title_ci. Many Thanks, Sandeep
Re: solre scores remains same for exact match and nearly exact match
Thanks. I added a copy field and that fixed the issue. On Wed, Apr 3, 2013 at 12:29 PM, Gora Mohanty-3 [via Lucene] ml-node+s472066n4053412...@n3.nabble.com wrote: On 3 April 2013 10:52, amit [hidden email]http://user/SendEmail.jtp?type=nodenode=4053412i=0 wrote: Below is my query http://localhost:8983/solr/select/?q=subject:session management in phpfq=category:[*%20TO%20*]fl=category,score,subject [...] Add debugQuery=on to your Solr URL, and you will get an explanation of the score. Your subject field is tokenised, so that there is no a priori reason that an exact match should score higher. Several strategies are available if you want that behaviour. Try searching Google, e.g., for solr exact match higher score. Regards, Gora -- If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053412.html To unsubscribe from solre scores remains same for exact match and nearly exact match, click herehttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=4053406code=YW1pdC5tYWxsaWtAZ21haWwuY29tfDQwNTM0MDZ8LTk5Njc5OTA3NA== . NAMLhttp://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://lucene.472066.n3.nabble.com/solre-scores-remains-same-for-exact-match-and-nearly-exact-match-tp4053406p4053478.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
Something interesting that I'm noticing as well, I just indexed 300,000 items, and some how 300,020 ended up in the index. I thought perhaps I messed something up so I started the indexing again and indexed another 400,000 and I see 400,064 docs. Is there a good way to find possibile duplicates? I had tried to facet on key (our id field) but that didn't give me anything with more than a count of 1. On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com wrote: Ok, so clearing the transaction log allowed things to go again. I am going to clear the index and try to replicate the problem on 4.2.0 and then I'll try on 4.2.1 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.com wrote: No, not that I know if, which is why I say we need to get to the bottom of it. - Mark On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote: Mark It's there a particular jira issue that you think may address this? I read through it quickly but didn't see one that jumped out On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote: I brought the bad one down and back up and it did nothing. I can clear the index and try4.2.1. I will save off the logs and see if there is anything else odd On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote: It would appear it's a bug given what you have said. Any other exceptions would be useful. Might be best to start tracking in a JIRA issue as well. To fix, I'd bring the behind node down and back again. Unfortunately, I'm pressed for time, but we really need to get to the bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading to mirrors now). - Mark On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote: Sorry I didn't ask the obvious question. Is there anything else that I should be looking for here and is this a bug? I'd be happy to troll through the logs further if more information is needed, just let me know. Also what is the most appropriate mechanism to fix this. Is it required to kill the index that is out of sync and let solr resync things? On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com wrote: sorry for spamming here shard5-core2 is the instance we're having issues with... Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException : Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok status:503, message:Service Unavailable at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson jej2...@gmail.com wrote: here is another one that looks interesting Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the leader, but locally we don't think so at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:228) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:339) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at
Re: Solr ZooKeeper ensemble with HBase
Hello, Amit: My guess is that, if HBase is working hard, you're going to have more trouble with HBase and Solr on the same nodes than HBase and Solr sharing a Zookeeper. Solr's usage of Zookeeper is very minimal. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela am...@infolinks.com wrote: Hi all, I have a running Hadoop + HBase cluster and the HBase cluster is running it's own zookeeper (HBase manages zookeeper). I would like to deploy my SolrCloud cluster on a portion of the machines on that cluster. My question is: Should I have any trouble / issues deploying an additional ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because, well first of all HBase manages it so I'm not sure it's possible and second I have HBase working pretty hard at times and I don't want to create any connection issues by overloading ZooKeeper. Thanks, Amit.
Re: Flow Chart of Solr
There are three books on Solr, two with that in the title, and one, Taming Text, each of which have been very valuable in understanding Solr. Jack On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky j...@basetechnology.com wrote: Sure, yes. But... it comes down to what level of detail you want and need for a specific task. In other words, there are probably a dozen or more levels of detail. The reality is that if you are going to work at the Solr code level, that is very, very different than being a user of Solr, and at that point your first step is to become familiar with the code itself. When you talk about parsing and stemming, you are really talking about the user-level, not the Solr code level. Maybe what you really need is a cheat sheet that maps a user-visible feature to the main Solr code component for that implements that user feature. There are a number of different forms of parsing in Solr - parsing of what? Queries? Requests? Solr documents? Function queries? Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does that. Lucene does all of the token filtering. Are you asking for details on how Lucene works? Maybe you meant to ask how term analysis works, which is split between Solr and Lucene. Or maybe you simply wanted to know when and where term analysis is done. Tell us your specific problem or specific question and we can probably quickly give you an answer. In truth, NOBODY uses flow charts anymore. Sure, there are some user-level diagrams, but not down to the code level. If you could focus on specific questions, we could give you specific answers. Main steps? That depends on what level you are working at. Tell us what problem you are trying to solve and we can point you to the relevant areas. In truth, if you become generally familiar with Solr at the user level (study the wikis), you will already know what the main steps are. So, it is not main steps of Solr, but main steps of some specific request of Solr, and for a specified level of detail, and for a specified area of Solr if greater detail is needed. Be more specific, and then we can be more specific. For now, the general advice for people who need or want to go far beyond the user level is to get familiar with the code - just LOOK at it - a lot of the package and class names are OBVIOUS, really, and follow the class hierarchy and code flow using the standard features of any modern Java IDE. If you are wondering where to start for some specific user-level feature, please ask specifically about that feature. But... make a diligent effort to discover and learn on your own before asking open-ended questions. Sure, there are lots of things in Lucene and Solr that are rather complex and seemingly convoluted, and not obvious, but people are more than willing to help you out if you simply ask a specific question. I mean, not everybody needs to know the fine detail of query parsing, analysis, building a Lucene-level stemmer, etc. If we tried to put all of that in a diagram, most people would be more confused than enlightened. At which step are scores calculated? That's more of a Lucene question. Or, are you really asking what code in Solr invokes Lucene search methods that calculate basic scores? In short, you need to be more specific. Don't force us to guess what problem you are trying to solve. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, April 03, 2013 6:52 AM To: solr-user@lucene.apache.org Subject: Re: Flow Chart of Solr So, all in all, is there anybody who can write down just main steps of Solr(including parsing, stemming etc.)? 2013/4/2 Furkan KAMACI furkankam...@gmail.com I think about myself as an example. I have started to make research about Solr just for some weeks. I have learned Solr and its related projects. My next step writing down the main steps Solr. We have separated learning curve of Solr into two main categories. First one is who are using it as out of the box components. Second one is developer side. Actually developer side branches into two way. First one is general steps of it. i.e. document comes into Solr (i.e. crawled data of Nutch). which analyzing processes are going to done (stamming, hamming etc.), what will be doing after parsing step by step. When a search query happens what happens step by step, at which step scores are calculated so on so forth. Second one is more code specific i.e. which handlers takes into account data that will going to be indexed(no need the explain every handler at this step) . Which are the analyzer, tokenizer classes and what are the flow between them. How response handlers works and what are they. Also explaining about cloud side is other work. Some of explanations are currently presents at wiki (but some of them are at very deep places at wiki and it is not easy to find the parent topic of it, maybe starting wiki from a top age and
Re: Solr ZooKeeper ensemble with HBase
Trouble in what why ? If I have enough memory - HBase RegionServer 10GB and maybe 2GB for Solr ? - or you mean CPU / disk ? On Wed, Apr 3, 2013 at 5:54 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hello, Amit: My guess is that, if HBase is working hard, you're going to have more trouble with HBase and Solr on the same nodes than HBase and Solr sharing a Zookeeper. Solr's usage of Zookeeper is very minimal. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela am...@infolinks.com wrote: Hi all, I have a running Hadoop + HBase cluster and the HBase cluster is running it's own zookeeper (HBase manages zookeeper). I would like to deploy my SolrCloud cluster on a portion of the machines on that cluster. My question is: Should I have any trouble / issues deploying an additional ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because, well first of all HBase manages it so I'm not sure it's possible and second I have HBase working pretty hard at times and I don't want to create any connection issues by overloading ZooKeeper. Thanks, Amit.
Re: Solr ZooKeeper ensemble with HBase
Solr heavily uses RAM for disk caching, so depending on your index size and what you intend to do with it, 2 GB could easily not be enough. We run with 6 GB heaps on 34 GB boxes, and the remaining RAM is there solely to act as a disk cache. We're on EC2, though, so unless you're using the SSD instances, the disks are slow. Might not be a problem for you. Also things like faceting and sorting can heavily hit the CPU. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Wed, Apr 3, 2013 at 11:55 AM, Amit Sela am...@infolinks.com wrote: Trouble in what why ? If I have enough memory - HBase RegionServer 10GB and maybe 2GB for Solr ? - or you mean CPU / disk ? On Wed, Apr 3, 2013 at 5:54 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hello, Amit: My guess is that, if HBase is working hard, you're going to have more trouble with HBase and Solr on the same nodes than HBase and Solr sharing a Zookeeper. Solr's usage of Zookeeper is very minimal. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela am...@infolinks.com wrote: Hi all, I have a running Hadoop + HBase cluster and the HBase cluster is running it's own zookeeper (HBase manages zookeeper). I would like to deploy my SolrCloud cluster on a portion of the machines on that cluster. My question is: Should I have any trouble / issues deploying an additional ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because, well first of all HBase manages it so I'm not sure it's possible and second I have HBase working pretty hard at times and I don't want to create any connection issues by overloading ZooKeeper. Thanks, Amit.
Re: Upgrade Solr3.5 to Solr4.1 - Index Reformat ?
On 4/1/2013 12:19 PM, feroz_kh wrote: Hi Shawn, I tried optimizing using this command... curl 'http://localhost:/solr/update?optimize=truemaxSegments=10waitFlush=true' And i got this response within secs... ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime840/int/lst /response Is this a valid response that one should get ? I checked the statistics link from /solr/admin page and it shows the number segments got updated. Would this be a good indication that optimization is complete ? At the same time - I even noticed the number of files in data/index directory hasn't reduced all files are not updated. Since it took just couple of secs for the response(even with waitFlush=true) - i am doubting if optimization really happened , but details on statistics page shows me correct number of segments. That looks like a valid success response. An optimize in Solr defaults to one segment. You asked it to do ten segments. Either you already had less than 10 segments, or it was able to find some very small segments to merge in order to get below 10. When you are optimizing in order to upgrade the index format, you should leave maxSegments off or set it to 1. Thanks, Shawn
Re: Lengthy description is converted to hash symbols
Yes... the str.. / is what I see in the admin console when I perform a search for the document. Currently, I am using solrj and the addBean() method to update the core. Whats strange is in our QA env, the document indexed correctly. But in prod, I see hash symbols and thus any user search against that field fails to find the document. Btw, I see no errors in the logs! -- View this message in context: http://lucene.472066.n3.nabble.com/Lengthy-description-is-converted-to-hash-symbols-tp4053338p4053505.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Flow Chart of Solr
And another one on the way: http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957 Hopefully that help a lot as well. Plenty of diagrams. Lots of examples. -- Jack Krupansky -Original Message- From: Jack Park Sent: Wednesday, April 03, 2013 11:25 AM To: solr-user@lucene.apache.org Subject: Re: Flow Chart of Solr There are three books on Solr, two with that in the title, and one, Taming Text, each of which have been very valuable in understanding Solr. Jack On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky j...@basetechnology.com wrote: Sure, yes. But... it comes down to what level of detail you want and need for a specific task. In other words, there are probably a dozen or more levels of detail. The reality is that if you are going to work at the Solr code level, that is very, very different than being a user of Solr, and at that point your first step is to become familiar with the code itself. When you talk about parsing and stemming, you are really talking about the user-level, not the Solr code level. Maybe what you really need is a cheat sheet that maps a user-visible feature to the main Solr code component for that implements that user feature. There are a number of different forms of parsing in Solr - parsing of what? Queries? Requests? Solr documents? Function queries? Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does that. Lucene does all of the token filtering. Are you asking for details on how Lucene works? Maybe you meant to ask how term analysis works, which is split between Solr and Lucene. Or maybe you simply wanted to know when and where term analysis is done. Tell us your specific problem or specific question and we can probably quickly give you an answer. In truth, NOBODY uses flow charts anymore. Sure, there are some user-level diagrams, but not down to the code level. If you could focus on specific questions, we could give you specific answers. Main steps? That depends on what level you are working at. Tell us what problem you are trying to solve and we can point you to the relevant areas. In truth, if you become generally familiar with Solr at the user level (study the wikis), you will already know what the main steps are. So, it is not main steps of Solr, but main steps of some specific request of Solr, and for a specified level of detail, and for a specified area of Solr if greater detail is needed. Be more specific, and then we can be more specific. For now, the general advice for people who need or want to go far beyond the user level is to get familiar with the code - just LOOK at it - a lot of the package and class names are OBVIOUS, really, and follow the class hierarchy and code flow using the standard features of any modern Java IDE. If you are wondering where to start for some specific user-level feature, please ask specifically about that feature. But... make a diligent effort to discover and learn on your own before asking open-ended questions. Sure, there are lots of things in Lucene and Solr that are rather complex and seemingly convoluted, and not obvious, but people are more than willing to help you out if you simply ask a specific question. I mean, not everybody needs to know the fine detail of query parsing, analysis, building a Lucene-level stemmer, etc. If we tried to put all of that in a diagram, most people would be more confused than enlightened. At which step are scores calculated? That's more of a Lucene question. Or, are you really asking what code in Solr invokes Lucene search methods that calculate basic scores? In short, you need to be more specific. Don't force us to guess what problem you are trying to solve. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, April 03, 2013 6:52 AM To: solr-user@lucene.apache.org Subject: Re: Flow Chart of Solr So, all in all, is there anybody who can write down just main steps of Solr(including parsing, stemming etc.)? 2013/4/2 Furkan KAMACI furkankam...@gmail.com I think about myself as an example. I have started to make research about Solr just for some weeks. I have learned Solr and its related projects. My next step writing down the main steps Solr. We have separated learning curve of Solr into two main categories. First one is who are using it as out of the box components. Second one is developer side. Actually developer side branches into two way. First one is general steps of it. i.e. document comes into Solr (i.e. crawled data of Nutch). which analyzing processes are going to done (stamming, hamming etc.), what will be doing after parsing step by step. When a search query happens what happens step by step, at which step scores are calculated so on so forth. Second one is more code specific i.e. which handlers takes into account data that will going to be indexed(no need the explain every handler at this step) . Which are the analyzer, tokenizer classes and what are the
Re: Lengthy description is converted to hash symbols
Show us the exact query URL as well as the request handler defaults. Make sure to try to do an explicit query on the field that has the # value. QA and prod may differ because maybe QA got completely reindexed more recently and maybe prod hasn't gotten fully reindexed recently. Maybe the schema changed but a full reindex wasn't done. -- Jack Krupansky -Original Message- From: Danny Watari Sent: Wednesday, April 03, 2013 12:15 PM To: solr-user@lucene.apache.org Subject: Re: Lengthy description is converted to hash symbols Yes... the str.. / is what I see in the admin console when I perform a search for the document. Currently, I am using solrj and the addBean() method to update the core. Whats strange is in our QA env, the document indexed correctly. But in prod, I see hash symbols and thus any user search against that field fails to find the document. Btw, I see no errors in the logs! -- View this message in context: http://lucene.472066.n3.nabble.com/Lengthy-description-is-converted-to-hash-symbols-tp4053338p4053505.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrCloud not distributing documents across shards
So we have 3 servers in a SolrCloud cluster. http://lucene.472066.n3.nabble.com/file/n4053506/Cloud1.png We have 2 shards for our collection (classic_bt) with a shard on each of the first two servers as the picture shows. The third server has replicas of the first 2 shards just for high availability purposes. Now if we go into counts we have the following information: shard1 - Numdocs - 33010 shard2 - Numdocs - 85934 Both shards replicate to the third server with no issues. For some reason the documents aren't distributing across the shards, nothing in the logs indicates a problem but I'm not sure what we should be looking for. Let me know if you need more information. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering Search Cloud
On 4/1/2013 3:02 PM, Furkan KAMACI wrote: I want to separate my cloud into two logical parts. One of them is indexer cloud of SolrCloud. Second one is Searcher cloud of SolrCloud. My first question is that. Does separating my cloud system make sense about performance improvement. Because I think that when indexing, searching make time to response and if I separate them I get a performance improvement. On the other hand maybe using all Solr machines as whole (I mean not partitioning as I mentioned) SolrCloud can make a better load balancing, I would want to learn it. My second question is that. Let's assume that I have separated my machines as I mentioned. Can I filter some indexes to be searchable or not from Searcher SolrCloud? SolrCloud gets rid of the master and slave designations. It also gets rid of the line between indexing and querying. Each shard has a replica that is designated the leader, but that has no real impact on searching and indexing, only on deciding which data to use when replicas get out of sync. In the old master-slave architecture, you indexed to the master and the updated index files were replicated to the slave. The slave did not handle the analysis for indexing, so it was usually better to send queries to slaves and let the master only do indexing. SolrCloud is very different. When you index, the documents are indexed on all replicas at about the same time. When you query, the requests are load balanced across all replicas. During normal operation, SolrCloud does not use replication at all. The replication feature is only used when a replica gets out of sync with the leader, and in that case, the entire index is replicated. Thanks, Shawn
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
Since I don't have that many items in my index I exported all of the keys for each shard and wrote a simple java program that checks for duplicates. I found some duplicate keys on different shards, a grep of the files for the keys found does indicate that they made it to the wrong places. If you notice documents with the same ID are on shard 3 and shard 5. Is it possible that the hash is being calculated taking into account only the live nodes? I know that we don't specify the numShards param @ startup so could this be what is happening? grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de * shard1-core1:0 shard1-core2:0 shard2-core1:0 shard2-core2:0 shard3-core1:1 shard3-core2:1 shard4-core1:0 shard4-core2:0 shard5-core1:1 shard5-core2:1 shard6-core1:0 shard6-core2:0 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com wrote: Something interesting that I'm noticing as well, I just indexed 300,000 items, and some how 300,020 ended up in the index. I thought perhaps I messed something up so I started the indexing again and indexed another 400,000 and I see 400,064 docs. Is there a good way to find possibile duplicates? I had tried to facet on key (our id field) but that didn't give me anything with more than a count of 1. On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com wrote: Ok, so clearing the transaction log allowed things to go again. I am going to clear the index and try to replicate the problem on 4.2.0 and then I'll try on 4.2.1 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.comwrote: No, not that I know if, which is why I say we need to get to the bottom of it. - Mark On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote: Mark It's there a particular jira issue that you think may address this? I read through it quickly but didn't see one that jumped out On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote: I brought the bad one down and back up and it did nothing. I can clear the index and try4.2.1. I will save off the logs and see if there is anything else odd On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote: It would appear it's a bug given what you have said. Any other exceptions would be useful. Might be best to start tracking in a JIRA issue as well. To fix, I'd bring the behind node down and back again. Unfortunately, I'm pressed for time, but we really need to get to the bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading to mirrors now). - Mark On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote: Sorry I didn't ask the obvious question. Is there anything else that I should be looking for here and is this a bug? I'd be happy to troll through the logs further if more information is needed, just let me know. Also what is the most appropriate mechanism to fix this. Is it required to kill the index that is out of sync and let solr resync things? On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com wrote: sorry for spamming here shard5-core2 is the instance we're having issues with... Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException : Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok status:503, message:Service Unavailable at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson jej2...@gmail.com wrote: here is another one that looks interesting Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: ClusterState says we are the leader, but locally we don't think so at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:293) at
SolrException: Error opening new searcher
We're suddenly seeing an error when trying to do updates/commits. This is on Solr 4.2 (Tomcat, solr war deployed to webapps, on Linux SuSE 11). Based off of some initial searching on things related to this issue, I have set ulimit in Linux to 'unlimited' and verified that Tomcat has enough memory for the virtual memory needed to run the Solr index (which is 1.1GB in size). Does anyone have any ideas? 1:25:41 SEVERE UpdateLog Error opening realtime searcher for deleteByQuery:org.apache.solr.common.SolrException: Error opening new searcher Error opening realtime searcher for deleteByQuery:org.apache.solr.common.SolrException: Error opening new searcher 11:25:39 SEVERE UpdateLog Replay exception: final commit. java.io.IOException: Map failed at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:761) at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283) at org.apache.lucene.store.MMapDirectory$MMapIndexInput.init(MMapDirectory.java:228) at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:195) at org.apache.lucene.codecs.lucene41.Lucene41PostingsReader.init(Lucene41PostingsReader.java:81) at org.apache.lucene.codecs.lucene41.Lucene41PostingsFormat.fieldsProducer(Lucene41PostingsFormat.java:430) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.init(PerFieldPostingsFormat.java:194) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer(PerFieldPostingsFormat.java:233) at org.apache.lucene.index.SegmentCoreReaders.init(SegmentCoreReaders.java:127) at org.apache.lucene.index.SegmentReader.init(SegmentReader.java:56) at org.apache.lucene.index.ReadersAndLiveDocs.getReader(ReadersAndLiveDocs.java:121) at org.apache.lucene.index.BufferedDeletesStream.applyDeletes(BufferedDeletesStream.java:269) at org.apache.lucene.index.IndexWriter.applyAllDeletes(IndexWriter.java:2961) at org.apache.lucene.index.IndexWriter.maybeApplyDeletes(IndexWriter.java:2952) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2692) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2827) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:2807) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:541) at org.apache.solr.update.UpdateLog$LogReplayer.doReplay(UpdateLog.java:1341) at org.apache.solr.update.UpdateLog$LogReplayer.run(UpdateLog.java:1160) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.OutOfMemoryError: Map failed at sun.nio.ch.FileChannelImpl.map0(Native Method) at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:758) ... 28 more SolrConfig: query useColdSearchertrue/useColdSearcher maxBooleanClauses1024/maxBooleanClauses filterCache class=solr.FastLRUCache size=512 initialSize=512 autowarmCount=0/ queryResultCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ documentCache class=solr.LRUCache size=512 initialSize=512 autowarmCount=0/ queryResultWindowSize20/queryResultWindowSize queryResultMaxDocsCached200/queryResultMaxDocsCached maxWarmingSearchers6/maxWarmingSearchers /query
Re: Lengthy description is converted to hash symbols
I looked at the text via the admin analysis tool. The text appeared to be ok! Unfortunately, the description is client data... so I can't post it here, but I do not see any issues when running the analysis tool. -- View this message in context: http://lucene.472066.n3.nabble.com/Lengthy-description-is-converted-to-hash-symbols-tp4053338p4053526.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr ZooKeeper ensemble with HBase
It will be limited by disk IO until you get the caches full. Then it will be limited by CPU. wunder On Apr 3, 2013, at 8:55 AM, Amit Sela am...@infolinks.com wrote: Trouble in what why ? If I have enough memory - HBase RegionServer 10GB and maybe 2GB for Solr ? - or you mean CPU / disk ? On Wed, Apr 3, 2013 at 5:54 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: Hello, Amit: My guess is that, if HBase is working hard, you're going to have more trouble with HBase and Solr on the same nodes than HBase and Solr sharing a Zookeeper. Solr's usage of Zookeeper is very minimal. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Wed, Apr 3, 2013 at 8:06 AM, Amit Sela am...@infolinks.com wrote: Hi all, I have a running Hadoop + HBase cluster and the HBase cluster is running it's own zookeeper (HBase manages zookeeper). I would like to deploy my SolrCloud cluster on a portion of the machines on that cluster. My question is: Should I have any trouble / issues deploying an additional ZooKeeper ensemble ? I don't want to use the HBase ZooKeeper because, well first of all HBase manages it so I'm not sure it's possible and second I have HBase working pretty hard at times and I don't want to create any connection issues by overloading ZooKeeper. Thanks, Amit.
Re: maxWarmingSearchers in Solr 4.
On 4/3/2013 1:48 AM, Dotan Cohen wrote: I have been dragging the same solrconfig.xml from Solr 3.x to 4.0 to 4.1, with no customization (bad, bad me!). I'm now looking into customizing it and I see that the Solr 4.1 solrconfig.xml is much simpler and shorter. Is this simply because many of the examples have been removed? In particular, I notice that there is no mention of maxWarmingSearchers in the Solr 4.1 solrconfig.xml. I assume that I can simply add it in, are there any other critical config options that are missing that I should be looking into as well? Would I be better off using the old Solr 3.x solrconfig.xml in Solr 4.1 as it contains so many examples? In situations where I don't want to change the default value, I prefer to leave config elements out of the solrconfig. It makes the config smaller, and it also makes it so that I will automatically see benefits from the default changing in new versions. In the case of maxWarmingSearchers, I would hope that you have your system set up so that you would never need more than 1 warming searcher at a time. If you do a commit while a previous commit is still warming, Solr will try to create a second warming searcher. I went poking in the code, and it seems that maxWarmingSearchers defaults to Integer.MAX_VALUE. I'm not sure whether this is a bad default or not. It does mean that a pathological setup without maxWarmingSearchers in the config will probably blow up with an OutOfMemory exception, but is that better or worse than commits that don't make new documents searchable? I can see arguments either way. Thanks, Shawn
Re: Flow Chart of Solr
Jack, Is that new book up to the 4.+ series? Thanks The other Jack On Wed, Apr 3, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com wrote: And another one on the way: http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957 Hopefully that help a lot as well. Plenty of diagrams. Lots of examples. -- Jack Krupansky -Original Message- From: Jack Park Sent: Wednesday, April 03, 2013 11:25 AM To: solr-user@lucene.apache.org Subject: Re: Flow Chart of Solr There are three books on Solr, two with that in the title, and one, Taming Text, each of which have been very valuable in understanding Solr. Jack On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky j...@basetechnology.com wrote: Sure, yes. But... it comes down to what level of detail you want and need for a specific task. In other words, there are probably a dozen or more levels of detail. The reality is that if you are going to work at the Solr code level, that is very, very different than being a user of Solr, and at that point your first step is to become familiar with the code itself. When you talk about parsing and stemming, you are really talking about the user-level, not the Solr code level. Maybe what you really need is a cheat sheet that maps a user-visible feature to the main Solr code component for that implements that user feature. There are a number of different forms of parsing in Solr - parsing of what? Queries? Requests? Solr documents? Function queries? Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does that. Lucene does all of the token filtering. Are you asking for details on how Lucene works? Maybe you meant to ask how term analysis works, which is split between Solr and Lucene. Or maybe you simply wanted to know when and where term analysis is done. Tell us your specific problem or specific question and we can probably quickly give you an answer. In truth, NOBODY uses flow charts anymore. Sure, there are some user-level diagrams, but not down to the code level. If you could focus on specific questions, we could give you specific answers. Main steps? That depends on what level you are working at. Tell us what problem you are trying to solve and we can point you to the relevant areas. In truth, if you become generally familiar with Solr at the user level (study the wikis), you will already know what the main steps are. So, it is not main steps of Solr, but main steps of some specific request of Solr, and for a specified level of detail, and for a specified area of Solr if greater detail is needed. Be more specific, and then we can be more specific. For now, the general advice for people who need or want to go far beyond the user level is to get familiar with the code - just LOOK at it - a lot of the package and class names are OBVIOUS, really, and follow the class hierarchy and code flow using the standard features of any modern Java IDE. If you are wondering where to start for some specific user-level feature, please ask specifically about that feature. But... make a diligent effort to discover and learn on your own before asking open-ended questions. Sure, there are lots of things in Lucene and Solr that are rather complex and seemingly convoluted, and not obvious, but people are more than willing to help you out if you simply ask a specific question. I mean, not everybody needs to know the fine detail of query parsing, analysis, building a Lucene-level stemmer, etc. If we tried to put all of that in a diagram, most people would be more confused than enlightened. At which step are scores calculated? That's more of a Lucene question. Or, are you really asking what code in Solr invokes Lucene search methods that calculate basic scores? In short, you need to be more specific. Don't force us to guess what problem you are trying to solve. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, April 03, 2013 6:52 AM To: solr-user@lucene.apache.org Subject: Re: Flow Chart of Solr So, all in all, is there anybody who can write down just main steps of Solr(including parsing, stemming etc.)? 2013/4/2 Furkan KAMACI furkankam...@gmail.com I think about myself as an example. I have started to make research about Solr just for some weeks. I have learned Solr and its related projects. My next step writing down the main steps Solr. We have separated learning curve of Solr into two main categories. First one is who are using it as out of the box components. Second one is developer side. Actually developer side branches into two way. First one is general steps of it. i.e. document comes into Solr (i.e. crawled data of Nutch). which analyzing processes are going to done (stamming, hamming etc.), what will be doing after parsing step by step. When a search query happens what happens step by step, at which step scores are calculated so on so forth.
Re: Flow Chart of Solr
We're using the 4.x branch code as the basis for our writing. So, effectively it will be for at least 4.3 when the book comes out in the summer. Early access will be in about a month or so. O'Reilly will be showing a galley proof for 200 pages of the book next week at Big Data TechCon next week in Boston. -- Jack Krupansky -Original Message- From: Jack Park Sent: Wednesday, April 03, 2013 12:56 PM To: solr-user@lucene.apache.org Subject: Re: Flow Chart of Solr Jack, Is that new book up to the 4.+ series? Thanks The other Jack On Wed, Apr 3, 2013 at 9:19 AM, Jack Krupansky j...@basetechnology.com wrote: And another one on the way: http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957 Hopefully that help a lot as well. Plenty of diagrams. Lots of examples. -- Jack Krupansky -Original Message- From: Jack Park Sent: Wednesday, April 03, 2013 11:25 AM To: solr-user@lucene.apache.org Subject: Re: Flow Chart of Solr There are three books on Solr, two with that in the title, and one, Taming Text, each of which have been very valuable in understanding Solr. Jack On Wed, Apr 3, 2013 at 5:25 AM, Jack Krupansky j...@basetechnology.com wrote: Sure, yes. But... it comes down to what level of detail you want and need for a specific task. In other words, there are probably a dozen or more levels of detail. The reality is that if you are going to work at the Solr code level, that is very, very different than being a user of Solr, and at that point your first step is to become familiar with the code itself. When you talk about parsing and stemming, you are really talking about the user-level, not the Solr code level. Maybe what you really need is a cheat sheet that maps a user-visible feature to the main Solr code component for that implements that user feature. There are a number of different forms of parsing in Solr - parsing of what? Queries? Requests? Solr documents? Function queries? Stemming? Well, in truth, Solr doesn't even do stemming - Lucene does that. Lucene does all of the token filtering. Are you asking for details on how Lucene works? Maybe you meant to ask how term analysis works, which is split between Solr and Lucene. Or maybe you simply wanted to know when and where term analysis is done. Tell us your specific problem or specific question and we can probably quickly give you an answer. In truth, NOBODY uses flow charts anymore. Sure, there are some user-level diagrams, but not down to the code level. If you could focus on specific questions, we could give you specific answers. Main steps? That depends on what level you are working at. Tell us what problem you are trying to solve and we can point you to the relevant areas. In truth, if you become generally familiar with Solr at the user level (study the wikis), you will already know what the main steps are. So, it is not main steps of Solr, but main steps of some specific request of Solr, and for a specified level of detail, and for a specified area of Solr if greater detail is needed. Be more specific, and then we can be more specific. For now, the general advice for people who need or want to go far beyond the user level is to get familiar with the code - just LOOK at it - a lot of the package and class names are OBVIOUS, really, and follow the class hierarchy and code flow using the standard features of any modern Java IDE. If you are wondering where to start for some specific user-level feature, please ask specifically about that feature. But... make a diligent effort to discover and learn on your own before asking open-ended questions. Sure, there are lots of things in Lucene and Solr that are rather complex and seemingly convoluted, and not obvious, but people are more than willing to help you out if you simply ask a specific question. I mean, not everybody needs to know the fine detail of query parsing, analysis, building a Lucene-level stemmer, etc. If we tried to put all of that in a diagram, most people would be more confused than enlightened. At which step are scores calculated? That's more of a Lucene question. Or, are you really asking what code in Solr invokes Lucene search methods that calculate basic scores? In short, you need to be more specific. Don't force us to guess what problem you are trying to solve. -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, April 03, 2013 6:52 AM To: solr-user@lucene.apache.org Subject: Re: Flow Chart of Solr So, all in all, is there anybody who can write down just main steps of Solr(including parsing, stemming etc.)? 2013/4/2 Furkan KAMACI furkankam...@gmail.com I think about myself as an example. I have started to make research about Solr just for some weeks. I have learned Solr and its related projects. My next step writing down the main steps Solr. We have separated learning curve of Solr into two main categories. First one is who are using it as out of the box
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
no, my thought was wrong, it appears that even with the parameter set I am seeing this behavior. I've been able to duplicate it on 4.2.0 by indexing 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so. I will try this on 4.2.1. to see if I see the same behavior On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com wrote: Since I don't have that many items in my index I exported all of the keys for each shard and wrote a simple java program that checks for duplicates. I found some duplicate keys on different shards, a grep of the files for the keys found does indicate that they made it to the wrong places. If you notice documents with the same ID are on shard 3 and shard 5. Is it possible that the hash is being calculated taking into account only the live nodes? I know that we don't specify the numShards param @ startup so could this be what is happening? grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de * shard1-core1:0 shard1-core2:0 shard2-core1:0 shard2-core2:0 shard3-core1:1 shard3-core2:1 shard4-core1:0 shard4-core2:0 shard5-core1:1 shard5-core2:1 shard6-core1:0 shard6-core2:0 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com wrote: Something interesting that I'm noticing as well, I just indexed 300,000 items, and some how 300,020 ended up in the index. I thought perhaps I messed something up so I started the indexing again and indexed another 400,000 and I see 400,064 docs. Is there a good way to find possibile duplicates? I had tried to facet on key (our id field) but that didn't give me anything with more than a count of 1. On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com wrote: Ok, so clearing the transaction log allowed things to go again. I am going to clear the index and try to replicate the problem on 4.2.0 and then I'll try on 4.2.1 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.comwrote: No, not that I know if, which is why I say we need to get to the bottom of it. - Mark On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote: Mark It's there a particular jira issue that you think may address this? I read through it quickly but didn't see one that jumped out On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote: I brought the bad one down and back up and it did nothing. I can clear the index and try4.2.1. I will save off the logs and see if there is anything else odd On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote: It would appear it's a bug given what you have said. Any other exceptions would be useful. Might be best to start tracking in a JIRA issue as well. To fix, I'd bring the behind node down and back again. Unfortunately, I'm pressed for time, but we really need to get to the bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading to mirrors now). - Mark On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote: Sorry I didn't ask the obvious question. Is there anything else that I should be looking for here and is this a bug? I'd be happy to troll through the logs further if more information is needed, just let me know. Also what is the most appropriate mechanism to fix this. Is it required to kill the index that is out of sync and let solr resync things? On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com wrote: sorry for spamming here shard5-core2 is the instance we're having issues with... Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException : Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok status:503, message:Service Unavailable at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) On Tue, Apr 2, 2013 at 5:43 PM, Jamie Johnson jej2...@gmail.com wrote:
Re: Query parser cuts last letter from search term.
On Wed, Apr 3, 2013, at 11:36 AM, vsl wrote: So why Solr does not return proper document? You're gonna have to give us a bit more than that. What is wrong with the documents it is returning? Upayavira
Re: Solr Multiword Search
I have been trying to use the MultiWordSpellingQueryConverter.java since I need to be able to find the document that correspond to the suggested collations. At the moment it seems to be producing collations based on word matches and arbitrary words from the field are picked up to form collation and so nothing corresponds to any of the titles in our set of indexed documents. Could anyone please confirm that this would work if I took the following steps. steps: 1. Get the solr4.2.war file. 2. Get to the WEB-INF lib and add the lucene-core-4.2.0.jar and the solr-core-4.2.0.jar that to the classpath to compile the MultiWordSpellingQueryConverter.java . The code for this is in my previous post in this thread. 3. jar cvf multiwordspellchecker.jar com/foo/MultiWordSpellingQueryConverter.java 4. Copy this jar to the $SOLR_HOME/lib directory. 6. Define queryConverter. Question: Where does this need to go? I have just put this somewhere between the searchComponent and the requestHandler for spell checks. 5. Start webserver. I see this jar file getting registered at startup: 2013-04-03 12:56:22,243 INFO [org.apache.solr.core.SolrResourceLoader] (coreLoadExecutor-3-thread-1) Adding 'file:/solr/lib/multiwordspellchecker.jar' to classloader 6. When I run the spell query, I don't see my print statements, so I am not sure if this code is really being called. I don't think it may be the logging that is failing but rather this code not being called at all. I would appreciate any information on what I might be doing wrong. Please help. Thanks. Regards, -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053534.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Out of memory on some faceting queries
On 4/2/2013 3:09 AM, Dotan Cohen wrote: I notice that this only occurs on queries that run facets. I start Solr with the following command: sudo nohup java -XX:NewRatio=1 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar /opt/solr-4.1.0/example/start.jar It looks like you've followed some advice that I gave previously on how to tune java. I have since learned that this advice is bad, it results in long GC pauses, even with heaps that aren't huge. As others have pointed out, you don't have a max heap setting, which would mean that you're using whatever Java chooses for its default, which might not be enough. If you can get Solr to successfully run for a while with queries and updates happening, the heap should eventually max out and the admin UI will show you what Java is choosing by default. Here is what I would now recommend for a beginning point on your Solr startup command. You may need to increase the heap beyond 4GB, but be careful that you still have enough free memory to be able to do effective caching of your index. sudo nohup java -Xms4096M -Xmx4096M -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:NewRatio=3 -XX:MaxTenuringThreshold=8 -XX:+CMSParallelRemarkEnabled -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar /opt/solr-4.1.0/example/start.jar If you are running a really old build of java (latest versions on Oracle's website are 1.6 build 43 and 1.7 build 17), you might want to leave AggressiveOpts out. Some people would argue that you should never use that option. Thanks, Shawn
Re: SolrCloud not distributing documents across shards
Hello Vytenis, What exactly do you mean by aren't distributing across the shards? Do you mean that POSTs against the server for shard 1 never end up resulting in documents saved in shard 2? Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Wed, Apr 3, 2013 at 12:31 PM, vsilgalis vsilga...@gmail.com wrote: So we have 3 servers in a SolrCloud cluster. http://lucene.472066.n3.nabble.com/file/n4053506/Cloud1.png We have 2 shards for our collection (classic_bt) with a shard on each of the first two servers as the picture shows. The third server has replicas of the first 2 shards just for high availability purposes. Now if we go into counts we have the following information: shard1 - Numdocs - 33010 shard2 - Numdocs - 85934 Both shards replicate to the third server with no issues. For some reason the documents aren't distributing across the shards, nothing in the logs indicates a problem but I'm not sure what we should be looking for. Let me know if you need more information. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Lengthy description is converted to hash symbols
Here is a query that should return 2 documents... but it only returns 1. /solr/m7779912/select?indent=onversion=2.2q=description%3Agatewayfq=start=0rows=10fl=descriptionqt=wt=explainOther=hl.fl= Oddly enough, the description of the two documents are exactly the same. Except one is indexed correctly and the other contains the hash symbols. Btw, when the core was created, it was built from scratch via a pojo's and the addBeans() method. -- View this message in context: http://lucene.472066.n3.nabble.com/Lengthy-description-is-converted-to-hash-symbols-tp4053338p4053544.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Tika Override
I am researching Solr and seeing if it would be a good fit for a document search service I am helping to develop. One of the requirements is that we will need to be able to customize how file contents are parsed beyond the default configurations that are offered out of the box by Tika. For example, we know that we will be indexing .pdf files that will contain a cover page with a project start date, and would like to pull this date out into a searchable field that is separate from the file content. I have seen several sources saying you can do this by overriding the ExtractingRequestHandler.createFactory() method, but I have not been able to find much documentation on how to implement a new parser. Can someone point me in the right direction on where to look, or let me know if the scenario I described above is even possible? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Tika-Override-tp4053552.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a collection is created - each shard gets a range, which is stored in zookeeper. You should not be able to end up with the same id on different shards - something very odd going on. Hopefully I'll have some time to try and help you reproduce. Ideally we can capture it in a test case. - Mark On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote: no, my thought was wrong, it appears that even with the parameter set I am seeing this behavior. I've been able to duplicate it on 4.2.0 by indexing 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so. I will try this on 4.2.1. to see if I see the same behavior On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com wrote: Since I don't have that many items in my index I exported all of the keys for each shard and wrote a simple java program that checks for duplicates. I found some duplicate keys on different shards, a grep of the files for the keys found does indicate that they made it to the wrong places. If you notice documents with the same ID are on shard 3 and shard 5. Is it possible that the hash is being calculated taking into account only the live nodes? I know that we don't specify the numShards param @ startup so could this be what is happening? grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de * shard1-core1:0 shard1-core2:0 shard2-core1:0 shard2-core2:0 shard3-core1:1 shard3-core2:1 shard4-core1:0 shard4-core2:0 shard5-core1:1 shard5-core2:1 shard6-core1:0 shard6-core2:0 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com wrote: Something interesting that I'm noticing as well, I just indexed 300,000 items, and some how 300,020 ended up in the index. I thought perhaps I messed something up so I started the indexing again and indexed another 400,000 and I see 400,064 docs. Is there a good way to find possibile duplicates? I had tried to facet on key (our id field) but that didn't give me anything with more than a count of 1. On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com wrote: Ok, so clearing the transaction log allowed things to go again. I am going to clear the index and try to replicate the problem on 4.2.0 and then I'll try on 4.2.1 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.comwrote: No, not that I know if, which is why I say we need to get to the bottom of it. - Mark On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote: Mark It's there a particular jira issue that you think may address this? I read through it quickly but didn't see one that jumped out On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote: I brought the bad one down and back up and it did nothing. I can clear the index and try4.2.1. I will save off the logs and see if there is anything else odd On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote: It would appear it's a bug given what you have said. Any other exceptions would be useful. Might be best to start tracking in a JIRA issue as well. To fix, I'd bring the behind node down and back again. Unfortunately, I'm pressed for time, but we really need to get to the bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading to mirrors now). - Mark On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote: Sorry I didn't ask the obvious question. Is there anything else that I should be looking for here and is this a bug? I'd be happy to troll through the logs further if more information is needed, just let me know. Also what is the most appropriate mechanism to fix this. Is it required to kill the index that is out of sync and let solr resync things? On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com wrote: sorry for spamming here shard5-core2 is the instance we're having issues with... Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException : Server at http://10.38.33.17:7577/solr/dsc-shard5-core2 returned non ok status:503, message:Service Unavailable at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:373) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:332) at org.apache.solr.update.SolrCmdDistributor$1.call(SolrCmdDistributor.java:306) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at
RE: AW: AW: java.lang.OutOfMemoryError: Map failed
I just posted a similar error and discovered that decreasing the Xmx fixed the problem for me. The free command/top, etc. indicated I was flying just below the threshold for my allowed memory, and with swap/virtual space available, so I'm still confused as to what the issue is, but you may try this in your configurations to see if it helps. -Original Message- From: Per Steffensen [mailto:st...@designware.dk] Sent: Tuesday, April 02, 2013 6:09 AM To: solr-user@lucene.apache.org Subject: Re: AW: AW: java.lang.OutOfMemoryError: Map failed I have seen the exact same on Ubuntu Server 12.04. It helped adding some swap space, but I do not understand why this is necessary, since OS ought to just use the actual memory mapped files if there is not room in (virtual) memory, swapping pages in and out on demand. Note that I saw this for memory mapped files opened for read+write - not in the exact same context as you see it where MMapDirectory is trying to map memory mapped files. If you find a solution/explanation, please post it here. I really want to know more about why FileChannel.map can cause OOM. I do not think the OOM is a real OOM indicating no more space on java heap, but is more an exception saying that OS has no more memory (in some interpretation of that). Regards, Per Steffensen On 4/2/13 11:32 AM, Arkadi Colson wrote: It is running as root: root@solr01-dcg:~# ps aux | grep tom root 1809 10.2 67.5 49460420 6931232 ?Sl Mar28 706:29 /usr/bin/java -Djava.util.logging.config.file=/usr/local/tomcat/conf/logging.propert ies -server -Xms2048m -Xmx6144m -XX:PermSize=64m -XX:MaxPermSize=128m -XX:+UseG1GC -verbose:gc -Xloggc:/solr/tomcat-logs/gc.log -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Duser.timezone=UTC -Dfile.encoding=UTF8 -Dsolr.solr.home=/opt/solr/ -Dport=8983 -Dcollection.configName=smsc -DzkClientTimeout=2 -DzkHost=solr01-dcg.intnet.smartbit.be:2181,solr01-gs.intnet.smartbit. be:2181,solr02-dcg.intnet.smartbit.be:2181,solr02-gs.intnet.smartbit.b e:2181,solr03-dcg.intnet.smartbit.be:2181,solr03-gs.intnet.smartbit.be :2181 -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.endorsed.dirs=/usr/local/tomcat/endorsed -classpath /usr/local/tomcat/bin/bootstrap.jar:/usr/local/tomcat/bin/tomcat-juli. jar -Dcatalina.base=/usr/local/tomcat -Dcatalina.home=/usr/local/tomcat -Djava.io.tmpdir=/usr/local/tomcat/temp org.apache.catalina.startup.Bootstrap start Arkadi On 04/02/2013 11:29 AM, André Widhani wrote: The output is from the root user. Are you running Solr as root? If not, please try again using the operating system user that runs Solr. André Von: Arkadi Colson [ark...@smartbit.be] Gesendet: Dienstag, 2. April 2013 11:26 An: solr-user@lucene.apache.org Cc: André Widhani Betreff: Re: AW: java.lang.OutOfMemoryError: Map failed Hmmm I checked it and it seems to be ok: root@solr01-dcg:~# ulimit -v unlimited Any other tips or do you need more debug info? BR On 04/02/2013 11:15 AM, André Widhani wrote: Hi Arkadi, this error usually indicates that virtual memory is not sufficient (should be unlimited). Please see http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/69168 Regards, André Von: Arkadi Colson [ark...@smartbit.be] Gesendet: Dienstag, 2. April 2013 10:24 An: solr-user@lucene.apache.org Betreff: java.lang.OutOfMemoryError: Map failed Hi Recently solr crashed. I've found this in the error log. My commit settings are loking like this: autoCommit maxTime1/maxTime openSearcherfalse/openSearcher /autoCommit autoSoftCommit maxTime2000/maxTime /autoSoftCommit The machine has 10GB of memory. Tomcat is running with -Xms2048m -Xmx6144m Versions Solr: 4.2 Tomcat: 7.0.33 Java: 1.7 Anybody any idea? Thx! Arkadi SEVERE: auto commit error...:org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1415) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1527) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandl er2.java:562) at org.apache.solr.update.CommitTracker.run(CommitTracker.java:216) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask .access$201(ScheduledThreadPoolExecutor.java:178) at
RE: Solr Multiword Search
You have specified spellcheck.q in your query. The whole purpose of spellcheck.q is to bypass any query converter you've configured giving it raw keywords instead. But possibly a custom query converter is not your best answer? I agree that charles charlie is an edit distance of 2, so if everything is set up correctly then DirectSolrSpellChecker with maxEdits=2 should find it. The collate functionality as you have it set up would check the index and only give you re-written queries that are guaranteed to return hits. But there is a big caveat: If the word charles occurs at all in the dictionary (because any document in your index contains it), then the spellchecker (by default) assumes it is a correctly-spelled word and will not try to correct it. In this case, specify spellcheck.alternateTermCount with a non-zero value. (See http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.alternativeTermCount) James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: skmirch [mailto:skmi...@hotmail.com] Sent: Wednesday, April 03, 2013 12:19 PM To: solr-user@lucene.apache.org Subject: Re: Solr Multiword Search I have been trying to use the MultiWordSpellingQueryConverter.java since I need to be able to find the document that correspond to the suggested collations. At the moment it seems to be producing collations based on word matches and arbitrary words from the field are picked up to form collation and so nothing corresponds to any of the titles in our set of indexed documents. Could anyone please confirm that this would work if I took the following steps. steps: 1. Get the solr4.2.war file. 2. Get to the WEB-INF lib and add the lucene-core-4.2.0.jar and the solr-core-4.2.0.jar that to the classpath to compile the MultiWordSpellingQueryConverter.java . The code for this is in my previous post in this thread. 3. jar cvf multiwordspellchecker.jar com/foo/MultiWordSpellingQueryConverter.java 4. Copy this jar to the $SOLR_HOME/lib directory. 6. Define queryConverter. Question: Where does this need to go? I have just put this somewhere between the searchComponent and the requestHandler for spell checks. 5. Start webserver. I see this jar file getting registered at startup: 2013-04-03 12:56:22,243 INFO [org.apache.solr.core.SolrResourceLoader] (coreLoadExecutor-3-thread-1) Adding 'file:/solr/lib/multiwordspellchecker.jar' to classloader 6. When I run the spell query, I don't see my print statements, so I am not sure if this code is really being called. I don't think it may be the logging that is failing but rather this code not being called at all. I would appreciate any information on what I might be doing wrong. Please help. Thanks. Regards, -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053534.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud not distributing documents across shards
Michael Della Bitta-2 wrote Hello Vytenis, What exactly do you mean by aren't distributing across the shards? Do you mean that POSTs against the server for shard 1 never end up resulting in documents saved in shard 2? So we indexed a set of 33010 documents on server01 which are now in shard1. And we kicked off a set of 85934 documents on server02 which are now in shard2 (as tests). In my understanding of how SolrCloud works, the documents should be distributed across the shards in the collection. Now I have seen this work before in my environment. Not sure what I need to look at to ensure this distribution. Just as a FYI, this is SOLR 4.1 -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053563.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
Where is this information stored in ZK? I don't see it in the cluster state (or perhaps I don't understand it ;) ). Perhaps something with my process is broken. What I do when I start from scratch is the following ZkCLI -cmd upconfig ... ZkCLI -cmd linkconfig but I don't ever explicitly create the collection. What should the steps from scratch be? I am moving from an unreleased snapshot of 4.0 so I never did that previously either so perhaps I did create the collection in one of my steps to get this working but have forgotten it along the way. On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com wrote: Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a collection is created - each shard gets a range, which is stored in zookeeper. You should not be able to end up with the same id on different shards - something very odd going on. Hopefully I'll have some time to try and help you reproduce. Ideally we can capture it in a test case. - Mark On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote: no, my thought was wrong, it appears that even with the parameter set I am seeing this behavior. I've been able to duplicate it on 4.2.0 by indexing 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so. I will try this on 4.2.1. to see if I see the same behavior On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com wrote: Since I don't have that many items in my index I exported all of the keys for each shard and wrote a simple java program that checks for duplicates. I found some duplicate keys on different shards, a grep of the files for the keys found does indicate that they made it to the wrong places. If you notice documents with the same ID are on shard 3 and shard 5. Is it possible that the hash is being calculated taking into account only the live nodes? I know that we don't specify the numShards param @ startup so could this be what is happening? grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de * shard1-core1:0 shard1-core2:0 shard2-core1:0 shard2-core2:0 shard3-core1:1 shard3-core2:1 shard4-core1:0 shard4-core2:0 shard5-core1:1 shard5-core2:1 shard6-core1:0 shard6-core2:0 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com wrote: Something interesting that I'm noticing as well, I just indexed 300,000 items, and some how 300,020 ended up in the index. I thought perhaps I messed something up so I started the indexing again and indexed another 400,000 and I see 400,064 docs. Is there a good way to find possibile duplicates? I had tried to facet on key (our id field) but that didn't give me anything with more than a count of 1. On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com wrote: Ok, so clearing the transaction log allowed things to go again. I am going to clear the index and try to replicate the problem on 4.2.0 and then I'll try on 4.2.1 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.com wrote: No, not that I know if, which is why I say we need to get to the bottom of it. - Mark On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote: Mark It's there a particular jira issue that you think may address this? I read through it quickly but didn't see one that jumped out On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote: I brought the bad one down and back up and it did nothing. I can clear the index and try4.2.1. I will save off the logs and see if there is anything else odd On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote: It would appear it's a bug given what you have said. Any other exceptions would be useful. Might be best to start tracking in a JIRA issue as well. To fix, I'd bring the behind node down and back again. Unfortunately, I'm pressed for time, but we really need to get to the bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading to mirrors now). - Mark On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote: Sorry I didn't ask the obvious question. Is there anything else that I should be looking for here and is this a bug? I'd be happy to troll through the logs further if more information is needed, just let me know. Also what is the most appropriate mechanism to fix this. Is it required to kill the index that is out of sync and let solr resync things? On Tue, Apr 2, 2013 at 5:45 PM, Jamie Johnson jej2...@gmail.com wrote: sorry for spamming here shard5-core2 is the instance we're having issues with... Apr 2, 2013 7:27:14 PM org.apache.solr.common.SolrException log SEVERE: shard update error StdNode: http://10.38.33.17:7577/solr/dsc-shard5-core2/:org.apache.solr.common.SolrException : Server at
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
It should be part of your clusterstate.json. Some users have reported trouble upgrading a previous zk install when this change came. I recommended manually updating the clusterstate.json to have the right info, and that seemed to work. Otherwise, I guess you have to start from a clean zk state. If you don't have that range information, I think there will be trouble. Do you have an router type defined in the clusterstate.json? - Mark On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote: Where is this information stored in ZK? I don't see it in the cluster state (or perhaps I don't understand it ;) ). Perhaps something with my process is broken. What I do when I start from scratch is the following ZkCLI -cmd upconfig ... ZkCLI -cmd linkconfig but I don't ever explicitly create the collection. What should the steps from scratch be? I am moving from an unreleased snapshot of 4.0 so I never did that previously either so perhaps I did create the collection in one of my steps to get this working but have forgotten it along the way. On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com wrote: Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a collection is created - each shard gets a range, which is stored in zookeeper. You should not be able to end up with the same id on different shards - something very odd going on. Hopefully I'll have some time to try and help you reproduce. Ideally we can capture it in a test case. - Mark On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote: no, my thought was wrong, it appears that even with the parameter set I am seeing this behavior. I've been able to duplicate it on 4.2.0 by indexing 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so. I will try this on 4.2.1. to see if I see the same behavior On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com wrote: Since I don't have that many items in my index I exported all of the keys for each shard and wrote a simple java program that checks for duplicates. I found some duplicate keys on different shards, a grep of the files for the keys found does indicate that they made it to the wrong places. If you notice documents with the same ID are on shard 3 and shard 5. Is it possible that the hash is being calculated taking into account only the live nodes? I know that we don't specify the numShards param @ startup so could this be what is happening? grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de * shard1-core1:0 shard1-core2:0 shard2-core1:0 shard2-core2:0 shard3-core1:1 shard3-core2:1 shard4-core1:0 shard4-core2:0 shard5-core1:1 shard5-core2:1 shard6-core1:0 shard6-core2:0 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com wrote: Something interesting that I'm noticing as well, I just indexed 300,000 items, and some how 300,020 ended up in the index. I thought perhaps I messed something up so I started the indexing again and indexed another 400,000 and I see 400,064 docs. Is there a good way to find possibile duplicates? I had tried to facet on key (our id field) but that didn't give me anything with more than a count of 1. On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com wrote: Ok, so clearing the transaction log allowed things to go again. I am going to clear the index and try to replicate the problem on 4.2.0 and then I'll try on 4.2.1 On Wed, Apr 3, 2013 at 8:21 AM, Mark Miller markrmil...@gmail.com wrote: No, not that I know if, which is why I say we need to get to the bottom of it. - Mark On Apr 2, 2013, at 10:18 PM, Jamie Johnson jej2...@gmail.com wrote: Mark It's there a particular jira issue that you think may address this? I read through it quickly but didn't see one that jumped out On Apr 2, 2013 10:07 PM, Jamie Johnson jej2...@gmail.com wrote: I brought the bad one down and back up and it did nothing. I can clear the index and try4.2.1. I will save off the logs and see if there is anything else odd On Apr 2, 2013 9:13 PM, Mark Miller markrmil...@gmail.com wrote: It would appear it's a bug given what you have said. Any other exceptions would be useful. Might be best to start tracking in a JIRA issue as well. To fix, I'd bring the behind node down and back again. Unfortunately, I'm pressed for time, but we really need to get to the bottom of this and fix it, or determine if it's fixed in 4.2.1 (spreading to mirrors now). - Mark On Apr 2, 2013, at 7:21 PM, Jamie Johnson jej2...@gmail.com wrote: Sorry I didn't ask the obvious question. Is there anything else that I should be looking for here and is this a bug? I'd be happy to troll through the logs further if more information is needed, just let me know. Also what is the most appropriate mechanism to fix this. Is it required to kill the index that
Re: SolrCloud not distributing documents across shards
: So we indexed a set of 33010 documents on server01 which are now in shard1. : And we kicked off a set of 85934 documents on server02 which are now in : shard2 (as tests). In my understanding of how SolrCloud works, the : documents should be distributed across the shards in the collection. Now I I'm not familiar with the details, but i've seen miller respond to a similar question with reference to the issue of not explicitly specifying numShards when creating your collections... http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3c0aa0b422-f1de-4915-b602-53cb18492...@gmail.com%3E -Hoss
Re: It seems a issue of deal with chinese synonym for solr
On 3/11/13 6:15 PM, 李威 wrote: in org.apache.solr.parser.SolrQueryParserBase, there is a function: protected Query newFieldQuery(Analyzer analyzer, String field, String queryText, boolean quoted) throws SyntaxError The below code can't process chinese rightly. BooleanClause.Occur occur = positionCount 1 operator == AND_OPERATOR ? BooleanClause.Occur.MUST : BooleanClause.Occur.SHOULD; For example, “北京市 and “北京 are synonym, if I seach 北京市动物园, the expected parse result is +(北京市 北京) +动物园, but actually it would be parsed to +北京市 +北京 +动物园. The code can process English, because English word is seperate by space, and only one position. An interesting feature of this example is that difference between the two synonyms is omission of one token 市 (city). Doesn't the same same problem happen if we define London City and London as synonyms, and execute a query like London City Zoo? Must Chinese Analyzer be used to reproduce this problem? I tried to test this but I couldn't. The result of query string expansion using Solr 4.2's query interface with debug output shows: str name=parsedqueryMultiPhraseQuery(text:(london london) city zoo)/str I see no plus (+). What query parser did you use? -- Kuro Kurosaka
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
The router says implicit. I did start from a blank zk state but perhaps I missed one of the ZkCLI commands? One of my shards from the clusterstate.json is shown below. What is the process that should be done to bootstrap a cluster other than the ZkCLI commands I listed above? My process right now is run those ZkCLI commands and then start solr on all of the instances with a command like this java -server -Dshard=shard5 -DcoreName=shard5-core1 -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181 -Djetty.port=7575 -DhostPort=7575 -jar start.jar I feel like maybe I'm missing a step. shard5:{ state:active, replicas:{ 10.38.33.16:7575_solr_shard5-core1:{ shard:shard5, state:active, core:shard5-core1, collection:collection1, node_name:10.38.33.16:7575_solr, base_url:http://10.38.33.16:7575/solr;, leader:true}, 10.38.33.17:7577_solr_shard5-core2:{ shard:shard5, state:recovering, core:shard5-core2, collection:collection1, node_name:10.38.33.17:7577_solr, base_url:http://10.38.33.17:7577/solr}}} On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com wrote: It should be part of your clusterstate.json. Some users have reported trouble upgrading a previous zk install when this change came. I recommended manually updating the clusterstate.json to have the right info, and that seemed to work. Otherwise, I guess you have to start from a clean zk state. If you don't have that range information, I think there will be trouble. Do you have an router type defined in the clusterstate.json? - Mark On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote: Where is this information stored in ZK? I don't see it in the cluster state (or perhaps I don't understand it ;) ). Perhaps something with my process is broken. What I do when I start from scratch is the following ZkCLI -cmd upconfig ... ZkCLI -cmd linkconfig but I don't ever explicitly create the collection. What should the steps from scratch be? I am moving from an unreleased snapshot of 4.0 so I never did that previously either so perhaps I did create the collection in one of my steps to get this working but have forgotten it along the way. On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com wrote: Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a collection is created - each shard gets a range, which is stored in zookeeper. You should not be able to end up with the same id on different shards - something very odd going on. Hopefully I'll have some time to try and help you reproduce. Ideally we can capture it in a test case. - Mark On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote: no, my thought was wrong, it appears that even with the parameter set I am seeing this behavior. I've been able to duplicate it on 4.2.0 by indexing 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so. I will try this on 4.2.1. to see if I see the same behavior On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com wrote: Since I don't have that many items in my index I exported all of the keys for each shard and wrote a simple java program that checks for duplicates. I found some duplicate keys on different shards, a grep of the files for the keys found does indicate that they made it to the wrong places. If you notice documents with the same ID are on shard 3 and shard 5. Is it possible that the hash is being calculated taking into account only the live nodes? I know that we don't specify the numShards param @ startup so could this be what is happening? grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de * shard1-core1:0 shard1-core2:0 shard2-core1:0 shard2-core2:0 shard3-core1:1 shard3-core2:1 shard4-core1:0 shard4-core2:0 shard5-core1:1 shard5-core2:1 shard6-core1:0 shard6-core2:0 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com wrote: Something interesting that I'm noticing as well, I just indexed 300,000 items, and some how 300,020 ended up in the index. I thought perhaps I messed something up so I started the indexing again and indexed another 400,000 and I see 400,064 docs. Is there a good way to find possibile duplicates? I had tried to facet on key (our id field) but that didn't give me anything with more than a count of 1. On Wed, Apr 3, 2013 at 9:22 AM, Jamie Johnson jej2...@gmail.com wrote: Ok, so clearing the transaction log allowed things to go again. I am going to clear the index and try to replicate the problem on 4.2.0 and then I'll try on 4.2.1 On Wed,
Re: SolrCloud not distributing documents across shards
Chris Hostetter-3 wrote I'm not familiar with the details, but i've seen miller respond to a similar question with reference to the issue of not explicitly specifying numShards when creating your collections... http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/% 3C0AA0B422-F1DE-4915-B602-53CB1849204A@ %3E -Hoss Well theoretically we are okay there. The commands we run to create our collection are as follow (note the numShards being specified): http://server01/solr/admin/cores?action=CREATEname=classic_btcollection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_btconfig=solrconfig.xmlschema=schema.xmlcollection.configName=classic_bt http://server02/solr/admin/cores?action=CREATEname=classic_btcollection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_btconfig=solrconfig.xmlschema=schema.xmlcollection.configName=classic_bt http://server03/solr/admin/cores?action=CREATEname=classic_bt_shard1collection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_bt_shard1config=solrconfig.xmlschema=schema.xmlcollection.configName=classic_btshard=shard1 http://server03/solr/admin/cores?action=CREATEname=classic_bt_shard2collection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_bt_shard2config=solrconfig.xmlschema=schema.xmlcollection.configName=classic_btshard=shard2 -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053581.html Sent from the Solr - User mailing list archive at Nabble.com.
HTML entities being missed by DIH HTMLStripTransformer
Hi, I am using DIH to index some database fields. These fields contain html formatted text in them. I use the 'HTMLStripTransformer' to remove that markup. This works fine when the text is like for example: liItem One/li or *This is in Bold* However when the text has HTML entity names like in: lt;ligt;Item Onelt;/gt; or lt;bgt;This is in Boldlt;/bgt; NOTHING HAPPENS. Two questions. (1) Is this the expected behavior of DIH HTMLStripTransformer? (2) If yes, is there an another transformer that I can employ first to turn these html entities into their usual symbols that can then be removed by the DIH HTMLStripTransformer? Thanks - ashok -- View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTML entities being missed by DIH HTMLStripTransformer
On 4 April 2013 00:30, Ashok ash...@qualcomm.com wrote: [...] Two questions. (1) Is this the expected behavior of DIH HTMLStripTransformer? Yes, I believe so. (2) If yes, is there an another transformer that I can employ first to turn these html entities into their usual symbols that can then be removed by the DIH HTMLStripTransformer? How are the HTML tags getting converted into entities? Are you escaping input somewhere? Regards, Gora
Re: Filtering Search Cloud
Shawn, thanks for your detailed explanation. My system will work on high load. I mean I will always index something and something always will be queried at my system. That is why I consider about physically separating indexer and query reply machines. I think about that: imagine a machine that both does indexing (a disk IO for it, I don't know the underlying system maybe Solr makes a sequential IO) and both trying to reply queries (another kind of IO) That is my main challenge to decide separating them. And the next step is that, if I separate them before response can I filter the data of indexer machines (I don't have any filtering issues right now, I just think that maybe I can need it at future) 2013/4/3 Shawn Heisey s...@elyograg.org On 4/1/2013 3:02 PM, Furkan KAMACI wrote: I want to separate my cloud into two logical parts. One of them is indexer cloud of SolrCloud. Second one is Searcher cloud of SolrCloud. My first question is that. Does separating my cloud system make sense about performance improvement. Because I think that when indexing, searching make time to response and if I separate them I get a performance improvement. On the other hand maybe using all Solr machines as whole (I mean not partitioning as I mentioned) SolrCloud can make a better load balancing, I would want to learn it. My second question is that. Let's assume that I have separated my machines as I mentioned. Can I filter some indexes to be searchable or not from Searcher SolrCloud? SolrCloud gets rid of the master and slave designations. It also gets rid of the line between indexing and querying. Each shard has a replica that is designated the leader, but that has no real impact on searching and indexing, only on deciding which data to use when replicas get out of sync. In the old master-slave architecture, you indexed to the master and the updated index files were replicated to the slave. The slave did not handle the analysis for indexing, so it was usually better to send queries to slaves and let the master only do indexing. SolrCloud is very different. When you index, the documents are indexed on all replicas at about the same time. When you query, the requests are load balanced across all replicas. During normal operation, SolrCloud does not use replication at all. The replication feature is only used when a replica gets out of sync with the leader, and in that case, the entire index is replicated. Thanks, Shawn
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
If you don't specify numShards after 4.1, you get an implicit doc router and it's up to you to distribute updates. In the past, partitioning was done on the fly - but for shard splitting and perhaps other features, we now divvy up the hash range up front based on numShards and store it in ZooKeeper. No numShards is now how you take complete control of updates yourself. - Mark On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote: The router says implicit. I did start from a blank zk state but perhaps I missed one of the ZkCLI commands? One of my shards from the clusterstate.json is shown below. What is the process that should be done to bootstrap a cluster other than the ZkCLI commands I listed above? My process right now is run those ZkCLI commands and then start solr on all of the instances with a command like this java -server -Dshard=shard5 -DcoreName=shard5-core1 -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181 -Djetty.port=7575 -DhostPort=7575 -jar start.jar I feel like maybe I'm missing a step. shard5:{ state:active, replicas:{ 10.38.33.16:7575_solr_shard5-core1:{ shard:shard5, state:active, core:shard5-core1, collection:collection1, node_name:10.38.33.16:7575_solr, base_url:http://10.38.33.16:7575/solr;, leader:true}, 10.38.33.17:7577_solr_shard5-core2:{ shard:shard5, state:recovering, core:shard5-core2, collection:collection1, node_name:10.38.33.17:7577_solr, base_url:http://10.38.33.17:7577/solr}}} On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com wrote: It should be part of your clusterstate.json. Some users have reported trouble upgrading a previous zk install when this change came. I recommended manually updating the clusterstate.json to have the right info, and that seemed to work. Otherwise, I guess you have to start from a clean zk state. If you don't have that range information, I think there will be trouble. Do you have an router type defined in the clusterstate.json? - Mark On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote: Where is this information stored in ZK? I don't see it in the cluster state (or perhaps I don't understand it ;) ). Perhaps something with my process is broken. What I do when I start from scratch is the following ZkCLI -cmd upconfig ... ZkCLI -cmd linkconfig but I don't ever explicitly create the collection. What should the steps from scratch be? I am moving from an unreleased snapshot of 4.0 so I never did that previously either so perhaps I did create the collection in one of my steps to get this working but have forgotten it along the way. On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com wrote: Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a collection is created - each shard gets a range, which is stored in zookeeper. You should not be able to end up with the same id on different shards - something very odd going on. Hopefully I'll have some time to try and help you reproduce. Ideally we can capture it in a test case. - Mark On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote: no, my thought was wrong, it appears that even with the parameter set I am seeing this behavior. I've been able to duplicate it on 4.2.0 by indexing 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so. I will try this on 4.2.1. to see if I see the same behavior On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com wrote: Since I don't have that many items in my index I exported all of the keys for each shard and wrote a simple java program that checks for duplicates. I found some duplicate keys on different shards, a grep of the files for the keys found does indicate that they made it to the wrong places. If you notice documents with the same ID are on shard 3 and shard 5. Is it possible that the hash is being calculated taking into account only the live nodes? I know that we don't specify the numShards param @ startup so could this be what is happening? grep -c 7cd1a717-3d94-4f5d-bcb1-9d8a95ca78de * shard1-core1:0 shard1-core2:0 shard2-core1:0 shard2-core2:0 shard3-core1:1 shard3-core2:1 shard4-core1:0 shard4-core2:0 shard5-core1:1 shard5-core2:1 shard6-core1:0 shard6-core2:0 On Wed, Apr 3, 2013 at 10:42 AM, Jamie Johnson jej2...@gmail.com wrote: Something interesting that I'm noticing as well, I just indexed 300,000 items, and some how 300,020 ended up in the index. I thought perhaps I messed something up so I started the indexing again and indexed another 400,000 and I see 400,064 docs. Is there a good
Re: HTML entities being missed by DIH HTMLStripTransformer
Well, the database field has text, sometimes with HTML entities and at other times with html tags. I have no control over the process that populates the database tables with info. -- View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053586.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
ah interestingso I need to specify num shards, blow out zk and then try this again to see if things work properly now. What is really strange is that for the most part things have worked right and on 4.2.1 I have 600,000 items indexed with no duplicates. In any event I will specify num shards clear out zk and begin again. If this works properly what should the router type be? On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com wrote: If you don't specify numShards after 4.1, you get an implicit doc router and it's up to you to distribute updates. In the past, partitioning was done on the fly - but for shard splitting and perhaps other features, we now divvy up the hash range up front based on numShards and store it in ZooKeeper. No numShards is now how you take complete control of updates yourself. - Mark On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote: The router says implicit. I did start from a blank zk state but perhaps I missed one of the ZkCLI commands? One of my shards from the clusterstate.json is shown below. What is the process that should be done to bootstrap a cluster other than the ZkCLI commands I listed above? My process right now is run those ZkCLI commands and then start solr on all of the instances with a command like this java -server -Dshard=shard5 -DcoreName=shard5-core1 -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181 -Djetty.port=7575 -DhostPort=7575 -jar start.jar I feel like maybe I'm missing a step. shard5:{ state:active, replicas:{ 10.38.33.16:7575_solr_shard5-core1:{ shard:shard5, state:active, core:shard5-core1, collection:collection1, node_name:10.38.33.16:7575_solr, base_url:http://10.38.33.16:7575/solr;, leader:true}, 10.38.33.17:7577_solr_shard5-core2:{ shard:shard5, state:recovering, core:shard5-core2, collection:collection1, node_name:10.38.33.17:7577_solr, base_url:http://10.38.33.17:7577/solr}}} On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com wrote: It should be part of your clusterstate.json. Some users have reported trouble upgrading a previous zk install when this change came. I recommended manually updating the clusterstate.json to have the right info, and that seemed to work. Otherwise, I guess you have to start from a clean zk state. If you don't have that range information, I think there will be trouble. Do you have an router type defined in the clusterstate.json? - Mark On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote: Where is this information stored in ZK? I don't see it in the cluster state (or perhaps I don't understand it ;) ). Perhaps something with my process is broken. What I do when I start from scratch is the following ZkCLI -cmd upconfig ... ZkCLI -cmd linkconfig but I don't ever explicitly create the collection. What should the steps from scratch be? I am moving from an unreleased snapshot of 4.0 so I never did that previously either so perhaps I did create the collection in one of my steps to get this working but have forgotten it along the way. On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com wrote: Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a collection is created - each shard gets a range, which is stored in zookeeper. You should not be able to end up with the same id on different shards - something very odd going on. Hopefully I'll have some time to try and help you reproduce. Ideally we can capture it in a test case. - Mark On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote: no, my thought was wrong, it appears that even with the parameter set I am seeing this behavior. I've been able to duplicate it on 4.2.0 by indexing 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so. I will try this on 4.2.1. to see if I see the same behavior On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com wrote: Since I don't have that many items in my index I exported all of the keys for each shard and wrote a simple java program that checks for duplicates. I found some duplicate keys on different shards, a grep of the files for the keys found does indicate that they made it to the wrong places. If you notice documents with the same ID are on shard 3 and shard 5. Is it possible that the hash is being calculated taking into account only the live nodes? I know that we don't specify the numShards param @ startup so could this be what is happening? grep -c
Re: SolrCloud not distributing documents across shards
With earlier versions of Solr Cloud, if there was any error or warning when you made a collection, you likely were set up for implicit routing which means that documents only go to the shard you're talking to. What you want is compositeId routing, which works how you think it should. Go into the cloud GUI and look at clusterstate.json in the Tree tab. You should see the routing algorithm it's using in that file. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Wed, Apr 3, 2013 at 2:59 PM, vsilgalis vsilga...@gmail.com wrote: Chris Hostetter-3 wrote I'm not familiar with the details, but i've seen miller respond to a similar question with reference to the issue of not explicitly specifying numShards when creating your collections... http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/% 3C0AA0B422-F1DE-4915-B602-53CB1849204A@ %3E -Hoss Well theoretically we are okay there. The commands we run to create our collection are as follow (note the numShards being specified): http://server01/solr/admin/cores?action=CREATEname=classic_btcollection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_btconfig=solrconfig.xmlschema=schema.xmlcollection.configName=classic_bt http://server02/solr/admin/cores?action=CREATEname=classic_btcollection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_btconfig=solrconfig.xmlschema=schema.xmlcollection.configName=classic_bt http://server03/solr/admin/cores?action=CREATEname=classic_bt_shard1collection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_bt_shard1config=solrconfig.xmlschema=schema.xmlcollection.configName=classic_btshard=shard1 http://server03/solr/admin/cores?action=CREATEname=classic_bt_shard2collection=classic_btnumShards=2instanceDir=instances/basistechdataDir=/opt/index/classic_bt_shard2config=solrconfig.xmlschema=schema.xmlcollection.configName=classic_btshard=shard2 -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053581.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: HTML entities being missed by DIH HTMLStripTransformer
Then, I would say, you have a bigger problem However, you can probably run RegEx filter and replace those known escapes with real characters before you run your HTMLStrip filter. Or run, HTMLStrip, RegEx and HTMLStrip again. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Apr 3, 2013 at 3:19 PM, Ashok ash...@qualcomm.com wrote: Well, the database field has text, sometimes with HTML entities and at other times with html tags. I have no control over the process that populates the database tables with info. -- View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053586.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering Search Cloud
On 4/3/2013 1:13 PM, Furkan KAMACI wrote: Shawn, thanks for your detailed explanation. My system will work on high load. I mean I will always index something and something always will be queried at my system. That is why I consider about physically separating indexer and query reply machines. I think about that: imagine a machine that both does indexing (a disk IO for it, I don't know the underlying system maybe Solr makes a sequential IO) and both trying to reply queries (another kind of IO) That is my main challenge to decide separating them. And the next step is that, if I separate them before response can I filter the data of indexer machines (I don't have any filtering issues right now, I just think that maybe I can need it at future) We do seem to have a language barrier, so let me try to be very clear: If you use SolrCloud, you can't separate querying and indexing. You will have to use the master-slave replication that been part of Solr since at least 1.4, possibly earlier. Thanks, Shawn
Re: HTML entities being missed by DIH HTMLStripTransformer
Hi Ashok, HTMLStripTransformer uses HTMLStripCharFilter under the hood, and HTMLStripCharFilter converts all HTML entities to their corresponding characters. What version of Solr are you using? My guess is that it only appears that nothing is happening, since when they are presented in a browser, they show up as the characters the entities represent. I think (never done this myself) that if you apply the HTMLStripTransformer twice, it will first convert the entities to characters, and then on the second pass, remove the HTML constructs. From http://wiki.apache.org/solr/DataImportHandler#Transformer: - The entity transformer attribute can consist of a comma separated list of transformers (say transformer=foo.X,foo.Y). The transformers are chained in this case and they are applied one after the other in the order in which they are specified. What this means is that after the fields are fetched from the datasource, the list of entity columns are processed one at a time in the order listed inside the entity tag and scanned by the first transformer to see if any of that transformers attributes are present. If so the transformer does it's thing! When all of the listed entity columns have been scanned the process is repeated using the next transformer in the list. - Steve On Apr 3, 2013, at 3:30 PM, Alexandre Rafalovitch arafa...@gmail.com wrote: Then, I would say, you have a bigger problem However, you can probably run RegEx filter and replace those known escapes with real characters before you run your HTMLStrip filter. Or run, HTMLStrip, RegEx and HTMLStrip again. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Apr 3, 2013 at 3:19 PM, Ashok ash...@qualcomm.com wrote: Well, the database field has text, sometimes with HTML entities and at other times with html tags. I have no control over the process that populates the database tables with info.
Re: SolrCloud not distributing documents across shards
Michael Della Bitta-2 wrote With earlier versions of Solr Cloud, if there was any error or warning when you made a collection, you likely were set up for implicit routing which means that documents only go to the shard you're talking to. What you want is compositeId routing, which works how you think it should. Go into the cloud GUI and look at clusterstate.json in the Tree tab. You should see the routing algorithm it's using in that file. Michael Della Bitta That sounds like my huckleberry. router:implicit Is in the collection info in the clusterstate.json How do I fix this? Just wipe the clusterstate.json? Thanks for your help. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053593.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
answered my own question, it now says compositeId. What is problematic though is that in addition to my shards (which are say jamie-shard1) I see the solr created shards (shard1). I assume that these were created because of the numShards param. Is there no way to specify the names of these shards? On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com wrote: ah interestingso I need to specify num shards, blow out zk and then try this again to see if things work properly now. What is really strange is that for the most part things have worked right and on 4.2.1 I have 600,000 items indexed with no duplicates. In any event I will specify num shards clear out zk and begin again. If this works properly what should the router type be? On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com wrote: If you don't specify numShards after 4.1, you get an implicit doc router and it's up to you to distribute updates. In the past, partitioning was done on the fly - but for shard splitting and perhaps other features, we now divvy up the hash range up front based on numShards and store it in ZooKeeper. No numShards is now how you take complete control of updates yourself. - Mark On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote: The router says implicit. I did start from a blank zk state but perhaps I missed one of the ZkCLI commands? One of my shards from the clusterstate.json is shown below. What is the process that should be done to bootstrap a cluster other than the ZkCLI commands I listed above? My process right now is run those ZkCLI commands and then start solr on all of the instances with a command like this java -server -Dshard=shard5 -DcoreName=shard5-core1 -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181 -Djetty.port=7575 -DhostPort=7575 -jar start.jar I feel like maybe I'm missing a step. shard5:{ state:active, replicas:{ 10.38.33.16:7575_solr_shard5-core1:{ shard:shard5, state:active, core:shard5-core1, collection:collection1, node_name:10.38.33.16:7575_solr, base_url:http://10.38.33.16:7575/solr;, leader:true}, 10.38.33.17:7577_solr_shard5-core2:{ shard:shard5, state:recovering, core:shard5-core2, collection:collection1, node_name:10.38.33.17:7577_solr, base_url:http://10.38.33.17:7577/solr}}} On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com wrote: It should be part of your clusterstate.json. Some users have reported trouble upgrading a previous zk install when this change came. I recommended manually updating the clusterstate.json to have the right info, and that seemed to work. Otherwise, I guess you have to start from a clean zk state. If you don't have that range information, I think there will be trouble. Do you have an router type defined in the clusterstate.json? - Mark On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote: Where is this information stored in ZK? I don't see it in the cluster state (or perhaps I don't understand it ;) ). Perhaps something with my process is broken. What I do when I start from scratch is the following ZkCLI -cmd upconfig ... ZkCLI -cmd linkconfig but I don't ever explicitly create the collection. What should the steps from scratch be? I am moving from an unreleased snapshot of 4.0 so I never did that previously either so perhaps I did create the collection in one of my steps to get this working but have forgotten it along the way. On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com wrote: Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a collection is created - each shard gets a range, which is stored in zookeeper. You should not be able to end up with the same id on different shards - something very odd going on. Hopefully I'll have some time to try and help you reproduce. Ideally we can capture it in a test case. - Mark On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote: no, my thought was wrong, it appears that even with the parameter set I am seeing this behavior. I've been able to duplicate it on 4.2.0 by indexing 100,000 documents on 10 threads (10,000 each) when I get to 400,000 or so. I will try this on 4.2.1. to see if I see the same behavior On Wed, Apr 3, 2013 at 12:37 PM, Jamie Johnson jej2...@gmail.com wrote: Since I don't have that many items in my index I exported all of the keys for each shard and wrote a simple java program that checks for duplicates. I found some duplicate keys on different shards, a grep of
Re: SolrCloud not distributing documents across shards
If you can work with a clean state, I'd turn off all your shards, clear out the Solr directories in Zookeeper, reset solr.xml for each of your shards, upgrade to the latest version of Solr, and turn everything back on again. Then upload config, recreate your collection, etc. I do it like this, but YMMV: curl http://localhost:8080/solr/admin/collections?action=CREATEname=$namenumShards=$numcollection.configName=$config-name; Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Wed, Apr 3, 2013 at 3:40 PM, vsilgalis vsilga...@gmail.com wrote: Michael Della Bitta-2 wrote With earlier versions of Solr Cloud, if there was any error or warning when you made a collection, you likely were set up for implicit routing which means that documents only go to the shard you're talking to. What you want is compositeId routing, which works how you think it should. Go into the cloud GUI and look at clusterstate.json in the Tree tab. You should see the routing algorithm it's using in that file. Michael Della Bitta That sounds like my huckleberry. router:implicit Is in the collection info in the clusterstate.json How do I fix this? Just wipe the clusterstate.json? Thanks for your help. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053593.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering Search Cloud
Thanks for your explanation, you explained every thing what I need. Just one more question. I see that I can not make it with Solr Cloud, but I can do something like that with master-slave replication of Solr. If I use master-slave replication of Solr, can I eliminate (filter) something (something that is indexed from master) from being a response after querying (querying from slaves) ? 2013/4/3 Shawn Heisey s...@elyograg.org On 4/3/2013 1:13 PM, Furkan KAMACI wrote: Shawn, thanks for your detailed explanation. My system will work on high load. I mean I will always index something and something always will be queried at my system. That is why I consider about physically separating indexer and query reply machines. I think about that: imagine a machine that both does indexing (a disk IO for it, I don't know the underlying system maybe Solr makes a sequential IO) and both trying to reply queries (another kind of IO) That is my main challenge to decide separating them. And the next step is that, if I separate them before response can I filter the data of indexer machines (I don't have any filtering issues right now, I just think that maybe I can need it at future) We do seem to have a language barrier, so let me try to be very clear: If you use SolrCloud, you can't separate querying and indexing. You will have to use the master-slave replication that been part of Solr since at least 1.4, possibly earlier. Thanks, Shawn
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
I had thought you could - but looking at the code recently, I don't think you can anymore. I think that's a technical limitation more than anything though. When these changes were made, I think support for that was simply not added at the time. I'm not sure exactly how straightforward it would be, but it seems doable - as it is, the overseer will preallocate shards when first creating the collection - that's when they get named shard(n). There would have to be logic to replace shard(n) with the custom shard name when the core actually registers. - Mark On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote: answered my own question, it now says compositeId. What is problematic though is that in addition to my shards (which are say jamie-shard1) I see the solr created shards (shard1). I assume that these were created because of the numShards param. Is there no way to specify the names of these shards? On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com wrote: ah interestingso I need to specify num shards, blow out zk and then try this again to see if things work properly now. What is really strange is that for the most part things have worked right and on 4.2.1 I have 600,000 items indexed with no duplicates. In any event I will specify num shards clear out zk and begin again. If this works properly what should the router type be? On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com wrote: If you don't specify numShards after 4.1, you get an implicit doc router and it's up to you to distribute updates. In the past, partitioning was done on the fly - but for shard splitting and perhaps other features, we now divvy up the hash range up front based on numShards and store it in ZooKeeper. No numShards is now how you take complete control of updates yourself. - Mark On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote: The router says implicit. I did start from a blank zk state but perhaps I missed one of the ZkCLI commands? One of my shards from the clusterstate.json is shown below. What is the process that should be done to bootstrap a cluster other than the ZkCLI commands I listed above? My process right now is run those ZkCLI commands and then start solr on all of the instances with a command like this java -server -Dshard=shard5 -DcoreName=shard5-core1 -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181 -Djetty.port=7575 -DhostPort=7575 -jar start.jar I feel like maybe I'm missing a step. shard5:{ state:active, replicas:{ 10.38.33.16:7575_solr_shard5-core1:{ shard:shard5, state:active, core:shard5-core1, collection:collection1, node_name:10.38.33.16:7575_solr, base_url:http://10.38.33.16:7575/solr;, leader:true}, 10.38.33.17:7577_solr_shard5-core2:{ shard:shard5, state:recovering, core:shard5-core2, collection:collection1, node_name:10.38.33.17:7577_solr, base_url:http://10.38.33.17:7577/solr}}} On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com wrote: It should be part of your clusterstate.json. Some users have reported trouble upgrading a previous zk install when this change came. I recommended manually updating the clusterstate.json to have the right info, and that seemed to work. Otherwise, I guess you have to start from a clean zk state. If you don't have that range information, I think there will be trouble. Do you have an router type defined in the clusterstate.json? - Mark On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote: Where is this information stored in ZK? I don't see it in the cluster state (or perhaps I don't understand it ;) ). Perhaps something with my process is broken. What I do when I start from scratch is the following ZkCLI -cmd upconfig ... ZkCLI -cmd linkconfig but I don't ever explicitly create the collection. What should the steps from scratch be? I am moving from an unreleased snapshot of 4.0 so I never did that previously either so perhaps I did create the collection in one of my steps to get this working but have forgotten it along the way. On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com wrote: Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a collection is created - each shard gets a range, which is stored in zookeeper. You should not be able to end up with the same id on different shards - something very odd going on. Hopefully I'll have some time to try and help you reproduce. Ideally we can capture it in a test case. - Mark On Apr 3, 2013, at 1:13 PM, Jamie Johnson jej2...@gmail.com wrote: no, my thought was wrong, it
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
ok, so that's not a deal breaker for me. I just changed it to match the shards that are auto created and it looks like things are happy. I'll go ahead and try my test to see if I can get things out of sync. On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller markrmil...@gmail.com wrote: I had thought you could - but looking at the code recently, I don't think you can anymore. I think that's a technical limitation more than anything though. When these changes were made, I think support for that was simply not added at the time. I'm not sure exactly how straightforward it would be, but it seems doable - as it is, the overseer will preallocate shards when first creating the collection - that's when they get named shard(n). There would have to be logic to replace shard(n) with the custom shard name when the core actually registers. - Mark On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote: answered my own question, it now says compositeId. What is problematic though is that in addition to my shards (which are say jamie-shard1) I see the solr created shards (shard1). I assume that these were created because of the numShards param. Is there no way to specify the names of these shards? On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com wrote: ah interestingso I need to specify num shards, blow out zk and then try this again to see if things work properly now. What is really strange is that for the most part things have worked right and on 4.2.1 I have 600,000 items indexed with no duplicates. In any event I will specify num shards clear out zk and begin again. If this works properly what should the router type be? On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com wrote: If you don't specify numShards after 4.1, you get an implicit doc router and it's up to you to distribute updates. In the past, partitioning was done on the fly - but for shard splitting and perhaps other features, we now divvy up the hash range up front based on numShards and store it in ZooKeeper. No numShards is now how you take complete control of updates yourself. - Mark On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote: The router says implicit. I did start from a blank zk state but perhaps I missed one of the ZkCLI commands? One of my shards from the clusterstate.json is shown below. What is the process that should be done to bootstrap a cluster other than the ZkCLI commands I listed above? My process right now is run those ZkCLI commands and then start solr on all of the instances with a command like this java -server -Dshard=shard5 -DcoreName=shard5-core1 -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181 -Djetty.port=7575 -DhostPort=7575 -jar start.jar I feel like maybe I'm missing a step. shard5:{ state:active, replicas:{ 10.38.33.16:7575_solr_shard5-core1:{ shard:shard5, state:active, core:shard5-core1, collection:collection1, node_name:10.38.33.16:7575_solr, base_url:http://10.38.33.16:7575/solr;, leader:true}, 10.38.33.17:7577_solr_shard5-core2:{ shard:shard5, state:recovering, core:shard5-core2, collection:collection1, node_name:10.38.33.17:7577_solr, base_url:http://10.38.33.17:7577/solr}}} On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com wrote: It should be part of your clusterstate.json. Some users have reported trouble upgrading a previous zk install when this change came. I recommended manually updating the clusterstate.json to have the right info, and that seemed to work. Otherwise, I guess you have to start from a clean zk state. If you don't have that range information, I think there will be trouble. Do you have an router type defined in the clusterstate.json? - Mark On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote: Where is this information stored in ZK? I don't see it in the cluster state (or perhaps I don't understand it ;) ). Perhaps something with my process is broken. What I do when I start from scratch is the following ZkCLI -cmd upconfig ... ZkCLI -cmd linkconfig but I don't ever explicitly create the collection. What should the steps from scratch be? I am moving from an unreleased snapshot of 4.0 so I never did that previously either so perhaps I did create the collection in one of my steps to get this working but have forgotten it along the way. On Wed, Apr 3, 2013 at 2:16 PM, Mark Miller markrmil...@gmail.com wrote: Thanks for digging Jamie. In 4.2, hash ranges are assigned up front when a collection is
Re: HTML entities being missed by DIH HTMLStripTransformer
Hi Steve, Fabulous suggestion! Yup, that is it! Using the HTMLStripTransformer twice did the trick. I am using Solr 4.1. Thank you very much! - ashok -- View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053609.html Sent from the Solr - User mailing list archive at Nabble.com.
do SearchComponents have access to response contents
I need to implement some SearchComponent that will deal with metrics on the response. Some things I see will be easy to get, like number of hits for instance, but I am more worried with this: We need to also track the size of the response (as the size in bytes of the whole xml response tat is streamed, with stored fields and all). I was a bit worried cause I am wondering if a searchcomponent will actually have access to the response bytes... Can someone confirm one way or the other? We are targeting Sorl4.0 thanks xavier
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
with these changes things are looking good, I'm up to 600,000 documents without any issues as of right now. I'll keep going and add more to see if I find anything. On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson jej2...@gmail.com wrote: ok, so that's not a deal breaker for me. I just changed it to match the shards that are auto created and it looks like things are happy. I'll go ahead and try my test to see if I can get things out of sync. On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller markrmil...@gmail.com wrote: I had thought you could - but looking at the code recently, I don't think you can anymore. I think that's a technical limitation more than anything though. When these changes were made, I think support for that was simply not added at the time. I'm not sure exactly how straightforward it would be, but it seems doable - as it is, the overseer will preallocate shards when first creating the collection - that's when they get named shard(n). There would have to be logic to replace shard(n) with the custom shard name when the core actually registers. - Mark On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote: answered my own question, it now says compositeId. What is problematic though is that in addition to my shards (which are say jamie-shard1) I see the solr created shards (shard1). I assume that these were created because of the numShards param. Is there no way to specify the names of these shards? On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com wrote: ah interestingso I need to specify num shards, blow out zk and then try this again to see if things work properly now. What is really strange is that for the most part things have worked right and on 4.2.1 I have 600,000 items indexed with no duplicates. In any event I will specify num shards clear out zk and begin again. If this works properly what should the router type be? On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com wrote: If you don't specify numShards after 4.1, you get an implicit doc router and it's up to you to distribute updates. In the past, partitioning was done on the fly - but for shard splitting and perhaps other features, we now divvy up the hash range up front based on numShards and store it in ZooKeeper. No numShards is now how you take complete control of updates yourself. - Mark On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote: The router says implicit. I did start from a blank zk state but perhaps I missed one of the ZkCLI commands? One of my shards from the clusterstate.json is shown below. What is the process that should be done to bootstrap a cluster other than the ZkCLI commands I listed above? My process right now is run those ZkCLI commands and then start solr on all of the instances with a command like this java -server -Dshard=shard5 -DcoreName=shard5-core1 -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181 -Djetty.port=7575 -DhostPort=7575 -jar start.jar I feel like maybe I'm missing a step. shard5:{ state:active, replicas:{ 10.38.33.16:7575_solr_shard5-core1:{ shard:shard5, state:active, core:shard5-core1, collection:collection1, node_name:10.38.33.16:7575_solr, base_url:http://10.38.33.16:7575/solr;, leader:true}, 10.38.33.17:7577_solr_shard5-core2:{ shard:shard5, state:recovering, core:shard5-core2, collection:collection1, node_name:10.38.33.17:7577_solr, base_url:http://10.38.33.17:7577/solr}}} On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com wrote: It should be part of your clusterstate.json. Some users have reported trouble upgrading a previous zk install when this change came. I recommended manually updating the clusterstate.json to have the right info, and that seemed to work. Otherwise, I guess you have to start from a clean zk state. If you don't have that range information, I think there will be trouble. Do you have an router type defined in the clusterstate.json? - Mark On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote: Where is this information stored in ZK? I don't see it in the cluster state (or perhaps I don't understand it ;) ). Perhaps something with my process is broken. What I do when I start from scratch is the following ZkCLI -cmd upconfig ... ZkCLI -cmd linkconfig but I don't ever explicitly create the collection. What should the steps from scratch be? I am moving from an unreleased snapshot of 4.0 so I never did that previously either so perhaps I did create the collection in one
Re: HTML entities being missed by DIH HTMLStripTransformer
Cool, glad I was able to help. On Apr 3, 2013, at 4:18 PM, Ashok ash...@qualcomm.com wrote: Hi Steve, Fabulous suggestion! Yup, that is it! Using the HTMLStripTransformer twice did the trick. I am using Solr 4.1. Thank you very much! - ashok -- View this message in context: http://lucene.472066.n3.nabble.com/HTML-entities-being-missed-by-DIH-HTMLStripTransformer-tp4053582p4053609.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud not distributing documents across shards
Michael Della Bitta-2 wrote If you can work with a clean state, I'd turn off all your shards, clear out the Solr directories in Zookeeper, reset solr.xml for each of your shards, upgrade to the latest version of Solr, and turn everything back on again. Then upload config, recreate your collection, etc. I do it like this, but YMMV: curl http://localhost:8080/solr/admin/collections?action=CREATEname=$namenumShards=$numcollection.configName=$config-name; Michael Della Bitta Looks like that was the problem. Thanks, much appreciated. Is there any insight into specifically what I should look into for preventing this in the future? -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053622.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Question on Exact Matches - edismax
Can you show us your *_ci field type? Solr does not really have a way to tell whether a match is exact or only partial, but you could hack around it with the fieldType. See https://github.com/cominvent/exactmatch for a possible solution. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 3. apr. 2013 kl. 15:55 skrev Sandeep Mestry sanmes...@gmail.com: Hi All, I have a requirement where in exact matches for 2 fields (Series Title, Title) should be ranked higher than the partial matches. The configuration looks like below: requestHandler name=assetdismax class=solr.SearchHandler lst name=defaults str name=defTypeedismax/str str name=echoParamsexplicit/str float name=tie0.01/float str name=qf*pg_series_title_ci*^500 *title_ci*^300 * pg_series_title*^200 *title*^25 classifications^15 classifications_texts^15 parent_classifications^10 synonym_classifications^5 pg_brand_title^5 pg_series_working_title^5 p_programme_title^5 p_item_title^5 p_interstitial_title^5 description^15 pg_series_description annotations^0.1 classification_notes^0.05 pv_program_version_number^2 pv_program_version_number_ci^2 pv_program_number^2 pv_program_number_ci^2 p_program_number^2 ma_version_number^2 ma_recording_location ma_contributions^0.001 rel_pg_series_title rel_programme_title rel_programme_number rel_programme_number_ci pg_uuid^0.5 p_uuid^0.5 pv_uuid^0.5 ma_uuid^0.5/str str name=pfpg_series_title_ci^500 title_ci^500/str int name=ps0/int str name=q.alt*:*/str str name=mm100%/str str name=q.opAND/str str name=facettrue/str str name=facet.limit-1/str str name=facet.mincount1/str /lst /requestHandler As you can see above, the search is against many fields. What I'd want is the documents that have exact matches for series title and title fields should rank higher than the rest. I have added 2 case insensitive (*pg_series_title_ci, title_ci*) fields for series title and title and have boosted them higher over the tokenized and rest of the fields. I have also implemented a similarity class to override idf however I still get documents having partial matches in title and other fields ranking higher than exact match in pg_series_title_ci. Many Thanks, Sandeep
Re: do SearchComponents have access to response contents
The search components can see the response as a namedlist, but it is only when SolrDispatchFIlter calls the QueryResponseWriter that XML or JSON or whatever other format (Javabin as well) is generated from the named list for final output in an HTTP response. You probably want a custom query response writer that wraps the XML response writer. Then you can generate the XML and then do whatever you want with it. The QueryResponseWriter class and queryResponseWriter in solrconfig.xml. -- Jack Krupansky -Original Message- From: xavier jmlucjav Sent: Wednesday, April 03, 2013 4:22 PM To: solr-user@lucene.apache.org Subject: do SearchComponents have access to response contents I need to implement some SearchComponent that will deal with metrics on the response. Some things I see will be easy to get, like number of hits for instance, but I am more worried with this: We need to also track the size of the response (as the size in bytes of the whole xml response tat is streamed, with stored fields and all). I was a bit worried cause I am wondering if a searchcomponent will actually have access to the response bytes... Can someone confirm one way or the other? We are targeting Sorl4.0 thanks xavier
Re: Solr Tika Override
You'd probably want to work on the XML output from Tika's PDF parser, from which you can identify which page and context. Personally I would build a separate indexing application in Java and call Tika directly, then build a SolrInputDocument which you pass to solr through SolrJ. I.e. not use ExtractingRequestHandler, but put all this logic on the client side. This scales better, you can handle weird parsing errors and OOM situations better and you have full control of how to deal with the XML output from various file formats, and what metadata to pass on into the Solr document. This is possible with a customized ExtractingHandler too, but it will be uglier and harder to test. With a standalone indexer application you can write unit tests for all the special parsing requirements. see http://tika.apache.org for more. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 3. apr. 2013 kl. 20:09 skrev JerryC coss...@vt.edu: I am researching Solr and seeing if it would be a good fit for a document search service I am helping to develop. One of the requirements is that we will need to be able to customize how file contents are parsed beyond the default configurations that are offered out of the box by Tika. For example, we know that we will be indexing .pdf files that will contain a cover page with a project start date, and would like to pull this date out into a searchable field that is separate from the file content. I have seen several sources saying you can do this by overriding the ExtractingRequestHandler.createFactory() method, but I have not been able to find much documentation on how to implement a new parser. Can someone point me in the right direction on where to look, or let me know if the scenario I described above is even possible? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Tika-Override-tp4053552.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud not distributing documents across shards
From what I can tell, the Collections API has been hardened significantly since 4.2 and now will refuse to create a collection if you give it something ambiguous to do. So if you upgrade to 4.2, things will become more safe. But overall I'd find a way of using the Collections API that works and stick with it. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Wed, Apr 3, 2013 at 5:01 PM, vsilgalis vsilga...@gmail.com wrote: Michael Della Bitta-2 wrote If you can work with a clean state, I'd turn off all your shards, clear out the Solr directories in Zookeeper, reset solr.xml for each of your shards, upgrade to the latest version of Solr, and turn everything back on again. Then upload config, recreate your collection, etc. I do it like this, but YMMV: curl http://localhost:8080/solr/admin/collections?action=CREATEname=$namenumShards=$numcollection.configName=$config-name; Michael Della Bitta Looks like that was the problem. Thanks, much appreciated. Is there any insight into specifically what I should look into for preventing this in the future? -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-not-distributing-documents-across-shards-tp4053506p4053622.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud not distributing documents across shards
On Apr 3, 2013, at 5:53 PM, Michael Della Bitta michael.della.bi...@appinions.com wrote: From what I can tell, the Collections API has been hardened significantly since 4.2 I did a lot of work here for 4.2.1 - there was a lot to improve. Hopefully there is much less now, but if anyone finds anything, I'll fix any JIRA's. - Mark
Re: Filtering Search Cloud
On 4/3/2013 1:52 PM, Furkan KAMACI wrote: Thanks for your explanation, you explained every thing what I need. Just one more question. I see that I can not make it with Solr Cloud, but I can do something like that with master-slave replication of Solr. If I use master-slave replication of Solr, can I eliminate (filter) something (something that is indexed from master) from being a response after querying (querying from slaves) ? I don't understand the question. I will attempt to give you more information, but it might not answer your question. If not, you'll have to try to improve your question. Your master and each of that master's slaves will have the same index as soon as replication is done. A query on the slave has no idea that the master exists. Thanks, Shawn
Streaming search results
Is it possible to stream search results from Solr? Seems that this feature is missing. I see two options to solve this: 1. Using search results pagination feature The idea is to implement a smart proxy that will stream chunks from search results using pagination. 2. Implement Solr plugin with search streaming feature (is that possible at all?) First option is easy to implement and reliable, though I dont know what are the drawbacks. Regards, Viktor
Re: Solr metrics in Codahale metrics and Graphite?
That sounds great. I'll check out the bug, I didn't see anything in the docs about this. And if I can't find it with a search engine, it probably isn't there. --wunder On Apr 3, 2013, at 6:39 AM, Shawn Heisey wrote: On 3/29/2013 12:07 PM, Walter Underwood wrote: What are folks using for this? I don't know that this really answers your question, but Solr 4.1 and later includes a big chunk of codahale metrics internally for request handler statistics - see SOLR-1972. First we tried including the jar and using the API, but that created thread leak problems, so the source code was added. Thanks, Shawn
Re: Solr metrics in Codahale metrics and Graphite?
It's there! :) http://search-lucene.com/?q=percentilefc_project=Solrfc_type=issue Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Apr 3, 2013 at 6:29 PM, Walter Underwood wun...@wunderwood.org wrote: That sounds great. I'll check out the bug, I didn't see anything in the docs about this. And if I can't find it with a search engine, it probably isn't there. --wunder On Apr 3, 2013, at 6:39 AM, Shawn Heisey wrote: On 3/29/2013 12:07 PM, Walter Underwood wrote: What are folks using for this? I don't know that this really answers your question, but Solr 4.1 and later includes a big chunk of codahale metrics internally for request handler statistics - see SOLR-1972. First we tried including the jar and using the API, but that created thread leak problems, so the source code was added. Thanks, Shawn
Re: Solr metrics in Codahale metrics and Graphite?
In the Jira, but not in the docs. It would be nice to have VM stats like GC, too, so we can have common monitoring and alerting on all our services. wunder On Apr 3, 2013, at 3:31 PM, Otis Gospodnetic wrote: It's there! :) http://search-lucene.com/?q=percentilefc_project=Solrfc_type=issue Otis -- Solr ElasticSearch Support http://sematext.com/ On Wed, Apr 3, 2013 at 6:29 PM, Walter Underwood wun...@wunderwood.org wrote: That sounds great. I'll check out the bug, I didn't see anything in the docs about this. And if I can't find it with a search engine, it probably isn't there. --wunder On Apr 3, 2013, at 6:39 AM, Shawn Heisey wrote: On 3/29/2013 12:07 PM, Walter Underwood wrote: What are folks using for this? I don't know that this really answers your question, but Solr 4.1 and later includes a big chunk of codahale metrics internally for request handler statistics - see SOLR-1972. First we tried including the jar and using the API, but that created thread leak problems, so the source code was added. Thanks, Shawn -- Walter Underwood wun...@wunderwood.org
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
just an update, I'm at 1M records now with no issues. This looks promising as to the cause of my issues, thanks for the help. Is the routing method with numShards documented anywhere? I know numShards is documented but I didn't know that the routing changed if you don't specify it. On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson jej2...@gmail.com wrote: with these changes things are looking good, I'm up to 600,000 documents without any issues as of right now. I'll keep going and add more to see if I find anything. On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson jej2...@gmail.com wrote: ok, so that's not a deal breaker for me. I just changed it to match the shards that are auto created and it looks like things are happy. I'll go ahead and try my test to see if I can get things out of sync. On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller markrmil...@gmail.comwrote: I had thought you could - but looking at the code recently, I don't think you can anymore. I think that's a technical limitation more than anything though. When these changes were made, I think support for that was simply not added at the time. I'm not sure exactly how straightforward it would be, but it seems doable - as it is, the overseer will preallocate shards when first creating the collection - that's when they get named shard(n). There would have to be logic to replace shard(n) with the custom shard name when the core actually registers. - Mark On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote: answered my own question, it now says compositeId. What is problematic though is that in addition to my shards (which are say jamie-shard1) I see the solr created shards (shard1). I assume that these were created because of the numShards param. Is there no way to specify the names of these shards? On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com wrote: ah interestingso I need to specify num shards, blow out zk and then try this again to see if things work properly now. What is really strange is that for the most part things have worked right and on 4.2.1 I have 600,000 items indexed with no duplicates. In any event I will specify num shards clear out zk and begin again. If this works properly what should the router type be? On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com wrote: If you don't specify numShards after 4.1, you get an implicit doc router and it's up to you to distribute updates. In the past, partitioning was done on the fly - but for shard splitting and perhaps other features, we now divvy up the hash range up front based on numShards and store it in ZooKeeper. No numShards is now how you take complete control of updates yourself. - Mark On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote: The router says implicit. I did start from a blank zk state but perhaps I missed one of the ZkCLI commands? One of my shards from the clusterstate.json is shown below. What is the process that should be done to bootstrap a cluster other than the ZkCLI commands I listed above? My process right now is run those ZkCLI commands and then start solr on all of the instances with a command like this java -server -Dshard=shard5 -DcoreName=shard5-core1 -Dsolr.data.dir=/solr/data/shard5-core1 -Dcollection.configName=solr-conf -Dcollection=collection1 -DzkHost=so-zoo1:2181,so-zoo2:2181,so-zoo3:2181 -Djetty.port=7575 -DhostPort=7575 -jar start.jar I feel like maybe I'm missing a step. shard5:{ state:active, replicas:{ 10.38.33.16:7575_solr_shard5-core1:{ shard:shard5, state:active, core:shard5-core1, collection:collection1, node_name:10.38.33.16:7575_solr, base_url:http://10.38.33.16:7575/solr;, leader:true}, 10.38.33.17:7577_solr_shard5-core2:{ shard:shard5, state:recovering, core:shard5-core2, collection:collection1, node_name:10.38.33.17:7577_solr, base_url:http://10.38.33.17:7577/solr}}} On Wed, Apr 3, 2013 at 2:40 PM, Mark Miller markrmil...@gmail.com wrote: It should be part of your clusterstate.json. Some users have reported trouble upgrading a previous zk install when this change came. I recommended manually updating the clusterstate.json to have the right info, and that seemed to work. Otherwise, I guess you have to start from a clean zk state. If you don't have that range information, I think there will be trouble. Do you have an router type defined in the clusterstate.json? - Mark On Apr 3, 2013, at 2:24 PM, Jamie Johnson jej2...@gmail.com wrote: Where is this information stored in ZK? I don't see it in the cluster state (or perhaps I don't understand it ;) ). Perhaps something with my process
RE: Solr Multiword Search
The following query is doing a word search (based on my previous post)... solr/spell?q=(charles+and+the+choclit+factory+OR+(title2:(charles+and+the+choclit+factory)))spellcheck.collate=truespellcheck=truespellcheck.q=charles+and+the+choclit+factory It produces a lot of unwanted matches. In order to do a phrase search, I changed it to: solr/spell?q=(charles+and+the+choclit+factory+OR+(title2:(charles+and+the+choclit+factory)))spellcheck.collate=truespellcheck=truespellcheck.q=charles+and+the+choclit+factory It does not find any match for the words in the phrase I am looking for and does poorly in the suggested collations. I want phrase corrections. How do I achieve this? charles and the chocolit factory produces the following collations: bool name=correctlySpelledfalse/bool lst name=collation str name=collationQuerycharles and the chocolat factory/str int name=hits2849777/int lst name=misspellingsAndCorrections str name=charlescharles/str str name=andand/str str name=thethe/str str name=chocolitchocolat/str str name=factoryfactory/str /lst /lst lst name=collation str name=collationQuerycharles and the chocalit factory/str int name=hits2849464/int lst name=misspellingsAndCorrections str name=charlescharles/str str name=andand/str str name=thethe/str str name=chocolitchocalit/str str name=factoryfactory/str /lst /lst lst name=collation str name=collationQuerycharles and the chocolat factors/str int name=hits2841190/int lst name=misspellingsAndCorrections str name=charlescharles/str str name=andand/str str name=thethe/str str name=chocolitchocolat/str str name=factoryfactors/str /lst /lst lst name=collation str name=collationQuerycharley and the chocolat factory/str int name=hits2827908/int lst name=misspellingsAndCorrections str name=charlescharley/str str name=andand/str str name=thethe/str str name=chocolitchocolat/str str name=factoryfactory/str /lst /lst lst name=collation str name=collationQuerycharles and the chocalit factors/str int name=hits2840877/int lst name=misspellingsAndCorrections str name=charlescharles/str str name=andand/str str name=thethe/str str name=chocolitchocalit/str str name=factoryfactors/str /lst /lst lst name=collation str name=collationQuerycharles and the chocklit factory/str int name=hits2849464/int lst name=misspellingsAndCorrections str name=charlescharles/str str name=andand/str str name=thethe/str str name=chocolitchocklit/str str name=factoryfactory/str /lst /lst lst name=collation str name=collationQuerycharles and the chocolat factorz/str int name=hits2841173/int lst name=misspellingsAndCorrections str name=charlescharles/str str name=andand/str str name=thethe/str str name=chocolitchocolat/str str name=factoryfactorz/str /lst /lst lst name=collation str name=collationQuerycharley and the chocalit factory/str int name=hits2827595/int lst name=misspellingsAndCorrections str name=charlescharley/str str name=andand/str str name=thethe/str str name=chocolitchocalit/str str name=factoryfactory/str /lst /lst lst name=collation str name=collationQuerycharley and the chocolat factors/str int name=hits2819321/int lst name=misspellingsAndCorrections str name=charlescharley/str str name=andand/str str name=thethe/str str name=chocolitchocolat/str str name=factoryfactors/str /lst /lst lst name=collation str name=collationQuerycharlies and the chocolat factory/str int name=hits2826661/int lst name=misspellingsAndCorrections str name=charlescharlies/str str name=andand/str str name=thethe/str str name=chocolitchocolat/str str name=factoryfactory/str /lst /lst /lst Notice number of hits. This does not look right? Please help. Thanks. -- Sandeep -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Multiword-Search-tp4053038p4053674.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
I am occasionally seeing this in the log, is this just a timeout issue? Should I be increasing the zk client timeout? WARNING: Overseer cannot talk to ZK Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process INFO: Watcher fired on path: null state: Expired type None Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater run WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer/queue at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233) at org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89) at org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131) at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128) at java.lang.Thread.run(Thread.java:662) On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson jej2...@gmail.com wrote: just an update, I'm at 1M records now with no issues. This looks promising as to the cause of my issues, thanks for the help. Is the routing method with numShards documented anywhere? I know numShards is documented but I didn't know that the routing changed if you don't specify it. On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson jej2...@gmail.com wrote: with these changes things are looking good, I'm up to 600,000 documents without any issues as of right now. I'll keep going and add more to see if I find anything. On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson jej2...@gmail.com wrote: ok, so that's not a deal breaker for me. I just changed it to match the shards that are auto created and it looks like things are happy. I'll go ahead and try my test to see if I can get things out of sync. On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller markrmil...@gmail.comwrote: I had thought you could - but looking at the code recently, I don't think you can anymore. I think that's a technical limitation more than anything though. When these changes were made, I think support for that was simply not added at the time. I'm not sure exactly how straightforward it would be, but it seems doable - as it is, the overseer will preallocate shards when first creating the collection - that's when they get named shard(n). There would have to be logic to replace shard(n) with the custom shard name when the core actually registers. - Mark On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote: answered my own question, it now says compositeId. What is problematic though is that in addition to my shards (which are say jamie-shard1) I see the solr created shards (shard1). I assume that these were created because of the numShards param. Is there no way to specify the names of these shards? On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com wrote: ah interestingso I need to specify num shards, blow out zk and then try this again to see if things work properly now. What is really strange is that for the most part things have worked right and on 4.2.1 I have 600,000 items indexed with no duplicates. In any event I will specify num shards clear out zk and begin again. If this works properly what should the router type be? On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com wrote: If you don't specify numShards after 4.1, you get an implicit doc router and it's up to you to distribute updates. In the past, partitioning was done on the fly - but for shard splitting and perhaps other features, we now divvy up the hash range up front based on numShards and store it in ZooKeeper. No numShards is now how you take complete control of updates yourself. - Mark On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote: The router says implicit. I did start from a blank zk state but perhaps I missed one of the ZkCLI commands? One of my shards from the clusterstate.json is shown below. What is the process that should be done to bootstrap a cluster other than the ZkCLI commands I listed above? My process right now is run those ZkCLI commands and then start solr on all of the instances with a command like this java -server -Dshard=shard5 -DcoreName=shard5-core1
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
Yeah. Are you using the concurrent low pause garbage collector? This means the overseer wasn't able to communicate with zk for 15 seconds - due to load or gc or whatever. If you can't resolve the root cause of that, or the load just won't allow for it, next best thing you can do is raise it to 30 seconds. - Mark On Apr 3, 2013, at 7:41 PM, Jamie Johnson jej2...@gmail.com wrote: I am occasionally seeing this in the log, is this just a timeout issue? Should I be increasing the zk client timeout? WARNING: Overseer cannot talk to ZK Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process INFO: Watcher fired on path: null state: Expired type None Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater run WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer/queue at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233) at org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89) at org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131) at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128) at java.lang.Thread.run(Thread.java:662) On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson jej2...@gmail.com wrote: just an update, I'm at 1M records now with no issues. This looks promising as to the cause of my issues, thanks for the help. Is the routing method with numShards documented anywhere? I know numShards is documented but I didn't know that the routing changed if you don't specify it. On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson jej2...@gmail.com wrote: with these changes things are looking good, I'm up to 600,000 documents without any issues as of right now. I'll keep going and add more to see if I find anything. On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson jej2...@gmail.com wrote: ok, so that's not a deal breaker for me. I just changed it to match the shards that are auto created and it looks like things are happy. I'll go ahead and try my test to see if I can get things out of sync. On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller markrmil...@gmail.comwrote: I had thought you could - but looking at the code recently, I don't think you can anymore. I think that's a technical limitation more than anything though. When these changes were made, I think support for that was simply not added at the time. I'm not sure exactly how straightforward it would be, but it seems doable - as it is, the overseer will preallocate shards when first creating the collection - that's when they get named shard(n). There would have to be logic to replace shard(n) with the custom shard name when the core actually registers. - Mark On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote: answered my own question, it now says compositeId. What is problematic though is that in addition to my shards (which are say jamie-shard1) I see the solr created shards (shard1). I assume that these were created because of the numShards param. Is there no way to specify the names of these shards? On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com wrote: ah interestingso I need to specify num shards, blow out zk and then try this again to see if things work properly now. What is really strange is that for the most part things have worked right and on 4.2.1 I have 600,000 items indexed with no duplicates. In any event I will specify num shards clear out zk and begin again. If this works properly what should the router type be? On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com wrote: If you don't specify numShards after 4.1, you get an implicit doc router and it's up to you to distribute updates. In the past, partitioning was done on the fly - but for shard splitting and perhaps other features, we now divvy up the hash range up front based on numShards and store it in ZooKeeper. No numShards is now how you take complete control of updates yourself. - Mark On Apr 3, 2013, at 2:57 PM, Jamie Johnson jej2...@gmail.com wrote: The router says implicit. I did start from a blank zk state but perhaps
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
This shouldn't be a problem though, if things are working as they are supposed to. Another node should simply take over as the overseer and continue processing the work queue. It's just best if you configure so that session timeouts don't happen unless a node is really down. On the other hand, it's nicer to detect that faster. Your tradeoff to make. - Mark On Apr 3, 2013, at 7:46 PM, Mark Miller markrmil...@gmail.com wrote: Yeah. Are you using the concurrent low pause garbage collector? This means the overseer wasn't able to communicate with zk for 15 seconds - due to load or gc or whatever. If you can't resolve the root cause of that, or the load just won't allow for it, next best thing you can do is raise it to 30 seconds. - Mark On Apr 3, 2013, at 7:41 PM, Jamie Johnson jej2...@gmail.com wrote: I am occasionally seeing this in the log, is this just a timeout issue? Should I be increasing the zk client timeout? WARNING: Overseer cannot talk to ZK Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process INFO: Watcher fired on path: null state: Expired type None Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater run WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer/queue at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233) at org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89) at org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131) at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128) at java.lang.Thread.run(Thread.java:662) On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson jej2...@gmail.com wrote: just an update, I'm at 1M records now with no issues. This looks promising as to the cause of my issues, thanks for the help. Is the routing method with numShards documented anywhere? I know numShards is documented but I didn't know that the routing changed if you don't specify it. On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson jej2...@gmail.com wrote: with these changes things are looking good, I'm up to 600,000 documents without any issues as of right now. I'll keep going and add more to see if I find anything. On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson jej2...@gmail.com wrote: ok, so that's not a deal breaker for me. I just changed it to match the shards that are auto created and it looks like things are happy. I'll go ahead and try my test to see if I can get things out of sync. On Wed, Apr 3, 2013 at 3:56 PM, Mark Miller markrmil...@gmail.comwrote: I had thought you could - but looking at the code recently, I don't think you can anymore. I think that's a technical limitation more than anything though. When these changes were made, I think support for that was simply not added at the time. I'm not sure exactly how straightforward it would be, but it seems doable - as it is, the overseer will preallocate shards when first creating the collection - that's when they get named shard(n). There would have to be logic to replace shard(n) with the custom shard name when the core actually registers. - Mark On Apr 3, 2013, at 3:42 PM, Jamie Johnson jej2...@gmail.com wrote: answered my own question, it now says compositeId. What is problematic though is that in addition to my shards (which are say jamie-shard1) I see the solr created shards (shard1). I assume that these were created because of the numShards param. Is there no way to specify the names of these shards? On Wed, Apr 3, 2013 at 3:25 PM, Jamie Johnson jej2...@gmail.com wrote: ah interestingso I need to specify num shards, blow out zk and then try this again to see if things work properly now. What is really strange is that for the most part things have worked right and on 4.2.1 I have 600,000 items indexed with no duplicates. In any event I will specify num shards clear out zk and begin again. If this works properly what should the router type be? On Wed, Apr 3, 2013 at 3:14 PM, Mark Miller markrmil...@gmail.com wrote: If you don't specify numShards after 4.1, you get an implicit doc router and it's up to you to
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
I am not using the concurrent low pause garbage collector, I could look at switching, I'm assuming you're talking about adding -XX:+UseConcMarkSweepGC correct? I also just had a shard go down and am seeing this in the log SEVERE: org.apache.solr.common.SolrException: I was asked to wait on state down for 10.38.33.17:7576_solr but I still do not see the requested state. I see state: recovering live:false at org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) Nothing other than this in the log jumps out as interesting though. On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller markrmil...@gmail.com wrote: This shouldn't be a problem though, if things are working as they are supposed to. Another node should simply take over as the overseer and continue processing the work queue. It's just best if you configure so that session timeouts don't happen unless a node is really down. On the other hand, it's nicer to detect that faster. Your tradeoff to make. - Mark On Apr 3, 2013, at 7:46 PM, Mark Miller markrmil...@gmail.com wrote: Yeah. Are you using the concurrent low pause garbage collector? This means the overseer wasn't able to communicate with zk for 15 seconds - due to load or gc or whatever. If you can't resolve the root cause of that, or the load just won't allow for it, next best thing you can do is raise it to 30 seconds. - Mark On Apr 3, 2013, at 7:41 PM, Jamie Johnson jej2...@gmail.com wrote: I am occasionally seeing this in the log, is this just a timeout issue? Should I be increasing the zk client timeout? WARNING: Overseer cannot talk to ZK Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process INFO: Watcher fired on path: null state: Expired type None Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater run WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer/queue at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233) at org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89) at org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131) at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128) at java.lang.Thread.run(Thread.java:662) On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson jej2...@gmail.com wrote: just an update, I'm at 1M records now with no issues. This looks promising as to the cause of my issues, thanks for the help. Is the routing method with numShards documented anywhere? I know numShards is documented but I didn't know that the routing changed if you don't specify it. On Wed, Apr 3, 2013 at 4:44 PM, Jamie Johnson jej2...@gmail.com wrote: with these changes things are looking good, I'm up to 600,000 documents without any issues as of right now. I'll keep going and add more to see if I find anything. On Wed, Apr 3, 2013 at 4:01 PM, Jamie Johnson jej2...@gmail.com wrote: ok, so that's not a deal breaker for me. I just changed it to match the shards that are auto created and it looks like things are happy. I'll go ahead and try my test to see if I can get things out of sync. On Wed, Apr 3, 2013 at
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
On Apr 3, 2013, at 8:17 PM, Jamie Johnson jej2...@gmail.com wrote: I am not using the concurrent low pause garbage collector, I could look at switching, I'm assuming you're talking about adding -XX:+UseConcMarkSweepGC correct? Right - if you don't do that, the default is almost always the throughput collector (I've only seen OSX buck this trend when apple handled java). That means stop the world garbage collections, so with larger heaps, that can be a fair amount of time that no threads can run. It's not that great for something as interactive as search generally is anyway, but it's always not that great when added to heavy load and a 15 sec session timeout between solr and zk. The below is odd - a replica node is waiting for the leader to see it as recovering and live - live means it has created an ephemeral node for that Solr corecontainer in zk - it's very strange if that didn't happen, unless this happened during shutdown or something. I also just had a shard go down and am seeing this in the log SEVERE: org.apache.solr.common.SolrException: I was asked to wait on state down for 10.38.33.17:7576_solr but I still do not see the requested state. I see state: recovering live:false at org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) Nothing other than this in the log jumps out as interesting though. On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller markrmil...@gmail.com wrote: This shouldn't be a problem though, if things are working as they are supposed to. Another node should simply take over as the overseer and continue processing the work queue. It's just best if you configure so that session timeouts don't happen unless a node is really down. On the other hand, it's nicer to detect that faster. Your tradeoff to make. - Mark On Apr 3, 2013, at 7:46 PM, Mark Miller markrmil...@gmail.com wrote: Yeah. Are you using the concurrent low pause garbage collector? This means the overseer wasn't able to communicate with zk for 15 seconds - due to load or gc or whatever. If you can't resolve the root cause of that, or the load just won't allow for it, next best thing you can do is raise it to 30 seconds. - Mark On Apr 3, 2013, at 7:41 PM, Jamie Johnson jej2...@gmail.com wrote: I am occasionally seeing this in the log, is this just a timeout issue? Should I be increasing the zk client timeout? WARNING: Overseer cannot talk to ZK Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process INFO: Watcher fired on path: null state: Expired type None Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater run WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer/queue at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233) at org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89) at org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131) at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128) at java.lang.Thread.run(Thread.java:662) On Wed, Apr 3, 2013 at 7:25 PM, Jamie Johnson jej2...@gmail.com wrote: just an update, I'm at 1M records now with no issues. This looks promising as to
hl.usePhraseHighlighter defaults to true but Query form and wiki suggest otherwise
Minor issues - It seems that the hl.usePhraseHighlighter is enabled by default, which definitely makes sense but the wiki says it's default value is false and the checkbox is unchecked by default on the Query form. This gives the impression this parameter defaults to false. I'm assuming the code is right in this case and we just need a JIRA to bring the Query form in-sync with the code. I can update the wiki ... just want to make sure that having this field enabled by default is the correct behavior before I update things. Cheers, Tim
Re: Solr 4.2 Cloud Replication Replica has higher version than Master?
Thanks I will try that. On Wed, Apr 3, 2013 at 8:28 PM, Mark Miller markrmil...@gmail.com wrote: On Apr 3, 2013, at 8:17 PM, Jamie Johnson jej2...@gmail.com wrote: I am not using the concurrent low pause garbage collector, I could look at switching, I'm assuming you're talking about adding -XX:+UseConcMarkSweepGC correct? Right - if you don't do that, the default is almost always the throughput collector (I've only seen OSX buck this trend when apple handled java). That means stop the world garbage collections, so with larger heaps, that can be a fair amount of time that no threads can run. It's not that great for something as interactive as search generally is anyway, but it's always not that great when added to heavy load and a 15 sec session timeout between solr and zk. The below is odd - a replica node is waiting for the leader to see it as recovering and live - live means it has created an ephemeral node for that Solr corecontainer in zk - it's very strange if that didn't happen, unless this happened during shutdown or something. I also just had a shard go down and am seeing this in the log SEVERE: org.apache.solr.common.SolrException: I was asked to wait on state down for 10.38.33.17:7576_solr but I still do not see the requested state. I see state: recovering live:false at org.apache.solr.handler.admin.CoreAdminHandler.handleWaitForStateAction(CoreAdminHandler.java:890) at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:591) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:192) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:141) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) Nothing other than this in the log jumps out as interesting though. On Wed, Apr 3, 2013 at 7:47 PM, Mark Miller markrmil...@gmail.com wrote: This shouldn't be a problem though, if things are working as they are supposed to. Another node should simply take over as the overseer and continue processing the work queue. It's just best if you configure so that session timeouts don't happen unless a node is really down. On the other hand, it's nicer to detect that faster. Your tradeoff to make. - Mark On Apr 3, 2013, at 7:46 PM, Mark Miller markrmil...@gmail.com wrote: Yeah. Are you using the concurrent low pause garbage collector? This means the overseer wasn't able to communicate with zk for 15 seconds - due to load or gc or whatever. If you can't resolve the root cause of that, or the load just won't allow for it, next best thing you can do is raise it to 30 seconds. - Mark On Apr 3, 2013, at 7:41 PM, Jamie Johnson jej2...@gmail.com wrote: I am occasionally seeing this in the log, is this just a timeout issue? Should I be increasing the zk client timeout? WARNING: Overseer cannot talk to ZK Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.DistributedQueue$LatchChildWatcher process INFO: Watcher fired on path: null state: Expired type None Apr 3, 2013 11:14:25 PM org.apache.solr.cloud.Overseer$ClusterStateUpdater run WARNING: Solr cannot talk to ZK, exiting Overseer main queue loop org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /overseer/queue at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1468) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:236) at org.apache.solr.common.cloud.SolrZkClient$6.execute(SolrZkClient.java:233) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) at org.apache.solr.common.cloud.SolrZkClient.getChildren(SolrZkClient.java:233) at org.apache.solr.cloud.DistributedQueue.orderedChildren(DistributedQueue.java:89) at org.apache.solr.cloud.DistributedQueue.element(DistributedQueue.java:131) at org.apache.solr.cloud.DistributedQueue.peek(DistributedQueue.java:326) at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:128)
Difference Between Indexing and Reindexing
OK, This could be a so easy question but I want to learn just a bit more technical detail of it. When I use Nutch to send documents to Solr to be indexing there are two parameters: -index and -reindex. What Solr does at each one different from the other one?