Maximum solr processes per machine
Hi, I'm thinking of solr cluster architecture before purchasing machines. My total index size is around 5TB. I want to have replication factor of 3. total 15TB. I've understood that I should have 50-100% of the index size as ram, for OS cache. Lets say we're talking about around 10TB of memory. Now I need to split this memory to multiple servers and get the machine spec I want to buy. I'm thinking of running multiple solr processes per machine. is there an upper limit of amount of solr processes per machine, assuming that I make sure that the total size of indexes of all nodes in the machine is within the RAM percentage i've defined? -- View this message in context: http://lucene.472066.n3.nabble.com/Maximum-solr-processes-per-machine-tp4092568.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Maximum solr processes per machine
bq: is there an upper limit of amount of solr processes per machine, No, assuming they're all in separate JVMs. I've see reports, though, that increasing the number of JVMs past the number of CPU cores gets into iffy territory. And, depending on your disk storage they may all be contending for disk access. FWIW, Erick On Sun, Sep 29, 2013 at 9:21 AM, adfel70 adfe...@gmail.com wrote: Hi, I'm thinking of solr cluster architecture before purchasing machines. My total index size is around 5TB. I want to have replication factor of 3. total 15TB. I've understood that I should have 50-100% of the index size as ram, for OS cache. Lets say we're talking about around 10TB of memory. Now I need to split this memory to multiple servers and get the machine spec I want to buy. I'm thinking of running multiple solr processes per machine. is there an upper limit of amount of solr processes per machine, assuming that I make sure that the total size of indexes of all nodes in the machine is within the RAM percentage i've defined? -- View this message in context: http://lucene.472066.n3.nabble.com/Maximum-solr-processes-per-machine-tp4092568.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Maximum solr processes per machine
How can I configure the disk storage so that disk access is optimized? I'm considering having RAID-10 and I think I'll have arround 4-8 disks per machine. Should I run each solr jvm to point on a datadir on differnet disks, or is there some other way to optimize this? Erick Erickson wrote bq: is there an upper limit of amount of solr processes per machine, No, assuming they're all in separate JVMs. I've see reports, though, that increasing the number of JVMs past the number of CPU cores gets into iffy territory. And, depending on your disk storage they may all be contending for disk access. FWIW, Erick On Sun, Sep 29, 2013 at 9:21 AM, adfel70 lt; adfel70@ gt; wrote: Hi, I'm thinking of solr cluster architecture before purchasing machines. My total index size is around 5TB. I want to have replication factor of 3. total 15TB. I've understood that I should have 50-100% of the index size as ram, for OS cache. Lets say we're talking about around 10TB of memory. Now I need to split this memory to multiple servers and get the machine spec I want to buy. I'm thinking of running multiple solr processes per machine. is there an upper limit of amount of solr processes per machine, assuming that I make sure that the total size of indexes of all nodes in the machine is within the RAM percentage i've defined? -- View this message in context: http://lucene.472066.n3.nabble.com/Maximum-solr-processes-per-machine-tp4092568.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://lucene.472066.n3.nabble.com/Maximum-solr-processes-per-machine-tp4092568p4092574.html Sent from the Solr - User mailing list archive at Nabble.com.
ClusteringComponent under Tomcat 7
Hi, I'm trying to run Solr 4.3 (and 4.4) with -Dsolr.clustering.enabled=true I've copied all relevant jars to ./lib directory under the instance. With jetty it runs OK! But, under Tomcat I receives the error (exception) below. Any idea/help? Thanks, -Ariel org.apache.solr.common.SolrException: Error Instantiating SearchComponent, solr.clustering.ClusteringComponent failed to instantiate org.apache.solr.handler.component.SearchComponent at org.apache.solr.core.SolrCore.init(SolrCore.java:835) at org.apache.solr.core.SolrCore.init(SolrCore.java:629) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:622) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:657) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:364) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:356) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.solr.common.SolrException: Error Instantiating SearchComponent, solr.clustering.ClusteringComponent failed to instantiate org.apache.solr.handler.component.SearchComponent at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:551) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:586) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2173) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2167) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2200) at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:1231) at org.apache.solr.core.SolrCore.init(SolrCore.java:766) ... 13 more Caused by: java.lang.ClassCastException: class org.apache.solr.handler.clustering.ClusteringComponent at java.lang.Class.asSubclass(Unknown Source) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:443) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:381) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:530) ... 19 more ERROR - 2013-09-29 05:58:13.519; org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: Unable to create core: att150K at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1150) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:666) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:364) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:356) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) at java.util.concurrent.FutureTask.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.solr.common.SolrException: Error Instantiating SearchComponent, solr.clustering.ClusteringComponent failed to instantiate org.apache.solr.handler.component.SearchComponent at org.apache.solr.core.SolrCore.init(SolrCore.java:835) at org.apache.solr.core.SolrCore.init(SolrCore.java:629) at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:622) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:657) ... 10 more Caused by: org.apache.solr.common.SolrException: Error Instantiating SearchComponent, solr.clustering.ClusteringComponent failed to instantiate org.apache.solr.handler.component.SearchComponent at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:551) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:586) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2173) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2167) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2200) at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:1231) at org.apache.solr.core.SolrCore.init(SolrCore.java:766) ... 13 more Caused by: java.lang.ClassCastException: class org.apache.solr.handler.clustering.ClusteringComponent at java.lang.Class.asSubclass(Unknown Source) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:443) at
Re: Maximum solr processes per machine
On 09/29/2013 04:03 PM, adfel70 wrote: How can I configure the disk storage so that disk access is optimized? I'm considering having RAID-10 and I think I'll have arround 4-8 disks per machine. Should I run each solr jvm to point on a datadir on differnet disks, or is there some other way to optimize this? Best way to deal with this, is trial and error. There are many factors that can contribute to your hardware decisions. Will there be concurrent access on all solr instances? Will some be used more than others? If you have a couple of highly used and many seldom used instances, then there's no problem in running in each in a different JVM. If you have more highly used instances/JVMs than CPU cores...you're in trouble. Are you doing real time search? Or is the data mostly static? If the data doesn't change much, then good warming-up queries will be a lot more useful than trying to tie solr to specific disks. If you're doing real time on a 5TB index then you'll probably want to throw your money at the fastest storage you can afford (SSDs vs spinning rust made a huge difference in our benchmarks) and the fastest CPUs you can get your hands on. Memory is important too, but in our benchmarks that didn't have as much impact as the other factors. Keeping a 5TB index in memory is going to be tricky, so in my opinion you'd be better off investing in faster disks instead. - Bram
Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception
how dum can you get. obviously quite dum... i would have to analyze the html-pages with a nested instance like this: entity name=rec processor=XPathEntityProcessor url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml forEach=/docs/doc dataSource=main entity name=htm processor=XPathEntityProcessor url=${rec.urlParse} forEach=/xhtml:html dataSource=dataUrl field column=text xpath=//content / field column=h_2 xpath=//body / field column=text_nohtml xpath=//text / field column=h_1 xpath=//h:h1 / /entity /entity but i'm pretty sure the foreach is wrong and the xpath expressions. in the moment i getting the following error: Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast to java.io.Reader On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote: ok i see what your getting at but why doesn't the following work: field xpath=//h:h1 column=h_1 / field column=text xpath=/xhtml:html/xhtml:body / i removed the tiki-processor. what am i missing, i haven't found anything in the wiki? On 28. Sep 2013, at 12:28 AM, P Williams wrote: I spent some more time thinking about this. Do you really need to use the TikaEntityProcessor? It doesn't offer anything new to the document you are building that couldn't be accomplished by the XPathEntityProcessor alone from what I can tell. I also tried to get the Advanced Parsinghttp://wiki.apache.org/solr/TikaEntityProcessorexample to work without success. There are some obvious typos (document instead of /document) and an odd order to the pieces (dataSources is enclosed by document). It also looks like FieldStreamDataSourcehttp://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.htmlis the one that is meant to work in this context. If Koji is still around maybe he could offer some help? Otherwise this bit of erroneous instruction should probably be removed from the wiki. Cheers, Tricia $ svn diff Index: solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java === --- solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java (revision 1526990) +++ solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java (working copy) @@ -99,13 +99,13 @@ runFullImport(getConfigHTML(identity)); assertQ(req(*:*), testsHTMLIdentity); } - + private String getConfigHTML(String htmlMapper) { return dataConfig + dataSource type='BinFileDataSource'/ + document + -entity name='Tika' format='xml' processor='TikaEntityProcessor' + +entity name='Tika' format='html' processor='TikaEntityProcessor' + url=' + getFile(dihextras/structured.html).getAbsolutePath() + ' + ((htmlMapper == null) ? : ( htmlMapper=' + htmlMapper + ')) + + field column='text'/ + @@ -114,4 +114,36 @@ /dataConfig; } + private String[] testsHTMLH1 = { + //*[@numFound='1'] + , //str[@name='h1'][contains(.,'H1 Header')] + }; + + @Test + public void testTikaHTMLMapperSubEntity() throws Exception { +runFullImport(getConfigSubEntity(identity)); +assertQ(req(*:*), testsHTMLH1); + } + + private String getConfigSubEntity(String htmlMapper) { +return +dataConfig + +dataSource type='BinFileDataSource' name='bin'/ + +dataSource type='FieldStreamDataSource' name='fld'/ + +document + +entity name='tika' processor='TikaEntityProcessor' url=' + getFile(dihextras/structured.html).getAbsolutePath() + ' dataSource='bin' format='html' rootEntity='false' + +!--Do appropriate mapping here meta=\true\ means it is a metadata field -- + +field column='Author' meta='true' name='author'/ + +field column='title' meta='true' name='title'/ + +!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately-- + +field name='text' column='text'/ + +entity name='detail' type='XPathEntityProcessor' forEach='/html' dataSource='fld' dataField='tika.text' rootEntity='true' + +field xpath='//div' column='foo'/ + +field xpath='//h1' column='h1' / + +/entity + +/entity + +/document + +/dataConfig; + } + } Index:
Re: Hello and help :)
Thanks for the anwser. Yes, you understood it correctly. The method you proposed should work perfectly, except I do have one more requirement that I forgot to mention earlier, and I apologize for that. The true problem we are facing is: * find all documents for userID=x, where userID=x has more than y documents in the index between dateA and dateB And since dateA and dateB can be any dates, its impossible to save the count, since we cannot foresee what date and what count will be requested. 2013/9/28 Upayavira u...@odoko.co.uk To phrase your need more generically: * find all documents for userID=x, where userID=x has more than y documents in the index Is that correct? If it is, I'd probably do some work at index time. First guess, I'd keep a separate core, which has a very small document per user, storing just: * userID * docCount Then, when you add/delete a document, you use atomic updates to either increase or decrease the docCount on that user doc. Then you can use a pseudo join between these two cores relatively easily. q=user_id:x {!join fromIndex=user from=user_id to=user_id}+user_id:x +doc_count:[y TO *] Worst case, if you don't want to mess with your indexing code, I wonder if you could use a ScriptUpdateProcessor to do this work - not sure if you can have one add an entirely new, additional, document to the list, but may be possible. Upayavira On Fri, Sep 27, 2013, at 09:50 PM, Matheus Salvia wrote: Sure, sorry for the inconvenience. I'm having a little trouble trying to make a query in Solr. The problem is: I must be able retrieve documents that have the same value for a specified field, but they should only be retrieved if this value appeared more than X times for a specified user. In pseudosql it would be something like: select user_id from documents where my_field=my_value and (select count(*) from documents where my_field=my_value and user_id=super.user_id) X I Know that solr return a 'numFound' for each query you make, but I dont know how to retrieve this value in a subquery. My Solr is organized in a way that a user is a document, and the properties of the user (such as name, age, etc) are grouped in another document with a 'root_id' field. So lets suppose the following query that gets all the root documents whose children have the prefix some_prefix. is_root:true AND _query_:{!join from=root_id to=id}requests_prefix:\some_prefix\ Now, how can I get the root documents (users in some sense) that have more than X children matching 'requests_prefix:some_prefix' or any other condition? Is it possible? P.S. It must be done in a single query, fields can be added at will, but the root/children structure should be preserved (preferentially). 2013/9/27 Upayavira u...@odoko.co.uk Mattheus, Given these mails form a part of an archive that are themselves self-contained, can you please post your actual question here? You're more likely to get answers that way. Thanks, Upayavira On Fri, Sep 27, 2013, at 04:36 PM, Matheus Salvia wrote: Hello everyone, I'm having a problem regarding how to make a solr query, I've posted it on stackoverflow. Can someone help me? http://stackoverflow.com/questions/19039099/apache-solr-count-of-subquery-as-a-superquery-parameter Thanks in advance! -- -- // Matheus Salvia Desenvolvedor Mobile Celular: +55 11 9-6446-2332 Skype: meta.faraday -- -- // Matheus Salvia Desenvolvedor Mobile Celular: +55 11 9-6446-2332 Skype: meta.faraday -- -- // Matheus Salvia Desenvolvedor Mobile Celular: +55 11 9-6446-2332 Skype: meta.faraday
Nagle's Algorithm
How do I set TCP_NODELAY on the http sockets for Jetty in SOLR 4? Is there an option in jetty.xml ? /* Create new stream socket */ sock = *socket*( AF_INET, SOCK_STREAM, 0 ); /* Disable the Nagle (TCP No Delay) algorithm */ flag = 1; ret = *setsockopt*( sock, IPPROTO_TCP, TCP_NODELAY, (char *)flag, sizeof(flag) ); -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Nagle's Algorithm
I don't keep up with this list well enough to know whether anyone else answered. I don't know how to do it in jetty.xml, but you can certainly tweak the code. java.net.Socket has a method setTcpNoDelay() that corresponds with the standard Unix system calls. Long-time past, my suggestion of this made Apache Axis 2.0 250ms faster per call (1). Now I want to know whether Apache Solr sets it. One common way to test the overhead portion of latency is to project the latency for a zero size request based on larger requests. What you do is to warm requests (all in memory) for progressively fewer and fewer rows. You can make requests for 100, 90, 80, 70 ... 10 rows each more than once so that all is warmed. If you plot this, it should look like a linear function latency(rows) = m(rows) + b since all is cached in memory. You have to control what else is going on on the server to get the linear plot of course - it can be quite hard to get this to work right on modern Linux. But once you have it, you can simply calculate f(0) and you have the latency for a theoretical 0 sized request. This is a tangential answer at best - I wish I just knew a setting to give you. (1) Latency Performance of SOAP Implementationshttp://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.21.8556type=ab On Sun, Sep 29, 2013 at 9:22 PM, William Bell billnb...@gmail.com wrote: How do I set TCP_NODELAY on the http sockets for Jetty in SOLR 4? Is there an option in jetty.xml ? /* Create new stream socket */ sock = *socket*( AF_INET, SOCK_STREAM, 0 ); /* Disable the Nagle (TCP No Delay) algorithm */ flag = 1; ret = *setsockopt*( sock, IPPROTO_TCP, TCP_NODELAY, (char *)flag, sizeof(flag) ); -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Nagle's Algorithm
I dunno, but this makes it look as if this may already be taken care of: http://jira.codehaus.org/browse/JETTY-1196 On 9/29/2013 9:22 PM, William Bell wrote: How do I set TCP_NODELAY on the http sockets for Jetty in SOLR 4? Is there an option in jetty.xml ? /* Create new stream socket */ sock = *socket*( AF_INET, SOCK_STREAM, 0 ); /* Disable the Nagle (TCP No Delay) algorithm */ flag = 1; ret = *setsockopt*( sock, IPPROTO_TCP, TCP_NODELAY, (char *)flag, sizeof(flag) );
Re: Maximum solr processes per machine
On 9/29/2013 7:21 AM, adfel70 wrote: Hi, I'm thinking of solr cluster architecture before purchasing machines. My total index size is around 5TB. I want to have replication factor of 3. total 15TB. I've understood that I should have 50-100% of the index size as ram, for OS cache. Lets say we're talking about around 10TB of memory. Now I need to split this memory to multiple servers and get the machine spec I want to buy. I'm thinking of running multiple solr processes per machine. Running multiple solr instances per machine is a really bad idea. One Solr instance can run many indexes, and there will be far less memory overhead if you're not running multiple servlet containers. You can also run all instances on the same TCP port - no need to figure out different ports per instance. Configuration and deployment are not as complicated. When you have multiple Solr instances per machine, the SolrCloud collections API has a tendency to place some or all of the replicas for each shard on the same machine, which means that it won't be fault tolerant. With one instance per machine, you can be absolutely sure that created collections will have all replicas for each shard on different machines. I will echo the advice you've been given about using SSD. You'll need much less OS disk cache memory with SSD. Thanks, Shawn
Re: Nagle's Algorithm
On 9/29/2013 7:22 PM, William Bell wrote: How do I set TCP_NODELAY on the http sockets for Jetty in SOLR 4? The client usually makes that decision, not the server. This parameter is turned on by default for recent HttpClient versions, the library used by SolrJ. Even the JETTY issue uncovered by Michael Sokolov refers to a client connection. Thanks, Shawn