Maximum solr processes per machine

2013-09-29 Thread adfel70
Hi,
I'm thinking of solr cluster architecture before purchasing machines.


My total index size is around 5TB. I want to have replication factor of 3.
total 15TB.
I've understood that I should  have 50-100% of the index size as ram, for OS
cache. Lets say we're talking about around 10TB of memory.
Now I need to split this memory to multiple servers and get the machine spec
I want to buy.
I'm thinking of running multiple solr processes per machine.
is there an upper limit of amount of solr processes per machine, assuming
that I make sure that the total size of indexes of all nodes in the machine
is within the RAM percentage i've defined?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Maximum-solr-processes-per-machine-tp4092568.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Maximum solr processes per machine

2013-09-29 Thread Erick Erickson
bq: is there an upper limit of amount of solr processes per machine,

No, assuming they're all in separate JVMs. I've see reports, though,
that increasing the number of JVMs past the number of CPU
cores gets into iffy territory.

And, depending on your disk storage they may all be contending for
disk access.

FWIW,
Erick

On Sun, Sep 29, 2013 at 9:21 AM, adfel70 adfe...@gmail.com wrote:
 Hi,
 I'm thinking of solr cluster architecture before purchasing machines.


 My total index size is around 5TB. I want to have replication factor of 3.
 total 15TB.
 I've understood that I should  have 50-100% of the index size as ram, for OS
 cache. Lets say we're talking about around 10TB of memory.
 Now I need to split this memory to multiple servers and get the machine spec
 I want to buy.
 I'm thinking of running multiple solr processes per machine.
 is there an upper limit of amount of solr processes per machine, assuming
 that I make sure that the total size of indexes of all nodes in the machine
 is within the RAM percentage i've defined?



 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Maximum-solr-processes-per-machine-tp4092568.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Maximum solr processes per machine

2013-09-29 Thread adfel70
How can I configure the disk storage so that disk access is optimized?
I'm considering having RAID-10
and I think I'll have arround 4-8 disks per machine.
Should I run each solr jvm to point on a datadir on differnet disks, or is
there some other way to optimize this?



Erick Erickson wrote
 bq: is there an upper limit of amount of solr processes per machine,
 
 No, assuming they're all in separate JVMs. I've see reports, though,
 that increasing the number of JVMs past the number of CPU
 cores gets into iffy territory.
 
 And, depending on your disk storage they may all be contending for
 disk access.
 
 FWIW,
 Erick
 
 On Sun, Sep 29, 2013 at 9:21 AM, adfel70 lt;

 adfel70@

 gt; wrote:
 Hi,
 I'm thinking of solr cluster architecture before purchasing machines.


 My total index size is around 5TB. I want to have replication factor of
 3.
 total 15TB.
 I've understood that I should  have 50-100% of the index size as ram, for
 OS
 cache. Lets say we're talking about around 10TB of memory.
 Now I need to split this memory to multiple servers and get the machine
 spec
 I want to buy.
 I'm thinking of running multiple solr processes per machine.
 is there an upper limit of amount of solr processes per machine, assuming
 that I make sure that the total size of indexes of all nodes in the
 machine
 is within the RAM percentage i've defined?



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Maximum-solr-processes-per-machine-tp4092568.html
 Sent from the Solr - User mailing list archive at Nabble.com.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Maximum-solr-processes-per-machine-tp4092568p4092574.html
Sent from the Solr - User mailing list archive at Nabble.com.


ClusteringComponent under Tomcat 7

2013-09-29 Thread Lieberman, Ariel
Hi,

I'm trying to run Solr 4.3 (and 4.4) with -Dsolr.clustering.enabled=true

I've copied all relevant jars to ./lib directory under the instance.

With jetty it runs OK! But, under Tomcat I receives the error (exception) below.

Any idea/help?

Thanks,

-Ariel


org.apache.solr.common.SolrException: Error Instantiating SearchComponent, 
solr.clustering.ClusteringComponent failed to instantiate 
org.apache.solr.handler.component.SearchComponent
 at org.apache.solr.core.SolrCore.init(SolrCore.java:835)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:629)
 at 
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:622)
 at org.apache.solr.core.CoreContainer.create(CoreContainer.java:657)
 at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:364)
 at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:356)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.solr.common.SolrException: Error Instantiating 
SearchComponent, solr.clustering.ClusteringComponent failed to instantiate 
org.apache.solr.handler.component.SearchComponent
 at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:551)
 at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:586)
 at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2173)
 at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2167)
 at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2200)
 at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:1231)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:766)
 ... 13 more
Caused by: java.lang.ClassCastException: class 
org.apache.solr.handler.clustering.ClusteringComponent
 at java.lang.Class.asSubclass(Unknown Source)
 at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:443)
 at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:381)
 at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:530)
 ... 19 more
ERROR - 2013-09-29 05:58:13.519; org.apache.solr.common.SolrException; 
null:org.apache.solr.common.SolrException: Unable to create core: att150K
 at 
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1150)
 at org.apache.solr.core.CoreContainer.create(CoreContainer.java:666)
 at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:364)
 at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:356)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
 at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
 at java.util.concurrent.FutureTask.run(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
 at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.solr.common.SolrException: Error Instantiating 
SearchComponent, solr.clustering.ClusteringComponent failed to instantiate 
org.apache.solr.handler.component.SearchComponent
 at org.apache.solr.core.SolrCore.init(SolrCore.java:835)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:629)
 at 
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:622)
 at org.apache.solr.core.CoreContainer.create(CoreContainer.java:657)
 ... 10 more
Caused by: org.apache.solr.common.SolrException: Error Instantiating 
SearchComponent, solr.clustering.ClusteringComponent failed to instantiate 
org.apache.solr.handler.component.SearchComponent
 at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:551)
 at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:586)
 at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2173)
 at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2167)
 at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2200)
 at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:1231)
 at org.apache.solr.core.SolrCore.init(SolrCore.java:766)
 ... 13 more
Caused by: java.lang.ClassCastException: class 
org.apache.solr.handler.clustering.ClusteringComponent
 at java.lang.Class.asSubclass(Unknown Source)
 at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:443)
 at 

Re: Maximum solr processes per machine

2013-09-29 Thread Bram Van Dam

On 09/29/2013 04:03 PM, adfel70 wrote:

How can I configure the disk storage so that disk access is optimized?
I'm considering having RAID-10
and I think I'll have arround 4-8 disks per machine.
Should I run each solr jvm to point on a datadir on differnet disks, or is
there some other way to optimize this?


Best way to deal with this, is trial and error. There are many factors 
that can contribute to your hardware decisions. Will there be concurrent 
access on all solr instances? Will some be used more than others?


If you have a couple of highly used and many seldom used instances, then 
there's no problem in running in each in a different JVM. If you have 
more highly used instances/JVMs than CPU cores...you're in trouble.


Are you doing real time search? Or is the data mostly static? If the 
data doesn't change much, then good warming-up queries will be a lot 
more useful than trying to tie solr to specific disks.


If you're doing real time on a 5TB index then you'll probably want to 
throw your money at the fastest storage you can afford (SSDs vs spinning 
rust made a huge difference in our benchmarks) and the fastest CPUs you 
can get your hands on. Memory is important too, but in our benchmarks 
that didn't have as much impact as the other factors. Keeping a 5TB 
index in memory is going to be tricky, so in my opinion you'd be better 
off investing in faster disks instead.


 - Bram


Re: XPathEntityProcessor nested in TikaEntityProcessor query null exception

2013-09-29 Thread Andreas Owen
how dum can you get. obviously quite dum... i would have to analyze the 
html-pages with a nested instance like this:

entity name=rec processor=XPathEntityProcessor 
url=file:///C:\ColdFusion10\cfusion\solr\solr\tkbintranet\docImportUrl.xml 
forEach=/docs/doc dataSource=main 

entity name=htm processor=XPathEntityProcessor 
url=${rec.urlParse} forEach=/xhtml:html dataSource=dataUrl
field column=text xpath=//content /
field column=h_2 xpath=//body /
field column=text_nohtml xpath=//text /
field column=h_1 xpath=//h:h1 /
/entity
/entity

but i'm pretty sure the foreach is wrong and the xpath expressions. in the 
moment i getting the following error:

Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
java.lang.ClassCastException: 
sun.net.www.protocol.http.HttpURLConnection$HttpInputStream cannot be cast to 
java.io.Reader





On 28. Sep 2013, at 1:39 AM, Andreas Owen wrote:

 ok i see what your getting at but why doesn't the following work:
   
   field xpath=//h:h1 column=h_1 /
   field column=text xpath=/xhtml:html/xhtml:body /
 
 i removed the tiki-processor. what am i missing, i haven't found anything in 
 the wiki?
 
 
 On 28. Sep 2013, at 12:28 AM, P Williams wrote:
 
 I spent some more time thinking about this.  Do you really need to use the
 TikaEntityProcessor?  It doesn't offer anything new to the document you are
 building that couldn't be accomplished by the XPathEntityProcessor alone
 from what I can tell.
 
 I also tried to get the Advanced
 Parsinghttp://wiki.apache.org/solr/TikaEntityProcessorexample to
 work without success.  There are some obvious typos (document
 instead of /document) and an odd order to the pieces (dataSources is
 enclosed by document).  It also looks like
 FieldStreamDataSourcehttp://lucene.apache.org/solr/4_3_1/solr-dataimporthandler/org/apache/solr/handler/dataimport/FieldStreamDataSource.htmlis
 the one that is meant to work in this context. If Koji is still around
 maybe he could offer some help?  Otherwise this bit of erroneous
 instruction should probably be removed from the wiki.
 
 Cheers,
 Tricia
 
 $ svn diff
 Index:
 solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
 ===
 ---
 solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
(revision 1526990)
 +++
 solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestTikaEntityProcessor.java
(working copy)
 @@ -99,13 +99,13 @@
runFullImport(getConfigHTML(identity));
assertQ(req(*:*), testsHTMLIdentity);
  }
 -
 +
  private String getConfigHTML(String htmlMapper) {
return
dataConfig +
  dataSource type='BinFileDataSource'/ +
  document +
 -entity name='Tika' format='xml'
 processor='TikaEntityProcessor'  +
 +entity name='Tika' format='html'
 processor='TikaEntityProcessor'  +
   url=' +
 getFile(dihextras/structured.html).getAbsolutePath() + '  +
((htmlMapper == null) ?  : ( htmlMapper=' + htmlMapper +
 ')) +  +
  field column='text'/ +
 @@ -114,4 +114,36 @@
/dataConfig;
 
  }
 +  private String[] testsHTMLH1 = {
 +  //*[@numFound='1']
 +  , //str[@name='h1'][contains(.,'H1 Header')]
 +  };
 +
 +  @Test
 +  public void testTikaHTMLMapperSubEntity() throws Exception {
 +runFullImport(getConfigSubEntity(identity));
 +assertQ(req(*:*), testsHTMLH1);
 +  }
 +
 +  private String getConfigSubEntity(String htmlMapper) {
 +return
 +dataConfig +
 +dataSource type='BinFileDataSource' name='bin'/ +
 +dataSource type='FieldStreamDataSource' name='fld'/ +
 +document +
 +entity name='tika' processor='TikaEntityProcessor' url=' +
 getFile(dihextras/structured.html).getAbsolutePath() + '
 dataSource='bin' format='html' rootEntity='false' +
 +!--Do appropriate mapping here  meta=\true\ means it is a
 metadata field -- +
 +field column='Author' meta='true' name='author'/ +
 +field column='title' meta='true' name='title'/ +
 +!--'text' is an implicit field emited by TikaEntityProcessor .
 Map it appropriately-- +
 +field name='text' column='text'/ +
 +entity name='detail' type='XPathEntityProcessor' forEach='/html'
 dataSource='fld' dataField='tika.text' rootEntity='true'  +
 +field xpath='//div'  column='foo'/ +
 +field xpath='//h1'  column='h1' / +
 +/entity +
 +/entity +
 +/document +
 +/dataConfig;
 +  }
 +
 }
 Index:
 

Re: Hello and help :)

2013-09-29 Thread Matheus Salvia
Thanks for the anwser. Yes, you understood it correctly.
The method you proposed should work perfectly, except I do have one more
requirement that I forgot to mention earlier, and I apologize for that.
The true problem we are facing is:
* find all documents for userID=x, where userID=x has more than y
 documents in the index between dateA and dateB

And since dateA and dateB can be any dates, its impossible to save the
count, since we cannot foresee what date and what count will be requested.


2013/9/28 Upayavira u...@odoko.co.uk

 To phrase your need more generically:

  * find all documents for userID=x, where userID=x has more than y
  documents in the index

 Is that correct?

 If it is, I'd probably do some work at index time. First guess, I'd keep
 a separate core, which has a very small document per user, storing just:

  * userID
  * docCount

 Then, when you add/delete a document, you use atomic updates to either
 increase or decrease the docCount on that user doc.

 Then you can use a pseudo join between these two cores relatively
 easily.

 q=user_id:x {!join fromIndex=user from=user_id to=user_id}+user_id:x
 +doc_count:[y TO *]

 Worst case, if you don't want to mess with your indexing code, I wonder
 if you could use a ScriptUpdateProcessor to do this work - not sure if
 you can have one add an entirely new, additional, document to the list,
 but may be possible.

 Upayavira

 On Fri, Sep 27, 2013, at 09:50 PM, Matheus Salvia wrote:
  Sure, sorry for the inconvenience.
 
  I'm having a little trouble trying to make a query in Solr. The problem
  is:
  I must be able retrieve documents that have the same value for a
  specified
  field, but they should only be retrieved if this value appeared more than
  X
  times for a specified user. In pseudosql it would be something like:
 
  select user_id from documents
  where my_field=my_value
  and
  (select count(*) from documents where my_field=my_value and
  user_id=super.user_id)  X
 
  I Know that solr return a 'numFound' for each query you make, but I dont
  know how to retrieve this value in a subquery.
 
  My Solr is organized in a way that a user is a document, and the
  properties
  of the user (such as name, age, etc) are grouped in another document with
  a
  'root_id' field. So lets suppose the following query that gets all the
  root
  documents whose children have the prefix some_prefix.
 
  is_root:true AND _query_:{!join from=root_id
  to=id}requests_prefix:\some_prefix\
 
  Now, how can I get the root documents (users in some sense) that have
  more
  than X children matching 'requests_prefix:some_prefix' or any other
  condition? Is it possible?
 
  P.S. It must be done in a single query, fields can be added at will, but
  the root/children structure should be preserved (preferentially).
 
 
  2013/9/27 Upayavira u...@odoko.co.uk
 
   Mattheus,
  
   Given these mails form a part of an archive that are themselves
   self-contained, can you please post your actual question here? You're
   more likely to get answers that way.
  
   Thanks, Upayavira
  
   On Fri, Sep 27, 2013, at 04:36 PM, Matheus Salvia wrote:
Hello everyone,
I'm having a problem regarding how to make a solr query, I've posted
 it
on
stackoverflow.
Can someone help me?
   
  
 http://stackoverflow.com/questions/19039099/apache-solr-count-of-subquery-as-a-superquery-parameter
   
Thanks in advance!
   
--
--
 // Matheus Salvia
Desenvolvedor Mobile
Celular: +55 11 9-6446-2332
Skype: meta.faraday
  
 
 
 
  --
  --
   // Matheus Salvia
  Desenvolvedor Mobile
  Celular: +55 11 9-6446-2332
  Skype: meta.faraday




-- 
--
 // Matheus Salvia
Desenvolvedor Mobile
Celular: +55 11 9-6446-2332
Skype: meta.faraday


Nagle's Algorithm

2013-09-29 Thread William Bell
How do I set TCP_NODELAY on the http sockets for Jetty in SOLR 4?

Is there an option in jetty.xml ?

/* Create new stream socket */

sock = *socket*( AF_INET, SOCK_STREAM, 0 );



/* Disable the Nagle (TCP No Delay) algorithm */

flag = 1;

ret = *setsockopt*( sock, IPPROTO_TCP, TCP_NODELAY, (char *)flag,
sizeof(flag) );




-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Nagle's Algorithm

2013-09-29 Thread Dan Davis
I don't keep up with this list well enough to know whether anyone else
answered.  I don't know how to do it in jetty.xml, but you can certainly
tweak the code.   java.net.Socket has a method setTcpNoDelay() that
corresponds with the standard Unix system calls.

Long-time past, my suggestion of this made Apache Axis 2.0 250ms faster per
call (1).   Now I want to know whether Apache Solr sets it.

One common way to test the overhead portion of latency is to project the
latency for a zero size request based on larger requests.   What you do is
to warm requests (all in memory) for progressively fewer and fewer
rows.   You can make requests for 100, 90, 80, 70 ... 10 rows each more
than once so that all is warmed.   If you plot this, it should look like a
linear function latency(rows) = m(rows) + b since all is cached in
memory.   You have to control what else is going on on the server to get
the linear plot of course - it can be quite hard to get this to work right
on modern Linux.   But once you have it, you can simply calculate f(0) and
you have the latency for a theoretical 0 sized request.

This is a tangential answer at best - I wish I just knew a setting to give
you.

(1) Latency Performance of SOAP
Implementationshttp://citeseer.ist.psu.edu/viewdoc/similar?doi=10.1.1.21.8556type=ab


On Sun, Sep 29, 2013 at 9:22 PM, William Bell billnb...@gmail.com wrote:

 How do I set TCP_NODELAY on the http sockets for Jetty in SOLR 4?

 Is there an option in jetty.xml ?

 /* Create new stream socket */

 sock = *socket*( AF_INET, SOCK_STREAM, 0 );



 /* Disable the Nagle (TCP No Delay) algorithm */

 flag = 1;

 ret = *setsockopt*( sock, IPPROTO_TCP, TCP_NODELAY, (char *)flag,
 sizeof(flag) );




 --
 Bill Bell
 billnb...@gmail.com
 cell 720-256-8076



Re: Nagle's Algorithm

2013-09-29 Thread Michael Sokolov

I dunno, but this makes it look as if this may already be taken care of:

http://jira.codehaus.org/browse/JETTY-1196

On 9/29/2013 9:22 PM, William Bell wrote:

How do I set TCP_NODELAY on the http sockets for Jetty in SOLR 4?

Is there an option in jetty.xml ?

/* Create new stream socket */

sock = *socket*( AF_INET, SOCK_STREAM, 0 );



/* Disable the Nagle (TCP No Delay) algorithm */

flag = 1;

ret = *setsockopt*( sock, IPPROTO_TCP, TCP_NODELAY, (char *)flag,
sizeof(flag) );








Re: Maximum solr processes per machine

2013-09-29 Thread Shawn Heisey
On 9/29/2013 7:21 AM, adfel70 wrote:
 Hi,
 I'm thinking of solr cluster architecture before purchasing machines.
 
 
 My total index size is around 5TB. I want to have replication factor of 3.
 total 15TB.
 I've understood that I should  have 50-100% of the index size as ram, for OS
 cache. Lets say we're talking about around 10TB of memory.
 Now I need to split this memory to multiple servers and get the machine spec
 I want to buy.
 I'm thinking of running multiple solr processes per machine.

Running multiple solr instances per machine is a really bad idea.  One
Solr instance can run many indexes, and there will be far less memory
overhead if you're not running multiple servlet containers.  You can
also run all instances on the same TCP port - no need to figure out
different ports per instance.  Configuration and deployment are not as
complicated.

When you have multiple Solr instances per machine, the SolrCloud
collections API has a tendency to place some or all of the replicas for
each shard on the same machine, which means that it won't be fault
tolerant.  With one instance per machine, you can be absolutely sure
that created collections will have all replicas for each shard on
different machines.

I will echo the advice you've been given about using SSD.  You'll need
much less OS disk cache memory with SSD.

Thanks,
Shawn



Re: Nagle's Algorithm

2013-09-29 Thread Shawn Heisey
On 9/29/2013 7:22 PM, William Bell wrote:
 How do I set TCP_NODELAY on the http sockets for Jetty in SOLR 4?

The client usually makes that decision, not the server.  This parameter
is turned on by default for recent HttpClient versions, the library used
by SolrJ.  Even the JETTY issue uncovered by Michael Sokolov refers to a
client connection.

Thanks,
Shawn