Re: Improving Solr performance

2011-01-14 Thread Gora Mohanty
On Fri, Jan 14, 2011 at 1:56 PM, supersoft  wrote:
>
> The tests are performed with a selfmade program.
[...]

May I ask in what language is the program written in? The reason to
ask that is to eliminate the possibility that there is an issue with the
threading model, e.g., if you were using Python, for example.

Would it be possible for you to run Apache bench, ab, against
your Solr setup, e.g., something like:

# For 10 simultaneous connections
ab -n 100 -c 10 http://localhost:8983/solr/select/?q=my_query1

# For 50 simultaneous connections
ab -n 500 -c 50 http://localhost:8983/solr/select/?q=my_query2

Please pay attention to the meaning of the -n parameter (there
is a slight gotcha there). "man ab" for details on usage, or see,
http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
for example.

> In the last post, I wrote the results of the 100 threads example orderered
> by the response date. The results ordered by the creation date are:
[...]

OK, the numbers makes more sense now.

As someone else has pointed out, your throughput does increase
with more simultaneous queries, and there are better ways to do
the measurement. Nevertheless, your results are very much at odds
with what we see, and I would like to understand the issue.

Regards,
Gora


Re: Improving Solr performance

2011-01-14 Thread Toke Eskildsen
On Thu, 2011-01-13 at 17:40 +0100, supersoft wrote:
> Although most of the queries are cache hits, the performance is still
> dependent of the number of simultaneous queries:
> 
> 1 simultaneous query: 3437 ms (cache fails)

Average response time: 3437 ms
Throughput: 0.29 queries/sec

> 2 simultaneous queries: 594, 954 ms
Average response time: 774 ms
Throughput: 1.29 queries/sec

> 10 simultaneous queries: 1047, 1313, 1438, 1797, 1922, 2094, 2250, 2500,
> 2938, 3000 ms
Average response time: 2030 ms
Throughput: 4.93 queries/sec

> 50 simultaneous queries: 1203, 1453, [...]
Average response time: 15478 ms
Throughput: 3.23 queries/sec

> 100 simultaneous queries: 1297, 1531, 1969, [...]
Average response time: 16285 ms
Throughput: 6.14 queries/sec

> Is this an expected situation?

Your numbers for 50 queries are strangely low, but the trend throughout
your tests indicate that your tests for 1, 2, 10, 50 and 100 threads do
not perform the same number of searches.

In order to compare the numbers, you need to let each test perform the
same number of searches and to start each test from exactly the same
warmup state. That means restarting Solr and flushing the disk cache,
which might require rebooting depending on your setup. It is also
recommended that you perform 5-10 searches before you start measuring
anything, as the first searches are not representative of general
performance.

Going with the numbers as they are, performance actually increases for
each thread you add: Look at throughput, not response time. This is
clearly bogus, but easily explained by the cache.



Re: Improving Solr performance

2011-01-14 Thread supersoft

The tests are performed with a selfmade program. The arguments are the number
of threads and the path to a file which contains available queries (in the
last test only one). When each thread is created, it gets the current date
(in milisecs), and when it gets the response from the query, the thread logs
the diff with that initial date. 

In the last post, I wrote the results of the 100 threads example orderered
by the response date. The results ordered by the creation date are:

100 simultaneous queries: 9265, 11922, 12375, 4109, 4890, 7093, 21875, 8547,
13562, 13219, 1531, 11875, 21281, 31985, 11703, 7391, 32031, 22172, 21469,
13875, 1969, 11406, 8172, 9609, 16953, 13828, 17282, 22141, 16625, 2203,
24985, 2375, 25188, 2891, 5047, 6422, 20860, 7594, 23125, 32281, 32016,
5312, 23125, 11484, 10344, 11500, 18172, 3937, 11547, 13500, 28297, 20594,
24641, 7063, 24797, 12922, 1297, 8984, 20625, 13407, 23203, 32016, 15922,
21875, 8750, 12875, 23203, 26453, 26016, 11797, 31782, 24672, 21625, 7672,
18985, 14672, 22157, 26485, 23328, 9907, 5563, 24625, 14078, 4703, 25844,
12328, 11484, 6437, 25937, 26437, 18484, 13719, 16328, 28687, 23141, 14016,
26437, 13187, 25031, 31969
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2254121.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Improving Solr performance

2011-01-13 Thread Gora Mohanty
On Thu, Jan 13, 2011 at 10:10 PM, supersoft  wrote:
>
> On the one hand, I found really interesting those comments about the reasons
> for sharding. Documentation agrees you about why to split an index in
> several shards (big sizes problems) but I don't find any explanation about
> the inconvenients as an Access Control List. I guess there should be some
> and they can be critical in this design. Any example?
[...]

Can I ask what might be a stupid question? How are you measuring
the numbers below, and what do they mean?

As your hit ratio is close to 1 (i.e., everything after the first query is
coming from the cache), these numbers seem a little strange. Are
these really the time for each of the N simultaneous queries? They
seem to be monotonically increasing (though with a couple of
strange exceptions), which leads me to suspect that they are some
kind of cumulative times, e.g., by this interpretation, for the case of
the 10 simultaneous queries, the first one takes 1047ms, the second
268ms, the third 125ms, and so on.

We have run performance tests with pg_bench on a index of size
40GB on a single Solr server with about 6GB of RAM allocated
to Solr, and see what I would think of as expected behaviour, i.e.,
for every fresh query term, the first query takes the longest, and
the time for subsequent queries with the same term goes down
dramatically, as the result is coming out of the cache. This is at
odds to what you describe here, so I have to go back and check
that we did not miss something important.

> 1 simultaneous query: 3437 ms (cache fails)
>
> 2 simultaneous queries: 594, 954 ms
>
> 10 simultaneous queries: 1047, 1313, 1438, 1797, 1922, 2094, 2250, 2500,
> 2938, 3000 ms
>
> 50 simultaneous queries: 1203, 1453, 1453, 1437, 1625, 1953, 5688, 12938,
> 14953, 16281, 15984, 16453, 15812, 16469, 16563, 16844, 17703, 16843, 17359,
> 16828, 18235, 18219, 18172, 18203, 17672, 17344, 17453, 18484, 18157, 18531,
> 18297, 18359, 18063, 18516, 18125, 17516, 18562, 18016, 18187, 18610, 18703,
> 18672, 17829, 18344, 18797, 18781, 18265, 18875, 18250, 18812
>
> 100 simultaneous queries: 1297, 1531, 1969, 2203, 2375, 2891, 3937, 4109,
> 4703, 4890, 5047, 5312, 5563, 6422, 6437, 7063, 7093, 7391, 7594, 7672,
> 8172, 8547, 8750, 8984, 9265, 9609, 9907, 10344, 11406, 11484, 11484, 11500,
> 11547, 11703, 11797, 11875, 11922, 12328, 12375, 12875, 12922, 13187, 13219,
> 13407, 13500, 13562, 13719, 13828, 13875, 14016, 14078, 14672, 15922, 16328,
> 16625, 16953, 17282, 18172, 18484, 18985, 20594, 20625, 20860, 21281, 21469,
> 21625, 21875, 21875, 22141, 22157, 22172, 23125, 23125, 23141, 23203, 23203,
> 23328, 24625, 24641, 24672, 24797, 24985, 25031, 25188, 25844, 25937, 26016,
> 26437, 26453, 26437, 26485, 28297, 28687, 31782, 31985, 31969, 32016, 32031,
> 32016, 32281 ms
[...]

Regards,
Gora


Re: Improving Solr performance

2011-01-13 Thread supersoft

On the one hand, I found really interesting those comments about the reasons
for sharding. Documentation agrees you about why to split an index in
several shards (big sizes problems) but I don't find any explanation about
the inconvenients as an Access Control List. I guess there should be some
and they can be critical in this design. Any example?

On the other hand, the performance problems. I have configured big caches
and I launch a test of simultaneous requests (with the same query) without
commiting during the test. The caches are initially empty and after the
test:

namequeryResultCache  
stats   
lookups 1129
hits1120
hitratio0.99
inserts 16
evictions   0
size9
warmupTime  0
cumulative_lookups  1129
cumulative_hits 1120
cumulative_hitratio 0.99
cumulative_inserts  16
cumulative_evictions0

namedocumentCache  
stats   
lookups 6750
hits6440
hitratio0.95
inserts 310
evictions   0
size310
warmupTime  0
cumulative_lookups  6750
cumulative_hits 6440
cumulative_hitratio 0.95
cumulative_inserts  310
cumulative_evictions0

Although most of the queries are cache hits, the performance is still
dependent of the number of simultaneous queries:

1 simultaneous query: 3437 ms (cache fails)

2 simultaneous queries: 594, 954 ms

10 simultaneous queries: 1047, 1313, 1438, 1797, 1922, 2094, 2250, 2500,
2938, 3000 ms

50 simultaneous queries: 1203, 1453, 1453, 1437, 1625, 1953, 5688, 12938,
14953, 16281, 15984, 16453, 15812, 16469, 16563, 16844, 17703, 16843, 17359,
16828, 18235, 18219, 18172, 18203, 17672, 17344, 17453, 18484, 18157, 18531,
18297, 18359, 18063, 18516, 18125, 17516, 18562, 18016, 18187, 18610, 18703,
18672, 17829, 18344, 18797, 18781, 18265, 18875, 18250, 18812

100 simultaneous queries: 1297, 1531, 1969, 2203, 2375, 2891, 3937, 4109,
4703, 4890, 5047, 5312, 5563, 6422, 6437, 7063, 7093, 7391, 7594, 7672,
8172, 8547, 8750, 8984, 9265, 9609, 9907, 10344, 11406, 11484, 11484, 11500,
11547, 11703, 11797, 11875, 11922, 12328, 12375, 12875, 12922, 13187, 13219,
13407, 13500, 13562, 13719, 13828, 13875, 14016, 14078, 14672, 15922, 16328,
16625, 16953, 17282, 18172, 18484, 18985, 20594, 20625, 20860, 21281, 21469,
21625, 21875, 21875, 22141, 22157, 22172, 23125, 23125, 23141, 23203, 23203,
23328, 24625, 24641, 24672, 24797, 24985, 25031, 25188, 25844, 25937, 26016,
26437, 26453, 26437, 26485, 28297, 28687, 31782, 31985, 31969, 32016, 32031,
32016, 32281 ms

Is this an expected situation? Is there any technique for not being so
dependent of the number simultaneuos queries? (due to economical reasons,
replication in more servers is not an option)

Thanks in advance (and also thanks for previous comments)
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2249108.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Improving Solr performance

2011-01-10 Thread Markus Jelsma
Any sources to cite for this statement? And are you talking about RAM 
allocated to the JVM or available for OS cache?

> Not sure if this was mentioned yet, but if you are doing slave/master
> replication you'll need 2x the RAM at replication time. Just something to
> keep in mind.
> 
> -mike
> 
> On Mon, Jan 10, 2011 at 5:01 PM, Toke Eskildsen 
wrote:
> > On Mon, 2011-01-10 at 21:43 +0100, Paul wrote:
> > > > I see from your other messages that these indexes all live on the
> > > > same
> > 
> > machine.
> > 
> > > > You're almost certainly I/O bound, because you don't have enough
> > > > memory
> > 
> > for the
> > 
> > > > OS to cache your index files.  With 100GB of total index size, you'll
> > 
> > get best
> > 
> > > > results with between 64GB and 128GB of total RAM.
> > > 
> > > Is that a general rule of thumb? That it is best to have about the
> > > same amount of RAM as the size of your index?
> > 
> > I does not seems like there is a clear current consensus on hardware to
> > handle IO problems. I am firmly in the SSD camp, but as you can see from
> > the current thread, other people recommend RAM and/or extra machines.
> > 
> > I can say that our tests with RAM and spinning disks showed us that a
> > lot of RAM certainly helps a lot, but also that it takes a considerable
> > amount of time to warm the index before the performance is satisfactory.
> > It might be helped with disk cache tricks, such as copying the whole
> > index to /dev/null before opening it in Solr.
> > 
> > > So, with a 5GB index, I should have between 4GB and 8GB of RAM
> > > dedicated to solr?
> > 
> > Not as -Xmx, but free for disk cache, yes. If you follow the RAM ~=
> > index size recommendation.


Re: Improving Solr performance

2011-01-10 Thread Jonathan Rochkind
And I don't think I've seen anyone suggest a seperate core just for 
Access Control Lists. I'm not sure what that would get you.


Perhaps a separate store that isn't Solr at all, in some cases.

On 1/10/2011 5:36 PM, Jonathan Rochkind wrote:

Access Control Lists


Re: Improving Solr performance

2011-01-10 Thread Jonathan Rochkind

On 1/10/2011 5:03 PM, Dennis Gearon wrote:

What I seem to see suggested here is to use different cores for the things you
suggested:
   different types of documents
   Access Control Lists

I wonder how sharding would work in that scenario?


Sharding has nothing to do with that scenario at all. Different cores 
are essentially _entirely seperate_.  While it can be convenient to use 
different cores like this, it means you don't get ANY searches that 
'join' over multiple 'kinds' of data in different cores.


Solr is not great at handling hetereogenous data like that.  Putting it 
in seperate cores is one solution, although then they are entirely 
seperate.  If that works, great.  Another solution is putting them in 
the same index, but using mostly different fields, and perhaps having a 
'type' field shared amongst all of your 'kinds' of data, and then always 
querying with an 'fq' for the right 'kind'.  Or if the fields they use 
are entirely different, you don't even need the fq, since a query on a 
certain field will only match a certain 'kind' of document.


Solr is not great at handling complex queries over data with 
hetereogenous schemata. Solr wants you to to flatten all your data into 
one single set of documents.


Sharding is a way of splitting up a single index (multiple cores are 
_multiple indexes_) amongst several hosts for performance reasons, 
mostly when you have a very large index.  That is it.  The end.  if you 
have multiple cores, that's the same as having multiple solr indexes 
(which may or may not happen to be on the same machine). Any one or more 
of those cores could be sharded if you want. This is a seperate issue.






Re: Improving Solr performance

2011-01-10 Thread mike anderson
Not sure if this was mentioned yet, but if you are doing slave/master
replication you'll need 2x the RAM at replication time. Just something to
keep in mind.

-mike

On Mon, Jan 10, 2011 at 5:01 PM, Toke Eskildsen wrote:

> On Mon, 2011-01-10 at 21:43 +0100, Paul wrote:
> > > I see from your other messages that these indexes all live on the same
> machine.
> > > You're almost certainly I/O bound, because you don't have enough memory
> for the
> > > OS to cache your index files.  With 100GB of total index size, you'll
> get best
> > > results with between 64GB and 128GB of total RAM.
> >
> > Is that a general rule of thumb? That it is best to have about the
> > same amount of RAM as the size of your index?
>
> I does not seems like there is a clear current consensus on hardware to
> handle IO problems. I am firmly in the SSD camp, but as you can see from
> the current thread, other people recommend RAM and/or extra machines.
>
> I can say that our tests with RAM and spinning disks showed us that a
> lot of RAM certainly helps a lot, but also that it takes a considerable
> amount of time to warm the index before the performance is satisfactory.
> It might be helped with disk cache tricks, such as copying the whole
> index to /dev/null before opening it in Solr.
>
> > So, with a 5GB index, I should have between 4GB and 8GB of RAM
> > dedicated to solr?
>
> Not as -Xmx, but free for disk cache, yes. If you follow the RAM ~=
> index size recommendation.
>
>


Re: Improving Solr performance

2011-01-10 Thread Dennis Gearon
What I seem to see suggested here is to use different cores for the things you 
suggested:
  different types of documents
  Access Control Lists

I wonder how sharding would work in that scenario?

Me, I plan on :
  For security:
Using a permissions field
  For different schmas:
Dynamic fields with enough premade fields to handle it.


The one thing I don't thing my approach does well with is statistics.

 Dennis Gearon



- Original Message 
From: Jonathan Rochkind 
To: "solr-user@lucene.apache.org" 
Cc: supersoft 
Sent: Mon, January 10, 2011 1:08:00 PM
Subject: Re: Improving Solr performance

I see a lot of people using shards to hold "different types of documents", and 
it almost always seems to be a bad solution. Shards are intended for 
distributing a large index over multiple hosts -- that's it.  Not for some kind 
of federated search over multiple schemas, not for access control.

Why not put everything in the same index, without shards, and just use an 'fq' 
limit in order to limit to the specific document you'd like to search over in a 
given search?I think that would achieve your goal a lot more simply than 
shards -- then you use sharding only if and when your index grows to be so 
large 
you'd like to distribute it over multiple hosts, and when you do so you choose 
a 
shard key that will have more or less equal distribution accross shards.

Using shards for access control or schema management just leads to headaches.

[Apparently Solr could use some highlighted documentation on what shards are 
really for, as it seems to be a very common issue on this list, someone trying 
to use them for something else and then inevitably finding problems with that 
approach.]

Jonathan

On 1/7/2011 6:48 AM, supersoft wrote:
> The reason of this distribution is the kind of the documents. In spite of
> having the same schema structure (and solr conf), a document belongs to 1 of
> 5 different kinds.
> 
> Each kind corresponds to a concrete shard and due to this, the implemented
> client tool avoids searching in all the shards when the users selects just
> one or a few of kinds. The tool runs a multisharded query of the proper
> shards. I guess this is a right approach but correct me if I am wrong.
> 
> The real problem of this architecture is the correlation between concurrent
> users and response time:
> 1 query: n seconds
> 2 queries: 2*n second each query
> 3 queries: 3*n seconds each query
> and so...
> 
> This is being a real headache because 1 single query has an acceptable
> response time but when many users are accessing to the server the
> performance goes hardly down.



Re: Improving Solr performance

2011-01-10 Thread Toke Eskildsen
On Mon, 2011-01-10 at 21:43 +0100, Paul wrote:
> > I see from your other messages that these indexes all live on the same 
> > machine.
> > You're almost certainly I/O bound, because you don't have enough memory for 
> > the
> > OS to cache your index files.  With 100GB of total index size, you'll get 
> > best
> > results with between 64GB and 128GB of total RAM.
> 
> Is that a general rule of thumb? That it is best to have about the
> same amount of RAM as the size of your index?

I does not seems like there is a clear current consensus on hardware to
handle IO problems. I am firmly in the SSD camp, but as you can see from
the current thread, other people recommend RAM and/or extra machines.

I can say that our tests with RAM and spinning disks showed us that a
lot of RAM certainly helps a lot, but also that it takes a considerable
amount of time to warm the index before the performance is satisfactory.
It might be helped with disk cache tricks, such as copying the whole
index to /dev/null before opening it in Solr.

> So, with a 5GB index, I should have between 4GB and 8GB of RAM
> dedicated to solr?

Not as -Xmx, but free for disk cache, yes. If you follow the RAM ~=
index size recommendation.



Re: Improving Solr performance

2011-01-10 Thread Jonathan Rochkind
I see a lot of people using shards to hold "different types of 
documents", and it almost always seems to be a bad solution. Shards are 
intended for distributing a large index over multiple hosts -- that's 
it.  Not for some kind of federated search over multiple schemas, not 
for access control.


Why not put everything in the same index, without shards, and just use 
an 'fq' limit in order to limit to the specific document you'd like to 
search over in a given search?I think that would achieve your goal a 
lot more simply than shards -- then you use sharding only if and when 
your index grows to be so large you'd like to distribute it over 
multiple hosts, and when you do so you choose a shard key that will have 
more or less equal distribution accross shards.


Using shards for access control or schema management just leads to 
headaches.


[Apparently Solr could use some highlighted documentation on what shards 
are really for, as it seems to be a very common issue on this list, 
someone trying to use them for something else and then inevitably 
finding problems with that approach.]


Jonathan

On 1/7/2011 6:48 AM, supersoft wrote:

The reason of this distribution is the kind of the documents. In spite of
having the same schema structure (and solr conf), a document belongs to 1 of
5 different kinds.

Each kind corresponds to a concrete shard and due to this, the implemented
client tool avoids searching in all the shards when the users selects just
one or a few of kinds. The tool runs a multisharded query of the proper
shards. I guess this is a right approach but correct me if I am wrong.

The real problem of this architecture is the correlation between concurrent
users and response time:
1 query: n seconds
2 queries: 2*n second each query
3 queries: 3*n seconds each query
and so...

This is being a real headache because 1 single query has an acceptable
response time but when many users are accessing to the server the
performance goes hardly down.


Re: Improving Solr performance

2011-01-10 Thread Markus Jelsma
No, it also depends on the queries you execute (sorting is a big consumer) and 
the number of concurrent users.

> Is that a general rule of thumb? That it is best to have about the
> same amount of RAM as the size of your index?
> 
> So, with a 5GB index, I should have between 4GB and 8GB of RAM
> dedicated to solr?


Re: Improving Solr performance

2011-01-10 Thread Paul
> I see from your other messages that these indexes all live on the same 
> machine.
> You're almost certainly I/O bound, because you don't have enough memory for 
> the
> OS to cache your index files.  With 100GB of total index size, you'll get best
> results with between 64GB and 128GB of total RAM.

Is that a general rule of thumb? That it is best to have about the
same amount of RAM as the size of your index?

So, with a 5GB index, I should have between 4GB and 8GB of RAM
dedicated to solr?


Re: Improving Solr performance

2011-01-09 Thread Dennis Gearon
These are definitely server grade machines.

There aren't any desktops I know of (that aren't made for HD video 
editing/rendition) that ever need that kind of memory.

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



- Original Message 
From: Shawn Heisey 
To: solr-user@lucene.apache.org
Sent: Sun, January 9, 2011 4:34:08 PM
Subject: Re: Improving Solr performance

On 1/7/2011 2:57 AM, supersoft wrote:
> have deployed a 5-sharded infrastructure where: shard1 has 3124422 docs
> shard2 has 920414 docs shard3 has 602772 docs shard4 has 2083492 docs shard5
> has 11915639 docs Indexes total size: 100GB
> 
> The OS is Linux x86_64 (Fedora release 8) with vMem equal to 7872420 and I
> run the server using Jetty (from Solr example download) with: java -Xmx3024M
> -Dsolr.solr.home=multicore -jar start.jar
> 
> The response time for a query is around 2-3 seconds. Nevertheless, if I
> execute several queries at the same time the performance goes down
> inmediately: 1 simultaneous query: 2516ms 2 simultaneous queries: 4250,4469
> ms 3 simultaneous queries: 5781, 6219, 6219 ms 4 simultaneous queries: 6484,
> 7203, 7719, 7781 ms...

I see from your other messages that these indexes all live on the same machine. 
 
You're almost certainly I/O bound, because you don't have enough memory for the 
OS to cache your index files.  With 100GB of total index size, you'll get best 
results with between 64GB and 128GB of total RAM.  Alternatively, you could use 
SSD to store the indexes instead of spinning hard drives, or put each shard on 
its own physical machine with RAM appropriately sized for the index.  For 
shard5 
on its own machine, at 64GB index size, you might be able to get away with 
32GB, 
but ideally you'd want 48-64GB.

Can you do anything to reduce the index size?  Perhaps you are storing fields 
that you don't need to be returned in the search results.  Ideally, you should 
only include enough information to fully populate a search results grid, and 
retrieve detail information for an individual document from the original data 
source instead of Solr.

Thanks,
Shawn


Re: Improving Solr performance

2011-01-09 Thread Shawn Heisey

On 1/7/2011 2:57 AM, supersoft wrote:

have deployed a 5-sharded infrastructure where: shard1 has 3124422 docs
shard2 has 920414 docs shard3 has 602772 docs shard4 has 2083492 docs shard5
has 11915639 docs Indexes total size: 100GB

The OS is Linux x86_64 (Fedora release 8) with vMem equal to 7872420 and I
run the server using Jetty (from Solr example download) with: java -Xmx3024M
-Dsolr.solr.home=multicore -jar start.jar

The response time for a query is around 2-3 seconds. Nevertheless, if I
execute several queries at the same time the performance goes down
inmediately: 1 simultaneous query: 2516ms 2 simultaneous queries: 4250,4469
ms 3 simultaneous queries: 5781, 6219, 6219 ms 4 simultaneous queries: 6484,
7203, 7719, 7781 ms...


I see from your other messages that these indexes all live on the same 
machine.  You're almost certainly I/O bound, because you don't have 
enough memory for the OS to cache your index files.  With 100GB of total 
index size, you'll get best results with between 64GB and 128GB of total 
RAM.  Alternatively, you could use SSD to store the indexes instead of 
spinning hard drives, or put each shard on its own physical machine with 
RAM appropriately sized for the index.  For shard5 on its own machine, 
at 64GB index size, you might be able to get away with 32GB, but ideally 
you'd want 48-64GB.


Can you do anything to reduce the index size?  Perhaps you are storing 
fields that you don't need to be returned in the search results.  
Ideally, you should only include enough information to fully populate a 
search results grid, and retrieve detail information for an individual 
document from the original data source instead of Solr.


Thanks,
Shawn



Re: Improving Solr performance

2011-01-08 Thread Lance Norskog
Are you using the Solr caches? These are configured in solrconfig.xml
in each core. Make sure you have at least 50-100 configured for each
kind.

Also, use filter queries: a filter query describes a subset of
documents. When you do a bunch of queries against the same filter
query, the second and subsequent queries are much much faster.

All of this is explained in the Enterprise Solr 1.4 book.

On Fri, Jan 7, 2011 at 7:20 AM, mike anderson  wrote:
> Making sure the index can fit in memory (you don't have to allocate that
> much to Solr, just make sure it's available to the OS so it can cache it --
> otherwise you are paging the hard drive, which is why you are probably IO
> bound) has been the key to our performance. We recently opted to use less
> RAM and store the indices on SSDs, we're still evaluating this approach but
> so far it seems to be comparable, so I agree with Toke! (We have 18 shards
> and over 100GB of index).
>
> On Fri, Jan 7, 2011 at 10:07 AM, Toke Eskildsen 
> wrote:
>
>> On Fri, 2011-01-07 at 10:57 +0100, supersoft wrote:
>>
>> [5 shards, 100GB, ~20M documents]
>>
>> ...
>>
>> [Low performance for concurrent searches]
>>
>> > Using JConsole for monitoring the server java proccess I checked that
>> Heap
>> > Memory and the CPU Usages don't reach the upper limits so the server
>> > shouldn't perform as overloaded.
>>
>> If memory and CPU is okay, the culprit is I/O.
>>
>> Solid state Drives has more than proven their worth for random access
>> I/O, which is used a lot when searching with Solr/Lucene. SSD's are
>> plug-in replacements for harddrives and they virtually eliminate I/O
>> performance bottlenecks when searching. This also means shortened warm
>> up requirements and less need for disk caching. Expanding RAM capacity
>> does not scale well and requires extensive warmup. Adding more machines
>> is expensive and often require architectural changes. With the current
>> prices for SSD's, I consider them the generic first suggestion for
>> improving search performance.
>>
>> Extra spinning disks improves the query throughput in general and speeds
>> up single queries when the chards are searched in parallel. They do not
>> help much for a single sequential searching of shards as the seek time
>> for a single I/O request is the same regardless of the number of drives.
>> If your current response time for a single user is satisfactory, adding
>> drives is a viable solution for you. I'll still recommend the SSD option
>> though, as it will also lower the response time for a single query.
>>
>> Regards,
>> Toke Eskildsen
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Improving Solr performance

2011-01-07 Thread mike anderson
Making sure the index can fit in memory (you don't have to allocate that
much to Solr, just make sure it's available to the OS so it can cache it --
otherwise you are paging the hard drive, which is why you are probably IO
bound) has been the key to our performance. We recently opted to use less
RAM and store the indices on SSDs, we're still evaluating this approach but
so far it seems to be comparable, so I agree with Toke! (We have 18 shards
and over 100GB of index).

On Fri, Jan 7, 2011 at 10:07 AM, Toke Eskildsen wrote:

> On Fri, 2011-01-07 at 10:57 +0100, supersoft wrote:
>
> [5 shards, 100GB, ~20M documents]
>
> ...
>
> [Low performance for concurrent searches]
>
> > Using JConsole for monitoring the server java proccess I checked that
> Heap
> > Memory and the CPU Usages don't reach the upper limits so the server
> > shouldn't perform as overloaded.
>
> If memory and CPU is okay, the culprit is I/O.
>
> Solid state Drives has more than proven their worth for random access
> I/O, which is used a lot when searching with Solr/Lucene. SSD's are
> plug-in replacements for harddrives and they virtually eliminate I/O
> performance bottlenecks when searching. This also means shortened warm
> up requirements and less need for disk caching. Expanding RAM capacity
> does not scale well and requires extensive warmup. Adding more machines
> is expensive and often require architectural changes. With the current
> prices for SSD's, I consider them the generic first suggestion for
> improving search performance.
>
> Extra spinning disks improves the query throughput in general and speeds
> up single queries when the chards are searched in parallel. They do not
> help much for a single sequential searching of shards as the seek time
> for a single I/O request is the same regardless of the number of drives.
> If your current response time for a single user is satisfactory, adding
> drives is a viable solution for you. I'll still recommend the SSD option
> though, as it will also lower the response time for a single query.
>
> Regards,
> Toke Eskildsen
>
>


Re: Improving Solr performance

2011-01-07 Thread Toke Eskildsen
On Fri, 2011-01-07 at 10:57 +0100, supersoft wrote:

[5 shards, 100GB, ~20M documents]

...

[Low performance for concurrent searches]

> Using JConsole for monitoring the server java proccess I checked that Heap
> Memory and the CPU Usages don't reach the upper limits so the server
> shouldn't perform as overloaded.

If memory and CPU is okay, the culprit is I/O.

Solid state Drives has more than proven their worth for random access
I/O, which is used a lot when searching with Solr/Lucene. SSD's are
plug-in replacements for harddrives and they virtually eliminate I/O
performance bottlenecks when searching. This also means shortened warm
up requirements and less need for disk caching. Expanding RAM capacity
does not scale well and requires extensive warmup. Adding more machines
is expensive and often require architectural changes. With the current
prices for SSD's, I consider them the generic first suggestion for
improving search performance.

Extra spinning disks improves the query throughput in general and speeds
up single queries when the chards are searched in parallel. They do not
help much for a single sequential searching of shards as the seek time
for a single I/O request is the same regardless of the number of drives.
If your current response time for a single user is satisfactory, adding
drives is a viable solution for you. I'll still recommend the SSD option
though, as it will also lower the response time for a single query.

Regards,
Toke Eskildsen



Re: Improving Solr performance

2011-01-07 Thread François Schiettecatte
It sounds like your system is I/O bound and I suspect (bet even) that all your 
index files are on the same disk drive. Also you have only 8GB of RAM for 100GB 
of index, so while your SOLR instance will cache some stuff and the balance 
will be used for caching file blocks, there really isn't enough memory there 
for effective caching.

I would suggest you check your machine's performance with something like atop ( 
http://www.atoptool.nl/ ) to see where your bottlenecks are (check the disk 
I/O). As I said I think you are I/O bound, and if all your shards are on the 
same drive there will be I/O contention when running simultaneous searches.

Your solutions are (in rough ascending order of cost):

- make your indices smaller (reduce disk I/O)

- buy more drives and spread your indices across the drives (reduce contention).

- buy more RAM (increase caching).

- buy more machines (more throughput).

Good luck!

François


On Jan 7, 2011, at 4:57 AM, supersoft wrote:

> 
> have deployed a 5-sharded infrastructure where: shard1 has 3124422 docs
> shard2 has 920414 docs shard3 has 602772 docs shard4 has 2083492 docs shard5
> has 11915639 docs Indexes total size: 100GB
> 
> The OS is Linux x86_64 (Fedora release 8) with vMem equal to 7872420 and I
> run the server using Jetty (from Solr example download) with: java -Xmx3024M
> -Dsolr.solr.home=multicore -jar start.jar
> 
> The response time for a query is around 2-3 seconds. Nevertheless, if I
> execute several queries at the same time the performance goes down
> inmediately: 1 simultaneous query: 2516ms 2 simultaneous queries: 4250,4469
> ms 3 simultaneous queries: 5781, 6219, 6219 ms 4 simultaneous queries: 6484,
> 7203, 7719, 7781 ms...
> 
> Using JConsole for monitoring the server java proccess I checked that Heap
> Memory and the CPU Usages don't reach the upper limits so the server
> shouldn't perform as overloaded. Can anyone give me an approach of how I
> should tune the instance for not being so hardly dependent of the number of
> simultaneous queries?
> 
> Thanks in advance
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2210843.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Improving Solr performance

2011-01-07 Thread supersoft

The reason of this distribution is the kind of the documents. In spite of
having the same schema structure (and solr conf), a document belongs to 1 of
5 different kinds. 

Each kind corresponds to a concrete shard and due to this, the implemented
client tool avoids searching in all the shards when the users selects just
one or a few of kinds. The tool runs a multisharded query of the proper
shards. I guess this is a right approach but correct me if I am wrong.

The real problem of this architecture is the correlation between concurrent
users and response time:
1 query: n seconds
2 queries: 2*n second each query
3 queries: 3*n seconds each query
and so...

This is being a real headache because 1 single query has an acceptable
response time but when many users are accessing to the server the
performance goes hardly down.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2211305.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Improving Solr performance

2011-01-07 Thread Grijesh.singh

open a new mail conversation for that

-
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2211300.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Improving Solr performance

2011-01-07 Thread Hong-Thai Nguyen
Hi,

Always interesting question! Anyone could purpose a generic (and approximate) 
equation:

Search_time = F(Nb_of_servers, RAM_size_per_server, CPU_of_servers, 
Nb_of_shards, Nb_of_documents, Total_size_of_documents or 
Average_size_of_a_document, Nb_requests_in_minute, Nb_indexed_fields_in_index, 
...) ?

Regards,

---
Hong-Thai
-Message d'origine-
De : Grijesh.singh [mailto:pintu.grij...@gmail.com] 
Envoyé : vendredi 7 janvier 2011 12:29
À : solr-user@lucene.apache.org
Objet : Re: Improving Solr performance


shards are used when index size become huge and performance going down .
shards mean distributed indexes. But if you will put all shards on same
machine as multicore then it will not help too much on performance.

and also shards distributes indexes near equals in size.
There is also not enough Ram to perform better.If your all index can load in
Cache then it will give you better performance.

Also there are not equally distributed indexes so all shards have different
response time.
When working with shards please keep in mind that main searcher sends query
to all shards and waits for response from all shards and incorporate all
responses in a single result and returns. 

So if any of shards taking more time to response then your total response
time will affect

-
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2211228.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Improving Solr performance

2011-01-07 Thread Grijesh.singh

shards are used when index size become huge and performance going down .
shards mean distributed indexes. But if you will put all shards on same
machine as multicore then it will not help too much on performance.

and also shards distributes indexes near equals in size.
There is also not enough Ram to perform better.If your all index can load in
Cache then it will give you better performance.

Also there are not equally distributed indexes so all shards have different
response time.
When working with shards please keep in mind that main searcher sends query
to all shards and waits for response from all shards and incorporate all
responses in a single result and returns. 

So if any of shards taking more time to response then your total response
time will affect

-
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2211228.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Improving Solr performance

2011-01-07 Thread Grijesh.singh

shards are used when index size become huge and performance going down .
shards mean distributed indexes. But if you will put all shards on same
machine as multicore then it will not help too much on performance.

and also shards distributes indexes near equals in size.
There is also not enough Ram to perform better.If your all index can load in
Cache then it will give you better performance.

Also there are not equally distributed indexes so all shards have different
response time.
When working with shards please keep in mind that main searcher sends query
to all shards and waits for response from all shards and incorporate all
responses in a single result and returns. 

So if any of shards taking more time to response then your total response
time will affect

-
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2211226.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Improving Solr performance

2011-01-07 Thread supersoft

1 - Yes, all the shards are in the same machine
2 - The machine RAM is 7.8GB and I assign 3.4GB to Solr server
3 - The shards sizes (GB) are 17, 5, 3, 11, 64
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2211135.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Improving Solr performance

2011-01-07 Thread Grijesh.singh

Some questions-
1-Are all shards on same machine
2-What is your Ram Size
3-What are the size of index on each shards in GB


-
Grijesh
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2210878.html
Sent from the Solr - User mailing list archive at Nabble.com.


Improving Solr performance

2011-01-07 Thread supersoft

have deployed a 5-sharded infrastructure where: shard1 has 3124422 docs
shard2 has 920414 docs shard3 has 602772 docs shard4 has 2083492 docs shard5
has 11915639 docs Indexes total size: 100GB

The OS is Linux x86_64 (Fedora release 8) with vMem equal to 7872420 and I
run the server using Jetty (from Solr example download) with: java -Xmx3024M
-Dsolr.solr.home=multicore -jar start.jar

The response time for a query is around 2-3 seconds. Nevertheless, if I
execute several queries at the same time the performance goes down
inmediately: 1 simultaneous query: 2516ms 2 simultaneous queries: 4250,4469
ms 3 simultaneous queries: 5781, 6219, 6219 ms 4 simultaneous queries: 6484,
7203, 7719, 7781 ms...

Using JConsole for monitoring the server java proccess I checked that Heap
Memory and the CPU Usages don't reach the upper limits so the server
shouldn't perform as overloaded. Can anyone give me an approach of how I
should tune the instance for not being so hardly dependent of the number of
simultaneous queries?

Thanks in advance
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210842p2210842.html
Sent from the Solr - User mailing list archive at Nabble.com.


Improving Solr performance

2011-01-07 Thread supersoft

have deployed a 5-sharded infrastructure where: shard1 has 3124422 docs
shard2 has 920414 docs shard3 has 602772 docs shard4 has 2083492 docs shard5
has 11915639 docs Indexes total size: 100GB

The OS is Linux x86_64 (Fedora release 8) with vMem equal to 7872420 and I
run the server using Jetty (from Solr example download) with: java -Xmx3024M
-Dsolr.solr.home=multicore -jar start.jar

The response time for a query is around 2-3 seconds. Nevertheless, if I
execute several queries at the same time the performance goes down
inmediately: 1 simultaneous query: 2516ms 2 simultaneous queries: 4250,4469
ms 3 simultaneous queries: 5781, 6219, 6219 ms 4 simultaneous queries: 6484,
7203, 7719, 7781 ms...

Using JConsole for monitoring the server java proccess I checked that Heap
Memory and the CPU Usages don't reach the upper limits so the server
shouldn't perform as overloaded. Can anyone give me an approach of how I
should tune the instance for not being so hardly dependent of the number of
simultaneous queries?

Thanks in advance
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-performance-tp2210843p2210843.html
Sent from the Solr - User mailing list archive at Nabble.com.