Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
Hi,

I know that SolrCloud allows you to have multiple shards on different
machines (or a single machine). But it requires a zookeeper installation
for doing things like leader election, leader availability, etc

While SolrCloud may be the ideal solution for my usecase eventually, I'd
like to know if there's a way I can point my Solr instance to read lucene
segments distributed across different disks attached to the same machine.

Thanks!

-Deepak


Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
@Greg - Are you suggesting RAID as a replacement for Solr or making Solr
work with RAID? Could you elaborate more on the latter, if that's you
meant?
We make use of solr's advanced text processing features which would be hard
to replicate just using RAID.


-Deepak



On Wed, Sep 11, 2013 at 12:11 PM, Greg Walters gwalt...@sherpaanalytics.com
 wrote:

 Why not use some form of RAID for your index store? You'd get the
 performance benefit of multiple disks without the complexity of managing
 them via solr.

 Thanks,
 Greg



 -Original Message-
 From: Deepak Konidena [mailto:deepakk...@gmail.com]
 Sent: Wednesday, September 11, 2013 2:07 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Distributing lucene segments across multiple disks.

 Are you suggesting a multi-core setup, where all the cores share the same
 schema, and the cores lie on different disks?

 Basically, I'd like to know if I can distribute shards/segments on a
 single machine (with multiple disks) without the use of zookeeper.





 -Deepak



 On Wed, Sep 11, 2013 at 11:55 AM, Upayavira u...@odoko.co.uk wrote:

  I think you'll find it hard to distribute different segments between
  disks, as they are typically stored in the same directory.
 
  However, instantiating separate cores on different disks should be
  straight-forward enough, and would give you a performance benefit.
 
  I've certainly heard of that done at Amazon, with a separate EBS
  volume per core giving some performance improvement.
 
  Upayavira
 
  On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
   Hi,
  
   I know that SolrCloud allows you to have multiple shards on
   different machines (or a single machine). But it requires a
   zookeeper installation for doing things like leader election, leader
   availability, etc
  
   While SolrCloud may be the ideal solution for my usecase eventually,
   I'd like to know if there's a way I can point my Solr instance to
   read lucene segments distributed across different disks attached to
 the same machine.
  
   Thanks!
  
   -Deepak
 



Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
Are you suggesting a multi-core setup, where all the cores share the same
schema, and the cores lie on different disks?

Basically, I'd like to know if I can distribute shards/segments on a single
machine (with multiple disks) without the use of zookeeper.





-Deepak



On Wed, Sep 11, 2013 at 11:55 AM, Upayavira u...@odoko.co.uk wrote:

 I think you'll find it hard to distribute different segments between
 disks, as they are typically stored in the same directory.

 However, instantiating separate cores on different disks should be
 straight-forward enough, and would give you a performance benefit.

 I've certainly heard of that done at Amazon, with a separate EBS volume
 per core giving some performance improvement.

 Upayavira

 On Wed, Sep 11, 2013, at 07:35 PM, Deepak Konidena wrote:
  Hi,
 
  I know that SolrCloud allows you to have multiple shards on different
  machines (or a single machine). But it requires a zookeeper installation
  for doing things like leader election, leader availability, etc
 
  While SolrCloud may be the ideal solution for my usecase eventually, I'd
  like to know if there's a way I can point my Solr instance to read lucene
  segments distributed across different disks attached to the same machine.
 
  Thanks!
 
  -Deepak



Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
I guess at this point in the discussion, I should probably give some more
background on why I am doing what I am doing. Having a single Solr shard
(multiple segments) on the same disk is posing severe performance problems
under load,in that, calls to Solr cause a lot of connection timeouts. When
we looked at the ganglia stats for the Solr box, we saw that while memory,
cpu and network usage were quite normal, the i/o wait spiked. We are unsure
on what caused the i/o wait and why there were no spikes in the cpu/memory
usage. Since the Solr box is a beefy box (multi-core setup, huge ram, SSD),
we'd like to distribute the segments to multiple locations (disks) and see
whether this improves performance under load.

@Greg - Thanks for clarifying that.  I just learnt that I can't set them up
using RAID as some of them are SSDs and some others are SATA (spinning
disks).

@Shawn Heisey - Could you elaborate more about the broker core and
delegating the requests to other cores?


-Deepak



On Wed, Sep 11, 2013 at 1:10 PM, Shawn Heisey s...@elyograg.org wrote:

 On 9/11/2013 1:07 PM, Deepak Konidena wrote:

 Are you suggesting a multi-core setup, where all the cores share the same
 schema, and the cores lie on different disks?

 Basically, I'd like to know if I can distribute shards/segments on a
 single
 machine (with multiple disks) without the use of zookeeper.


 Sure, you can do it all manually.  At that point you would not be using
 SolrCloud at all, because the way to enable SolrCloud is to tell Solr where
 zookeeper lives.

 Without SolrCloud, there is no cluster automation at all.  There is no
 collection paradigm, you just have cores.  You have to send updates to
 the correct core; they not be redirected for you.  Similarly, queries will
 not be load balanced automatically.  For Java clients, the CloudSolrServer
 object can work seamlessly when servers go down.  If you're not using
 SolrCloud, you can't use CloudSolrServer.

 You would be in charge of creating the shards parameter yourself.  The way
 that I do this on my index is that I have a broker core that has no index
 of its own, but its solrconfig.xml has the shards and shards.qt parameters
 in all the request handler definitions.  You can also include the parameter
 with the query.

 You would also have to handle redundancy yourself, either with replication
 or with independently updated indexes.  I use the latter method, because it
 offers a lot more flexibility than replication.

 As mentioned in another reply, setting up RAID with a lot of disks may be
 better than trying to split your index up on different filesystems that
 each reside on different disks.  I would recommend RAID10 for Solr, and it
 works best if it's hardware RAID and the controller has battery-backed (or
 NVRAM) cache.

 Thanks,
 Shawn




Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
@Greg - Thanks for the suggestion. Will pass it along to my folks.

@Shawn - That's the link I was looking for 'non-SolrCloud approach to
distributed search'. Thanks for passing that along. Will give it a try.

As far as RAM usage goes, I believe we set the heap size to about 40% of
the RAM and less than 10% is available for OS caching ( since replica takes
another 40%). Why does unallocated RAM help? How does it impact performance
under load?


-Deepak



On Wed, Sep 11, 2013 at 2:50 PM, Shawn Heisey s...@elyograg.org wrote:

 On 9/11/2013 2:57 PM, Deepak Konidena wrote:

 I guess at this point in the discussion, I should probably give some more
 background on why I am doing what I am doing. Having a single Solr shard
 (multiple segments) on the same disk is posing severe performance problems
 under load,in that, calls to Solr cause a lot of connection timeouts. When
 we looked at the ganglia stats for the Solr box, we saw that while memory,
 cpu and network usage were quite normal, the i/o wait spiked. We are
 unsure
 on what caused the i/o wait and why there were no spikes in the cpu/memory
 usage. Since the Solr box is a beefy box (multi-core setup, huge ram,
 SSD),
 we'd like to distribute the segments to multiple locations (disks) and see
 whether this improves performance under load.

 @Greg - Thanks for clarifying that.  I just learnt that I can't set them
 up
 using RAID as some of them are SSDs and some others are SATA (spinning
 disks).

 @Shawn Heisey - Could you elaborate more about the broker core and
 delegating the requests to other cores?


 On the broker core - I have a core on my servers that has no index of its
 own.  In the /select handler (and others) I have placed a shards parameter,
 and many of them also have a shards.qt parameter.  The shards paramter is
 how a non-cloud distributed search is done.

 http://wiki.apache.org/solr/**DistributedSearchhttp://wiki.apache.org/solr/DistributedSearch

 Addressing your first paragraph: You say that you have lots of RAM ... but
 is there a lot of unallocated RAM that the OS can use for caching, or is it
 mostly allocated to processes, such as the java heap for Solr?

 Depending on exactly how your indexes are composed, you need up to 100% of
 the total index size available as unallocated RAM.  With SSD, the
 requirement is less, but cannot be ignored.  I personally wouldn't go below
 about 25-50% even with SSD, and I'd plan on 50-100% for regular disks.

 There is some evidence to suggest that you only need unallocated RAM equal
 to 10% of your index size for caching with SSD, but that is only likely to
 work if you have a lot of stored (as opposed to indexed) data.  If most of
 your index is unstored, then more would be required.

 Thanks,
 Shawn




Re: Distributing lucene segments across multiple disks.

2013-09-11 Thread Deepak Konidena
Very helpful link. Thanks for sharing that.


-Deepak



On Wed, Sep 11, 2013 at 4:34 PM, Shawn Heisey s...@elyograg.org wrote:

 On 9/11/2013 4:16 PM, Deepak Konidena wrote:

 As far as RAM usage goes, I believe we set the heap size to about 40% of
 the RAM and less than 10% is available for OS caching ( since replica
 takes
 another 40%). Why does unallocated RAM help? How does it impact
 performance
 under load?


 Because once the data is in the OS disk cache, reading it becomes
 instantaneous, it doesn't need to go out to the disk.  Disks are glacial
 compared to RAM.  Even SSD has a far slower response time.  Any recent
 operating system does this automatically, including the one from Redmond
 that we all love to hate.

 http://blog.thetaphi.de/2012/**07/use-lucenes-mmapdirectory-**
 on-64bit.htmlhttp://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

 Thanks,
 Shawn




Order of fields in a search query.

2013-08-30 Thread Deepak Konidena
Does the order of fields matter in a lucene query?

For instance,

q = A  B  C

Lets say A appears in a million documents, B in 1, C in 1000.

while the results would be identical irrespective of the order in which you
AND
A, B and C, will the response times of the following queries differ in any
way?

C  B  A
A  B  C

Does Lucene/Solr pick the best query execution plan in terms of both space
and time for a given query?

-Deepak


Multiple _val_ inside a lucene query.

2013-08-12 Thread Deepak Konidena
One of my previous mails to the group helped me simulate short-circuiting
OR behavior using (thanks to yonik)

_val_:def(query(cond1,cond2,..))

where if cond1 is true the query returns without executing the subsequent
conditions.

While it works successfully for single attribute, I am trying to extend it
so I can achieve the same behavior for multiple attributes. I am trying to
use multiple _val_, where the query returns an error.

How do I make a query with multiple _val_ clauses?

-Deepak


short-circuit OR operator in lucene/solr

2013-07-21 Thread Deepak Konidena
I understand that lucene's AND (), OR (||) and NOT (!) operators are
shorthands for REQUIRED, OPTIONAL and EXCLUDE respectively, which is why
one can't treat them as boolean operators (adhering to boolean algebra).

I have been trying to construct a simple OR expression, as follows

q = +(field1:value1 OR field2:value2)

with a match on either field1 or field2. But since the OR is merely an
optional, documents where both field1:value1 and field2:value2 are matched,
the query returns a score resulting in a match on both the clauses.

How do I enforce short-circuiting in this context? In other words, how to
implement short-circuiting as in boolean algebra where an expression A || B
|| C returns true if A is true without even looking into whether B or C
could be true.
-Deepak