Re: Setting up to index multiple datastores

Shawn Heisey Thu, 02 Mar 2017 17:15:08 -0800

On 3/2/2017 2:58 PM, Daniel Miller wrote:
> One of the many features of the Dovecot IMAP server is Solr support. 
> This obviously provides full-text-searching of stored mails - and it
> works great.  But...the focus of the Dovecot team and mailing list is
> Dovecot configuration.  I'm asking for some guidance on how I might
> optimize Solr.


I use Solr for work.  I use Dovecot for personal domains.  I have not
used them together.  I probably should -- my personal mailbox is many
gigabytes and would benefit from a boost in search performance.

> At the moment I have a (I think!) reasonably well-defined schema that
> seems to perform well.  In my particular use case, I have a single
> physical server running Linux with available VirtualBox virtual
> servers.  I am presently running Solr within one of the virtual
> servers, and I'm running SolrCloud even though I only have one core
> (it just seemed to work better).
>
> Now because I have a single collection/core/shard - all the mail users
> and all their mail folders are stored/indexed/searched by this single
> Solr instance.  I'm thinking that I'd like to split the indexing on at
> least a per-user fashion - possibly also on a per-mailbox fashion. 
> Dovecot does allow for variable substitution in the Solr URL - so I
> should be able to generate the necessary URL requests on the Dovecot
> side.  What I don't know is:
>
> 1.  Is it possible to split the "indexes" (I'm still learning Solr
> vocabulary) without creating separate "cores" (which to me means
> separate Java instances)?
> 2.  Can these separate "indexes" be created on-demand - or do they
> need to be explictly created prior to use?

Here's a paragraph that hopefully clears up most confusion about Solr
terminology.  This is applicable to SolrCloud:

Collections are made up of one or more shards.  Shards are made up of
one or more replicas.  Each replica is a core.  One replica from each
shard is elected as the leader of that shard, and if there are multiple
replicas, the leader role can move between them in response to a change
in cluster state.

Further info: One Solr instance (JVM) can handle many cores.  SolrCloud
allows multiple Solr instances to coordinate with each other (via
ZooKeeper) and form a whole cluster.  Without SolrCloud, you have cores,
but no collections and no replicas.  Sharding is possible without
SolrCloud, but is handled mostly manually.  Replication is possible
without SolrCloud, but works very differently, and has a single point of
failure due to the fact that switching master servers isn't something
that's done easily.  SolrCloud is a true cluster, no masters or slaves.

https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding
https://cwiki.apache.org/confluence/display/solr/Index+Replication

SolrCloud also makes it VERY easy to create new collections (logical
indexes) if the desired index config is already in the zookeeper
database.  It can be done entirely with an HTTP request:

https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding

One thing to note:  SolrCloud begins to have performance issues when the
number of collections in the cloud reaches the low hundreds.  It's not
going to scale very well with a collection per user or per mailbox
unless there aren't very many users.  There are people looking into how
to scale better, but this hasn't really gone anywhere yet.  Here's one
issue about it, with a lot of very dense comments:

https://issues.apache.org/jira/browse/SOLR-7191

Thanks,
Shawn

Re: Setting up to index multiple datastores

Reply via email to