On 3/2/2017 2:58 PM, Daniel Miller wrote: > One of the many features of the Dovecot IMAP server is Solr support. > This obviously provides full-text-searching of stored mails - and it > works great. But...the focus of the Dovecot team and mailing list is > Dovecot configuration. I'm asking for some guidance on how I might > optimize Solr.
I use Solr for work. I use Dovecot for personal domains. I have not used them together. I probably should -- my personal mailbox is many gigabytes and would benefit from a boost in search performance. > At the moment I have a (I think!) reasonably well-defined schema that > seems to perform well. In my particular use case, I have a single > physical server running Linux with available VirtualBox virtual > servers. I am presently running Solr within one of the virtual > servers, and I'm running SolrCloud even though I only have one core > (it just seemed to work better). > > Now because I have a single collection/core/shard - all the mail users > and all their mail folders are stored/indexed/searched by this single > Solr instance. I'm thinking that I'd like to split the indexing on at > least a per-user fashion - possibly also on a per-mailbox fashion. > Dovecot does allow for variable substitution in the Solr URL - so I > should be able to generate the necessary URL requests on the Dovecot > side. What I don't know is: > > 1. Is it possible to split the "indexes" (I'm still learning Solr > vocabulary) without creating separate "cores" (which to me means > separate Java instances)? > 2. Can these separate "indexes" be created on-demand - or do they > need to be explictly created prior to use? Here's a paragraph that hopefully clears up most confusion about Solr terminology. This is applicable to SolrCloud: Collections are made up of one or more shards. Shards are made up of one or more replicas. Each replica is a core. One replica from each shard is elected as the leader of that shard, and if there are multiple replicas, the leader role can move between them in response to a change in cluster state. Further info: One Solr instance (JVM) can handle many cores. SolrCloud allows multiple Solr instances to coordinate with each other (via ZooKeeper) and form a whole cluster. Without SolrCloud, you have cores, but no collections and no replicas. Sharding is possible without SolrCloud, but is handled mostly manually. Replication is possible without SolrCloud, but works very differently, and has a single point of failure due to the fact that switching master servers isn't something that's done easily. SolrCloud is a true cluster, no masters or slaves. https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding https://cwiki.apache.org/confluence/display/solr/Index+Replication SolrCloud also makes it VERY easy to create new collections (logical indexes) if the desired index config is already in the zookeeper database. It can be done entirely with an HTTP request: https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding One thing to note: SolrCloud begins to have performance issues when the number of collections in the cloud reaches the low hundreds. It's not going to scale very well with a collection per user or per mailbox unless there aren't very many users. There are people looking into how to scale better, but this hasn't really gone anywhere yet. Here's one issue about it, with a lot of very dense comments: https://issues.apache.org/jira/browse/SOLR-7191 Thanks, Shawn