Re: newbie questions regarding solr cloud

Erick Erickson Thu, 02 Apr 2015 13:03:01 -0700

See inline:

On Thu, Apr 2, 2015 at 12:36 PM, Ben Hsu <ben....@criticalmedia.com> wrote:
> Hello
>
> I am playing with solr5 right now, to see if its cloud features can replace
> what we have with solr 3.6, and I have some questions, some newbie, and
> some not so newbie
>
> Background: the documents we are putting in solr have a date field. the
> majority of our searches are restricted to documents created within the
> last week, but searches do go back 60 days. documents older than 60 days
> are removed from the repo. we also want high availability in case a machine
> becomes unavailable
>
> our current method, using solr 3.6, is to split the data into 1 day chunks,
> within each day the data is split into several shards, and each shard has 2
> replicas. Our code generates the list of cores to be queried on based on
> the time ranged in the query. Cores that fall off the 60 day range are
> deleted through solr's RESTful API.
>
> This all sounds a lot like what Solr Cloud provides, so I started looking
> at Solr Cloud's features.
>
> My newbie questions:
>
>  - it looks like the way to write a document is to pick a node (possibly
> using a LB), send it to that node, and let solr figure out which nodes that
> document is supposed to go. is this the recommended way?


[EOE] That's totally fine. If you're using SolrJ a better way is to
use CloudSolrClient
which sends the docs to the proper leader, thus saving one hop.

>  - similarly, can I just randomly pick a core (using the demo example:
> http://localhost:7575/solr/#/gettingstarted_shard1_replica2/query ), query
> it, and let it scatter out the queries to the appropriate cores, and send
> me the results back? will it give me back results from all the shards?

[EOE] Yes. Actually, you don't even have to pick a core, just a collection.
The # is totally unneeded, it's just part of navigating around the UI. So this
should work:
http://localhost:7575/solr/gettingstarted/query?q=*:*

>  - is there a recommended Python library?
[EOE] Unsure. If you do find one, check that it has the
CloudSolrClient support as
I expect that would take the most effort

>
> My hopefully less newbie questions:
>  - does solr auto detect when node become unavailable, and stop sending
> queries to them?

[EOE] Yes, that's what Zookeeper is all about. As each Solr node comes up it
registers itself as a listener for collection state changes. ZK
detects a node dying and
notifies all the remaining nodes that nodeX is out of commission and
they adjust accordingly.

>  - when the master node dies and the cluster elects a new master, what
> happens to writes?
[EOE] Stop thinking master/slave! It's "leaders" and "replicas"
(although I'm trying
to use "leaders" and "followers"). The critical bit is that on an
update, the raw document
is forwarded from the leader to all followers so they can come and go.
You simply cannot
rely on a particular node that is a leader remaining the leader. For
instance, if you bring up
your nodes in a different order tomorrow, the leaders and followers
won't be the same.


>  - what happens when a node is unavailable
[EOE] SolrCloud "does the right thing" and keeps on chugging. See the
comments about
auto-detect. The exception is that if _all_ the nodes hosting a shard
go down, you cannot
add to the index and queries will fail unless you set shards.tolerant=true.

>  - what is the procedure when a shard becomes too big for one machine, and
> needs to be split?
There is the Collections API SPLITSHARD command you can use. This means that
you increase by powers of two though, there's no such thing as adding,
say, one new
shard to a 4 shard cluster.

You can also reindex from scratch.

You can also "overshard" when you initially create your collection and
host multiple
shards and/or replicas on a single machine, then physically move them when the
aggregate size exceeds your boundaries.

>  - what is the procedure when we lose a machine and the node needs replacing
Use the Collections API to DELETEREPLICA on the replicas on the dead node.
Use the Collections API to ADREPLICA on new machines.

>  - how would we quickly bulk delete data within a date range?
[EOE]
...solr/update?commit=true&stream.body=<delete><query>date_field:[DATE1
TO DATE2]</query></delete

You can take explicit control of where your docs go by various routing
schemes. The default is to route based on a hash of the id field, but
if you choose you route all docs based on the value of a field
(_route_) or based on the first part of the unique key with the bang
(!) operator.

Do note, though, that one of the consequences of putting all of a
day's data on a single shard (or subset of shards) is that you
concentrate all your searching on those machines, and the other ones
can be idle. At times you can get better throughput by just letting
the docs be distributed randomly. That's what I'd start with
anyway.....

Best,
Erick

Re: newbie questions regarding solr cloud

Reply via email to