+1 for CloudSolrServer

CloudSolrServer also has built in fault tolerance (i.e. if the master shard
is not reachable then it adds to the replica) and much better error
reporting than ConcurrentUpdateSolrServer.  The only downside is lack of
batching. As long as you are adding documents in decent size batches (can
also use multiple threads to add), you will get good indexing performance.

CP

On Thu, Oct 30, 2014 at 6:53 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Matt:
>
> You might want to look at SolrJ, in particular with the use of
> CloudSolrServer.
> The big benefit here is that it'll route the docs to the correct leader
> for each
> shard rather than relying on the nodes to communicate with each other.
>
> Here's a SolrJ example. NOTE: it used ConcurrentUpdateSolrServer which
> you should replace with CloudSolrServer. Other than making the c'tor work,
> that
> should be the only change you need as far as instantiating the right Solr
> Server.
>
> This one connects with a DB and also parses Tika files, but you should be
> able
> to remove all that without too much problem.
>
> https://lucidworks.com/blog/indexing-with-solrj/
>
> Best,
> Erick
>
> On Thu, Oct 30, 2014 at 10:08 AM, Matt Hilt <matt.h...@numerica.us> wrote:
> > Thanks for the info Daniel. I will go forth and make a better client.
> >
> >
> > On Oct 29, 2014, at 2:28 AM, Daniel Collins <danwcoll...@gmail.com>
> wrote:
> >
> >> I kind of think this might be "working as designed", but I'll be happy
> to
> >> be corrected by others :)
> >>
> >> We had a similar issue which we discovered by accident, we had 2 or 3
> >> collections spread across some machines, and we accidentally tried to
> send
> >> an indexing request to a node in teh cloud that didn't have a replica of
> >> collection1 (but it had other collections). We saw an instant jump in
> >> indexing latency to 5s, which given the previous latencies had been
> ~20ms
> >> was rather obvious!
> >>
> >> Querying seems to be fine with this kind of forwarding approach, but
> >> indexing would logically require ZK information (to find the right shard
> >> for the destination collection and the leader of that shard), so I'm
> >> wondering if a node in the cloud that has a replica of collection1 has
> that
> >> information cached, whereas a node in the (same) cloud that only has a
> >> collection2 replica only has collection2 information cached, and has to
> go
> >> to ZK for every "forwarding" request.
> >>
> >> I haven't checked the code recently, but that seems plausible to me.
> Would
> >> you really want all your collection2 nodes to be running ZK watches for
> all
> >> collection1 updates as well as their own collection2 watches, that would
> >> clog them up processing updates that in all honestly, they shouldn't
> have
> >> to deal with. Every node in the cloud would have to have a watch on
> >> everything else which if you have a lot of independent collections
> would be
> >> an unnecessary burden on each of them.
> >>
> >> If you use SolrJ as a client, that would route to a correct node in the
> >> cloud (which is what we ended up using through JNI which was
> >> "interesting"), but if you are using HTTP to index, that's something
> your
> >> application has to take care of.
> >>
> >> On 28 October 2014 19:29, Matt Hilt <matt.h...@numerica.us> wrote:
> >>
> >>> I have three equal machines each running solr cloud (4.8). I have
> multiple
> >>> collections that are replicated but not sharded. I also have document
> >>> generation processes running on these nodes which involves querying the
> >>> collection ~5 times per document generated.
> >>>
> >>> Node 1 has a replica of collection A and is running document generation
> >>> code that pushes to the HTTP /update/json hander.
> >>> Node 2 is the leader of collection A.
> >>> Node 3 does not have a replica of node A, but is running document
> >>> generation code for collection A.
> >>>
> >>> The issue I see is that node 1 can push documents into Solr 3-5 times
> >>> faster than node 3 when they both talk to the solr instance on their
> >>> localhost. If either of them talk directly to the solr instance on
> node 2,
> >>> the performance is excellent (on par with node 1). To me it seems that
> the
> >>> only difference in these cases is the query/put request forwarding.
> Does
> >>> this involve some slow zookeeper communication that should be avoided?
> Any
> >>> other insights?
> >>>
> >>> Thanks
> >
>

Reply via email to