[ https://issues.apache.org/jira/browse/SOLR-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Anshum Gupta updated SOLR-5681: ------------------------------- Attachment: SOLR-5681-2.patch Thanks for looking at that Shalin. I've addressed everything but #6 and #7 (working on it right now). Everything other than that is either in the patch or explained below (or both). bq. There are some unrelated changes in CollectionHandler.handleRequestStatus() They are related. The request status message flow earlier assumed that the same request would never be in the workQueue and completed/failed Map. It changes with this. The entry from workQueue is only removed by the parent thread in a batched manner. The entry in the completed/failed map however is made by the task thread. We now need to check for the task in completed/failed map before we look up the running map/workQueue. So the change. bq. The added synchronisation in CoreAdminHandler.addTask is required. SOLR-6075 created and put up a patch there. Was wanting to commit it but I have trouble logging in. Will wait for my password to be reset or will hope someone commits it. bq. DistributedQueue.peekTopN has the following code. It checks for topN.isEmpty but it should actually check for orderedChildren.isEmpty instead. If ordered children aren’t empty but topN is, we want to return null. So the check. bq. Also, is there any chance that headNode may be null? I looked at the existing code and it does not assume that headNode would never be null. That’s the reason I added it here too. bq. Remove the e.printStackTrace() calls in DistributedQueue.getLastElementId() Seems like you’ve been looking at the older patch. This method was renamed to getTailId() and the issues you mention already addressed. bq. Instead of passing a shardHandler to OCP constructor, why not just pass a shardHandlerFactory? Required deeper changes to the mock etc. We could move to that later but I think for now, this makes sense. bq. In OCP.cleanupWorkQueue, the synchronization on a ConcurrentHashMap is not required It’s required as it’s a read-process-update issue. We iterate on the key set of the map and then in the end clear it while someone else might add to it. We don’t want to clear a completed task which was never removed from the zk workQueue. bq. What is the reason behind cleaning work queue twice and sleeping for 20ms in this code: To maintain concurrency limits and clean-up from zk after tasks complete. I’ve made it a little better by adding a waited bool in that loop and only call cleanUp for the second time when the waited is set. bq. There are unrelated changes in OCP.prioritizeOverseerNodes Merge issue.. at least it seems like it. bq. KeeperException.NodeExistsException thrown from markTaskAsRunning is ignored This should never happen. It’s just that DistributedMap.put throws that exception so we need to catch it. I’ve added a log.error for that and also commented saying that should never happen. > Make the OverseerCollectionProcessor multi-threaded > --------------------------------------------------- > > Key: SOLR-5681 > URL: https://issues.apache.org/jira/browse/SOLR-5681 > Project: Solr > Issue Type: Improvement > Components: SolrCloud > Reporter: Anshum Gupta > Assignee: Anshum Gupta > Attachments: SOLR-5681-2.patch, SOLR-5681-2.patch, SOLR-5681-2.patch, > SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, > SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, > SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, > SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, > SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch > > > Right now, the OverseerCollectionProcessor is single threaded i.e submitting > anything long running would have it block processing of other mutually > exclusive tasks. > When OCP tasks become optionally async (SOLR-5477), it'd be good to have > truly non-blocking behavior by multi-threading the OCP itself. > For example, a ShardSplit call on Collection1 would block the thread and > thereby, not processing a create collection task (which would stay queued in > zk) though both the tasks are mutually exclusive. > Here are a few of the challenges: > * Mutual exclusivity: Only let mutually exclusive tasks run in parallel. An > easy way to handle that is to only let 1 task per collection run at a time. > * ZK Distributed Queue to feed tasks: The OCP consumes tasks from a queue. > The task from the workQueue is only removed on completion so that in case of > a failure, the new Overseer can re-consume the same task and retry. A queue > is not the right data structure in the first place to look ahead i.e. get the > 2nd task from the queue when the 1st one is in process. Also, deleting tasks > which are not at the head of a queue is not really an 'intuitive' thing. > Proposed solutions for task management: > * Task funnel and peekAfter(): The parent thread is responsible for getting > and passing the request to a new thread (or one from the pool). The parent > method uses a peekAfter(last element) instead of a peek(). The peekAfter > returns the task after the 'last element'. Maintain this request information > and use it for deleting/cleaning up the workQueue. > * Another (almost duplicate) queue: While offering tasks to workQueue, also > offer them to a new queue (call it volatileWorkQueue?). The difference is, as > soon as a task from this is picked up for processing by the thread, it's > removed from the queue. At the end, the cleanup is done from the workQueue. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org