[ 
https://issues.apache.org/jira/browse/SOLR-5681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anshum Gupta updated SOLR-5681:
-------------------------------

    Attachment: SOLR-5681-2.patch

Thanks for looking at that Shalin. I've addressed everything but #6 and #7 
(working on it right now). Everything other than that is either in the patch or 
explained below (or both).

bq. There are some unrelated changes in CollectionHandler.handleRequestStatus()
They are related. The request status message flow earlier assumed that the same 
request would never be in the workQueue and completed/failed Map. It changes 
with this. The entry from workQueue is only removed by the parent thread in a 
batched manner. The entry in the completed/failed map however is made by the 
task thread. We now need to check for the task in completed/failed map before 
we look up the running map/workQueue. So the change.

bq. The added synchronisation in CoreAdminHandler.addTask is required.
SOLR-6075 created and put up a patch there. Was wanting to commit it but I have 
trouble logging in. Will wait for my password to be reset or will hope someone 
commits it.

bq. DistributedQueue.peekTopN has the following code. It checks for 
topN.isEmpty but it should actually check for orderedChildren.isEmpty instead.
If ordered children aren’t empty but topN is, we want to return null. So the 
check.

bq. Also, is there any chance that headNode may be null?
I looked at the existing code and it does not assume that headNode would never 
be null. That’s the reason I added it here too.

bq. Remove the e.printStackTrace() calls in DistributedQueue.getLastElementId()
Seems like you’ve been looking at the older patch. This method was renamed to 
getTailId() and the issues you mention already addressed.

bq. Instead of passing a shardHandler to OCP constructor, why not just pass a 
shardHandlerFactory?
Required deeper changes to the mock etc. We could move to that later but I 
think for now, this makes sense.

bq. In OCP.cleanupWorkQueue, the synchronization on a ConcurrentHashMap is not 
required
It’s required as it’s a read-process-update issue. We iterate on the key set of 
the map and then in the end clear it while someone else might add to it. We 
don’t want to clear a completed task which was never removed from the zk 
workQueue.

bq. What is the reason behind cleaning work queue twice and sleeping for 20ms 
in this code:
To maintain concurrency limits and clean-up from zk after tasks complete. I’ve 
made it a little better by adding a waited bool in that loop and only call 
cleanUp for the second time when the waited is set.

bq. There are unrelated changes in OCP.prioritizeOverseerNodes
Merge issue.. at least it seems like it.

bq. KeeperException.NodeExistsException thrown from markTaskAsRunning is ignored
This should never happen. It’s just that DistributedMap.put throws that 
exception so we need to catch it. I’ve added a log.error for that and also 
commented saying that should never happen.

> Make the OverseerCollectionProcessor multi-threaded
> ---------------------------------------------------
>
>                 Key: SOLR-5681
>                 URL: https://issues.apache.org/jira/browse/SOLR-5681
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>            Reporter: Anshum Gupta
>            Assignee: Anshum Gupta
>         Attachments: SOLR-5681-2.patch, SOLR-5681-2.patch, SOLR-5681-2.patch, 
> SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, 
> SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, 
> SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, 
> SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch, 
> SOLR-5681.patch, SOLR-5681.patch, SOLR-5681.patch
>
>
> Right now, the OverseerCollectionProcessor is single threaded i.e submitting 
> anything long running would have it block processing of other mutually 
> exclusive tasks.
> When OCP tasks become optionally async (SOLR-5477), it'd be good to have 
> truly non-blocking behavior by multi-threading the OCP itself.
> For example, a ShardSplit call on Collection1 would block the thread and 
> thereby, not processing a create collection task (which would stay queued in 
> zk) though both the tasks are mutually exclusive.
> Here are a few of the challenges:
> * Mutual exclusivity: Only let mutually exclusive tasks run in parallel. An 
> easy way to handle that is to only let 1 task per collection run at a time.
> * ZK Distributed Queue to feed tasks: The OCP consumes tasks from a queue. 
> The task from the workQueue is only removed on completion so that in case of 
> a failure, the new Overseer can re-consume the same task and retry. A queue 
> is not the right data structure in the first place to look ahead i.e. get the 
> 2nd task from the queue when the 1st one is in process. Also, deleting tasks 
> which are not at the head of a queue is not really an 'intuitive' thing.
> Proposed solutions for task management:
> * Task funnel and peekAfter(): The parent thread is responsible for getting 
> and passing the request to a new thread (or one from the pool). The parent 
> method uses a peekAfter(last element) instead of a peek(). The peekAfter 
> returns the task after the 'last element'. Maintain this request information 
> and use it for deleting/cleaning up the workQueue.
> * Another (almost duplicate) queue: While offering tasks to workQueue, also 
> offer them to a new queue (call it volatileWorkQueue?). The difference is, as 
> soon as a task from this is picked up for processing by the thread, it's 
> removed from the queue. At the end, the cleanup is done from the workQueue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to