[ 
https://issues.apache.org/jira/browse/SOLR-10745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030023#comment-16030023
 ] 

Shalin Shekhar Mangar commented on SOLR-10745:
----------------------------------------------

Thanks Andrzej. I made a pass through the code at jira/SOLR-10745 branch.

A few comments:
# Should we write events to nodeLost, nodeAdded even when there are no 
corresponding (active) triggers? -- it seems wasteful and worse the data will 
keep growing with no one to delete it
# I agree with your choice of using persistent znodes for nodeLost events. Same 
for using ephemeral for nodeAdded because if the node goes away, the znode does 
too and we obviously never want to fire a nodeAdded trigger if the node itself 
is no more. I can't think of any cons to using ephemeral here except it is 
inconsistent with how we handle nodeLost events.
# While processing these events, i.e. before adding them to the tracking map, 
we must check actual state of the node at the time e.g. if a node came back, we 
don't want to add it to the NodeLostTrigger's tracking map
# Perhaps add some error handling code which ensures that we mark the node as 
live even if the multi op fails? I don't think it can fail but I just want to 
ensure that we fail to start Solr if cannot create the live node.
# TriggerIntegrationTest can use SolrZkClient.clean() which does the same thing 
as deleteChildrenRecursively
# nodeNameVsTimeAdded is now ConcurrentHashMap but it is never accessed 
concurrently?
# I'd prefer that retreiving marker paths should be done once during startup in 
ScheduledTrigger.run(). Doing that each time the trigger is run is redundant.
# minor nit - in testNodesEventRegistration, the code comment says "we want 
both triggers to fire" but the latch is initialized with 3.

> Reliably create nodeAdded / nodeLost events
> -------------------------------------------
>
>                 Key: SOLR-10745
>                 URL: https://issues.apache.org/jira/browse/SOLR-10745
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: SolrCloud
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>              Labels: autoscaling
>             Fix For: master (7.0)
>
>
> When Overseer node goes down then depending on the current phase of trigger 
> execution a {{nodeLost}} event may not have been generated. Similarly, when a 
> new node is added and Overseer goes down before the trigger saves a 
> checkpoint (and before it produces {{nodeAdded}} event) this event may never 
> be generated.
> The proposed solution would be to modify how nodeLost / nodeAdded information 
> is recorded in the cluster:
> * new nodes should do a ZK multi-write to both {{/live_nodes}} and 
> additionally to a predefined location eg. 
> {{/autoscaling/nodeAdded/<nodeName>}}. On the first execution of Trigger.run 
> in the new Overseer leader it would check this location for new znodes, which 
> would indicate that node has been added, and then generate a new event and 
> remove the znode that corresponds to the event.
> * node lost events should also be recorded to a predefined location eg. 
> {{/autoscaling/nodeLost/<nodeName>}}. Writing to this znode would be 
> attempted simultaneously by a few randomly selected nodes to make sure at 
> least one of them succeeds. On the first run of the new trigger instance (in 
> new Overseer leader) event generation would follow the sequence described 
> above.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to