Hi, I'm currently prototyping enhancements in the Sling job handling to allow for better scaling in clustered / distributed environments. The current implementation makes a lot of assumptions and relies on JCR locks. These assumptions in combination of problems with JCR locks, usually lead to a setup of having just a single instance in a cluster for processing jobs. The goal is to be able to run jobs distributed in a cluster but also to be able to process jobs only on specifc instances (e.g. to offload some heavy jobs on dedicated machines).
Though this is still in an early phase, I would like to run some of the potential changes for users through this list a) Jobs containing queue configurations The configuration of job handling is usually done through queue configurations. These queues are assigned to one or more job topic and have different characteristics like if these jobs can be processed in parallel, how often a job should be retried, delay between retries etc. The queue's are configured globally through OSGi ConfigAdmin and are therefore the same on all cluster nodes. When we started with the job handling, we didn't have this configuration, so each and every job contained this whole information as properties of the job itself - which clearly is a maintenance nightmare but can also lead to funny situations where two jobs with the same topic contain different configurations (e.g. one allowing parallel processing while the other does not). With the introduction of the queue configurations, we already reduced the per job configuration possibilities and in some cases these are already ignored. For the new version I plan to discontinue the per job configuration of queue's as it is simply not worth the effort to support it. And having a single truth of queue configurations makes maintenance and troubleshooting way easier. b) Job API Until now, we're leveraging the EventAdmin to add jobs but also to execute jobs. While this seemed elegant when we started with job handling, this creates another layer to the picture and adds a some uncertainty: e.g. a job could be added by sending an event to the event admin, but is not known to the sender whether this job really arrived at the job manager and/or got persisted at all. On the other hand implementing a job processor based on event admin looks more complicated than it should be. Therefore I think it's time to add a method to the JobManager for adding a job - if this method returns, the job is persisted and gets executed. For processing, we make the job processor interface a OSGi service interface. Implementations can register this service together with the topics this interface is able to process. This makes the implementation easier but also allows to find out which topics can be processed on a cluster node. c) Deprecate event admin based API As with b) we don't need the event admin based API anymore and should deprecate it - but of course for compatiblity still support it. WDYT? Regards Carsten -- Carsten Ziegeler [email protected]
