Hi,

I'm currently prototyping enhancements in the Sling job handling to allow
for better scaling in clustered / distributed environments. The current
implementation makes a lot of assumptions and relies on JCR locks. These
assumptions in combination of problems with JCR locks, usually lead to a
setup of having just a single instance in a cluster for processing jobs.
The goal is to be able to run jobs distributed in a cluster but also to be
able to process jobs only on specifc instances (e.g. to offload some heavy
jobs on dedicated machines).

Though this is still in an early phase, I would like to run some of the
potential changes for users through this list

a) Jobs containing queue configurations
The configuration of job handling is usually done through queue
configurations. These queues are assigned to one or more job topic and have
different characteristics like if these jobs can be processed in parallel,
how often a job should be retried, delay between retries etc. The queue's
are configured globally through OSGi ConfigAdmin and are therefore the same
on all cluster nodes.
When we started with the job handling, we didn't have this configuration,
so each and every job contained this whole information as properties of the
job itself - which clearly is a maintenance nightmare but can also lead to
funny situations where two jobs with the same topic contain different
configurations (e.g. one allowing parallel processing while the other does
not).
With the introduction of the queue configurations, we already reduced the
per job configuration possibilities and in some cases these are already
ignored.

For the new version I plan to discontinue the per job configuration of
queue's as it is simply not worth the effort to support it. And having a
single truth of queue configurations makes maintenance and troubleshooting
way easier.

b) Job API
Until now, we're leveraging the EventAdmin to add jobs but also to execute
jobs. While this seemed elegant when we started with job handling, this
creates another layer to the picture and adds a some uncertainty: e.g. a
job could be added by sending an event to the event admin, but is not known
to the sender whether this job really arrived at the job manager and/or got
persisted at all. On the other hand implementing a job processor based on
event admin looks more complicated than it should be.

Therefore I think it's time to add a method to the JobManager for adding a
job - if this method returns, the job is persisted and gets executed. For
processing, we make the job processor interface a OSGi service interface.
Implementations can register this service together with the topics this
interface is able to process. This makes the implementation easier but also
allows to find out which topics can be processed on a cluster node.

c) Deprecate event admin based API
As with b) we don't need the event admin based API anymore and should
deprecate it - but of course for compatiblity still support it.

WDYT?

Regards
Carsten
-- 
Carsten Ziegeler
[email protected]

Reply via email to