On Monday, August 04, 2014 10:11:41 AM Martin Vaeth wrote:
> J. Roeleveld <jo...@antarean.org> wrote:
> > These schedules then also can't be restarted from the beginning
> > when they stop halfway through without risking massive consistency
> > problems in the final data.
> 
> So you have a command which might break due to hardware error
> and cannot be rerun. I cannot see how any general-purpose scheduler
> might help you here: You either need to be able to split your command
> into several (sequential) commands or you need something adapted
> for your particular command.

A general-purpose scheduler can work, as they do exist. (With a price tag)
In the OSS world, there is, to my knowledge, none.
Yours seems to be the most promising as it looks like the missing features 
shouldn't be too difficult to add.

The commands are relatively simple, but they deal with large amounts of data. 
I am talking about ETL processes that, due to the amount of data being 
processed, can easily take several hours per step.
If, during one of these steps, the database or ETL process suffers a crash, 
the activities of the ETL process need to be rolled back to the point where 
you can restart it.

I am not talking about simple schedules related to day-to-day maintenance of a 
few servers.

> > And then multiple of those starting at random times with
> > occasionally a whole bunch of the same schedule put into the
> > queue with dependencies to the previous run.
> 
> That's not a problem. Only if the granularity of one command is
> not fine enough, it becomes a problem.

If nothing happens, it can all be stuck into a single script and the end 
result will be the same. Problems start because the real world is not 100% 
reliable.

> > If, during that time, one of the machines has a hardware failure
> > or the scheduling process crashes on one or more of the servers,
> > the last state needs to be recoverable.
> 
> One must distinguish two cases:
> 
> 1. The machine running "schedule-server" has a hardware failure.
>    (Let us assume tha "schedule-server" does not have a software failure -
>    otherwise, you have problems anyway.)
> 2. Some other machine has a hardware failure.
> 
> Case 2. is not bad (as concerns the scheduling): Of course, the
> machine will not report that it completed the job, and you will
> have to think how to complete the job. But it is clear that in
> such exceptional cases you have to interfere manually in some sense.

Agreed, this happens more often then you might think.

> In order to deal with case 1., you can regularly (e.g. each minute)
> dump the output of "schedule list" (possibly suppressing non-important
> data through the options to keep it short).

Or all the necessary information is kept in-sync on persistent storage. This 
would then also allow easy fail-over if the master-schedule-node fails. A 2nd 
machine could quickly take over.

> One could add a logging option to decrease the possible race of 1 minute,
> but in case of hardware failure a possible race cannot be excluded anyway.
> 
> In case 1. you manually have to re-queue the jobs and think what to do
> with the already started jobs. However, I cannot imagine that this
> occurs so frequently that this exceptional case becomes something
> one should seriously think about.

As I mentioned above, with BI infrastructure (large databases, complex ETL 
processes, interactive report services,....), the scheduler is busy 24/7. The 
amount of tasks, schedules, dependencies, states,.... that needs to kept track 
off can easily lead to unforeseen issues and bugs.

Reply via email to