[gentoo-user] Re: Recommendations for scheduler

2014-08-08 Thread Martin Vaeth
 On Tuesday, August 05, 2014 06:33:59 AM Martin Vaeth wrote:

 When you are at it you should probably also encrypt the communication

schedule-0.15 is finally able to use encryption, hence the current mild
security risks will practically vanish, even if listening to a
world-wide port.

schedule-1.0 will probably soon be ready with encryption strengthened
even more.




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-06 Thread Peter Humphrey
On Tuesday 05 August 2014 22:43:42 J. Roeleveld wrote:

 I still remember running seti@home and similar programs in the past. Those
 were large clusters, but with a very badly designed network.

Was that in the days before BOINC, Joost? Do you think it's any better now? I 
run 5 BOINC projects here in the same general area as SETI. They seem to work 
all right, except for getting changes in what they call computing preferences 
propagated around the projects.

(Just an aside - I don't want to hijack this interesting thread.)

-- 
Regards
Peter




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-06 Thread J. Roeleveld
On Wednesday, August 06, 2014 09:29:53 AM Peter Humphrey wrote:
 On Tuesday 05 August 2014 22:43:42 J. Roeleveld wrote:
  I still remember running seti@home and similar programs in the past. Those
  were large clusters, but with a very badly designed network.
 
 Was that in the days before BOINC, Joost? Do you think it's any better now?
 I run 5 BOINC projects here in the same general area as SETI. They seem to
 work all right, except for getting changes in what they call computing
 preferences propagated around the projects.
 
 (Just an aside - I don't want to hijack this interesting thread.)

Yes, I did it for a short period sometime in 1999.

It worked alright, I just meant that running it on thousands of personal 
computers using dial-up to the internet is a badly designed network for a 
cluster.

--
Joost



[gentoo-user] Re: Recommendations for scheduler

2014-08-05 Thread Martin Vaeth
J. Roeleveld jo...@antarean.org wrote:

No, it wouldn't, since jobs just finishing and wanting to report their
status cannot do this when there is no server. You would need a rather
involved protocol to deal with such situations dynamically.
It can certainly be done, but it is not something which can
easily be added as a feature: If this is required, it has to be the
fundamental concept from the very beginning and everything else has to
follow this first aim. You need different protocols than TCP sockets,
to start with; something like dbus over IP with servers being able
to announce their new presence, etc.

 I think it's doable with standard networking protocols.

Yes, you can tunnel such a protocol over existing protocols,
but essentially you must use a different one.
Unless you want a static setup (use server A, if that fail use
server B, and server A reports everything to server B)
it cannot be done in a simple way that you have only
one port open on the server: The client also needs a port open
to be informed about the current server. Even worse, you need
a daemon running for each client to handle this port.
In such a case, you might make each client its own server,
by spreading all changes to all clients immediately.

 But, either you have a master server which controls everything.
 Or you have a master process which has failover functionality
 using classical distributed software techniques.

This summarizes it quite good.
The concept of my schedule is to follow the first path (with the
advantage of being simple, having only one part, clients do nothing
while their task is runnning).
If you want to follow the latter, you need a rather different CLI
and a different protocol - which is practically everything schedule
consists of; so it is probably simpler to rewrite this from scratch.
As I said: It is not a feature you can easily add later on; it is a
fundamental decision you must choose from the very beginning.
When you are at it you should probably also encrypt the communication
and establish methods for authentification which is also something
I currently omitted in schedule for simplicity (although this might
be easier to add later on).




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-05 Thread J. Roeleveld
On Tuesday, August 05, 2014 06:33:59 AM Martin Vaeth wrote:
 J. Roeleveld jo...@antarean.org wrote:
 No, it wouldn't, since jobs just finishing and wanting to report their
 status cannot do this when there is no server. You would need a rather
 involved protocol to deal with such situations dynamically.
 It can certainly be done, but it is not something which can
 easily be added as a feature: If this is required, it has to be the
 fundamental concept from the very beginning and everything else has to
 follow this first aim. You need different protocols than TCP sockets,
 to start with; something like dbus over IP with servers being able
 to announce their new presence, etc.
 
  I think it's doable with standard networking protocols.
 
 Yes, you can tunnel such a protocol over existing protocols,
 but essentially you must use a different one.
 Unless you want a static setup (use server A, if that fail use
 server B, and server A reports everything to server B)
 it cannot be done in a simple way that you have only
 one port open on the server: The client also needs a port open
 to be informed about the current server. Even worse, you need
 a daemon running for each client to handle this port.
 In such a case, you might make each client its own server,
 by spreading all changes to all clients immediately.

Not necessarily, the client listens on a port and the server connects to the 
clients it maintains. It then also knows when a client is dead and 
corresponding jobs have an issue.

  But, either you have a master server which controls everything.
  Or you have a master process which has failover functionality
  using classical distributed software techniques.
 
 This summarizes it quite good.
 The concept of my schedule is to follow the first path (with the
 advantage of being simple, having only one part, clients do nothing
 while their task is runnning).
 If you want to follow the latter, you need a rather different CLI
 and a different protocol - which is practically everything schedule
 consists of; so it is probably simpler to rewrite this from scratch.
 As I said: It is not a feature you can easily add later on; it is a
 fundamental decision you must choose from the very beginning.
 When you are at it you should probably also encrypt the communication
 and establish methods for authentification which is also something
 I currently omitted in schedule for simplicity (although this might
 be easier to add later on).

I agree. schedule is good for most uses we might encounter. For the business 
case I have, I will need to write something myself.

Thanks to this discussion we've been having, I now have a much better idea on 
how to approach this project. For that I am very thankful.

--
Joost



Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-05 Thread J. Roeleveld
On Monday, August 04, 2014 10:38:57 PM Alan McKinnon wrote:
 On 04/08/2014 21:46, J. Roeleveld wrote:
  On 4 August 2014 15:35:41 CEST, Alan McKinnon alan.mckin...@gmail.com  
 Either make the ETL tool pick up where it stopped and continue as it is
  the only that knows what it was doing and how far it got. Or, wrap the
  entire script in a single transaction.
  
  Alan,
  
  That would be the ideal solution.
 
 You have the same concerns I do - how do you make a transaction around
 500 million rows. So I asked the in-house expert - Mrs Alan :-)

Have a very large temporary tablespace on the database server.

  However, a single transaction dealing with around 500,000,000 rows will
  get me shot by the DBAs :) (Never mind that the performance of this will
  be such that having it all done by an office full of secretaries might be
  quicker.)
 She reckons an ETL job *must* be self-contained; if it isn't then it's
 broken by design. It must be idempotent too, which can be as simple as
 Truncate, Load, Commit

Most common tactic (done by humans):
- delete from target table where INS_PCS_ID = crashed run-id;
- update target table set VLD_TO = null where UPD_PCS_ID = crashed run-id;
Then, restart the crashed run-id.

For this, you need to know which command failed to know where to find the 
actual run-id you need to roll back.

  Having the ETL process clever enough to be able to pick up from any point
  requires a degree of forward thinking and planning that is never done in
  real life. I would love to design it like that as it isn't too difficult.
  But I always get brought into these projects when implementing these
  structures will require a full rewrite and getting the original
  architects to admit their design can't be made restartable without human
  intervention.
 I agree with that design actually - it's the job of the hardware and OS
 guys to make stuff reliable that the application layer can rely on. When
 a SAN connection goes away, it usually comes back and the app layer just
 carries on (never mind that it retried 100 times meanwhile).

Yes, until you find out the clustered FS being used causes the crashes... 
(Yes, been in that situation...)

 Sometimes this doesn't work out. The easiest, cheapest and quickest way
 to handle it is to just restart the whole job from the beginning. This
 offends the engineer in us sometimes, but it really is the best way and
 all of Unix is built on this very idea :-)

Which is generally done. Usually, requiring a manual clean up prior to 
restart. If done properly, the ETL process has the capability to roll back the 
failed run prior to redoing it.
This, however, requires extensive planning and design at the initial 
implementation phase.

 If the SAn goes away too often and it causes issues, the manybe the best
 approach is to get the SAN and facilities guys to get their act together

Instead of finger-pointing.

  At which point the business simply says it is acceptable to have people do
  a manual rollback and restart the schedules from wherever it went wrong.
 Exactly. One of the few cases where business has the correct idea.
 There's only some many pennies to spend and so many dollars to be delivered.

Nightly processes that fail and then have to wait for the day-shift to arrive 
often cost the business more because the reports are delayed.

  I'm sure your wife has similar experiences as this is why these projects
  are always late to deliver and over budget.
 She says her projects are subject to the same universal inviolate rule
 as mine:
 
 time and cost is always best engineering estimate times pi

Overhead, testing, maintenance, , yes, it all adds to.

 We learn to deal with it. Which brings us back to Martin's initial
 statement: a scheduler cannot deal with any of this, the job itself
 must. It's an unpredictable event and schedulers can only deal with
 predictable events

True, but keeping the schedules and state stored in a way to make it easy to 
find out how far the whole process got makes recovery simpler.
Otherwise it's often quicker to simply roll back the entire schedule and 
restart. Even if only the last 2 of the 50 commands didn't run yet.

--
Joost



[gentoo-user] Re: Recommendations for scheduler

2014-08-05 Thread James
Joost Roeleveld joost at antarean.org writes:


  Mesos looks promising for a variety of (Apache) reasons. Some key
  technologies folks may want google about that are related:
  
  Quincy (fair schedular)
  Chronos (scheduler)
  Hadoop (scheduler)
 
 Hadoop not a scheduler. It's a framework for a Big Data clustered   
 database.

  HDFS (clusterd file system)
 Unless it's changed recently, not suitable for anything else then Hadoop 
 and  contains a single point of failure.

I'm curious as to more information about this 'single point of failure. Can
you be more specific or provides links?

On this resource: 

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html

JournalNode machines talks about surviving faults:

increase the number of failures the system can tolerate, you should run an
odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N
JournalNodes, the system can tolerate at most (N - 1) / 2 failures and
continue to function normally. 

 
  http://gpo.zugaina.org/sys-cluster/apache-hadoop-common
  
  Zookeeper (Fault tolerance)
  SPARK ( optimized for interative jobs where a datase is resued in many
  parallel operations (advanced math/science and many other apps.)
  https://spark.apache.org/
  
  Dryad  Torque   Mpiche2 MPI
  Globus tookit
  
  mesos_tech_report.pdf
  
  It looks as though Amazon, google, facebook and many others
  large in the Cluster/Cloud arena are using Mesos..?
  
  So let's all post what we find, particularly in overlays.
 
 Unless you are dealing with Big Data projects, like Google, Facebook,
Amazon,  big banks,... you don't have much use for those projects.

Many scientific applications are using the cluster (cloud) or big data
approach to all sorts of problems. Furthermore, as GPU and the new
Arm systems with dozens and dozens of cpu cores inside one computer become
readily available, the cluster-cloud (big data) approach will become much
more pervasive in the next few years, imho.

http://blog.rescale.com/reservoir-simulation-moves-to-the-cloud/

There are thousands of small companies needing reservoir simulation, not to 
mention the millions of folks working on carbon sequestration.
Anything to do with Biological or Chemical Science is using or moving
to the Cloud-Clustered world. For me, a Cluster is just a cloud internally
managee, rather than outsourcing it to others; ymmv.


 Mesos looks like a nice project, just like Hadoop and related are also 
 nice. But for most people, they are as usefull as using Exalytics.

I'm not excited about an Oracle solution to anything. Many of the folks
I know consult on moving technologies away from Oracle's spear of influence,
not limited to mysql; ymmv. I know of one very large communications company
that went broke and had to merge because of those ridiculous Oracle fees.
Caveat Emptor; long live Postresql.  


 A scheduler should not have a large set of dependencies that you wouldn't
 use otherwise. That makes Chronos a non-option to me.

Those other technologies are often useful to folks who would be attracted to
something like chronos.

 Martin's project looks promising, but doesn't store the schedules 
 internally. For repeating schedules, like what Alan was describing, you 
 need to put those into scripts and start those from an existing cron.
 Of the 2, I think improving Martin's project is the most likely option 
 for me as it doesn't have additional dependencies and seems to be 
 easily implemented.
 Joost

Understood.
Like others, I'll be curious to follow what develops out of Martin's work.

For me Chronos, Mesos and the other aforementioned technologies look to be
more viable; particularly if one is preparing for a clustered world with
CPUs, GPUs, SoCs and Arm machines distributed about the ethernet  as
resources to be scheduled and utilized in a variety of schema. It's the
quest for one-infrastructure to solve many problems where scenarios compete. 

Big data is not the only reason for cloud-clusters. Theoretically,
(Clustered) systems can have a far greater resource utilization of networked
resources than traditional (distributed) approaches. I grant you that this
is a work in progress, but I personally know of dozens of mathematically
complex distributed systems that are  migrating to the clustered approach
rather than something custom or traditionally distributed.

Granted, Cloud -- Clustered -- Distributed are all overlaping approaches
to big problems. I do appreciate the candor of this thread.


James







Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-05 Thread J. Roeleveld
On 5 August 2014 21:57:56 CEST, James wirel...@tampabay.rr.com wrote:
Joost Roeleveld joost at antarean.org writes:


  Mesos looks promising for a variety of (Apache) reasons. Some key
  technologies folks may want google about that are related:
  
  Quincy (fair schedular)
  Chronos (scheduler)
  Hadoop (scheduler)
 
 Hadoop not a scheduler. It's a framework for a Big Data clustered   
 database.

  HDFS (clusterd file system)
 Unless it's changed recently, not suitable for anything else then
Hadoop 
 and  contains a single point of failure.

I'm curious as to more information about this 'single point of failure.
Can
you be more specific or provides links?

On this resource: 

http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html

JournalNode machines talks about surviving faults:

increase the number of failures the system can tolerate, you should
run an
odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N
JournalNodes, the system can tolerate at most (N - 1) / 2 failures and
continue to function normally. 

Just read that part. Looks like they solved it partly since 2.2.
The problem lies with the NameNodes.
Prior to 2.2, you only had 1. If that one dies, you loose the entire cluster. 
If that one is unrecoverable, you loose all the data.

After 2.2, you can configure a standby NameNode. However, it still requires 
manual restart.

Considering that Hadoop is most often running on old machines, chances for 
hardware failure are higher when compared with clusters using newer hardware.

I'm not sure how other cluster FSs deal with this, but I consider it a design 
flaw if the disappearance of a single machine in a 100+ node cluster dies, the 
entire cluster ends up in a broken state.
It's like running a single Raid5 with 100+ drives.
Anyone stupid enough to do that deserves to loose their data.

  http://gpo.zugaina.org/sys-cluster/apache-hadoop-common
  
  Zookeeper (Fault tolerance)
  SPARK ( optimized for interative jobs where a datase is resued in
many
  parallel operations (advanced math/science and many other apps.)
  https://spark.apache.org/
  
  Dryad  Torque   Mpiche2 MPI
  Globus tookit
  
  mesos_tech_report.pdf
  
  It looks as though Amazon, google, facebook and many others
  large in the Cluster/Cloud arena are using Mesos..?
  
  So let's all post what we find, particularly in overlays.
 
 Unless you are dealing with Big Data projects, like Google, Facebook,
Amazon,  big banks,... you don't have much use for those projects.

Many scientific applications are using the cluster (cloud) or big data
approach to all sorts of problems. Furthermore, as GPU and the new
Arm systems with dozens and dozens of cpu cores inside one computer
become
readily available, the cluster-cloud (big data) approach will become
much
more pervasive in the next few years, imho.

http://blog.rescale.com/reservoir-simulation-moves-to-the-cloud/

There are thousands of small companies needing reservoir simulation,
not to 
mention the millions of folks working on carbon sequestration.
Anything to do with Biological or Chemical Science is using or moving
to the Cloud-Clustered world. For me, a Cluster is just a cloud
internally
managee, rather than outsourcing it to others; ymmv.

My apologies. I forgot the scientific research here. But that was mostly 
because they have been dealing with really large datasets and corresponding 
large compute clusters for decades.

The term Big Data is generally applied to financial and social media data.

 Mesos looks like a nice project, just like Hadoop and related are
also 
 nice. But for most people, they are as usefull as using Exalytics.

I'm not excited about an Oracle solution to anything. Many of the folks
I know consult on moving technologies away from Oracle's spear of
influence,
not limited to mysql; ymmv. I know of one very large communications
company
that went broke and had to merge because of those ridiculous Oracle
fees.
Caveat Emptor; long live Postresql.  

I'd be interested in the name of that company. Even offlist.

And I definitely agree. PostgreSQL is often a valid alternative. Unfortunately, 
it is rarely possible to use it as a back end to enterprise software as these 
are all designed to be used with databases from the usual suspects (Oracle, 
IBM, Microsoft, )

Same goes for OSS projects. The developers are often unable to properly code 
the SQL layer and end up simply using MySQL and its broken SQL implementation.

 A scheduler should not have a large set of dependencies that you
wouldn't
 use otherwise. That makes Chronos a non-option to me.

Those other technologies are often useful to folks who would be
attracted to
something like chronos.

If you already use Mesos, using Chronos makes sense.
If you're only interested in a scheduler, installing Mesos just to use Chronos 
doesn't make sense.

 Martin's project looks promising, but doesn't store the schedules 
 internally. For repeating 

Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-05 Thread Alan McKinnon
On 05/08/2014 22:43, J. Roeleveld wrote:
 I believe Martin's scheduler will be very valuable. Even for me.
 I am very likely going to start using this for some of my regular maintenance 
 activities on the home network.
 
 But as the rest of the thread shows, I wouldn't be able to use it as a 
 scheduler for large projects where the schedules can get very complex very 
 quickly.


Martin will be happy to know I think his work will fit my needs just
nicely :-)



-- 
Alan McKinnon
alan.mckin...@gmail.com




[gentoo-user] Re: Recommendations for scheduler

2014-08-04 Thread Martin Vaeth
J. Roeleveld jo...@antarean.org wrote:

 With the kind of schedules I am working with (and I believe Alan will
 also end up with), restarting the whole process from the start can
 lead to issues.
 Finding out how far the process got before the service crashed can become
 rather complex.

I am not sure whether I understand this correctly:
schedule has not a problem to display which tasks have
finished/failed/are still running at any time.
Of course, a finer granulation than tasks are not possible (how far
has a certain task got?) because this would require knowledge
about the task and how to check it - you need to be able to
split your tasks into more shell commands to make a finer granulation
available for schedule.

You can just rerun your driving script with the effect that the
tasks which already are finished/failed will actually not be
restarted, but the behaviour is as if they would finish immediately
and report that they are finished/failed. (When you plan to do this,
I would suggest to schedule things like sleep as separate tasks,
too, and not build them into the driving script.)

If there is an unexpected problem, and e.g. you want to re-run
a failed task anyway, you can just re-queue your new task on
the same place as there was the previous task, e.g.
schedule remove jobnr
schedule -j jobnr queue commmand to do your task
Then the old job (and its state) is replaced by the new queued job,
and your (identical as before) driving script will start it instead
of assuming that the job is already finished.

In order to avoid races, I would recommend to do the above only
while your driving script is not running (e.g., you can put it
in the background with ctrl-z if you have written it in (...) or
if it is really a classical script, and then continue it with fg;
or you even stop it completely with Ctrl-c and re-run it, depending
on what you want): The problem is that between the above two commands
the jobs after jobnr are renumbered.
Alternatively, you can insert your new job at the end of the joblist
and then use something like (untested)
schedule -jjobnr insert 0 jobnr+1:-1
schedule remove 0
to to re-sort your job list: The insert is race-free,
and having added a job at the end for some time will hopefully not
disturb anything.




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-04 Thread J. Roeleveld
On 4 August 2014 10:41:04 CEST, Martin Vaeth mar...@mvath.de wrote:
J. Roeleveld jo...@antarean.org wrote:

 With the kind of schedules I am working with (and I believe Alan will
 also end up with), restarting the whole process from the start can
 lead to issues.
 Finding out how far the process got before the service crashed can
become
 rather complex.

I am not sure whether I understand this correctly:

The schedules I am used to dealing with easily span 8 - 14 hours with 
occasionally even over a week.
These schedules then also can't be restarted from the beginning when they stop 
halfway through without risking massive consistency problems in the final data.

And then multiple of those starting at random times with occasionally a whole 
bunch of the same schedule put into the queue with dependencies to the previous 
run.

If, during that time, one of the machines has a hardware failure or the 
scheduling process crashes on one or more of the servers, the last state needs 
to be recoverable.

If you have to clean up the environment and bring it back to a state where you 
can restart the schedules, it saves time if you know which commands and tasks 
were actually running at the time.

For this, the schedules, queues and current state for each node needs to be 
stored on persistent storage.

Hope this clarifies it all a bit.
--
Joost


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



[gentoo-user] Re: Recommendations for scheduler

2014-08-04 Thread Martin Vaeth
J. Roeleveld jo...@antarean.org wrote:

 These schedules then also can't be restarted from the beginning
 when they stop halfway through without risking massive consistency
 problems in the final data.

So you have a command which might break due to hardware error
and cannot be rerun. I cannot see how any general-purpose scheduler
might help you here: You either need to be able to split your command
into several (sequential) commands or you need something adapted
for your particular command.

 And then multiple of those starting at random times with
 occasionally a whole bunch of the same schedule put into the
 queue with dependencies to the previous run.

That's not a problem. Only if the granularity of one command is
not fine enough, it becomes a problem.

 If, during that time, one of the machines has a hardware failure
 or the scheduling process crashes on one or more of the servers,
 the last state needs to be recoverable.

One must distinguish two cases:

1. The machine running schedule-server has a hardware failure.
   (Let us assume tha schedule-server does not have a software failure -
   otherwise, you have problems anyway.)
2. Some other machine has a hardware failure.

Case 2. is not bad (as concerns the scheduling): Of course, the
machine will not report that it completed the job, and you will
have to think how to complete the job. But it is clear that in
such exceptional cases you have to interfere manually in some sense.

In order to deal with case 1., you can regularly (e.g. each minute)
dump the output of schedule list (possibly suppressing non-important
data through the options to keep it short).
One could add a logging option to decrease the possible race of 1 minute,
but in case of hardware failure a possible race cannot be excluded anyway.

In case 1. you manually have to re-queue the jobs and think what to do
with the already started jobs. However, I cannot imagine that this
occurs so frequently that this exceptional case becomes something
one should seriously think about.




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-04 Thread J. Roeleveld
On Monday, August 04, 2014 10:11:41 AM Martin Vaeth wrote:
 J. Roeleveld jo...@antarean.org wrote:
  These schedules then also can't be restarted from the beginning
  when they stop halfway through without risking massive consistency
  problems in the final data.
 
 So you have a command which might break due to hardware error
 and cannot be rerun. I cannot see how any general-purpose scheduler
 might help you here: You either need to be able to split your command
 into several (sequential) commands or you need something adapted
 for your particular command.

A general-purpose scheduler can work, as they do exist. (With a price tag)
In the OSS world, there is, to my knowledge, none.
Yours seems to be the most promising as it looks like the missing features 
shouldn't be too difficult to add.

The commands are relatively simple, but they deal with large amounts of data. 
I am talking about ETL processes that, due to the amount of data being 
processed, can easily take several hours per step.
If, during one of these steps, the database or ETL process suffers a crash, 
the activities of the ETL process need to be rolled back to the point where 
you can restart it.

I am not talking about simple schedules related to day-to-day maintenance of a 
few servers.

  And then multiple of those starting at random times with
  occasionally a whole bunch of the same schedule put into the
  queue with dependencies to the previous run.
 
 That's not a problem. Only if the granularity of one command is
 not fine enough, it becomes a problem.

If nothing happens, it can all be stuck into a single script and the end 
result will be the same. Problems start because the real world is not 100% 
reliable.

  If, during that time, one of the machines has a hardware failure
  or the scheduling process crashes on one or more of the servers,
  the last state needs to be recoverable.
 
 One must distinguish two cases:
 
 1. The machine running schedule-server has a hardware failure.
(Let us assume tha schedule-server does not have a software failure -
otherwise, you have problems anyway.)
 2. Some other machine has a hardware failure.
 
 Case 2. is not bad (as concerns the scheduling): Of course, the
 machine will not report that it completed the job, and you will
 have to think how to complete the job. But it is clear that in
 such exceptional cases you have to interfere manually in some sense.

Agreed, this happens more often then you might think.

 In order to deal with case 1., you can regularly (e.g. each minute)
 dump the output of schedule list (possibly suppressing non-important
 data through the options to keep it short).

Or all the necessary information is kept in-sync on persistent storage. This 
would then also allow easy fail-over if the master-schedule-node fails. A 2nd 
machine could quickly take over.

 One could add a logging option to decrease the possible race of 1 minute,
 but in case of hardware failure a possible race cannot be excluded anyway.
 
 In case 1. you manually have to re-queue the jobs and think what to do
 with the already started jobs. However, I cannot imagine that this
 occurs so frequently that this exceptional case becomes something
 one should seriously think about.

As I mentioned above, with BI infrastructure (large databases, complex ETL 
processes, interactive report services,), the scheduler is busy 24/7. The 
amount of tasks, schedules, dependencies, states, that needs to kept track 
off can easily lead to unforeseen issues and bugs.



[gentoo-user] Re: Recommendations for scheduler

2014-08-04 Thread Martin Vaeth
J. Roeleveld jo...@antarean.org wrote:

 So you have a command which might break due to hardware error
 and cannot be rerun. I cannot see how any general-purpose scheduler
 might help you here: You either need to be able to split your command
 into several (sequential) commands or you need something adapted
 for your particular command.

 A general-purpose scheduler can work, as they do exist.

I doubt that they can solve your problem.
Let me repeat: You have a single program which accesses the database
in a complex way and somewhere in the course of accessing it, the
machine (or program) crashes.
No general-purpose program can recover from this: You need
particular knowledge of the database and the program if you even
want to have a *chance* to recover from such a situation.
A program with such a particular knowledge can hardly be called
general-purpose.

 If, during one of these steps, the database or ETL process suffers a
 crash, the activities of the ETL process need to be rolled back to
 the point where you can restart it.

I agree, but you need particular knowledge of the database and
your tasks to do this which is far beyond the job of a scheduler.
As already mentioned by someone in this thread, your problem needs
to be solved on the level of the database (using
snapshopt capabilities etc.)

 In order to deal with case 1., you can regularly (e.g. each minute)
 dump the output of schedule list (possibly suppressing non-important
 data through the options to keep it short).

 Or all the necessary information is kept in-sync on persistent storage.
 This would then also allow easy fail-over if the master-schedule-node
 fails

No, it wouldn't, since jobs just finishing and wanting to report their
status cannot do this when there is no server. You would need a rather
involved protocol to deal with such situations dynamically.
It can certainly be done, but it is not something which can
easily be added as a feature: If this is required, it has to be the
fundamental concept from the very beginning and everything else has to
follow this first aim. You need different protocols than TCP sockets,
to start with; something like dbus over IP with servers being able
to announce their new presence, etc.




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-04 Thread Alan McKinnon
On 04/08/2014 15:31, Martin Vaeth wrote:
 J. Roeleveld jo...@antarean.org wrote:

 So you have a command which might break due to hardware error
 and cannot be rerun. I cannot see how any general-purpose scheduler
 might help you here: You either need to be able to split your command
 into several (sequential) commands or you need something adapted
 for your particular command.

 A general-purpose scheduler can work, as they do exist.
 
 I doubt that they can solve your problem.
 Let me repeat: You have a single program which accesses the database
 in a complex way and somewhere in the course of accessing it, the
 machine (or program) crashes.
 No general-purpose program can recover from this: You need
 particular knowledge of the database and the program if you even
 want to have a *chance* to recover from such a situation.
 A program with such a particular knowledge can hardly be called
 general-purpose.


Joost,

Either make the ETL tool pick up where it stopped and continue as it is
the only that knows what it was doing and how far it got. Or, wrap the
entire script in a single transaction.


-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-04 Thread J. Roeleveld
On 4 August 2014 15:35:41 CEST, Alan McKinnon alan.mckin...@gmail.com wrote:
On 04/08/2014 15:31, Martin Vaeth wrote:
 J. Roeleveld jo...@antarean.org wrote:

 So you have a command which might break due to hardware error
 and cannot be rerun. I cannot see how any general-purpose scheduler
 might help you here: You either need to be able to split your
command
 into several (sequential) commands or you need something adapted
 for your particular command.

 A general-purpose scheduler can work, as they do exist.
 
 I doubt that they can solve your problem.
 Let me repeat: You have a single program which accesses the database
 in a complex way and somewhere in the course of accessing it, the
 machine (or program) crashes.
 No general-purpose program can recover from this: You need
 particular knowledge of the database and the program if you even
 want to have a *chance* to recover from such a situation.
 A program with such a particular knowledge can hardly be called
 general-purpose.


Joost,

Either make the ETL tool pick up where it stopped and continue as it is
the only that knows what it was doing and how far it got. Or, wrap the
entire script in a single transaction.

Alan,

That would be the ideal solution.
However, a single transaction dealing with around 500,000,000 rows will get me 
shot by the DBAs :)
(Never mind that the performance of this will be such that having it all done 
by an office full of secretaries might be quicker.)

Having the ETL process clever enough to be able to pick up from any point 
requires a degree of forward thinking and planning that is never done in real 
life.
I would love to design it like that as it isn't too difficult. But I always get 
brought into these projects when implementing these structures will require a 
full rewrite and getting the original architects to admit their design can't be 
made restartable without human intervention.

At which point the business simply says it is acceptable to have people do a 
manual rollback and restart the schedules from wherever it went wrong.

I'm sure your wife has similar experiences as this is why these projects are 
always late to deliver and over budget.

--
Joost
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-04 Thread J. Roeleveld
On 4 August 2014 15:31:40 CEST, Martin Vaeth mar...@mvath.de wrote:
J. Roeleveld jo...@antarean.org wrote:

 So you have a command which might break due to hardware error
 and cannot be rerun. I cannot see how any general-purpose scheduler
 might help you here: You either need to be able to split your
command
 into several (sequential) commands or you need something adapted
 for your particular command.

 A general-purpose scheduler can work, as they do exist.

I doubt that they can solve your problem.
Let me repeat: You have a single program which accesses the database
in a complex way and somewhere in the course of accessing it, the
machine (or program) crashes.
No general-purpose program can recover from this: You need
particular knowledge of the database and the program if you even
want to have a *chance* to recover from such a situation.
A program with such a particular knowledge can hardly be called
general-purpose.

The scheduler needs to be able to show which process failed/didn't finish. 
Then humans need to ensure that part finishes/reruns properly.
Then humans need to be able to mark the failed process as succeeded.

At which point the scheduler continues with the schedule(s)

 If, during one of these steps, the database or ETL process suffers a
 crash, the activities of the ETL process need to be rolled back to
 the point where you can restart it.

I agree, but you need particular knowledge of the database and
your tasks to do this which is far beyond the job of a scheduler.
As already mentioned by someone in this thread, your problem needs
to be solved on the level of the database (using
snapshopt capabilities etc.)

Or human intervention. Which requires a clear indication of where it went wrong 
and allows a simple action to continue the schedule from where it was after 
these humans solved the issues and ensure consistency.

 In order to deal with case 1., you can regularly (e.g. each minute)
 dump the output of schedule list (possibly suppressing
non-important
 data through the options to keep it short).

 Or all the necessary information is kept in-sync on persistent
storage.
 This would then also allow easy fail-over if the master-schedule-node
 fails

No, it wouldn't, since jobs just finishing and wanting to report their
status cannot do this when there is no server. You would need a rather
involved protocol to deal with such situations dynamically.
It can certainly be done, but it is not something which can
easily be added as a feature: If this is required, it has to be the
fundamental concept from the very beginning and everything else has to
follow this first aim. You need different protocols than TCP sockets,
to start with; something like dbus over IP with servers being able
to announce their new presence, etc.

I think it's doable with standard networking protocols.
But, either you have a master server which controls everything. Or you have a 
master process which has failover functionality using classical distributed 
software techniques.

These emails are actually quite useful as I am getting a clear pucture in my 
head on how I could approach this properly.

Thanks,

Joost

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-04 Thread Alan McKinnon
On 04/08/2014 21:46, J. Roeleveld wrote:
 On 4 August 2014 15:35:41 CEST, Alan McKinnon alan.mckin...@gmail.com wrote:
 On 04/08/2014 15:31, Martin Vaeth wrote:
 J. Roeleveld jo...@antarean.org wrote:

 So you have a command which might break due to hardware error
 and cannot be rerun. I cannot see how any general-purpose scheduler
 might help you here: You either need to be able to split your
 command
 into several (sequential) commands or you need something adapted
 for your particular command.

 A general-purpose scheduler can work, as they do exist.

 I doubt that they can solve your problem.
 Let me repeat: You have a single program which accesses the database
 in a complex way and somewhere in the course of accessing it, the
 machine (or program) crashes.
 No general-purpose program can recover from this: You need
 particular knowledge of the database and the program if you even
 want to have a *chance* to recover from such a situation.
 A program with such a particular knowledge can hardly be called
 general-purpose.


 Joost,

 Either make the ETL tool pick up where it stopped and continue as it is
 the only that knows what it was doing and how far it got. Or, wrap the
 entire script in a single transaction.
 
 Alan,
 
 That would be the ideal solution.

You have the same concerns I do - how do you make a transaction around
500 million rows. So I asked the in-house expert - Mrs Alan :-)


 However, a single transaction dealing with around 500,000,000 rows will get 
 me shot by the DBAs :)
 (Never mind that the performance of this will be such that having it all done 
 by an office full of secretaries might be quicker.)

She reckons an ETL job *must* be self-contained; if it isn't then it's
broken by design. It must be idempotent too, which can be as simple as
Truncate, Load, Commit

 Having the ETL process clever enough to be able to pick up from any point 
 requires a degree of forward thinking and planning that is never done in real 
 life.
 I would love to design it like that as it isn't too difficult. But I always 
 get brought into these projects when implementing these structures will 
 require a full rewrite and getting the original architects to admit their 
 design can't be made restartable without human intervention.


I agree with that design actually - it's the job of the hardware and OS
guys to make stuff reliable that the application layer can rely on. When
a SAN connection goes away, it usually comes back and the app layer just
carries on (never mind that it retried 100 times meanwhile).

Sometimes this doesn't work out. The easiest, cheapest and quickest way
to handle it is to just restart the whole job from the beginning. This
offends the engineer in us sometimes, but it really is the best way and
all of Unix is built on this very idea :-)

If the SAn goes away too often and it causes issues, the manybe the best
approach is to get the SAN and facilities guys to get their act together

 At which point the business simply says it is acceptable to have people do a 
 manual rollback and restart the schedules from wherever it went wrong.

Exactly. One of the few cases where business has the correct idea.
There's only some many pennies to spend and so many dollars to be delivered.


 
 I'm sure your wife has similar experiences as this is why these projects are 
 always late to deliver and over budget.

She says her projects are subject to the same universal inviolate rule
as mine:

time and cost is always best engineering estimate times pi

We learn to deal with it. Which brings us back to Martin's initial
statement: a scheduler cannot deal with any of this, the job itself
must. It's an unpredictable event and schedulers can only deal with
predictable events


-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-03 Thread Joost Roeleveld
On Saturday 02 August 2014 16:53:26 James wrote:
 Alan McKinnon alan.mckinnon at gmail.com writes:
  Well, we've found 2 projects that at least in part seek to achieve our
  general goals - chronos and Martin's new project.
  Why don't we both fool around with them for a bit and get a sense of
  what it will take to add features etc? Then we can meet back here and
  discuss. Always better to build on an existing foundation
 
 Mesos looks promising for a variety of (Apache) reasons. Some key
 technologies folks may want google about that are related:
 
 Quincy (fair schedular)
 Chronos (scheduler)
 Hadoop (scheduler)

Hadoop not a scheduler. It's a framework for a Big Data clustered database.

 HDFS (clusterd file system)

Unless it's changed recently, not suitable for anything else then Hadoop and 
contains a single point of failure.

 http://gpo.zugaina.org/sys-cluster/apache-hadoop-common
 
 Zookeeper (Fault tolerance)
 SPARK ( optimized for interative jobs where a datase is resued in many
 parallel operations (advanced math/science and many other apps.)
 https://spark.apache.org/
 
 Dryad  Torque   Mpiche2 MPI
 Globus tookit
 
 mesos_tech_report.pdf
 
 It looks as though Amazon, google, facebook and many others
 large in the Cluster/Cloud arena are using Mesos..?
 
 So let's all post what we find, particularly in overlays.

Unless you are dealing with Big Data projects, like Google, Facebook, Amazon, 
big banks,... you don't have much use for those projects.

Mesos looks like a nice project, just like Hadoop and related are also nice. 
But for most people, they are as usefull as using Exalytics.

A scheduler should not have a large set of dependencies that you wouldn't use 
otherwise. That makes Chronos a non-option to me.

Martin's project looks promising, but doesn't store the schedules internally. 
For repeating schedules, like what Alan was describing, you need to put those 
into scripts and start those from an existing cron.

Of the 2, I think improving Martin's project is the most likely option for me 
as it doesn't have additional dependencies and seems to be easily implemented.

--
Joost



[gentoo-user] Re: Recommendations for scheduler

2014-08-03 Thread Martin Vaeth
J. Roeleveld jo...@antarean.org wrote:

 Depends on the specific requirements.
 If you want:

In a sense, most you require can be done with my mentioned schedule
tool, although perhaps the usage is not in the way you expected.
I reorder your points for a clearer explanation:

 - have schedules operate over multiple machines (eg. part run on 
 database, some on a compute-cluster, some other bit making nice graphs 
 and printing it,...)

Since schedule can use TCP for communication, this should not be
a problem if you let schedule-server listen world-wide
(export SCHEDULE_SERVER_OPTS=-a0.0.0.0)

For the actual scheduling you must setup your machines correspondingly:
Queue on one machine the task doing the database access you want
(with schedule -a[serveraddress] queue command_to_access_database)
and similarly on the other machines.
Of course, ssh or anything else can be used to do this without
physically accessing the machines.

Then, on one machine (not necessarily that of the server),
you run an appropriate driver script.

 - time based start of a schedule
 - dependencies in said schedules and between schedules which can delay
 the actual start
 - stop of schedule if error occurs

All this is not a problem, since the driver script is just a
shell script which calls schedule to start the tasks,
wait for them being finished and/or checking their exit status.
This is perhaps inconvenient but has the advantage of being
absolutely flexible:
You can use all linux tools like sleep (or also use at or cron)
to get any delays you want, do tests more powerful than checking
the exit status etc.

 - ability to restart schedule from crashed point

Running non-yet started jobs after a crash is not a problem -
you just edit your driver script appropriately and restart it.
Jobs which were already running need to be re-queued if they
should be running again.




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-03 Thread J. Roeleveld
On Sunday, August 03, 2014 07:50:57 AM Martin Vaeth wrote:
 J. Roeleveld jo...@antarean.org wrote:
  Depends on the specific requirements.
 
  If you want:
 In a sense, most you require can be done with my mentioned schedule
 tool, although perhaps the usage is not in the way you expected.

I agree, based on a quick look.

 I reorder your points for a clearer explanation:

snipped explanation

A useful addition to your schedule-tool would be to store the scripts in a way 
that makes editing simpler and then add an editing tool to make this process 
simpler.
Add monitoring (email alerts, webpage, front-end) to check the status of all 
the batch-jobs.


I might be mistaken, but I think the server keeps the entire queue in-memory 
and when the process dies, the status is lost?
Or is it kept somewhere?

--
Joost



[gentoo-user] Re: Recommendations for scheduler

2014-08-03 Thread Martin Vaeth
J. Roeleveld jo...@antarean.org wrote:

 A useful addition to your schedule-tool would be to store the
 scripts in a way that makes editing simpler

Since it is an arbitrary script in an arbitrary language,
I think this is not in the scope of this project to do this.
In most cases I used it so far, 1-2 more or less complex lines
(maybe a few more if they would not be complex)
in an interactive zsh were enough, and these are very simple
enough to edit in zsh, i.e. I even did not write any script file
in the classical sense.

 I might be mistaken, but I think the server keeps the entire
 queue in-memory and when the process dies, the status is lost?

Yes, the server process must not die.

If it dies, not only the queue is lost but also the waiting processes
(that is: queued but not yet started) cannot be reached anymore:
These waiting processes do not have their own TCP socket but just
keep their established connection with the server's socket until
the server tells them through this connection to start or to cancel;
if this connection gets lost, the waiting processes die:
What else could they do, reasonably?

The already started processes have a unique ID (into which the
server's process is encoded): They reestablish the connection to report
the exit status according to this ID. If the server is stopped,
they cannot report this status, of course, and moreover,
a new server does not know their IDs either and thus will ignore these
status reports.

Maybe this protocol is not the most clever solution, but it is
one which could be implemented without lots of overhead:
Mainly, I was up to a quick solution which is working good enough
for me: If the server has no bugs, why should it die?
Moreover, if the server dies for some strange reasons, it is probably
safer to re-queue the jobs again, anyway.




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-03 Thread Alan McKinnon
On 03/08/2014 09:23, Joost Roeleveld wrote:
 On Saturday 02 August 2014 16:53:26 James wrote:
 Alan McKinnon alan.mckinnon at gmail.com writes:
 Well, we've found 2 projects that at least in part seek to achieve our
 general goals - chronos and Martin's new project.
 Why don't we both fool around with them for a bit and get a sense of
 what it will take to add features etc? Then we can meet back here and
 discuss. Always better to build on an existing foundation

 Mesos looks promising for a variety of (Apache) reasons. Some key
 technologies folks may want google about that are related:

 Quincy (fair schedular)
 Chronos (scheduler)
 Hadoop (scheduler)
 
 Hadoop not a scheduler. It's a framework for a Big Data clustered database.
 
 HDFS (clusterd file system)
 
 Unless it's changed recently, not suitable for anything else then Hadoop and 
 contains a single point of failure.
 
 http://gpo.zugaina.org/sys-cluster/apache-hadoop-common

 Zookeeper (Fault tolerance)
 SPARK ( optimized for interative jobs where a datase is resued in many
 parallel operations (advanced math/science and many other apps.)
 https://spark.apache.org/

 Dryad  Torque   Mpiche2 MPI
 Globus tookit

 mesos_tech_report.pdf

 It looks as though Amazon, google, facebook and many others
 large in the Cluster/Cloud arena are using Mesos..?

 So let's all post what we find, particularly in overlays.
 
 Unless you are dealing with Big Data projects, like Google, Facebook, Amazon, 
 big banks,... you don't have much use for those projects.


My wife works in BigData for real, she and Joost speak the same
language, I don't :-)
She reckons Big Data is like teenage sex - everyone says they are doing
it and no-one really does ;-D


 Mesos looks like a nice project, just like Hadoop and related are also nice. 
 But for most people, they are as usefull as using Exalytics.

A bit OT, but it might be worthwhile for interested persons to get good
ebuilds going for these projects. Someone will use it on Gentoo, and it
will add value to the project. Much like gems and other
business-oriented packages benefit


 
 A scheduler should not have a large set of dependencies that you wouldn't use 
 otherwise. That makes Chronos a non-option to me.
 
 Martin's project looks promising, but doesn't store the schedules internally. 
 For repeating schedules, like what Alan was describing, you need to put those 
 into scripts and start those from an existing cron.

Sounds like a small feature-add. If Martin did his groundwork
correctly[1] then the core logic will work and it's just a case of
adding some persistence and loading the data back in on demand

 Of the 2, I think improving Martin's project is the most likely option for me 
 as it doesn't have additional dependencies and seems to be easily implemented.

Don't forget Martins is the guy who does eix.
Street cred? check
Knows Gentoo? check





[1] I only say it this way as I haven't evaluated his code at all yet so
have no idea how far Martin has taken it


-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-03 Thread J. Roeleveld
On Sunday, August 03, 2014 02:16:37 PM Alan McKinnon wrote:
 On 03/08/2014 09:23, Joost Roeleveld wrote:
  On Saturday 02 August 2014 16:53:26 James wrote:
  Alan McKinnon alan.mckinnon at gmail.com writes:

snipped

  Unless you are dealing with Big Data projects, like Google, Facebook,
  Amazon, big banks,... you don't have much use for those projects.
 
 My wife works in BigData for real, she and Joost speak the same
 language, I don't :-)
 She reckons Big Data is like teenage sex - everyone says they are doing
 it and no-one really does ;-D

I know a few companies that actually do use it.
But, the biggest issue with the whole Big Data thing is that noone really 
agrees on what it actually is.

  Mesos looks like a nice project, just like Hadoop and related are also
  nice. But for most people, they are as usefull as using Exalytics.
 
 A bit OT, but it might be worthwhile for interested persons to get good
 ebuilds going for these projects. Someone will use it on Gentoo, and it
 will add value to the project. Much like gems and other
 business-oriented packages benefit

I agree, but just to implement a decent scheduler, I still think it's 
overkill.

  A scheduler should not have a large set of dependencies that you wouldn't
  use otherwise. That makes Chronos a non-option to me.
  
  Martin's project looks promising, but doesn't store the schedules
  internally. For repeating schedules, like what Alan was describing, you
  need to put those into scripts and start those from an existing cron.
 
 Sounds like a small feature-add. If Martin did his groundwork
 correctly[1] then the core logic will work and it's just a case of
 adding some persistence and loading the data back in on demand

The code looks clean and I think it shouldn't be too much work to add it.

  Of the 2, I think improving Martin's project is the most likely option for
  me as it doesn't have additional dependencies and seems to be easily
  implemented.
 Don't forget Martins is the guy who does eix.
 Street cred? check
 Knows Gentoo? check
 
 [1] I only say it this way as I haven't evaluated his code at all yet so
 have no idea how far Martin has taken it

The code is clean and does what Martin says it does.

--
Joost



Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-03 Thread J. Roeleveld
On Sunday, August 03, 2014 12:10:49 PM Martin Vaeth wrote:
 J. Roeleveld jo...@antarean.org wrote:
  A useful addition to your schedule-tool would be to store the
  scripts in a way that makes editing simpler
 
 Since it is an arbitrary script in an arbitrary language,
 I think this is not in the scope of this project to do this.
 In most cases I used it so far, 1-2 more or less complex lines
 (maybe a few more if they would not be complex)
 in an interactive zsh were enough, and these are very simple
 enough to edit in zsh, i.e. I even did not write any script file
 in the classical sense.
 
  I might be mistaken, but I think the server keeps the entire
  queue in-memory and when the process dies, the status is lost?
 
 Yes, the server process must not die.
 
 If it dies, not only the queue is lost but also the waiting processes
 (that is: queued but not yet started) cannot be reached anymore:
 These waiting processes do not have their own TCP socket but just
 keep their established connection with the server's socket until
 the server tells them through this connection to start or to cancel;
 if this connection gets lost, the waiting processes die:
 What else could they do, reasonably?
 
 The already started processes have a unique ID (into which the
 server's process is encoded): They reestablish the connection to report
 the exit status according to this ID. If the server is stopped,
 they cannot report this status, of course, and moreover,
 a new server does not know their IDs either and thus will ignore these
 status reports.
 
 Maybe this protocol is not the most clever solution, but it is
 one which could be implemented without lots of overhead:
 Mainly, I was up to a quick solution which is working good enough
 for me: If the server has no bugs, why should it die?
 Moreover, if the server dies for some strange reasons, it is probably
 safer to re-queue the jobs again, anyway.

With the kind of schedules I am working with (and I believe Alan will also end 
up with), restarting the whole process from the start can lead to issues.
Finding out how far the process got before the service crashed can become 
rather complex.

--
Joost



Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-03 Thread Alan McKinnon
On 03/08/2014 15:36, J. Roeleveld wrote:
 Maybe this protocol is not the most clever solution, but it is
  one which could be implemented without lots of overhead:
  Mainly, I was up to a quick solution which is working good enough
  for me: If the server has no bugs, why should it die?
  Moreover, if the server dies for some strange reasons, it is probably
  safer to re-queue the jobs again, anyway.

 With the kind of schedules I am working with (and I believe Alan will also 
 end 
 up with), restarting the whole process from the start can lead to issues.
 Finding out how far the process got before the service crashed can become 
 rather complex.

Yes, very much so. My first concern is the database cleanups - without
scheduler guarantees I'd need transactions in MySQL.


-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-03 Thread J. Roeleveld
On Sunday, August 03, 2014 10:04:50 PM Alan McKinnon wrote:
 On 03/08/2014 15:36, J. Roeleveld wrote:
  Maybe this protocol is not the most clever solution, but it is
  
   one which could be implemented without lots of overhead:
   Mainly, I was up to a quick solution which is working good enough
   for me: If the server has no bugs, why should it die?
   Moreover, if the server dies for some strange reasons, it is probably
   safer to re-queue the jobs again, anyway.
  
  With the kind of schedules I am working with (and I believe Alan will also
  end up with), restarting the whole process from the start can lead to
  issues. Finding out how far the process got before the service crashed
  can become rather complex.
 
 Yes, very much so. My first concern is the database cleanups - without
 scheduler guarantees I'd need transactions in MySQL.

Or you migrate to PostgreSQL, but that is OT :)

--
Joost



Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-03 Thread Alan McKinnon
On 03/08/2014 22:23, J. Roeleveld wrote:
 On Sunday, August 03, 2014 10:04:50 PM Alan McKinnon wrote:
 On 03/08/2014 15:36, J. Roeleveld wrote:
 Maybe this protocol is not the most clever solution, but it is

 one which could be implemented without lots of overhead:
 Mainly, I was up to a quick solution which is working good enough
 for me: If the server has no bugs, why should it die?
 Moreover, if the server dies for some strange reasons, it is probably
 safer to re-queue the jobs again, anyway.

 With the kind of schedules I am working with (and I believe Alan will also
 end up with), restarting the whole process from the start can lead to
 issues. Finding out how far the process got before the service crashed
 can become rather complex.

 Yes, very much so. My first concern is the database cleanups - without
 scheduler guarantees I'd need transactions in MySQL.
 
 Or you migrate to PostgreSQL, but that is OT :)


Maybe, but also valid :-)

I took one look at the schemas here and wondered Why MySQL? This is
Postgres territory. It's a case of LAMP tunnel vision.





-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-03 Thread J. Roeleveld
On Sunday, August 03, 2014 10:57:06 PM Alan McKinnon wrote:
 On 03/08/2014 22:23, J. Roeleveld wrote:
  On Sunday, August 03, 2014 10:04:50 PM Alan McKinnon wrote:
  On 03/08/2014 15:36, J. Roeleveld wrote:
  Maybe this protocol is not the most clever solution, but it is
  
  one which could be implemented without lots of overhead:
  Mainly, I was up to a quick solution which is working good enough
  for me: If the server has no bugs, why should it die?
  Moreover, if the server dies for some strange reasons, it is probably
  safer to re-queue the jobs again, anyway.
  
  With the kind of schedules I am working with (and I believe Alan will
  also
  end up with), restarting the whole process from the start can lead to
  issues. Finding out how far the process got before the service crashed
  can become rather complex.
  
  Yes, very much so. My first concern is the database cleanups - without
  scheduler guarantees I'd need transactions in MySQL.
  
  Or you migrate to PostgreSQL, but that is OT :)
 
 Maybe, but also valid :-)
 
 I took one look at the schemas here and wondered Why MySQL? This is
 Postgres territory. It's a case of LAMP tunnel vision.

That and that people who start with LAMP don't learn SQL.
This leads to code that is near impossible to port to a different database and 
when people actually want to do all the work to get the SQL to work on any 
database, the projects involved refuse the patches.

--
Joost



Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-02 Thread Alan McKinnon
On 01/08/2014 21:35, cov...@ccs.covici.com wrote:
 Alan McKinnon alan.mckin...@gmail.com wrote:
 
 On 01/08/2014 20:17, James wrote:
 Alan McKinnon alan.mckinnon at gmail.com writes:


 New job, new environment. Existing persons suffer from
 5-year-old-with-a-hammer syndrome and assume cron is the solution to all
 ills. Result: a towering edifice of cron jobs that may or may not
 clobber each other's work, may or may not work at all, and implement no
 error handling at all. But my god, can they spew out mail from STOUT

 Sounds like a department full of computer scientist I inherited a few
 decades ago...

 I've met folks like that
 Brilliant in their chosen field but completely useless outside it? The
 kind of fellows who see nothing wrong with eating a barbeque'd steak
 with a spoon because they can get a result?


 I know nothing bout chronos, but I find it an interesting readymmv.


 http://nerds.airbnb.com/introducing-chronos/
 http://airbnb.github.io/chronos/
 https://github.com/airbnb/chronos

 Aaaah, now this sounds like something I can use. Proper dependency
 chains, Restful JSON interface so the devs can write code to drive it in
 automation.

 Good find, thanks!
 
 Unless I am missing something, chronos is not in the tree at all.
 

Correct, it isn't in the tree. But there's nothing stopping me from
getting it in there

-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-02 Thread Alan McKinnon
On 01/08/2014 23:02, Martin Vaeth wrote:
 Alan McKinnon alan.mckin...@gmail.com wrote:

 But cron has only one event trigger: wall-clock time. And it's a very
 blunt weapon. I'm looking for recommendations of alternative schedulers
 that satisfy real-world business needs that need some other event
 trigger than what the time is right now.
 
 I had a similar need recently, and since the discussion in
 
 https://forums.gentoo.org/viewtopic-t-992780-highlight-.html

Interesting thread :-)

Conceptually, your needs are the same as mine - sequence defined by
something other than wall-clock time.
The responders there do the same thing as I experience - tunnel vision
with regard to cron. Sysadmins are used to cron and sadly most of us
want to ram a purely cron-based solution into places where it most
certainly does not belong.

Business rules very seldom fit easily into a cron model, they usually
rely on a defined sequence


 
 had led to nothing satisfactory for me, I have written a
 scheduler tool which serves my needs
 (which might very well differ from yours...):
 
 The corresponding tool is still in beta testing phase:
 https://github.com/vaeth/schedule/
 
 You can install it from the mv overlay (available over layman).

Nice, thanks for the link :-)

Now I have two projects to evaluate.


-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-02 Thread J. Roeleveld
On Saturday, August 02, 2014 11:18:32 AM Alan McKinnon wrote:
 On 01/08/2014 21:35, cov...@ccs.covici.com wrote:
  Alan McKinnon alan.mckin...@gmail.com wrote:
  On 01/08/2014 20:17, James wrote:
  Alan McKinnon alan.mckinnon at gmail.com writes:
  New job, new environment. Existing persons suffer from
  5-year-old-with-a-hammer syndrome and assume cron is the 
solution to
  all
  ills. Result: a towering edifice of cron jobs that may or may not
  clobber each other's work, may or may not work at all, and 
implement no
  error handling at all. But my god, can they spew out mail from 
STOUT
  
  Sounds like a department full of computer scientist I inherited a 
few
  decades ago...
  
  I've met folks like that
  Brilliant in their chosen field but completely useless outside it? The
  kind of fellows who see nothing wrong with eating a barbeque'd 
steak
  with a spoon because they can get a result?
  
  I know nothing bout chronos, but I find it an interesting 
readymmv.
  
  
  http://nerds.airbnb.com/introducing-chronos/
  http://airbnb.github.io/chronos/
  https://github.com/airbnb/chronos
  
  Aaaah, now this sounds like something I can use. Proper 
dependency
  chains, Restful JSON interface so the devs can write code to drive it 
in
  automation.
  
  Good find, thanks!
  
  Unless I am missing something, chronos is not in the tree at all.
 
 Correct, it isn't in the tree. But there's nothing stopping me from
 getting it in there

Neither are the dependencies. 

If you get it to work, don't forget to create a nice howto documentation as 
from what I found online, the documentation is incomplete and out of date.

--
Joost


[gentoo-user] Re: Recommendations for scheduler

2014-08-02 Thread James
Alan McKinnon alan.mckinnon at gmail.com writes:


 Well, we've found 2 projects that at least in part seek to achieve our
 general goals - chronos and Martin's new project.
 Why don't we both fool around with them for a bit and get a sense of
 what it will take to add features etc? Then we can meet back here and
 discuss. Always better to build on an existing foundation

Mesos looks promising for a variety of (Apache) reasons. Some key
technologies folks may want google about that are related:

Quincy (fair schedular)
Chronos (scheduler)
Hadoop (scheduler)
HDFS (clusterd file system)
http://gpo.zugaina.org/sys-cluster/apache-hadoop-common

Zookeeper (Fault tolerance)
SPARK ( optimized for interative jobs where a datase is resued in many
parallel operations (advanced math/science and many other apps.)
https://spark.apache.org/

Dryad  Torque   Mpiche2 MPI
Globus tookit

mesos_tech_report.pdf

It looks as though Amazon, google, facebook and many others
large in the Cluster/Cloud arena are using Mesos..?

So let's all post what we find, particularly in overlays.

hth,
James




[gentoo-user] Re: Recommendations for scheduler

2014-08-01 Thread James
Alan McKinnon alan.mckinnon at gmail.com writes:


 New job, new environment. Existing persons suffer from
 5-year-old-with-a-hammer syndrome and assume cron is the solution to all
 ills. Result: a towering edifice of cron jobs that may or may not
 clobber each other's work, may or may not work at all, and implement no
 error handling at all. But my god, can they spew out mail from STOUT

Sounds like a department full of computer scientist I inherited a few
decades ago...

I know nothing bout chronos, but I find it an interesting readymmv.


http://nerds.airbnb.com/introducing-chronos/
http://airbnb.github.io/chronos/
https://github.com/airbnb/chronos


cheers mate!

James






Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-01 Thread Alan McKinnon
On 01/08/2014 20:17, James wrote:
 Alan McKinnon alan.mckinnon at gmail.com writes:
 
 
 New job, new environment. Existing persons suffer from
 5-year-old-with-a-hammer syndrome and assume cron is the solution to all
 ills. Result: a towering edifice of cron jobs that may or may not
 clobber each other's work, may or may not work at all, and implement no
 error handling at all. But my god, can they spew out mail from STOUT
 
 Sounds like a department full of computer scientist I inherited a few
 decades ago...

I've met folks like that
Brilliant in their chosen field but completely useless outside it? The
kind of fellows who see nothing wrong with eating a barbeque'd steak
with a spoon because they can get a result?

 
 I know nothing bout chronos, but I find it an interesting readymmv.
 
 
 http://nerds.airbnb.com/introducing-chronos/
 http://airbnb.github.io/chronos/
 https://github.com/airbnb/chronos

Aaaah, now this sounds like something I can use. Proper dependency
chains, Restful JSON interface so the devs can write code to drive it in
automation.

Good find, thanks!




 
 
 cheers mate!
 
 James
 
 
 
 
 
 


-- 
Alan McKinnon
alan.mckin...@gmail.com




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-01 Thread covici
Alan McKinnon alan.mckin...@gmail.com wrote:

 On 01/08/2014 20:17, James wrote:
  Alan McKinnon alan.mckinnon at gmail.com writes:
  
  
  New job, new environment. Existing persons suffer from
  5-year-old-with-a-hammer syndrome and assume cron is the solution to all
  ills. Result: a towering edifice of cron jobs that may or may not
  clobber each other's work, may or may not work at all, and implement no
  error handling at all. But my god, can they spew out mail from STOUT
  
  Sounds like a department full of computer scientist I inherited a few
  decades ago...
 
 I've met folks like that
 Brilliant in their chosen field but completely useless outside it? The
 kind of fellows who see nothing wrong with eating a barbeque'd steak
 with a spoon because they can get a result?
 
  
  I know nothing bout chronos, but I find it an interesting readymmv.
  
  
  http://nerds.airbnb.com/introducing-chronos/
  http://airbnb.github.io/chronos/
  https://github.com/airbnb/chronos
 
 Aaaah, now this sounds like something I can use. Proper dependency
 chains, Restful JSON interface so the devs can write code to drive it in
 automation.
 
 Good find, thanks!

Unless I am missing something, chronos is not in the tree at all.

-- 
Your life is like a penny.  You're going to lose it.  The question is:
How do
you spend it?

 John Covici
 cov...@ccs.covici.com



[gentoo-user] Re: Recommendations for scheduler

2014-08-01 Thread Martin Vaeth
Alan McKinnon alan.mckin...@gmail.com wrote:

 But cron has only one event trigger: wall-clock time. And it's a very
 blunt weapon. I'm looking for recommendations of alternative schedulers
 that satisfy real-world business needs that need some other event
 trigger than what the time is right now.

I had a similar need recently, and since the discussion in

https://forums.gentoo.org/viewtopic-t-992780-highlight-.html

had led to nothing satisfactory for me, I have written a
scheduler tool which serves my needs
(which might very well differ from yours...):

The corresponding tool is still in beta testing phase:
https://github.com/vaeth/schedule/

You can install it from the mv overlay (available over layman).




Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-01 Thread J. Roeleveld
On 1 August 2014 20:17:05 CEST, James wirel...@tampabay.rr.com wrote:
Alan McKinnon alan.mckinnon at gmail.com writes:


 New job, new environment. Existing persons suffer from
 5-year-old-with-a-hammer syndrome and assume cron is the solution to
all
 ills. Result: a towering edifice of cron jobs that may or may not
 clobber each other's work, may or may not work at all, and implement
no
 error handling at all. But my god, can they spew out mail from STOUT

Sounds like a department full of computer scientist I inherited a few
decades ago...

I know nothing bout chronos, but I find it an interesting readymmv.


http://nerds.airbnb.com/introducing-chronos/
http://airbnb.github.io/chronos/
https://github.com/airbnb/chronos


cheers mate!

James

Looks interesting.
Apart from it requiring a clustered environment (mesos).

Unless I misunderstand the part where it says it runs on top of mesos?

--
Joost
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [gentoo-user] Re: Recommendations for scheduler

2014-08-01 Thread J. Roeleveld
On 1 August 2014 23:02:11 CEST, Martin Vaeth mar...@mvath.de wrote:
Alan McKinnon alan.mckin...@gmail.com wrote:

 But cron has only one event trigger: wall-clock time. And it's a very
 blunt weapon. I'm looking for recommendations of alternative
schedulers
 that satisfy real-world business needs that need some other event
 trigger than what the time is right now.

I had a similar need recently, and since the discussion in

https://forums.gentoo.org/viewtopic-t-992780-highlight-.html

had led to nothing satisfactory for me, I have written a
scheduler tool which serves my needs
(which might very well differ from yours...):

The corresponding tool is still in beta testing phase:
https://github.com/vaeth/schedule/

You can install it from the mv overlay (available over layman).

Going to have a look at this soon.

What are the features it currently has already and what are you planning on 
adding?

--
Joost
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



[gentoo-user] Re: Recommendations for scheduler

2014-08-01 Thread Martin Vaeth
J. Roeleveld jo...@antarean.org wrote:
https://github.com/vaeth/schedule/

 What are the features it currently has already

This is hard to answer, since at a first glance the whole thing
does not even look like a scheduler: It looks more like a means to
communicate with some server, but after the discussions in the
gentoo forums, it became clear to my surprise that this is all
what is needed for the use cases I had in mind:
The real scheduler driving the whole thing can be a tiny script
(in shell or any other language) which just communicates with
that server.

To understand whether this can solve your problems, it is
probably best if you look at the examples in the README
(and/or the mentioned discussion in the gentoo forum).

 and what are you planning on adding?

Since it is sufficient for my purposes, I am currently not
planning to add anything (except possibly bug fixes or if I run
into a problem which I cannot solve with it).
Patches for extensions are welcome, of course.
(Also suggestions without patches are welcome, but my time is
currently very limited, and I do not make any promises.)