[gentoo-user] Re: Recommendations for scheduler
On Tuesday, August 05, 2014 06:33:59 AM Martin Vaeth wrote: When you are at it you should probably also encrypt the communication schedule-0.15 is finally able to use encryption, hence the current mild security risks will practically vanish, even if listening to a world-wide port. schedule-1.0 will probably soon be ready with encryption strengthened even more.
Re: [gentoo-user] Re: Recommendations for scheduler
On Tuesday 05 August 2014 22:43:42 J. Roeleveld wrote: I still remember running seti@home and similar programs in the past. Those were large clusters, but with a very badly designed network. Was that in the days before BOINC, Joost? Do you think it's any better now? I run 5 BOINC projects here in the same general area as SETI. They seem to work all right, except for getting changes in what they call computing preferences propagated around the projects. (Just an aside - I don't want to hijack this interesting thread.) -- Regards Peter
Re: [gentoo-user] Re: Recommendations for scheduler
On Wednesday, August 06, 2014 09:29:53 AM Peter Humphrey wrote: On Tuesday 05 August 2014 22:43:42 J. Roeleveld wrote: I still remember running seti@home and similar programs in the past. Those were large clusters, but with a very badly designed network. Was that in the days before BOINC, Joost? Do you think it's any better now? I run 5 BOINC projects here in the same general area as SETI. They seem to work all right, except for getting changes in what they call computing preferences propagated around the projects. (Just an aside - I don't want to hijack this interesting thread.) Yes, I did it for a short period sometime in 1999. It worked alright, I just meant that running it on thousands of personal computers using dial-up to the internet is a badly designed network for a cluster. -- Joost
[gentoo-user] Re: Recommendations for scheduler
J. Roeleveld jo...@antarean.org wrote: No, it wouldn't, since jobs just finishing and wanting to report their status cannot do this when there is no server. You would need a rather involved protocol to deal with such situations dynamically. It can certainly be done, but it is not something which can easily be added as a feature: If this is required, it has to be the fundamental concept from the very beginning and everything else has to follow this first aim. You need different protocols than TCP sockets, to start with; something like dbus over IP with servers being able to announce their new presence, etc. I think it's doable with standard networking protocols. Yes, you can tunnel such a protocol over existing protocols, but essentially you must use a different one. Unless you want a static setup (use server A, if that fail use server B, and server A reports everything to server B) it cannot be done in a simple way that you have only one port open on the server: The client also needs a port open to be informed about the current server. Even worse, you need a daemon running for each client to handle this port. In such a case, you might make each client its own server, by spreading all changes to all clients immediately. But, either you have a master server which controls everything. Or you have a master process which has failover functionality using classical distributed software techniques. This summarizes it quite good. The concept of my schedule is to follow the first path (with the advantage of being simple, having only one part, clients do nothing while their task is runnning). If you want to follow the latter, you need a rather different CLI and a different protocol - which is practically everything schedule consists of; so it is probably simpler to rewrite this from scratch. As I said: It is not a feature you can easily add later on; it is a fundamental decision you must choose from the very beginning. When you are at it you should probably also encrypt the communication and establish methods for authentification which is also something I currently omitted in schedule for simplicity (although this might be easier to add later on).
Re: [gentoo-user] Re: Recommendations for scheduler
On Tuesday, August 05, 2014 06:33:59 AM Martin Vaeth wrote: J. Roeleveld jo...@antarean.org wrote: No, it wouldn't, since jobs just finishing and wanting to report their status cannot do this when there is no server. You would need a rather involved protocol to deal with such situations dynamically. It can certainly be done, but it is not something which can easily be added as a feature: If this is required, it has to be the fundamental concept from the very beginning and everything else has to follow this first aim. You need different protocols than TCP sockets, to start with; something like dbus over IP with servers being able to announce their new presence, etc. I think it's doable with standard networking protocols. Yes, you can tunnel such a protocol over existing protocols, but essentially you must use a different one. Unless you want a static setup (use server A, if that fail use server B, and server A reports everything to server B) it cannot be done in a simple way that you have only one port open on the server: The client also needs a port open to be informed about the current server. Even worse, you need a daemon running for each client to handle this port. In such a case, you might make each client its own server, by spreading all changes to all clients immediately. Not necessarily, the client listens on a port and the server connects to the clients it maintains. It then also knows when a client is dead and corresponding jobs have an issue. But, either you have a master server which controls everything. Or you have a master process which has failover functionality using classical distributed software techniques. This summarizes it quite good. The concept of my schedule is to follow the first path (with the advantage of being simple, having only one part, clients do nothing while their task is runnning). If you want to follow the latter, you need a rather different CLI and a different protocol - which is practically everything schedule consists of; so it is probably simpler to rewrite this from scratch. As I said: It is not a feature you can easily add later on; it is a fundamental decision you must choose from the very beginning. When you are at it you should probably also encrypt the communication and establish methods for authentification which is also something I currently omitted in schedule for simplicity (although this might be easier to add later on). I agree. schedule is good for most uses we might encounter. For the business case I have, I will need to write something myself. Thanks to this discussion we've been having, I now have a much better idea on how to approach this project. For that I am very thankful. -- Joost
Re: [gentoo-user] Re: Recommendations for scheduler
On Monday, August 04, 2014 10:38:57 PM Alan McKinnon wrote: On 04/08/2014 21:46, J. Roeleveld wrote: On 4 August 2014 15:35:41 CEST, Alan McKinnon alan.mckin...@gmail.com Either make the ETL tool pick up where it stopped and continue as it is the only that knows what it was doing and how far it got. Or, wrap the entire script in a single transaction. Alan, That would be the ideal solution. You have the same concerns I do - how do you make a transaction around 500 million rows. So I asked the in-house expert - Mrs Alan :-) Have a very large temporary tablespace on the database server. However, a single transaction dealing with around 500,000,000 rows will get me shot by the DBAs :) (Never mind that the performance of this will be such that having it all done by an office full of secretaries might be quicker.) She reckons an ETL job *must* be self-contained; if it isn't then it's broken by design. It must be idempotent too, which can be as simple as Truncate, Load, Commit Most common tactic (done by humans): - delete from target table where INS_PCS_ID = crashed run-id; - update target table set VLD_TO = null where UPD_PCS_ID = crashed run-id; Then, restart the crashed run-id. For this, you need to know which command failed to know where to find the actual run-id you need to roll back. Having the ETL process clever enough to be able to pick up from any point requires a degree of forward thinking and planning that is never done in real life. I would love to design it like that as it isn't too difficult. But I always get brought into these projects when implementing these structures will require a full rewrite and getting the original architects to admit their design can't be made restartable without human intervention. I agree with that design actually - it's the job of the hardware and OS guys to make stuff reliable that the application layer can rely on. When a SAN connection goes away, it usually comes back and the app layer just carries on (never mind that it retried 100 times meanwhile). Yes, until you find out the clustered FS being used causes the crashes... (Yes, been in that situation...) Sometimes this doesn't work out. The easiest, cheapest and quickest way to handle it is to just restart the whole job from the beginning. This offends the engineer in us sometimes, but it really is the best way and all of Unix is built on this very idea :-) Which is generally done. Usually, requiring a manual clean up prior to restart. If done properly, the ETL process has the capability to roll back the failed run prior to redoing it. This, however, requires extensive planning and design at the initial implementation phase. If the SAn goes away too often and it causes issues, the manybe the best approach is to get the SAN and facilities guys to get their act together Instead of finger-pointing. At which point the business simply says it is acceptable to have people do a manual rollback and restart the schedules from wherever it went wrong. Exactly. One of the few cases where business has the correct idea. There's only some many pennies to spend and so many dollars to be delivered. Nightly processes that fail and then have to wait for the day-shift to arrive often cost the business more because the reports are delayed. I'm sure your wife has similar experiences as this is why these projects are always late to deliver and over budget. She says her projects are subject to the same universal inviolate rule as mine: time and cost is always best engineering estimate times pi Overhead, testing, maintenance, , yes, it all adds to. We learn to deal with it. Which brings us back to Martin's initial statement: a scheduler cannot deal with any of this, the job itself must. It's an unpredictable event and schedulers can only deal with predictable events True, but keeping the schedules and state stored in a way to make it easy to find out how far the whole process got makes recovery simpler. Otherwise it's often quicker to simply roll back the entire schedule and restart. Even if only the last 2 of the 50 commands didn't run yet. -- Joost
[gentoo-user] Re: Recommendations for scheduler
Joost Roeleveld joost at antarean.org writes: Mesos looks promising for a variety of (Apache) reasons. Some key technologies folks may want google about that are related: Quincy (fair schedular) Chronos (scheduler) Hadoop (scheduler) Hadoop not a scheduler. It's a framework for a Big Data clustered database. HDFS (clusterd file system) Unless it's changed recently, not suitable for anything else then Hadoop and contains a single point of failure. I'm curious as to more information about this 'single point of failure. Can you be more specific or provides links? On this resource: http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html JournalNode machines talks about surviving faults: increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally. http://gpo.zugaina.org/sys-cluster/apache-hadoop-common Zookeeper (Fault tolerance) SPARK ( optimized for interative jobs where a datase is resued in many parallel operations (advanced math/science and many other apps.) https://spark.apache.org/ Dryad Torque Mpiche2 MPI Globus tookit mesos_tech_report.pdf It looks as though Amazon, google, facebook and many others large in the Cluster/Cloud arena are using Mesos..? So let's all post what we find, particularly in overlays. Unless you are dealing with Big Data projects, like Google, Facebook, Amazon, big banks,... you don't have much use for those projects. Many scientific applications are using the cluster (cloud) or big data approach to all sorts of problems. Furthermore, as GPU and the new Arm systems with dozens and dozens of cpu cores inside one computer become readily available, the cluster-cloud (big data) approach will become much more pervasive in the next few years, imho. http://blog.rescale.com/reservoir-simulation-moves-to-the-cloud/ There are thousands of small companies needing reservoir simulation, not to mention the millions of folks working on carbon sequestration. Anything to do with Biological or Chemical Science is using or moving to the Cloud-Clustered world. For me, a Cluster is just a cloud internally managee, rather than outsourcing it to others; ymmv. Mesos looks like a nice project, just like Hadoop and related are also nice. But for most people, they are as usefull as using Exalytics. I'm not excited about an Oracle solution to anything. Many of the folks I know consult on moving technologies away from Oracle's spear of influence, not limited to mysql; ymmv. I know of one very large communications company that went broke and had to merge because of those ridiculous Oracle fees. Caveat Emptor; long live Postresql. A scheduler should not have a large set of dependencies that you wouldn't use otherwise. That makes Chronos a non-option to me. Those other technologies are often useful to folks who would be attracted to something like chronos. Martin's project looks promising, but doesn't store the schedules internally. For repeating schedules, like what Alan was describing, you need to put those into scripts and start those from an existing cron. Of the 2, I think improving Martin's project is the most likely option for me as it doesn't have additional dependencies and seems to be easily implemented. Joost Understood. Like others, I'll be curious to follow what develops out of Martin's work. For me Chronos, Mesos and the other aforementioned technologies look to be more viable; particularly if one is preparing for a clustered world with CPUs, GPUs, SoCs and Arm machines distributed about the ethernet as resources to be scheduled and utilized in a variety of schema. It's the quest for one-infrastructure to solve many problems where scenarios compete. Big data is not the only reason for cloud-clusters. Theoretically, (Clustered) systems can have a far greater resource utilization of networked resources than traditional (distributed) approaches. I grant you that this is a work in progress, but I personally know of dozens of mathematically complex distributed systems that are migrating to the clustered approach rather than something custom or traditionally distributed. Granted, Cloud -- Clustered -- Distributed are all overlaping approaches to big problems. I do appreciate the candor of this thread. James
Re: [gentoo-user] Re: Recommendations for scheduler
On 5 August 2014 21:57:56 CEST, James wirel...@tampabay.rr.com wrote: Joost Roeleveld joost at antarean.org writes: Mesos looks promising for a variety of (Apache) reasons. Some key technologies folks may want google about that are related: Quincy (fair schedular) Chronos (scheduler) Hadoop (scheduler) Hadoop not a scheduler. It's a framework for a Big Data clustered database. HDFS (clusterd file system) Unless it's changed recently, not suitable for anything else then Hadoop and contains a single point of failure. I'm curious as to more information about this 'single point of failure. Can you be more specific or provides links? On this resource: http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html JournalNode machines talks about surviving faults: increase the number of failures the system can tolerate, you should run an odd number of JNs, (i.e. 3, 5, 7, etc.). Note that when running with N JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally. Just read that part. Looks like they solved it partly since 2.2. The problem lies with the NameNodes. Prior to 2.2, you only had 1. If that one dies, you loose the entire cluster. If that one is unrecoverable, you loose all the data. After 2.2, you can configure a standby NameNode. However, it still requires manual restart. Considering that Hadoop is most often running on old machines, chances for hardware failure are higher when compared with clusters using newer hardware. I'm not sure how other cluster FSs deal with this, but I consider it a design flaw if the disappearance of a single machine in a 100+ node cluster dies, the entire cluster ends up in a broken state. It's like running a single Raid5 with 100+ drives. Anyone stupid enough to do that deserves to loose their data. http://gpo.zugaina.org/sys-cluster/apache-hadoop-common Zookeeper (Fault tolerance) SPARK ( optimized for interative jobs where a datase is resued in many parallel operations (advanced math/science and many other apps.) https://spark.apache.org/ Dryad Torque Mpiche2 MPI Globus tookit mesos_tech_report.pdf It looks as though Amazon, google, facebook and many others large in the Cluster/Cloud arena are using Mesos..? So let's all post what we find, particularly in overlays. Unless you are dealing with Big Data projects, like Google, Facebook, Amazon, big banks,... you don't have much use for those projects. Many scientific applications are using the cluster (cloud) or big data approach to all sorts of problems. Furthermore, as GPU and the new Arm systems with dozens and dozens of cpu cores inside one computer become readily available, the cluster-cloud (big data) approach will become much more pervasive in the next few years, imho. http://blog.rescale.com/reservoir-simulation-moves-to-the-cloud/ There are thousands of small companies needing reservoir simulation, not to mention the millions of folks working on carbon sequestration. Anything to do with Biological or Chemical Science is using or moving to the Cloud-Clustered world. For me, a Cluster is just a cloud internally managee, rather than outsourcing it to others; ymmv. My apologies. I forgot the scientific research here. But that was mostly because they have been dealing with really large datasets and corresponding large compute clusters for decades. The term Big Data is generally applied to financial and social media data. Mesos looks like a nice project, just like Hadoop and related are also nice. But for most people, they are as usefull as using Exalytics. I'm not excited about an Oracle solution to anything. Many of the folks I know consult on moving technologies away from Oracle's spear of influence, not limited to mysql; ymmv. I know of one very large communications company that went broke and had to merge because of those ridiculous Oracle fees. Caveat Emptor; long live Postresql. I'd be interested in the name of that company. Even offlist. And I definitely agree. PostgreSQL is often a valid alternative. Unfortunately, it is rarely possible to use it as a back end to enterprise software as these are all designed to be used with databases from the usual suspects (Oracle, IBM, Microsoft, ) Same goes for OSS projects. The developers are often unable to properly code the SQL layer and end up simply using MySQL and its broken SQL implementation. A scheduler should not have a large set of dependencies that you wouldn't use otherwise. That makes Chronos a non-option to me. Those other technologies are often useful to folks who would be attracted to something like chronos. If you already use Mesos, using Chronos makes sense. If you're only interested in a scheduler, installing Mesos just to use Chronos doesn't make sense. Martin's project looks promising, but doesn't store the schedules internally. For repeating
Re: [gentoo-user] Re: Recommendations for scheduler
On 05/08/2014 22:43, J. Roeleveld wrote: I believe Martin's scheduler will be very valuable. Even for me. I am very likely going to start using this for some of my regular maintenance activities on the home network. But as the rest of the thread shows, I wouldn't be able to use it as a scheduler for large projects where the schedules can get very complex very quickly. Martin will be happy to know I think his work will fit my needs just nicely :-) -- Alan McKinnon alan.mckin...@gmail.com
[gentoo-user] Re: Recommendations for scheduler
J. Roeleveld jo...@antarean.org wrote: With the kind of schedules I am working with (and I believe Alan will also end up with), restarting the whole process from the start can lead to issues. Finding out how far the process got before the service crashed can become rather complex. I am not sure whether I understand this correctly: schedule has not a problem to display which tasks have finished/failed/are still running at any time. Of course, a finer granulation than tasks are not possible (how far has a certain task got?) because this would require knowledge about the task and how to check it - you need to be able to split your tasks into more shell commands to make a finer granulation available for schedule. You can just rerun your driving script with the effect that the tasks which already are finished/failed will actually not be restarted, but the behaviour is as if they would finish immediately and report that they are finished/failed. (When you plan to do this, I would suggest to schedule things like sleep as separate tasks, too, and not build them into the driving script.) If there is an unexpected problem, and e.g. you want to re-run a failed task anyway, you can just re-queue your new task on the same place as there was the previous task, e.g. schedule remove jobnr schedule -j jobnr queue commmand to do your task Then the old job (and its state) is replaced by the new queued job, and your (identical as before) driving script will start it instead of assuming that the job is already finished. In order to avoid races, I would recommend to do the above only while your driving script is not running (e.g., you can put it in the background with ctrl-z if you have written it in (...) or if it is really a classical script, and then continue it with fg; or you even stop it completely with Ctrl-c and re-run it, depending on what you want): The problem is that between the above two commands the jobs after jobnr are renumbered. Alternatively, you can insert your new job at the end of the joblist and then use something like (untested) schedule -jjobnr insert 0 jobnr+1:-1 schedule remove 0 to to re-sort your job list: The insert is race-free, and having added a job at the end for some time will hopefully not disturb anything.
Re: [gentoo-user] Re: Recommendations for scheduler
On 4 August 2014 10:41:04 CEST, Martin Vaeth mar...@mvath.de wrote: J. Roeleveld jo...@antarean.org wrote: With the kind of schedules I am working with (and I believe Alan will also end up with), restarting the whole process from the start can lead to issues. Finding out how far the process got before the service crashed can become rather complex. I am not sure whether I understand this correctly: The schedules I am used to dealing with easily span 8 - 14 hours with occasionally even over a week. These schedules then also can't be restarted from the beginning when they stop halfway through without risking massive consistency problems in the final data. And then multiple of those starting at random times with occasionally a whole bunch of the same schedule put into the queue with dependencies to the previous run. If, during that time, one of the machines has a hardware failure or the scheduling process crashes on one or more of the servers, the last state needs to be recoverable. If you have to clean up the environment and bring it back to a state where you can restart the schedules, it saves time if you know which commands and tasks were actually running at the time. For this, the schedules, queues and current state for each node needs to be stored on persistent storage. Hope this clarifies it all a bit. -- Joost -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
[gentoo-user] Re: Recommendations for scheduler
J. Roeleveld jo...@antarean.org wrote: These schedules then also can't be restarted from the beginning when they stop halfway through without risking massive consistency problems in the final data. So you have a command which might break due to hardware error and cannot be rerun. I cannot see how any general-purpose scheduler might help you here: You either need to be able to split your command into several (sequential) commands or you need something adapted for your particular command. And then multiple of those starting at random times with occasionally a whole bunch of the same schedule put into the queue with dependencies to the previous run. That's not a problem. Only if the granularity of one command is not fine enough, it becomes a problem. If, during that time, one of the machines has a hardware failure or the scheduling process crashes on one or more of the servers, the last state needs to be recoverable. One must distinguish two cases: 1. The machine running schedule-server has a hardware failure. (Let us assume tha schedule-server does not have a software failure - otherwise, you have problems anyway.) 2. Some other machine has a hardware failure. Case 2. is not bad (as concerns the scheduling): Of course, the machine will not report that it completed the job, and you will have to think how to complete the job. But it is clear that in such exceptional cases you have to interfere manually in some sense. In order to deal with case 1., you can regularly (e.g. each minute) dump the output of schedule list (possibly suppressing non-important data through the options to keep it short). One could add a logging option to decrease the possible race of 1 minute, but in case of hardware failure a possible race cannot be excluded anyway. In case 1. you manually have to re-queue the jobs and think what to do with the already started jobs. However, I cannot imagine that this occurs so frequently that this exceptional case becomes something one should seriously think about.
Re: [gentoo-user] Re: Recommendations for scheduler
On Monday, August 04, 2014 10:11:41 AM Martin Vaeth wrote: J. Roeleveld jo...@antarean.org wrote: These schedules then also can't be restarted from the beginning when they stop halfway through without risking massive consistency problems in the final data. So you have a command which might break due to hardware error and cannot be rerun. I cannot see how any general-purpose scheduler might help you here: You either need to be able to split your command into several (sequential) commands or you need something adapted for your particular command. A general-purpose scheduler can work, as they do exist. (With a price tag) In the OSS world, there is, to my knowledge, none. Yours seems to be the most promising as it looks like the missing features shouldn't be too difficult to add. The commands are relatively simple, but they deal with large amounts of data. I am talking about ETL processes that, due to the amount of data being processed, can easily take several hours per step. If, during one of these steps, the database or ETL process suffers a crash, the activities of the ETL process need to be rolled back to the point where you can restart it. I am not talking about simple schedules related to day-to-day maintenance of a few servers. And then multiple of those starting at random times with occasionally a whole bunch of the same schedule put into the queue with dependencies to the previous run. That's not a problem. Only if the granularity of one command is not fine enough, it becomes a problem. If nothing happens, it can all be stuck into a single script and the end result will be the same. Problems start because the real world is not 100% reliable. If, during that time, one of the machines has a hardware failure or the scheduling process crashes on one or more of the servers, the last state needs to be recoverable. One must distinguish two cases: 1. The machine running schedule-server has a hardware failure. (Let us assume tha schedule-server does not have a software failure - otherwise, you have problems anyway.) 2. Some other machine has a hardware failure. Case 2. is not bad (as concerns the scheduling): Of course, the machine will not report that it completed the job, and you will have to think how to complete the job. But it is clear that in such exceptional cases you have to interfere manually in some sense. Agreed, this happens more often then you might think. In order to deal with case 1., you can regularly (e.g. each minute) dump the output of schedule list (possibly suppressing non-important data through the options to keep it short). Or all the necessary information is kept in-sync on persistent storage. This would then also allow easy fail-over if the master-schedule-node fails. A 2nd machine could quickly take over. One could add a logging option to decrease the possible race of 1 minute, but in case of hardware failure a possible race cannot be excluded anyway. In case 1. you manually have to re-queue the jobs and think what to do with the already started jobs. However, I cannot imagine that this occurs so frequently that this exceptional case becomes something one should seriously think about. As I mentioned above, with BI infrastructure (large databases, complex ETL processes, interactive report services,), the scheduler is busy 24/7. The amount of tasks, schedules, dependencies, states, that needs to kept track off can easily lead to unforeseen issues and bugs.
[gentoo-user] Re: Recommendations for scheduler
J. Roeleveld jo...@antarean.org wrote: So you have a command which might break due to hardware error and cannot be rerun. I cannot see how any general-purpose scheduler might help you here: You either need to be able to split your command into several (sequential) commands or you need something adapted for your particular command. A general-purpose scheduler can work, as they do exist. I doubt that they can solve your problem. Let me repeat: You have a single program which accesses the database in a complex way and somewhere in the course of accessing it, the machine (or program) crashes. No general-purpose program can recover from this: You need particular knowledge of the database and the program if you even want to have a *chance* to recover from such a situation. A program with such a particular knowledge can hardly be called general-purpose. If, during one of these steps, the database or ETL process suffers a crash, the activities of the ETL process need to be rolled back to the point where you can restart it. I agree, but you need particular knowledge of the database and your tasks to do this which is far beyond the job of a scheduler. As already mentioned by someone in this thread, your problem needs to be solved on the level of the database (using snapshopt capabilities etc.) In order to deal with case 1., you can regularly (e.g. each minute) dump the output of schedule list (possibly suppressing non-important data through the options to keep it short). Or all the necessary information is kept in-sync on persistent storage. This would then also allow easy fail-over if the master-schedule-node fails No, it wouldn't, since jobs just finishing and wanting to report their status cannot do this when there is no server. You would need a rather involved protocol to deal with such situations dynamically. It can certainly be done, but it is not something which can easily be added as a feature: If this is required, it has to be the fundamental concept from the very beginning and everything else has to follow this first aim. You need different protocols than TCP sockets, to start with; something like dbus over IP with servers being able to announce their new presence, etc.
Re: [gentoo-user] Re: Recommendations for scheduler
On 04/08/2014 15:31, Martin Vaeth wrote: J. Roeleveld jo...@antarean.org wrote: So you have a command which might break due to hardware error and cannot be rerun. I cannot see how any general-purpose scheduler might help you here: You either need to be able to split your command into several (sequential) commands or you need something adapted for your particular command. A general-purpose scheduler can work, as they do exist. I doubt that they can solve your problem. Let me repeat: You have a single program which accesses the database in a complex way and somewhere in the course of accessing it, the machine (or program) crashes. No general-purpose program can recover from this: You need particular knowledge of the database and the program if you even want to have a *chance* to recover from such a situation. A program with such a particular knowledge can hardly be called general-purpose. Joost, Either make the ETL tool pick up where it stopped and continue as it is the only that knows what it was doing and how far it got. Or, wrap the entire script in a single transaction. -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: Recommendations for scheduler
On 4 August 2014 15:35:41 CEST, Alan McKinnon alan.mckin...@gmail.com wrote: On 04/08/2014 15:31, Martin Vaeth wrote: J. Roeleveld jo...@antarean.org wrote: So you have a command which might break due to hardware error and cannot be rerun. I cannot see how any general-purpose scheduler might help you here: You either need to be able to split your command into several (sequential) commands or you need something adapted for your particular command. A general-purpose scheduler can work, as they do exist. I doubt that they can solve your problem. Let me repeat: You have a single program which accesses the database in a complex way and somewhere in the course of accessing it, the machine (or program) crashes. No general-purpose program can recover from this: You need particular knowledge of the database and the program if you even want to have a *chance* to recover from such a situation. A program with such a particular knowledge can hardly be called general-purpose. Joost, Either make the ETL tool pick up where it stopped and continue as it is the only that knows what it was doing and how far it got. Or, wrap the entire script in a single transaction. Alan, That would be the ideal solution. However, a single transaction dealing with around 500,000,000 rows will get me shot by the DBAs :) (Never mind that the performance of this will be such that having it all done by an office full of secretaries might be quicker.) Having the ETL process clever enough to be able to pick up from any point requires a degree of forward thinking and planning that is never done in real life. I would love to design it like that as it isn't too difficult. But I always get brought into these projects when implementing these structures will require a full rewrite and getting the original architects to admit their design can't be made restartable without human intervention. At which point the business simply says it is acceptable to have people do a manual rollback and restart the schedules from wherever it went wrong. I'm sure your wife has similar experiences as this is why these projects are always late to deliver and over budget. -- Joost -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [gentoo-user] Re: Recommendations for scheduler
On 4 August 2014 15:31:40 CEST, Martin Vaeth mar...@mvath.de wrote: J. Roeleveld jo...@antarean.org wrote: So you have a command which might break due to hardware error and cannot be rerun. I cannot see how any general-purpose scheduler might help you here: You either need to be able to split your command into several (sequential) commands or you need something adapted for your particular command. A general-purpose scheduler can work, as they do exist. I doubt that they can solve your problem. Let me repeat: You have a single program which accesses the database in a complex way and somewhere in the course of accessing it, the machine (or program) crashes. No general-purpose program can recover from this: You need particular knowledge of the database and the program if you even want to have a *chance* to recover from such a situation. A program with such a particular knowledge can hardly be called general-purpose. The scheduler needs to be able to show which process failed/didn't finish. Then humans need to ensure that part finishes/reruns properly. Then humans need to be able to mark the failed process as succeeded. At which point the scheduler continues with the schedule(s) If, during one of these steps, the database or ETL process suffers a crash, the activities of the ETL process need to be rolled back to the point where you can restart it. I agree, but you need particular knowledge of the database and your tasks to do this which is far beyond the job of a scheduler. As already mentioned by someone in this thread, your problem needs to be solved on the level of the database (using snapshopt capabilities etc.) Or human intervention. Which requires a clear indication of where it went wrong and allows a simple action to continue the schedule from where it was after these humans solved the issues and ensure consistency. In order to deal with case 1., you can regularly (e.g. each minute) dump the output of schedule list (possibly suppressing non-important data through the options to keep it short). Or all the necessary information is kept in-sync on persistent storage. This would then also allow easy fail-over if the master-schedule-node fails No, it wouldn't, since jobs just finishing and wanting to report their status cannot do this when there is no server. You would need a rather involved protocol to deal with such situations dynamically. It can certainly be done, but it is not something which can easily be added as a feature: If this is required, it has to be the fundamental concept from the very beginning and everything else has to follow this first aim. You need different protocols than TCP sockets, to start with; something like dbus over IP with servers being able to announce their new presence, etc. I think it's doable with standard networking protocols. But, either you have a master server which controls everything. Or you have a master process which has failover functionality using classical distributed software techniques. These emails are actually quite useful as I am getting a clear pucture in my head on how I could approach this properly. Thanks, Joost -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [gentoo-user] Re: Recommendations for scheduler
On 04/08/2014 21:46, J. Roeleveld wrote: On 4 August 2014 15:35:41 CEST, Alan McKinnon alan.mckin...@gmail.com wrote: On 04/08/2014 15:31, Martin Vaeth wrote: J. Roeleveld jo...@antarean.org wrote: So you have a command which might break due to hardware error and cannot be rerun. I cannot see how any general-purpose scheduler might help you here: You either need to be able to split your command into several (sequential) commands or you need something adapted for your particular command. A general-purpose scheduler can work, as they do exist. I doubt that they can solve your problem. Let me repeat: You have a single program which accesses the database in a complex way and somewhere in the course of accessing it, the machine (or program) crashes. No general-purpose program can recover from this: You need particular knowledge of the database and the program if you even want to have a *chance* to recover from such a situation. A program with such a particular knowledge can hardly be called general-purpose. Joost, Either make the ETL tool pick up where it stopped and continue as it is the only that knows what it was doing and how far it got. Or, wrap the entire script in a single transaction. Alan, That would be the ideal solution. You have the same concerns I do - how do you make a transaction around 500 million rows. So I asked the in-house expert - Mrs Alan :-) However, a single transaction dealing with around 500,000,000 rows will get me shot by the DBAs :) (Never mind that the performance of this will be such that having it all done by an office full of secretaries might be quicker.) She reckons an ETL job *must* be self-contained; if it isn't then it's broken by design. It must be idempotent too, which can be as simple as Truncate, Load, Commit Having the ETL process clever enough to be able to pick up from any point requires a degree of forward thinking and planning that is never done in real life. I would love to design it like that as it isn't too difficult. But I always get brought into these projects when implementing these structures will require a full rewrite and getting the original architects to admit their design can't be made restartable without human intervention. I agree with that design actually - it's the job of the hardware and OS guys to make stuff reliable that the application layer can rely on. When a SAN connection goes away, it usually comes back and the app layer just carries on (never mind that it retried 100 times meanwhile). Sometimes this doesn't work out. The easiest, cheapest and quickest way to handle it is to just restart the whole job from the beginning. This offends the engineer in us sometimes, but it really is the best way and all of Unix is built on this very idea :-) If the SAn goes away too often and it causes issues, the manybe the best approach is to get the SAN and facilities guys to get their act together At which point the business simply says it is acceptable to have people do a manual rollback and restart the schedules from wherever it went wrong. Exactly. One of the few cases where business has the correct idea. There's only some many pennies to spend and so many dollars to be delivered. I'm sure your wife has similar experiences as this is why these projects are always late to deliver and over budget. She says her projects are subject to the same universal inviolate rule as mine: time and cost is always best engineering estimate times pi We learn to deal with it. Which brings us back to Martin's initial statement: a scheduler cannot deal with any of this, the job itself must. It's an unpredictable event and schedulers can only deal with predictable events -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: Recommendations for scheduler
On Saturday 02 August 2014 16:53:26 James wrote: Alan McKinnon alan.mckinnon at gmail.com writes: Well, we've found 2 projects that at least in part seek to achieve our general goals - chronos and Martin's new project. Why don't we both fool around with them for a bit and get a sense of what it will take to add features etc? Then we can meet back here and discuss. Always better to build on an existing foundation Mesos looks promising for a variety of (Apache) reasons. Some key technologies folks may want google about that are related: Quincy (fair schedular) Chronos (scheduler) Hadoop (scheduler) Hadoop not a scheduler. It's a framework for a Big Data clustered database. HDFS (clusterd file system) Unless it's changed recently, not suitable for anything else then Hadoop and contains a single point of failure. http://gpo.zugaina.org/sys-cluster/apache-hadoop-common Zookeeper (Fault tolerance) SPARK ( optimized for interative jobs where a datase is resued in many parallel operations (advanced math/science and many other apps.) https://spark.apache.org/ Dryad Torque Mpiche2 MPI Globus tookit mesos_tech_report.pdf It looks as though Amazon, google, facebook and many others large in the Cluster/Cloud arena are using Mesos..? So let's all post what we find, particularly in overlays. Unless you are dealing with Big Data projects, like Google, Facebook, Amazon, big banks,... you don't have much use for those projects. Mesos looks like a nice project, just like Hadoop and related are also nice. But for most people, they are as usefull as using Exalytics. A scheduler should not have a large set of dependencies that you wouldn't use otherwise. That makes Chronos a non-option to me. Martin's project looks promising, but doesn't store the schedules internally. For repeating schedules, like what Alan was describing, you need to put those into scripts and start those from an existing cron. Of the 2, I think improving Martin's project is the most likely option for me as it doesn't have additional dependencies and seems to be easily implemented. -- Joost
[gentoo-user] Re: Recommendations for scheduler
J. Roeleveld jo...@antarean.org wrote: Depends on the specific requirements. If you want: In a sense, most you require can be done with my mentioned schedule tool, although perhaps the usage is not in the way you expected. I reorder your points for a clearer explanation: - have schedules operate over multiple machines (eg. part run on database, some on a compute-cluster, some other bit making nice graphs and printing it,...) Since schedule can use TCP for communication, this should not be a problem if you let schedule-server listen world-wide (export SCHEDULE_SERVER_OPTS=-a0.0.0.0) For the actual scheduling you must setup your machines correspondingly: Queue on one machine the task doing the database access you want (with schedule -a[serveraddress] queue command_to_access_database) and similarly on the other machines. Of course, ssh or anything else can be used to do this without physically accessing the machines. Then, on one machine (not necessarily that of the server), you run an appropriate driver script. - time based start of a schedule - dependencies in said schedules and between schedules which can delay the actual start - stop of schedule if error occurs All this is not a problem, since the driver script is just a shell script which calls schedule to start the tasks, wait for them being finished and/or checking their exit status. This is perhaps inconvenient but has the advantage of being absolutely flexible: You can use all linux tools like sleep (or also use at or cron) to get any delays you want, do tests more powerful than checking the exit status etc. - ability to restart schedule from crashed point Running non-yet started jobs after a crash is not a problem - you just edit your driver script appropriately and restart it. Jobs which were already running need to be re-queued if they should be running again.
Re: [gentoo-user] Re: Recommendations for scheduler
On Sunday, August 03, 2014 07:50:57 AM Martin Vaeth wrote: J. Roeleveld jo...@antarean.org wrote: Depends on the specific requirements. If you want: In a sense, most you require can be done with my mentioned schedule tool, although perhaps the usage is not in the way you expected. I agree, based on a quick look. I reorder your points for a clearer explanation: snipped explanation A useful addition to your schedule-tool would be to store the scripts in a way that makes editing simpler and then add an editing tool to make this process simpler. Add monitoring (email alerts, webpage, front-end) to check the status of all the batch-jobs. I might be mistaken, but I think the server keeps the entire queue in-memory and when the process dies, the status is lost? Or is it kept somewhere? -- Joost
[gentoo-user] Re: Recommendations for scheduler
J. Roeleveld jo...@antarean.org wrote: A useful addition to your schedule-tool would be to store the scripts in a way that makes editing simpler Since it is an arbitrary script in an arbitrary language, I think this is not in the scope of this project to do this. In most cases I used it so far, 1-2 more or less complex lines (maybe a few more if they would not be complex) in an interactive zsh were enough, and these are very simple enough to edit in zsh, i.e. I even did not write any script file in the classical sense. I might be mistaken, but I think the server keeps the entire queue in-memory and when the process dies, the status is lost? Yes, the server process must not die. If it dies, not only the queue is lost but also the waiting processes (that is: queued but not yet started) cannot be reached anymore: These waiting processes do not have their own TCP socket but just keep their established connection with the server's socket until the server tells them through this connection to start or to cancel; if this connection gets lost, the waiting processes die: What else could they do, reasonably? The already started processes have a unique ID (into which the server's process is encoded): They reestablish the connection to report the exit status according to this ID. If the server is stopped, they cannot report this status, of course, and moreover, a new server does not know their IDs either and thus will ignore these status reports. Maybe this protocol is not the most clever solution, but it is one which could be implemented without lots of overhead: Mainly, I was up to a quick solution which is working good enough for me: If the server has no bugs, why should it die? Moreover, if the server dies for some strange reasons, it is probably safer to re-queue the jobs again, anyway.
Re: [gentoo-user] Re: Recommendations for scheduler
On 03/08/2014 09:23, Joost Roeleveld wrote: On Saturday 02 August 2014 16:53:26 James wrote: Alan McKinnon alan.mckinnon at gmail.com writes: Well, we've found 2 projects that at least in part seek to achieve our general goals - chronos and Martin's new project. Why don't we both fool around with them for a bit and get a sense of what it will take to add features etc? Then we can meet back here and discuss. Always better to build on an existing foundation Mesos looks promising for a variety of (Apache) reasons. Some key technologies folks may want google about that are related: Quincy (fair schedular) Chronos (scheduler) Hadoop (scheduler) Hadoop not a scheduler. It's a framework for a Big Data clustered database. HDFS (clusterd file system) Unless it's changed recently, not suitable for anything else then Hadoop and contains a single point of failure. http://gpo.zugaina.org/sys-cluster/apache-hadoop-common Zookeeper (Fault tolerance) SPARK ( optimized for interative jobs where a datase is resued in many parallel operations (advanced math/science and many other apps.) https://spark.apache.org/ Dryad Torque Mpiche2 MPI Globus tookit mesos_tech_report.pdf It looks as though Amazon, google, facebook and many others large in the Cluster/Cloud arena are using Mesos..? So let's all post what we find, particularly in overlays. Unless you are dealing with Big Data projects, like Google, Facebook, Amazon, big banks,... you don't have much use for those projects. My wife works in BigData for real, she and Joost speak the same language, I don't :-) She reckons Big Data is like teenage sex - everyone says they are doing it and no-one really does ;-D Mesos looks like a nice project, just like Hadoop and related are also nice. But for most people, they are as usefull as using Exalytics. A bit OT, but it might be worthwhile for interested persons to get good ebuilds going for these projects. Someone will use it on Gentoo, and it will add value to the project. Much like gems and other business-oriented packages benefit A scheduler should not have a large set of dependencies that you wouldn't use otherwise. That makes Chronos a non-option to me. Martin's project looks promising, but doesn't store the schedules internally. For repeating schedules, like what Alan was describing, you need to put those into scripts and start those from an existing cron. Sounds like a small feature-add. If Martin did his groundwork correctly[1] then the core logic will work and it's just a case of adding some persistence and loading the data back in on demand Of the 2, I think improving Martin's project is the most likely option for me as it doesn't have additional dependencies and seems to be easily implemented. Don't forget Martins is the guy who does eix. Street cred? check Knows Gentoo? check [1] I only say it this way as I haven't evaluated his code at all yet so have no idea how far Martin has taken it -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: Recommendations for scheduler
On Sunday, August 03, 2014 02:16:37 PM Alan McKinnon wrote: On 03/08/2014 09:23, Joost Roeleveld wrote: On Saturday 02 August 2014 16:53:26 James wrote: Alan McKinnon alan.mckinnon at gmail.com writes: snipped Unless you are dealing with Big Data projects, like Google, Facebook, Amazon, big banks,... you don't have much use for those projects. My wife works in BigData for real, she and Joost speak the same language, I don't :-) She reckons Big Data is like teenage sex - everyone says they are doing it and no-one really does ;-D I know a few companies that actually do use it. But, the biggest issue with the whole Big Data thing is that noone really agrees on what it actually is. Mesos looks like a nice project, just like Hadoop and related are also nice. But for most people, they are as usefull as using Exalytics. A bit OT, but it might be worthwhile for interested persons to get good ebuilds going for these projects. Someone will use it on Gentoo, and it will add value to the project. Much like gems and other business-oriented packages benefit I agree, but just to implement a decent scheduler, I still think it's overkill. A scheduler should not have a large set of dependencies that you wouldn't use otherwise. That makes Chronos a non-option to me. Martin's project looks promising, but doesn't store the schedules internally. For repeating schedules, like what Alan was describing, you need to put those into scripts and start those from an existing cron. Sounds like a small feature-add. If Martin did his groundwork correctly[1] then the core logic will work and it's just a case of adding some persistence and loading the data back in on demand The code looks clean and I think it shouldn't be too much work to add it. Of the 2, I think improving Martin's project is the most likely option for me as it doesn't have additional dependencies and seems to be easily implemented. Don't forget Martins is the guy who does eix. Street cred? check Knows Gentoo? check [1] I only say it this way as I haven't evaluated his code at all yet so have no idea how far Martin has taken it The code is clean and does what Martin says it does. -- Joost
Re: [gentoo-user] Re: Recommendations for scheduler
On Sunday, August 03, 2014 12:10:49 PM Martin Vaeth wrote: J. Roeleveld jo...@antarean.org wrote: A useful addition to your schedule-tool would be to store the scripts in a way that makes editing simpler Since it is an arbitrary script in an arbitrary language, I think this is not in the scope of this project to do this. In most cases I used it so far, 1-2 more or less complex lines (maybe a few more if they would not be complex) in an interactive zsh were enough, and these are very simple enough to edit in zsh, i.e. I even did not write any script file in the classical sense. I might be mistaken, but I think the server keeps the entire queue in-memory and when the process dies, the status is lost? Yes, the server process must not die. If it dies, not only the queue is lost but also the waiting processes (that is: queued but not yet started) cannot be reached anymore: These waiting processes do not have their own TCP socket but just keep their established connection with the server's socket until the server tells them through this connection to start or to cancel; if this connection gets lost, the waiting processes die: What else could they do, reasonably? The already started processes have a unique ID (into which the server's process is encoded): They reestablish the connection to report the exit status according to this ID. If the server is stopped, they cannot report this status, of course, and moreover, a new server does not know their IDs either and thus will ignore these status reports. Maybe this protocol is not the most clever solution, but it is one which could be implemented without lots of overhead: Mainly, I was up to a quick solution which is working good enough for me: If the server has no bugs, why should it die? Moreover, if the server dies for some strange reasons, it is probably safer to re-queue the jobs again, anyway. With the kind of schedules I am working with (and I believe Alan will also end up with), restarting the whole process from the start can lead to issues. Finding out how far the process got before the service crashed can become rather complex. -- Joost
Re: [gentoo-user] Re: Recommendations for scheduler
On 03/08/2014 15:36, J. Roeleveld wrote: Maybe this protocol is not the most clever solution, but it is one which could be implemented without lots of overhead: Mainly, I was up to a quick solution which is working good enough for me: If the server has no bugs, why should it die? Moreover, if the server dies for some strange reasons, it is probably safer to re-queue the jobs again, anyway. With the kind of schedules I am working with (and I believe Alan will also end up with), restarting the whole process from the start can lead to issues. Finding out how far the process got before the service crashed can become rather complex. Yes, very much so. My first concern is the database cleanups - without scheduler guarantees I'd need transactions in MySQL. -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: Recommendations for scheduler
On Sunday, August 03, 2014 10:04:50 PM Alan McKinnon wrote: On 03/08/2014 15:36, J. Roeleveld wrote: Maybe this protocol is not the most clever solution, but it is one which could be implemented without lots of overhead: Mainly, I was up to a quick solution which is working good enough for me: If the server has no bugs, why should it die? Moreover, if the server dies for some strange reasons, it is probably safer to re-queue the jobs again, anyway. With the kind of schedules I am working with (and I believe Alan will also end up with), restarting the whole process from the start can lead to issues. Finding out how far the process got before the service crashed can become rather complex. Yes, very much so. My first concern is the database cleanups - without scheduler guarantees I'd need transactions in MySQL. Or you migrate to PostgreSQL, but that is OT :) -- Joost
Re: [gentoo-user] Re: Recommendations for scheduler
On 03/08/2014 22:23, J. Roeleveld wrote: On Sunday, August 03, 2014 10:04:50 PM Alan McKinnon wrote: On 03/08/2014 15:36, J. Roeleveld wrote: Maybe this protocol is not the most clever solution, but it is one which could be implemented without lots of overhead: Mainly, I was up to a quick solution which is working good enough for me: If the server has no bugs, why should it die? Moreover, if the server dies for some strange reasons, it is probably safer to re-queue the jobs again, anyway. With the kind of schedules I am working with (and I believe Alan will also end up with), restarting the whole process from the start can lead to issues. Finding out how far the process got before the service crashed can become rather complex. Yes, very much so. My first concern is the database cleanups - without scheduler guarantees I'd need transactions in MySQL. Or you migrate to PostgreSQL, but that is OT :) Maybe, but also valid :-) I took one look at the schemas here and wondered Why MySQL? This is Postgres territory. It's a case of LAMP tunnel vision. -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: Recommendations for scheduler
On Sunday, August 03, 2014 10:57:06 PM Alan McKinnon wrote: On 03/08/2014 22:23, J. Roeleveld wrote: On Sunday, August 03, 2014 10:04:50 PM Alan McKinnon wrote: On 03/08/2014 15:36, J. Roeleveld wrote: Maybe this protocol is not the most clever solution, but it is one which could be implemented without lots of overhead: Mainly, I was up to a quick solution which is working good enough for me: If the server has no bugs, why should it die? Moreover, if the server dies for some strange reasons, it is probably safer to re-queue the jobs again, anyway. With the kind of schedules I am working with (and I believe Alan will also end up with), restarting the whole process from the start can lead to issues. Finding out how far the process got before the service crashed can become rather complex. Yes, very much so. My first concern is the database cleanups - without scheduler guarantees I'd need transactions in MySQL. Or you migrate to PostgreSQL, but that is OT :) Maybe, but also valid :-) I took one look at the schemas here and wondered Why MySQL? This is Postgres territory. It's a case of LAMP tunnel vision. That and that people who start with LAMP don't learn SQL. This leads to code that is near impossible to port to a different database and when people actually want to do all the work to get the SQL to work on any database, the projects involved refuse the patches. -- Joost
Re: [gentoo-user] Re: Recommendations for scheduler
On 01/08/2014 21:35, cov...@ccs.covici.com wrote: Alan McKinnon alan.mckin...@gmail.com wrote: On 01/08/2014 20:17, James wrote: Alan McKinnon alan.mckinnon at gmail.com writes: New job, new environment. Existing persons suffer from 5-year-old-with-a-hammer syndrome and assume cron is the solution to all ills. Result: a towering edifice of cron jobs that may or may not clobber each other's work, may or may not work at all, and implement no error handling at all. But my god, can they spew out mail from STOUT Sounds like a department full of computer scientist I inherited a few decades ago... I've met folks like that Brilliant in their chosen field but completely useless outside it? The kind of fellows who see nothing wrong with eating a barbeque'd steak with a spoon because they can get a result? I know nothing bout chronos, but I find it an interesting readymmv. http://nerds.airbnb.com/introducing-chronos/ http://airbnb.github.io/chronos/ https://github.com/airbnb/chronos Aaaah, now this sounds like something I can use. Proper dependency chains, Restful JSON interface so the devs can write code to drive it in automation. Good find, thanks! Unless I am missing something, chronos is not in the tree at all. Correct, it isn't in the tree. But there's nothing stopping me from getting it in there -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: Recommendations for scheduler
On 01/08/2014 23:02, Martin Vaeth wrote: Alan McKinnon alan.mckin...@gmail.com wrote: But cron has only one event trigger: wall-clock time. And it's a very blunt weapon. I'm looking for recommendations of alternative schedulers that satisfy real-world business needs that need some other event trigger than what the time is right now. I had a similar need recently, and since the discussion in https://forums.gentoo.org/viewtopic-t-992780-highlight-.html Interesting thread :-) Conceptually, your needs are the same as mine - sequence defined by something other than wall-clock time. The responders there do the same thing as I experience - tunnel vision with regard to cron. Sysadmins are used to cron and sadly most of us want to ram a purely cron-based solution into places where it most certainly does not belong. Business rules very seldom fit easily into a cron model, they usually rely on a defined sequence had led to nothing satisfactory for me, I have written a scheduler tool which serves my needs (which might very well differ from yours...): The corresponding tool is still in beta testing phase: https://github.com/vaeth/schedule/ You can install it from the mv overlay (available over layman). Nice, thanks for the link :-) Now I have two projects to evaluate. -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: Recommendations for scheduler
On Saturday, August 02, 2014 11:18:32 AM Alan McKinnon wrote: On 01/08/2014 21:35, cov...@ccs.covici.com wrote: Alan McKinnon alan.mckin...@gmail.com wrote: On 01/08/2014 20:17, James wrote: Alan McKinnon alan.mckinnon at gmail.com writes: New job, new environment. Existing persons suffer from 5-year-old-with-a-hammer syndrome and assume cron is the solution to all ills. Result: a towering edifice of cron jobs that may or may not clobber each other's work, may or may not work at all, and implement no error handling at all. But my god, can they spew out mail from STOUT Sounds like a department full of computer scientist I inherited a few decades ago... I've met folks like that Brilliant in their chosen field but completely useless outside it? The kind of fellows who see nothing wrong with eating a barbeque'd steak with a spoon because they can get a result? I know nothing bout chronos, but I find it an interesting readymmv. http://nerds.airbnb.com/introducing-chronos/ http://airbnb.github.io/chronos/ https://github.com/airbnb/chronos Aaaah, now this sounds like something I can use. Proper dependency chains, Restful JSON interface so the devs can write code to drive it in automation. Good find, thanks! Unless I am missing something, chronos is not in the tree at all. Correct, it isn't in the tree. But there's nothing stopping me from getting it in there Neither are the dependencies. If you get it to work, don't forget to create a nice howto documentation as from what I found online, the documentation is incomplete and out of date. -- Joost
[gentoo-user] Re: Recommendations for scheduler
Alan McKinnon alan.mckinnon at gmail.com writes: Well, we've found 2 projects that at least in part seek to achieve our general goals - chronos and Martin's new project. Why don't we both fool around with them for a bit and get a sense of what it will take to add features etc? Then we can meet back here and discuss. Always better to build on an existing foundation Mesos looks promising for a variety of (Apache) reasons. Some key technologies folks may want google about that are related: Quincy (fair schedular) Chronos (scheduler) Hadoop (scheduler) HDFS (clusterd file system) http://gpo.zugaina.org/sys-cluster/apache-hadoop-common Zookeeper (Fault tolerance) SPARK ( optimized for interative jobs where a datase is resued in many parallel operations (advanced math/science and many other apps.) https://spark.apache.org/ Dryad Torque Mpiche2 MPI Globus tookit mesos_tech_report.pdf It looks as though Amazon, google, facebook and many others large in the Cluster/Cloud arena are using Mesos..? So let's all post what we find, particularly in overlays. hth, James
[gentoo-user] Re: Recommendations for scheduler
Alan McKinnon alan.mckinnon at gmail.com writes: New job, new environment. Existing persons suffer from 5-year-old-with-a-hammer syndrome and assume cron is the solution to all ills. Result: a towering edifice of cron jobs that may or may not clobber each other's work, may or may not work at all, and implement no error handling at all. But my god, can they spew out mail from STOUT Sounds like a department full of computer scientist I inherited a few decades ago... I know nothing bout chronos, but I find it an interesting readymmv. http://nerds.airbnb.com/introducing-chronos/ http://airbnb.github.io/chronos/ https://github.com/airbnb/chronos cheers mate! James
Re: [gentoo-user] Re: Recommendations for scheduler
On 01/08/2014 20:17, James wrote: Alan McKinnon alan.mckinnon at gmail.com writes: New job, new environment. Existing persons suffer from 5-year-old-with-a-hammer syndrome and assume cron is the solution to all ills. Result: a towering edifice of cron jobs that may or may not clobber each other's work, may or may not work at all, and implement no error handling at all. But my god, can they spew out mail from STOUT Sounds like a department full of computer scientist I inherited a few decades ago... I've met folks like that Brilliant in their chosen field but completely useless outside it? The kind of fellows who see nothing wrong with eating a barbeque'd steak with a spoon because they can get a result? I know nothing bout chronos, but I find it an interesting readymmv. http://nerds.airbnb.com/introducing-chronos/ http://airbnb.github.io/chronos/ https://github.com/airbnb/chronos Aaaah, now this sounds like something I can use. Proper dependency chains, Restful JSON interface so the devs can write code to drive it in automation. Good find, thanks! cheers mate! James -- Alan McKinnon alan.mckin...@gmail.com
Re: [gentoo-user] Re: Recommendations for scheduler
Alan McKinnon alan.mckin...@gmail.com wrote: On 01/08/2014 20:17, James wrote: Alan McKinnon alan.mckinnon at gmail.com writes: New job, new environment. Existing persons suffer from 5-year-old-with-a-hammer syndrome and assume cron is the solution to all ills. Result: a towering edifice of cron jobs that may or may not clobber each other's work, may or may not work at all, and implement no error handling at all. But my god, can they spew out mail from STOUT Sounds like a department full of computer scientist I inherited a few decades ago... I've met folks like that Brilliant in their chosen field but completely useless outside it? The kind of fellows who see nothing wrong with eating a barbeque'd steak with a spoon because they can get a result? I know nothing bout chronos, but I find it an interesting readymmv. http://nerds.airbnb.com/introducing-chronos/ http://airbnb.github.io/chronos/ https://github.com/airbnb/chronos Aaaah, now this sounds like something I can use. Proper dependency chains, Restful JSON interface so the devs can write code to drive it in automation. Good find, thanks! Unless I am missing something, chronos is not in the tree at all. -- Your life is like a penny. You're going to lose it. The question is: How do you spend it? John Covici cov...@ccs.covici.com
[gentoo-user] Re: Recommendations for scheduler
Alan McKinnon alan.mckin...@gmail.com wrote: But cron has only one event trigger: wall-clock time. And it's a very blunt weapon. I'm looking for recommendations of alternative schedulers that satisfy real-world business needs that need some other event trigger than what the time is right now. I had a similar need recently, and since the discussion in https://forums.gentoo.org/viewtopic-t-992780-highlight-.html had led to nothing satisfactory for me, I have written a scheduler tool which serves my needs (which might very well differ from yours...): The corresponding tool is still in beta testing phase: https://github.com/vaeth/schedule/ You can install it from the mv overlay (available over layman).
Re: [gentoo-user] Re: Recommendations for scheduler
On 1 August 2014 20:17:05 CEST, James wirel...@tampabay.rr.com wrote: Alan McKinnon alan.mckinnon at gmail.com writes: New job, new environment. Existing persons suffer from 5-year-old-with-a-hammer syndrome and assume cron is the solution to all ills. Result: a towering edifice of cron jobs that may or may not clobber each other's work, may or may not work at all, and implement no error handling at all. But my god, can they spew out mail from STOUT Sounds like a department full of computer scientist I inherited a few decades ago... I know nothing bout chronos, but I find it an interesting readymmv. http://nerds.airbnb.com/introducing-chronos/ http://airbnb.github.io/chronos/ https://github.com/airbnb/chronos cheers mate! James Looks interesting. Apart from it requiring a clustered environment (mesos). Unless I misunderstand the part where it says it runs on top of mesos? -- Joost -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [gentoo-user] Re: Recommendations for scheduler
On 1 August 2014 23:02:11 CEST, Martin Vaeth mar...@mvath.de wrote: Alan McKinnon alan.mckin...@gmail.com wrote: But cron has only one event trigger: wall-clock time. And it's a very blunt weapon. I'm looking for recommendations of alternative schedulers that satisfy real-world business needs that need some other event trigger than what the time is right now. I had a similar need recently, and since the discussion in https://forums.gentoo.org/viewtopic-t-992780-highlight-.html had led to nothing satisfactory for me, I have written a scheduler tool which serves my needs (which might very well differ from yours...): The corresponding tool is still in beta testing phase: https://github.com/vaeth/schedule/ You can install it from the mv overlay (available over layman). Going to have a look at this soon. What are the features it currently has already and what are you planning on adding? -- Joost -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
[gentoo-user] Re: Recommendations for scheduler
J. Roeleveld jo...@antarean.org wrote: https://github.com/vaeth/schedule/ What are the features it currently has already This is hard to answer, since at a first glance the whole thing does not even look like a scheduler: It looks more like a means to communicate with some server, but after the discussions in the gentoo forums, it became clear to my surprise that this is all what is needed for the use cases I had in mind: The real scheduler driving the whole thing can be a tiny script (in shell or any other language) which just communicates with that server. To understand whether this can solve your problems, it is probably best if you look at the examples in the README (and/or the mentioned discussion in the gentoo forum). and what are you planning on adding? Since it is sufficient for my purposes, I am currently not planning to add anything (except possibly bug fixes or if I run into a problem which I cannot solve with it). Patches for extensions are welcome, of course. (Also suggestions without patches are welcome, but my time is currently very limited, and I do not make any promises.)