This patch is a first try at using condor as a job management system.
This removes the usage of the 'taskomatic' utilities and replaces them
with 'condormatic' calls that use the command line interfaces (no qmf or
gsoap etc) to condor.

On startup of the server (and any changes after running), a pile of
'classads' are created which define each possible startup location for
a given set of image/hardware profiles that exist and are useable as
well as the backend info condor needs to start an instance on the given
provider.

For each instance that you start, a job will be created in condor.
Condor will then match the hardware profile and image to a provider and
can then start an instance on that provider.  When you stop or destroy
that instance, the job will be removed (which isn't really how we want
it to go but..).

This patch requires that you have our custom hacked up condor installed.
You can get this at:

http://people.redhat.com/clalance/condor-dcloud

Be sure to read the README.  Chris has written up very good instructions
on how to set up condor.

In general everything here basically works.  There are however several
known bugs and deficiencies:

- To 'stop' a job in condor we should be using 'hold' instead of
  removing the job.  This is creating a few different problems.
- After stopping an instance the condor job is removed but the instance
  continues to exist in deltacloud.  On a subsequent 'start' the start
  fails.
- I'm only matching on image and hardware profiles, not realms and
  I'm ignoring quotas too.
- We are still reaching directly to the DeltaCloud API to get a list of
  available actions for each instance.  Maybe this is fine, I'm not
  sure.
- Classads are sync'd to condor on startup and on any changes to the
  hardware profile and image records.  However, if you restart condor
  you won't have any classads in it to match against and your jobs will
  fail.
- We're still using 'on-demand' syncing of states from condor to the
  aggregator.  eg when you list the instances it updates the states of
  each instance from condor at that time.  There is no event logging.
- There's no 'reboot' as yet in condor.  Not sure how we'll deal with
  that just yet.
- We've kept the tasks model and usage but they are quazi-meaningless.
  The task table needs to turn into an event/audit log table.

Many of these problems have fixes in-progress or will be addressed in
future patches.

Signed-off-by: Ian Main <[email protected]>
---
 src/app/controllers/instance_controller.rb        |   17 +-
 src/app/controllers/pool_controller.rb            |    5 +-
 src/app/models/hardware_profile_observer.rb       |    9 +
 src/app/models/image_observer.rb                  |    9 +
 src/app/util/condormatic.rb                       |  232 +++++++++++++++++++++
 src/config/environment.rb                         |    2 +-
 src/config/initializers/condor_classads_sync.rb   |    8 +
 src/db/migrate/20090804142049_create_instances.rb |    1 +
 8 files changed, 275 insertions(+), 8 deletions(-)
 create mode 100644 src/app/models/hardware_profile_observer.rb
 create mode 100644 src/app/models/image_observer.rb
 create mode 100644 src/app/util/condormatic.rb
 create mode 100644 src/config/initializers/condor_classads_sync.rb

diff --git a/src/app/controllers/instance_controller.rb 
b/src/app/controllers/instance_controller.rb
index 039ed3a..5664ec5 100644
--- a/src/app/controllers/instance_controller.rb
+++ b/src/app/controllers/instance_controller.rb
@@ -19,7 +19,7 @@
 # Filters added to this controller apply to all controllers in the application.
 # Likewise, all the methods added will be available for all controllers.
 
-require 'util/taskomatic'
+require 'util/condormatic'
 
 class InstanceController < ApplicationController
   before_filter :require_user
@@ -96,8 +96,7 @@ class InstanceController < ApplicationController
                                 :task_target => @instance,
                                 :action      => InstanceTask::ACTION_CREATE})
       if @task.save
-        task_impl = Taskomatic.new(@task,logger)
-        task_impl.instance_create
+        condormatic_instance_create(@task)
         flash[:notice] = "Instance added."
         redirect_to :controller => "pool", :action => 'show', :id => 
@instance.pool_id
       else
@@ -124,8 +123,16 @@ class InstanceController < ApplicationController
       raise ActionError.new("#{action} cannot be performed on this instance.")
     end
 
-    task_impl = Taskomatic.new(@task,logger)
-    task_impl.send "instance_#{action}"
+    case action
+      when 'stop'
+        condormatic_instance_stop(@task)
+      when 'destroy'
+        condormatic_instance_destroy(@task)
+      when 'start'
+        condormatic_instance_create(@task)
+      else
+        raise ActionError.new("Sorry, action '#{action}' is currently not 
supported by condor backend.")
+    end
 
     alert = "#[email protected]}: #{action} was successfully queued."
     flash[:notice] = alert
diff --git a/src/app/controllers/pool_controller.rb 
b/src/app/controllers/pool_controller.rb
index e687c0b..9d53862 100644
--- a/src/app/controllers/pool_controller.rb
+++ b/src/app/controllers/pool_controller.rb
@@ -20,6 +20,7 @@
 # Likewise, all the methods added will be available for all controllers.
 
 require 'util/taskomatic'
+require 'util/condormatic'
 
 class PoolController < ApplicationController
   before_filter :require_user
@@ -36,8 +37,8 @@ class PoolController < ApplicationController
     #FIXME: clean this up, many error cases here
     @pool = Pool.find(params[:id])
     require_privilege(Privilege::INSTANCE_VIEW,@pool)
-    # pass nil into Taskomatic as we're not working off a task here
-    Taskomatic.new(nil,logger).pool_refresh(@pool)
+    # Go to condor and sync the database to the real instance states
+    condormatic_instances_sync_states
     @pool.reload
     @instances = @pool.instances
   end
diff --git a/src/app/models/hardware_profile_observer.rb 
b/src/app/models/hardware_profile_observer.rb
new file mode 100644
index 0000000..c924bdb
--- /dev/null
+++ b/src/app/models/hardware_profile_observer.rb
@@ -0,0 +1,9 @@
+class HardwareProfileObserver < ActiveRecord::Observer
+
+  def after_save(hwp)
+    condormatic_classads_sync
+  end
+end
+
+HardwareProfileObserver.instance
+
diff --git a/src/app/models/image_observer.rb b/src/app/models/image_observer.rb
new file mode 100644
index 0000000..68a5b85
--- /dev/null
+++ b/src/app/models/image_observer.rb
@@ -0,0 +1,9 @@
+class ImageObserver < ActiveRecord::Observer
+
+  def after_save(image)
+    condormatic_classads_sync
+  end
+end
+
+ImageObserver.instance
+
diff --git a/src/app/util/condormatic.rb b/src/app/util/condormatic.rb
new file mode 100644
index 0000000..7ec6e01
--- /dev/null
+++ b/src/app/util/condormatic.rb
@@ -0,0 +1,232 @@
+#
+# Copyright (C) 2010 Red Hat, Inc.
+#  Written by Ian Main <[email protected]>
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; version 2 of the License.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,
+# MA  02110-1301, USA.  A copy of the GNU General Public License is
+# also available at http://www.gnu.org/copyleft/gpl.html.
+
+def condormatic_instance_create(task)
+
+  begin
+    instance = task.instance
+    # FIXME: We should be using the realm name and matching it in condor.
+    realm = instance.realm.external_key rescue nil
+
+    job_name = "job_#{instance.name}_#{instance.id}"
+
+
+    # I use the 2>&1 to get stderr and stdout together because popen3 does not 
support
+    # the ability to get the exit value of the command in ruby 1.8.
+    pipe = IO.popen("condor_submit 2>&1", "w+")
+    pipe.puts "universe = grid\n"
+    Rails.logger.info "universe = grid\n"
+    pipe.puts "executable = #{job_name}\n"
+    Rails.logger.info "executable = #{job_name}\n"
+    pipe.puts "grid_resource = dcloud $$(provider_url) $$(username) 
$$(password) $$(image_key) #{instance.name} NULL $$(hardwareprofile_key)\n"
+    Rails.logger.info "grid_resource = dcloud $$(provider_url) $$(username) 
$$(password) $$(image_key) #{instance.name} NULL $$(hardwareprofile_key)\n"
+    pipe.puts "log = #{job_name}.log\n"
+    Rails.logger.info "log = #{job_name}.log\n"
+    pipe.puts "requirements = hardwareprofile == 
\"#{instance.hardware_profile.id}\" && image == \"#{instance.image.id}\"\n"
+    Rails.logger.info "requirements = hardwareprofile == 
\"#{instance.hardware_profile.id}\" && image == \"#{instance.image.id}\"\n"
+    pipe.puts "notification = never\n"
+    Rails.logger.info "notification = never\n"
+    pipe.puts "queue\n"
+    Rails.logger.info "queue\n"
+    pipe.close_write
+    out = pipe.read
+    pipe.close
+
+    Rails.logger.info "$? (return value?) is #{$?}"
+    raise ("Error calling condor_submit: #{out}") if $? != 0
+
+    instance.condor_job_id = job_name
+    instance.save!
+
+  rescue Exception => ex
+    task.state = Task::STATE_FAILED
+    Rails.logger.error ex.message
+    Rails.logger.error ex.backtrace
+  else
+    # FIXME: We're kinda lying here.. we don't know the state for the task but 
I don't think that matters so much
+    # as we are just going to use the 'task' table as a kind of audit log.
+    task.state = Task::STATE_PENDING
+  end
+  task.instance.save!
+end
+
+# JobStatus for condor jobs:
+#
+# 0 Unexpanded  U
+# 1 Idle        I
+# 2 Running     R
+# 3 Removed     X
+# 4 Completed   C
+# 5 Held        H
+# 6 Submission_err  E
+#
+
+def condor_to_instance_state(state_val)
+  case state_val
+    when '0'
+      return Instance::STATE_PENDING
+    when '1'
+      return Instance::STATE_PENDING
+    when '2'
+      return Instance::STATE_RUNNING
+    when '3'
+      return Instance::STATE_STOPPED
+    when '4'
+      return Instance::STATE_STOPPED
+    when '5'
+      return Instance::STATE_CREATE_FAILED
+    when '6'
+      return Instance::STATE_CREATE_FAILED
+  else
+    return Instance::STATE_PENDING
+  end
+end
+
+def condormatic_instances_sync_states
+
+  begin
+    # I'm not going to do the 2&>1 trick here since we are parsing the output
+    # and I'm afraid we'll get a warning or something on stderr and it'll mess
+    # up the xml parsing.
+    pipe = IO.popen("condor_q -xml")
+    xml = pipe.read
+    pipe.close
+
+    raise ("Error calling condor_q -xml") if $? != 0
+
+    # Set them all to 'stopped' because if they aren't in the condor
+    # queue as jobs then they are not running, pending or anything else.
+    instances = Instance.find(:all)
+    instances.each do |instance|
+      instance.state = Instance::STATE_STOPPED
+      instance.save!
+    end
+
+    def find_value_int(job_ele, attrib)
+      if job_ele.attributes['n'] == attrib
+        cmd = job_ele.elements.each('i') do |i|
+          return i.text
+        end
+      end
+      return nil
+    end
+
+    def find_value_str(job_ele, attrib)
+      if job_ele.attributes['n'] == attrib
+        cmd = job_ele.elements.each('s') do |s|
+          return s.text
+        end
+      end
+      return nil
+    end
+
+    doc = REXML::Document.new(xml)
+    doc.elements.each('classads/c') do |jobs_ele|
+      job_name = nil
+      job_state = nil
+
+      jobs_ele.elements.each('a') do |job_ele|
+        value = find_value_str(job_ele, 'Cmd')
+        job_name = value if value != nil
+        value = find_value_int(job_ele, 'JobStatus')
+        job_state = value if value != nil
+      end
+
+      Rails.logger.info "job name is #{job_name}"
+      Rails.logger.info "job state is #{job_state}"
+
+      instance = Instance.find(:first, :conditions => {:condor_job_id => 
job_name})
+
+      if instance
+        instance.state = condor_to_instance_state(job_state)
+        instance.save!
+        Rails.logger.info "Instance state updated to 
#{condor_to_instance_state(job_state)}"
+      end
+    end
+  rescue Exception => ex
+    Rails.logger.error ex.message
+    Rails.logger.error ex.backtrace
+  end
+end
+
+def condormatic_instance_stop(task)
+    instance = task.instance
+
+    Rails.logger.info("calling condor_rm -constraint 'Cmd == 
\"#{instance.condor_job_id}\"' 2>&1")
+    pipe = IO.popen("condor_rm -constraint 'Cmd == 
\"#{instance.condor_job_id}\"' 2>&1")
+    out = pipe.read
+    pipe.close
+
+    Rails.logger.info("condor_rm return status is #{$?}")
+    Rails.logger.error("Error calling condor_rm (exit code #{$?}) on job: 
#{out}") if $? != 0
+end
+
+def condormatic_instance_destroy(task)
+    instance = task.instance
+
+    Rails.logger.info("calling condor_rm -constraint 'Cmd == 
\"#{instance.condor_job_id}\"' 2>&1")
+    pipe = IO.popen("condor_rm -constraint 'Cmd == 
\"#{instance.condor_job_id}\"' 2>&1")
+    out = pipe.read
+    pipe.close
+
+    Rails.logger.info("condor_rm return status is #{$?}")
+    Rails.logger.error("Error calling condor_rm (exit code #{$?}) on job: 
#{out}") if $? != 0
+end
+
+
+def condormatic_classads_sync
+
+  index = 0
+  providers = Provider.find(:all)
+  Rails.logger.info "Syncing classads.."
+
+  providers.each do |provider|
+    provider.cloud_accounts.each do |account|
+      provider.images.each do |image|
+        provider.hardware_profiles.each do |hwp|
+          pipe = IO.popen("condor_advertise UPDATE_STARTD_AD 2>&1", "w+")
+
+          pipe.puts "Name=\"provider_combination_#{index}\""
+          pipe.puts 'MyType="Machine"'
+          pipe.puts 'Requirements=true'
+          pipe.puts "\n# Stuff needed to match:"
+          pipe.puts 
"hardwareprofile=\"#{hwp.aggregator_hardware_profiles[0].id}\""
+          pipe.puts "image=\"#{image.aggregator_images[0].id}\""
+          pipe.puts "\n# Backend info to complete this job:"
+          pipe.puts "image_key=\"#{image.external_key}\""
+          pipe.puts "hardwareprofile_key=\"#{hwp.external_key}\""
+          pipe.puts "provider_url=\"#{account.provider.url}\""
+          pipe.puts "username=\"#{account.username}\""
+          pipe.puts "password=\"#{account.password}\""
+          pipe.close_write
+
+          out = pipe.read
+          pipe.close
+
+          Rails.logger.error "Unable to submit condor classad: #{out}" if $? 
!= 0
+
+          index += 1
+        end
+      end
+    end
+
+    Rails.logger.info "done"
+  end
+end
+
diff --git a/src/config/environment.rb b/src/config/environment.rb
index 919a710..eb11f17 100644
--- a/src/config/environment.rb
+++ b/src/config/environment.rb
@@ -50,7 +50,7 @@ Rails::Initializer.run do |config|
   config.gem "gnuplot"
   config.gem "scruffy"
 
-  config.active_record.observers = :instance_observer, :task_observer
+  config.active_record.observers = :instance_observer, :task_observer, 
:hardware_profile_observer, :image_observer
   # Only load the plugins named here, in the order given. By default, all 
plugins
   # in vendor/plugins are loaded in alphabetical order.
   # :all can be used as a placeholder for all plugins not explicitly named
diff --git a/src/config/initializers/condor_classads_sync.rb 
b/src/config/initializers/condor_classads_sync.rb
new file mode 100644
index 0000000..9165f75
--- /dev/null
+++ b/src/config/initializers/condor_classads_sync.rb
@@ -0,0 +1,8 @@
+require 'util/condormatic'
+
+puts "Syncing condor classads.."
+# This pulls all the possible classad matches from the database and puts
+# them on condor on startup.
+condormatic_classads_sync
+puts "Done."
+
diff --git a/src/db/migrate/20090804142049_create_instances.rb 
b/src/db/migrate/20090804142049_create_instances.rb
index 335b93f..42706e1 100644
--- a/src/db/migrate/20090804142049_create_instances.rb
+++ b/src/db/migrate/20090804142049_create_instances.rb
@@ -32,6 +32,7 @@ class CreateInstances < ActiveRecord::Migration
       t.string    :public_address
       t.string    :private_address
       t.string    :state
+      t.string    :condor_job_id
       t.integer   :lock_version, :default => 0
       t.integer   :acc_pending_time, :default => 0
       t.integer   :acc_running_time, :default => 0
-- 
1.7.0.1

_______________________________________________
deltacloud-devel mailing list
[email protected]
https://fedorahosted.org/mailman/listinfo/deltacloud-devel

Reply via email to