Transloading Data

Phrogz Thu, 29 Jan 2009 12:05:02 -0800

I've recently been porting various web applications from other
backends and databases to Ramaze/Sequel (various DBs). This often
involves merging in additional data not previously in the database.


Batch pushing tables of data would be fine if it weren't for those
pesky references with autoincrement fields. You need to setup a map
between the old primary key and the new primary key. And you need to
update all the columns in all the tables that referenced the old field
with the new value. And you need to process the the tables in the
right order based on which tables reference each other. And when a
table references within itself, you need to do a second pass on the
data after inserting the records to update those references.

Anyhow, after doing this a couple times I decided to write a library
to abstract this out. It's working nicely now, except for speed.
Hopefully someone here can suggest ways to speed up the painful part
of it. The library:
Loads data for tables from CSV files.
Uses the associations specified in Sequel models to figure out the
dependencies between them, and load the data in the right order.
Allows you to translate column names from the CSV file into the new
schema.

The part that (I think) is making it painfully slow for large data
sets is that Dataset#<< does not always return the primary key of the
record inserted. So I have this code:
        mapping = @transformation.call( ...rowdata... )
        # mapping is a hash of field symbols to data
        self << mapping
        new_row = self.first( mapping )
        @pk_after_transload[ orig_pk ] = new_row.pk

Is there a better Sequel, DB-agnostic way to be sure about the new
primary key values?

Anyhow, in case anyone is interested, here's the code in two files.
The first file shows how I'm using it to specify transloading of data
for a simple site, and the second file is the library itself in its
current form.

__model/migrate.rb__
$: << ".." # Because init.rb is used to being called from one level up
# Setup the DB and load the Sequel model definitions
require 'init.rb'
require 'TableTransloader.rb'

class NodeType
  extend Phrogz::TableTransloader
  transload_from( 'nodetypes.csv', :nodeTypeID ) do |old|
    {
      :node_type => old.nodeTypeID,
      :name      => old.nodeTypeName
    }
  end
end

class Node
  extend Phrogz::TableTransloader
  transload_from( 'nodes.csv', :nodeID ) do |old|
    {
      :name                => old.nodeName,
      :title               => old.title,
      :description         => old.description,
      :top_node_id         => old.topNodeID,    # These will be fixed
up
      :parent_node_id      => old.parentNodeID, # by
transload_old_data
      :content_html        => old.contentHTML,
      :created_on          => old.createDate,
      :content_mod_date    => old.contentModDate,
      :mod_date            => old.modDate,
      :descendant_mod_date => old.descendantModDate,
      :node_type           => NodeType.pk_after_transload
[old.nodeTypeID],
      :allow_comments      => old.allowComments,
      :order_id            => old.orderID
    }
  end
end

class User
  extend Phrogz::TableTransloader
  transload_from( 'users.csv', :userID ) do |old|
    {
      :email        => old.UserEmail,
      :name         => old.UserName,
      :pass_hash    => old.Password,
      :access_level => old.AccessLevel
    }
  end
end

class Comment
  extend Phrogz::TableTransloader
  transload_from( 'comments.csv', :commentID ) do |old|
    {
      :node_id    => Node.pk_after_transload[ old.nodeID ],
      :user_id    => User.pk_after_transload[ old.userID ],
      :comment    => old.Comment,
      :created_on => old.timeDate,
      :updated_on => old.timeDate
    }
  end
end

class LogEntry
  extend Phrogz::TableTransloader
  transload_from( 'log.csv' ) do |old|
    {
      :node_id    => Node.pk_after_transload[ old.nodeID ],
      :created_on => old.timeDate,
      :ip         => old.ipAddress,
      :referrer   => old.referrer
    }
  end
end

# The order here doesn't matter; the method figures out the
dependencies
Phrogz::TableTransloader.transload_all(
  Node, NodeType, User, Comment, LogEntry
)


__TableTransloader.rb__
module Phrogz; end

module Phrogz::RoughCSV
  def self.load( file_name )
    $KCODE = 'UTF-8'
    require 'time' # for parsing
    csv = IO.read( file_name )
    # Rough cleaning to match the simple regex
    :go while csv.gsub! ',,', ',NULL,'
    :go while csv.gsub! ',"",', ',NULL,'

    # first line must be columns
    header = csv[ /\A.+?\n/ ]
    csv = csv[ header.length..-1 ]
    columns = header.scan( /"([^"]+)"/ ).flatten.map{ |col|
col.to_sym }

    rows = []
    rows << current_row = {}
    col_index = -1

    csv.scan( /([^",\n]+|".*?[^"]")(,|\n)/m ) do |item, sep|
      col = columns[ col_index+=1 ]

      # Guess at native Ruby value types
      current_row[ col ] = case
        when item[0]==?"
          item[1..-2].gsub('""','"')
        when item =~ /^-?\d+$/
          item.to_i
        when item == "True"
          true
        when item == "False"
          false
        when item == 'NULL'
          nil
        when item =~ /(?:20|19)\d{2}-\d{2}-\d{2}\b/
          Time.parse(item)
        else
          warn "Treating #{item} as String value..."
          item
      end

      if sep == "\n"
        rows << current_row = {}
        col_index = -1
      end
    end

    # Throw out the final, empty row
    rows.pop if rows.last.empty?
    rows
  end
end

module Phrogz::TableTransloader
  attr_reader :pk_after_transload
  def transload_from( csv_file, original_pk=:id, &transformation )
    @transload_source   = csv_file
    @pk_after_transload = {}
    @pre_transload_pk   = original_pk
    @transformation     = transformation
  end

  def transload_old_data
    puts "Transloading data for #{self} from #...@transload_source}" if
$SHOW_PROGRESS
    DB.transaction do
      Phrogz::RoughCSV.load( @transload_source ).each do |old_row|
        orig_pk = old_row[ @pre_transload_pk ]
        raise "No primary key #...@pre_transload_pk} supplied for #
{old_row.inspect}" if orig_pk.nil?

        existing_row = @pk_after_transload[ orig_pk ]
        raise "Record with primary key #{orig_pk} already existed" if
existing_row

        mapping = @transformation.call( Row.new( self, old_row ) )
        begin
          self << mapping
          new_row = self.first( mapping )
          @pk_after_transload[ orig_pk ] = new_row.pk
        rescue Sequel::DatabaseError => e
          p e, old_row
          puts "="*75
        end
      end

      # Update self-referencing IDs
      self_referencing_columns = model_references.select{ |key,model|
model==self }.map{ |key,model| key }
      self_referencing_columns.concat( referenced_by[ self ] || [] )
      self_referencing_columns.uniq!
      unless self_referencing_columns.empty?
        each do |new_row|
          self_referencing_columns.each do |column|
            new_row[column] = @pk_after_transload[ new_row[ column ] ]
          end
          new_row.save
        end
      end
    end
  end

  def model_references
    references = {}
    self.all_association_reflections.reject{ |assoc| assoc[:type]
==:one_to_many }.each do |assoc|
      model = Object.const_get( assoc[ :class_name ] )
      references[ assoc[:key] ] = model
    end
    references
  end

  def referenced_by
    references = {}
    self.all_association_reflections.select{ |assoc| assoc[:type]
==:one_to_many }.each do |assoc|
      model = Object.const_get( assoc[ :class_name ] )
      references[ model ] ||= []
      references[ model ] << assoc[ :key ]
    end
    references
  end

  def self.transload_all( *models )
    dependencies = {}
    models.each do |model|
      dependencies[ model ] = model.model_references.values.uniq

      # Self-dependencies cause problems in the Dependency Walker
      # They are handled in transload_old_data
      dependencies[ model ].delete( model )
    end
    Phrogz::DependencyWalker.walk( dependencies ) do |model|
      model.delete_all
      model.transload_old_data
    end
  end

  class Row
    attr_reader :data, :model
    def initialize( model, data={} )
      @model = model
      @data  = data
      @pk    = model.primary_key
    end
    def pk
      @data[ @pk ]
    end
    def []( field )
      @data[ field ]
    end
    def method_missing( field, *args )
      field_name = field.to_s
      if field_name[-1] == ?=
        @data[ field_name[0..-1].to_sym ] = args.first
      else
        @data[ field ]
      end
    end
  end

end

module Phrogz::DependencyWalker
  # +dependencies+ are specified as a Hash mapping one object to the
object(s)
  # that it depends upon. Objects that should be included in the
traversal but
  # that do not depend on any others may be mapped to nil.
  #
  # All objects will be walked AFTER the objects they depend upon.
  # Circular references are not detected, and will cause a
SystemStackError.
  # Any self-references (e.g. :a=>:a ) will hang the algorithm.
  #
  # Returns an array of objects in the traversal order.
  # If a block is supplied, it is executed on the objects in traversal
order.
  #
  # Examples:
  #  include Phrogz
  #  p DependencyWalker.walk( :a=>:b )
  #  #=> [:b, :a]
  #
  #  p DependencyWalker.walk( :a=>:b, :c=>:a, :b=>
[:d,:e], :e=>:d, :f=>nil )
  #  #=> [:f, :d, :e, :b, :a, :c]
  #
  #  DependencyWalker.walk( :a=>[:b], :b=>[:d,:c], :c=>:d ) do |obj|
  #    p obj
  #  end
  #  #=> :d
  #  #=> :c
  #  #=> :b
  #  #=> :a
  def self.walk( dependencies={} )
   chain = []
   # TODO: detect circular references and bail quietly
   # http://phrogz.net/nodes/traversingdirectedgraph.asp
   process = lambda do |key|
     chain << key
     dependent = dependencies[ key ]
     unless dependent.nil?
       if dependent.is_a?( Array )
         dependent.each{ |item| process[ item ] }
       else
         process[ dependent ]
       end
     end
   end
   dependencies.each{ |key,_| process[ key ] }
   chain.reverse!.uniq!
   chain.each{ |item| yield item } if block_given?
   chain
  end
end



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/sequel-talk?hl=en
-~----------~----~----~----~------~----~------~--~---

Transloading Data

Reply via email to