Hi,

I have a optiminsation problem, I have a beutiful but large db (22Gb, yes
you read that right) in Mysql, and i made a mistake in populating part of
it, so i thought that i would write a quick script with sequel to weed out
the data that shouldnt be there, unfortunately it runs too slowly to be of
use...

--> a brief description of the db.

<big picture>
 database contains a list of webcam image files, that have various
statistics recorded about each image or part of image, i study how the
statistics of the images varies in time,
 these statistics, can be mean, max, min, histograms, entropy, etc
</big picture>

 each image has a number of different channels, r,g,b,H,S,L, greyscale etc
 the each type of statistic is in its own table (10 or so tables exist), so
more tests can be added later
 each channel of the image has its own statistic entry in each of the
statistics table
 a number of different experiments "determine" how the stats are calculated.
( this is the mechanism i use to do science with my data )
 one table contains the glue that links individual image files to there
statistics, "New_Stats" it currently has 33 million entries, one for each
image/statistics table pair.

--> a breif description of the problem.

i made a mistake a processed some image twice, (with the same experiment)
and due to a bug in my code a number of images have duplicate entries in
each of the statistics tables, and "New_Stats" table, these later duplicates
contain and error in their math and so the values are wrong.

-->how i want to solve it
this may not be the best way, i am open to suggestions as to a better
technique,

i dont know exactly which images have been duplicated, so i figure that i
will have to go through each image and the channels there of,
then go to the stats table, and for each channel, experiment, statstable, i
count the number of entries, if this is greater than 1, i then push
subsequent entries on to a stack, for deletion later.

---> the problem,

  my simplistic implementation is woefully slow, taking ten minuets on my
machine for 10000 images

  does anyone have suggestions as to how i can speed this up? i dont mind if
it takes a few days to process, but my estimates for this implementation
make me a evolved species before it has finished.

  i put a copy of the critical bit of code at the bottom of this email for
dissemination

  Any help i can gather from the more knowledgeable masses will be most
appreciated.

 Thanks

 Joe


<ruby code snipit >
  channels = DB.from(:channels)
  stats_tables = DB.from(:Stats_Tables)
  all_src_files = DB.from(:src_files_table).filter(:src_camera =>
5).limit(10000) # the problem only occurs with  images from camera 5
  puts all_src_files.count # not useful/needed until you remove the limit
clause above
  puts " now processing each file"
  to_be_removed = Array.new #array holds the items to be removed
  proc_count = 0
  all_src_files.each do |sf|
    # what we need to do here. go through and see if for each stats table,
their
    # is more then one entry per channel and experiment. if that is the
case,
    # then dump it (the id) for the 2nd,3rd,.... entries
    #
    exp_a  = DB.from(:New_Stats).where(:Experiment_ID =>
2).where(:src_file_id => sf[:id]).group_by(:Stat_ID).order(:Stat_ID.desc)
    channels.each do |ch|  # go through each image channel
      stats_tables.each  do |stble| # go through each statistics table
        tble_channel = exp_a.filter(:Channel =>
ch[:Channel_ID]).filter(:Table_id => stble[:Table_id]).order(:Stat_ID.asc)

        if (tble_channel.count > 1) # do we have to many entries ?
          # ok so we now have too many entries for this table at this point
          remove_item = nil
          tble_channel.each do |tb|
          if (!remove_item.nil?) #skip first entry
                     item = {"id" => tble_channel.get(:Stat_ID),
"channel"=>ch[:Channel_ID],"table"=>stble[:Table_id], "table_key" =>
tble_channel.get(:Table_key) }
                      #ok so we need to ignore the first one.
                      to_be_removed.push(item)
          else
            remove_item = true;
          end
          end

        end
      end
    end
    proc_count= proc_count + 1  # this is a counter to indicate progress
through the images.
    if ((proc_count % 1000) == 1 )
      puts "Completed #{proc_count} items, found #{to_be_removed.size()}
items "
    end
  end
</ruby code snipit>
--
You received this message because you are subscribed to the Google Groups "sequel-talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to [email protected].
For more options, visit this group at http://groups.google.com/group/sequel-talk?hl=en.

Reply via email to