Hi, I have a optiminsation problem, I have a beutiful but large db (22Gb, yes you read that right) in Mysql, and i made a mistake in populating part of it, so i thought that i would write a quick script with sequel to weed out the data that shouldnt be there, unfortunately it runs too slowly to be of use...
--> a brief description of the db.
<big picture>
database contains a list of webcam image files, that have various
statistics recorded about each image or part of image, i study how the
statistics of the images varies in time,
these statistics, can be mean, max, min, histograms, entropy, etc
</big picture>
each image has a number of different channels, r,g,b,H,S,L, greyscale etc
the each type of statistic is in its own table (10 or so tables exist), so
more tests can be added later
each channel of the image has its own statistic entry in each of the
statistics table
a number of different experiments "determine" how the stats are calculated.
( this is the mechanism i use to do science with my data )
one table contains the glue that links individual image files to there
statistics, "New_Stats" it currently has 33 million entries, one for each
image/statistics table pair.
--> a breif description of the problem.
i made a mistake a processed some image twice, (with the same experiment)
and due to a bug in my code a number of images have duplicate entries in
each of the statistics tables, and "New_Stats" table, these later duplicates
contain and error in their math and so the values are wrong.
-->how i want to solve it
this may not be the best way, i am open to suggestions as to a better
technique,
i dont know exactly which images have been duplicated, so i figure that i
will have to go through each image and the channels there of,
then go to the stats table, and for each channel, experiment, statstable, i
count the number of entries, if this is greater than 1, i then push
subsequent entries on to a stack, for deletion later.
---> the problem,
my simplistic implementation is woefully slow, taking ten minuets on my
machine for 10000 images
does anyone have suggestions as to how i can speed this up? i dont mind if
it takes a few days to process, but my estimates for this implementation
make me a evolved species before it has finished.
i put a copy of the critical bit of code at the bottom of this email for
dissemination
Any help i can gather from the more knowledgeable masses will be most
appreciated.
Thanks
Joe
<ruby code snipit >
channels = DB.from(:channels)
stats_tables = DB.from(:Stats_Tables)
all_src_files = DB.from(:src_files_table).filter(:src_camera =>
5).limit(10000) # the problem only occurs with images from camera 5
puts all_src_files.count # not useful/needed until you remove the limit
clause above
puts " now processing each file"
to_be_removed = Array.new #array holds the items to be removed
proc_count = 0
all_src_files.each do |sf|
# what we need to do here. go through and see if for each stats table,
their
# is more then one entry per channel and experiment. if that is the
case,
# then dump it (the id) for the 2nd,3rd,.... entries
#
exp_a = DB.from(:New_Stats).where(:Experiment_ID =>
2).where(:src_file_id => sf[:id]).group_by(:Stat_ID).order(:Stat_ID.desc)
channels.each do |ch| # go through each image channel
stats_tables.each do |stble| # go through each statistics table
tble_channel = exp_a.filter(:Channel =>
ch[:Channel_ID]).filter(:Table_id => stble[:Table_id]).order(:Stat_ID.asc)
if (tble_channel.count > 1) # do we have to many entries ?
# ok so we now have too many entries for this table at this point
remove_item = nil
tble_channel.each do |tb|
if (!remove_item.nil?) #skip first entry
item = {"id" => tble_channel.get(:Stat_ID),
"channel"=>ch[:Channel_ID],"table"=>stble[:Table_id], "table_key" =>
tble_channel.get(:Table_key) }
#ok so we need to ignore the first one.
to_be_removed.push(item)
else
remove_item = true;
end
end
end
end
end
proc_count= proc_count + 1 # this is a counter to indicate progress
through the images.
if ((proc_count % 1000) == 1 )
puts "Completed #{proc_count} items, found #{to_be_removed.size()}
items "
end
end
</ruby code snipit>
-- You received this message because you are subscribed to the Google Groups "sequel-talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to [email protected].
For more options, visit this group at http://groups.google.com/group/sequel-talk?hl=en.
