I didn't get a chance to look at your code much there, but it seems
like you should be able to do this much quicker. 22GB isn't all that
big of a database to begin with (something you could hold in memory
likely), so I'm wondering if there's some server-level optimization
that you need to do.

Just a thought. I'd probably write a script that looks for identical
ones and then just axes the dupes.

For each one, just select * that are the same, then ax those. I'd also
add in timestamps to your entries so that if you do anything like that
in the future you can easily just knock off the recent ones.

dave/tibbon

On Thu, Jan 14, 2010 at 1:47 PM, Joe Lovick <[email protected]> wrote:
> Hi,
>
> I have a optiminsation problem, I have a beutiful but large db (22Gb, yes
> you read that right) in Mysql, and i made a mistake in populating part of
> it, so i thought that i would write a quick script with sequel to weed out
> the data that shouldnt be there, unfortunately it runs too slowly to be of
> use...
>
> --> a brief description of the db.
>
> <big picture>
>  database contains a list of webcam image files, that have various
> statistics recorded about each image or part of image, i study how the
> statistics of the images varies in time,
>  these statistics, can be mean, max, min, histograms, entropy, etc
> </big picture>
>
>  each image has a number of different channels, r,g,b,H,S,L, greyscale etc
>  the each type of statistic is in its own table (10 or so tables exist), so
> more tests can be added later
>  each channel of the image has its own statistic entry in each of the
> statistics table
>  a number of different experiments "determine" how the stats are calculated.
> ( this is the mechanism i use to do science with my data )
>  one table contains the glue that links individual image files to there
> statistics, "New_Stats" it currently has 33 million entries, one for each
> image/statistics table pair.
>
> --> a breif description of the problem.
>
> i made a mistake a processed some image twice, (with the same experiment)
> and due to a bug in my code a number of images have duplicate entries in
> each of the statistics tables, and "New_Stats" table, these later duplicates
> contain and error in their math and so the values are wrong.
>
> -->how i want to solve it
> this may not be the best way, i am open to suggestions as to a better
> technique,
>
> i dont know exactly which images have been duplicated, so i figure that i
> will have to go through each image and the channels there of,
> then go to the stats table, and for each channel, experiment, statstable, i
> count the number of entries, if this is greater than 1, i then push
> subsequent entries on to a stack, for deletion later.
>
> ---> the problem,
>
>   my simplistic implementation is woefully slow, taking ten minuets on my
> machine for 10000 images
>
>   does anyone have suggestions as to how i can speed this up? i dont mind if
> it takes a few days to process, but my estimates for this implementation
> make me a evolved species before it has finished.
>
>   i put a copy of the critical bit of code at the bottom of this email for
> dissemination
>
>   Any help i can gather from the more knowledgeable masses will be most
> appreciated.
>
>  Thanks
>
>  Joe
>
>
> <ruby code snipit >
>   channels = DB.from(:channels)
>   stats_tables = DB.from(:Stats_Tables)
>   all_src_files = DB.from(:src_files_table).filter(:src_camera =>
> 5).limit(10000) # the problem only occurs with  images from camera 5
>   puts all_src_files.count # not useful/needed until you remove the limit
> clause above
>   puts " now processing each file"
>   to_be_removed = Array.new #array holds the items to be removed
>   proc_count = 0
>   all_src_files.each do |sf|
>     # what we need to do here. go through and see if for each stats table,
> their
>     # is more then one entry per channel and experiment. if that is the
> case,
>     # then dump it (the id) for the 2nd,3rd,.... entries
>     #
>     exp_a  = DB.from(:New_Stats).where(:Experiment_ID =>
> 2).where(:src_file_id => sf[:id]).group_by(:Stat_ID).order(:Stat_ID.desc)
>     channels.each do |ch|  # go through each image channel
>       stats_tables.each  do |stble| # go through each statistics table
>         tble_channel = exp_a.filter(:Channel =>
> ch[:Channel_ID]).filter(:Table_id => stble[:Table_id]).order(:Stat_ID.asc)
>
>         if (tble_channel.count > 1) # do we have to many entries ?
>           # ok so we now have too many entries for this table at this point
>           remove_item = nil
>           tble_channel.each do |tb|
>           if (!remove_item.nil?) #skip first entry
>                      item = {"id" => tble_channel.get(:Stat_ID),
> "channel"=>ch[:Channel_ID],"table"=>stble[:Table_id], "table_key" =>
> tble_channel.get(:Table_key) }
>                       #ok so we need to ignore the first one.
>                       to_be_removed.push(item)
>           else
>             remove_item = true;
>           end
>           end
>
>         end
>       end
>     end
>     proc_count= proc_count + 1  # this is a counter to indicate progress
> through the images.
>     if ((proc_count % 1000) == 1 )
>       puts "Completed #{proc_count} items, found #{to_be_removed.size()}
> items "
>     end
>   end
> </ruby code snipit>
>
> --
> You received this message because you are subscribed to the Google Groups
> "sequel-talk" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/sequel-talk?hl=en.
>
>
-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sequel-talk?hl=en.


Reply via email to