Timestamps are an excellent idea! as soon as i am out of this mess i will add them into the design.
I will give Jeremy's technique a go, before heading into server-level optimization (i assume you mean SQL directly on the server, and also mysql performance tuning) Thanks On Jan 14, 1:35 pm, David Fisher <[email protected]> wrote: > I didn't get a chance to look at your code much there, but it seems > like you should be able to do this much quicker. 22GB isn't all that > big of a database to begin with (something you could hold in memory > likely), so I'm wondering if there's some server-level optimization > that you need to do. > > Just a thought. I'd probably write a script that looks for identical > ones and then just axes the dupes. > > For each one, just select * that are the same, then ax those. I'd also > add in timestamps to your entries so that if you do anything like that > in the future you can easily just knock off the recent ones. > > dave/tibbon > > On Thu, Jan 14, 2010 at 1:47 PM, Joe Lovick <[email protected]> wrote: > > Hi, > > > I have a optiminsation problem, I have a beutiful but large db (22Gb, yes > > you read that right) in Mysql, and i made a mistake in populating part of > > it, so i thought that i would write a quick script with sequel to weed out > > the data that shouldnt be there, unfortunately it runs too slowly to be of > > use... > > > --> a brief description of the db. > > > <big picture> > > database contains a list of webcam image files, that have various > > statistics recorded about each image or part of image, i study how the > > statistics of the images varies in time, > > these statistics, can be mean, max, min, histograms, entropy, etc > > </big picture> > > > each image has a number of different channels, r,g,b,H,S,L, greyscale etc > > the each type of statistic is in its own table (10 or so tables exist), so > > more tests can be added later > > each channel of the image has its own statistic entry in each of the > > statistics table > > a number of different experiments "determine" how the stats are calculated. > > ( this is the mechanism i use to do science with my data ) > > one table contains the glue that links individual image files to there > > statistics, "New_Stats" it currently has 33 million entries, one for each > > image/statistics table pair. > > > --> a breif description of the problem. > > > i made a mistake a processed some image twice, (with the same experiment) > > and due to a bug in my code a number of images have duplicate entries in > > each of the statistics tables, and "New_Stats" table, these later duplicates > > contain and error in their math and so the values are wrong. > > > -->how i want to solve it > > this may not be the best way, i am open to suggestions as to a better > > technique, > > > i dont know exactly which images have been duplicated, so i figure that i > > will have to go through each image and the channels there of, > > then go to the stats table, and for each channel, experiment, statstable, i > > count the number of entries, if this is greater than 1, i then push > > subsequent entries on to a stack, for deletion later. > > > ---> the problem, > > > my simplistic implementation is woefully slow, taking ten minuets on my > > machine for 10000 images > > > does anyone have suggestions as to how i can speed this up? i dont mind if > > it takes a few days to process, but my estimates for this implementation > > make me a evolved species before it has finished. > > > i put a copy of the critical bit of code at the bottom of this email for > > dissemination > > > Any help i can gather from the more knowledgeable masses will be most > > appreciated. > > > Thanks > > > Joe > > > <ruby code snipit > > > channels = DB.from(:channels) > > stats_tables = DB.from(:Stats_Tables) > > all_src_files = DB.from(:src_files_table).filter(:src_camera => > > 5).limit(10000) # the problem only occurs with images from camera 5 > > puts all_src_files.count # not useful/needed until you remove the limit > > clause above > > puts " now processing each file" > > to_be_removed = Array.new #array holds the items to be removed > > proc_count = 0 > > all_src_files.each do |sf| > > # what we need to do here. go through and see if for each stats table, > > their > > # is more then one entry per channel and experiment. if that is the > > case, > > # then dump it (the id) for the 2nd,3rd,.... entries > > # > > exp_a = DB.from(:New_Stats).where(:Experiment_ID => > > 2).where(:src_file_id => sf[:id]).group_by(:Stat_ID).order(:Stat_ID.desc) > > channels.each do |ch| # go through each image channel > > stats_tables.each do |stble| # go through each statistics table > > tble_channel = exp_a.filter(:Channel => > > ch[:Channel_ID]).filter(:Table_id => stble[:Table_id]).order(:Stat_ID.asc) > > > if (tble_channel.count > 1) # do we have to many entries ? > > # ok so we now have too many entries for this table at this point > > remove_item = nil > > tble_channel.each do |tb| > > if (!remove_item.nil?) #skip first entry > > item = {"id" => tble_channel.get(:Stat_ID), > > "channel"=>ch[:Channel_ID],"table"=>stble[:Table_id], "table_key" => > > tble_channel.get(:Table_key) } > > #ok so we need to ignore the first one. > > to_be_removed.push(item) > > else > > remove_item = true; > > end > > end > > > end > > end > > end > > proc_count= proc_count + 1 # this is a counter to indicate progress > > through the images. > > if ((proc_count % 1000) == 1 ) > > puts "Completed #{proc_count} items, found #{to_be_removed.size()} > > items " > > end > > end > > </ruby code snipit> > > > -- > > You received this message because you are subscribed to the Google Groups > > "sequel-talk" group. > > To post to this group, send email to [email protected]. > > To unsubscribe from this group, send email to > > [email protected]. > > For more options, visit this group at > >http://groups.google.com/group/sequel-talk?hl=en. > >
-- You received this message because you are subscribed to the Google Groups "sequel-talk" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/sequel-talk?hl=en.
