Timestamps are an excellent idea! as soon as i am out of this mess i
will add them into the design.

I will give Jeremy's technique a go, before heading into server-level
optimization (i assume you mean SQL directly on the server, and also
mysql performance tuning)


Thanks


On Jan 14, 1:35 pm, David Fisher <[email protected]> wrote:
> I didn't get a chance to look at your code much there, but it seems
> like you should be able to do this much quicker. 22GB isn't all that
> big of a database to begin with (something you could hold in memory
> likely), so I'm wondering if there's some server-level optimization
> that you need to do.
>
> Just a thought. I'd probably write a script that looks for identical
> ones and then just axes the dupes.
>
> For each one, just select * that are the same, then ax those. I'd also
> add in timestamps to your entries so that if you do anything like that
> in the future you can easily just knock off the recent ones.
>
> dave/tibbon
>
> On Thu, Jan 14, 2010 at 1:47 PM, Joe Lovick <[email protected]> wrote:
> > Hi,
>
> > I have a optiminsation problem, I have a beutiful but large db (22Gb, yes
> > you read that right) in Mysql, and i made a mistake in populating part of
> > it, so i thought that i would write a quick script with sequel to weed out
> > the data that shouldnt be there, unfortunately it runs too slowly to be of
> > use...
>
> > --> a brief description of the db.
>
> > <big picture>
> >  database contains a list of webcam image files, that have various
> > statistics recorded about each image or part of image, i study how the
> > statistics of the images varies in time,
> >  these statistics, can be mean, max, min, histograms, entropy, etc
> > </big picture>
>
> >  each image has a number of different channels, r,g,b,H,S,L, greyscale etc
> >  the each type of statistic is in its own table (10 or so tables exist), so
> > more tests can be added later
> >  each channel of the image has its own statistic entry in each of the
> > statistics table
> >  a number of different experiments "determine" how the stats are calculated.
> > ( this is the mechanism i use to do science with my data )
> >  one table contains the glue that links individual image files to there
> > statistics, "New_Stats" it currently has 33 million entries, one for each
> > image/statistics table pair.
>
> > --> a breif description of the problem.
>
> > i made a mistake a processed some image twice, (with the same experiment)
> > and due to a bug in my code a number of images have duplicate entries in
> > each of the statistics tables, and "New_Stats" table, these later duplicates
> > contain and error in their math and so the values are wrong.
>
> > -->how i want to solve it
> > this may not be the best way, i am open to suggestions as to a better
> > technique,
>
> > i dont know exactly which images have been duplicated, so i figure that i
> > will have to go through each image and the channels there of,
> > then go to the stats table, and for each channel, experiment, statstable, i
> > count the number of entries, if this is greater than 1, i then push
> > subsequent entries on to a stack, for deletion later.
>
> > ---> the problem,
>
> >   my simplistic implementation is woefully slow, taking ten minuets on my
> > machine for 10000 images
>
> >   does anyone have suggestions as to how i can speed this up? i dont mind if
> > it takes a few days to process, but my estimates for this implementation
> > make me a evolved species before it has finished.
>
> >   i put a copy of the critical bit of code at the bottom of this email for
> > dissemination
>
> >   Any help i can gather from the more knowledgeable masses will be most
> > appreciated.
>
> >  Thanks
>
> >  Joe
>
> > <ruby code snipit >
> >   channels = DB.from(:channels)
> >   stats_tables = DB.from(:Stats_Tables)
> >   all_src_files = DB.from(:src_files_table).filter(:src_camera =>
> > 5).limit(10000) # the problem only occurs with  images from camera 5
> >   puts all_src_files.count # not useful/needed until you remove the limit
> > clause above
> >   puts " now processing each file"
> >   to_be_removed = Array.new #array holds the items to be removed
> >   proc_count = 0
> >   all_src_files.each do |sf|
> >     # what we need to do here. go through and see if for each stats table,
> > their
> >     # is more then one entry per channel and experiment. if that is the
> > case,
> >     # then dump it (the id) for the 2nd,3rd,.... entries
> >     #
> >     exp_a  = DB.from(:New_Stats).where(:Experiment_ID =>
> > 2).where(:src_file_id => sf[:id]).group_by(:Stat_ID).order(:Stat_ID.desc)
> >     channels.each do |ch|  # go through each image channel
> >       stats_tables.each  do |stble| # go through each statistics table
> >         tble_channel = exp_a.filter(:Channel =>
> > ch[:Channel_ID]).filter(:Table_id => stble[:Table_id]).order(:Stat_ID.asc)
>
> >         if (tble_channel.count > 1) # do we have to many entries ?
> >           # ok so we now have too many entries for this table at this point
> >           remove_item = nil
> >           tble_channel.each do |tb|
> >           if (!remove_item.nil?) #skip first entry
> >                      item = {"id" => tble_channel.get(:Stat_ID),
> > "channel"=>ch[:Channel_ID],"table"=>stble[:Table_id], "table_key" =>
> > tble_channel.get(:Table_key) }
> >                       #ok so we need to ignore the first one.
> >                       to_be_removed.push(item)
> >           else
> >             remove_item = true;
> >           end
> >           end
>
> >         end
> >       end
> >     end
> >     proc_count= proc_count + 1  # this is a counter to indicate progress
> > through the images.
> >     if ((proc_count % 1000) == 1 )
> >       puts "Completed #{proc_count} items, found #{to_be_removed.size()}
> > items "
> >     end
> >   end
> > </ruby code snipit>
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "sequel-talk" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to
> > [email protected].
> > For more options, visit this group at
> >http://groups.google.com/group/sequel-talk?hl=en.
>
>
-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sequel-talk?hl=en.


Reply via email to