RE: getting certain rows from a group by

Rick James Wed, 19 Sep 2012 15:57:23 -0700

> my counts are 3 times too much.
Without studying the code, I would guess that there is a JOIN between he data 
that needs COUNTing and the GROUP BY for the COUNT.  That is, it collects more 
'joined' rows before counting.  Fixing it will probably make the query even 
messier.


> -----Original Message-----
> From: Larry Martell [mailto:larry.mart...@gmail.com]
> Sent: Wednesday, September 19, 2012 3:22 PM
> To: Rick James
> Cc: peter.braw...@earthlink.net; mysql@lists.mysql.com
> Subject: Re: getting certain rows from a group by
>
> Well, I'm getting the proper rows from the 4 joins I added, but now I'm
> seeing a weird side effect - my counts are 3 times to much.
>
> The query is really huge and nasty now, but I'm going to paste it
> below.
>
> In the outer most select, Wafers, Rerun, Runs, and Count are 3 times
> what they should be. If I take off the joins, I get the correct counts.
> I can't see why that's happening.
>
>
>         SELECT data_target.name as Target,
>                q1.ep as EP,
>                COUNT(*) as Wafers,
>                Lots,
>                FORMAT(SUM(numonep)/(COUNT(*)+(SUM(GREATEST(q1.reruns,
> 0)))), 1) as 'Sites/Wafer',
>                DATE_FORMAT(MAX(LastRun), '%m/%d/%y') as "Last Run",
>                SUM(GREATEST(q1.reruns, 0)) as Rerun,
>                COUNT(*)+SUM(GREATEST(q1.reruns, 0)) as Runs,
>                FORMAT(avgbottom, 1) as "Avg Bottom",
>                FORMAT(3*stdbottom, 2) as "3 Sig",
>                FORMAT(maxbottom, 1) as Max,
>                FORMAT(minbottom, 1) as Min,
>                SUM(numonep) as Count,
>                SUM(numonep) - SUM(numbottoms) as NAs,
>                CONCAT(FORMAT(100*SUM(numbottoms)/SUM(numonep), 1), '
> %') as "% Success",
>                FORMAT(3*stdbottom/avgbottom, 2) as "3Sig/Avg",
>                FORMAT(AVG(avgbottom), 1) as 'Wafer Avg',
>                FORMAT(AVG(Wafer3Sigma), 2) as 'Wafer 3 Sigma',
>                FORMAT(AVG(Ranges), 1) as 'Avg Range',
>                FORMAT(3*STD(Ranges), 2) as '3Sig of Ranges',
>                FORMAT(MAX(Ranges), 1) as 'Max Range',
>                FORMAT(MIN(Ranges), 1) as 'Min Range',
>                FORMAT((SUM(numonep) - SUM(numbottoms))/COUNT(*), 1) as
> 'NAs/Wafer',
>                mx.maxImage as maxImage,
>                mn.minImage as minImage,
>                lr.lrImage as lrlrImage,
>                cm.cmImage as cmcmImage
>            FROM (SELECT target_name_id,
>                         ep,
>                         wafer_id,
>                         COUNT(bottom) as numbottoms,
>                         AVG(bottom) as avgbottom,
>                         STD(bottom) as stdbottom,
>                         MAX(bottom) as maxbottom,
>                         MIN(bottom) as minbottom,
>                         MAX(date_time) as "LastRun",
>                         COUNT(*) as numonep,
>                         COUNT(DISTINCT target_name_id, ep, lot_id,
> data_file_id)-1 as reruns,
>                         COUNT(DISTINCT(lot_id)) as Lots,
>                         3*STD(bottom) as Wafer3Sigma,
>                         MAX(bottom) - MIN(bottom) as Ranges
>            FROM data_cst
>            WHERE TRUE
>          AND data_cst.target_name_id IN (775, 776, 777, 778, 779, 780,
> 45, 44, 116, 117, 118, 119, 120, 121)
>          AND data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
> '2010-03-04 23:59:59'
>          GROUP BY target_name_id, ep, wafer_id
>          HAVING count(*) < 999) q1
>                JOIN (SELECT data_cstimage.name as minImage,
>                             data_cst.bottom,
>                             data_cst.target_name_id,
>                             data_cst.date_time,
>                             data_cst.lot_id,
>                             data_cst.tool_id,
>                             data_cst.roiname,
>                             data_cst.recipe_id,
>                             data_cst.ep
>                      FROM data_cstimage, data_cst
>                      WHERE data_cstimage.id =
> data_cst.image_measurer_id) mn
>                      ON mn.bottom = q1.minbottom
>                 AND mn.target_name_id IN (775, 776, 777, 778, 779, 780,
> 45, 44, 116, 117, 118, 119, 120, 121)
>                 AND mn.date_time BETWEEN '2010-03-04 00:00:00' AND
> '2010-03-04 23:59:59'
>                JOIN (SELECT data_cstimage.name as maxImage,
>                             data_cst.bottom,
>                             data_cst.target_name_id,
>                             data_cst.date_time,
>                             data_cst.lot_id,
>                             data_cst.tool_id,
>                             data_cst.roiname,
>                             data_cst.recipe_id,
>                             data_cst.ep
>                      FROM data_cstimage, data_cst
>                      WHERE data_cstimage.id =
> data_cst.image_measurer_id) mx
>                      ON mx.bottom = q1.maxbottom
>                 AND mx.target_name_id IN (775, 776, 777, 778, 779, 780,
> 45, 44, 116, 117, 118, 119, 120, 121)
>                 AND mx.date_time BETWEEN '2010-03-04 00:00:00' AND
> '2010-03-04 23:59:59'
>                JOIN (SELECT data_cstimage.name as lrImage,
>                             data_cst.bottom,
>                             data_cst.target_name_id,
>                             data_cst.date_time,
>                             data_cst.lot_id,
>                             data_cst.tool_id,
>                             data_cst.roiname,
>                             data_cst.recipe_id,
>                             data_cst.ep
>                      FROM data_cstimage, data_cst
>                      WHERE data_cstimage.id =
> data_cst.image_measurer_id) lr
>                      ON lr.date_time = q1.LastRun
>                 AND lr.target_name_id IN (775, 776, 777, 778, 779, 780,
> 45, 44, 116, 117, 118, 119, 120, 121)
>                 AND lr.date_time BETWEEN '2010-03-04 00:00:00' AND
> '2010-03-04 23:59:59'
>                JOIN (SELECT data_cstimage.name as cmImage,
>                             data_cst.bottom,
>                             data_cst.target_name_id,
>                             data_cst.date_time,
>                             data_cst.lot_id,
>                             data_cst.tool_id,
>                             data_cst.roiname,
>                             data_cst.recipe_id,
>                             data_cst.ep
>                FROM data_cstimage, data_cst
>                WHERE data_cstimage.id = data_cst.image_measurer_id
>                 AND data_cst.target_name_id IN (775, 776, 777, 778,
> 779, 780, 45, 44, 116, 117, 118, 119, 120, 121)
>                 AND data_cst.date_time BETWEEN '2010-03-04 00:00:00'
> AND '2010-03-04 23:59:59'
>                GROUP BY target_name_id, ep
>                HAVING count(*) < 999
>                ORDER BY ABS(bottom - AVG(bottom))) cm
>                ON cm.target_name_id = q1.target_name_id
>                AND cm.ep = q1.ep
>                 AND cm.target_name_id IN (775, 776, 777, 778, 779, 780,
> 45, 44, 116, 117, 118, 119, 120, 121)
>                 AND cm.date_time BETWEEN '2010-03-04 00:00:00' AND
> '2010-03-04 23:59:59' ,
>                 data_target
>                 WHERE data_target.id = q1.target_name_id
>                 GROUP BY q1.target_name_id, q1.ep
>
>
> On Wed, Sep 19, 2012 at 3:27 PM, Larry Martell
> <larry.mart...@gmail.com> wrote:
> > OK, I think I have this working. The last join was this:
> >
> > JOIN (select id, target_name_id, ep,date_time from data_cst WHERE
> > target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44, 116, 117,
> > 118, 119, 120, 121) AND data_cst.date_time BETWEEN '2010-03-04
> > 00:00:00' AND '2010-03-04 23:59:59' GROUP BY  target_name_id, ep
> > HAVING count(*) < 999 ORDER BY  ABS(data_cst.bottom - AVG(bottom)))
> cm
> > on cm.target_name_id = q1.target_name_id and cm.ep = q1.ep and
> > cm.date_time BETWEEN '2010-03-04 00:00:00' AND '2010-03-04 23:59:59'
> >
> > I have do to more testing to ensure I'm getting the correct rows
> back.
> >
> > Thanks for all the help!
> >
> > On Wed, Sep 19, 2012 at 10:07 AM, Larry Martell
> <larry.mart...@gmail.com> wrote:
> >> On Tue, Sep 18, 2012 at 7:41 PM, Rick James <rja...@yahoo-inc.com>
> wrote:
> >>> SELECT  ((the appropriate id))          -- <--
> >>>     FROM  data_cst, rollup, data_target
> >>>     WHERE  data_target.name = rollup.Target
> >>>       AND  data_cst.ep = rollup.EP
> >>>       AND  data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
> '2010-03-04 23:59:59'
> >>>       AND  data_target.id = data_cst.target_name_id
> >>>     GROUP BY  rollup.Target, rollup.EP
> >>>     ORDER BY  ABS(data_cst.bottom - rollup.`Avg Bottom`)
> >>>     LIMIT  1               -- <--
> >>
> >> rollup is the name of the temp table that I was testing with. It
> >> contains the output of the original query. I don't have that in the
> >> real world. That's my issue - I can't figure out how to join with
> the
> >> result set and apply the order by to the join condition. What would
> I
> >> use in place of rollup to get the 'Avg Bottom'?
> >>
> >> Here's what my query looks like now:
> >>
> >> SELECT data_target.name as Target,
> >>        q1.ep as EP,
> >>        COUNT(*) as Wafers,
> >>        Lots,
> >>        SUM(numonep)/(COUNT(*)+SUM(CASE WHEN reruns > 0 THEN reruns
> >> ELSE 0 END)) as 'Sites/Wafer',
> >>        MAX(LastRun) as "Last Run",
> >>        SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END) as Rerun,
> >>        COUNT(*)+SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END) as
> Runs,
> >>        avgbottom as "Avg Bottom",
> >>        3*stdbottom as "3 Sig",
> >>        maxbottom as Max,
> >>        minbottom as Min,
> >>        SUM(numonep) as Count, SUM(numonep) - SUM(numbottoms) as NAs,
> >>        100-((SUM(numonep) - SUM(numbottoms))/SUM(numonep)*100) as "%
> Success",
> >>        3*stdbottom/avgbottom as "3Sig/Avg",
> >>        AVG(avgbottom) as 'Wafer Avg',
> >>        AVG(Wafer3Sigma) as 'Wafer 3 Sigma',
> >>        AVG(Ranges) as 'Avg Range',
> >>        3*STD(Ranges) as '3Sig of Ranges',
> >>        MAX(Ranges) as 'Max Range',
> >>        MIN(Ranges) as 'Min Range', (SUM(numonep) -
> >> SUM(numbottoms))/COUNT(*) as 'NAs/Wafer',
> >>        mx.id as maxID,
> >>        mn.id as minID,
> >>        lr.id as lrID
> >> FROM (SELECT target_name_id,
> >>              ep,
> >>              wafer_id,
> >>              COUNT(bottom) as numbottoms,
> >>              AVG(bottom) as avgbottom,
> >>              STD(bottom) as stdbottom,
> >>              MAX(bottom) as maxbottom,
> >>              MIN(bottom) as minbottom,
> >>              MAX(date_time) as "LastRun",
> >>              COUNT(*) as numonep,
> >>              COUNT(DISTINCT target_name_id, ep, lot_id,
> >> data_file_id)-1 as reruns,
> >>              COUNT(DISTINCT(lot_id)) as Lots,
> >>              3*STD(bottom) as Wafer3Sigma,
> >>              MAX(bottom) - MIN(bottom) as Ranges
> >>       FROM data_cst
> >>       WHERE target_name_id IN (775, 776, 777, 778, 779, 780, 45, 44,
> >> 116, 117, 118, 119, 120, 121)
> >>       AND data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND
> >> '2010-03-04 23:59:59'
> >>       GROUP BY target_name_id, ep, wafer_id
> >>       HAVING count(*) < 999) q1
> >> join data_cst mn ON mn.bottom = q1.minbottom join data_cst mx on
> >> mx.bottom = q1.maxbottom join data_cst lr on lr.date_time =
> >> q1.LastRun, data_target WHERE data_target.id = q1.target_name_id
> >> GROUP BY q1.target_name_id, q1.ep;
> >>
> >>>
> >>>> -----Original Message-----
> >>>> From: Larry Martell [mailto:larry.mart...@gmail.com]
> >>>> Sent: Tuesday, September 18, 2012 2:57 PM
> >>>> To: Rick James
> >>>> Cc: peter.braw...@earthlink.net; mysql@lists.mysql.com
> >>>> Subject: Re: getting certain rows from a group by
> >>>>
> >>>> On Tue, Sep 18, 2012 at 4:01 PM, Rick James <rja...@yahoo-inc.com>
> >>>> wrote:
> >>>> > SELECT mn.*
> >>>> >        mx.*
> >>>> >> >    FROM ( SELECT @min = MIN(bottom),
> >>>> >> >                  @max = MAX(bottom), ... )
> >>>> >> >    JOIN data_cst mn  ON bottom = @min
> >>>> >> >    JOIN data_cst mx  ON bottom = @max
> >>>> >
> >>>> > That is, the FROM finds the value for the desired row.
> >>>> > The JOIN then gets to the rest of the fields.
> >>>> > Caveat:  If two rows have bottom = @min, it returns both.
> >>>>
> >>>>
> >>>> Thanks a lot Rick! This was super helpful. I've got it working for
> >>>> 3 of the 4 cases - min(bottom), max(bottom), and max(date_time).
> >>>> But I can't figure out how to work in the last case - where bottom
> >>>> is closest to avg(bottom). In an individual query I can get it
> with
> >>>> an order by, like
> >>>> this:
> >>>>
> >>>> SELECT rollup.Target, rollup.EP, rollup.`Avg Bottom`, data_cst.id,
> >>>> data_cst.bottom FROM data_cst, rollup, data_target WHERE
> >>>> data_target.name = rollup.Target AND data_cst.ep = rollup.EP AND
> >>>> data_cst.date_time BETWEEN '2010-03-04 00:00:00' AND '2010-03-04
> >>>> 23:59:59'
> >>>> AND data_target.id = data_cst.target_name_id GROUP BY
> >>>> rollup.Target, rollup.EP ORDER BY ABS(data_cst.bottom -
> rollup.`Avg
> >>>> Bottom`);
> >>>>
> >>>> Any way to work that into another join?
> >>>>
> >>>> Thanks!
> >>>>
> >>>> >
> >>>> >> -----Original Message-----
> >>>> >> From: Larry Martell [mailto:larry.mart...@gmail.com]
> >>>> >> Sent: Tuesday, September 18, 2012 12:54 PM
> >>>> >> To: Rick James
> >>>> >> Cc: peter.braw...@earthlink.net; mysql@lists.mysql.com
> >>>> >> Subject: Re: getting certain rows from a group by
> >>>> >>
> >>>> >> On Tue, Sep 18, 2012 at 2:05 PM, Rick James
> >>>> >> <rja...@yahoo-inc.com>
> >>>> >> wrote:
> >>>> >> > For single query...
> >>>> >> >
> >>>> >> > Plan A:
> >>>> >> > Repeat the SELECT as once for each (MIN, AVG, etc) as a
> >>>> >> > subquery
> >>>> in
> >>>> >> >    WHERE bottom = ( SELECT MIN(bottom) ... )
> >>>> >> >
> >>>> >> > Plan B:
> >>>> >> >    FROM ( SELECT @min = MIN(bottom),
> >>>> >> >                  @max = MAX(bottom), ... )
> >>>> >> >    JOIN data_cst mn  ON bottom = @min
> >>>> >> >    JOIN data_cst mx  ON bottom = @max
> >>>> >>
> >>>> >>
> >>>> >> Can you elaborate on this? I don't see how this will give me
> >>>> >> what I need. I'm not looking for the min or max bottom - I
> >>>> >> already have
> >>>> that
> >>>> >> - I'm looking for the row from each group that has the min and
> >>>> >> max bottom.
> >>>> >>
> >>>> >> > Plan C:
> >>>> >> > Get rid of 3rd party packages that eventually get in your way
> >>>> >> > instead
> >>>> >> of 'helping'.
> >>>> >>
> >>>> >> It's not django that is getting in the way. django is a
> >>>> >> fantastic framework for web development. It's just changing
> >>>> >> requirements. The original developer (who no longer works here)
> >>>> >> created a few
> >>>> different
> >>>> >> classes, and then based on the original requirements, chose to
> >>>> >> use the one that supports a single query. That worked then, but
> >>>> >> may not work now. There are other classes that allow multiple
> >>>> >> queries, but then you have to parse the data and stuff it into
> >>>> >> the context that will be picked up by the browser yourself. The
> >>>> >> single query class does that for you. I may yet go that way,
> but
> >>>> >> I'm trying to avoid a big rewrite if I can.
> >>>> >>
> >>>> >> >> -----Original Message-----
> >>>> >> >> From: Larry Martell [mailto:larry.mart...@gmail.com]
> >>>> >> >> Sent: Tuesday, September 18, 2012 4:57 AM
> >>>> >> >> To: peter.braw...@earthlink.net
> >>>> >> >> Cc: mysql@lists.mysql.com
> >>>> >> >> Subject: Re: getting certain rows from a group by
> >>>> >> >>
> >>>> >> >> On Tue, Sep 18, 2012 at 7:52 AM, Peter Brawley
> >>>> >> >> <peter.braw...@earthlink.net> wrote:
> >>>> >> >> > On 2012-09-18 5:53 AM, Larry Martell wrote:
> >>>> >> >> >>
> >>>> >> >> >> I have this query:
> >>>> >> >> >>
> >>>> >> >> >> SELECT data_target.name as Target,
> >>>> >> >> >>         q1.ep as EP,
> >>>> >> >> >>         COUNT(*) as Wafers,
> >>>> >> >> >>         Lots,
> >>>> >> >> >>         SUM(numonep)/(COUNT(*)+SUM(CASE WHEN reruns > 0
> >>>> >> >> >> THEN reruns ELSE 0 END)) as 'Sites/Wafer',
> >>>> >> >> >>         MAX(LastRun) as "Last Run",
> >>>> >> >> >>         SUM(CASE WHEN reruns > 0 THEN reruns ELSE 0 END)
> >>>> >> >> >> as
> >>>> >> Rerun,
> >>>> >> >> >>         COUNT(*)+SUM(CASE WHEN reruns > 0 THEN reruns
> ELSE
> >>>> >> >> >> 0
> >>>> >> >> >> END) as
> >>>> >> >> Runs,
> >>>> >> >> >>         avgbottom as "Avg Bottom",
> >>>> >> >> >>         3*stdbottom as "3 Sig",
> >>>> >> >> >>         maxbottom as Max,
> >>>> >> >> >>         minbottom as Min,
> >>>> >> >> >>         SUM(numonep) as Count,
> >>>> >> >> >>         SUM(numonep) - SUM(numbottoms) as NAs,
> >>>> >> >> >>         100-((SUM(numonep) -
> >>>> SUM(numbottoms))/SUM(numonep)*100)
> >>>> >> as
> >>>> >> >> "%
> >>>> >> >> >> Success",
> >>>> >> >> >>         3*stdbottom/avgbottom as "3Sig/Avg",
> >>>> >> >> >>         AVG(avgbottom) as 'Wafer Avg',
> >>>> >> >> >>         AVG(Wafer3Sigma) as 'Wafer 3 Sigma',
> >>>> >> >> >>         AVG(Ranges) as 'Avg Range',
> >>>> >> >> >>         3*STD(Ranges) as '3Sig of Ranges',
> >>>> >> >> >>         MAX(Ranges) as 'Max Range',
> >>>> >> >> >>         MIN(Ranges) as 'Min Range',
> >>>> >> >> >>         (SUM(numonep) - SUM(numbottoms))/COUNT(*) as
> >>>> 'NAs/Wafer'
> >>>> >> >> >> FROM (SELECT target_name_id,
> >>>> >> >> >>               ep,
> >>>> >> >> >>               wafer_id,
> >>>> >> >> >>               COUNT(bottom) as numbottoms,
> >>>> >> >> >>               AVG(bottom) as avgbottom,
> >>>> >> >> >>               STD(bottom) as stdbottom,
> >>>> >> >> >>               MAX(bottom) as maxbottom,
> >>>> >> >> >>               MIN(bottom) as minbottom,
> >>>> >> >> >>               MAX(date_time) as "LastRun",
> >>>> >> >> >>               COUNT(*) as numonep,
> >>>> >> >> >>               COUNT(DISTINCT target_name_id, ep, lot_id,
> >>>> >> >> >> data_file_id)-1 as reruns,
> >>>> >> >> >>               COUNT(DISTINCT(lot_id)) as Lots,
> >>>> >> >> >>               3*STD(bottom) as Wafer3Sigma,
> >>>> >> >> >>               MAX(bottom) - MIN(bottom) as Ranges
> >>>> >> >> >>        FROM data_cst
> >>>> >> >> >>        WHERE target_name_id IN (775, 776, 777, 778, 779,
> >>>> >> >> >> 780, 45,
> >>>> >> >> 44,
> >>>> >> >> >> 116, 117, 118, 119, 120, 121)
> >>>> >> >> >>        AND data_cst.date_time BETWEEN '2010-03-04
> 00:00:00'
> >>>> AND
> >>>> >> >> >> '2010-03-04 23:59:59'
> >>>> >> >> >>        GROUP BY target_name_id, ep, wafer_id
> >>>> >> >> >>        HAVING count(*) < 999) q1,
> >>>> >> >> >>       data_target
> >>>> >> >> >> WHERE data_target.id = target_name_id GROUP BY
> >>>> >> >> >> q1.target_name_id, q1.ep;
> >>>> >> >> >>
> >>>> >> >> >>
> >>>> >> >> >> This works fine. But now I need to get a certain column
> >>>> >> >> >> (image_measurer_id) with each row returned that
> >>>> >> >> >> corresponds to
> >>>> >> the
> >>>> >> >> >> row from the group that has bottom = Min(bottom), bottom
> =
> >>>> >> >> >> Max(bottom), bottom closest to Avg(bottom), and bottom
> >>>> >> >> >> from
> >>>> the
> >>>> >> >> >> row where date_time = Max(date_time).
> >>>> >> >> >>
> >>>> >> >> >> Is this even possible from one query?
> >>>> >> >> >
> >>>> >> >> >
> >>>> >> >> > Might be, but what's the importance of doing it as one
> query?
> >>>> >> >>
> >>>> >> >> Because it's part of a django based web app, and the class
> >>>> >> >> that this is part of only supports having a single query. To
> >>>> >> >> use multiple queries will require a fairly major rewrite of
> >>>> >> >> the
> >>>> server
> >>>> >> >> side of
> >>>> >> that app.
> >>>> >> >>
> >>>> >> >> > I'd start by
> >>>> >> >> > saving this result to a temp table and developing the new
> >>>> query.
> >>>> >> >> > When that's running, see if you can to optimise a query
> >>>> >> >> > built
> >>>> by
> >>>> >> >> > replacing the reference to the temp table with the
> original
> >>>> query.
> >>>> >> >>
> >>>> >> >> Thanks, I'll look into this.

RE: getting certain rows from a group by

Reply via email to