Re: grouping by the difference between values in rows

shawn l.green Wed, 15 Jan 2014 15:09:49 -0800

Hello Larry,

On 1/13/2014 2:43 AM, Peter Brawley wrote:


On 2014-01-12 9:13 PM, Larry Martell wrote:

On Sun, Jan 12, 2014 at 2:47 PM, Peter Brawley
<[email protected]> wrote:

On 2014-01-12 1:17 PM, Larry Martell wrote:

I've been asked to do something that I do not think is possible in SQL.

I have a query that has this basic form:

SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f
FROM t
GROUP BY a, b, c, d, f

x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or
10053.490, 2542.094).

The business issue is that if either x or y in 2 rows that are in the
same a, b, c, d group are within 1 of each other then they should be
grouped together. And to make it more complicated, the tolerance is
applied as a rolling continuum. For example, if the x and y in a set
of grouped rows are:

row 1: 1.5, 9.5
row 2: 2.4, 20.8
row 3: 3.3, 40.6
row 4: 4.2, 2.5
row 5: 5.1, 10.1
row 6: 6.0, 7.9
row 7: 8.0, 21.0
row 8: 100, 200

1 through 6 get combined because all their X values are within the
tolerance of some other X in the set that's been combined. 7's Y value
is within the tolerance of 2's Y, so that should be combined as well.
8 is not combined because neither the X or Y value is within the
tolerance of any X or Y in the set that was combined.

In python I can easily parse the data and identify the rows that need
to be combined, but then I've lost the ability to calculate the
average and std. The only way I can think of to do this is to remove
the grouping from the SQL and do all the grouping and aggregating
myself. But this query often returns 20k to 30k rows after grouping.
It could easily be 80k to 100k rows that I have to process if I remove
the grouping and I think that will be very slow.

Anyone have any ideas?


Could you compute the row-to-row values & write them to a temp table,
then
run the SQL that incorporates that result column?

I thought of temp tables, but I could not come up with a way to use
them for this. How can I apply the x/y tolerance grouping in sql?


Run the query you showed, saving the result to a temp table. In an sproc
or your preferred app language, do the row-to-row processing to generate
a new column in the temp table from the biz rules you outlined, now
query the revised temp table as desired.

This is a very interesting problem but I am not sure what to do with agrouping. To me this appears to be a two-dimensional,nearest-neighbor-type sorting problem.

The way you outline it you could end up with all of the elements in youranalysis plane in a single group depending on how closely they are toeach other. Are long strings of points and blobs (clusters of points) inyour 2-D search space (your plane is defined by your {a,b,c,d} tuple)acceptable results of your re-grouping process?

If I had to draw this out graphically, you are stamping each of your{x,y} points with a 2x2 boundary box (each point is in intersection inthe middle) and you want to know which sets of those boxes form acontiguous intersection area (they overlap).

Is that what you are looking for? One technique would be to start withPeter's suggestion. Then reprocess that list to generate an acyclicdirected graph (by eliminating the reverse matches from your set), thentracing down the tree. If point 2 links to point 7,then 7 also links to2 (both would be found by your nearness test. Eliminate any matcheswhere the first point appears after the second point from your list.


Yours,
--
Shawn Green
MySQL Senior Principal Technical Support Engineer
Oracle USA, Inc. - Hardware and Software, Engineered to Work Together.
Office: Blountville, TN

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql

Re: grouping by the difference between values in rows

Reply via email to