On Sun, Jan 12, 2014 at 6:27 PM, Chris Angelico <ros...@gmail.com> wrote: > On Mon, Jan 13, 2014 at 6:23 AM, Larry Martell <larry.mart...@gmail.com> > wrote: >> I have an python app that queries a MySQL DB. The query has this form: >> >> SELECT a, b, c, d, AVG(e), STD(e), CONCAT(x, ',', y) as f >> FROM t >> GROUP BY a, b, c, d, f >> >> x and y are numbers (378.18, 2213.797 or 378.218, 2213.949 or >> 10053.490, 2542.094). >> >> The business issue is that if either x or y in 2 rows that are in the >> same a, b, c, d group are within 1 of each other then they should be >> grouped together. And to make it more complicated, the tolerance is >> applied as a rolling continuum. For example, if the x and y in a set >> of grouped rows are: >> >> row 1: 1.5, 9.5 >> row 2: 2.4, 20.8 >> row 3: 3.3, 40.6 >> row 4: 4.2, 2.5 >> row 5: 5.1, 10.1 >> row 6: 6.0, 7.9 >> row 7: 8.0, 21.0 >> row 8: 100, 200 >> >> 1 through 6 get combined because all their X values are within the >> tolerance of some other X in the set that's been combined. 7's Y value >> is within the tolerance of 2's Y, so that should be combined as well. >> 8 is not combined because neither the X or Y value is within the >> tolerance of any X or Y in the set that was combined. > > Trying to get my head around this a bit more. Are columns a/b/c/d > treated as a big category (eg type, brand, category, model), such that > nothing will ever be grouped that has any difference in those four > columns? If so, we can effectively ignore them and pretend we have a > table with exactly one set (eg stick a WHERE clause onto the query > that stipulates their values). Then what you have is this: > > * Aggregate based on proximity of x and y > * Emit results derived from e > > Is that correct?
There will be multiple groups of a/b/c/d. I simplified the query for the purposes of posting my question. There is a where clause with values that come from user input. None, any, or all of a, b, c, or d could be in the where clause. > So here's my way of writing it. > > * Subselect: List all values for x, in order, and figure out which > ones are less than the previous value plus one > * Subselect: Ditto, for y. > * Outer select: Somehow do an either-or group. I'm not quite sure how > to do that part, actually! > > A PGSQL window function would cover the two subselects - at least, I'm > fairly sure it would. I can't quite get the whole thing, though; I can > get a true/false flag that says whether it's near to the previous one > (that's easy), and creating a grouping column value should be possible > from that but I'm not sure how. > > But an either-or grouping is a bit trickier. The best I can think of > is to collect all the y values for each group of x values, and then if > any two groups 'overlap' (ie have points within 1.0 of each other), > merge the groups. That's going to be seriously tricky to do in SQL, I > think, so you may have to go back to Python on that one. > > My analysis suggests that, whatever happens, you're going to need > every single y value somewhere. So it's probably not worth trying to > do any grouping/aggregation in SQL, since you need to further analyze > all the individual data points. I can't think of any way better than > just leafing through the whole table (either in Python or in a stored > procedure - if you can run your script on the same computer that's > running the database, I'd do that, otherwise consider a stored > procedure to reduce network transfers) and building up mappings. > > Of course, "I can't think of a way" does not equate to "There is no > way". There may be some magic trick that I didn't think of, or some > arcane incantation that gets what you want. Who knows? If you can > produce an ASCII art Mandelbrot set [1] in pure SQL, why not this! > > ChrisA > > [1] http://wiki.postgresql.org/wiki/Mandelbrot_set Thanks for the reply. I'm going to take a stab at removing the group by and doing it all in python. It doesn't look too hard, but I don't know how it will perform. -- https://mail.python.org/mailman/listinfo/python-list