Re: Order rows before applying GROUP BY / DISTINCT

Remi Mikalsen Tue, 05 Oct 2004 10:56:16 -0700

Ok... I've taken a close look at the various options.

* The use of a temporary table would solve the problem. However, I am very uncertain 
of the 
efficiency of the use of temporary tables the way you suggest. I would need 6 queries 
to the 
database, plus creation of a new table based on another that might contain more than 
1.000.000 records. Of course, the content of the temporary table would be very 
reduced, 
probably never more than a couple of thousands records. Is it possible to estimate the 
time 
needed for these operations? Can I be certain that this is more efficient than using 
two 
queries with scripting in between, the way I described before? I guess the main 
difference 
consists in 4 extra queries (creating, dropping, locking and unlocing tables) vs. 
reading 
1.000.000 records with PHP. I would believe that using MySQL is always better, but I 
am 
not sure; have never had the chance to compare these things on big tables.


* I'm running version 4.0.21, which rules out suggestion 2 and 3.
    - this is what I would have done things using any SQL-99 compliant DBS.

* The 4th suggestion also solves my problem; this one was completely new to me. But I 
realise that some creative mind came up with this due to the lack of a reasonable 
solution in 
MySQL. Would it be more or less efficient than the 1st? I'm not very keen on the idea 
of 
locking tables, so if the efficiency penalty using this solution is relatively small 
(<15-20%) 
compared to the first option, and as long as it is faster than my current solution, I 
would 
prefer it.

Thanks a lot for all these pointers!


Remi Mikalsen

E-Mail: [EMAIL PROTECTED]
URL:    http://www.iMikalsen.com

On 5 Oct 2004 at 9:45, [EMAIL PROTECTED] wrote:

> You are correct, there is no predicate you can use to pull the refurl 
> value from the same row that the MAX(entry) value comes from . That will 
> require a second processing stage. Depending on what version of MySQL you 
> are running, you have some options:
> 
> *store the results of this MAX() query in a temp table and join those 
> results back to the userlog table to get the other fields you need (like 
> refurl) where the dates and user match.
> 
> *use the "anonymous view" feature to skip the step of creating a temp 
> table (that's where you surround a query with parentheses and use it like 
> any other table in an outer query. Don't forget to alias your "view"). 
> This functionality is a side effect of the parsing engine's treatment of 
> parentheses in a query and is available as of 4.1 (it arrived with the 
> UNION statement) 
> 
> *You may be able to use a subquery to get matching non-aggregated values.
> 
> *You may be able to use the MAX-CONCAT trick as described in the manual.
> 
> Shawn Green
> Database Administrator
> Unimin Corporation - Spruce Pine
> 
> "Remi Mikalsen" <[EMAIL PROTECTED]> wrote on 10/04/2004 05:21:44 PM:
> 
> > If you aren't tired of this problem, I have one more question...
> > 
> > The last query worked well, except one small detail... 
> > SELECT user, refurl, max( entry ) AS most_recent, count( user ) AS 
> visits 
> > FROM userlog 
> > WHERE user != '' 
> > GROUP BY user 
> > ORDER BY most_recent DESC 
> > LIMIT 10 
> > 
> > I am now including ONE extra field: "refurl". MySQL doesn't grab the
> > refurl from the SAME 
> > row as it grabs the last entry time! I would like to be able to get 
> > this extra field (and maybe 
> > others too), and as they are potentially different from time to 
> > time, just any of them won't do 
> > it.
> > 
> > I could probably get around the problem with a join on the time 
> > column (maybe I should 
> > create an index to enhance efficiency?), but I wonder if there is a 
> > more efficient way to solve 
> > this? Additionally, IF there were two entries at the same time 
> > (which is highly improbable, 
> > and not critical at all if it happens once in a million) I might not
> > get the correct result anyway. 
> > Maybe I'm being picky, but while I'm at it it might as well get well 
> done. 
> > 
> > Thank you for the help so far anyway!! 
> > 
> > 
> > Remi Mikalsen 
> > 
> > E-Mail:   [EMAIL PROTECTED] 
> > URL:   http://www.iMikalsen.com 
> > 
> > 
> > On 4 Oct 2004 at 10:22, [EMAIL PROTECTED] wrote: 
> > 
> > > 
> > > 
> > > I would be very surprised if this query fails as you say it does. 
> > > 
> > > select user, max(time) as most_recent, count(user) as visits 
> > > from userlog where user!='' 
> > > group by user 
> > > order by most_recent desc 
> > > limit 10; 
> > > 
> > > This represents the last 10 users to sign in, when they signed in 
> > last, and many times they 
> > have 
> > > signed in so far. I may not know everything but unless you have 
> > some problem with the 
> > data so 
> > > that your datetime field "time" is not acting the same for all 
> > visitors, this should work. 
> > Please 
> > > describe why this data is wrong for your question. It is entirely 
> > likely I misunderstood what 
> > you 
> > > wanted to find in your data. That would make my suggestions wrong 
> > (my apologies if so!!) 
> > > 
> > > Shawn Green 
> > > Database Administrator 
> > > Unimin Corporation - Spruce Pine 
> > > 
> > > "Remi Mikalsen" <[EMAIL PROTECTED]>wrote on 10/04/2004 10:11:09 AM: 
> > > 
> > > > Thank you for the answer! However, it doesn't seem to solve my 
> > > > problem. Running the query 
> > > > without modifications repeated users (because of the "group by user, 
> 
> > > > HOST"). I removed the 
> > > > HOST, and ran the query over again. Now it returned unique users, 
> > > > but it still didn't return 
> > > > the LAST login of them many logins for each user. 
> > > > 
> > > > Currently I am using a new implementation with two queries and PHP 
> > > > in between. I won't 
> > > > write all the code here, just explain the basics. 
> > > > 
> > > > 1. A query gets the total number of logins for each user 
> > > > 2. PHP retrieves the logins and "orders" them with a perfect hash 
> > > >    - I create an array with indexes correspondig to usernames 
> > > >    - the values in the array are the numbers of logins 
> > > > corresponding to each username 
> > > >    - this way I have practically no overhead when accessing the 
> > > array 
> > > >        for example: $users['a_username'] = 10; /* 10 logins 
> > > > by user a_username */ 
> > > > 3. Another query gets all logins ordered by the time column, last 
> > > first 
> > > > 4. PHP reads the results. When a user is found that has more than 0 
> > > > logins in the $users 
> > > > array, I print the user and the amount of logins, and set the value 
> > > > in the $users array to 0. 
> > > > This guarantees only printing a user once, and only printing the 
> > > > last login due to ordering by 
> > > > the time field. 
> > > > 
> > > > But this also has obvious drawbacks as to performance. I may have to 
> 
> > > > retrieve 10.000 rows 
> > > > from the userlog table to get 10 unique users, or I might never have 
> 
> > > > 10 unique users in the 
> > > > table, even with 1.000.000.000 records! I just never know. I also 
> > > > have to make two queries, 
> > > > which gives a small overhead. Finally, using scripting to interprete 
> 
> > > > intermediate results is 
> > > > much slower than MySQL. All in all, this just doesn't seem to be a 
> > > > very scaleable solution. 
> > > > 
> > > > 
> > > > 
> > > > Remi Mikalsen 
> > > > 
> > > > E-Mail:  [EMAIL PROTECTED] 
> > > > URL:  http://www.iMikalsen.com 
> > > > 
> > > > 
> > > > 
> > > > On 4 Oct 2004 at 9:14, [EMAIL PROTECTED] wrote: 
> > > > 
> > > > > What you are looking for is the MAX() of the Time column (most 
> > > recent 
> > > > > entry) 
> > > > > 
> > > > > select user, host, max(time) as time, count(user) as visits 
> > > > > from userlog where user!='' 
> > > > > group by user, host 
> > > > > order by time desc 
> > > > > limit 10; 
> > > > > 
> > > > > The issue you ran into is caused by an non-standard SQL extension 
> > > created 
> > > > > by MySQL. Its behavior is documented here: 
> > > > > 
> > > > > http://dev.mysql.com/doc/mysql/en/GROUP-BY-hidden-fields.html 
> > > > > 
> > > > > <SOAPBOX ON> 
> > > > > IMHO, this extension of the SQL spec has caused more problems than 
>  
> > > it has 
> > > > > helped. I believe a better extension would have been to add a new 
> > > GROUP BY 
> > > > > predicate like FIRST() or LAST() so that the behavior would be 
> > > explicit 
> > > > > and controllable. It is even documented that the value returned is 
>  
> > > > > potentially random 
> > > > > 
> > > > > "Do not use this feature if the columns you omit from the GROUP BY 
>  
> > > part 
> > > > > are not unique in the group! You will get unpredictable results." 
> > > > > 
> > > > > I know it's too late to take this behavior out of the older 
> > > versions of 
> > > > > MySQL but is there any way to prevent its inclusion in future 
> > > versions? 
> > > > > Are there other options out there to "fix" this non-deterministic 
> > > > > behavior? I do not believe that educating the community to _avoid_ 
>  
> > > a 
> > > > > feature is a viable option. If it were, I do not think that we 
> > > would need 
> > > > > to explain this strange behavior to new users as often as we do. 
> > > > > <SOAPBOX OFF> 
> > > > > 
> > > > > Shawn Green 
> > > > > Database Administrator 
> > > > > Unimin Corporation - Spruce Pine 
> > > > > 
> > > > > 
> > > > > "Remi Mikalsen" <[EMAIL PROTECTED]> wrote on 10/01/2004 06:41:32 
> > > PM: 
> > > > > 
> > > > > > Hello. 
> > > > > > 
> > > > > > I'm having a problem where I seem to need to order a table 
> > > before 
> > > > > > applying group by (or 
> > > > > > distinct?) in my query. 
> > > > > > 
> > > > > > Here is a simplified table structure example: 
> > > > > > ID USER HOST TIME 
> > > > > > 
> > > > > > ID = Primary Key 
> > > > > > 
> > > > > > I would like to do the following in ONE query, if possible: 
> > > > > > I am looking to retrieve the LAST time 10 UNIQUE users were 
> > > > > > registered in the table 
> > > > > > (user+host+time). These users should be the last 10 people to be 
>  
> > > > > > inserted into the table 
> > > > > > (each user can appear various times in the table, like in a 
> > > log). At 
> > > > > > the same time, I would like 
> > > > > > to retrieve the TOTAL NUMBER of times each of these users appear 
>  
> > > in 
> > > > > > the table, but this is 
> > > > > > not very important. 
> > > > > > 
> > > > > > This was the query I adopted until noticing it had a severe 
> > > problem: 
> > > > > > 
> > > > > > select user, host, time, count(user) as times 
> > > > > > from userlog where user!='' 
> > > > > > group by user 
> > > > > > order by time desc 
> > > > > > limit 10; 
> > > > > > 
> > > > > > The problem is that the TIME associated with each person isn't 
> > > the 
> > > > > > LAST TIME a registry 
> > > > > > was done for the user. This makes me think that I might need to 
> > > > > > order the TIME column 
> > > > > > before doing the GROUP BY, but I do no know how (and it might 
> > > not 
> > > > > > even be the solution to 
> > > > > > the problem!). 
> > > > > > 
> > > > > > I do not know if I managed to express myself very well, but if 
> > > > > > anyone is willing to help, I 
> > > > > > would of course clarify things if necessary. 
> > > > > > 
> > > > > > 
> > > > > > Remi Mikalsen 
> > > > > > 
> > > > > > E-Mail:  [EMAIL PROTECTED] 
> > > > > > URL:  http://www.iMikalsen.com 


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Order rows before applying GROUP BY / DISTINCT

Reply via email to