On Tue, Jun 07, 2011 at 05:38:09PM +0200, Sidney Cadot scratched on the wall:
> > Please someone show me what is the correct value of this avg() in practice.

> SELECT AVG(data) FROM TryAvg ORDER BY ABS(data) DESC;
> 
> But apparently, SQLite ignores the ORDER BY clause because of the
> AVG() function, perhaps -erroneously- assuming AVG() is commutative.

  No, this is basic SQL order of operations.  The ORDER BY is applied
  *after* the aggregation.  That SELECT returns one row, so the ORDER
  BY does nothing.

  Actually... the ORDER BY doesn't even make sense.  You're asking it
  to ORDER the AVG() result by the last "data" value in the GROUP,
  only you have no idea which actual "data" value that is.  Most DBs
  consider this statement to be an error.

> This works, but isn't guaranteed to work:
> 
> SELECT AVG(data) FROM (SELEFT data FROM TryAvg ORDER BY ABS(data) DESC);

  No, it isn't guaranteed to work, because ORDER BY only makes sense for
  output.  Row order within SELECTs are undefined, and that remains true
  of a sub-SELECT, even if it contains an ORDER BY.

> Funnily enough, in Postgres, the test seems to expose an actual bug,
> or at least a strange error message:
> 
> sidney=# select AVG(data) from TryAvg ORDER BY data DESC;
> ERROR:  column "tryavg.data" must appear in the GROUP BY clause or be
> used in an aggregate function
> LINE 1: select AVG(data) from TryAvg ORDER BY data DESC;

  Not a bug, just as the message says...  You can't ORDER BY something
  you didn't GROUP BY, or wasn't the output of an aggregate.
  Using an aggregate without a GROUP BY implies a group across the
  whole selection.  Unlike SQLite, PostgreSQL will not let you attempt
  to ORDER BY undefined data.  Just remember the AVG() is computed,
  grouped rows are collapsed, *then* the ORDER BY is applied.

  I like this aspect of SQLite, but many people consider it questionable.
  Personally I'd rather have the freedom to shoot myself in the foot.

> Now, to address your actual point. The fact that we cannot control
> evaluation order in SQL does mean that one has to be wary in case
> one's data is badly conditioned. The problem is in SQL: it assumes
> commutativity for aggregate functions, and that is a property that no
> floating point format conceived can deliver.

  Which really an issue with floating-point.  Addition *is*
  commutative, as any math teacher can tell you.  The fact that 
  it isn't (nor are a lot of other basic arithmetic principals) when
  using a floating-point representation is an issue with the
  representation.  This very fact is one of the main reasons
  traditional RDBMS systems have always kept away from using FP for
  real numbers-- especially in financial operations, where every penny
  counts, no matter how large the sum.

> This is an interesting
> fact (as is the fact that, as a consequence of this, identical
> invocations of aggregate functions can yield different results under
> seemingly identical circumstances in SQL), but I disagree that this
> disqualifies using SQL for use in a scientific setting. There are
> preciously little types of measurements where one has to calculate
> with 16 orders of magnitude as your example does, and in such
> circumstances, the scientist needs to be exceedingly on his or her
> guard to know what she is doing in ANY language, with ANY storage
> mechanism. I think therefore that this example, interesting as it is,
> does not provide a conclusive argument in the discussion at hand.

  Which sounds more or less like "using the right tool for the job."
  If you're aware enough to understand these kinds of issues**
  and if you truly need IEEE 754 down to the bit, you need to be
  writing to the raw hardware and doing all your data manipulation
  yourself.  If you're doing common sense operations with values in
  a similar number domain, SQL, SQLite, Perl, Python, or whatever--
  despite not presenting a 100% IEEE 754 environment-- will generally
  do something "close enough" to the right thing that nobody cares.

  Even if you do care, careful ordering of FP operations isn't so much
  about getting the "right" answer (in most cases) as it is about
  tightening the "close enough" until it falls back into the "nobody
  cares" range.

  ** In my own experiences working at NCSA, two national labs, and
     several university HPC projects, most people doing "scientific
     computing" don't have a damn clue.  Most of the "scientific
     computing" code I saw was written by domain grad students or
     interns that would have a hard time writing Hello World without
     their notes.  If it is a physics program, chances are still
     better than not it is written in FORTRAN.  Error bars?  What are
     those?

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Intelligence is like underwear: it is important that you have it,
 but showing it to the wrong people has the tendency to make them
 feel uncomfortable." -- Angela Johnson
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to