Re: [HACKERS] MD5 aggregate

2013-06-27 Thread Dean Rasheed
On 26 June 2013 22:48, Noah Misch n...@leadboat.com wrote:
 On Wed, Jun 26, 2013 at 09:04:34PM +0100, Dean Rasheed wrote:
 On 26 June 2013 19:32, Noah Misch n...@leadboat.com wrote:
  On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

  md5_agg() is well-defined and not cryptographically novel, and your use 
  case
  is credible.  However, not every useful-sometimes function belongs in 
  core; we
  mostly stick to functions with ubiquitous value and functions that would be
  onerous to implement externally.  (We do carry legacy stuff that wouldn't 
  make
  the cut today.)  In my estimation, md5_agg() does not make that cut.  The
  variety of intuitive md5_agg() definitions attested upthread doesn't bode 
  well
  for its broad applicability.  It could just as well live in an extension
  published on PGXN.  Mine is just one opinion, though; I won't be horrified 
  if
  the community wants an md5_agg() in core.

 I disagree with this though. I see md5_agg() as a natural extension of
 the already-in-core md5() functions and underlying code, satisfying a
 well-defined use-case, and providing a checksum comparable with
 externally generated md5 sums.

 All true, but I don't see those facts justifying core inclusion.

 A quick google search reveals several people asking for something like
 this, and people recommending md5(string_agg(...)) or
 md5(string_agg(md5(...))) based solutions, which are doomed to failure
 on larger tables. So I think that there is a case for having md5_agg()
 in core as an alternative to such hacks, while having more
 sophisticated crypto functions available as extensions.

 My nine Google hits for md5(string_agg included one Stack Overflow answer,
 several mirrors of that answer, and a few posts on this thread.


I found more than that using Google. It's not uncommon for people to
use Google and then pick the first accepted answer on Stack
Overflow, in which case they might well pick one of the answers here
[1] or here [2]. Or they might find this [3]. Or if they came to our
lists they might find this [4], or deduce from this [5] that it isn't
possible.

[1] 
http://stackoverflow.com/questions/4020033/how-can-i-get-a-hash-of-an-entire-table-in-postgresql

[2] 
http://stackoverflow.com/questions/13554333/finding-out-the-hash-value-of-a-group-of-rows

[3] https://ralphholz.de/?q=node/298

[4] http://www.postgresql.org/message-id/4e5f469c.5070...@compulab.co.il

[5] 
http://www.postgresql.org/message-id/e94d85500801301153u6b976e31m89e311c7134a0...@mail.gmail.com

I'd say there are clearly people who want it, and the nature of some
of those answers suggests to me that we ought to have a better answer
in core.


 I haven't looked at pgcrypto because, as I said, performance wasn't my
 primary criterion either, but removing the unnessary data copy just
 seemed like good sense.

 I'll take a look at it, and then as you and Peter suggest, look to
 split the patch into two. I think I will withdraw md5_total() because
 I was never entirely happy with that anyway.

 Understood; feel free to back off from any performance improvements in which
 you didn't wish to involve yourself.

 If we do end up moving forward with md5_agg(), I note that the pgcrypto
 version already has an initialize/accumulate/finalize API division.  A patch
 importing that code largely-unchanged would be easier to verify than a patch
 restructuring the src/backend/libpq/md5.c implementation.  The two patches
 would then be a use pgcrypto md5.c in core patch and an md5_agg() patch.


I'll take a look at it, although I think there is still the potential
for bugs either way.

What I've done with libpq's md5.c is just to rearrange the internal
pieces, without touching the core algorithm code or the higher level
callers. So the most likely types of bugs are boundary/size errors. If
I'd broken any of pg_md5_hash(), pg_md5_binary() or pg_md5_crypt(),
then I'd have broken them all. It's possible to get a reasonable level
of confidence in those changes using queries like this on HEAD and
with the patch:

SELECT md5(string_agg(md5(str) || repeat(' ', i), '')) FROM (
  SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str
FROM generate_series(0,1) i,
 generate_series(0,i) j
   GROUP BY i
) t;

and these with the patch:

SELECT md5_agg(md5(str) || repeat(' ', i)) FROM (
  SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str
FROM generate_series(0,1) i,
 generate_series(0,i) j
   GROUP BY i
) t;

SELECT md5_agg(md5_agg || repeat(' ', i)) FROM (
  SELECT i, md5_agg(chr(32+(i+j*31)%224))
FROM generate_series(0,1) i,
 generate_series(0,i) j
   GROUP BY i
) t;

which all produce the same overall sum.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-27 Thread Dean Rasheed
On 26 June 2013 21:46, Peter Eisentraut pete...@gmx.net wrote:
 On 6/26/13 4:04 PM, Dean Rasheed wrote:
 A quick google search reveals several people asking for something like
 this, and people recommending md5(string_agg(...)) or
 md5(string_agg(md5(...))) based solutions, which are doomed to failure
 on larger tables.

 The thread discussed several other options of checksumming tables that
 did not have the air of a crytographic offering, as Noah put it.


True but md5 has the advantage of being directly comparable with the
output of Unix md5sum, which would be useful if you loaded data from
external files and wanted to confirm that your import process didn't
mangle it.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-27 Thread Marko Kreen
On Thu, Jun 27, 2013 at 11:28 AM, Dean Rasheed dean.a.rash...@gmail.com wrote:
 On 26 June 2013 21:46, Peter Eisentraut pete...@gmx.net wrote:
 On 6/26/13 4:04 PM, Dean Rasheed wrote:
 A quick google search reveals several people asking for something like
 this, and people recommending md5(string_agg(...)) or
 md5(string_agg(md5(...))) based solutions, which are doomed to failure
 on larger tables.

 The thread discussed several other options of checksumming tables that
 did not have the air of a crytographic offering, as Noah put it.


 True but md5 has the advantage of being directly comparable with the
 output of Unix md5sum, which would be useful if you loaded data from
 external files and wanted to confirm that your import process didn't
 mangle it.

The problem with md5_agg() is that it's only useful in toy scenarios.

It's more useful give people script that does same sum(hash(row))
on dump file than try to run MD5 on ordered rows.

Also, I don't think anybody actually cares about MD5(table-as-bytes), instead
people want way to check if 2 tables or table and dump are same.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-27 Thread Robert Haas
On Thu, Jun 27, 2013 at 7:29 AM, Marko Kreen mark...@gmail.com wrote:
 On Thu, Jun 27, 2013 at 11:28 AM, Dean Rasheed dean.a.rash...@gmail.com 
 wrote:
 On 26 June 2013 21:46, Peter Eisentraut pete...@gmx.net wrote:
 On 6/26/13 4:04 PM, Dean Rasheed wrote:
 A quick google search reveals several people asking for something like
 this, and people recommending md5(string_agg(...)) or
 md5(string_agg(md5(...))) based solutions, which are doomed to failure
 on larger tables.

 The thread discussed several other options of checksumming tables that
 did not have the air of a crytographic offering, as Noah put it.


 True but md5 has the advantage of being directly comparable with the
 output of Unix md5sum, which would be useful if you loaded data from
 external files and wanted to confirm that your import process didn't
 mangle it.

 The problem with md5_agg() is that it's only useful in toy scenarios.

 It's more useful give people script that does same sum(hash(row))
 on dump file than try to run MD5 on ordered rows.

 Also, I don't think anybody actually cares about MD5(table-as-bytes), instead
 people want way to check if 2 tables or table and dump are same.

I think you're trying to tell Dean to write the patch that you want
instead of the patch that he wants.  There are certainly other things
that could be done that some people might sometimes prefer, but that
doesn't mean what he did isn't useful.

That having been said, I basically agree with Noah: I think this would
be a useful extension (perhaps even in contrib?) but I don't think we
need to install it by default.  It's useful, but it's also narrow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-27 Thread Peter Eisentraut
On 6/27/13 4:19 AM, Dean Rasheed wrote:
 I'd say there are clearly people who want it, and the nature of some
 of those answers suggests to me that we ought to have a better answer
 in core.

It's not clear what these people wanted this functionality for.  They
all wanted to analyze a table to compare with another table (or the same
table later).  Either, they wanted this to detect data changes, in which
case the right tool is a checksum, not a cryptographic hash.  We already
have several checksum implementations in core, so we could expose on of
them.  Or they wanted this to protect their data from tampering, in
which case the right tool is a cryptographic hash, but Noah argues that
a sum of MD5 hashes is not cryptographically sound.  (And in any case,
we don't put cryptographic functionality into the core.)

The reason md5_agg is proposed here and in all those cited posts is
presumably because the md5() function was already there anyway.  The the
md5() function is there because the md5 code was already there anyway
because of the authentication.  Let's not add higher-order
already-there-anyway code. ;-)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-27 Thread Dean Rasheed
On 27 June 2013 17:47, Peter Eisentraut pete...@gmx.net wrote:
 On 6/27/13 4:19 AM, Dean Rasheed wrote:
 I'd say there are clearly people who want it, and the nature of some
 of those answers suggests to me that we ought to have a better answer
 in core.

 It's not clear what these people wanted this functionality for.  They
 all wanted to analyze a table to compare with another table (or the same
 table later).  Either, they wanted this to detect data changes, in which
 case the right tool is a checksum, not a cryptographic hash.  We already
 have several checksum implementations in core, so we could expose on of
 them.  Or they wanted this to protect their data from tampering, in
 which case the right tool is a cryptographic hash, but Noah argues that
 a sum of MD5 hashes is not cryptographically sound.  (And in any case,
 we don't put cryptographic functionality into the core.)

 The reason md5_agg is proposed here and in all those cited posts is
 presumably because the md5() function was already there anyway.  The the
 md5() function is there because the md5 code was already there anyway
 because of the authentication.  Let's not add higher-order
 already-there-anyway code. ;-)


OK fair enough. It's looking like there are more people who don't want
md5_agg() in core, or want something different, than who do want it.
Also, if we're taking the view that the existing md5() function is
only for hashing passwords, then it's probably not worth trying to
optimise it.

At this stage it's probably best to mark this as returned with
feedback, and I'll consider the options for rewriting it, but not
during this commitfest.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-26 Thread Noah Misch
On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
 I've been playing around with the idea of an aggregate that computes
 the sum of the md5 hashes of each of its inputs, which I've called
 md5_total() for now, although I'm not particularly wedded to that
 name. Comparing it with md5_agg() on a 100M row table (see attached
 test script) produces interesting results:
 
 SELECT md5_agg(foo.*::text)
   FROM (SELECT * FROM foo ORDER BY id) foo;
 
  50bc42127fb9b028c9708248f835ed8f
 
 Time: 92960.021 ms
 
 SELECT md5_total(foo.*::text) FROM foo;
 
  02faea7fafee4d253fc94cfae031afc43c03479c
 
 Time: 96190.343 ms
 
 Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
 result is longer) but it seems like it would be very useful for
 quickly comparing data in SQL, since its value is not dependent on the
 row-order making it easier to use and better performing if there is no
 usable index for ordering.

I took a look at this patch.  First, as Peter suggested upthread, it should
have been two patches: one to optimize calculateDigestFromBuffer() and
callees, another atop the first adding new SQL functions and adjusting the
infrastructure to meet their needs.

 + primarymd5_total/primary
 +/indexterm
 +functionmd5_total(replaceable 
 class=parameterexpression/replaceable)/function
 +   /entry
 +   entrytypetext/type or typebytea/type/entry
 +   entrytypetext/type/entry
 +   entry
 +sum of the MD5 hash values of the input values, independent of their
 +order

This is not specific enough.  We would need to provide either an algorithm
specification or a literature reference.  However, that just leads to the
problem that we should not put a literature-unrecognized cryptographic
algorithm into core PostgreSQL.  I realize that the use case inspiring this
patch wasn't concerned with cryptographic properties, but the association with
md5 immediately makes it a cryptography offering.

md5_agg() is well-defined and not cryptographically novel, and your use case
is credible.  However, not every useful-sometimes function belongs in core; we
mostly stick to functions with ubiquitous value and functions that would be
onerous to implement externally.  (We do carry legacy stuff that wouldn't make
the cut today.)  In my estimation, md5_agg() does not make that cut.  The
variety of intuitive md5_agg() definitions attested upthread doesn't bode well
for its broad applicability.  It could just as well live in an extension
published on PGXN.  Mine is just one opinion, though; I won't be horrified if
the community wants an md5_agg() in core.

 *** a/src/backend/libpq/md5.c
 --- b/src/backend/libpq/md5.c
 ***
 *** 1,14 
   /*
*  md5.c
*
 !  *  Implements  the  MD5 Message-Digest Algorithm as specified in
 !  *  RFC  1321.  This  implementation  is a simple one, in that it
 !  *  needs  every  input  byte  to  be  buffered  before doing any
 !  *  calculations.  I  do  not  expect  this  file  to be used for
 !  *  general  purpose  MD5'ing  of large amounts of data, only for
 !  *  generating hashed passwords from limited input.

In other words, performance wasn't a design criterion.  I wonder if we would
be wasting our time with a narrow performance change like removing the data
copy.  What's your opinion of copying pgcrypto's md5 implementation, which
already avoids the data copy?

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-26 Thread Dean Rasheed
On 26 June 2013 19:32, Noah Misch n...@leadboat.com wrote:
 On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
 I've been playing around with the idea of an aggregate that computes
 the sum of the md5 hashes of each of its inputs, which I've called
 md5_total() for now, although I'm not particularly wedded to that
 name. Comparing it with md5_agg() on a 100M row table (see attached
 test script) produces interesting results:

 SELECT md5_agg(foo.*::text)
   FROM (SELECT * FROM foo ORDER BY id) foo;

  50bc42127fb9b028c9708248f835ed8f

 Time: 92960.021 ms

 SELECT md5_total(foo.*::text) FROM foo;

  02faea7fafee4d253fc94cfae031afc43c03479c

 Time: 96190.343 ms

 Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
 result is longer) but it seems like it would be very useful for
 quickly comparing data in SQL, since its value is not dependent on the
 row-order making it easier to use and better performing if there is no
 usable index for ordering.

 I took a look at this patch.  First, as Peter suggested upthread, it should
 have been two patches: one to optimize calculateDigestFromBuffer() and
 callees, another atop the first adding new SQL functions and adjusting the
 infrastructure to meet their needs.


I didn't really set out with the aim of optimising the existing md5()
functions at all, but merely to restructure the code in a way that
would expose the necessary API for md5_agg(). As the timings up-thread
show, the performance gain is really very modest, although the memory
saving might be more of a factor in some setups. In fact, for the
sorts of queries shown, the vast majority of the time is spent
elsewhere, as can be seen by replacing md5() with a more trivial
function such as length().


 + primarymd5_total/primary
 +/indexterm
 +functionmd5_total(replaceable 
 class=parameterexpression/replaceable)/function
 +   /entry
 +   entrytypetext/type or typebytea/type/entry
 +   entrytypetext/type/entry
 +   entry
 +sum of the MD5 hash values of the input values, independent of their
 +order

 This is not specific enough.  We would need to provide either an algorithm
 specification or a literature reference.  However, that just leads to the
 problem that we should not put a literature-unrecognized cryptographic
 algorithm into core PostgreSQL.  I realize that the use case inspiring this
 patch wasn't concerned with cryptographic properties, but the association with
 md5 immediately makes it a cryptography offering.


Yes, I can totally understand that point-of-view. I wasn't really
intending to write md5_total() at the outset - it was more of an
intellectual exercise following some of the previous points raised. So
on reflection, I think I can also agree that that probably doesn't
belong in core.


 md5_agg() is well-defined and not cryptographically novel, and your use case
 is credible.  However, not every useful-sometimes function belongs in core; we
 mostly stick to functions with ubiquitous value and functions that would be
 onerous to implement externally.  (We do carry legacy stuff that wouldn't make
 the cut today.)  In my estimation, md5_agg() does not make that cut.  The
 variety of intuitive md5_agg() definitions attested upthread doesn't bode well
 for its broad applicability.  It could just as well live in an extension
 published on PGXN.  Mine is just one opinion, though; I won't be horrified if
 the community wants an md5_agg() in core.


I disagree with this though. I see md5_agg() as a natural extension of
the already-in-core md5() functions and underlying code, satisfying a
well-defined use-case, and providing a checksum comparable with
externally generated md5 sums.

A quick google search reveals several people asking for something like
this, and people recommending md5(string_agg(...)) or
md5(string_agg(md5(...))) based solutions, which are doomed to failure
on larger tables. So I think that there is a case for having md5_agg()
in core as an alternative to such hacks, while having more
sophisticated crypto functions available as extensions.


 *** a/src/backend/libpq/md5.c
 --- b/src/backend/libpq/md5.c
 ***
 *** 1,14 
   /*
*  md5.c
*
 !  *  Implements  the  MD5 Message-Digest Algorithm as specified in
 !  *  RFC  1321.  This  implementation  is a simple one, in that it
 !  *  needs  every  input  byte  to  be  buffered  before doing any
 !  *  calculations.  I  do  not  expect  this  file  to be used for
 !  *  general  purpose  MD5'ing  of large amounts of data, only for
 !  *  generating hashed passwords from limited input.

 In other words, performance wasn't a design criterion.  I wonder if we would
 be wasting our time with a narrow performance change like removing the data
 copy.  What's your opinion of copying pgcrypto's md5 implementation, which
 already avoids the data copy?


I haven't looked at pgcrypto because, as I said, performance wasn't my
primary 

Re: [HACKERS] MD5 aggregate

2013-06-26 Thread Peter Eisentraut
On 6/26/13 4:04 PM, Dean Rasheed wrote:
 A quick google search reveals several people asking for something like
 this, and people recommending md5(string_agg(...)) or
 md5(string_agg(md5(...))) based solutions, which are doomed to failure
 on larger tables.

The thread discussed several other options of checksumming tables that
did not have the air of a crytographic offering, as Noah put it.

 So I think that there is a case for having md5_agg()
 in core as an alternative to such hacks, while having more
 sophisticated crypto functions available as extensions.

Well, in general, I'd rather see the sophisticated stuff in core and the
hacks in extensions.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-26 Thread Noah Misch
On Wed, Jun 26, 2013 at 09:04:34PM +0100, Dean Rasheed wrote:
 On 26 June 2013 19:32, Noah Misch n...@leadboat.com wrote:
  On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

  md5_agg() is well-defined and not cryptographically novel, and your use case
  is credible.  However, not every useful-sometimes function belongs in core; 
  we
  mostly stick to functions with ubiquitous value and functions that would be
  onerous to implement externally.  (We do carry legacy stuff that wouldn't 
  make
  the cut today.)  In my estimation, md5_agg() does not make that cut.  The
  variety of intuitive md5_agg() definitions attested upthread doesn't bode 
  well
  for its broad applicability.  It could just as well live in an extension
  published on PGXN.  Mine is just one opinion, though; I won't be horrified 
  if
  the community wants an md5_agg() in core.
 
 I disagree with this though. I see md5_agg() as a natural extension of
 the already-in-core md5() functions and underlying code, satisfying a
 well-defined use-case, and providing a checksum comparable with
 externally generated md5 sums.

All true, but I don't see those facts justifying core inclusion.

 A quick google search reveals several people asking for something like
 this, and people recommending md5(string_agg(...)) or
 md5(string_agg(md5(...))) based solutions, which are doomed to failure
 on larger tables. So I think that there is a case for having md5_agg()
 in core as an alternative to such hacks, while having more
 sophisticated crypto functions available as extensions.

My nine Google hits for md5(string_agg included one Stack Overflow answer,
several mirrors of that answer, and a few posts on this thread.

 I haven't looked at pgcrypto because, as I said, performance wasn't my
 primary criterion either, but removing the unnessary data copy just
 seemed like good sense.
 
 I'll take a look at it, and then as you and Peter suggest, look to
 split the patch into two. I think I will withdraw md5_total() because
 I was never entirely happy with that anyway.

Understood; feel free to back off from any performance improvements in which
you didn't wish to involve yourself.

If we do end up moving forward with md5_agg(), I note that the pgcrypto
version already has an initialize/accumulate/finalize API division.  A patch
importing that code largely-unchanged would be easier to verify than a patch
restructuring the src/backend/libpq/md5.c implementation.  The two patches
would then be a use pgcrypto md5.c in core patch and an md5_agg() patch.

Thanks,
nm

-- 
Noah Misch
EnterpriseDB http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Review [was Re: [HACKERS] MD5 aggregate]

2013-06-23 Thread Dean Rasheed
On 21 June 2013 21:04, David Fetter da...@fetter.org wrote:
 On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote:
 On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
  On 15 June 2013 10:22, Dean Rasheed dean.a.rash...@gmail.com wrote:
   There seem to be 2 separate directions that this could go, which
   really meet different requirements:
  
   1). Produce an unordered sum for SQL to compare 2 tables regardless of
   the order in which they are scanned. A possible approach to this might
   be something like an aggregate
  
   md5_total(text/bytea) returns text
  
   that returns the sum of the md5 values of each input value, treating
   each md5 value as an unsigned 128-bit integer, and then producing the
   hexadecimal representation of the final sum. This should out-perform a
   solution based on numeric addition, and in typical cases, the result
   wouldn't be much longer than a regular md5 sum, and so would be easy
   to eyeball for differences.
  
 
  I've been playing around with the idea of an aggregate that computes
  the sum of the md5 hashes of each of its inputs, which I've called
  md5_total() for now, although I'm not particularly wedded to that
  name. Comparing it with md5_agg() on a 100M row table (see attached
  test script) produces interesting results:
 
  SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;
 
   50bc42127fb9b028c9708248f835ed8f
 
  Time: 92960.021 ms
 
  SELECT md5_total(foo.*::text) FROM foo;
 
   02faea7fafee4d253fc94cfae031afc43c03479c
 
  Time: 96190.343 ms
 
  Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
  result is longer) but it seems like it would be very useful for
  quickly comparing data in SQL, since its value is not dependent on the
  row-order making it easier to use and better performing if there is no
  usable index for ordering.
 
  Note, however, that if there is an index that can be used for
  ordering, the performance is not necessarily better than md5_agg(), as
  this example shows. There is a small additional overhead per row for
  initialising the MD5 sums, and adding the results to the total, but I
  think the biggest factor is that md5_total() is processing more data.
  The reason is that MD5 works on 64-byte blocks, so the total amount of
  data going through the core MD5 algorithm is each row's size is
  rounded up to a multiple of 64. In this simple case it ends up
  processing around 1.5 times as much data:
 
  SELECT sum(length(foo.*::text)) AS md5_agg,
 sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
 
md5_agg   |  md5_total
  +-
   8103815438 | 12799909248
 
  although of course that overhead won't be as large on wider tables,
  and even in this case the overall performance is still on a par with
  md5_agg().
 
  ISTM that both aggregates are potentially useful in different
  situations. I would probably typically use md5_total() because of its
  simplicity/order-independence and consistent performance, but
  md5_agg() might also be useful when comparing with external data.
 
  Regards,
  Dean

 Performance review (skills needed: ability to time performance)

 Does the patch slow down simple tests?

 Yes.  For an MD5 checksum of some random data, here are
 results from master:

 shackle@postgres:5493=# WITH t1 AS (SELECT 
 string_agg(chr(floor(255*random()+1)::int),'') AS a FROM 
 generate_series(1,1)),
 postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN 
 generate_series(1,1))
 postgres-# select md5(a::text) FROM t2;
 shackle@postgres:5493=# \timing
 Timing is on.
 shackle@postgres:5493=# \g
 Time: 955.393 ms
 shackle@postgres:5493=# \g
 Time: 950.909 ms
 shackle@postgres:5493=# \g
 Time: 947.955 ms
 shackle@postgres:5493=# \g
 Time: 946.193 ms
 shackle@postgres:5493=# \g
 Time: 950.591 ms
 shackle@postgres:5493=# \g
 Time: 952.346 ms
 shackle@postgres:5493=# \g
 Time: 948.623 ms
 shackle@postgres:5493=# \g
 Time: 939.819 ms

 and here from master + the patch:

 WITH t1 AS (SELECT 
 string_agg(chr(floor(255*random()+1)::int),'') AS a FROM 
 generate_series(1,1)),
 t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,1))
 select md5(a::text) FROM t2;
 Time: 1129.178 ms
 shackle@postgres:5494=# \g
 Time: 1134.013 ms
 shackle@postgres:5494=# \g
 Time: 1124.387 ms
 shackle@postgres:5494=# \g
 Time: 1119.733 ms
 shackle@postgres:5494=# \g
 Time: 1104.906 ms
 shackle@postgres:5494=# \g
 Time: 1137.055 ms
 shackle@postgres:5494=# \g
 Time: 1128.977 ms
 

Review [was Re: [HACKERS] MD5 aggregate]

2013-06-21 Thread David Fetter
On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
 On 15 June 2013 10:22, Dean Rasheed dean.a.rash...@gmail.com wrote:
  There seem to be 2 separate directions that this could go, which
  really meet different requirements:
 
  1). Produce an unordered sum for SQL to compare 2 tables regardless of
  the order in which they are scanned. A possible approach to this might
  be something like an aggregate
 
  md5_total(text/bytea) returns text
 
  that returns the sum of the md5 values of each input value, treating
  each md5 value as an unsigned 128-bit integer, and then producing the
  hexadecimal representation of the final sum. This should out-perform a
  solution based on numeric addition, and in typical cases, the result
  wouldn't be much longer than a regular md5 sum, and so would be easy
  to eyeball for differences.
 
 
 I've been playing around with the idea of an aggregate that computes
 the sum of the md5 hashes of each of its inputs, which I've called
 md5_total() for now, although I'm not particularly wedded to that
 name. Comparing it with md5_agg() on a 100M row table (see attached
 test script) produces interesting results:
 
 SELECT md5_agg(foo.*::text)
   FROM (SELECT * FROM foo ORDER BY id) foo;
 
  50bc42127fb9b028c9708248f835ed8f
 
 Time: 92960.021 ms
 
 SELECT md5_total(foo.*::text) FROM foo;
 
  02faea7fafee4d253fc94cfae031afc43c03479c
 
 Time: 96190.343 ms
 
 Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
 result is longer) but it seems like it would be very useful for
 quickly comparing data in SQL, since its value is not dependent on the
 row-order making it easier to use and better performing if there is no
 usable index for ordering.
 
 Note, however, that if there is an index that can be used for
 ordering, the performance is not necessarily better than md5_agg(), as
 this example shows. There is a small additional overhead per row for
 initialising the MD5 sums, and adding the results to the total, but I
 think the biggest factor is that md5_total() is processing more data.
 The reason is that MD5 works on 64-byte blocks, so the total amount of
 data going through the core MD5 algorithm is each row's size is
 rounded up to a multiple of 64. In this simple case it ends up
 processing around 1.5 times as much data:
 
 SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
 
   md5_agg   |  md5_total
 +-
  8103815438 | 12799909248
 
 although of course that overhead won't be as large on wider tables,
 and even in this case the overall performance is still on a par with
 md5_agg().
 
 ISTM that both aggregates are potentially useful in different
 situations. I would probably typically use md5_total() because of its
 simplicity/order-independence and consistent performance, but
 md5_agg() might also be useful when comparing with external data.
 
 Regards,
 Dean

Submission review:

Is the patch in a patch format which has context? (eg: context diff format)

Yes.

Does it apply cleanly to the current git master?

Yes.

Does it include reasonable tests, necessary doc patches, etc? 

Yes.

Usability review:

Does the patch actually implement that?

Yes.

Do we want that?

Enough do.

Do we already have it?

No.

Does it follow SQL spec, or the community-agreed behavior?

No, and yes, respectively.

Does it include pg_dump support (if applicable)?

N/A

Are there dangers?

Patch changes the MD5 implementation, which could
theoretically result in backward incompatibility if the
changes are not 100% backward-compatible.

Have all the bases been covered? 

Yes.

Feature test:

Does the feature work as advertised?

Yes.

Are there corner cases the author has failed to consider?

Not that I've found so far.

Are there any assertion failures or crashes?

No.

Performance review (skills needed: ability to time performance)

Does the patch slow down simple tests?

Yes.  For an MD5 checksum of some random data, here are
results from master:

shackle@postgres:5493=# WITH t1 AS (SELECT 
string_agg(chr(floor(255*random()+1)::int),'') AS a FROM 
generate_series(1,1)),
postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN 
generate_series(1,1))
postgres-# select md5(a::text) FROM t2;
shackle@postgres:5493=# \timing 
Timing is on.
shackle@postgres:5493=# \g
Time: 955.393 ms
shackle@postgres:5493=# \g
Time: 950.909 ms
shackle@postgres:5493=# \g
Time: 947.955 ms
shackle@postgres:5493=# \g
Time: 946.193 ms
shackle@postgres:5493=# \g
Time: 950.591 ms
shackle@postgres:5493=# \g
Time: 952.346 ms
  

Re: Review [was Re: [HACKERS] MD5 aggregate]

2013-06-21 Thread David Fetter
On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote:
 On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
  On 15 June 2013 10:22, Dean Rasheed dean.a.rash...@gmail.com wrote:
   There seem to be 2 separate directions that this could go, which
   really meet different requirements:
  
   1). Produce an unordered sum for SQL to compare 2 tables regardless of
   the order in which they are scanned. A possible approach to this might
   be something like an aggregate
  
   md5_total(text/bytea) returns text
  
   that returns the sum of the md5 values of each input value, treating
   each md5 value as an unsigned 128-bit integer, and then producing the
   hexadecimal representation of the final sum. This should out-perform a
   solution based on numeric addition, and in typical cases, the result
   wouldn't be much longer than a regular md5 sum, and so would be easy
   to eyeball for differences.
  
  
  I've been playing around with the idea of an aggregate that computes
  the sum of the md5 hashes of each of its inputs, which I've called
  md5_total() for now, although I'm not particularly wedded to that
  name. Comparing it with md5_agg() on a 100M row table (see attached
  test script) produces interesting results:
  
  SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;
  
   50bc42127fb9b028c9708248f835ed8f
  
  Time: 92960.021 ms
  
  SELECT md5_total(foo.*::text) FROM foo;
  
   02faea7fafee4d253fc94cfae031afc43c03479c
  
  Time: 96190.343 ms
  
  Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
  result is longer) but it seems like it would be very useful for
  quickly comparing data in SQL, since its value is not dependent on the
  row-order making it easier to use and better performing if there is no
  usable index for ordering.
  
  Note, however, that if there is an index that can be used for
  ordering, the performance is not necessarily better than md5_agg(), as
  this example shows. There is a small additional overhead per row for
  initialising the MD5 sums, and adding the results to the total, but I
  think the biggest factor is that md5_total() is processing more data.
  The reason is that MD5 works on 64-byte blocks, so the total amount of
  data going through the core MD5 algorithm is each row's size is
  rounded up to a multiple of 64. In this simple case it ends up
  processing around 1.5 times as much data:
  
  SELECT sum(length(foo.*::text)) AS md5_agg,
 sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
  
md5_agg   |  md5_total
  +-
   8103815438 | 12799909248
  
  although of course that overhead won't be as large on wider tables,
  and even in this case the overall performance is still on a par with
  md5_agg().
  
  ISTM that both aggregates are potentially useful in different
  situations. I would probably typically use md5_total() because of its
  simplicity/order-independence and consistent performance, but
  md5_agg() might also be useful when comparing with external data.
  
  Regards,
  Dean

 Performance review (skills needed: ability to time performance)
 
 Does the patch slow down simple tests?
 
 Yes.  For an MD5 checksum of some random data, here are
 results from master:
 
 shackle@postgres:5493=# WITH t1 AS (SELECT 
 string_agg(chr(floor(255*random()+1)::int),'') AS a FROM 
 generate_series(1,1)),
 postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN 
 generate_series(1,1))
 postgres-# select md5(a::text) FROM t2;
 shackle@postgres:5493=# \timing 
 Timing is on.
 shackle@postgres:5493=# \g
 Time: 955.393 ms
 shackle@postgres:5493=# \g
 Time: 950.909 ms
 shackle@postgres:5493=# \g
 Time: 947.955 ms
 shackle@postgres:5493=# \g
 Time: 946.193 ms
 shackle@postgres:5493=# \g
 Time: 950.591 ms
 shackle@postgres:5493=# \g
 Time: 952.346 ms
 shackle@postgres:5493=# \g
 Time: 948.623 ms
 shackle@postgres:5493=# \g
 Time: 939.819 ms
 
 and here from master + the patch:
 
 WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') 
 AS a FROM generate_series(1,1)),
 t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,1))
 select md5(a::text) FROM t2;
 Time: 1129.178 ms
 shackle@postgres:5494=# \g
 Time: 1134.013 ms
 shackle@postgres:5494=# \g
 Time: 1124.387 ms
 shackle@postgres:5494=# \g
 Time: 1119.733 ms
 shackle@postgres:5494=# \g
 Time: 1104.906 ms
 shackle@postgres:5494=# \g
 Time: 1137.055 ms
 shackle@postgres:5494=# \g
 Time: 1128.977 ms
 shackle@postgres:5494=# \g

Re: [HACKERS] MD5 aggregate

2013-06-17 Thread Dean Rasheed
On 15 June 2013 10:22, Dean Rasheed dean.a.rash...@gmail.com wrote:
 There seem to be 2 separate directions that this could go, which
 really meet different requirements:

 1). Produce an unordered sum for SQL to compare 2 tables regardless of
 the order in which they are scanned. A possible approach to this might
 be something like an aggregate

 md5_total(text/bytea) returns text

 that returns the sum of the md5 values of each input value, treating
 each md5 value as an unsigned 128-bit integer, and then producing the
 hexadecimal representation of the final sum. This should out-perform a
 solution based on numeric addition, and in typical cases, the result
 wouldn't be much longer than a regular md5 sum, and so would be easy
 to eyeball for differences.


I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
  FROM (SELECT * FROM foo ORDER BY id) foo;

 50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

 02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

Note, however, that if there is an index that can be used for
ordering, the performance is not necessarily better than md5_agg(), as
this example shows. There is a small additional overhead per row for
initialising the MD5 sums, and adding the results to the total, but I
think the biggest factor is that md5_total() is processing more data.
The reason is that MD5 works on 64-byte blocks, so the total amount of
data going through the core MD5 algorithm is each row's size is
rounded up to a multiple of 64. In this simple case it ends up
processing around 1.5 times as much data:

SELECT sum(length(foo.*::text)) AS md5_agg,
   sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;

  md5_agg   |  md5_total
+-
 8103815438 | 12799909248

although of course that overhead won't be as large on wider tables,
and even in this case the overall performance is still on a par with
md5_agg().

ISTM that both aggregates are potentially useful in different
situations. I would probably typically use md5_total() because of its
simplicity/order-independence and consistent performance, but
md5_agg() might also be useful when comparing with external data.

Regards,
Dean


md5_agg_v2.patch
Description: Binary data


md5-100m-row-test.sql
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-17 Thread Marko Kreen
On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
 On 15 June 2013 10:22, Dean Rasheed dean.a.rash...@gmail.com wrote:
  There seem to be 2 separate directions that this could go, which
  really meet different requirements:
 
  1). Produce an unordered sum for SQL to compare 2 tables regardless of
  the order in which they are scanned. A possible approach to this might
  be something like an aggregate
 
  md5_total(text/bytea) returns text
 
  that returns the sum of the md5 values of each input value, treating
  each md5 value as an unsigned 128-bit integer, and then producing the
  hexadecimal representation of the final sum. This should out-perform a
  solution based on numeric addition, and in typical cases, the result
  wouldn't be much longer than a regular md5 sum, and so would be easy
  to eyeball for differences.
 
 
 I've been playing around with the idea of an aggregate that computes
 the sum of the md5 hashes of each of its inputs, which I've called
 md5_total() for now, although I'm not particularly wedded to that
 name. Comparing it with md5_agg() on a 100M row table (see attached
 test script) produces interesting results:
 
 SELECT md5_agg(foo.*::text)
   FROM (SELECT * FROM foo ORDER BY id) foo;
 
  50bc42127fb9b028c9708248f835ed8f
 
 Time: 92960.021 ms
 
 SELECT md5_total(foo.*::text) FROM foo;
 
  02faea7fafee4d253fc94cfae031afc43c03479c
 
 Time: 96190.343 ms
 
 Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
 result is longer) but it seems like it would be very useful for
 quickly comparing data in SQL, since its value is not dependent on the
 row-order making it easier to use and better performing if there is no
 usable index for ordering.
 
 Note, however, that if there is an index that can be used for
 ordering, the performance is not necessarily better than md5_agg(), as
 this example shows. There is a small additional overhead per row for
 initialising the MD5 sums, and adding the results to the total, but I
 think the biggest factor is that md5_total() is processing more data.
 The reason is that MD5 works on 64-byte blocks, so the total amount of
 data going through the core MD5 algorithm is each row's size is
 rounded up to a multiple of 64. In this simple case it ends up
 processing around 1.5 times as much data:
 
 SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
 
   md5_agg   |  md5_total
 +-
  8103815438 | 12799909248
 
 although of course that overhead won't be as large on wider tables,
 and even in this case the overall performance is still on a par with
 md5_agg().
 
 ISTM that both aggregates are potentially useful in different
 situations. I would probably typically use md5_total() because of its
 simplicity/order-independence and consistent performance, but
 md5_agg() might also be useful when comparing with external data.

Few notes:

- Index-scan over whole table is *very* bad for larger tables (few times
  bigger than available RAM).  If you want to promote such use you should
  also warn against use on loaded server.

- It's pointless to worry about overflow on 128-bit ints.  Just let it
  happen.  Adding complexity for that does not bring any advantage.

- Using some faster 128-bit hash may be useful - check out CityHash
  or SpookyHash.  You can get C implementation from pghashlib.

-- 
marko



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-15 Thread Dean Rasheed
On 13 June 2013 10:35, Dean Rasheed dean.a.rash...@gmail.com wrote:
 Hi,

 Attached is a patch implementing a new aggregate function md5_agg() to
 compute the aggregate MD5 sum across a number of rows. This is
 something I've wished for a number of times. I think the primary use
 case is to do a quick check that 2 tables, possibly on different
 servers, contain the same data, using a query like

   SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

 or

   SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

2). Produce an ordered MD5 sum compatible with COPY, whose result
would match that of running unix md5sum on the COPY output. Given all
the possible COPY options that would affect the result (format,
delimiters, headers, quoting, escaping, ...), I think that such a
thing would only reasonably be possible as an extension to the COPY
command itself.

I guess in its simplest form this would just be a new option MD5 to
COPY that would cause it to pipe its output to the md5 aggregator and
then send the final sum to the COPY destination at the end (e.g.,
COPY foo TO STDOUT MD5 would produce the ordered MD5 sum of the data
in foo).


I still think my original md5_agg() has its uses, since what it
produces is comparable with external md5 sums, and is directly
available to SQL, but (1) is probably the most useful for quickly
comparing 2 tables. I'm much less convinced about the value of (2),
but on the face of it, it doesn't seem like it would be hard to
implement.

Thoughts?

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Marko Kreen
On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com wrote:
 Attached is a patch implementing a new aggregate function md5_agg() to
 compute the aggregate MD5 sum across a number of rows. This is
 something I've wished for a number of times. I think the primary use
 case is to do a quick check that 2 tables, possibly on different
 servers, contain the same data, using a query like

   SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

 or

   SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

 these would be equivalent to

   SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo;

 but without the excessive memory consumption for the intermediate
 concatenated string, and the resulting 1GB table size limit.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Tom Lane
Marko Kreen mark...@gmail.com writes:
 On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com 
 wrote:
 Attached is a patch implementing a new aggregate function md5_agg() to
 compute the aggregate MD5 sum across a number of rows.

 It's more efficient to calculate per-row md5, and then sum() them.
 This avoids the need for ORDER BY.

Good point.  The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Benedikt Grundmann
On Fri, Jun 14, 2013 at 2:14 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Marko Kreen mark...@gmail.com writes:
  On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com
 wrote:
  Attached is a patch implementing a new aggregate function md5_agg() to
  compute the aggregate MD5 sum across a number of rows.

  It's more efficient to calculate per-row md5, and then sum() them.
  This avoids the need for ORDER BY.

 Good point.  The aggregate md5 function also fails to distinguish the
 case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
 from the case where they contain 'xyz' followed by 'zyxyz'.

 Now, as against that, you lose any sensitivity to the ordering of the
 values.

 Personally I'd be a bit inclined to xor the per-row md5's rather than
 sum them, but that's a small matter.

 regards, tom lane


xor works but only if each row is different (e.g. at the very least all
columns together make a unique key).





 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Stephen Frost
* Tom Lane (t...@sss.pgh.pa.us) wrote:
 Marko Kreen mark...@gmail.com writes:
  On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com 
  wrote:
  Attached is a patch implementing a new aggregate function md5_agg() to
  compute the aggregate MD5 sum across a number of rows.
 
  It's more efficient to calculate per-row md5, and then sum() them.
  This avoids the need for ORDER BY.
 
 Good point.  The aggregate md5 function also fails to distinguish the
 case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
 from the case where they contain 'xyz' followed by 'zyxyz'.
 
 Now, as against that, you lose any sensitivity to the ordering of the
 values.
 
 Personally I'd be a bit inclined to xor the per-row md5's rather than
 sum them, but that's a small matter.

Where I'd take this is actually in a completely different direction..
I'd like the aggregate to be able to match the results of running the
'md5sum' unix utility on a file that's been COPY'd out.  Yes, that means
we'd need a way to get back what would this row look like if it was
sent through COPY with these parameters, but I've long wanted that
also.

No, no clue about how to put all that together.  Yes, having this would
be better than nothing, so I'm still for adding this even if we can't
make it match COPY output. :)

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Andrew Dunstan


On 06/14/2013 09:40 AM, Stephen Frost wrote:

* Tom Lane (t...@sss.pgh.pa.us) wrote:

Marko Kreen mark...@gmail.com writes:

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com wrote:

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

Good point.  The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

Where I'd take this is actually in a completely different direction..
I'd like the aggregate to be able to match the results of running the
'md5sum' unix utility on a file that's been COPY'd out.  Yes, that means
we'd need a way to get back what would this row look like if it was
sent through COPY with these parameters, but I've long wanted that
also.

No, no clue about how to put all that together.  Yes, having this would
be better than nothing, so I'm still for adding this even if we can't
make it match COPY output. :)






I'd rather go the other way, processing the records without having to 
process them otherwise at all. Turning things into text must slow things 
down, surely.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Stephen Frost
* Andrew Dunstan (and...@dunslane.net) wrote:
 I'd rather go the other way, processing the records without having
 to process them otherwise at all. Turning things into text must slow
 things down, surely.

That's certainly an interesting idea also..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Dean Rasheed
On 14 June 2013 14:14, Tom Lane t...@sss.pgh.pa.us wrote:
 Marko Kreen mark...@gmail.com writes:
 On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed dean.a.rash...@gmail.com 
 wrote:
 Attached is a patch implementing a new aggregate function md5_agg() to
 compute the aggregate MD5 sum across a number of rows.

 It's more efficient to calculate per-row md5, and then sum() them.
 This avoids the need for ORDER BY.

 Good point.  The aggregate md5 function also fails to distinguish the
 case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
 from the case where they contain 'xyz' followed by 'zyxyz'.


Well, if you aggregated foo.*::text as in my original example, then
the textual representation of the row would protect you from that. But
yes, if you were just doing it with a single text column that might be
a risk.


 Now, as against that, you lose any sensitivity to the ordering of the
 values.

 Personally I'd be a bit inclined to xor the per-row md5's rather than
 sum them, but that's a small matter.


But this would be a much riskier thing to do with a single column,
because if you updated multiple rows in the same way (e.g., UPDATE t
SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
there were an even number of matches.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Tom Lane
Dean Rasheed dean.a.rash...@gmail.com writes:
 On 14 June 2013 14:14, Tom Lane t...@sss.pgh.pa.us wrote:
 Personally I'd be a bit inclined to xor the per-row md5's rather than
 sum them, but that's a small matter.

 But this would be a much riskier thing to do with a single column,
 because if you updated multiple rows in the same way (e.g., UPDATE t
 SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
 there were an even number of matches.

I was implicitly thinking that the sum would be a modulo sum so that the
final result is still the size of an md5 signature.  If that's true,
then leaking bits via carry out is just as bad as xor's deficiencies.
Now, you could certainly make it a non-modulo sum and not lose any
information to carries, if you're willing to do the arithmetic in
NUMERIC and have a variable-width result.  Sounds a bit slow though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Dean Rasheed
On 14 June 2013 15:19, Stephen Frost sfr...@snowman.net wrote:
 * Andrew Dunstan (and...@dunslane.net) wrote:
 I'd rather go the other way, processing the records without having
 to process them otherwise at all. Turning things into text must slow
 things down, surely.

 That's certainly an interesting idea also..


md5_agg(record) ?

Yes, I like it.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Andres Freund
On 2013-06-14 15:49:31 +0100, Dean Rasheed wrote:
 On 14 June 2013 15:19, Stephen Frost sfr...@snowman.net wrote:
  * Andrew Dunstan (and...@dunslane.net) wrote:
  I'd rather go the other way, processing the records without having
  to process them otherwise at all. Turning things into text must slow
  things down, surely.
 
  That's certainly an interesting idea also..
 
 
 md5_agg(record) ?
 
 Yes, I like it.

It's more complex than just memcmp()ing HeapTupleData though. At least
if the Datum contains varlena columns there's so many different
representations (short, long, compressed, external, external compressed)
of the same data that a md5 without normalizing that wouldn't be very
interesting.
So you would at least need a normalizing version of
toast_flatten_tuple() that also deals with short/long varlenas. But even
after that, you would still need to deal with Datums that can have
different representation (like short numerics, old style hstore, ...).

It might be more realistic to use the binary output functions, but I am
not sure whether all of those are sufficiently reproduceable.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Hannu Krosing
On 06/14/2013 04:47 PM, Tom Lane wrote:
 Dean Rasheed dean.a.rash...@gmail.com writes:
 On 14 June 2013 14:14, Tom Lane t...@sss.pgh.pa.us wrote:
 Personally I'd be a bit inclined to xor the per-row md5's rather than
 sum them, but that's a small matter.
 But this would be a much riskier thing to do with a single column,
 because if you updated multiple rows in the same way (e.g., UPDATE t
 SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
 there were an even number of matches.
 I was implicitly thinking that the sum would be a modulo sum so that the
 final result is still the size of an md5 signature.  If that's true,
 then leaking bits via carry out is just as bad as xor's deficiencies.
 Now, you could certainly make it a non-modulo sum and not lose any
 information to carries, if you're willing to do the arithmetic in
 NUMERIC and have a variable-width result.  Sounds a bit slow though.
What skytools/pgq/londiste uses for comparing tables on master
and slave is query like this

select sum(hashtext(t.*::text)) from yourtable t;

This is non-modulo sum and does not use md5 but relies on
whatever the hashtext() du jour is :)

So it is not comparable to anything external (like the md5sum
compatible idea above) but is usually good enough for fast
checks of compatible tables.

As tables are unordered by definition anyway, this should be
good enough for most SQL.

The speed comes from both fast(er) hashtext() function and
avoiding the sort.

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Dean Rasheed
On 14 June 2013 16:09, Hannu Krosing ha...@2ndquadrant.com wrote:
 What skytools/pgq/londiste uses for comparing tables on master
 and slave is query like this

 select sum(hashtext(t.*::text)) from yourtable t;

 This is non-modulo sum and does not use md5 but relies on
 whatever the hashtext() du jour is :)

 So it is not comparable to anything external (like the md5sum
 compatible idea above) but is usually good enough for fast
 checks of compatible tables.

 As tables are unordered by definition anyway, this should be
 good enough for most SQL.

 The speed comes from both fast(er) hashtext() function and
 avoiding the sort.


That sounds like a pretty good approach. We could do that if we had a
version of md5() that returned numeric. My impression is that numeric
computations are pretty fast compared to the sorting overhead.

On the other hand, if there is a usable index, select md5_agg(..) from
(sub-query) will do and index scan rather than a sort, making it much
faster than using an ORDER BY in the aggregate.

Regards,
Dean


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Craig Ringer

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/14/2013 09:40 PM, Stephen Frost wrote:
 Where I'd take this is actually in a completely different direction..
 I'd like the aggregate to be able to match the results of running the
 'md5sum' unix utility on a file that's been COPY'd out.
Until I started looking at the follow-up discussion I didn't realise it
wasn't supposed to.

If it isn't the md5sum of the ordered rows, it shouldn't be called
'md5'. It might still be useful, but please don't call it md5.

The proposals to make it produce the same result with different row
orderings sound useful in the context of SQL; if it produced a different
result with different orderings I'd want a way to force an explicit
ORDER BY clause on the aggregate and error out if one wasn't present.
Making it ignore order means that it's no longer md5, though.

row_sums(...) ?


- -- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.13 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRu+XMAAoJELBXNkqjr+S2mkkH/j8gi8d07dI6+G742f0U+v0J
u8DGhtDQuuWHalqlaUDOssmi4fRDg99OzLlR+Mid0yGL/UfFMoL47H+kNRoMkuzV
stUz3vf5rp8TbqEnikT3EwEKIuzaWrae0Fn3TKIYXVSRVvWjGzRSZsvJZsdfcS7T
7lZ9sf6QGekT9bAi6BIFsG7Z1bFLb6Q6AeTsX04++dLBCrjm96CSyisBswY5J2qg
zD0WrK6IOsSn9ljlIZRGSTtP+tdM5mOi/DHdeEd+glGx5YKQ9t9yq++oayoqb9mp
hPtsBo6UwcMaylPA2vQnhVi0q2bl9FMa+QGpaWBe6YfXLPF4PWhET2OixkRat1w=
=ncT/
-END PGP SIGNATURE-



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-14 Thread Craig Ringer
On 06/13/2013 05:35 PM, Dean Rasheed wrote:
 Hi,

 Attached is a patch implementing a new aggregate function md5_agg() to
 compute the aggregate MD5 sum across a number of rows. This is
 something I've wished for a number of times. I think the primary use
 case is to do a quick check that 2 tables, possibly on different
 servers, contain the same data, using a query like

   SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

 or

   SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

That's a very useful thing to be able to do, but I'm hesitant to make
the fact that it uses md5 too prominent in the name if it doesn't
produce a result that an external user could reasonably expect from
md5'ing the same data.

I imagine having an md5_agg(text) and md5(bytea) that was the more
efficient, streaming equivalent of:

md5(string_agg(the_col,''))

would be rather handy.

It'd be less useful for other types (floats, integers, etc) unless we
had a way to get the binary representations of those in a well defined
form, like int8le(1) . Casting to 'text' would be sufficient for most of
the purposes I can imagine, though, and for those that it wouldn't
things would quickly get so complicated that you'd want to be using a
PL/something function anyway.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] MD5 aggregate

2013-06-13 Thread Dean Rasheed
Hi,

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows. This is
something I've wished for a number of times. I think the primary use
case is to do a quick check that 2 tables, possibly on different
servers, contain the same data, using a query like

  SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

or

  SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

these would be equivalent to

  SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo;

but without the excessive memory consumption for the intermediate
concatenated string, and the resulting 1GB table size limit.

I've added 2 variants: md5_agg(text) and md5_agg(bytea) to match the 2
variants of md5(), so pure binary data can also be checksummed.

In passing, I've tidied up and optimised the code in md5.c a bit ---
specifically I've removed the malloc()/memcpy()/free() code that was
unnecessarily making a copy of the entire input data just to pad it
and append the bit count. This reduces the memory consumption of the
existing md5() functions for large inputs, and gives a modest
performance boost. As a result, the md5() function can no longer throw
an out-of-memory error.

Regards,
Dean


md5_agg.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] MD5 aggregate

2013-06-13 Thread Peter Eisentraut
On 6/13/13 5:35 AM, Dean Rasheed wrote:
 Attached is a patch implementing a new aggregate function md5_agg() to
 compute the aggregate MD5 sum across a number of rows.

That seems somewhat useful.

 In passing, I've tidied up and optimised the code in md5.c a bit ---
 specifically I've removed the malloc()/memcpy()/free() code that was
 unnecessarily making a copy of the entire input data just to pad it
 and append the bit count. This reduces the memory consumption of the
 existing md5() functions for large inputs, and gives a modest
 performance boost. As a result, the md5() function can no longer throw
 an out-of-memory error.

I think it would be better if you split this into two separate patches.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers