On Thu, Aug 20, 2015 at 5:02 PM, Greg Stark <st...@mit.edu> wrote: > I haven't thought through the exponential > growth carefully enough to tell if doubling the run size should > decrease the number of passes linearly or by a constant number.
It seems that with 5 times the data that previously required ~30MB to avoid a multi-pass sort (where ~2300MB is required for an internal sort -- the benchmark query), it took ~60MB to avoid a multi-pass sort. I guess I just didn't exactly determine either threshold due to that taking too long, and that as predicted, every time the input size quadruples, the required amount of work_mem to avoid multiple passes only doubles. That will need to be verified more vigorously, but it looks that way. > But you're right that seems to be less and less a realistic scenario. > Times when users are really processing data sets that large nowadays > they'll just throw it into Hadoop or Biigquery or whatever to get the > parallelism of many cpus. Or maybe Citus and the like. I'm not sure that even that's generally true, simply because sorting a huge amount of data is very expensive -- it's not really a "big data" thing, so to speak. Look at recent results on this site: http://sortbenchmark.org Last year's winning "Gray" entrant, TritonSort, uses a huge parallel cluster of 186 machines, but only sorts 100TB. That's just over 500GB per node. Each node is a 32 core Intel Xeon EC2 instance with 244GB memory, and lots of SSDs. It seems like the point of the 100TB minimum rule in the "Gray" contest category is that that's practically impossible to fit entirely in memory (to avoid merging). Eventually, linearithmic growth becomes extremely painful, not matter how much processing power you have. It takes a while, though. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers