Re: [HACKERS] CUDA Sorting

Greg Smith Sun, 12 Feb 2012 23:27:29 -0800

On 02/11/2012 08:14 PM, Gaetano Mendola wrote:

The trend is to have server capable of running CUDA providing GPU viaexternal hardware (PCI Express interface with PCI Express switches),look for example at PowerEdge C410x PCIe Expansion Chassis from DELL.

The C410X adds 16 PCIe slots to a server, housed inside a separate 3Uenclosure. That's a completely sensible purchase if your goal is tobuild a computing cluster, where a lot of work is handed off to a set ofGPUs. I think that's even less likely to be a cost-effective option fora database server. Adding a single dedicated GPU installed in a serverto accelerate sorting is something that might be justifiable, based onyour benchmarks. This is a much more expensive option than thatthough. Details athttp://www.dell.com/us/enterprise/p/poweredge-c410x/pd for anyone whowants to see just how big this external box is.

I did some experimenst timing the sort done with CUDA and the sortdone with pg_qsort:
                       CUDA      pg_qsort
33Milion integers:   ~ 900 ms,  ~ 6000 ms
1Milion integers:    ~  21 ms,  ~  162 ms
100k integers:       ~   2 ms,  ~   13 ms
CUDA time has already in the copy operations (host->device,device->host).As GPU I was using a C2050, and the CPU doing the pg_qsort was aIntel(R) Xeon(R) CPU X5650 @ 2.67GHz

That's really interesting, and the X5650 is by no means a slow CPU. Sothis benchmark is providing a lot of CPU power yet still seeing over a6X speedup in sort times. It sounds like the PCI Express bus has gottenfast enough that the time to hand data over and get it back again caneasily be justified for medium to large sized sorts.

It would be helpful to take this patch and confirm whether it scaleswhen using in parallel. Easiest way to do that would be to use thepgbench "-f" feature, which allows running an arbitrary number of somequery at once. Seeing whether this acceleration continued to hold asthe number of clients increases is a useful data point.

Is it possible for you to break down where the time is being spent? Forexample, how much of this time is consumed in the GPU itself, comparedto time spent transferring data between CPU and GPU? I'm also curiouswhere the bottleneck is at with this approach. If it's the speed of thePCI-E bus for smaller data sets, adding more GPUs may never bepractical. If the bus can handle quite a few of these at once before itsaturates, it might be possible to overload a single GPU. That seemslike it would be really hard to reach for database sorting though; Ican't really defend justify my gut feel for that being true though.

> I've never seen a PostgreSQL server capable of running CUDA, and I
> don't expect that to change.

That sounds like:

"I think there is a world market for maybe five computers."
- IBM Chairman Thomas Watson, 1943

Yes, and "640K will be enough for everyone", ha ha. (Having said the640K thing is flat out denied by Gates, BTW, and no one has come up withproof otherwise).

I think you've made an interesting case for this sort of accelerationnow being useful for systems doing what's typically considered a datawarehouse task. I regularly see servers waiting for far more than 13Mintegers to sort. And I am seeing a clear trend toward providing morePCI-E slots in servers now. Dell's R810 is the most popular singleserver model my customers have deployed in the last year, and it has 5X8 slots in it. It's rare all 5 of those are filled. As long as adedicated GPU works fine when dropped to X8 speeds, I know a fair numberof systems where one of those could be added now.

There's another data point in your favor I didn't notice before yourlast e-mail. Amazon has a "Cluster GPU Quadruple Extra Large" node typethat runs with NVIDIA Tesla hardware. That means the installed base ofpeople who could consider CUDA is higher than I expected. Todemonstrate how much that costs, to provision a GPU enabled reservedinstance from Amazon for one year costs $2410 at "Light Utilization",giving a system with 22GB of RAM and 1.69GB of storage. (I find thereserved prices easier to compare with dedicated hardware than thehourly ones) That's halfway between the High-Memory Double Extra LargeInstance (34GB RAM/850GB disk) at $1100 and the High-Memory QuadrupleExtra Large Instance (64GB RAM/1690GB disk) at $2200. If someone couldprove sorting was a bottleneck on their server, that isn't anunreasonable option to consider on a cloud-based database deployment.

I still think that an approach based on OpenCL is more likely to besuitable for PostgreSQL, which was part of why I gave CUDA low oddshere. The points in favor of OpenCL are:

-Since you last posted, OpenCL compiling has switched to using LLVM astheir standard compiler. Good PostgreSQL support for LLVM isn't faraway. It looks to me like the compiler situation for CUDA requirestheir PathScale based compiler. I don't know enough about this area tosay which compiling tool chain will end up being easier to deal with.

-Intel is making GPU support standard for OpenCL, as I mentionedbefore. NVIDIA will be hard pressed to compete with Intel for GPUacceleration once more systems supporting that enter the market.

-Easy availability of OpenCL on Mac OS X for development sake. Lots ofPostgres hackers with OS X systems, even though there aren't too many OSX database servers.

The fact that Amazon provides a way to crack the chicken/egg hardwareproblem immediately helps a lot though, I don't even need a physicalcard here to test CUDA GPU acceleration on Linux now. With that datapoint, your benchmarks are good enough to say I'd be willing to helpreview a patch in this area here as part of the 9.3 development cycle.That may validate that GPU acceleration is useful, and then the nextstep would be considering how portable that will be to other GPUinterfaces. I still expect CUDA will be looked back on as a dead endfor GPU accelerated computing one day. Computing history is not filledwith many single-vendor standards who competed successfully againstIntel providing the same thing. AMD's x86-64 is the only example I canthink of where Intel didn't win that sort of race, which happened (IMHO)only because Intel's Itanium failed to prioritize backwardscompatibility highly enough.


--
Greg Smith   2ndQuadrant US    [email protected]   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CUDA Sorting

Reply via email to