Re: Error running concurrent process and storing results in array

data pulverizer via Digitalmars-d-learn Fri, 08 May 2020 06:40:26 -0700

On Thursday, 7 May 2020 at 14:49:43 UTC, data pulverizer wrote:

After running the Julia code by the Julia community they madesome changes (using views rather than passing copies of thearray) and their time has come down to ~ 2.5 seconds. The plotthickens.

I've run the Chapel code past the Chapel programming languagepeople and they've brought the time down to ~ 6.5 seconds. I'vedisallowed calling BLAS because I'm looking at the performance ofthe programming language implementations rather than it's abilityto call other libraries.


So far the times are looking like this:

D:      ~ 1.5 seconds
Julia:  ~ 2.5 seconds
Chapel: ~ 6.5 seconds

I've been working on the Nim benchmark and have written a littlebyte order set of functions for big -> little endian stuff(https://gist.github.com/dataPulverizer/744fadf8924ae96135fc600ac86c7060) which was fun and has the ntoh, hton, and so forth functions that can be applied to any basic type. Now writing a little matrix type in the same vein as the D matrix type I wrote and then do the easy bit which is writing the kernel matrix algorithm itself.

In the end I'll run the benchmark on data of various sizes.Currently I'm just running it on the (10,000 x 784) data setwhich outputs a (10,000 x 10,000) matrix. I'll end up running(5,000 x 784), (10,000 x 784), (20,000 x 784), (30,000 x 784),(40,000 x 784), (50,000 x 784), and (60,000 x 784). Ideally I'dmeasure each on 100 times and plot confidence intervals, but I'llhave to settle for measuring each one 3 times and take an averageotherwise it will take too much time. I don't think that D willhave it it's own way for all the data sizes, from what I can see,Julia may do better at the largest data set, maybe simd will be afactor there.

The data set sizes are not randomly chosen. In many common datascience tasks maybe > 90% of what data scientists currently workon, people work with data sets in this range or even smaller, thebig data stuff is much less common unless you're working forGoogle (FANGs) or a specialist startup. I remember running akernel cluster in often used "data science" languages (none ofwhich I'm benchmarking here) and it wasn't done after an hour andthen hung and crashed, I implemented something in Julia and itwas done in a minute. Calculating kernel matrices is thecornerstone of many kernel-based machine learning librarieskernel PCA, Kernel Clustering, SVM and so on. It's a prettyimportant thing to calculate and shows the potential of theselanguages in the data science field. I think an article like thisis valid for people that implement numerical libraries. I'm alsohoping to throw in C++ by way of comparison.

Re: Error running concurrent process and storing results in array

Reply via email to