Le mar. 21 mai 2019 à 18:55, Nicolas Cellier < nicolas.cellier.aka.n...@gmail.com> a écrit :
> I have updated Smallapack to version 1.6.1 so as to accelerate sum. > > | a b c | > a := LapackSGEMatrix randNormal: #(10000 1). > b := a as: FloatArray. > c := a asAbstractMatrix. > {a sum. b sum. c sum.}. > {[a sum] bench. [b sum] bench. [c sum] bench.}. > '27,500 per second. 36.3 microseconds per run.' "LapackMatrix" > '117,000 per second. 8.54 microseconds per run.' "FloatArray" > '1,180 per second. 845 microseconds per run.' "Un-accelerated > AbstractMatrix" > > I measured with single precision for everything so as to have fair > comparisons. > As you can see, LapackMatrix sum is still slower than FloatArray sum. > This is because we have to create a Matrix of ones first (xLASET) before > calling BLAS dot-product primitive (xDOTU). > > In Squeak, profiling can be obtained thru (I guess not much different in > Pharo): > AndreasSystemProfiler spyOn: [[a sum] bench]. > > which gives: > **Leaves** > 37.19 (1,865) LapackSLibrary xlasetWithuplo:m:n:alpha:beta:a:lda:length: > 27.12 (1,360) BlasSLibrary dotF2CWithn:x:incx:y:incy: > 9.31 (467) Behavior basicNew: > 2.42 (121) ByteString class compare:with:collated: > 1.43 (72) ByteString hashWithInitialHash: > > Note that this is un-accelerated BLAS (no Intel Math Kernel Library or > other accelerated version), but it does not make much difference for those > BLAS level-1 functions (those with cost O(N)) > > But we can avoid that cost and compute all the cumulative sums, or some of > them at once with a single Matrix operation (xTRMM / xGEMM): > > | a b c | > a := LapackSGEMatrix randUniform: #(10000 1). > b := LapackSGEMatrix nrow: 10000 ncol: 10000 withAll: 1. "huge matrix, > that's not cheap, don't even bench it!" > "cumsum" > c := b lowerTriangle * a. > > | a b c | > a := LapackSGEMatrix randUniform: #(10000 1). > b := LapackSGEMatrix nrow: 10000 ncol: 10000 withAll: 1. "huge matrix, > that's not cheap, don't even bench it!" > "cumsum 1 out of 100" > c := (b lowerTriangle atRows: (100 to: 10000 by: 100)) * a. > > | a b c | > a := LapackSGEMatrix randUniform: #(10000 1). > "cheaper construction of partial cum sum" > [b := LapackSGEMatrix rows: ( (100 to: a nrow by: 100) collect: [:n | ( > LapackSGEMatrix nrow: 1 ncol: n withAll: 1.0) , (LapackSGEMatrix nrow: 1 > ncol: a nrow - n)]). > c := b * a.] bench. > > '3.5 per second. 286 milliseconds per run.' > > So as you see, with some moderate effort, we have computed 100 partial > cumulative sums in about 286ms, that's 2.86ms per cumsum on average, not > too bad. > Maybe not as straightforward as numpy, and maybe still not as fast, but > not completely at west. > > Oups, I posted to fast! ms are not µs! [c := b * a] bench. gives '1,690 per second. 593 microseconds per run.', that's about 6µs per sum, but including construction of multiplier, that's way too much (it's not accelerated!) > > Le mar. 21 mai 2019 à 12:51, Jimmie Houchin <jlhouc...@gmail.com> a > écrit : > >> Not a problem. I greatly respective other peoples time and priorities and >> their personal lives. >> >> Just for the record I am using 64bit Pharo on a fast i7, 16gb ram, laptop >> running Xubunut 18.04 64bit. >> >> I do not remember any problems loading. And within the small amount of >> experimenting that I did, it seemed to operate fine. >> >> Again, thanks for your contribution. I know it is a lot of work and a >> pretty large area to cover. Python/Numpy has armies of people working on >> this. >> >> Jimmie >> >> >> On 5/21/19 2:54 AM, Nicolas Cellier wrote: >> >> Hi Jimmie, >> I didn't take time yesterday to analyze your specific example because it >> was quite late, but here are some remarks: >> 1) First, I recommend using 64bits Pharo, because number crunching and >> Float operations will be faster (not FloatArray though). >> 2) it would be nice to use a profiler to analyze where time is spent >> I would not be amazed that (Float readFrom:...) takes a non neglectable >> percentage of it >> 3) ExternalDoubleArray only add overhead if no bulk-operation is performed >> (like reading raw binary data or serving as storage area passed to >> Lapack/blas primitives) >> it does not provide accelerated features by itself indeed. >> I think that it is too low level to serve as a primary interface. >> 4) LapackXXXMatrix sum has effectively not been optimized to use BLAS, >> and this can be easily corrected, thanks for giving this example. >> With some cooperation, we could easily make some progress, there are low >> hanging fruits. >> But I understand if you prefer to stick with more mature numpy solution. >> Thanks for trying. At least, you were able to load and use Smallapack in >> Pharo, and this is already a good feedback. >> If you have time, I'll publish a small enhancement for accelerating sum, >> and ask you to retry. >> Thanks again >> >> >> Le mar. 21 mai 2019 à 05:13, Serge Stinckwich <serge.stinckw...@gmail.com> >> a écrit : >> >>> There is another solution with my TensorFlow Pharo binding: >>> https://github.com/PolyMathOrg/libtensorflow-pharo-bindings >>> >>> You can do a matrix multiplication like that : >>> >>> | graph t1 t2 c1 c2 mult session result | >>> graph := TF_Graph create. >>> t1 := TF_Tensor fromFloats: (1 to:1000000) asArray shape:#(1000 1000). >>> t2 := TF_Tensor fromFloats: (1 to:1000000) asArray shape:#(1000 1000). >>> c1 := graph const: 'c1' value: t1. >>> c2 := graph const: 'c2' value: t2. >>> mult := c1 * c2. >>> session := TF_Session on: graph. >>> result := session runOutput: (mult output: 0). >>> result asNumbers >>> >>> Here I'm doing a multiplication between 2 matrices of 1000X1000 size in >>> 537 ms on my computer. >>> >>> All operations can be done in a graph of operations that is run outside >>> Pharo, so could be very fast. >>> Operations can be done on CPU or GPU. 32 bits or 64 bits float >>> operations are possible. >>> >>> This is a work in progress but can already be used. >>> Regards, >>> >>> >>> >>> On Tue, May 21, 2019 at 6:54 AM Jimmie Houchin <jlhouc...@gmail.com> >>> wrote: >>> >>>> I wasn't worried about how to do sliding windows. My problem is that >>>> using LapackDGEMatrix in my example was 18x slower than FloatArray, which >>>> is slower than Numpy. It isn't what I was expecting. >>>> >>>> What I didn't know is if I was doing something wrong to cause such a >>>> tremendous slow down. >>>> >>>> Python and Numpy is not my favorite. But it isn't uncomfortable. >>>> >>>> So I gave up and went back to Numpy. >>>> >>>> Thanks. >>>> >>>> >>>> >>>> On 5/20/19 5:17 PM, Nicolas Cellier wrote: >>>> >>>> Hi Jimmie, >>>> effectively I did not subsribe... >>>> Having efficient methods for sliding window average is possible, here >>>> is how I would do it: >>>> >>>> "Create a vector with 100,000 rows filles with random values (uniform >>>> distrubution in [0,1]" >>>> v := LapackDGEMatrix randUniform: #(100000 1). >>>> >>>> "extract values from rank 10001 to 20000" >>>> w1 := v atIntervalFrom: 10001 to: 20000 by: 1. >>>> >>>> "create a left multiplier matrix for performing average of w1" >>>> a := LapackDGEMatrix nrow: 1 ncol: w1 nrow withAll: 1.0 / w1 size. >>>> >>>> "get the average (this is a 1x1 matrix from which we take first >>>> element)" >>>> avg1 := (a * w1) at: 1. >>>> >>>> [ "select another slice of same size" >>>> w2 := v atIntervalFrom: 15001 to: 25000 by: 1. >>>> >>>> "get the average (we can recycle a)" >>>> avg2 := (a * w2) at: 1 ] bench. >>>> >>>> This gives: >>>> '16,500 per second. 60.7 microseconds per run.' >>>> versus: >>>> [w2 sum / w2 size] bench. >>>> '1,100 per second. 908 microseconds per run.' >>>> >>>> For max and min, it's harder. Lapack/Blas only provide max of absolute >>>> value as primitive: >>>> [w2 absMax] bench. >>>> '19,400 per second. 51.5 microseconds per run.' >>>> >>>> Everything else will be slower, unless we write new primitives in C and >>>> connect them... >>>> [w2 maxOf: [:each | each]] bench. >>>> '984 per second. 1.02 milliseconds per run.' >>>> >>>> Le dim. 19 mai 2019 à 14:58, Jimmie <jlhouc...@gmail.com> a écrit : >>>> >>>>> On 5/16/19 1:26 PM, Nicolas Cellier wrote:> Any feedback on this? >>>>> > Did someone tried to use Smallapack in Pharo? >>>>> > Jimmie? >>>>> > >>>>> >>>>> I am going to guess that you are not on pharo-users. My bad. >>>>> I posted this in pharo-users as I it wasn't Pharo development question. >>>>> >>>>> I probably should have posted here or emailed you directly. >>>>> >>>>> All I really need is good performance with a simple array of floats. >>>>> No >>>>> matrix math. Nothing complicated. Moving Averages over a slice of the >>>>> array. A variety of different averages, weighted, etc. Max/min of the >>>>> array. But just a single simple array. >>>>> >>>>> Any help greatly appreciated. >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> On 4/28/19 8:32 PM, Jimmie Houchin wrote: >>>>> Hello, >>>>> >>>>> I have installed Smallapack into Pharo 7.0.3. Thanks Nicholas. >>>>> >>>>> I am very unsure on my use of Smallapack. I am not a mathematician or >>>>> scientist. However the only part of Smallapack I am trying to use at >>>>> the >>>>> moment is something that would be 64bit and compare to FloatArray so >>>>> that I can do some simple accessing, slicing, sum, and average on the >>>>> array. >>>>> >>>>> Here is some sample code I wrote just to play in a playground. >>>>> >>>>> I have an ExternalDoubleArray, LapackDGEMatrix, and a FloatArray >>>>> samples. The ones not in use are commented out for any run. >>>>> >>>>> fp is a download from >>>>> http://ratedata.gaincapital.com/2018/12%20December/EUR_USD_Week1.zip >>>>> and unzipped to a directory. >>>>> >>>>> fp := '/home/jimmie/data/EUR_USD_Week1.csv' >>>>> index := 0. >>>>> pricesSum := 0. >>>>> asum := 0. >>>>> ttr := [ >>>>> lines := fp asFileReference contents lines allButFirst. >>>>> a := ExternalDoubleArray new: lines size. >>>>> "la := LapackDGEMatrix allocateNrow: lines size ncol: 1. >>>>> a := la columnAt: 1." >>>>> "a := FloatArray new: lines size." >>>>> lines do: [ :line || parts price | >>>>> parts := ',' split: line. >>>>> index := index + 1. >>>>> price := Float readFrom: (parts last). >>>>> a at: index put: price. >>>>> pricesSum := pricesSum + price. >>>>> (index rem: 100) = 0 ifTrue: [ >>>>> asum := a sum. >>>>> ]]] timeToRun. >>>>> { index. pricesSum. asum. ttr }. >>>>> "ExternalDoubleArray an Array(337588 383662.5627699992 >>>>> 383562.2956199993 0:00:01:59.885)" >>>>> "FloatArray an Array(337588 383662.5627699992 383562.2954441309 >>>>> 0:00:00:06.555)" >>>>> >>>>> FloatArray is not the precision I need. But it is over 18x faster. >>>>> >>>>> I am afraid I must be doing something badly wrong. Python/Numpy is >>>>> over >>>>> 4x faster than FloatArray for the above. >>>>> >>>>> If I am using Smallapack incorrectly please help. >>>>> >>>>> Any help greatly appreciated. >>>>> >>>>> Thanks. >>>>> >>>>> >>>>> >>> >>> -- >>> Serge Stinckwic >>> h >>> >>> Int. Research Unit >>> on Modelling/Simulation of Complex Systems (UMMISCO) >>> Sorbonne University >>> (SU) >>> French National Research Institute for Sustainable Development (IRD) >>> U >>> niversity of Yaoundé I, Cameroon >>> "Programs must be written for people to read, and only incidentally for >>> machines to execute." >>> https://twitter.com/SergeStinckwich >>> >>