Re: [Pharo-dev] FloatArray

Nicolas Cellier Tue, 21 May 2019 10:09:42 -0700

Le mar. 21 mai 2019 à 18:55, Nicolas Cellier <
nicolas.cellier.aka.n...@gmail.com> a écrit :


> I have updated Smallapack to version 1.6.1 so as to accelerate sum.
>
> | a b c |
> a := LapackSGEMatrix randNormal: #(10000 1).
> b := a as: FloatArray.
> c := a asAbstractMatrix.
> {a sum. b sum. c sum.}.
> {[a sum] bench. [b sum] bench. [c sum] bench.}.
>   '27,500 per second. 36.3 microseconds per run.'  "LapackMatrix"
>   '117,000 per second. 8.54 microseconds per run.' "FloatArray"
>   '1,180 per second. 845 microseconds per run.'  "Un-accelerated
> AbstractMatrix"
>
> I measured with single precision for everything so as to have fair
> comparisons.
> As you can see, LapackMatrix sum is still slower than FloatArray sum.
> This is because we have to create a Matrix of ones first (xLASET) before
> calling BLAS dot-product primitive (xDOTU).
>
> In Squeak, profiling can be obtained thru (I guess not much different in
> Pharo):
>     AndreasSystemProfiler spyOn: [[a sum] bench].
>
> which gives:
> **Leaves**
> 37.19 (1,865)  LapackSLibrary xlasetWithuplo:m:n:alpha:beta:a:lda:length:
> 27.12 (1,360)  BlasSLibrary dotF2CWithn:x:incx:y:incy:
> 9.31 (467)  Behavior basicNew:
> 2.42 (121)  ByteString class compare:with:collated:
> 1.43 (72)  ByteString hashWithInitialHash:
>
> Note that this is un-accelerated BLAS (no Intel Math Kernel Library or
> other accelerated version), but it does not make much difference for those
> BLAS level-1 functions (those with cost O(N))
>
> But we can avoid that cost and compute all the cumulative sums, or some of
> them at once with a single Matrix operation (xTRMM / xGEMM):
>
> | a b c |
> a := LapackSGEMatrix randUniform: #(10000 1).
> b := LapackSGEMatrix nrow: 10000 ncol: 10000 withAll: 1. "huge matrix,
> that's not cheap, don't even bench it!"
> "cumsum"
> c := b lowerTriangle * a.
>
> | a b c |
> a := LapackSGEMatrix randUniform: #(10000 1).
> b := LapackSGEMatrix nrow: 10000 ncol: 10000 withAll: 1.  "huge matrix,
> that's not cheap, don't even bench it!"
> "cumsum 1 out of 100"
> c := (b lowerTriangle atRows: (100 to: 10000 by: 100)) * a.
>
> | a b c |
> a := LapackSGEMatrix randUniform: #(10000 1).
> "cheaper construction of partial cum sum"
> [b := LapackSGEMatrix rows: (  (100 to: a nrow by: 100) collect: [:n | (
> LapackSGEMatrix nrow: 1 ncol: n withAll: 1.0) , (LapackSGEMatrix nrow: 1
> ncol: a nrow - n)]).
> c := b * a.] bench.
>
>  '3.5 per second. 286 milliseconds per run.'
>
> So as you see, with some moderate effort, we have computed 100 partial
> cumulative sums in about 286ms, that's 2.86ms per cumsum on average, not
> too bad.
> Maybe not as straightforward as numpy, and maybe still not as fast, but
> not completely at west.
>
>
Oups, I posted to fast! ms are not µs!
[c := b * a] bench.
gives  '1,690 per second. 593 microseconds per run.', that's about 6µs per
sum, but including construction of multiplier, that's way too much (it's
not accelerated!)


>
> Le mar. 21 mai 2019 à 12:51, Jimmie Houchin <jlhouc...@gmail.com> a
> écrit :
>
>> Not a problem. I greatly respective other peoples time and priorities and
>> their personal lives.
>>
>> Just for the record I am using 64bit Pharo on a fast i7, 16gb ram, laptop
>> running Xubunut 18.04 64bit.
>>
>> I do not remember any problems loading. And within the small amount of
>> experimenting that I did, it seemed to operate fine.
>>
>> Again, thanks for your contribution. I know it is a lot of work and a
>> pretty large area to cover. Python/Numpy has armies of people working on
>> this.
>>
>> Jimmie
>>
>>
>> On 5/21/19 2:54 AM, Nicolas Cellier wrote:
>>
>> Hi Jimmie,
>> I didn't take time yesterday to analyze your specific example because it
>> was quite late, but here are some remarks:
>> 1) First, I recommend using 64bits Pharo, because number crunching and
>> Float operations will be faster (not FloatArray though).
>> 2) it would be nice to use a profiler to analyze where time is spent
>>   I would not be amazed that (Float readFrom:...) takes a non neglectable
>> percentage of it
>> 3) ExternalDoubleArray only add overhead if no bulk-operation is performed
>>   (like reading raw binary data or serving as storage area passed to
>> Lapack/blas primitives)
>>   it does not provide accelerated features by itself indeed.
>>   I think that it is too low level to serve as a primary interface.
>> 4) LapackXXXMatrix sum has effectively not been optimized to use BLAS,
>> and this can be easily corrected, thanks for giving this example.
>> With some cooperation, we could easily make some progress, there are low
>> hanging fruits.
>> But I understand if you prefer to stick with more mature numpy solution.
>> Thanks for trying. At least, you were able to load and use Smallapack in
>> Pharo, and this is already a good feedback.
>> If you have time, I'll publish a small enhancement for accelerating sum,
>> and ask you to retry.
>> Thanks again
>>
>>
>> Le mar. 21 mai 2019 à 05:13, Serge Stinckwich <serge.stinckw...@gmail.com>
>> a écrit :
>>
>>> There is another solution with my TensorFlow Pharo binding:
>>> https://github.com/PolyMathOrg/libtensorflow-pharo-bindings
>>>
>>> You can do a matrix multiplication like that :
>>>
>>> | graph t1 t2 c1 c2 mult session result |
>>> graph := TF_Graph create.
>>> t1 := TF_Tensor fromFloats: (1 to:1000000) asArray shape:#(1000 1000).
>>> t2 := TF_Tensor fromFloats: (1 to:1000000) asArray shape:#(1000 1000).
>>> c1 := graph const: 'c1' value: t1.
>>> c2 := graph const: 'c2' value: t2.
>>> mult := c1 * c2.
>>> session := TF_Session on: graph.
>>> result := session runOutput: (mult output: 0).
>>> result asNumbers
>>>
>>> Here I'm doing a multiplication between 2 matrices of 1000X1000 size in
>>> 537 ms on my computer.
>>>
>>> All operations can be done in a graph of operations that is run outside
>>> Pharo, so could be very fast.
>>> Operations can be done on CPU or GPU. 32 bits or 64 bits float
>>> operations are possible.
>>>
>>> This is a work in progress but can already be used.
>>> Regards,
>>>
>>>
>>>
>>> On Tue, May 21, 2019 at 6:54 AM Jimmie Houchin <jlhouc...@gmail.com>
>>> wrote:
>>>
>>>> I wasn't worried about how to do sliding windows. My problem is that
>>>> using LapackDGEMatrix in my example was 18x slower than FloatArray, which
>>>> is slower than Numpy. It isn't what I was expecting.
>>>>
>>>> What I didn't know is if I was doing something wrong to cause such a
>>>> tremendous slow down.
>>>>
>>>> Python and Numpy is not my favorite. But it isn't uncomfortable.
>>>>
>>>> So I gave up and went back to Numpy.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> On 5/20/19 5:17 PM, Nicolas Cellier wrote:
>>>>
>>>> Hi Jimmie,
>>>> effectively I did not subsribe...
>>>> Having efficient methods for sliding window average is possible, here
>>>> is how I would do it:
>>>>
>>>> "Create a vector with 100,000 rows filles with random values (uniform
>>>> distrubution in [0,1]"
>>>> v := LapackDGEMatrix randUniform: #(100000 1).
>>>>
>>>> "extract values from rank 10001 to 20000"
>>>> w1 := v atIntervalFrom: 10001 to: 20000 by: 1.
>>>>
>>>> "create a left multiplier matrix for performing average of w1"
>>>> a := LapackDGEMatrix nrow: 1 ncol: w1 nrow withAll: 1.0 / w1 size.
>>>>
>>>> "get the average (this is a 1x1 matrix from which we take first
>>>> element)"
>>>> avg1 := (a * w1) at: 1.
>>>>
>>>> [ "select another slice of same size"
>>>> w2 := v atIntervalFrom: 15001 to: 25000 by: 1.
>>>>
>>>> "get the average (we can recycle a)"
>>>> avg2 := (a * w2) at: 1 ] bench.
>>>>
>>>> This gives:
>>>>  '16,500 per second. 60.7 microseconds per run.'
>>>> versus:
>>>> [w2 sum / w2 size] bench.
>>>>  '1,100 per second. 908 microseconds per run.'
>>>>
>>>> For max and min, it's harder. Lapack/Blas only provide max of absolute
>>>> value as primitive:
>>>> [w2 absMax] bench.
>>>>  '19,400 per second. 51.5 microseconds per run.'
>>>>
>>>> Everything else will be slower, unless we write new primitives in C and
>>>> connect them...
>>>> [w2 maxOf: [:each | each]] bench.
>>>>  '984 per second. 1.02 milliseconds per run.'
>>>>
>>>> Le dim. 19 mai 2019 à 14:58, Jimmie <jlhouc...@gmail.com> a écrit :
>>>>
>>>>> On 5/16/19 1:26 PM, Nicolas Cellier wrote:> Any feedback on this?
>>>>>  > Did someone tried to use Smallapack in Pharo?
>>>>>  > Jimmie?
>>>>>  >
>>>>>
>>>>> I am going to guess that you are not on pharo-users. My bad.
>>>>> I posted this in pharo-users as I it wasn't Pharo development question.
>>>>>
>>>>> I probably should have posted here or emailed you directly.
>>>>>
>>>>> All I really need is good performance with a simple array of floats.
>>>>> No
>>>>> matrix math. Nothing complicated. Moving Averages over a slice of the
>>>>> array. A variety of different averages, weighted, etc. Max/min of the
>>>>> array. But just a single simple array.
>>>>>
>>>>> Any help greatly appreciated.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> On 4/28/19 8:32 PM, Jimmie Houchin wrote:
>>>>> Hello,
>>>>>
>>>>> I have installed Smallapack into Pharo 7.0.3. Thanks Nicholas.
>>>>>
>>>>> I am very unsure on my use of Smallapack. I am not a mathematician or
>>>>> scientist. However the only part of Smallapack I am trying to use at
>>>>> the
>>>>> moment is something that would  be 64bit and compare to FloatArray so
>>>>> that I can do some simple accessing, slicing, sum, and average on the
>>>>> array.
>>>>>
>>>>> Here is some sample code I wrote just to play in a playground.
>>>>>
>>>>> I have an ExternalDoubleArray, LapackDGEMatrix, and a FloatArray
>>>>> samples. The ones not in use are commented out for any run.
>>>>>
>>>>> fp is a download from
>>>>> http://ratedata.gaincapital.com/2018/12%20December/EUR_USD_Week1.zip
>>>>> and unzipped to a directory.
>>>>>
>>>>> fp := '/home/jimmie/data/EUR_USD_Week1.csv'
>>>>> index := 0.
>>>>> pricesSum := 0.
>>>>> asum := 0.
>>>>> ttr := [
>>>>>      lines := fp asFileReference contents lines allButFirst.
>>>>>      a := ExternalDoubleArray new: lines size.
>>>>>      "la := LapackDGEMatrix allocateNrow: lines size ncol: 1.
>>>>>      a := la columnAt: 1."
>>>>>      "a := FloatArray new: lines size."
>>>>>      lines do: [ :line || parts price |
>>>>>          parts := ',' split: line.
>>>>>          index := index + 1.
>>>>>          price := Float readFrom: (parts last).
>>>>>          a at: index put: price.
>>>>>          pricesSum := pricesSum + price.
>>>>>          (index rem: 100) = 0 ifTrue: [
>>>>>              asum := a sum.
>>>>>       ]]] timeToRun.
>>>>> { index. pricesSum. asum. ttr }.
>>>>>   "ExternalDoubleArray an Array(337588 383662.5627699992
>>>>> 383562.2956199993 0:00:01:59.885)"
>>>>>   "FloatArray  an Array(337588 383662.5627699992 383562.2954441309
>>>>> 0:00:00:06.555)"
>>>>>
>>>>> FloatArray is not the precision I need. But it is over 18x faster.
>>>>>
>>>>> I am afraid I must be doing something badly wrong. Python/Numpy is
>>>>> over
>>>>> 4x faster than FloatArray for the above.
>>>>>
>>>>> If I am using Smallapack incorrectly please help.
>>>>>
>>>>> Any help greatly appreciated.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Serge Stinckwic
>>> h
>>>
>>> Int. Research Unit
>>>  on Modelling/Simulation of Complex Systems (UMMISCO)
>>> Sorbonne University
>>>  (SU)
>>> French National Research Institute for Sustainable Development (IRD)
>>> U
>>> niversity of Yaoundé I, Cameroon
>>> "Programs must be written for people to read, and only incidentally for
>>> machines to execute."
>>> https://twitter.com/SergeStinckwich
>>>
>>

Re: [Pharo-dev] FloatArray

Reply via email to