[julia-users] Re: ANN: support for the orignial role of quiet NaNs

2015-08-03 Thread Jason Riedy
And Jeffrey Sarnoff writes:
> IEEE 754-2008 makes it clear that QNaN payload values are fare game: 
> (is says details of NaN propagtion may have vender differences, and)

Having been in the room at the time of utter despair... (And
admittedly not having looked at your package yet.)

One "right" way of using the NaN values without following their
propagation is the following:
  - ensure the input has NaN payloads denoting something about
their origin and the likelihood of causing issues later,
  - check the invalid flag at a felicitous time, then
  - scanning to the inputs to rank the possible issues.

This somewhat fits a Lispy with-NaN-debugging macro style (that
can at compile time ignore absolutely everything depending on the
safety level).  Naturally, it's not guaranteed to work, and I'm
not at all sure how a user could fill in the payloads
appropriately.  If there is a good answer to filling in the
payloads, well, the standard is up for revision...



[julia-users] Re: ANN: support for the orignial role of quiet NaNs

2015-08-03 Thread Jason Riedy
And Jeffrey Sarnoff writes:
> AFAIK Julia and most other languages use one or two of each in
> most circumstances.

And many chips produce only one, the platform's "canonical" NaN.
Some pass one of the argument NaNs through but rarely will
specify which.



[julia-users] Re: Can you make this linear algebra code faster?

2015-03-25 Thread Jason Riedy
And Jiahao Chen writes:
> I tried to manually inline idxmaxabs. It made absolutely no difference
> on my machine. The row scaling takes ~0.05% of total execution time.

Simply inlining, sure, but you could scale inside the outer loop
and find next the pivot in the inner loop.  Making only a single
pass over the data should save more than 0.05% once you leave
cache.  But as long as you're in cache (500x500 is approx. 2MiB),
not much will matter.

Ultimately, I'm not sure who's interested in complete pivoting
for LU.  That choice alone kills performance on modern machines
for negligible benefit.  You likely would find more interest for
column-pivoted QR or rook pivoting in LDL^T.



[julia-users] Re: Can you make this linear algebra code faster?

2015-03-25 Thread Jason Riedy
And Tim Holy writes:
> Obviously you could do even better by devectorizing, but it
> wouldn't be as pretty.

Similarly for moving the row scaling and next pivot search into
the loop.



[julia-users] Re: Rounding and the IEEE rule

2014-12-28 Thread Jason Riedy
And Tamas Papp writes:
> My rule of thumb is that if I need to worry about the IEEE rounding mode
> making a difference in my results, then my problem is really
> ill-conditioned and rounding the least of my worries. So I usually just
> ignore it.

You know of ill-conditioning, so you are not the concern.
Round-to-even is meant to allay bias problems for the vast
majority of programmers, like those responsible for the Vancouver
stock exchange index in 1982.  They used truncation which, over
time, roughly halved the index.  Rounding upward in the same
scenario would have doubled it...

The choices in 754 were made with a nearly revolting amount of
deliberation.  Always makes me sad to see them shrugged away.



[julia-users] Re: Why lufact() is faster than my code?

2014-09-25 Thread Jason Riedy
And Staro Pickle writes:
> I am doing some matrix factorizations like LU, LDL'. I find that the
> julia's lufact() is much faster than my code. I wonder why my code is
> slow and why lufact is fast(is this one question or two?). I use
> vectorization(is that vectorization?). 

Others have given some reasons...  If you want to see contortions
that were (perhaps still are) necessary to be competitive, see:
  https://groups.google.com/forum/#!msg/julia-users/DJRxApmpJmU/NqepNaX5ARIJ

> n=5, a=10
> Matrix to be decomposed:
> 1 0 0 0 a 
> 0 1 0 0 a
> 0 0 1 0 a
> 0 0 0 1 a
> a a a a 1

However, if your matrices always have this form, I would highly
recommend solving a few by hand to notice the pattern.



[julia-users] Re: Matlab bench in Julia

2014-09-18 Thread Jason Riedy
And Elliot Saba writes:
> The first thing you should do is run your code once to warm up the
> JIT, and then run it again to measure the actual run time, rather
> than compile time + run time.

To be fair, he seems to be timing MATLAB in the same way, so he's
comparing systems appropriately at that level.

It's just the tuned BLAS+LAPACK & fftw v. the default ones.  This
is one reason why MATLAB bundles so much.  (Another reason being
the differences in numerical results causing support calls.  Took
a long time before MATLAB gave in to per-platform-tuned libraries.)



[julia-users] Re: needless loss of performance: immutable conflates two concepts

2014-08-07 Thread Jason Riedy
And Stephen Vavasis writes:
> You say that probably no improvement would be possible if 'fast'
> updating were available for immutable in my code.  In this case, why
> is there such a huge difference between fast and slow updating for
> the non-immutable case?  Is there some optimization available in the
> 'immutable' case but not the 'type' case, or is this a compiler bug?

Likely, not possible...  And I admit I'm going more by
familiarity with gcc optimizations than llvm ones.

> function immut_upd_slow()
[...]
> @inbounds a[i] = xyimmut(a[i].x, a[i].y, a[i].z, true)

With a value type (and without aliasing, also without signaling
NaNs signaling on motion), the compiler could detect that
a[i].{x,y,z} do not change.  Because it's a value type and not a
general object, the compiler could know there's no allocation.
An allocation *may* trigger a GC, so there are side-effects.  I
know gcc pulls this off with C structs in some doubled-precision
code.

I suppose Julia could maintain extra attributes on general
objects to decide if they're morally value objects, but that
wanders close to sufficiently smart compiler territory.

Besides, if you really want this to be fast, the "array" should
be an object holding three arrays: one Int array for x, one
Float64 2-d array for y & z, and one Bool array for summed.  Just
like Matlab's complex, and with all the interop issues, but then
you avoid needless cache lines, open up vectorization, etc...



[julia-users] Re: needless loss of performance: immutable conflates two concepts

2014-08-05 Thread Jason Riedy
And Steve Vavasis writes:
> Since fast update is not available for 'immutable', we can only
> guess how much improvement is possible.

Likely none at all.  Decomposing structured types into their
components is a common optimization now, and the immutable
semantics permit it.

"Immutable" is just Julia's word for a value type.  You can
declare a structured type that acts just like an integer, float,
etc.  Just as you cannot change 1 to 2, you cannot change an
immutable value (which consists of its components).  You can
change the value associated with a variable, e.g. v += 1.

GC and the rest is just a consequence.  There is no 2 object
implemented to which variables of integer or fixnum types point,
so similarly there is no GC-able object to which values of
immutable type point.  There could be, but that'd would be an
implementation choice.




[julia-users] Re: ways to improve performance of a non-integer power?

2014-07-16 Thread Jason Riedy
And Florian Oswald writes:
> I would have never guessed that there could be a fast/slow
> issue with such basic functions. Little do I know!

See also Jean-Michel Muller's text on "Elementary Functions,
Algorithms and Implementation" and others (including, imho,
Markstein's "IA-64 and Elementary Functions: Speed and Precision"
for very concrete descriptions of trade-offs on one (possibly
odd) architecture).

It'd be fantastic if something like the NIST Digital Library of
Mathematical Functions (http://dlmf.nist.gov/) discussed more of
the computational trade-offs, but there are the dual issues of
funding and the slow deterioration of knowledge.  There might be
a role for a high-performance, high-level, free language like
Julia here.

A few people have made routines dependable enough that few people
worry.  While this is good for general use, it creates a
difficult position for continuing the knowledge.  I know that I
don't have it...



[julia-users] LU factorization in Julia can be almost as fast as native code.

2014-05-26 Thread Jason Riedy
First, the caveats:
- on large matrices,
- with few threads, and
- punching a hole directly to the BLAS library.

Sivan Toledo's recursive LU factorization ( 
[[http://dx.doi.org/10.1137/S0895479896297744]] ) isn't just good for 
optimizing cache behavior but also avoiding top-level overheads.  A somewhat 
straight-forward iterative implementation in Julia, attached, performs about as 
quickly as the current LAPACK call for square matrices of dimension 3000 or 
above so long as you limit OpenBLAS to a single thread.  I haven't tested 
against MKL.

Unfortunately, I have to avoid the pretty BLAS interface and call xGEMM and 
xTRSV directly.  The allocations and construction of StridedMatrix views at 
every step destroys any performance.  Similarly with a light-weight wrapper 
attempt called LASubDomain below.  Also, I have to write loops rather than 
vector expressions.  That leads to ugly code compared to modern Fortran 
versions.

Current ratio of recursive LU time to the Julia LAPACK dispatch on a dual-core, 
Sandy Bridge server by dimension (vertical) and number of threads (horizontal):

| N |   NT: 1 |   2 |   4 |  8 |  16 |  24 |  32 |
|---+-+-+-++-+-+-|
|10 | 2301.01 | 2291.16 | 2335.39 | 2268.1 | 2532.12 | 2304.23 | 2320.84 |
|30 |  661.25 |  560.34 |  443.07 | 514.52 |  392.04 |  425.56 | 472 |
|   100 |   98.62 |   90.38 |   96.69 |  94.31 |74.6 |   73.61 |   76.33 |
|   300 |8.12 |8.14 |   12.19 |  16.15 |   14.55 |   14.16 |   15.82 |
|  1000 | 1.6 |1.55 |2.36 |   3.35 | 3.5 |3.86 |3.48 |
|  3000 |1.04 |1.13 |1.36 |   1.99 |1.62 |1.36 |1.39 |
| 1 |1.02 |1.06 |1.13 |   1.26 |1.55 |1.54 |1.58 |

The best case for traditional LU factorization relying on xGEMM for the Schur 
complement is 7.7x slower than the LAPACK dispatch, and it scales far worse.

If there's interest, I'd love to work this into a faster generic 
base/linalg/lu.jl for use with other data types.  I know I need to consider if 
the la_swap!, lu_trisolve!, and lu_schur! functions are the right utility 
methods.  I'm interested in supporting doubled double, etc. with relatively 
high performance.  That implies building everything out of at least dot 
products.  For "multiplied" precisions, those can be optimized to avoid 
continual re-normalization (see Siegfried Rump, et al.'s work).

(I actually wrote the code last November, but this is first chance I've had to 
describe it.  Might be a tad out of date style-wise with respect to current 
Julia.  Given that lag, I suppose I shouldn't promise to beat this into shape 
at any fast pace...)
module RecFactorization
# Beginnings of recursive factorization implementations for Julia.
# Currently just LU.  QR should be straight-forward.  Unfortunately,
# vector expressions must be written as loops to avoid needless memory
# allocation.  I hate writing loops.

export reclufact!, lufact_schur!, lufact_noblas!

import Base.LinAlg: BlasFloat, BlasChar, BlasInt, blas_int, DimensionMismatch, 
chksquare, axpy!

# Attempt at defining a pretty LAPACK-style sub-array...
## Using LASubDomain causes far too much allocation.  I suspect using 
ArrayViews would have
## the same problem.
immutable LASubDomain
offset::BlasInt
leadingdim::BlasInt
nrow::BlasInt
ncol::BlasInt

function LASubDomain{T}(A::DenseArray{T,2}, i, j)
(M,N) = size(A)
offset = i-1 + (j-1)*M
nrow = M-i
ncol = N-i
return new(offset, M, nrow, ncol)
end
function LASubDomain{T}(A::DenseArray{T,2}, i, nrow, j, ncol)
(M,N) = size(A)
offset = (i-1) + (j-1)*M
return new(offset, M, nrow, ncol)
end
end

import Base.pointer
pointer{T}(A::DenseArray{T, 2}, dom::LASubDomain) = pointer(A) + 
dom.offset*sizeof(T)
pointer{T}(A::DenseArray{T, 2}, i, j) = pointer(A) + ((i-1) + 
(j-1)*size(A,1))*sizeof(T)
import Base.size
size(dom::LASubDomain) = (dom.nrow, dom.ncol)
size(dom::LASubDomain, k::Int) = (dom.nrow, dom.ncol)[k]

leadingdim{T}(A::DenseArray{T, 2}) = size(A, 1)
leadingdim(dom::LASubDomain) = dom.leadingdim
LAArg{T}(A::DenseArray{T,2}, dom::LASubDomain) = (pointer(A, dom), 
leadingdim(dom))

# Dig in and expose the raw BLAS calls.
const libblas = Base.libblas_name

for (gemm, elty) in
((:dgemm_,:Float64),
 (:sgemm_,:Float32),
 (:zgemm_,:Complex128),
 (:cgemm_,:Complex64))
@eval begin
 # SUBROUTINE 
DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC)
 # * .. Scalar Arguments ..
 #   DOUBLE PRECISION ALPHA,BETA
 #   INTEGER K,LDA,LDB,LDC,M,N
 #   CHARACTER TRANSA,TRANSB
 # * .. Array Arguments ..
 #   DOUBLE PRECISION A(LDA,*),B(LDB,*),C(LDC,*)
function painful_gemm!(transA::BlasChar, transB::BlasChar, m::BlasInt, 
n::B