[julia-users] Re: ANN: support for the orignial role of quiet NaNs
And Jeffrey Sarnoff writes: > IEEE 754-2008 makes it clear that QNaN payload values are fare game: > (is says details of NaN propagtion may have vender differences, and) Having been in the room at the time of utter despair... (And admittedly not having looked at your package yet.) One "right" way of using the NaN values without following their propagation is the following: - ensure the input has NaN payloads denoting something about their origin and the likelihood of causing issues later, - check the invalid flag at a felicitous time, then - scanning to the inputs to rank the possible issues. This somewhat fits a Lispy with-NaN-debugging macro style (that can at compile time ignore absolutely everything depending on the safety level). Naturally, it's not guaranteed to work, and I'm not at all sure how a user could fill in the payloads appropriately. If there is a good answer to filling in the payloads, well, the standard is up for revision...
[julia-users] Re: ANN: support for the orignial role of quiet NaNs
And Jeffrey Sarnoff writes: > AFAIK Julia and most other languages use one or two of each in > most circumstances. And many chips produce only one, the platform's "canonical" NaN. Some pass one of the argument NaNs through but rarely will specify which.
[julia-users] Re: Can you make this linear algebra code faster?
And Jiahao Chen writes: > I tried to manually inline idxmaxabs. It made absolutely no difference > on my machine. The row scaling takes ~0.05% of total execution time. Simply inlining, sure, but you could scale inside the outer loop and find next the pivot in the inner loop. Making only a single pass over the data should save more than 0.05% once you leave cache. But as long as you're in cache (500x500 is approx. 2MiB), not much will matter. Ultimately, I'm not sure who's interested in complete pivoting for LU. That choice alone kills performance on modern machines for negligible benefit. You likely would find more interest for column-pivoted QR or rook pivoting in LDL^T.
[julia-users] Re: Can you make this linear algebra code faster?
And Tim Holy writes: > Obviously you could do even better by devectorizing, but it > wouldn't be as pretty. Similarly for moving the row scaling and next pivot search into the loop.
[julia-users] Re: Rounding and the IEEE rule
And Tamas Papp writes: > My rule of thumb is that if I need to worry about the IEEE rounding mode > making a difference in my results, then my problem is really > ill-conditioned and rounding the least of my worries. So I usually just > ignore it. You know of ill-conditioning, so you are not the concern. Round-to-even is meant to allay bias problems for the vast majority of programmers, like those responsible for the Vancouver stock exchange index in 1982. They used truncation which, over time, roughly halved the index. Rounding upward in the same scenario would have doubled it... The choices in 754 were made with a nearly revolting amount of deliberation. Always makes me sad to see them shrugged away.
[julia-users] Re: Why lufact() is faster than my code?
And Staro Pickle writes: > I am doing some matrix factorizations like LU, LDL'. I find that the > julia's lufact() is much faster than my code. I wonder why my code is > slow and why lufact is fast(is this one question or two?). I use > vectorization(is that vectorization?). Others have given some reasons... If you want to see contortions that were (perhaps still are) necessary to be competitive, see: https://groups.google.com/forum/#!msg/julia-users/DJRxApmpJmU/NqepNaX5ARIJ > n=5, a=10 > Matrix to be decomposed: > 1 0 0 0 a > 0 1 0 0 a > 0 0 1 0 a > 0 0 0 1 a > a a a a 1 However, if your matrices always have this form, I would highly recommend solving a few by hand to notice the pattern.
[julia-users] Re: Matlab bench in Julia
And Elliot Saba writes: > The first thing you should do is run your code once to warm up the > JIT, and then run it again to measure the actual run time, rather > than compile time + run time. To be fair, he seems to be timing MATLAB in the same way, so he's comparing systems appropriately at that level. It's just the tuned BLAS+LAPACK & fftw v. the default ones. This is one reason why MATLAB bundles so much. (Another reason being the differences in numerical results causing support calls. Took a long time before MATLAB gave in to per-platform-tuned libraries.)
[julia-users] Re: needless loss of performance: immutable conflates two concepts
And Stephen Vavasis writes: > You say that probably no improvement would be possible if 'fast' > updating were available for immutable in my code. In this case, why > is there such a huge difference between fast and slow updating for > the non-immutable case? Is there some optimization available in the > 'immutable' case but not the 'type' case, or is this a compiler bug? Likely, not possible... And I admit I'm going more by familiarity with gcc optimizations than llvm ones. > function immut_upd_slow() [...] > @inbounds a[i] = xyimmut(a[i].x, a[i].y, a[i].z, true) With a value type (and without aliasing, also without signaling NaNs signaling on motion), the compiler could detect that a[i].{x,y,z} do not change. Because it's a value type and not a general object, the compiler could know there's no allocation. An allocation *may* trigger a GC, so there are side-effects. I know gcc pulls this off with C structs in some doubled-precision code. I suppose Julia could maintain extra attributes on general objects to decide if they're morally value objects, but that wanders close to sufficiently smart compiler territory. Besides, if you really want this to be fast, the "array" should be an object holding three arrays: one Int array for x, one Float64 2-d array for y & z, and one Bool array for summed. Just like Matlab's complex, and with all the interop issues, but then you avoid needless cache lines, open up vectorization, etc...
[julia-users] Re: needless loss of performance: immutable conflates two concepts
And Steve Vavasis writes: > Since fast update is not available for 'immutable', we can only > guess how much improvement is possible. Likely none at all. Decomposing structured types into their components is a common optimization now, and the immutable semantics permit it. "Immutable" is just Julia's word for a value type. You can declare a structured type that acts just like an integer, float, etc. Just as you cannot change 1 to 2, you cannot change an immutable value (which consists of its components). You can change the value associated with a variable, e.g. v += 1. GC and the rest is just a consequence. There is no 2 object implemented to which variables of integer or fixnum types point, so similarly there is no GC-able object to which values of immutable type point. There could be, but that'd would be an implementation choice.
[julia-users] Re: ways to improve performance of a non-integer power?
And Florian Oswald writes: > I would have never guessed that there could be a fast/slow > issue with such basic functions. Little do I know! See also Jean-Michel Muller's text on "Elementary Functions, Algorithms and Implementation" and others (including, imho, Markstein's "IA-64 and Elementary Functions: Speed and Precision" for very concrete descriptions of trade-offs on one (possibly odd) architecture). It'd be fantastic if something like the NIST Digital Library of Mathematical Functions (http://dlmf.nist.gov/) discussed more of the computational trade-offs, but there are the dual issues of funding and the slow deterioration of knowledge. There might be a role for a high-performance, high-level, free language like Julia here. A few people have made routines dependable enough that few people worry. While this is good for general use, it creates a difficult position for continuing the knowledge. I know that I don't have it...
[julia-users] LU factorization in Julia can be almost as fast as native code.
First, the caveats: - on large matrices, - with few threads, and - punching a hole directly to the BLAS library. Sivan Toledo's recursive LU factorization ( [[http://dx.doi.org/10.1137/S0895479896297744]] ) isn't just good for optimizing cache behavior but also avoiding top-level overheads. A somewhat straight-forward iterative implementation in Julia, attached, performs about as quickly as the current LAPACK call for square matrices of dimension 3000 or above so long as you limit OpenBLAS to a single thread. I haven't tested against MKL. Unfortunately, I have to avoid the pretty BLAS interface and call xGEMM and xTRSV directly. The allocations and construction of StridedMatrix views at every step destroys any performance. Similarly with a light-weight wrapper attempt called LASubDomain below. Also, I have to write loops rather than vector expressions. That leads to ugly code compared to modern Fortran versions. Current ratio of recursive LU time to the Julia LAPACK dispatch on a dual-core, Sandy Bridge server by dimension (vertical) and number of threads (horizontal): | N | NT: 1 | 2 | 4 | 8 | 16 | 24 | 32 | |---+-+-+-++-+-+-| |10 | 2301.01 | 2291.16 | 2335.39 | 2268.1 | 2532.12 | 2304.23 | 2320.84 | |30 | 661.25 | 560.34 | 443.07 | 514.52 | 392.04 | 425.56 | 472 | | 100 | 98.62 | 90.38 | 96.69 | 94.31 |74.6 | 73.61 | 76.33 | | 300 |8.12 |8.14 | 12.19 | 16.15 | 14.55 | 14.16 | 15.82 | | 1000 | 1.6 |1.55 |2.36 | 3.35 | 3.5 |3.86 |3.48 | | 3000 |1.04 |1.13 |1.36 | 1.99 |1.62 |1.36 |1.39 | | 1 |1.02 |1.06 |1.13 | 1.26 |1.55 |1.54 |1.58 | The best case for traditional LU factorization relying on xGEMM for the Schur complement is 7.7x slower than the LAPACK dispatch, and it scales far worse. If there's interest, I'd love to work this into a faster generic base/linalg/lu.jl for use with other data types. I know I need to consider if the la_swap!, lu_trisolve!, and lu_schur! functions are the right utility methods. I'm interested in supporting doubled double, etc. with relatively high performance. That implies building everything out of at least dot products. For "multiplied" precisions, those can be optimized to avoid continual re-normalization (see Siegfried Rump, et al.'s work). (I actually wrote the code last November, but this is first chance I've had to describe it. Might be a tad out of date style-wise with respect to current Julia. Given that lag, I suppose I shouldn't promise to beat this into shape at any fast pace...) module RecFactorization # Beginnings of recursive factorization implementations for Julia. # Currently just LU. QR should be straight-forward. Unfortunately, # vector expressions must be written as loops to avoid needless memory # allocation. I hate writing loops. export reclufact!, lufact_schur!, lufact_noblas! import Base.LinAlg: BlasFloat, BlasChar, BlasInt, blas_int, DimensionMismatch, chksquare, axpy! # Attempt at defining a pretty LAPACK-style sub-array... ## Using LASubDomain causes far too much allocation. I suspect using ArrayViews would have ## the same problem. immutable LASubDomain offset::BlasInt leadingdim::BlasInt nrow::BlasInt ncol::BlasInt function LASubDomain{T}(A::DenseArray{T,2}, i, j) (M,N) = size(A) offset = i-1 + (j-1)*M nrow = M-i ncol = N-i return new(offset, M, nrow, ncol) end function LASubDomain{T}(A::DenseArray{T,2}, i, nrow, j, ncol) (M,N) = size(A) offset = (i-1) + (j-1)*M return new(offset, M, nrow, ncol) end end import Base.pointer pointer{T}(A::DenseArray{T, 2}, dom::LASubDomain) = pointer(A) + dom.offset*sizeof(T) pointer{T}(A::DenseArray{T, 2}, i, j) = pointer(A) + ((i-1) + (j-1)*size(A,1))*sizeof(T) import Base.size size(dom::LASubDomain) = (dom.nrow, dom.ncol) size(dom::LASubDomain, k::Int) = (dom.nrow, dom.ncol)[k] leadingdim{T}(A::DenseArray{T, 2}) = size(A, 1) leadingdim(dom::LASubDomain) = dom.leadingdim LAArg{T}(A::DenseArray{T,2}, dom::LASubDomain) = (pointer(A, dom), leadingdim(dom)) # Dig in and expose the raw BLAS calls. const libblas = Base.libblas_name for (gemm, elty) in ((:dgemm_,:Float64), (:sgemm_,:Float32), (:zgemm_,:Complex128), (:cgemm_,:Complex64)) @eval begin # SUBROUTINE DGEMM(TRANSA,TRANSB,M,N,K,ALPHA,A,LDA,B,LDB,BETA,C,LDC) # * .. Scalar Arguments .. # DOUBLE PRECISION ALPHA,BETA # INTEGER K,LDA,LDB,LDC,M,N # CHARACTER TRANSA,TRANSB # * .. Array Arguments .. # DOUBLE PRECISION A(LDA,*),B(LDB,*),C(LDC,*) function painful_gemm!(transA::BlasChar, transB::BlasChar, m::BlasInt, n::B