Agreed! Axis-reducing dot product operators might be a reasonable addition
to the standard library, especially since BLAS provides the (presumably
highly optimized or even multi-threaded) dot product primitive, which a
library function could easily sub out to using the appropriate strides.

>> A good solution for this particular problem, though presumably uses more
>> memory than a dedicated axis-aware dot product method.
> You are correct that the method I showed does create a matrix of the same
> size as `a` and `b` to evaluate the `.+` operation.  You can avoid doing so
> but, of course, the code becomes more opaque.  I think the point for me is
> that if `a` and `b` are so large that the allocation and freeing of the
> memory becomes problematic, I can write the space conserving version in
> Julia and get performance comparable to compiled code.  Lately when
> describing Julia to colleagues I mention the type system and multiple
> dispatch and several other aspects showing how well-designed Julia is.  But
> the point that I emphasize is "one language", which sometimes I extend to
> "One language to rule them all" (I assume everyone is familiar with "Lord
> of the Rings").  I can write Julia code at in a high-level, vectorized
> style (like R, Matlab/Octave) but I can also, if I need to, write low-level
> iterative code in Julia.  I don't need to use a compiled language write
> interface code.
> If I have very large arrays, perhaps even memory-mapped arrays because
> they are so large, I could define a function
> function rowdot{T}(a::DenseMatrix{T},b::DenseMatrix{T})
>     ((m,n) = size(a)) == size(b) || throw(DimensionMismatch(""))
>     res = zeros(T,(m,))
>     for j in 1:n, i in 1:m
>         res[i] += a[i,j] * b[i,j]
>     end
>     res
> end
> that avoided creating the temporary.  Once I convinced myself that there
> were no problems in the code (and my first version did indeed have a bug) I
> could change the loop to
>     @simd for j in 1:n, i in 1:m
>         @inbounds res[i] += a[i,j] * b[i,j]
>     end
> and improve the performance.  In the next iteration I could use
> SharedArrays and parallelize the calculation if it really needed it.
> As a programmer I am grateful for the incredible freedom that Julia gives
> me to get as obsessive compulsive about performance as I want.
> Thanks!
> You're welcome.
>>>> Working through the excellent coursera machine-learning course, I found
>>>> myself using the row-wise (axis-wise) dot product in Octave, but found
>>>> there was no obvious equivalent in Julia.
>>>> In Octave/Matlab, one can call dot(a,b,2) to get the row-wise dot
>>>> product of two mxn matrices, returned as a new column vector of size mx1.
>>>> Even though Julia makes for loops faster, I like sum(dot(a,b,2)) for
>>>> its concision over the equivalent array comprehension or explicit for loop.
>>>> Hopefully I'm just missing an overload or alternate name?
