Re: [julia-users] Re: Why should computer scientists and computational statisticians invest in Julia instead of R?

Stefan Karpinski Sun, 11 May 2014 10:57:06 -0700

I find this kind of amusing:

You may have heard before that R is a vectorized language, but what do we
> mean by that? One way to read that is to say that many functions in R can
> operate efficiently on vectors (in addition to singletons).

That's certainly one way to spin it. Following it up with two blog posts
giving four different ways to work around this language "feature" suggests
something a bit less sunny, however. A more cynical take is that a
vectorized language is one that is bad at both iteration and recursion, and
thus forces its users to vectorize their code so that all the hard work can
be done in a faster language. Perhaps a more even-handed characterization
is that vectorized languages provide many vectorized functions and writing
vectorized code is idiomatic – which may make Julia a vectorized language,
but not in the sense of forcing you to doing everything vectorized.

This simple Collatz computation is clearly a case where vectorization fails
pretty badly and the straightforward iterative version is clear and easy –
and in a language that allows efficient iteration, fast. Here's some
related code for calculating all the Collatz numbers from 1 up to n,
reusing work (adapted from the solution to Euler problem
14<https://github.com/JuliaLang/julia/blob/458d27216f02d61e3e1249b8cd0f0b446c31e719/test/euler.jl#L219-L238>
):

function collatz_up_to(m)
    c = fill(-1,m)
    c[1] = 0
    for n::Int64 = 2:m
        nʹ, d = n, 0
        while nʹ > length(c) || c[nʹ] < 0
             nʹ = iseven(nʹ) ? nʹ>>1 : 3nʹ+1
            d += 1
        end
        d += c[nʹ]
        while n > length(c) || c[n] < 0
            n <= length(c) && (c[n] = d)
            n = iseven(n) ? n>>1 : 3n+1
            d -= 1
        end
    end
    return c
end

On my system, calling collatz_up_to(10^6) is 11x faster than [ collatz(k)
for k=1:10^6 ] and 14x faster than map(collatz, 1:10^6). The speedup factor
does not seem to grow as the problem gets larger – if anything, it shrinks
a little. Maybe the fraction of shared work stays fixed as the number of
Collatz iterations one wants to compute grows larger. Still, an 11x speedup
isn't bad. This also seems like a computation that's essentially impossible
to vectorize.

I'm kind of thinking of making a new benchmark suite, grouping benchmarks
by whether they test iteration, recursion, array swapping, parsing, etc.
I'd like to head off all the squabbling about the unfairness of not using
vectorized implementations by picking problems like Collatz computation,
where vectorization isn't effective. Of course, I'm sure there will then be
complaints about how the choice of problems is unfair. Perhaps we should
include simpler problems where vectorization is viable and compare both
vectorized and devectorized versions in various languages – kind of like
John Myles White's blog
post<http://www.johnmyleswhite.com/notebook/2013/12/22/the-relationship-between-vectorized-and-devectorized-code/>
on
vectorization vs. devectorization. Making a new benchmark suite would be a
lot of work, but unlike when we wrote all the microbenchmark
implementations back in 2011, today we may be able to distribute the work
and have people who are experts in each language to write the
implementations.

On Fri, May 9, 2014 at 6:26 PM, Douglas Bates <dmba...@gmail.com> wrote:

> On Friday, May 9, 2014 4:45:00 PM UTC-5, Ted Fujimoto wrote:
>>
>> Hi all,
>>
>> In the first few minutes of this video (https://www.youtube.com/
>> watch?v=v9Io-p_iymI) Prof. Doug Bates skims through why one should
>> invest in Julia over R. However, I felt it wasn't detailed enough. Why
>> should statisticians care about classes instead of "tags"? Do these
>> practices really make a difference? Could anyone share some more insight
>> into his reasoning? Other reasons other than speed (say programming
>> practices, simplicity, etc.) would also be appreciated.
>>
>> Thanks.
>>
>
> Bear in mind that the video you mention is an old video, by the standards
> of the Julia language.
>
> You may want to take a look at some slides from a more recent 
> presentation.some
> slides <http://www.stat.wisc.edu/~bates/2014-04-09-Evanston.html> (type
> "w" to get a widescreen view)
>
> The big difference for me between using R and using Julia is that I almost
> always end up writing some C or C++ code when I am developing in R whereas
> in Julia I almost never write C or C++ code for a project.  For clarity and
> flexibility being able to write in the high-level REPL-based language and
> attain reasonable or very good performance cannot be overemphasized.
>
> A recent 
> posting<http://blog.revolutionanalytics.com/2014/05/r-and-the-collatz-conjecture-part-2.html>
>  on
> the Revolution Analytics blog (see also the previous 
> posting<http://blog.revolutionanalytics.com/2014/04/a-look-a-r-vectorization-through-the-collatz-conjecture.html>
>  on
> the topic) showed various ways of evaluating the Collatz stopping time for
> a vector of integers.  I was rather surprised at these postings because, in
> my opinion, they don't show the R language to advantage.  In Julia I would
> write the evaluation of the stopping time for an integer as
>
> function collatz(n::Integer)
>     count = 0
>     while n > 1
>         n = isodd(n) ? 3n+1 : n >> 1
>         count += 1
>     end
>     count
> end
>
> which I think is pretty straightforward.  (Well, I do use the expression n
> >> 1 to divide n by 2 but you have to have some fun.)
>
> On my machine this outperforms all of the R versions.
>
> julia> @time map(collatz, 1:1000000)
> elapsed time: 0.520773287 seconds (39983824 bytes allocated)
> 1000000-element Array{Int64,1}:
>    0
>    1
>    7
>    2
>    5
>    8
>   16
>    3
>   19
>    6
>    ⋮
>  165
>  113
>  165
>  113
>  258
>  113
>  113
>  258
>   258
>  152
>
> And, as a bonus, you can parallelize the mapping of this function with
> very little effort.
>
> Regarding my comment in the video about "tags", I think it is important.
>  There are now at least three systems for defining classes and dispatching
> calls to generics in R but the one that is the most widely used, known as
> "S3", doesn't define classes.  In R
>
> class(x) <- "foo"
>
> doesn't check anything about the structure of 'x', which is a recipe for
> hard-to-detect bugs.
>
> Anyway, enough of my comments.  Perhaps others can contribute.
>

Re: [julia-users] Re: Why should computer scientists and computational statisticians invest in Julia instead of R?

Reply via email to