Re: [julia-users] Re: Why should computer scientists and computational statisticians invest in Julia instead of R?

Douglas Bates Mon, 12 May 2014 07:20:33 -0700

On Sunday, May 11, 2014 12:55:47 PM UTC-5, Stefan Karpinski wrote:
>
> I find this kind of amusing:
>
> You may have heard before that R is a vectorized language, but what do we 
>> mean by that? One way to read that is to say that many functions in R can 
>> operate efficiently on vectors (in addition to singletons).
>
>


> That's certainly one way to spin it. Following it up with two blog posts 
> giving four different ways to work around this language "feature" suggests 
> something a bit less sunny, however. A more cynical take is that a 
> vectorized language is one that is bad at both iteration and recursion, and 
> thus forces its users to vectorize their code so that all the hard work can 
> be done in a faster language. Perhaps a more even-handed characterization 
> is that vectorized languages provide many vectorized functions and writing 
> vectorized code is idiomatic – which may make Julia a vectorized language, 
> but not in the sense of forcing you to doing everything vectorized.
>
>
I must admit that was my reaction too when I saw those posts.  If I wanted 
to come up with an example of a problem showing why forced vectorization (R 
does not have scalars so any arithmetic operation written in R is 
vectorized) I would look for a calculation where each step in the iteration 
is conditional on the current value.  For example, this calculation.

I don't know the person who wrote these postings.  It's possible that he or 
she doesn't have a lot of experience. I am surprised that others are 
Revolution Analytics didn't pick up on the fact that this example doesn't 
show R in a good light.
 

> This simple Collatz computation is clearly a case where vectorization 
> fails pretty badly and the straightforward iterative version is clear and 
> easy – and in a language that allows efficient iteration, fast. Here's some 
> related code for calculating all the Collatz numbers from 1 up to n, 
> reusing work (adapted from the solution to Euler problem 
> 14<https://github.com/JuliaLang/julia/blob/458d27216f02d61e3e1249b8cd0f0b446c31e719/test/euler.jl#L219-L238>
> ):
>
> function collatz_up_to(m)
>     c = fill(-1,m)
>     c[1] = 0
>     for n::Int64 = 2:m
>         nʹ, d = n, 0
>         while nʹ > length(c) || c[nʹ] < 0
>              nʹ = iseven(nʹ) ? nʹ>>1 : 3nʹ+1
>             d += 1
>         end
>         d += c[nʹ]
>         while n > length(c) || c[n] < 0
>             n <= length(c) && (c[n] = d)
>             n = iseven(n) ? n>>1 : 3n+1
>             d -= 1
>         end
>     end
>     return c
> end
>
>
>
I spent the weekend thinking on and off about memoizing this calculation 
but didn't come up with anything as clear as this.  You obviously have a 
more devious mind than I do.

I think there is a possible enhancement in that once you know c[n] you can 
fill in c[n]*2^k until that product is > m.
 

> On my system, calling collatz_up_to(10^6) is 11x faster than [ collatz(k) 
> for k=1:10^6 ] and 14x faster than map(collatz, 1:10^6). The speedup 
> factor does not seem to grow as the problem gets larger – if anything, it 
> shrinks a little. Maybe the fraction of shared work stays fixed as the 
> number of Collatz iterations one wants to compute grows larger. Still, an 
> 11x speedup isn't bad. This also seems like a computation that's 
> essentially impossible to vectorize.
>
> I'm kind of thinking of making a new benchmark suite, grouping benchmarks 
> by whether they test iteration, recursion, array swapping, parsing, etc. 
> I'd like to head off all the squabbling about the unfairness of not using 
> vectorized implementations by picking problems like Collatz computation, 
> where vectorization isn't effective. Of course, I'm sure there will then be 
> complaints about how the choice of problems is unfair. Perhaps we should 
> include simpler problems where vectorization is viable and compare both 
> vectorized and devectorized versions in various languages – kind of like 
> John Myles White's blog 
> post<http://www.johnmyleswhite.com/notebook/2013/12/22/the-relationship-between-vectorized-and-devectorized-code/>
>  on 
> vectorization vs. devectorization. Making a new benchmark suite would be a 
> lot of work, but unlike when we wrote all the microbenchmark 
> implementations back in 2011, today we may be able to distribute the work 
> and have people who are experts in each language to write the 
> implementations.
>
> On Fri, May 9, 2014 at 6:26 PM, Douglas Bates <dmb...@gmail.com<javascript:>
> > wrote:
>
>> On Friday, May 9, 2014 4:45:00 PM UTC-5, Ted Fujimoto wrote:
>>>
>>> Hi all,
>>>
>>> In the first few minutes of this video (https://www.youtube.com/
>>> watch?v=v9Io-p_iymI) Prof. Doug Bates skims through why one should 
>>> invest in Julia over R. However, I felt it wasn't detailed enough. Why 
>>> should statisticians care about classes instead of "tags"? Do these 
>>> practices really make a difference? Could anyone share some more insight 
>>> into his reasoning? Other reasons other than speed (say programming 
>>> practices, simplicity, etc.) would also be appreciated.
>>>
>>> Thanks.
>>>
>>
>> Bear in mind that the video you mention is an old video, by the standards 
>> of the Julia language.
>>
>> You may want to take a look at some slides from a more recent 
>> presentation.some 
>> slides<http://www.stat.wisc.edu/~bates/2014-04-09-Evanston.html> (type 
>> "w" to get a widescreen view)
>>
>> The big difference for me between using R and using Julia is that I 
>> almost always end up writing some C or C++ code when I am developing in R 
>> whereas in Julia I almost never write C or C++ code for a project.  For 
>> clarity and flexibility being able to write in the high-level REPL-based 
>> language and attain reasonable or very good performance cannot be 
>> overemphasized.
>>
>> A recent 
>> posting<http://blog.revolutionanalytics.com/2014/05/r-and-the-collatz-conjecture-part-2.html>
>>  on 
>> the Revolution Analytics blog (see also the previous 
>> posting<http://blog.revolutionanalytics.com/2014/04/a-look-a-r-vectorization-through-the-collatz-conjecture.html>
>>  on 
>> the topic) showed various ways of evaluating the Collatz stopping time for 
>> a vector of integers.  I was rather surprised at these postings because, in 
>> my opinion, they don't show the R language to advantage.  In Julia I would 
>> write the evaluation of the stopping time for an integer as
>>
>> function collatz(n::Integer)
>>     count = 0
>>     while n > 1
>>         n = isodd(n) ? 3n+1 : n >> 1
>>         count += 1
>>     end
>>     count
>> end
>>
>> which I think is pretty straightforward.  (Well, I do use the expression 
>> n >> 1 to divide n by 2 but you have to have some fun.)
>>
>> On my machine this outperforms all of the R versions.
>>
>> julia> @time map(collatz, 1:1000000)
>> elapsed time: 0.520773287 seconds (39983824 bytes allocated)
>> 1000000-element Array{Int64,1}:
>>    0
>>    1
>>    7
>>    2
>>    5
>>    8
>>   16
>>    3
>>   19
>>    6
>>    ⋮
>>  165
>>  113
>>  165
>>  113
>>  258
>>  113
>>  113
>>  258
>>   258
>>  152
>>
>> And, as a bonus, you can parallelize the mapping of this function with 
>> very little effort.
>>
>> Regarding my comment in the video about "tags", I think it is important. 
>>  There are now at least three systems for defining classes and dispatching 
>> calls to generics in R but the one that is the most widely used, known as 
>> "S3", doesn't define classes.  In R
>>
>> class(x) <- "foo"
>>
>> doesn't check anything about the structure of 'x', which is a recipe for 
>> hard-to-detect bugs.
>>
>> Anyway, enough of my comments.  Perhaps others can contribute.
>>
>
>

Re: [julia-users] Re: Why should computer scientists and computational statisticians invest in Julia instead of R?

Reply via email to