Re: [julia-users] Re: what's the best way to do R table() in julia? (why does StatsBase.count(x,k) need k?)

David van Leeuwen Wed, 26 Nov 2014 09:31:31 -0800

Hello again,

I worked hard on NamedArrays.jl to solve the problems indicated below:


On Monday, November 10, 2014 1:43:57 AM UTC+1, Dahua Lin wrote:
>
> NamedArrays.jl generally goes along this way. However, it remains limited 
> in two aspects:
>
> 1. Some fields in NamedArrays are not declared of specific types. In 
> particular, the field `dicts` is of the type `Vector{Dict}`, and the use of 
> this field is on the critical path when looping over the table, e.g. when 
> counting. This would potentially lead to substantial impact on performance.
>
> A NamedArray is now parameterized by the complete set of Dicts that are 
used for the indices.  It took me a while to get the constructors right, in 
intermediate stages of the development I ended up with VarType parameters 
of NamedArray.  
 

> 2. Currently, it only accepts a limited set of types for indices, e.g. 
> Real and String. But in some cases, people may go beyond this. I don't 
> think we have to impose this limit. 
>
> The indexing code is completely overhauled now, and the indices() methods 
are now explicitly parameterized by the dictionary key type, their call 
should be efficient.  It should now be possible to index a NamedArray with 
any type, although some types (AbstractVector, Range, Int) are interpreted 
specially.  

As a consequence, the type of the key for the indices cannot be altered 
after initialization of a NamedArray (the names themselves still can). 
 Thus, if you want other types than ASCIIString (which is used to give 
default names to indices), you need to call a constructor with your names 
prepared instead of filling them in afterwards. 

You can try the code for julia-0.3 with Pkg.checkout("NamedArrays"), or 
read it at Github. 

Cheers, 

---david
 

> Dahua
>
>
> On Monday, November 10, 2014 8:35:32 AM UTC+8, Dahua Lin wrote:
>>
>> I have been observing an interesting differences between people coming 
>> from stats and machine learning.
>>
>> Stats people tend to favor the approach that allows one to directly use 
>> the category names to index the table, e.g. A["apple"]. This tendency is 
>> clearly reflected in the design of R, where one can attach a name to 
>> everything.
>>
>> While in machine learning practice, it is a common convention to just 
>> encode categories into integers, and simply use an ordinary array to 
>> represent a counting table. Whereas it makes it a little bit inconvenient 
>> in an interactive environment, this way is generally more efficient when 
>> you have to deal with these categories over a large number of samples.
>>
>> These differences aside, I believe, however, that there exist a very 
>> generic approach to this problem -- a multi-dimensional associative map, 
>> which allows one to write A[i1, i2, ...] where the indices can be arbitrary 
>> hashable & equality-comparable instances, including integers, strings, 
>> symbols, among many other things.
>>
>> A multi-dimensional associative map can be considered as a 
>> multi-dimensional generalization of dictionaries, which can be easily 
>> implemented via an multidimensional array and several dictionaries, each 
>> for one dimension, to map user-side indexes to integer indexes. 
>>
>> - Dahua
>>
>>
>>
>>
>> On Monday, November 10, 2014 8:12:54 AM UTC+8, David van Leeuwen wrote:
>>>
>>> Hi, 
>>>
>>> On Sunday, November 9, 2014 5:10:19 PM UTC+1, Milan Bouchet-Valat wrot
>>>
>>>> Actually I didn't do it because NamedArrays.jl didn't work well on 0.3 
>>>> when I first worked on the package. Now I see the tests are still failing. 
>>>> Do you know what is needed to make them work?
>>>>
>>>> What is exactly not working, could you maybe file an issue?  Travis 
>>> tells me all is fine. 
>>>
>>> ---david
>>>  
>>>
>>>> Another point is that I think this deserves going into StatsBase, but 
>>>> before that we need everybody to agree on a design for NamedArrays.
>>>>
>>>> Regards
>>>>
>>>>
>>>>  On Sunday, November 9, 2014 4:26:45 PM UTC+1, Milan Bouchet-Valat 
>>>> wrote: 
>>>>
>>>>  Le jeudi 06 novembre 2014 à 11:17 -0800, Conrad Stack a écrit : 
>>>>
>>>> I was also looking for a function like this, but could not find one in 
>>>> docs.julialang.org.  I was doing this (v0.4.0-dev), for anyone who is 
>>>> interested:
>>>>
>>>>
>>>> example = rand(1:10,100)
>>>> uexample = sort(unique(example))
>>>> counts = map(x->count(y->x==y,example),uexample)
>>>>
>>>>
>>>> It's pretty ugly, so thanks, Johan, for pointing out the 
>>>> StatsBase->countmap 
>>>>
>>>> I've also put together a small package precisely aimed at offering an 
>>>> equivalent of R's table():
>>>> https://github.com/nalimilan/ <https://github.com/nalimilan/Tables.jl>
>>>> Tables.jl <https://github.com/nalimilan/Tables.jl>
>>>>
>>>> But there's a more general issue about how to handle arrays with 
>>>> dimension names in Julia. NamedArrays.jl (which is used in my package) 
>>>> attempts to tackle this issue, but I don't think we've reached a consensus 
>>>> yet about the best solution.
>>>>
>>>>
>>>> Regards
>>>>
>>>>  
>>>>
>>>>
>>>> On Sunday, August 17, 2014 9:56:29 AM UTC-4, Johan Sigfrids wrote:
>>>>
>>>> I think countmap comes closest to giving you what you want:
>>>>
>>>> using StatsBase
>>>> data = sample(["a", "b", "c"], 20)
>>>> countmap(data)
>>>>
>>>>
>>>> Dict{ASCIIString,Int64} with 3 entries:
>>>>   "c" => 3
>>>>   "b" => 10
>>>>   "a" => 7
>>>>
>>>>
>>>> On Sunday, August 17, 2014 4:45:21 PM UTC+3, Florian Oswald wrote: 
>>>>
>>>> Hi 
>>>>
>>>>
>>>> I'm looking for the best way to count how many times a certain value 
>>>> x_i appears in vector x, where x could be integers, floats, strings. In R 
>>>> I 
>>>> would do table(x). I found StatsBase.counts(x,k) but I'm a bit confused by 
>>>> k (where k goes into 1:k, i.e. the vector is scanned to find how many 
>>>> elements locate at each point of 1:k). most of the times I don't know k, 
>>>> and in fact I would do table(x) just to find out what k was. Apart from 
>>>> that, I don't think I could use this with strings, as I can't construct a 
>>>> range object from strings. 
>>>>
>>>>
>>>> I'm wondering whether a method StatsBase.counts(x::Vector) just 
>>>> returning the frequency of each element appearing would be useful. 
>>>>
>>>>
>>>> The same applies to Base.hist if I understand correctly. I just don't 
>>>> want to have to specify the edges of bins. 
>>>>
>>>>
>>>>
>>>>
>>>>   
>>>>
>>>>  
>>>>

Re: [julia-users] Re: what's the best way to do R table() in julia? (why does StatsBase.count(x,k) need k?)

Reply via email to