Hello, On Monday, November 10, 2014 1:43:57 AM UTC+1, Dahua Lin wrote: > > NamedArrays.jl generally goes along this way. However, it remains limited > in two aspects: > > 1. Some fields in NamedArrays are not declared of specific types. In > particular, the field `dicts` is of the type `Vector{Dict}`, and the use of > this field is on the critical path when looping over the table, e.g. when > counting. This would potentially lead to substantial impact on performance. > > In the beginning I have been experimenting with indexing speed, mainly to sort out the various forms of getindex(), and I although I don't remember the exact result, I do remember that I found the drop in performance w.r.t. integer indexing surprisingly small.
I suppose the problem you indicate can be alleviated by making NamedArray parameterized by the type of the key in the dict as well. 2. Currently, it only accepts a limited set of types for indices, e.g. Real > and String. But in some cases, people may go beyond this. I don't think we > have to impose this limit. > > Ah---I now see what you mean. I thought I had built in support for all types as index, but there obviously is no catch all-rule in getindex. I suppose NamedArray needs an update there. ---david > Dahua > > > On Monday, November 10, 2014 8:35:32 AM UTC+8, Dahua Lin wrote: >> >> I have been observing an interesting differences between people coming >> from stats and machine learning. >> >> Stats people tend to favor the approach that allows one to directly use >> the category names to index the table, e.g. A["apple"]. This tendency is >> clearly reflected in the design of R, where one can attach a name to >> everything. >> >> While in machine learning practice, it is a common convention to just >> encode categories into integers, and simply use an ordinary array to >> represent a counting table. Whereas it makes it a little bit inconvenient >> in an interactive environment, this way is generally more efficient when >> you have to deal with these categories over a large number of samples. >> >> These differences aside, I believe, however, that there exist a very >> generic approach to this problem -- a multi-dimensional associative map, >> which allows one to write A[i1, i2, ...] where the indices can be arbitrary >> hashable & equality-comparable instances, including integers, strings, >> symbols, among many other things. >> >> A multi-dimensional associative map can be considered as a >> multi-dimensional generalization of dictionaries, which can be easily >> implemented via an multidimensional array and several dictionaries, each >> for one dimension, to map user-side indexes to integer indexes. >> >> - Dahua >> >> >> >> >> On Monday, November 10, 2014 8:12:54 AM UTC+8, David van Leeuwen wrote: >>> >>> Hi, >>> >>> On Sunday, November 9, 2014 5:10:19 PM UTC+1, Milan Bouchet-Valat wrot >>> >>>> Actually I didn't do it because NamedArrays.jl didn't work well on 0.3 >>>> when I first worked on the package. Now I see the tests are still failing. >>>> Do you know what is needed to make them work? >>>> >>>> What is exactly not working, could you maybe file an issue? Travis >>> tells me all is fine. >>> >>> ---david >>> >>> >>>> Another point is that I think this deserves going into StatsBase, but >>>> before that we need everybody to agree on a design for NamedArrays. >>>> >>>> Regards >>>> >>>> >>>> On Sunday, November 9, 2014 4:26:45 PM UTC+1, Milan Bouchet-Valat >>>> wrote: >>>> >>>> Le jeudi 06 novembre 2014 à 11:17 -0800, Conrad Stack a écrit : >>>> >>>> I was also looking for a function like this, but could not find one in >>>> docs.julialang.org. I was doing this (v0.4.0-dev), for anyone who is >>>> interested: >>>> >>>> >>>> example = rand(1:10,100) >>>> uexample = sort(unique(example)) >>>> counts = map(x->count(y->x==y,example),uexample) >>>> >>>> >>>> It's pretty ugly, so thanks, Johan, for pointing out the >>>> StatsBase->countmap >>>> >>>> I've also put together a small package precisely aimed at offering an >>>> equivalent of R's table(): >>>> https://github.com/nalimilan/ <https://github.com/nalimilan/Tables.jl> >>>> Tables.jl <https://github.com/nalimilan/Tables.jl> >>>> >>>> But there's a more general issue about how to handle arrays with >>>> dimension names in Julia. NamedArrays.jl (which is used in my package) >>>> attempts to tackle this issue, but I don't think we've reached a consensus >>>> yet about the best solution. >>>> >>>> >>>> Regards >>>> >>>> >>>> >>>> >>>> On Sunday, August 17, 2014 9:56:29 AM UTC-4, Johan Sigfrids wrote: >>>> >>>> I think countmap comes closest to giving you what you want: >>>> >>>> using StatsBase >>>> data = sample(["a", "b", "c"], 20) >>>> countmap(data) >>>> >>>> >>>> Dict{ASCIIString,Int64} with 3 entries: >>>> "c" => 3 >>>> "b" => 10 >>>> "a" => 7 >>>> >>>> >>>> On Sunday, August 17, 2014 4:45:21 PM UTC+3, Florian Oswald wrote: >>>> >>>> Hi >>>> >>>> >>>> I'm looking for the best way to count how many times a certain value >>>> x_i appears in vector x, where x could be integers, floats, strings. In R >>>> I >>>> would do table(x). I found StatsBase.counts(x,k) but I'm a bit confused by >>>> k (where k goes into 1:k, i.e. the vector is scanned to find how many >>>> elements locate at each point of 1:k). most of the times I don't know k, >>>> and in fact I would do table(x) just to find out what k was. Apart from >>>> that, I don't think I could use this with strings, as I can't construct a >>>> range object from strings. >>>> >>>> >>>> I'm wondering whether a method StatsBase.counts(x::Vector) just >>>> returning the frequency of each element appearing would be useful. >>>> >>>> >>>> The same applies to Base.hist if I understand correctly. I just don't >>>> want to have to specify the edges of bins. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>>