[julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?
Hi, Thanks for the explanation. Suppose I have a named array X with 3 columns x1, x2 and x3 and I do prod(X, 2). Will the resulting array (a single columns in this case) have a sensible name like x1x2x3 ? Or more generally, how are these new names generated and for which operations ? For some operators NamedArrays tries to generate sensible names, including prod(): julia n = NamedArray(rand(2,3), ([:a, :b], [:x1, :x2, :x3]), (rows, cols)) 2x3 NamedArray{Float64,2,Array{Float64,2},(Dict{Symbol,Int64},Dict{Symbol, Int64})} rows \ cols x1 x2 x3 a 0.712609 0.607843 0.18794 b 0.208052 0.4409 0.282238 julia prod(n, 2) 2x1 NamedArray{Float64,2,Array{Float64,2},(Dict{Symbol,Int64},Dict{ ASCIIString,Int64})} rows \ cols prod(cols) a 0.0814106 b 0.0258897 So the column in this case is called prod(cols). Note that we started with symbols as names for columns indices in this example but prod() normalizes this to ASCIIStrings. I've worked hard to make the index type a free choice, but some automatically generated names default to Strings. You can find the list of operations for which NamedArrays tries to give sensible names here https://github.com/davidavdav/NamedArrays/blob/master/src/changingnames.jl, in the source code. I am open for suggestions and pull requests for more functions or other names. As a result of your original question I am now implementing the matrix operation support more thoroughly, basically following the Julia manual for possible matrix operations. Some take a bit of effort to implement, for some I need to think about sensible names for the dimensions. Expect an update within the next week or so. Cheers, ---david Hi David, Thanks for the example and a link to the source file. I found it very educational and useful. My original motivation was to check what is (or cloud be) a best practice in Julia for creating e.g. all interactions of say 3 variables and have sensible names filled in automatically. NamedArrays seems to be a good start. In addition, is there any documentation for Symbol and Dictionary types ? I found very little about it in Julia manual. Best regards, Jan
[julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?
Hello Ján, On Monday, December 8, 2014 11:04:08 AM UTC+1, Ján Dolinský wrote: Hi, Thanks for the explanation. Suppose I have a named array X with 3 columns x1, x2 and x3 and I do prod(X, 2). Will the resulting array (a single columns in this case) have a sensible name like x1x2x3 ? Or more generally, how are these new names generated and for which operations ? For some operators NamedArrays tries to generate sensible names, including prod(): julia n = NamedArray(rand(2,3), ([:a, :b], [:x1, :x2, :x3]), (rows, cols)) 2x3 NamedArray{Float64,2,Array{Float64,2},(Dict{Symbol,Int64},Dict{Symbol, Int64})} rows \ cols x1 x2 x3 a 0.712609 0.607843 0.18794 b 0.208052 0.4409 0.282238 julia prod(n, 2) 2x1 NamedArray{Float64,2,Array{Float64,2},(Dict{Symbol,Int64},Dict{ ASCIIString,Int64})} rows \ cols prod(cols) a 0.0814106 b 0.0258897 So the column in this case is called prod(cols). Note that we started with symbols as names for columns indices in this example but prod() normalizes this to ASCIIStrings. I've worked hard to make the index type a free choice, but some automatically generated names default to Strings. You can find the list of operations for which NamedArrays tries to give sensible names here https://github.com/davidavdav/NamedArrays/blob/master/src/changingnames.jl, in the source code. I am open for suggestions and pull requests for more functions or other names. As a result of your original question I am now implementing the matrix operation support more thoroughly, basically following the Julia manual for possible matrix operations. Some take a bit of effort to implement, for some I need to think about sensible names for the dimensions. Expect an update within the next week or so. Cheers, ---david Thanks, Jan
Re: [julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?
For DataFrames, it depends on what you want to do. It is difficult to get performance with DataArrays as columns using the current implementation. With the ongoing work by John Myles White on the use of a Nullable type, that should be much better. Also, you can use standard Arrays as columns of a DataFrame. It's not documented well, but it can be done. Also, if you want to treat a DataFrame like a matrix, then generally the answer is no. With some trickery, you can store a view to a matrix in a DataFrame. Basically, you have to create column views into the matrix. Here is an example. It might be useful if you want to treat all or part of a DataFrame as a matrix. Thanks a lot for an excellent explanation and the example. I'll try it out. Looks promising. Jan
[julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?
I can only speak for NamedArrays. On the one hand the deployment of BLAS should be transparant and the use of NamedArray vs Array not lead to much degradation in performance. E.g., a * b with `a` and `b` a NamedArray, effectively calls a.array * b.array which Base implements with BLAS.gemm(). There is just a little overhead of filling in sensible names in the result---so if you have small matrices in an inner loop, you're going to get hurt. On the other hand, I am not sure how much of the Julia BLAS cleverness is retained in NamedArrays---but the intention of the package is that it is completely transparent, and if you notice bad performance for a particular situation then you should file an issue (or make a PR:-). Individual element indexing of a NamedArray with integers is just a little bit slower than that of an Array. Indexing by name is quite a bit slower---you may try a different Associative than the standard Dict. Incidentally, I've been toying with the idea of NamedArrays `*` check on consistency of index and dimension names, but my guess is that people would find such a thing annoying. ArrayViews are currently not aware of NameArrays. I believe the views are going to be part ov julia-0.4, so then it would be a task for NamedArray to implement views of NamedArrays I gather. Cheers, ---david Hi, Thanks for the explanation. Suppose I have a named array X with 3 columns x1, x2 and x3 and I do prod(X, 2). Will the resulting array (a single columns in this case) have a sensible name like x1x2x3 ? Or more generally, how are these new names generated and for which operations ? Thanks, Jan
Re: [julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?
In 0.4, the views are a revamp of SubArray. So if NamedArrays already interacts well with SubArrays, you're basically set. FYI the implementation in 0.4 is largely backwards-compatible, but there are some important differences. If you need to dig into the internal implementation of SubArrays, you can find documentation on the changes for 0.4 here: http://docs.julialang.org/en/latest/devdocs/subarrays/ --Tim On Friday, December 05, 2014 04:18:08 AM David van Leeuwen wrote: Hi, On Friday, December 5, 2014 8:47:22 AM UTC+1, Ján Dolinský wrote: Hi, I am exploring DataFrames and NamedArrays packages and I would like to ask whether their are suitable for heavier computations and whether I can use them directly in BLAS calls (e.g. gemv() etc.). In addition, is it possible to create views of e.g. DataFrames or NamedArrays ? I can only speak for NamedArrays. On the one hand the deployment of BLAS should be transparant and the use of NamedArray vs Array not lead to much degradation in performance. E.g., a * b with `a` and `b` a NamedArray, effectively calls a.array * b.array which Base implements with BLAS.gemm(). There is just a little overhead of filling in sensible names in the result---so if you have small matrices in an inner loop, you're going to get hurt. On the other hand, I am not sure how much of the Julia BLAS cleverness is retained in NamedArrays---but the intention of the package is that it is completely transparent, and if you notice bad performance for a particular situation then you should file an issue (or make a PR:-). Individual element indexing of a NamedArray with integers is just a little bit slower than that of an Array. Indexing by name is quite a bit slower---you may try a different Associative than the standard Dict. Incidentally, I've been toying with the idea of NamedArrays `*` check on consistency of index and dimension names, but my guess is that people would find such a thing annoying. ArrayViews are currently not aware of NameArrays. I believe the views are going to be part ov julia-0.4, so then it would be a task for NamedArray to implement views of NamedArrays I gather. Cheers, ---david Thanks, Jan
Re: [julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?
So what is the relation between ArrayViews and 0.4 `SubArray revamp'? Are they targeting different use cases or is one of them going to be phased out? On Friday, December 5, 2014 1:28:16 PM UTC+1, Tim Holy wrote: In 0.4, the views are a revamp of SubArray. So if NamedArrays already interacts well with SubArrays, you're basically set. Most likely not, as SubArrays are new to me (I've tried sub() in production code in the past, but it was always better to write out a loop)). FYI the implementation in 0.4 is largely backwards-compatible, but there are some important differences. If you need to dig into the internal implementation of SubArrays, you can find documentation on the changes for 0.4 here: http://docs.julialang.org/en/latest/devdocs/subarrays/ OK, I'll have to study this in order to understand the implications for NamedArrays. ---david --Tim On Friday, December 05, 2014 04:18:08 AM David van Leeuwen wrote: Hi, On Friday, December 5, 2014 8:47:22 AM UTC+1, Ján Dolinský wrote: Hi, I am exploring DataFrames and NamedArrays packages and I would like to ask whether their are suitable for heavier computations and whether I can use them directly in BLAS calls (e.g. gemv() etc.). In addition, is it possible to create views of e.g. DataFrames or NamedArrays ? I can only speak for NamedArrays. On the one hand the deployment of BLAS should be transparant and the use of NamedArray vs Array not lead to much degradation in performance. E.g., a * b with `a` and `b` a NamedArray, effectively calls a.array * b.array which Base implements with BLAS.gemm(). There is just a little overhead of filling in sensible names in the result---so if you have small matrices in an inner loop, you're going to get hurt. On the other hand, I am not sure how much of the Julia BLAS cleverness is retained in NamedArrays---but the intention of the package is that it is completely transparent, and if you notice bad performance for a particular situation then you should file an issue (or make a PR:-). Individual element indexing of a NamedArray with integers is just a little bit slower than that of an Array. Indexing by name is quite a bit slower---you may try a different Associative than the standard Dict. Incidentally, I've been toying with the idea of NamedArrays `*` check on consistency of index and dimension names, but my guess is that people would find such a thing annoying. ArrayViews are currently not aware of NameArrays. I believe the views are going to be part ov julia-0.4, so then it would be a task for NamedArray to implement views of NamedArrays I gather. Cheers, ---david Thanks, Jan
Re: [julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?
The new SubArrays steal mercilessly from the good ideas of both the old SubArray and ArrayViews, and then add some new tricks of their own. In theory, they should be a strict improvement on both of their predecessors. --Tim On Friday, December 05, 2014 05:13:28 AM David van Leeuwen wrote: So what is the relation between ArrayViews and 0.4 `SubArray revamp'? Are they targeting different use cases or is one of them going to be phased out? On Friday, December 5, 2014 1:28:16 PM UTC+1, Tim Holy wrote: In 0.4, the views are a revamp of SubArray. So if NamedArrays already interacts well with SubArrays, you're basically set. Most likely not, as SubArrays are new to me (I've tried sub() in production code in the past, but it was always better to write out a loop)). FYI the implementation in 0.4 is largely backwards-compatible, but there are some important differences. If you need to dig into the internal implementation of SubArrays, you can find documentation on the changes for 0.4 here: http://docs.julialang.org/en/latest/devdocs/subarrays/ OK, I'll have to study this in order to understand the implications for NamedArrays. ---david --Tim On Friday, December 05, 2014 04:18:08 AM David van Leeuwen wrote: Hi, On Friday, December 5, 2014 8:47:22 AM UTC+1, Ján Dolinský wrote: Hi, I am exploring DataFrames and NamedArrays packages and I would like to ask whether their are suitable for heavier computations and whether I can use them directly in BLAS calls (e.g. gemv() etc.). In addition, is it possible to create views of e.g. DataFrames or NamedArrays ? I can only speak for NamedArrays. On the one hand the deployment of BLAS should be transparant and the use of NamedArray vs Array not lead to much degradation in performance. E.g., a * b with `a` and `b` a NamedArray, effectively calls a.array * b.array which Base implements with BLAS.gemm(). There is just a little overhead of filling in sensible names in the result---so if you have small matrices in an inner loop, you're going to get hurt. On the other hand, I am not sure how much of the Julia BLAS cleverness is retained in NamedArrays---but the intention of the package is that it is completely transparent, and if you notice bad performance for a particular situation then you should file an issue (or make a PR:-). Individual element indexing of a NamedArray with integers is just a little bit slower than that of an Array. Indexing by name is quite a bit slower---you may try a different Associative than the standard Dict. Incidentally, I've been toying with the idea of NamedArrays `*` check on consistency of index and dimension names, but my guess is that people would find such a thing annoying. ArrayViews are currently not aware of NameArrays. I believe the views are going to be part ov julia-0.4, so then it would be a task for NamedArray to implement views of NamedArrays I gather. Cheers, ---david Thanks, Jan
Re: [julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?
For DataFrames, it depends on what you want to do. It is difficult to get performance with DataArrays as columns using the current implementation. With the ongoing work by John Myles White on the use of a Nullable type, that should be much better. Also, you can use standard Arrays as columns of a DataFrame. It's not documented well, but it can be done. Also, if you want to treat a DataFrame like a matrix, then generally the answer is no. With some trickery, you can store a view to a matrix in a DataFrame. Basically, you have to create column views into the matrix. Here is an example. It might be useful if you want to treat all or part of a DataFrame as a matrix. julia using DataFrames julia m = rand(5,5) 5x5 Array{Float64,2}: 0.186736 0.247699 0.0968634 0.471383 0.145244 0.985306 0.966015 0.663865 0.0468244 0.0471465 0.981947 0.707241 0.0841202 0.0539529 0.692217 0.918222 0.0415162 0.646298 0.581983 0.653881 0.515692 0.0344289 0.0821672 0.877258 0.653756 julia d = DataFrame(Any[sub(m, :, i) for i in 1:size(m, 2)], [:a, :b, :c, :d, :e]) 5x5 DataFrames.DataFrame | Row | a| b | c | d | e | |-|--|---|---|---|---| | 1 | 0.186736 | 0.247699 | 0.0968634 | 0.471383 | 0.145244 | | 2 | 0.985306 | 0.966015 | 0.663865 | 0.0468244 | 0.0471465 | | 3 | 0.981947 | 0.707241 | 0.0841202 | 0.0539529 | 0.692217 | | 4 | 0.918222 | 0.0415162 | 0.646298 | 0.581983 | 0.653881 | | 5 | 0.515692 | 0.0344289 | 0.0821672 | 0.877258 | 0.653756 | julia d[:b] 5-element SubArray{Float64,1,Array{Float64,2},(Colon,Int64),2}: 0.247699 0.966015 0.707241 0.0415162 0.0344289 On Fri, Dec 5, 2014 at 8:37 AM, Tim Holy tim.h...@gmail.com wrote: The new SubArrays steal mercilessly from the good ideas of both the old SubArray and ArrayViews, and then add some new tricks of their own. In theory, they should be a strict improvement on both of their predecessors. --Tim On Friday, December 05, 2014 05:13:28 AM David van Leeuwen wrote: So what is the relation between ArrayViews and 0.4 `SubArray revamp'? Are they targeting different use cases or is one of them going to be phased out? On Friday, December 5, 2014 1:28:16 PM UTC+1, Tim Holy wrote: In 0.4, the views are a revamp of SubArray. So if NamedArrays already interacts well with SubArrays, you're basically set. Most likely not, as SubArrays are new to me (I've tried sub() in production code in the past, but it was always better to write out a loop)). FYI the implementation in 0.4 is largely backwards-compatible, but there are some important differences. If you need to dig into the internal implementation of SubArrays, you can find documentation on the changes for 0.4 here: http://docs.julialang.org/en/latest/devdocs/subarrays/ OK, I'll have to study this in order to understand the implications for NamedArrays. ---david --Tim On Friday, December 05, 2014 04:18:08 AM David van Leeuwen wrote: Hi, On Friday, December 5, 2014 8:47:22 AM UTC+1, Ján Dolinský wrote: Hi, I am exploring DataFrames and NamedArrays packages and I would like to ask whether their are suitable for heavier computations and whether I can use them directly in BLAS calls (e.g. gemv() etc.). In addition, is it possible to create views of e.g. DataFrames or NamedArrays ? I can only speak for NamedArrays. On the one hand the deployment of BLAS should be transparant and the use of NamedArray vs Array not lead to much degradation in performance. E.g., a * b with `a` and `b` a NamedArray, effectively calls a.array * b.array which Base implements with BLAS.gemm(). There is just a little overhead of filling in sensible names in the result---so if you have small matrices in an inner loop, you're going to get hurt. On the other hand, I am not sure how much of the Julia BLAS cleverness is retained in NamedArrays---but the intention of the package is that it is completely transparent, and if you notice bad performance for a particular situation then you should file an issue (or make a PR:-). Individual element indexing of a NamedArray with integers is just a little bit slower than that of an Array. Indexing by name is quite a bit slower---you may try a different Associative than the standard Dict. Incidentally, I've been toying with the idea of NamedArrays `*` check on consistency of index and dimension names, but my guess is that people would find such a thing annoying. ArrayViews are currently not aware of NameArrays. I believe the views are going to be part ov julia-0.4, so then it would be a task for NamedArray to implement views of NamedArrays I