[julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?

2014-12-10 Thread Ján Dolinský


 Hi,

 Thanks for the explanation. Suppose I have a named array X with 3 columns 
 x1, x2 and x3 and I do prod(X, 2). Will the resulting array (a single 
 columns in this case) have a sensible name like x1x2x3 ? Or more 
 generally, how are these new names generated and for which operations ?

 For some operators NamedArrays tries to generate sensible names, 
 including prod():

 julia n = NamedArray(rand(2,3), ([:a, :b], [:x1, :x2, :x3]), (rows, 
 cols))

 2x3 NamedArray{Float64,2,Array{Float64,2},(Dict{Symbol,Int64},Dict{Symbol,
 Int64})}
 rows \ cols x1 x2 x3 
 a 0.712609 0.607843 0.18794
 b 0.208052 0.4409 0.282238

 julia prod(n, 2) 
 2x1 NamedArray{Float64,2,Array{Float64,2},(Dict{Symbol,Int64},Dict{
 ASCIIString,Int64})} 
 rows \ cols prod(cols) 
 a 0.0814106 
 b 0.0258897 


 So the column in this case is called prod(cols).  Note that we started 
 with symbols as names for columns indices in this example but prod() 
 normalizes this to ASCIIStrings.  

 I've worked hard to make the index type a free choice, but some 
 automatically generated names default to Strings.  

 You can find the list of operations for which NamedArrays tries to give 
 sensible names here 
 https://github.com/davidavdav/NamedArrays/blob/master/src/changingnames.jl, 
 in the source code.  I am open for suggestions and pull requests for more 
 functions or other names.

 As a result of your original question I am now implementing the matrix 
 operation support more thoroughly, basically following the Julia manual for 
 possible matrix operations.  Some take a bit of effort to implement, for 
 some I need to think about sensible names for the dimensions.   Expect an 
 update within the next week or so. 

 Cheers, 

 ---david


Hi David,

Thanks for the example and a link to the source file. I found it very 
educational and useful. My original motivation was to check what is (or 
cloud be) a best practice in Julia for creating e.g. all interactions of 
say 3 variables and have sensible names filled in automatically. 
NamedArrays seems to be a good start.

In addition, is there any documentation for Symbol and Dictionary types ? I 
found very little about it in Julia manual.

Best regards,
Jan  


[julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?

2014-12-09 Thread David van Leeuwen
Hello Ján,

On Monday, December 8, 2014 11:04:08 AM UTC+1, Ján Dolinský wrote:


 Hi,

 Thanks for the explanation. Suppose I have a named array X with 3 columns 
 x1, x2 and x3 and I do prod(X, 2). Will the resulting array (a single 
 columns in this case) have a sensible name like x1x2x3 ? Or more 
 generally, how are these new names generated and for which operations ?

 For some operators NamedArrays tries to generate sensible names, including 
prod():

julia n = NamedArray(rand(2,3), ([:a, :b], [:x1, :x2, :x3]), (rows, 
cols))

2x3 NamedArray{Float64,2,Array{Float64,2},(Dict{Symbol,Int64},Dict{Symbol,
Int64})}
rows \ cols x1 x2 x3 
a 0.712609 0.607843 0.18794
b 0.208052 0.4409 0.282238

julia prod(n, 2) 
2x1 NamedArray{Float64,2,Array{Float64,2},(Dict{Symbol,Int64},Dict{
ASCIIString,Int64})} 
rows \ cols prod(cols) 
a 0.0814106 
b 0.0258897 


So the column in this case is called prod(cols).  Note that we started 
with symbols as names for columns indices in this example but prod() 
normalizes this to ASCIIStrings.  

I've worked hard to make the index type a free choice, but some 
automatically generated names default to Strings.  

You can find the list of operations for which NamedArrays tries to give 
sensible names here 
https://github.com/davidavdav/NamedArrays/blob/master/src/changingnames.jl, 
in the source code.  I am open for suggestions and pull requests for more 
functions or other names.

As a result of your original question I am now implementing the matrix 
operation support more thoroughly, basically following the Julia manual for 
possible matrix operations.  Some take a bit of effort to implement, for 
some I need to think about sensible names for the dimensions.   Expect an 
update within the next week or so. 

Cheers, 

---david

Thanks,
 Jan 



Re: [julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?

2014-12-08 Thread Ján Dolinský


 For DataFrames, it depends on what you want to do. It is difficult to get 
 performance with DataArrays as columns using the current implementation. 
 With the ongoing work by John Myles White on the use of a Nullable type, 
 that should be much better. Also, you can use standard Arrays as columns of 
 a DataFrame. It's not documented well, but it can be done.

 Also, if you want to treat a DataFrame like a matrix, then generally the 
 answer is no. With some trickery, you can store a view to a matrix in a 
 DataFrame. Basically, you have to create column views into the matrix. Here 
 is an example. It might be useful if you want to treat all or part of a 
 DataFrame as a matrix.


Thanks a lot for an excellent explanation and the example. I'll try it out. 
Looks promising.

Jan 


[julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?

2014-12-08 Thread Ján Dolinský


 I can only speak for NamedArrays.  On the one hand the deployment of BLAS 
 should be transparant and the use of NamedArray vs Array not lead to much 
 degradation in performance.  E.g., a * b with `a` and `b` a NamedArray, 
 effectively calls a.array * b.array which Base implements with 
 BLAS.gemm().  There is just a little overhead of filling in sensible names 
 in the result---so if you have small matrices in an inner loop, you're 
 going to get hurt. 

 On the other hand, I am not sure how much of the Julia BLAS cleverness is 
 retained in NamedArrays---but the intention of the package is that it is 
 completely transparent, and if you notice bad performance for a particular 
 situation then you should file an issue (or make a PR:-).  Individual 
 element indexing of a NamedArray with integers is just a little bit slower 
 than that of an Array.  Indexing by name is quite a bit slower---you may 
 try a different Associative than the standard Dict. 

 Incidentally, I've been toying with the idea of NamedArrays `*` check on 
 consistency of index and dimension names, but my guess is that people would 
 find such a thing annoying.  

 ArrayViews are currently not aware of NameArrays.  I believe the views are 
 going to be part ov julia-0.4, so then it would be a task for NamedArray to 
 implement views of NamedArrays I gather. 

 Cheers, 

 ---david


Hi,

Thanks for the explanation. Suppose I have a named array X with 3 columns 
x1, x2 and x3 and I do prod(X, 2). Will the resulting array (a single 
columns in this case) have a sensible name like x1x2x3 ? Or more 
generally, how are these new names generated and for which operations ?

Thanks,
Jan 


Re: [julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?

2014-12-05 Thread Tim Holy
In 0.4, the views are a revamp of SubArray. So if NamedArrays already 
interacts well with SubArrays, you're basically set.

FYI the implementation in 0.4 is largely backwards-compatible, but there are 
some important differences. If you need to dig into the internal implementation 
of SubArrays, you can find documentation on the changes for 0.4 here:
http://docs.julialang.org/en/latest/devdocs/subarrays/

--Tim

On Friday, December 05, 2014 04:18:08 AM David van Leeuwen wrote:
 Hi,
 
 On Friday, December 5, 2014 8:47:22 AM UTC+1, Ján Dolinský wrote:
  Hi,
  
  I am exploring DataFrames and NamedArrays packages and I would like to ask
  whether their are suitable for heavier computations and whether I can use
  them directly in BLAS calls (e.g. gemv() etc.). In addition, is it
  possible
  to create views of e.g. DataFrames or NamedArrays ?
  
  I can only speak for NamedArrays.  On the one hand the deployment of BLAS
 
 should be transparant and the use of NamedArray vs Array not lead to much
 degradation in performance.  E.g., a * b with `a` and `b` a NamedArray,
 effectively calls a.array * b.array which Base implements with
 BLAS.gemm().  There is just a little overhead of filling in sensible names
 in the result---so if you have small matrices in an inner loop, you're
 going to get hurt.
 
 On the other hand, I am not sure how much of the Julia BLAS cleverness is
 retained in NamedArrays---but the intention of the package is that it is
 completely transparent, and if you notice bad performance for a particular
 situation then you should file an issue (or make a PR:-).  Individual
 element indexing of a NamedArray with integers is just a little bit slower
 than that of an Array.  Indexing by name is quite a bit slower---you may
 try a different Associative than the standard Dict.
 
 Incidentally, I've been toying with the idea of NamedArrays `*` check on
 consistency of index and dimension names, but my guess is that people would
 find such a thing annoying.
 
 ArrayViews are currently not aware of NameArrays.  I believe the views are
 going to be part ov julia-0.4, so then it would be a task for NamedArray to
 implement views of NamedArrays I gather.
 
 Cheers,
 
 ---david
 
  Thanks,
  Jan



Re: [julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?

2014-12-05 Thread David van Leeuwen
So what is the relation between ArrayViews and 0.4 `SubArray revamp'?   Are 
they targeting different use cases or is one of them going to be phased 
out? 

On Friday, December 5, 2014 1:28:16 PM UTC+1, Tim Holy wrote:

 In 0.4, the views are a revamp of SubArray. So if NamedArrays already 
 interacts well with SubArrays, you're basically set. 

 Most likely not, as SubArrays are new to me (I've tried sub() in 
production code in the past, but it was always better to write out a 
loop)). 
 

 FYI the implementation in 0.4 is largely backwards-compatible, but there 
 are 
 some important differences. If you need to dig into the internal 
 implementation 
 of SubArrays, you can find documentation on the changes for 0.4 here: 
 http://docs.julialang.org/en/latest/devdocs/subarrays/ 

 OK, I'll have to study this in order to understand the implications for 
NamedArrays.  

---david
 

 --Tim 

 On Friday, December 05, 2014 04:18:08 AM David van Leeuwen wrote: 
  Hi, 
  
  On Friday, December 5, 2014 8:47:22 AM UTC+1, Ján Dolinský wrote: 
   Hi, 
   
   I am exploring DataFrames and NamedArrays packages and I would like to 
 ask 
   whether their are suitable for heavier computations and whether I can 
 use 
   them directly in BLAS calls (e.g. gemv() etc.). In addition, is it 
   possible 
   to create views of e.g. DataFrames or NamedArrays ? 
   
   I can only speak for NamedArrays.  On the one hand the deployment of 
 BLAS 
  
  should be transparant and the use of NamedArray vs Array not lead to 
 much 
  degradation in performance.  E.g., a * b with `a` and `b` a 
 NamedArray, 
  effectively calls a.array * b.array which Base implements with 
  BLAS.gemm().  There is just a little overhead of filling in sensible 
 names 
  in the result---so if you have small matrices in an inner loop, you're 
  going to get hurt. 
  
  On the other hand, I am not sure how much of the Julia BLAS cleverness 
 is 
  retained in NamedArrays---but the intention of the package is that it is 
  completely transparent, and if you notice bad performance for a 
 particular 
  situation then you should file an issue (or make a PR:-).  Individual 
  element indexing of a NamedArray with integers is just a little bit 
 slower 
  than that of an Array.  Indexing by name is quite a bit slower---you may 
  try a different Associative than the standard Dict. 
  
  Incidentally, I've been toying with the idea of NamedArrays `*` check on 
  consistency of index and dimension names, but my guess is that people 
 would 
  find such a thing annoying. 
  
  ArrayViews are currently not aware of NameArrays.  I believe the views 
 are 
  going to be part ov julia-0.4, so then it would be a task for NamedArray 
 to 
  implement views of NamedArrays I gather. 
  
  Cheers, 
  
  ---david 
  
   Thanks, 
   Jan 



Re: [julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?

2014-12-05 Thread Tim Holy
The new SubArrays steal mercilessly from the good ideas of both the old 
SubArray and ArrayViews, and then add some new tricks of their own. In theory, 
they should be a strict improvement on both of their predecessors.

--Tim

On Friday, December 05, 2014 05:13:28 AM David van Leeuwen wrote:
 So what is the relation between ArrayViews and 0.4 `SubArray revamp'?   Are
 they targeting different use cases or is one of them going to be phased
 out?
 
 On Friday, December 5, 2014 1:28:16 PM UTC+1, Tim Holy wrote:
  In 0.4, the views are a revamp of SubArray. So if NamedArrays already
  interacts well with SubArrays, you're basically set.
  
  Most likely not, as SubArrays are new to me (I've tried sub() in
 
 production code in the past, but it was always better to write out a
 loop)).
 
  FYI the implementation in 0.4 is largely backwards-compatible, but there
  are
  some important differences. If you need to dig into the internal
  implementation
  of SubArrays, you can find documentation on the changes for 0.4 here:
  http://docs.julialang.org/en/latest/devdocs/subarrays/
  
  OK, I'll have to study this in order to understand the implications for
 
 NamedArrays.
 
 ---david
 
  --Tim
  
  On Friday, December 05, 2014 04:18:08 AM David van Leeuwen wrote:
   Hi,
   
   On Friday, December 5, 2014 8:47:22 AM UTC+1, Ján Dolinský wrote:
Hi,

I am exploring DataFrames and NamedArrays packages and I would like to
  
  ask
  
whether their are suitable for heavier computations and whether I can
  
  use
  
them directly in BLAS calls (e.g. gemv() etc.). In addition, is it
possible
to create views of e.g. DataFrames or NamedArrays ?

I can only speak for NamedArrays.  On the one hand the deployment of
  
  BLAS
  
   should be transparant and the use of NamedArray vs Array not lead to
  
  much
  
   degradation in performance.  E.g., a * b with `a` and `b` a
  
  NamedArray,
  
   effectively calls a.array * b.array which Base implements with
   BLAS.gemm().  There is just a little overhead of filling in sensible
  
  names
  
   in the result---so if you have small matrices in an inner loop, you're
   going to get hurt.
   
   On the other hand, I am not sure how much of the Julia BLAS cleverness
  
  is
  
   retained in NamedArrays---but the intention of the package is that it is
   completely transparent, and if you notice bad performance for a
  
  particular
  
   situation then you should file an issue (or make a PR:-).  Individual
   element indexing of a NamedArray with integers is just a little bit
  
  slower
  
   than that of an Array.  Indexing by name is quite a bit slower---you may
   try a different Associative than the standard Dict.
   
   Incidentally, I've been toying with the idea of NamedArrays `*` check on
   consistency of index and dimension names, but my guess is that people
  
  would
  
   find such a thing annoying.
   
   ArrayViews are currently not aware of NameArrays.  I believe the views
  
  are
  
   going to be part ov julia-0.4, so then it would be a task for NamedArray
  
  to
  
   implement views of NamedArrays I gather.
   
   Cheers,
   
   ---david
   
Thanks,
Jan



Re: [julia-users] Re: DataFrames and NamedArrays: are they suitable for heavier computations ?

2014-12-05 Thread Tom Short
For DataFrames, it depends on what you want to do. It is difficult to get
performance with DataArrays as columns using the current implementation.
With the ongoing work by John Myles White on the use of a Nullable type,
that should be much better. Also, you can use standard Arrays as columns of
a DataFrame. It's not documented well, but it can be done.

Also, if you want to treat a DataFrame like a matrix, then generally the
answer is no. With some trickery, you can store a view to a matrix in a
DataFrame. Basically, you have to create column views into the matrix. Here
is an example. It might be useful if you want to treat all or part of a
DataFrame as a matrix.

julia using DataFrames

julia m = rand(5,5)
5x5 Array{Float64,2}:
 0.186736  0.247699   0.0968634  0.471383   0.145244
 0.985306  0.966015   0.663865   0.0468244  0.0471465
 0.981947  0.707241   0.0841202  0.0539529  0.692217
 0.918222  0.0415162  0.646298   0.581983   0.653881
 0.515692  0.0344289  0.0821672  0.877258   0.653756

julia d = DataFrame(Any[sub(m, :, i) for i in 1:size(m, 2)], [:a, :b, :c,
:d, :e])
5x5 DataFrames.DataFrame
| Row | a| b | c | d | e |
|-|--|---|---|---|---|
| 1   | 0.186736 | 0.247699  | 0.0968634 | 0.471383  | 0.145244  |
| 2   | 0.985306 | 0.966015  | 0.663865  | 0.0468244 | 0.0471465 |
| 3   | 0.981947 | 0.707241  | 0.0841202 | 0.0539529 | 0.692217  |
| 4   | 0.918222 | 0.0415162 | 0.646298  | 0.581983  | 0.653881  |
| 5   | 0.515692 | 0.0344289 | 0.0821672 | 0.877258  | 0.653756  |

julia d[:b]
5-element SubArray{Float64,1,Array{Float64,2},(Colon,Int64),2}:
 0.247699
 0.966015
 0.707241
 0.0415162
 0.0344289



On Fri, Dec 5, 2014 at 8:37 AM, Tim Holy tim.h...@gmail.com wrote:

 The new SubArrays steal mercilessly from the good ideas of both the old
 SubArray and ArrayViews, and then add some new tricks of their own. In
 theory,
 they should be a strict improvement on both of their predecessors.

 --Tim

 On Friday, December 05, 2014 05:13:28 AM David van Leeuwen wrote:
  So what is the relation between ArrayViews and 0.4 `SubArray revamp'?
  Are
  they targeting different use cases or is one of them going to be phased
  out?
 
  On Friday, December 5, 2014 1:28:16 PM UTC+1, Tim Holy wrote:
   In 0.4, the views are a revamp of SubArray. So if NamedArrays already
   interacts well with SubArrays, you're basically set.
  
   Most likely not, as SubArrays are new to me (I've tried sub() in
 
  production code in the past, but it was always better to write out a
  loop)).
 
   FYI the implementation in 0.4 is largely backwards-compatible, but
 there
   are
   some important differences. If you need to dig into the internal
   implementation
   of SubArrays, you can find documentation on the changes for 0.4 here:
   http://docs.julialang.org/en/latest/devdocs/subarrays/
  
   OK, I'll have to study this in order to understand the implications for
 
  NamedArrays.
 
  ---david
 
   --Tim
  
   On Friday, December 05, 2014 04:18:08 AM David van Leeuwen wrote:
Hi,
   
On Friday, December 5, 2014 8:47:22 AM UTC+1, Ján Dolinský wrote:
 Hi,

 I am exploring DataFrames and NamedArrays packages and I would
 like to
  
   ask
  
 whether their are suitable for heavier computations and whether I
 can
  
   use
  
 them directly in BLAS calls (e.g. gemv() etc.). In addition, is it
 possible
 to create views of e.g. DataFrames or NamedArrays ?

 I can only speak for NamedArrays.  On the one hand the deployment
 of
  
   BLAS
  
should be transparant and the use of NamedArray vs Array not lead to
  
   much
  
degradation in performance.  E.g., a * b with `a` and `b` a
  
   NamedArray,
  
effectively calls a.array * b.array which Base implements with
BLAS.gemm().  There is just a little overhead of filling in sensible
  
   names
  
in the result---so if you have small matrices in an inner loop,
 you're
going to get hurt.
   
On the other hand, I am not sure how much of the Julia BLAS
 cleverness
  
   is
  
retained in NamedArrays---but the intention of the package is that
 it is
completely transparent, and if you notice bad performance for a
  
   particular
  
situation then you should file an issue (or make a PR:-).  Individual
element indexing of a NamedArray with integers is just a little bit
  
   slower
  
than that of an Array.  Indexing by name is quite a bit slower---you
 may
try a different Associative than the standard Dict.
   
Incidentally, I've been toying with the idea of NamedArrays `*`
 check on
consistency of index and dimension names, but my guess is that people
  
   would
  
find such a thing annoying.
   
ArrayViews are currently not aware of NameArrays.  I believe the
 views
  
   are
  
going to be part ov julia-0.4, so then it would be a task for
 NamedArray
  
   to
  
implement views of NamedArrays I