I was using the dataframes convert method that allows replacement of NA 
with an arbitrary value.  I thought I had it working, but maybe I forgot to 
save and was running an old version.

Anyway, it appears I am using the method from the dataframes documentation, 
but it results in a type error:

Using DataFrames

    city = readtable(fname)
    points = convert(Array, city[:,2:end], NaN)  # converts NA values to 
NaN == not a number

Results in:

*ERROR: MethodError: `convert` has no method matching 
convert(::Type{Array{T,N}}, ::DataFrames.DataFrame, ::Float64)*

*This may have arisen from a call to the constructor Array{T,N}(...),*

*since type constructors fall back to convert methods.*

Closest candidates are:

  convert{T,N}(::Type{Array{T,N}}, *::DataArrays.DataArray{T,N}*, ::Any)

  convert{T,R,N}(::Type{Array{T,N}}, *::DataArrays.PooledDataArray{T,R,N}*, 
::Any)

  convert(::Type{Array{T,N}}, ::DataFrames.AbstractDataFrame)

  ...

 in hcluster at /Users/lewislevinmbr/Dropbox/Online Coursework/MIT Intro 
6002x/Assignments/Probset_6/df_hcluster.jl:85

DataFrames documentation shows:

dv = @data([NA, 3, 2, 5, 4])mean(convert(Array, dv, 11))

Seems like I am doing the same thing, just using the float value NaN.  The 
columns of city that are being sliced are indeed Float64.  

This certainly works, but will fail if any value is NA (not a problem with 
sample dataset, but I would like to generalize...):

points = Array{Float64}(city[:, 2:end])  # fails if any value is NA
>

Kept breaking this down and solved it.  The convert with replacement of NA 
values only works on type::DataArray, not the DataFrames type.  So, first 
convert to DataArray, then do the conversion with replacement of NA, thus:

    city = readtable(fname)
>
>     points = convert(Array{Float64,2}, DataArray(city[:, 2:end]), NaN)  # 
>> converts NA to NaN
>
>
That's what I wanted--and got.  Works like a champ.  I think replacing NaN 
with NA is pretty useful.  NaN's will propagate like NA's in a DataArray 
type, but Array{Float64} is noticeably faster.  

You could ask, "why are you doing this?  ...like why even use a DataFrame 
at all with its ability to handle NAs if you are just going to convert back 
out of it....?"
Well, good question.   The simple answer is for the simple data reading and 
handling of row/col names and simple summary stats, etc.  And then, since 
the core data array has no NA's and is float, get the improved performance 
for handling the data subset as Array{Float64,2}.

And, yes, I did the experiment with readcsv and this also works, but 
provides no handling of NA.

So, I think the most general is loading the data as a DataFrame, deciding 
what to do with NA, and then converting.

There is enough here to handle lots of different approaches.  Just 
experimenting. 

Reply via email to