Ok. I’m coming around to this. How would you do I/O? If we make DataFrames expose a nullable property, we could plausibly produce vectors instead of data vectors when parsing CSV files.
— John On Jan 23, 2014, at 7:38 PM, Sean Garborg <sean.garb...@gmail.com> wrote: > I'd think of #3 as a feature, too. > > Just to throw another use case in the ring, if DataFrames with a mix of > Vectors and DataVectors (with NAs) were performant, my co-workers and I would > usually pull in data marking all columns as Vectors, these columns would > remain Vectors, and derived columns would be mostly DataVectors. > > > On Thursday, January 23, 2014 8:48:42 PM UTC-6, tshort wrote: > I think of item #3 as a feature, not a bug. I don't like the idea of > auto-conversion. If I choose Vectors, I should not expect them to > support missing values. R sometimes irritates me by adding NA's when I > don't expect it. I'd rather have the error than have NA's sneak in > there. Also, there may be other types of AbstractDataFrames where we > don't have the ability to assign missing values. HDF5 tables are one > example I can think of. We wouldn't want to try to autoconvert a huge > HDF5 column to a DataVector. > > > > On Thu, Jan 23, 2014 at 8:58 PM, John Myles White > <johnmyl...@gmail.com> wrote: > > A couple of points that expand on Tom’s comments: > > > > (1) We need to add Tom’s definition of countna(a::Array) = 0 to show() wide > > DataFrame’s that contain any columns that are Vector’s. I never use > > DataFrame’s like that, so I forgot that others might. It’s also impossible > > to produce such a DataFrame using our current I/O routines. > > > > (2) The constructor you’re using does exist, Jacob, but you should > > typically pass in a Vector{Any}, each element of which is either a > > DataVector or PooledDataVector. See Point (3) for why, at the moment, using > > a Vector as a column is subtly broken. > > > > (3) If people are going to put Vector’s in DataFrames for performance > > reasons, all of our setindex!() functions for DataFrames need to add > > methods that automatically convert Vector’s to DataVector’s if an NA is > > inserted in a Vector. Right now that kind of insertion is just going to > > error out. Ths check isn’t too hard, but it’s totally missing from our > > current codebase. > > > > Personally, I would prefer that we not allow any of the columns of a > > DataFrame to be Vector's. It’s a weird edge case that doesn’t actually > > offer reliable high performance, because the potential performance > > improvements relies on the unsafe assumption that a DataFame won’t contain > > any columns with NA’s in it. > > > > — John > > > > On Jan 23, 2014, at 1:33 PM, Tom Short <tshort...@gmail.com> wrote: > > > >> That works, but columns will be Arrays instead of DataArrays. That's > >> the way it's always worked. If you want them to be DataArrays, then > >> convert to DataArrays right at the end. > >> > >> To fix show to support columns that are arrays, we probably need (at > >> least) to define the following: > >> > >> countna(da::Array) = 0 > >> > >> > >> > >> On Thu, Jan 23, 2014 at 4:07 PM, Jacob Quinn <quinn....@gmail.com> wrote: > >>> Great investigative work. Is > >>> DataFrames( array_of_arrays, Index(column_names_array) ) > >>> not the right way to hand construct DataFrames any more? I think I can > >>> allocate DataArrays instead, but at every step of the way, I was trying > >>> to > >>> hand-optimize the result fetching process, which resulted in not creating > >>> a > >>> DataArray or DataFrame until right before we return to the user. > >>> > >>> -Jacob > >>> > >>> > >>> On Thu, Jan 23, 2014 at 3:27 PM, bp2012 <bert.pr...@gmail.com> wrote: > >>>> > >>>> To check Jacob's suggestion about versions mismatch I completely removed > >>>> the DataFrames and ODBC packages using Pkg.rm and physically deleted the > >>>> directories from disk. I then added them via Pkg.add and Pkg,update. > >>>> > >>>> I am running the julia nightlies build. > >>>> julia> versioninfo() > >>>> Julia Version 0.3.0-prerelease+1127 > >>>> Commit bc73674* (2014-01-22 20:09 UTC) > >>>> > >>>> Pkg.status() > >>>> - DataFrames 0.5.1 > >>>> - ODBC 0.3.5 > >>>> > >>>> Pkg.checkout("ODBC") > >>>> INFO: Checking out ODBC master... > >>>> INFO: Pulling ODBC latest master... > >>>> INFO: No packages to install, update or remove > >>>> > >>>> julia> Pkg.checkout("DataFrames") > >>>> INFO: Checking out DataFrames master... > >>>> INFO: Pulling DataFrames latest master... > >>>> INFO: No packages to install, update or remove > >>>> > >>>> I did some digging. It looks like there is a mismatch in that countna > >>>> expects DataFrame columns to be DataArrays. However the ODBC package > >>>> returns > >>>> DataFrames that have array columns (using the first constructor in > >>>> dataframe.jl). You guys would know better as to whether a change is > >>>> needed > >>>> in the constructor or if countna should also accept Array columns. > >>>> > >>>> > >>>> I made some local changes to work around the issue. > >>>> > >>>> show.jl: > >>>> line 42: if isna(col, i) changed to if isna(col[i]) > >>>> line 322: missing[j] = countna(adf[j]) changed to missing[j] = > >>>> countna(isa(adf[j], DataArray) ? adf[j] : DataArray(adf[j])) > >>>> > >>>> These work great for me. > >>>> > >>> > >