Ok. I’m coming around to this.

How would you do I/O? If we make DataFrames expose a nullable property, we 
could plausibly produce vectors instead of data vectors when parsing CSV files.

 — John

On Jan 23, 2014, at 7:38 PM, Sean Garborg <sean.garb...@gmail.com> wrote:

> I'd think of #3 as a feature, too.
> 
> Just to throw another use case in the ring, if DataFrames with a mix of 
> Vectors and DataVectors (with NAs) were performant, my co-workers and I would 
> usually pull in data marking all columns as Vectors, these columns would 
> remain Vectors, and derived columns would be mostly DataVectors.
> 
> 
> On Thursday, January 23, 2014 8:48:42 PM UTC-6, tshort wrote:
> I think of item #3 as a feature, not a bug. I don't like the idea of 
> auto-conversion. If I choose Vectors, I should not expect them to 
> support missing values. R sometimes irritates me by adding NA's when I 
> don't expect it. I'd rather have the error than have NA's sneak in 
> there. Also, there may be other types of AbstractDataFrames where we 
> don't have the ability to assign missing values. HDF5 tables are one 
> example I can think of. We wouldn't want to try to autoconvert a huge 
> HDF5 column to a DataVector. 
> 
> 
> 
> On Thu, Jan 23, 2014 at 8:58 PM, John Myles White 
> <johnmyl...@gmail.com> wrote: 
> > A couple of points that expand on Tom’s comments: 
> > 
> > (1) We need to add Tom’s definition of countna(a::Array) = 0 to show() wide 
> > DataFrame’s that contain any columns that are Vector’s. I never use 
> > DataFrame’s like that, so I forgot that others might. It’s also impossible 
> > to produce such a DataFrame using our current I/O routines. 
> > 
> > (2) The constructor you’re using does exist, Jacob, but you should 
> > typically pass in a Vector{Any}, each element of which is either a 
> > DataVector or PooledDataVector. See Point (3) for why, at the moment, using 
> > a Vector as a column is subtly broken. 
> > 
> > (3) If people are going to put Vector’s in DataFrames for performance 
> > reasons, all of our setindex!() functions for DataFrames need to add 
> > methods that automatically convert Vector’s to DataVector’s if an NA is 
> > inserted in a Vector. Right now that kind of insertion is just going to 
> > error out. Ths check isn’t too hard, but it’s totally missing from our 
> > current codebase. 
> > 
> > Personally, I would prefer that we not allow any of the columns of a 
> > DataFrame to be Vector's. It’s a weird edge case that doesn’t actually 
> > offer reliable high performance, because the potential performance 
> > improvements relies on the unsafe assumption that a DataFame won’t contain 
> > any columns with NA’s in it. 
> > 
> >  — John 
> > 
> > On Jan 23, 2014, at 1:33 PM, Tom Short <tshort...@gmail.com> wrote: 
> > 
> >> That works, but columns will be Arrays instead of DataArrays. That's 
> >> the way it's always worked. If you want them to be DataArrays, then 
> >> convert to DataArrays right at the end. 
> >> 
> >> To fix show to support columns that are arrays, we probably need (at 
> >> least) to define the following: 
> >> 
> >> countna(da::Array) = 0 
> >> 
> >> 
> >> 
> >> On Thu, Jan 23, 2014 at 4:07 PM, Jacob Quinn <quinn....@gmail.com> wrote: 
> >>> Great investigative work. Is 
> >>> DataFrames( array_of_arrays, Index(column_names_array) ) 
> >>> not the right way to hand construct DataFrames any more? I think I can 
> >>> allocate DataArrays instead, but at every step of the way, I was trying 
> >>> to 
> >>> hand-optimize the result fetching process, which resulted in not creating 
> >>> a 
> >>> DataArray or DataFrame until right before we return to the user. 
> >>> 
> >>> -Jacob 
> >>> 
> >>> 
> >>> On Thu, Jan 23, 2014 at 3:27 PM, bp2012 <bert.pr...@gmail.com> wrote: 
> >>>> 
> >>>> To check Jacob's suggestion about versions mismatch I completely removed 
> >>>> the DataFrames and ODBC packages using Pkg.rm and physically deleted the 
> >>>> directories from disk. I then added them via Pkg.add and Pkg,update. 
> >>>> 
> >>>> I am running the julia nightlies build. 
> >>>> julia> versioninfo() 
> >>>> Julia Version 0.3.0-prerelease+1127 
> >>>> Commit bc73674* (2014-01-22 20:09 UTC) 
> >>>> 
> >>>> Pkg.status() 
> >>>> - DataFrames                  0.5.1 
> >>>> - ODBC                          0.3.5 
> >>>> 
> >>>> Pkg.checkout("ODBC") 
> >>>> INFO: Checking out ODBC master... 
> >>>> INFO: Pulling ODBC latest master... 
> >>>> INFO: No packages to install, update or remove 
> >>>> 
> >>>> julia> Pkg.checkout("DataFrames") 
> >>>> INFO: Checking out DataFrames master... 
> >>>> INFO: Pulling DataFrames latest master... 
> >>>> INFO: No packages to install, update or remove 
> >>>> 
> >>>> I did some digging. It looks like there is a mismatch in that countna 
> >>>> expects DataFrame columns to be DataArrays. However the ODBC package 
> >>>> returns 
> >>>> DataFrames that have array columns (using the first constructor in 
> >>>> dataframe.jl). You guys would know better as to whether a change is 
> >>>> needed 
> >>>> in the constructor or if countna should also accept Array columns. 
> >>>> 
> >>>> 
> >>>> I made some local changes to work around the issue. 
> >>>> 
> >>>> show.jl: 
> >>>> line 42:  if isna(col, i)     changed to  if isna(col[i]) 
> >>>> line 322:  missing[j] = countna(adf[j])     changed to    missing[j] = 
> >>>> countna(isa(adf[j], DataArray) ? adf[j] : DataArray(adf[j])) 
> >>>> 
> >>>> These work great for me. 
> >>>> 
> >>> 
> >

Reply via email to