Re: [R] Handling large dataset & dataframe

Mark Stephens Tue, 25 Apr 2006 13:28:26 -0700

From ?scan: "the *type* of what gives the type of data to be read".
So list(integer(), integer(), double(), raw(), ...)
In your code all columns are being read as character regardless of the
contents of the character vector.


I have to admit that I have added the *'s in *type*.  I have been caught out
by this too.  Its not the most convenient way to specify the types of a
large number of columns either.  As you have a lot of columns you might want
to do something like this:  as.list(rep(integer(1),250)), assuming your
dummies are together, to save typing.  Also storage.mode() is useful to tell
you the precise type (and therefore size) of an object e.g. sapply(coltypes,
storage.mode) is actually the types scan() will use.  Note that 'numeric'
could be 'double' or 'integer' which are important in your case to fit
inside the 1GB limit, because 'integer' (4 bytes) is half 'double' (8
bytes).

Perhaps someone on r-devel could enhance the documentation to make "type"
stand out in capitals in bold in help(scan)?  Or maybe scan could be clever
enough to accept a character vector 'what'.  Or maybe I'm missing a good
reason why this isn't possible - anyone? How about allowing a character
vector length one, with each character representing the type of that column
e.g.  what="IIIIDDCD" would mean 4 integers followed by 2 double's followed
by a character column, followed finally by a double column,  8 columns in
total.  Probably someone somewhere has done that already, but I'm not aware
anyone has wrapped it up conveniently?

On 25/04/06, Sachin J <[EMAIL PROTECTED]> wrote:
>
>  Mark:
>
> Here is the information I didn't provide in my earlier post. R version is
> R2.2.1 running on Windows XP.  My dataset has 16 variables with following
> data type.
> ColNumber:   1              2              3  .......16
> Datatypes:
>
> "numeric","numeric","numeric","numeric","numeric","numeric","character","numeric","numeric","character","character","numeric","numeric","numeric","numeric","numeric","numeric","numeric"
>
> Variable (2) which is numeric and variables denoted as character are to be
> treated as dummy variables in the regression.
>
> Search in R help list  suggested I can use read.csv with colClasses option
> also instead of using scan() and then converting it to dataframe as you
> suggested. I am trying both these methods but unable to resolve syntactical
> error.
>
> >coltypes<-
> c("numeric","factor","numeric","numeric","numeric","numeric","factor","numeric","numeric","factor","factor","numeric","numeric","numeric","numeric","numeric","numeric","numeric")
>
> >mydf <- read.csv("C:/temp/data.csv", header=FALSE, colClasses = coltypes,
> strip.white=TRUE)
>
> ERROR: Error in scan(file = file, what = what, sep = sep, quote = quote,
> dec = dec,  :
>         scan() expected 'a real', got 'V1'
>
> No idea whats the problem.
>
> AS PER YOUR SUGGESTION I TRIED scan() as follows:
>
>
> >coltypes<-c("numeric","factor","numeric","numeric","numeric","numeric","factor","numeric","numeric","factor","factor","numeric","numeric","numeric","numeric","numeric","numeric","numeric")
> >x<-scan(file = 
> >"C:/temp/data.dbf",what=as.list(coltypes),sep=",",quiet=TRUE,skip=1)
>
> >names(x)<-scan(file = "C:/temp/data.dbf",what="",nlines=1, sep=",")
> >x<-as.data.frame(x)
>
> This is working fine but x has no data in it and contains
> > x
>
>  [1] X._.   NA.    NA..1  NA..2  NA..3  NA..4  NA..5  NA..6  NA..7  NA..8
> NA..9  NA..10 NA..11
> [14] NA..12 NA..13 NA..14 NA..15 NA..16
> <0 rows> (or 0-length row.names)
>
> Please let me know how to properly use scan or colClasses option.
>
> Sachin
>
>
>
>
>
> *Mark Stephens <[EMAIL PROTECTED]>* wrote:
>
> Sachin,
> With your dummies stored as integer, the size of your object would appear
> to be 350000 * (4*250 + 8*16) bytes = 376MB.
> You said "PC" but did not provide R version information, assuming windows
> then ...
> With 1GB RAM you should be able to load a 376MB object into memory. If you
> can store the dummies as 'raw' then object size is only 126MB.
> You don't say how you attempted to load the data. Assuming your input data
> is in text file (or can be) have you tried scan()? Setup the 'what'
> argument
> with length 266 and make sure the dummy column are set to integer() or
> raw(). Then x = scan(...); class(x)=" data.frame".
> What is the result of memory.limit()? If it is 256MB or 512MB, then try
> starting R with --max-mem-size=800M (I forget the syntax exactly). Leave a
> bit of room below 1GB. Once the object is in memory R may need to copy it
> once, or a few times. You may need to close all other apps in memory, or
> send them to swap.
> I don't really see why your data should not fit into the memory you have.
> Purchasing an extra 1GB may help. Knowing the object size calculation (as
> above) should help you guage whether it is worth it.
> Have you used process monitor to see the memory growing as R loads the
> data? This can be useful.
> If all the above fails, then consider 64-bit and purchasing as much memory
> as you can afford. R can use over 64GB RAM+ on 64bit machines. Maybe you
> can
> hire some time on a 64-bit server farm - i heard its quite cheap but never
> tried it myself. You shouldn't need to go that far with this data set
> though.
> Hope this helps,
> Mark
>
>
> Hi Roger,
>
> I want to carry out regression analysis on this dataset. So I believe I
> can't read the dataset in chunks. Any other solution?
>
> TIA
> Sachin
>
>
> roger koenker < [EMAIL PROTECTED]> wrote:
> You can read chunks of it at a time and store it in sparse matrix
> form using the packages SparseM or Matrix, but then you need
> to think about what you want to do with it.... least squares sorts
> of things are ok, but other options are somewhat limited...
>
>
> url: www.econ.uiuc.edu/~roger Roger Koenker
> email [EMAIL PROTECTED] Department of Economics
> vox: 217-333-4558 University of Illinois
> fax: 217-244-6678 Champaign, IL 61820
>
>
> On Apr 24, 2006, at 12:41 PM, Sachin J wrote:
>
> > Hi,
> >
> > I have a dataset consisting of 350,000 rows and 266 columns. Out
> > of 266 columns 250 are dummy variable columns. I am trying to read
> > this data set into R dataframe object but unable to do it due to
> > memory size limitations (object size created is too large to handle
> > in R). Is there a way to handle such a large dataset in R.
> >
> > My PC has 1GB of RAM, and 55 GB harddisk space running windows XP.
> >
> > Any pointers would be of great help.
> >
> > TIA
> > Sachin
> >
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>
>
>  ------------------------------
> Talk is cheap. Use Yahoo! Messenger to make PC-to-Phone calls. Great rates
> starting at 1¢/min.
> <http://us.rd.yahoo.com/mail_us/taglines/postman7/*http://us.rd.yahoo.com/evt=39666/*http://beta.messenger.yahoo.com>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Handling large dataset & dataframe

Reply via email to