Re: [R] Huge data sets and RAM problems
Dear all, Thank you very much for your replies and help. I will try to work with your suggestions and come back to you if I need something more. Kind regards, Stella Pachidi On Thu, Apr 22, 2010 at 5:30 AM, kMan wrote: > You set records to NULL perhaps (delete, shift up). Perhaps your system is > susceptible to butterflies on the other side of the world. > > Your code may have 'worked' on a small section of data, but the data used > did not include all of the cases needed to fully test your code. So... test > your code! > > scan(), used with 'nlines', 'skip', 'sep', and 'what' will cut your read > time by at least half while taking less RAM memory to do it, do most of your > post processing, and give you something to better test your code. Or, don't > use 'nlines' and lose your time/memory benefits over read.table(). 'skip' > will get you "right to the point" before where things failed. That would be > an interesting small segment of data to test with. > > wordpad can read your file (and then some). Eventually. > > Sincerely, > KeithC. > > -Original Message- > From: Stella Pachidi [mailto:stella.pach...@gmail.com] > Sent: Monday, April 19, 2010 2:07 PM > To: r-h...@stat.math.ethz.ch > Subject: [R] Huge data sets and RAM problems > > Dear all, > > This is the first time I am sending mail to the mailing list, so I hope I do > not make a mistake... > > The last months I have been working on my MSc thesis project on performing > data mining techniques on user logs of a software-as-a-service application. > The main problem I am experiencing is how to process the huge amount of > data. More specifically: > > I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM and > CPU Intel Core Duo 2GHz. > > The user logs data come from a query Crystal report (.rpt file) which I > transform with some Java code into a tab separated file. > > Although with a small subset of my data everything manages to run, when I > increase the data set I get several problems: > > The first problem is with the use of read.delim(). When I try to read a big > amount of data (over 2.400.000 rows and 18 attributes at each > row) it doesn't seem to transform all table into a data frame. In > particular, the data frame returned has 1.220.987 rows. > > Furthermore, as one of the data attributes is DataTime, when I try to split > this column into two columns (one with Data and one with the Time), the > returned result is quite strange, as the two new columns appear to have more > rows than the data frame: > > applicLog.dat <- read.delim("file.txt") > #Process the syscreated column (Date time --> Date + time) copyDate <- > applicLog.dat[["ï..syscreated"]] copyDate <- as.character(copyDate) > splitDate <- strsplit(copyDate, " ") splitDate <- unlist(splitDate) > splitDateIndex <- c(1:length(splitDate)) sysCreatedDate <- > splitDate[splitDateIndex %% 2 == 1] sysCreatedTime <- > splitDate[splitDateIndex %% 2 == 0] sysCreatedDate <- > strptime(sysCreatedDate, format="%Y-%m-%d") op <- options(digits.secs = 3) > sysCreatedTime <- strptime(sysCreatedTime, format ="%H:%M:%OS") > applicLog.dat[["ï..syscreated"]] <- NULL applicLog.dat <- cbind > (sysCreatedDate,sysCreatedTime,applicLog.dat) > > Then I get the error: Error in data.frame(..., check.names = FALSE) : > arguments imply differing number of rows: 1221063, 1221062, 1220987 > > > Finally, another problem I have is when I perform association mining on the > data set using the package arules: I turn the data frame into transactions > table and then run the apriori algorithm. When I put too low support in > order to manage to find the rules I need, the vector of rules becomes too > big and I get problems with the memory such as: > Error: cannot allocate vector of size 923.1 Mb In addition: Warning > messages: > 1: In items(x) : Reached total allocation of 153Mb: see help(memory.size) > > Could you please help me with how I could allocate more RAM? Or, do you > think there is a way to process the data by loading them into a document > instead of loading all into RAM? Do you know how I could manage to read all > my data set? > > I would really appreciate your help. > > Kind regards, > Stella Pachidi > > PS: Do you know any text editor that can read huge .txt files? > > > > > > -- > Stella Pachidi > Master in Business Informatics student > Utrecht University > > > > -- Stella Pachidi Master in Business Informatics student Utrecht University email: s.pach...@students.uu.nl tel: +31644478898 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Huge data sets and RAM problems
You set records to NULL perhaps (delete, shift up). Perhaps your system is susceptible to butterflies on the other side of the world. Your code may have 'worked' on a small section of data, but the data used did not include all of the cases needed to fully test your code. So... test your code! scan(), used with 'nlines', 'skip', 'sep', and 'what' will cut your read time by at least half while taking less RAM memory to do it, do most of your post processing, and give you something to better test your code. Or, don't use 'nlines' and lose your time/memory benefits over read.table(). 'skip' will get you "right to the point" before where things failed. That would be an interesting small segment of data to test with. wordpad can read your file (and then some). Eventually. Sincerely, KeithC. -Original Message- From: Stella Pachidi [mailto:stella.pach...@gmail.com] Sent: Monday, April 19, 2010 2:07 PM To: r-h...@stat.math.ethz.ch Subject: [R] Huge data sets and RAM problems Dear all, This is the first time I am sending mail to the mailing list, so I hope I do not make a mistake... The last months I have been working on my MSc thesis project on performing data mining techniques on user logs of a software-as-a-service application. The main problem I am experiencing is how to process the huge amount of data. More specifically: I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM and CPU Intel Core Duo 2GHz. The user logs data come from a query Crystal report (.rpt file) which I transform with some Java code into a tab separated file. Although with a small subset of my data everything manages to run, when I increase the data set I get several problems: The first problem is with the use of read.delim(). When I try to read a big amount of data (over 2.400.000 rows and 18 attributes at each row) it doesn't seem to transform all table into a data frame. In particular, the data frame returned has 1.220.987 rows. Furthermore, as one of the data attributes is DataTime, when I try to split this column into two columns (one with Data and one with the Time), the returned result is quite strange, as the two new columns appear to have more rows than the data frame: applicLog.dat <- read.delim("file.txt") #Process the syscreated column (Date time --> Date + time) copyDate <- applicLog.dat[["ï..syscreated"]] copyDate <- as.character(copyDate) splitDate <- strsplit(copyDate, " ") splitDate <- unlist(splitDate) splitDateIndex <- c(1:length(splitDate)) sysCreatedDate <- splitDate[splitDateIndex %% 2 == 1] sysCreatedTime <- splitDate[splitDateIndex %% 2 == 0] sysCreatedDate <- strptime(sysCreatedDate, format="%Y-%m-%d") op <- options(digits.secs = 3) sysCreatedTime <- strptime(sysCreatedTime, format ="%H:%M:%OS") applicLog.dat[["ï..syscreated"]] <- NULL applicLog.dat <- cbind (sysCreatedDate,sysCreatedTime,applicLog.dat) Then I get the error: Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 1221063, 1221062, 1220987 Finally, another problem I have is when I perform association mining on the data set using the package arules: I turn the data frame into transactions table and then run the apriori algorithm. When I put too low support in order to manage to find the rules I need, the vector of rules becomes too big and I get problems with the memory such as: Error: cannot allocate vector of size 923.1 Mb In addition: Warning messages: 1: In items(x) : Reached total allocation of 153Mb: see help(memory.size) Could you please help me with how I could allocate more RAM? Or, do you think there is a way to process the data by loading them into a document instead of loading all into RAM? Do you know how I could manage to read all my data set? I would really appreciate your help. Kind regards, Stella Pachidi PS: Do you know any text editor that can read huge .txt files? -- Stella Pachidi Master in Business Informatics student Utrecht University __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Huge data sets and RAM problems
Stella, A few brief words of advice: 1. Work through your code a line at a time, making sure that each is what you would expect. I think some of your later problems are a result of something early not being as expected. For example, if the read.delim() is in fact not giving you what you expect, stop there before moving onwards. I suspect some funny character(s) or character encodings might be a problem. 2. 32-bit Windows can be limiting. With 2 GB of RAM, you're probably not going to be able to work effectively in native R with objects over 200-300 MB, and the error indicates that something (you or a package you're using) simply have run out of memory. So... 3. Consider more RAM (and preferably with 64-bit R). Other solutions might be possible, such as using a database to hand the data transition into R. 2.5 million rows by 18 columns is apt to be around 360 MB. Although you can afford 1 (or a few) copies of this, it doesn't leave you much room for the memory overhead of working with such an object. Part of the oringal message below. Jay - Message: 80 Date: Mon, 19 Apr 2010 22:07:03 +0200 From: Stella Pachidi To: r-h...@stat.math.ethz.ch Subject: [R] Huge data sets and RAM problems Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Dear all, I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM and CPU Intel Core Duo 2GHz. . Finally, another problem I have is when I perform association mining on the data set using the package arules: I turn the data frame into transactions table and then run the apriori algorithm. When I put too low support in order to manage to find the rules I need, the vector of rules becomes too big and I get problems with the memory such as: Error: cannot allocate vector of size 923.1 Mb In addition: Warning messages: 1: In items(x) : Reached total allocation of 153Mb: see help(memory.size) Could you please help me with how I could allocate more RAM? Or, do you think there is a way to process the data by loading them into a document instead of loading all into RAM? Do you know how I could manage to read all my data set? I would really appreciate your help. Kind regards, Stella Pachidi -- John W. Emerson (Jay) Associate Professor of Statistics Department of Statistics Yale University http://www.stat.yale.edu/~jay [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.