Dear all, This is the first time I am sending mail to the mailing list, so I hope I do not make a mistake...
The last months I have been working on my MSc thesis project on performing data mining techniques on user logs of a software-as-a-service application. The main problem I am experiencing is how to process the huge amount of data. More specifically: I am using R 2.10.1 in a laptop with Windows 7 - 32bit system, 2GB RAM and CPU Intel Core Duo 2GHz. The user logs data come from a query Crystal report (.rpt file) which I transform with some Java code into a tab separated file. Although with a small subset of my data everything manages to run, when I increase the data set I get several problems: The first problem is with the use of read.delim(). When I try to read a big amount of data (over 2.400.000 rows and 18 attributes at each row) it doesn't seem to transform all table into a data frame. In particular, the data frame returned has 1.220.987 rows. Furthermore, as one of the data attributes is DataTime, when I try to split this column into two columns (one with Data and one with the Time), the returned result is quite strange, as the two new columns appear to have more rows than the data frame: applicLog.dat <- read.delim("file.txt") #Process the syscreated column (Date time --> Date + time) copyDate <- applicLog.dat[["ï..syscreated"]] copyDate <- as.character(copyDate) splitDate <- strsplit(copyDate, " ") splitDate <- unlist(splitDate) splitDateIndex <- c(1:length(splitDate)) sysCreatedDate <- splitDate[splitDateIndex %% 2 == 1] sysCreatedTime <- splitDate[splitDateIndex %% 2 == 0] sysCreatedDate <- strptime(sysCreatedDate, format="%Y-%m-%d") op <- options(digits.secs = 3) sysCreatedTime <- strptime(sysCreatedTime, format ="%H:%M:%OS") applicLog.dat[["ï..syscreated"]] <- NULL applicLog.dat <- cbind (sysCreatedDate,sysCreatedTime,applicLog.dat) Then I get the error: Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 1221063, 1221062, 1220987 Finally, another problem I have is when I perform association mining on the data set using the package arules: I turn the data frame into transactions table and then run the apriori algorithm. When I put too low support in order to manage to find the rules I need, the vector of rules becomes too big and I get problems with the memory such as: Error: cannot allocate vector of size 923.1 Mb In addition: Warning messages: 1: In items(x) : Reached total allocation of 153Mb: see help(memory.size) Could you please help me with how I could allocate more RAM? Or, do you think there is a way to process the data by loading them into a document instead of loading all into RAM? Do you know how I could manage to read all my data set? I would really appreciate your help. Kind regards, Stella Pachidi PS: Do you know any text editor that can read huge .txt files? -- Stella Pachidi Master in Business Informatics student Utrecht University ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.