I have two suggestions to speed up your code, if you must use a loop. First, don't grow your output dataset at each iteration. Instead of cases <- 0 output <- numeric(cases) while(length(line <- readLines(input, n=1))==1) { cases <- cases + 1 output[cases] <- as.numeric(line) } preallocate the output vector to be about the size of its eventual length (slightly bigger is better), replacing output <- numeric(0) with the likes of output <- numeric(500000) and when you are done with the loop trim down the length if it is too big if (cases < length(output)) length(output) <- cases Growing your dataset in a loop can cause quadratic or worse growth in time with problem size and the above sort of code should make the time grow linearly with problem size.
Second, don't do data.frame subscripting inside your loop. Instead of data <- data.frame(Id=numeric(cases)) while(...) { data[cases, 1] <- newValue } do Id <- numeric(cases) while(...) { Id[cases] <- newValue } data <- data.frame(Id = Id) This is just the general principal that you don't want to repeat the same operation over and over in a loop. dataFrame[i,j] first extracts column j then extracts element i from that column. Since the column is the same every iteration you may as well extract the column outside of the loop. Avoiding the loop altogether is the fastest. E.g., the code you showed does the same thing as idLines <- grep(value=TRUE, "Id:", readLines(file)) data.frame(Id = as.numeric(sub("^.*Id:[[:space:]]*", "", idLines))) You can also use an external process (perl or grep) to filter out the lines that are not of interest. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On Behalf Of Freds > Sent: Wednesday, April 13, 2011 10:58 AM > To: r-help@r-project.org > Subject: Re: [R] Incremental ReadLines > > Hi there, > > I am having a similar problem with reading in a large text > file with around > 550.000 observations with each 10 to 100 lines of > description. I am trying > to parse it in R but I have troubles with the size of the > file. It seems > like it is slowing down dramatically at some point. I would > be happy for any > suggestions. Here is my code, which works fine when I am > doing a subsample > of my dataset. > > #Defining datasource > file <- "filename.txt" > > #Creating placeholder for data and assigning column names > data <- data.frame(Id=NA) > > #Starting by case = 0 > case <- 0 > > #Opening a connection to data > input <- file(file, "rt") > > #Going through cases > repeat { > line <- readLines(input, n=1) > if (length(line)==0) break > if (length(grep("Id:",line)) != 0) { > case <- case + 1 ; data[case,] <-NA > split_line <- strsplit(line,"Id:") > data[case,1] <- as.numeric(split_line[[1]][2]) > } > } > > #Closing connection > close(input) > > #Saving dataframe > write.csv(data,'data.csv') > > > Kind regards, > > > Frederik > > > -- > View this message in context: > http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3 447859.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.