Hi Mike, Thanks for your comment.
I must admit that I am very new to R and although it sounds interesting what you write I have no idea of where to start. Can you give some functions or examples where I can see how it can be done. I was under the impression that I had to do a loop since my blocks of observations are of varying length. Thanks again, Frederik On Thu, Apr 14, 2011 at 6:19 AM, Mike Marchywka <marchy...@hotmail.com>wrote: > > > > > > ---------------------------------------- > > Date: Wed, 13 Apr 2011 10:57:58 -0700 > > From: frederikl...@gmail.com > > To: r-help@r-project.org > > Subject: Re: [R] Incremental ReadLines > > > > Hi there, > > > > I am having a similar problem with reading in a large text file with > around > > 550.000 observations with each 10 to 100 lines of description. I am > trying > > to parse it in R but I have troubles with the size of the file. It seems > > like it is slowing down dramatically at some point. I would be happy for > any > > This probably occurs when you run out of physical memory but you can > probably verify by looking at task manager. A "readline()" method > wouldn't fit real well with R as you try to had blocks of data > so that inner loops, implemented largely in native code, can operate > efficiently. The thing you want is a data structure that can use > disk more effectively and hide these details from you and algorightm. > This works best if the algorithm works with data strcuture to avoid > lots of disk thrashing. You coudl imagine that your "read" would do > nothing until each item is needed but often people want the whole > file validated before procesing, lots of details come up with exception > handling as you get fancy here. Note of course that your parse output > could be stored in a hash or something represnting a DOM and this could > get arbitrarily large. Since it is designed for random access, this may > cause lots of thrashing if partially on disk. Anything you can do to > make access patterns more regular, for example sort your data, would help. > > > > suggestions. Here is my code, which works fine when I am doing a > subsample > > of my dataset. > > > > #Defining datasource > > file <- "filename.txt" > > > > #Creating placeholder for data and assigning column names > > data <- data.frame(Id=NA) > > > > #Starting by case = 0 > > case <- 0 > > > > #Opening a connection to data > > input <- file(file, "rt") > > > > #Going through cases > > repeat { > > line <- readLines(input, n=1) > > if (length(line)==0) break > > if (length(grep("Id:",line)) != 0) { > > case <- case + 1 ; data[case,] <-NA > > split_line <- strsplit(line,"Id:") > > data[case,1] <- as.numeric(split_line[[1]][2]) > > } > > } > > > > #Closing connection > > close(input) > > > > #Saving dataframe > > write.csv(data,'data.csv') > > > > > > Kind regards, > > > > > > Frederik > > > > > > -- > > View this message in context: > http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html > > Sent from the R help mailing list archive at Nabble.com. > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.