On Tue, 2007-01-30 at 17:23 -0500, Kimpel, Mark William wrote: > The main problem I am trying to solve it this: > > I am importing a tab delimited file whose first line contains only one > column, which is a descriptor of the form "col_1 col_2 col_3", i.e. the > colnames are not tab delineated but are separated by whitespace. I would > like to parse this first line and make such that it becomes the colnames > of the rest of the file, which I am reading into R using read.delim(). > The file is so huge that I must do this in R. > > My first question is this: What is the best way to accomplish what I > want to do?
Mark, The first thing that comes to mind is a two pass approach on the file: First pass: (using example file with your first line) # Get the first line into a vector to set the colnames for the DF # during the second pass ColNames <- unlist(read.table("test.txt", nrow = 1, as.is = TRUE)) > str(ColNames) Named chr [1:3] "col_1" "col_2" "col_3" - attr(*, "names")= chr [1:3] "V1" "V2" "V3" Second pass: # Now read the rest of the file, skipping the first line DF <- read.delim("test.txt", skip = 1, col.names = ColNames) I believe that should get you the full data set and set the colnames based upon the first line. This should pretty much obviate the need for everything below here. > My other questions revolve around some failed attempts on my part to > solve the problem on my own using regular expressions. I thought that > perhaps I could change the first line to "c("col_1", "col_2", "col_3") > using gsub. I was having trouble figuring out how R uses the backslash > character because I know that sometimes the backslash one would use in > Perl needs to be a double backslash in R. You would not want to change the first line as you have it above, as it would not be parsed properly using read.table() family functions. > Here is a sample of what I tried and what I got: > > a<-"col_1 col_2 col_3" > > > gsub("\\s", " " , a) > > [1] "col_1 col_2 col_3" > > > gsub("\\s", "\\s" , a) > > [1] "col_1scol_2scol_3" > > As you can see, it looks like R is taking a regular expression for > "pattern", but not taking it for "replacement". Why is this? There are various settings for how regex are interpreted by/within R. See ?grep and note the various arguments to the functions there and how they impact R's behavior here. Also, note that there is a difference (to further complicate your life...) between the characters that R displays by default using print() and how they are displayed using cat(). See below. > a [1] "col_1 col_2 col_3" > gsub(" ", ", " , a) [1] "col_1, col_2, col_3" or to get you to your vector statement above: Note the result here: > paste("c(\"", gsub(" ", "\", \"" , a), "\")", sep = "") [1] "c(\"col_1\", \"col_2\", \"col_3\")" Now see how it displays when the escaped double quote chars are interpreted properly using cat(): > cat(paste("c(\"", gsub(" ", "\", \"" , a), "\")", sep = ""), "\n") c("col_1", "col_2", "col_3") > Assuming that I did want to solve my original problem with gsub and then > turn the string into an R object, how would I get gsub to return > "c("col_1", "col_2", "col_3") using my original string? Again, note the two pass solution above. It's easier, unless you would want to consider using awk/sed from a CLI, which I generally avoid at all costs... > Finally, is there a way to declare a string as a regular expression so > that R sees it the same way other languages, such as Perl do, i.e. make > the backslash be interpreted the same way? For someone who is just > learning regular expressions as I am, it is very frustrating to read > about them in references and then have to translate what I've learned > into R syntax. I was thinking that instead of enclosing the string in > "", one could use THIS.IS.A.REGULAR.EXPRESSION(), similar to the way we > use I() in formulae. Part of the challenge is noting the different behaviors of regex within R and how that behavior is affected by the aforementioned arguments. Also, noting how the output is displayed within R relative to the interpretation of escaped characters as is seen above. > These are a bunch of questions, but obviously I have a lot to learn! > > Thanks, > > Mark HTH, Marc Schwartz ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.