Hi All, Thanks so much for your inputs, it's so nice to have such a helpful community -- I wrote some kind of mix between the different replies, I copy my final code below.
All the best, Emmanuel mat = read.csv("~/format_test.csv", fill=TRUE, header=FALSE, as.is=TRUE) first.col.idx = grep(":",mat[,1]) first.col.val = grep(":",mat[,1],value=TRUE) first.col.val = gsub(pattern=":", replacement="", first.col.val) mat.clean = mat[-first.col.idx,] reps = diff(c(first.col.idx,length(mat1[,1]))) reps[1] = reps[1]+1 mat.final = cbind( rep(first.col.val, reps-1), mat.clean) On Fri, 31 Jan 2020 at 20:31, Berry, Charles <ccbe...@health.ucsd.edu> wrote: > > > > On Jan 31, 2020, at 1:04 AM, Emmanuel Levy <emmanuel.l...@gmail.com> > wrote: > > > > Hi, > > > > I'd like to use the Netflix challenge data and just can't figure out how > to > > efficiently "scan" the files. > > https://www.kaggle.com/netflix-inc/netflix-prize-data > > > > The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. > or > > 3 values associated to each ID: > > > > The format is as follows: > > *1:* > > value1,value2, value3 > > value1,value2, value3 > > value1,value2, value3 > > value1,value2, value3 > > *2:* > > value1,value2, value3 > > value1,value2, value3 > > *3:* > > value1,value2, value3 > > value1,value2, value3 > > value1,value2, value3 > > *4:* > > etc ... > > > > And I want to create a matrix where each line is of the form: > > > > ID value1, value2, value3 > > > > Si "ID" needs to be duplicated - I could write a Perl script to convert > > this format to CSV, but I'm sure there's a simple R trick. > > > > I'd be tempted to use pipe() to separately read the ID lines and the value > lines, but in R you can do this: > > fc <- count.fields( "yourfile.txt", sep = ",") > rlines <- split( readLines( "yourfile.txt" ), fc) > mat <- > cbind( rlines[[1]], > do.call( rbind, strsplit( rlines[[2]], ","))) > > > This assumes that there are exactly 1 or 3 fields in each row of > "yourfile.txt", if not, some incantation of grepl() applied to the text of > readLines() should suffice. > > HTH, > > Chuck > > > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.