> On Jan 31, 2020, at 1:04 AM, Emmanuel Levy <emmanuel.l...@gmail.com> wrote: > > Hi, > > I'd like to use the Netflix challenge data and just can't figure out how to > efficiently "scan" the files. > https://www.kaggle.com/netflix-inc/netflix-prize-data > > The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or > 3 values associated to each ID: > > The format is as follows: > *1:* > value1,value2, value3 > value1,value2, value3 > value1,value2, value3 > value1,value2, value3 > *2:* > value1,value2, value3 > value1,value2, value3 > *3:* > value1,value2, value3 > value1,value2, value3 > value1,value2, value3 > *4:* > etc ... > > And I want to create a matrix where each line is of the form: > > ID value1, value2, value3 > > Si "ID" needs to be duplicated - I could write a Perl script to convert > this format to CSV, but I'm sure there's a simple R trick. > I'd be tempted to use pipe() to separately read the ID lines and the value lines, but in R you can do this: fc <- count.fields( "yourfile.txt", sep = ",") rlines <- split( readLines( "yourfile.txt" ), fc) mat <- cbind( rlines[[1]], do.call( rbind, strsplit( rlines[[2]], ","))) This assumes that there are exactly 1 or 3 fields in each row of "yourfile.txt", if not, some incantation of grepl() applied to the text of readLines() should suffice. HTH, Chuck ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.