> On Jan 31, 2020, at 1:04 AM, Emmanuel Levy <emmanuel.l...@gmail.com> wrote:
> 
> Hi,
> 
> I'd like to use the Netflix challenge data and just can't figure out how to
> efficiently "scan" the files.
> https://www.kaggle.com/netflix-inc/netflix-prize-data
> 
> The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or
> 3 values associated to each ID:
> 
> The format is as follows:
> *1:*
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> *2:*
> value1,value2, value3
> value1,value2, value3
> *3:*
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> *4:*
> etc ...
> 
> And I want to create a matrix where each line is of the form:
> 
> ID value1, value2, value3
> 
> Si "ID" needs to be duplicated - I could write a Perl script to convert
> this format to CSV, but I'm sure there's a simple R trick.
> 

I'd be tempted to use pipe() to separately read the ID lines and the value 
lines, but in R you can do this:

fc <- count.fields( "yourfile.txt", sep = ",")
rlines <- split( readLines( "yourfile.txt" ), fc)
mat <-
  cbind( rlines[[1]],
        do.call( rbind, strsplit( rlines[[2]], ",")))


This assumes that there are exactly 1 or 3 fields in each row of 
"yourfile.txt", if not, some incantation of grepl() applied to the text of 
readLines() should suffice.

HTH,

Chuck 

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to