I am sure Rainer's approach is good and I know my R programming is truly terrible but here's a crude script in base R that does what you want
# rawDat <- readLines(con = "netflix.dat") fil <- tempfile(fileext = ".dat") cat("*1:* value1,value2, value3 value1,value2, value3 value1,value2, value3 value1,value2, value3 *2:* value1,value2, value3 value1,value2, value3 *3:* value1,value2, value3 value1,value2, value3 value1,value2, value3 *4:*", file = fil, sep = "\n") rawDat <- readLines(fil, n = -1) unlink(fil) # tidy up data input ### create a data frame for output ### this first line will be overwritten by the actual data outDF <- as.data.frame(list(id = 1, value1 = "", value2 = "", value3 = ""), stringsAsFactors = FALSE) # necessary to avoid mess with character to factor conversion j <- 0 # counter for entries for (i in 1:length(rawDat)) { rawDat[i] <- trimws(rawDat[i]) if (nchar(rawDat[i]) == 0) next # skip empty lines if (grepl(":*", rawDat[i], fixed = TRUE)) { ### got an ID line id <- sub("\\*([0123456789]*):\\*", "\\1", rawDat[i]) } else { ### not an ID line so one of the one or more following lines of data ### I have assumed these are all of the same form j <- j + 1 rawDat[i] <- gsub(" ", "", rawDat[i], fixed = TRUE) tmpDat <- unlist(strsplit(rawDat[i], ",")) outDF[j,1] <- id outDF[j,2:4] <- tmpDat } } outDF I am slowly adapting to the tidyverse but this is something I still find easier to do in very crude for loop, base R. Plea: my formal programming training is one week of "Introduction to FORTRAN" on teletypes in 1975, but I confess it's both lack of formal training _and_ lack of native ability that means my coding is so bad. If any gurus have a moment, show us really elegant and tidyverse ways to do this! Very best all, Chris ----- Original Message ----- > From: "Rainer M Krug" <rai...@krugs.de> > To: "Emmanuel Levy" <emmanuel.l...@gmail.com> > Cc: "R-help Mailing List" <r-help@r-project.org> > Sent: Friday, 31 January, 2020 10:55:46 > Subject: Re: [R] How to read a file containing two types of rows - (for the > Netflix challenge data format) > I did something similar yesterday… > > Use readLine() to read at in and identify the “*1:*, … with a regex. Than you > have your dividers. In a second step, use read.csv(skip = …, Ncollumns = …) > to > read the enclosed blocks, and last, combine them accordingly. > > This is written without an R installation, so the argument names are likely > wrong. > > Rainer > > >> On 31 Jan 2020, at 10:04, Emmanuel Levy <emmanuel.l...@gmail.com> wrote: >> >> Hi, >> >> I'd like to use the Netflix challenge data and just can't figure out how to >> efficiently "scan" the files. >> https://www.kaggle.com/netflix-inc/netflix-prize-data >> >> The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or >> 3 values associated to each ID: >> >> The format is as follows: >> *1:* >> value1,value2, value3 >> value1,value2, value3 >> value1,value2, value3 >> value1,value2, value3 >> *2:* >> value1,value2, value3 >> value1,value2, value3 >> *3:* >> value1,value2, value3 >> value1,value2, value3 >> value1,value2, value3 >> *4:* >> etc ... >> >> And I want to create a matrix where each line is of the form: >> >> ID value1, value2, value3 >> >> Si "ID" needs to be duplicated - I could write a Perl script to convert >> this format to CSV, but I'm sure there's a simple R trick. >> >> Thanks for suggestions! >> >> Emmanuel >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > -- > Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, > UCT), Dipl. Phys. (Germany) > > Orcid ID: 0000-0002-7490-0066 > > Department of Evolutionary Biology and Environmental Studies > University of Zürich > Office Y34-J-74 > Winterthurerstrasse 190 > 8075 Zürich > Switzerland > > Office: +41 (0)44 635 47 64 > Cell: +41 (0)78 630 66 57 > email: rainer.k...@uzh.ch > rai...@krugs.de > Skype: RMkrug > > PGP: 0x0F52F982 > > > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Chris Evans <ch...@psyctc.org> Visiting Professor, University of Sheffield <chris.ev...@sheffield.ac.uk> I do some consultation work for the University of Roehampton <chris.ev...@roehampton.ac.uk> and other places but <ch...@psyctc.org> remains my main Email address. I have a work web site at: https://www.psyctc.org/psyctc/ and a site I manage for CORE and CORE system trust at: http://www.coresystemtrust.org.uk/ I have "semigrated" to France, see: https://www.psyctc.org/pelerinage2016/semigrating-to-france/ That page will also take you to my blog which started with earlier joys in France and Spain! If you want to book to talk, I am trying to keep that to Thursdays and my diary is at: https://www.psyctc.org/pelerinage2016/ceworkdiary/ Beware: French time, generally an hour ahead of UK. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.