Re: [R] How to read a file containing two types of rows - (for the Netflix challenge data format)

2020-01-31 Thread Berry, Charles



> On Jan 31, 2020, at 1:04 AM, Emmanuel Levy  wrote:
> 
> Hi,
> 
> I'd like to use the Netflix challenge data and just can't figure out how to
> efficiently "scan" the files.
> https://www.kaggle.com/netflix-inc/netflix-prize-data
> 
> The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or
> 3 values associated to each ID:
> 
> The format is as follows:
> *1:*
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> *2:*
> value1,value2, value3
> value1,value2, value3
> *3:*
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> *4:*
> etc ...
> 
> And I want to create a matrix where each line is of the form:
> 
> ID value1, value2, value3
> 
> Si "ID" needs to be duplicated - I could write a Perl script to convert
> this format to CSV, but I'm sure there's a simple R trick.
> 

I'd be tempted to use pipe() to separately read the ID lines and the value 
lines, but in R you can do this:

fc <- count.fields( "yourfile.txt", sep = ",")
rlines <- split( readLines( "yourfile.txt" ), fc)
mat <-
  cbind( rlines[[1]],
do.call( rbind, strsplit( rlines[[2]], ",")))


This assumes that there are exactly 1 or 3 fields in each row of 
"yourfile.txt", if not, some incantation of grepl() applied to the text of 
readLines() should suffice.

HTH,

Chuck 

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to read a file containing two types of rows - (for the Netflix challenge data format)

2020-01-31 Thread Emmanuel Levy
Hi All,

Thanks so much for your inputs, it's so nice to have such a helpful
community -- I wrote some kind of mix between the different replies, I copy
my final code below.

All the best,

Emmanuel

mat = read.csv("~/format_test.csv", fill=TRUE, header=FALSE, as.is=TRUE)

first.col.idx = grep(":",mat[,1])

first.col.val = grep(":",mat[,1],value=TRUE)

first.col.val = gsub(pattern=":", replacement="", first.col.val)

mat.clean = mat[-first.col.idx,]

reps = diff(c(first.col.idx,length(mat1[,1])))

reps[1] = reps[1]+1

mat.final = cbind( rep(first.col.val, reps-1), mat.clean)








On Fri, 31 Jan 2020 at 20:31, Berry, Charles 
wrote:

>
>
> > On Jan 31, 2020, at 1:04 AM, Emmanuel Levy 
> wrote:
> >
> > Hi,
> >
> > I'd like to use the Netflix challenge data and just can't figure out how
> to
> > efficiently "scan" the files.
> > https://www.kaggle.com/netflix-inc/netflix-prize-data
> >
> > The files have two types of row, either an *ID* e.g., "1:" , "2:", etc.
> or
> > 3 values associated to each ID:
> >
> > The format is as follows:
> > *1:*
> > value1,value2, value3
> > value1,value2, value3
> > value1,value2, value3
> > value1,value2, value3
> > *2:*
> > value1,value2, value3
> > value1,value2, value3
> > *3:*
> > value1,value2, value3
> > value1,value2, value3
> > value1,value2, value3
> > *4:*
> > etc ...
> >
> > And I want to create a matrix where each line is of the form:
> >
> > ID value1, value2, value3
> >
> > Si "ID" needs to be duplicated - I could write a Perl script to convert
> > this format to CSV, but I'm sure there's a simple R trick.
> >
>
> I'd be tempted to use pipe() to separately read the ID lines and the value
> lines, but in R you can do this:
>
> fc <- count.fields( "yourfile.txt", sep = ",")
> rlines <- split( readLines( "yourfile.txt" ), fc)
> mat <-
>   cbind( rlines[[1]],
> do.call( rbind, strsplit( rlines[[2]], ",")))
>
>
> This assumes that there are exactly 1 or 3 fields in each row of
> "yourfile.txt", if not, some incantation of grepl() applied to the text of
> readLines() should suffice.
>
> HTH,
>
> Chuck
>
>
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to read a file containing two types of rows - (for the Netflix challenge data format)

2020-01-31 Thread Jan Galkowski
With the *data.table* package, *R* can use *fread* as follows:

> grab<- function(file)
> {
>  fin<- fread(file=file,
>  sep=NULL,
>  dec=".",
>  quote="", nrows=Inf, header=FALSE,
>  stringsAsFactors=FALSE, verbose=FALSE,
>  col.names=c("record"),
>  check.names=FALSE, fill=FALSE, blank.lines.skip=FALSE,
>  showProgress=TRUE,
>  data.table=FALSE, skip=0,
>  nThread=2, logical01=FALSE, keepLeadingZeros=FALSE)
>  cat(sprintf("Read '%s'.\n", file))
>  #
>  substance<- apply(X=fin, MARGIN=1, FUN=function(r) chartr(",", "\t", r[1]))
>  cat(sprintf("Translated '%s'.\n", file))
>  D<- fread(text=substance,
>  sep="\t",
>  dec=".",
>  quote="", nrows=Inf, header=FALSE,
>  stringsAsFactors=FALSE, verbose=FALSE,
>  col.names=c("ip", "valid.hits", "err.hits", "megabytes"),
>  check.names=FALSE, fill=FALSE, blank.lines.skip=FALSE,
>  showProgress=TRUE,
>  data.table=FALSE, skip=0,
>  nThread=2, logical01=FALSE, keepLeadingZeros=FALSE)
>  cat(sprintf("Parsed '%s'.\n", file))
>  ip<- D$ip
>  withinBlock<- sapply(X=ip, FUN=function(s) as.integer((strsplit(x=s, 
> split=".", fixed=TRUE)[[1]])[4]))
>  D$within.block<- withinBlock
>  return(D)
> }
> 

In short, one pass pulls in all the records into an internal structure, which 
can be edited or manipulated at will, and then a second call to *fread* parses 
it properly. 

*fread *is fast, even for big datasets.


--
Jan Galkowski 

https://www.linkedin.com/in/deepdevelopment

member,

... American Statistical Association
... International Society for Bayesian Analysis
... Ecological Society of America
... International Association of Survey Statisticians
... American Association for the Advancement of Science
... TeX Users Group

(pronouns: *he, him, his*)

*Keep your energy local*. --John Farrell, *ILSR *


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to read a file containing two types of rows - (for the Netflix challenge data format)

2020-01-31 Thread Chris Evans
I am sure Rainer's approach is good and I know my R programming is truly 
terrible but here's a crude script in base R that does what you want

# rawDat <- readLines(con = "netflix.dat")
fil <- tempfile(fileext = ".dat")
cat("*1:*
value1,value2, value3
value1,value2, value3
value1,value2, value3
value1,value2, value3
*2:*
value1,value2, value3
value1,value2, value3
*3:*
value1,value2, value3
value1,value2, value3
value1,value2, value3
*4:*", 
file = fil,
sep = "\n")
rawDat <- readLines(fil, n = -1)
unlink(fil) # tidy up data input

### create a data frame for output
### this first line will be overwritten by the actual data
outDF <- as.data.frame(list(id = 1,
value1 = "",
value2 = "",
value3 = ""),
   stringsAsFactors = FALSE) # necessary to avoid mess with 
character to factor conversion

j <- 0 # counter for entries
for (i in 1:length(rawDat)) {
  rawDat[i] <- trimws(rawDat[i])
  if (nchar(rawDat[i]) == 0) next # skip empty lines
  if (grepl(":*", rawDat[i], fixed = TRUE)) {
### got an ID line
id <- sub("\\*([0123456789]*):\\*", "\\1", rawDat[i])
  } else {
### not an ID line so one of the one or more following lines of data
### I have assumed these are all of the same form
j <- j + 1
rawDat[i] <- gsub(" ", "", rawDat[i], fixed = TRUE)
tmpDat <- unlist(strsplit(rawDat[i], ","))
outDF[j,1] <- id
outDF[j,2:4] <- tmpDat
  }
}
outDF

I am slowly adapting to the tidyverse but this is something I still find easier 
to do in very crude for loop, base R.

Plea: my formal programming training is one week of "Introduction to FORTRAN" 
on teletypes in 1975, but I confess it's 
both lack of formal training _and_ lack of native ability that means my coding 
is so bad.

If any gurus have a moment, show us really elegant and tidyverse ways to do 
this!

Very best all,

Chris

- Original Message -
> From: "Rainer M Krug" 
> To: "Emmanuel Levy" 
> Cc: "R-help Mailing List" 
> Sent: Friday, 31 January, 2020 10:55:46
> Subject: Re: [R] How to read a file containing two types of rows - (for the 
> Netflix challenge data format)

> I did something similar yesterday…
> 
> Use readLine() to read at in and identify the “*1:*, … with a regex. Than you
> have your dividers. In a second step, use read.csv(skip = …, Ncollumns = …)  
> to
> read the enclosed blocks, and last, combine them accordingly.
> 
> This is written without an R installation, so the argument names are likely
> wrong.
> 
> Rainer
> 
> 
>> On 31 Jan 2020, at 10:04, Emmanuel Levy  wrote:
>> 
>> Hi,
>> 
>> I'd like to use the Netflix challenge data and just can't figure out how to
>> efficiently "scan" the files.
>> https://www.kaggle.com/netflix-inc/netflix-prize-data
>> 
>> The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or
>> 3 values associated to each ID:
>> 
>> The format is as follows:
>> *1:*
>> value1,value2, value3
>> value1,value2, value3
>> value1,value2, value3
>> value1,value2, value3
>> *2:*
>> value1,value2, value3
>> value1,value2, value3
>> *3:*
>> value1,value2, value3
>> value1,value2, value3
>> value1,value2, value3
>> *4:*
>> etc ...
>> 
>> And I want to create a matrix where each line is of the form:
>> 
>> ID value1, value2, value3
>> 
>> Si "ID" needs to be duplicated - I could write a Perl script to convert
>> this format to CSV, but I'm sure there's a simple R trick.
>> 
>> Thanks for suggestions!
>> 
>> Emmanuel
>> 
>>  [[alternative HTML version deleted]]
>> 
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> --
> Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology,
> UCT), Dipl. Phys. (Germany)
> 
> Orcid ID: -0002-7490-0066
> 
> Department of Evolutionary Biology and Environmental Studies
> University of Zürich
> Office Y34-J-74
> Winterthurerstrasse 190
> 8075 Zürich
> Switzerland
> 
> Office:   +41 (0)44 635 47 64
> Cell: +41 (0)78 630 66 57
> email:  rainer.k...@uzh.ch
>   rai...@krugs.de
> Skype: RMkrug
> 
> PGP: 0x0F52F982
> 
> 
> 
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Chris Evans  Visiting Professor, University of Sheffield 

I do some consultation work for the University of Roehampton 
 and other places
but  remains my main Email address.  I have a work web site 
at:
   

Re: [R] How to read a file containing two types of rows - (for the Netflix challenge data format)

2020-01-31 Thread Rainer M Krug
I did something similar yesterday…

Use readLine() to read at in and identify the “*1:*, … with a regex. Than you 
have your dividers. In a second step, use read.csv(skip = …, Ncollumns = …)  to 
read the enclosed blocks, and last, combine them accordingly.

This is written without an R installation, so the argument names are likely 
wrong.

Rainer


> On 31 Jan 2020, at 10:04, Emmanuel Levy  wrote:
> 
> Hi,
> 
> I'd like to use the Netflix challenge data and just can't figure out how to
> efficiently "scan" the files.
> https://www.kaggle.com/netflix-inc/netflix-prize-data
> 
> The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or
> 3 values associated to each ID:
> 
> The format is as follows:
> *1:*
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> *2:*
> value1,value2, value3
> value1,value2, value3
> *3:*
> value1,value2, value3
> value1,value2, value3
> value1,value2, value3
> *4:*
> etc ...
> 
> And I want to create a matrix where each line is of the form:
> 
> ID value1, value2, value3
> 
> Si "ID" needs to be duplicated - I could write a Perl script to convert
> this format to CSV, but I'm sure there's a simple R trick.
> 
> Thanks for suggestions!
> 
> Emmanuel
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

--
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, 
UCT), Dipl. Phys. (Germany)

Orcid ID: -0002-7490-0066

Department of Evolutionary Biology and Environmental Studies
University of Zürich
Office Y34-J-74
Winterthurerstrasse 190
8075 Zürich
Switzerland

Office: +41 (0)44 635 47 64
Cell:   +41 (0)78 630 66 57
email:  rainer.k...@uzh.ch
rai...@krugs.de
Skype: RMkrug

PGP: 0x0F52F982




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.