subject:"\[R\] Data"

[R] Data consideration in executing pca

2024-02-25 Thread Jiji Sid

Dear R users,

 I have a txt file named 'data_1.txt' whose first column contains the names
of the individuals and the other columns contain the values of four
variables X_1,X_2,X_3 and X_4. I read it with R from its location and
called it data. I'd like to do a normalized principal component analysis. I
started by calculating the correlation matrix:
cor(data)
 I got the following message:
Error in cor(nnotes) : 'x' must be numeric.
I eliminated the first column of names, then read the modified data and
then added the names of individuals with rownames. I was able to run PCA
and obtain graphical representations of the variables, their coordinates,
contributions and cos2. However, when I used fviz_pca_ind to get the
graphical representation of the individuals, I got a graphic that considers
the variables as individuals. Can you please help me? I have attached the
file data_1 .txt file.

Many thanks in advance.
Name X_1  X_2X_3 X_4 
John10.2  10 65
Ricardo 14.75 13.5   10.5 9
Suzane  17.75 16.5  12.5 10
Monica   1514 13  9
Meriam   1619 14  13 
Philipps 17.75 17 12  12
Sonia2017 13  14__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data frame returned from sapply but vector expected

2022-11-04 Thread PIKAL Petr

Hallo Ivan

Thanks, yes it seems to be working. I thought also removing NULL by

mylist2[sapply(mylist2, is.null)] <- NULL

but your approach is probably better (in any case simpler)

Thanks again.

Petr

> -Original Message-
> From: Ivan Krylov 
> Sent: Friday, November 4, 2022 1:37 PM
> To: PIKAL Petr 
> Cc: R-help Mailing List 
> Subject: Re: [R] data frame returned from sapply but vector expected
> 
> On Fri, 4 Nov 2022 15:30:27 +0300
> Ivan Krylov  wrote:
> 
> > sapply(mylist2, `[[`, 'b')
> 
> Wait, that would simplify the return value into a matrix when there are no
> NULLs. But lapply(mylist2, `[[`, 'b') should work in both cases, which in
my
> opinion goes to show the dangers of using simplifying functions in
to-be-library
> code.
> 
> Sorry for the double-post!
> 
> --
> Best regards,
> Ivan
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data frame returned from sapply but vector expected

2022-11-04 Thread Ivan Krylov

On Fri, 4 Nov 2022 15:30:27 +0300
Ivan Krylov  wrote:

> sapply(mylist2, `[[`, 'b')

Wait, that would simplify the return value into a matrix when there are
no NULLs. But lapply(mylist2, `[[`, 'b') should work in both cases,
which in my opinion goes to show the dangers of using simplifying
functions in to-be-library code.

Sorry for the double-post!

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data frame returned from sapply but vector expected

2022-11-04 Thread Ivan Krylov

On Fri, 4 Nov 2022 12:19:09 +
PIKAL Petr  wrote:

> > str(sapply(mylist2, "[", "b"))  
> 
> List of 3
> 
> $ : NULL
> 
> $ :'data.frame':   5 obs. of  1 variable:
> 
>   ..$ b: num [1:5] 0.01733 0.46055 0.19421 0.11609 0.00789
> 
> $ :'data.frame':   5 obs. of  1 variable:
> 
>   ..$ b: num [1:5] 0.593 0.478 0.299 0.185 0.847

Is sapply(mylist2, `[[`, 'b') closer to what you'd like to see, i.e. a
list of vectors or NULLs?

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] data frame returned from sapply but vector expected

2022-11-04 Thread PIKAL Petr

Hallo all

 

I found a strange problem for coding if part of list is NULL. 

In this case, sapply result is ***list of data frames*** but if there is no
NULL leaf, the result is ***list of vectors***. 

I tried simplify option but it did not help me neither I found anything in
help page. 

 

The code is part of bigger project where I fill list by reading in data and
if it fails, the leaf is set to NULL. Then the the boxplot is created simply
by

boxplot(sapply(mylist2, "[", "b")) and the user is asked to select if values
should be rbinded or not.

 

Is it possible to perform some *apply without getting data frame as result
in case NULL leaf?

 

Here is an example (without boxplot)

 

df1 <- data.frame(a=rnorm(5), b=runif(5), c=rlnorm(5))

df2 <- data.frame(a=rnorm(5), b=runif(5), c=rlnorm(5))

mylist1 <- list(df1,df2, df3)

mylist2 <- list(NULL,df2, df3)

> str(sapply(mylist1, "[", "b"))

List of 3

$ b: num [1:5] 0.387 0.69 0.876 0.836 0.819

$ b: num [1:5] 0.01733 0.46055 0.19421 0.11609 0.00789

$ b: num [1:5] 0.593 0.478 0.299 0.185 0.847

> str(sapply(mylist2, "[", "b"))

List of 3

$ : NULL

$ :'data.frame':   5 obs. of  1 variable:

  ..$ b: num [1:5] 0.01733 0.46055 0.19421 0.11609 0.00789

$ :'data.frame':   5 obs. of  1 variable:

  ..$ b: num [1:5] 0.593 0.478 0.299 0.185 0.847

 

S pozdravem | Best Regards

RNDr. Petr PIKAL
Vedoucí Výzkumu a vývoje | Research Manager

PRECHEZA a.s.
nábř. Dr. Edvarda Beneše 1170/24 | 750 02 Přerov | Czech Republic
Tel: +420 581 252 256 | GSM: +420 724 008 364
  petr.pi...@precheza.cz |
 www.precheza.cz

Osobní údaje: Informace o zpracování a ochraně osobních údajů obchodních
partnerů PRECHEZA a.s. jsou zveřejněny na:

https://www.precheza.cz/zasady-ochrany-osobnich-udaju/ | Information about
processing and protection of business partner's personal data are available
on website:

https://www.precheza.cz/en/personal-data-protection-principles/

Důvěrnost: Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné
a podléhají tomuto právně závaznému prohlášení o vyloučení odpovědnosti:
 https://www.precheza.cz/01-dovetek/ |
This email and any documents attached to it may be confidential and are
subject to the legally binding disclaimer:

https://www.precheza.cz/en/01-disclaimer/

 

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data manipulation question

2021-08-23 Thread Jim Lemon

Hi Kai,
How about setting:

germlinepatients$DisclosureStatus <- NA

then having your three conditional statements as indices:

germlinepatients$DisclosureStatus[germlinepatients$gl_resultsdisclosed
== 1] <-"DISCLOSED"
germlinepatients$DisclosureStatus[germlinepatients$
gl_resultsdisclosed == 0] <- "ATTEMPTED"
 germlinepatients$DisclosureStatus[is.na(germlinepatients$gl_resultsdisclosed) &
 germlinepatients$gl_discloseattempt1 != "ATTEMPTED"] <-"ATTEMPTED"

I know it's not elegant and you could join the last two statements
with OR (|) but it may work.

Jim

On Tue, Aug 24, 2021 at 9:22 AM Kai Yang via R-help
 wrote:
>
> Hello List,
> I wrote the script below to assign value to a new field DisclosureStatus.
> my goal is if gl_resultsdisclosed=1 then DisclosureStatus=DISCLOSED
> else if gl_resultsdisclosed=0 then DisclosureStatus= ATTEMPTED
> else if gl_resultsdisclosed is missing and gl_discloseattempt1 is not missing 
> then DisclosureStatus= ATTEMPTED
> else missing
>
>
> germlinepatients$DisclosureStatus <-
>   ifelse(germlinepatients$gl_resultsdisclosed==1, "DISCLOSED",
> ifelse(germlinepatients$ gl_resultsdisclosed==0, "ATTEMPTED",
>ifelse(is.na(germlinepatients$gl_resultsdisclosed) & 
> germlinepatients$gl_discloseattempt1!='', "ATTEMPTED",
>NA)))
>
> the first 3 row give me right result, but the 3rd row does not. After 
> checking the data, there are 23 cases are gl_resultsdisclosed is missing and 
> gl_discloseattempt1 is not missing.  the code doesn't has any error message.
> Please help
> thank you
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] data manipulation question

2021-08-23 Thread Kai Yang via R-help

Hello List,
I wrote the script below to assign value to a new field DisclosureStatus.
my goal is if gl_resultsdisclosed=1 then DisclosureStatus=DISCLOSED
else if gl_resultsdisclosed=0 then DisclosureStatus= ATTEMPTED
else if gl_resultsdisclosed is missing and gl_discloseattempt1 is not missing 
then DisclosureStatus= ATTEMPTED
else missing


germlinepatients$DisclosureStatus <- 
              ifelse(germlinepatients$gl_resultsdisclosed==1, "DISCLOSED",
                ifelse(germlinepatients$ gl_resultsdisclosed==0, "ATTEMPTED", 
                   ifelse(is.na(germlinepatients$gl_resultsdisclosed) & 
germlinepatients$gl_discloseattempt1!='', "ATTEMPTED",
                                                           NA)))

the first 3 row give me right result, but the 3rd row does not. After checking 
the data, there are 23 cases are gl_resultsdisclosed is missing and 
gl_discloseattempt1 is not missing.  the code doesn't has any error message.
Please help 
thank you

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data is not properly written in csv file

2021-06-21 Thread David Winsemius

This was an exact duplicate of a posting to StackOverflow where it has a 
response. You are asked in the Posting Guide not to crosspost.



--

David.

On 6/20/21 8:03 AM, Sri Priya wrote:

location <- '
http://keic.mica-apps.net/wwwisis/ET_Annual_Reports/Religare_Enterprises_Ltd/RELIGARE-2017-2018.pdf
'

# Extract the table
out <- extract_tables(location)


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data is not properly written in csv file

2021-06-21 Thread Marc Schwartz via R-help


Hi,

If each of the extracted tables do not have consistent content and 
structure, that may be causing problems as you append each to the same file.


You might want to modify your loop so that each table gets written to a 
different CSV file and see what that looks like.


Also, review ?write.table and take note of the default arguments that 
are used for write.csv(), as noted in the CSV Files section, and in the 
Examples.


Regards,

Marc Schwartz

Sri Priya wrote on 6/20/21 11:03 AM:

Dear R Users,

I am working on extracting tables from PDF and I am writing that in a csv
file. When I executed the code, the tables were not properly written in the
csv file.

Here is my code:

library(tabulizer)
# Location of pdf file.
location <- '
http://keic.mica-apps.net/wwwisis/ET_Annual_Reports/Religare_Enterprises_Ltd/RELIGARE-2017-2018.pdf
'

# Extract the table
out <- extract_tables(location)
for(i in 1:length(out))
{
 write.table(out[i], file='Output.csv',append=TRUE, sep=",",quote =
FALSE)
}
  I enclosed the screenshot of the output file. In that you can see
the tables are incomplete.

Any help would be appreciated.

Thanks
Sripriya.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data is not properly written in csv file

2021-06-21 Thread Bert Gunter

Please read the posting guide, linked below, which says:

"For questions about functions in standard packages distributed with R (see
the FAQ Add-on packages in R
), ask
questions on R-help.
If the question relates to a *contributed package* , e.g., one downloaded
from CRAN, try contacting the package maintainer first. You can also use
find("functionname") and packageDescription("packagename") to find this
information. *Only* send such questions to R-help or R-devel if you get no
reply or need further assistance. This applies to both requests for help
and to bug reports."

You may get lucky here and someone familiar with the tabulizer package will
respond; but unless you have already done so and received no response -- in
which case say so -- you should contact the maintainer about your problem.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Mon, Jun 21, 2021 at 2:14 PM Sri Priya  wrote:

> Dear R Users,
>
> I am working on extracting tables from PDF and I am writing that in a csv
> file. When I executed the code, the tables were not properly written in the
> csv file.
>
> Here is my code:
>
> library(tabulizer)
> # Location of pdf file.
> location <- '
>
> http://keic.mica-apps.net/wwwisis/ET_Annual_Reports/Religare_Enterprises_Ltd/RELIGARE-2017-2018.pdf
> '
>
> # Extract the table
> out <- extract_tables(location)
> for(i in 1:length(out))
> {
> write.table(out[i], file='Output.csv',append=TRUE, sep=",",quote =
> FALSE)
> }
>  I enclosed the screenshot of the output file. In that you can see
> the tables are incomplete.
>
> Any help would be appreciated.
>
> Thanks
> Sripriya.
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data transformation problem

2020-11-12 Thread phil


Thank you so much for this elegant solution, Jeff.

Philip

On 2020-11-12 02:20, Jeff Newmiller wrote:

I am not a data.table afficiando, but here is how I would do it with
dplyr/tidyr:

library(dplyr)
library(tidyr)

do_per_REL <- function( DF ) {
  rng <- range( DF$REF1 ) # watch out for missing months?
  DF <- (   data.frame( REF1 = seq( rng[ 1 ], rng[ 2 ], by = "month" ) 
)

%>% left_join( DF, by = "REF1" )
%>% arrange( REF1 )
)
  with( DF
  , data.frame( REF2 = REF1[ -1 ]
  , VAL2 = 100 * diff( VAL1 ) / VAL1[ -length( VAL1 ) ]
  )
  )
}

df2a <- (   df1
%>% mutate( REF1 = as.Date( REF1 )
  , REL1 = as.Date( REL1 )
  )
%>% nest( data = -REL1 )
%>% rename( REL2 = REL1 )
%>% rowwise()
%>% mutate( data = list( do_per_REL( data ) ) )
%>% ungroup()
%>% unnest( cols = "data" )
%>% select( REF2, REL2, VAL2 )
%>% arrange( REF2, desc( REL2 ), VAL2 )
)
df2a

On Wed, 11 Nov 2020, p...@philipsmith.ca wrote:

I am stuck on a data transformation problem. I have a data frame, df1 
in my example, with some original "levels" data. The data pertain to 
some variable, such as GDP, in various reference periods, REF, as 
estimated and released in various release periods, REL. The release 
periods follow after the reference periods by two months or more, 
sometimes by several years. I want to build a second data frame, 
called df2 in my example, with the month-to-month growth rates that 
existed in each reference period, revealing the revisions to those 
growth rates in subsequent periods.


REF1 <- 
c("2017-01-01","2017-01-01","2017-01-01","2017-01-01","2017-01-01",

 "2017-02-01","2017-02-01","2017-02-01","2017-02-01","2017-02-01",
 "2017-03-01","2017-03-01","2017-03-01","2017-03-01","2017-03-01")
REL1 <- 
c("2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01",

 "2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01",
 "2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01")
VAL1 <- 
c(17974,14567,13425,NA,12900,17974,14000,14000,12999,13245,17197,11500,

 19900,18765,13467)
df1 <- data.frame(REF1,REL1,VAL1)
REF2 <- 
c("2017-02-01","2017-02-01","2017-02-01","2017-02-01","2017-02-01",

 "2017-03-01","2017-03-01","2017-03-01","2017-03-01","2017-03-01")
REL2 <- 
c("2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01",

 "2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01")
VAL2 <- c(0.0,-3.9,4.3,NA,2.3,-4.3,-17.9,42.1,44.4,1.7)
df2 <- data.frame(REF2,REL2,VAL2)

In my example I have provided some sample data pertaining to three 
reference months, 2017-01-01 through 2017-03-01, and five release 
periods, "2020-09-01","2020-08-01","2020-07-01","2020-06-01" and 
"2019-05-01". In my actual problem I have millions of REF-REL 
combinations, so my data frame is quite large. I am using data.table 
for faster processing, though I am more familiar with the tidyverse. I 
am providing df2 as the target data frame for my example, so you can 
see what I am trying to achieve.


I have not been able to find an efficient way to do these 
calculations. I have tried "for" loops with "if" statements, without 
success so far, and anyway this approach would be too slow, I fear. 
Suggestions as to how I might proceed would be much appreciated.


Philip

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.



---
Jeff NewmillerThe .   .  Go 
Live...
DCN:Basics: ##.#.   ##.#.  Live 
Go...
  Live:   OO#.. Dead: OO#..  
Playing

Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  
rocks...1k

---


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data transformation problem

2020-11-11 Thread Jeff Newmiller

I am not a data.table afficiando, but here is how I would do it with 
dplyr/tidyr:


library(dplyr)
library(tidyr)

do_per_REL <- function( DF ) {
  rng <- range( DF$REF1 ) # watch out for missing months?
  DF <- (   data.frame( REF1 = seq( rng[ 1 ], rng[ 2 ], by = "month" ) )
%>% left_join( DF, by = "REF1" )
%>% arrange( REF1 )
)
  with( DF
  , data.frame( REF2 = REF1[ -1 ]
  , VAL2 = 100 * diff( VAL1 ) / VAL1[ -length( VAL1 ) ]
  )
  )
}

df2a <- (   df1
%>% mutate( REF1 = as.Date( REF1 )
  , REL1 = as.Date( REL1 )
  )
%>% nest( data = -REL1 )
%>% rename( REL2 = REL1 )
%>% rowwise()
%>% mutate( data = list( do_per_REL( data ) ) )
%>% ungroup()
%>% unnest( cols = "data" )
%>% select( REF2, REL2, VAL2 )
%>% arrange( REF2, desc( REL2 ), VAL2 )
)
df2a

On Wed, 11 Nov 2020, p...@philipsmith.ca wrote:

I am stuck on a data transformation problem. I have a data frame, df1 in my 
example, with some original "levels" data. The data pertain to some variable, 
such as GDP, in various reference periods, REF, as estimated and released in 
various release periods, REL. The release periods follow after the reference 
periods by two months or more, sometimes by several years. I want to build a 
second data frame, called df2 in my example, with the month-to-month growth 
rates that existed in each reference period, revealing the revisions to those 
growth rates in subsequent periods.


REF1 <- c("2017-01-01","2017-01-01","2017-01-01","2017-01-01","2017-01-01",
 "2017-02-01","2017-02-01","2017-02-01","2017-02-01","2017-02-01",
 "2017-03-01","2017-03-01","2017-03-01","2017-03-01","2017-03-01")
REL1 <- c("2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01",
 "2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01",
 "2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01")
VAL1 <- 
c(17974,14567,13425,NA,12900,17974,14000,14000,12999,13245,17197,11500,

 19900,18765,13467)
df1 <- data.frame(REF1,REL1,VAL1)
REF2 <- c("2017-02-01","2017-02-01","2017-02-01","2017-02-01","2017-02-01",
 "2017-03-01","2017-03-01","2017-03-01","2017-03-01","2017-03-01")
REL2 <- c("2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01",
 "2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01")
VAL2 <- c(0.0,-3.9,4.3,NA,2.3,-4.3,-17.9,42.1,44.4,1.7)
df2 <- data.frame(REF2,REL2,VAL2)

In my example I have provided some sample data pertaining to three reference 
months, 2017-01-01 through 2017-03-01, and five release periods, 
"2020-09-01","2020-08-01","2020-07-01","2020-06-01" and "2019-05-01". In my 
actual problem I have millions of REF-REL combinations, so my data frame is 
quite large. I am using data.table for faster processing, though I am more 
familiar with the tidyverse. I am providing df2 as the target data frame for 
my example, so you can see what I am trying to achieve.


I have not been able to find an efficient way to do these calculations. I 
have tried "for" loops with "if" statements, without success so far, and 
anyway this approach would be too slow, I fear. Suggestions as to how I might 
proceed would be much appreciated.


Philip

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



---
Jeff NewmillerThe .   .  Go Live...
DCN:Basics: ##.#.   ##.#.  Live Go...
  Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
/Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Data transformation problem

2020-11-11 Thread phil

I am stuck on a data transformation problem. I have a data frame, df1 in 
my example, with some original "levels" data. The data pertain to some 
variable, such as GDP, in various reference periods, REF, as estimated 
and released in various release periods, REL. The release periods follow 
after the reference periods by two months or more, sometimes by several 
years. I want to build a second data frame, called df2 in my example, 
with the month-to-month growth rates that existed in each reference 
period, revealing the revisions to those growth rates in subsequent 
periods.


REF1 <- 
c("2017-01-01","2017-01-01","2017-01-01","2017-01-01","2017-01-01",

  "2017-02-01","2017-02-01","2017-02-01","2017-02-01","2017-02-01",
  "2017-03-01","2017-03-01","2017-03-01","2017-03-01","2017-03-01")
REL1 <- 
c("2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01",

  "2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01",
  "2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01")
VAL1 <- 
c(17974,14567,13425,NA,12900,17974,14000,14000,12999,13245,17197,11500,

  19900,18765,13467)
df1 <- data.frame(REF1,REL1,VAL1)
REF2 <- 
c("2017-02-01","2017-02-01","2017-02-01","2017-02-01","2017-02-01",

  "2017-03-01","2017-03-01","2017-03-01","2017-03-01","2017-03-01")
REL2 <- 
c("2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01",

  "2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01")
VAL2 <- c(0.0,-3.9,4.3,NA,2.3,-4.3,-17.9,42.1,44.4,1.7)
df2 <- data.frame(REF2,REL2,VAL2)

In my example I have provided some sample data pertaining to three 
reference months, 2017-01-01 through 2017-03-01, and five release 
periods, "2020-09-01","2020-08-01","2020-07-01","2020-06-01" and 
"2019-05-01". In my actual problem I have millions of REF-REL 
combinations, so my data frame is quite large. I am using data.table for 
faster processing, though I am more familiar with the tidyverse. I am 
providing df2 as the target data frame for my example, so you can see 
what I am trying to achieve.


I have not been able to find an efficient way to do these calculations. 
I have tried "for" loops with "if" statements, without success so far, 
and anyway this approach would be too slow, I fear. Suggestions as to 
how I might proceed would be much appreciated.


Philip

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data Table not rendering properly using R shiny

2020-11-07 Thread Rui Barradas


Hello,

Or maybe


logical_idx <- max_usage_hours_per_region$Region %in% input$Region


Another option is ?match


Hope this helps,

Rui Barradas


Às 15:41 de 07/11/20, Jeff Newmiller escreveu:

This looks odd...

max_usage_hours_per_region[input$Region,]

This would only work if you had rownames on that data frame corresponding to 
the names of the Regions. This is a common R mistake... you probably need

logical_idx <- max_usage_hours_per_region$Region == input$Region
max_usage_hours_per_region[  logical_idx,]

That said, it is very difficult to separate out R questions when mixed into 
shiny code, so you would help yourself and this list to work on minimal 
reproducible examples that focus on the R syntax if possible for posts here. 
Read the Posting Guide.

On November 7, 2020 2:42:58 AM PST, Ritwik Mohapatra  wrote:

Hi All,

I have a data output as below.I want to display them in an interactive
html
report using shiny but the data table is not rendering properly and
instead
giving NA values.

max_usage_hours_per_region<-setNames(aggregate(df3_machine_region$sum_as_hours~df3_machine_region$Region,df3_machine_region,max),c("Region","Sum_as_Hours"))

Region Sum_as_Hours
1 Africa 1156.0833
2 Americas 740.1667
3 APAC 740.2833
4 Europe 1895.2000
5 PDO 1053.3500
6 UK 0.


Rshiny code:

library(shiny)

ui <- fluidPage(
selectInput("Region","Select
Region",max_usage_hours_per_region$Region,selected = TRUE),
tableOutput("table")
)
server <- function(input, output) {
output$table <- renderTable(
max_usage_hours_per_region[input$Region,])
}
shinyApp(ui = ui, server = server)

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data Table not rendering properly using R shiny

2020-11-07 Thread Jeff Newmiller

This looks odd...

max_usage_hours_per_region[input$Region,]

This would only work if you had rownames on that data frame corresponding to 
the names of the Regions. This is a common R mistake... you probably need

logical_idx <- max_usage_hours_per_region$Region == input$Region
max_usage_hours_per_region[  logical_idx,]

That said, it is very difficult to separate out R questions when mixed into 
shiny code, so you would help yourself and this list to work on minimal 
reproducible examples that focus on the R syntax if possible for posts here. 
Read the Posting Guide.

On November 7, 2020 2:42:58 AM PST, Ritwik Mohapatra  wrote:
>Hi All,
>
>I have a data output as below.I want to display them in an interactive
>html
>report using shiny but the data table is not rendering properly and
>instead
>giving NA values.
>
>max_usage_hours_per_region<-setNames(aggregate(df3_machine_region$sum_as_hours~df3_machine_region$Region,df3_machine_region,max),c("Region","Sum_as_Hours"))
>
>Region Sum_as_Hours
>1 Africa 1156.0833
>2 Americas 740.1667
>3 APAC 740.2833
>4 Europe 1895.2000
>5 PDO 1053.3500
>6 UK 0.
>
>
>Rshiny code:
>
>library(shiny)
>
>ui <- fluidPage(
>selectInput("Region","Select
>Region",max_usage_hours_per_region$Region,selected = TRUE),
>tableOutput("table")
>)
>server <- function(input, output) {
>output$table <- renderTable(
>max_usage_hours_per_region[input$Region,])
>}
>shinyApp(ui = ui, server = server)
>
>   [[alternative HTML version deleted]]
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data Table not rendering properly using R shiny

2020-11-07 Thread Marc Schwartz via R-help

Hi,

Please drop R-Devel as a cc: from this thread for further replies.

This topic is definitely not relevant there and cross-posting is not needed, 
but does require manual moderation.

Thanks,

Marc Schwartz

> On Nov 7, 2020, at 10:23 AM, Bert Gunter  wrote:
> 
> Better to post on  RStudio support, I think. Shiny is an RStudio package
> and product and this list if for R language/programming help. The two are
> separate.
> 
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> 
> 
>> On Sat, Nov 7, 2020 at 2:43 AM Ritwik Mohapatra  wrote:
>> 
>> Hi All,
>> 
>> I have a data output as below.I want to display them in an interactive html
>> report using shiny but the data table is not rendering properly and instead
>> giving NA values.
>> 
>> 
>> max_usage_hours_per_region<-setNames(aggregate(df3_machine_region$sum_as_hours~df3_machine_region$Region,df3_machine_region,max),c("Region","Sum_as_Hours"))
>> 
>> Region Sum_as_Hours
>> 1 Africa 1156.0833
>> 2 Americas 740.1667
>> 3 APAC 740.2833
>> 4 Europe 1895.2000
>> 5 PDO 1053.3500
>> 6 UK 0.
>> 
>> 
>> Rshiny code:
>> 
>> library(shiny)
>> 
>> ui <- fluidPage(
>> selectInput("Region","Select
>> Region",max_usage_hours_per_region$Region,selected = TRUE),
>> tableOutput("table")
>> )
>> server <- function(input, output) {
>> output$table <- renderTable(
>> max_usage_hours_per_region[input$Region,])
>> }
>> shinyApp(ui = ui, server = server)
>> 
>>[[alternative HTML version deleted]]
>> 
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>> 
> 
>[[alternative HTML version deleted]]
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data Table not rendering properly using R shiny

2020-11-07 Thread Bert Gunter

Better to post on  RStudio support, I think. Shiny is an RStudio package
and product and this list if for R language/programming help. The two are
separate.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sat, Nov 7, 2020 at 2:43 AM Ritwik Mohapatra  wrote:

> Hi All,
>
> I have a data output as below.I want to display them in an interactive html
> report using shiny but the data table is not rendering properly and instead
> giving NA values.
>
>
> max_usage_hours_per_region<-setNames(aggregate(df3_machine_region$sum_as_hours~df3_machine_region$Region,df3_machine_region,max),c("Region","Sum_as_Hours"))
>
> Region Sum_as_Hours
> 1 Africa 1156.0833
> 2 Americas 740.1667
> 3 APAC 740.2833
> 4 Europe 1895.2000
> 5 PDO 1053.3500
> 6 UK 0.
>
>
> Rshiny code:
>
> library(shiny)
>
> ui <- fluidPage(
> selectInput("Region","Select
> Region",max_usage_hours_per_region$Region,selected = TRUE),
> tableOutput("table")
> )
> server <- function(input, output) {
> output$table <- renderTable(
> max_usage_hours_per_region[input$Region,])
> }
> shinyApp(ui = ui, server = server)
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Data Table not rendering properly using R shiny

2020-11-07 Thread Ritwik Mohapatra

Hi All,

I have a data output as below.I want to display them in an interactive html
report using shiny but the data table is not rendering properly and instead
giving NA values.

max_usage_hours_per_region<-setNames(aggregate(df3_machine_region$sum_as_hours~df3_machine_region$Region,df3_machine_region,max),c("Region","Sum_as_Hours"))

Region Sum_as_Hours
1 Africa 1156.0833
2 Americas 740.1667
3 APAC 740.2833
4 Europe 1895.2000
5 PDO 1053.3500
6 UK 0.


Rshiny code:

library(shiny)

ui <- fluidPage(
selectInput("Region","Select
Region",max_usage_hours_per_region$Region,selected = TRUE),
tableOutput("table")
)
server <- function(input, output) {
output$table <- renderTable(
max_usage_hours_per_region[input$Region,])
}
shinyApp(ui = ui, server = server)

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] [R-pkgs] xmlconvert: A package for converting XML data to R data frames and vice versa

2020-11-05 Thread joachim

Hello everyone,



Driven by the need to work with XML data from medical systems that use
object-oriented databases I have developed the 'xmlconvert' package. With
its easy-to-use functions xml_to_df() and df_to_xml() it allows to convert
data from XML to R data frames and vice versa. A variety of arguments gives
you control over the specifics of the conversion process.



The package is available on CRAN (visit
https://CRAN.R-project.org/package=xmlconvert for more details). Install it
by executing

> install.packages("xmlconvert", dependencies = TRUE)

in the R console.



You will find more information on GitHub
(https://github.com/jsugarelli/xmlconvert). The GitHub README provides an
intro how to use the package and how to adjust for different ways in which
the data can be represented in the XML documents.



Best,



Joachim





Joachim L. Zuckarelli

E-mail: joac...@zuckarelli.de <mailto:joac...@zuckarelli.de>

Website: http://www.zuckarelli.de

Twitter: https://twitter.com/jsugarelli






[[alternative HTML version deleted]]

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data error problem

2020-10-05 Thread Jim Lemon

Hi Mir,
Without knowing what the data looks like, this is only a guess.
read.table() expects a white space delimiter and if you have a space
in one of your column names it will consider it as two names instead
of one. How many columns do you expect?

Jim

On Mon, Oct 5, 2020 at 6:14 PM Mohammad Tanvir Ahamed via R-help
 wrote:
>
> Hi, In your data file, the first row does not have an equal number of column 
> like the rest of the row.Check your data file. Specially 1st row.
> Regards.Tanvir AhamedStockholm, Sweden |  
> mashra...@yahoo.com
>
> On Monday, 5 October 2020, 08:11:48 am GMT+2, Mir Md. Abdus Salam 
>  wrote:
>
>  Dear all,
>
> I need urgent help. I am a new user of R. I got the following error
>
> anovamine<-read.table("spike cu.txt",header=TRUE)
>
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
> line 1 did not have 9 elements
>
> Can anybody please help me to solve this problem why I am getting such kind 
> of error?
>
> Thanks
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data error problem

2020-10-05 Thread Mohammad Tanvir Ahamed via R-help

Hi, In your data file, the first row does not have an equal number of column 
like the rest of the row.Check your data file. Specially 1st row.  
Regards.Tanvir AhamedStockholm, Sweden |  
mashra...@yahoo.com  

On Monday, 5 October 2020, 08:11:48 am GMT+2, Mir Md. Abdus Salam 
 wrote:  
 
 Dear all,

I need urgent help. I am a new user of R. I got the following error

anovamine<-read.table("spike cu.txt",header=TRUE)

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
line 1 did not have 9 elements

Can anybody please help me to solve this problem why I am getting such kind of 
error?

Thanks


    [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] data error problem

2020-10-05 Thread Mir Md. Abdus Salam

Dear all,

I need urgent help. I am a new user of R. I got the following error

anovamine<-read.table("spike cu.txt",header=TRUE)

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
line 1 did not have 9 elements

Can anybody please help me to solve this problem why I am getting such kind of 
error?

Thanks


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data With Ordinal Responses: Calculate ICC & Assessing Model Fit

2020-08-17 Thread Bert Gunter

I believe you should post on r-sig-mixed-models, not here. You are more
likely to find the interest and expertise you seek there.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Aug 17, 2020 at 3:28 AM Sidoti, Salvatore A. <
sidoti...@buckeyemail.osu.edu> wrote:

> To begin with, I'm not a fan of cross-posting. However, I posted my
> question on Stack Exchange more than two weeks ago, but I have yet to
> receive a sufficient answer:
>
>
> https://stats.stackexchange.com/questions/479600/data-with-ordinal-responses-calculate-icc-assessing-model-fit
>
> Here's what I've learned since then (hopefully):
>
> 1) ICC of a CLMM:
> Computed like this:
> (variance of the random effect) / (variance of the random effect + 1)
> If this is correct, I would love to see a reference/citation for it.
>
> 2) 95% Confidence Interval for the ICC from a CLMM Model
> To my current understanding, a confidence interval for an ICC is only
> obtainable via simulation. I've conducted simulations with GLMM model
> objects ('lme4' package) and the bootMer() function. Unfortunately,
> bootMer() will not accept a CLMM model ('ordinal' package).
>
> 3) Model Fit of a CLMM
> Assuming that the model converges without incident, the model summary
> includes a condition number of the Hessian ('cond.H'). This value should be
> below 10^4 for a "good fit". This is straightforward enough. However, I am
> not as sure about the value for 'max.grad', which needs to be "well below
> 1". The question is, to what magnitude should max.grad < 1 for a decent
> model fit? My reference is linked below (Christensen, 2019), but it does
> not elaborate further on this point:
>
>
> https://documentcloud.adobe.com/link/track?uri=urn:aaid:scds:US:b6a61fe2-b851-49ce-b8b1-cd760d290636
>
> 3) Effect Size of a CLMM
> The random variable's effect is determined by a comparison between the
> full model to a model with only the fixed effects via the anova() function.
> I found this information on the 'rcompanion' package website:
>
> https://rcompanion.org/handbook/G_12.html
>
> The output of this particular anova() will include a value named
> 'LR.stat', the likelihood ratio statistic. The LR.stat is twice the
> difference of each log-likelihood (absolute value) of the respective
> models. Is LR.stat the mixed-model version of an "effect size"? If so, how
> does one determine if the effect is small, large, in-between, etc?
>
> Cheers,
> Sal
>
> Salvatore A. Sidoti
> PhD Candidate
> Behavioral Ecology
> The Ohio State University
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Data With Ordinal Responses: Calculate ICC & Assessing Model Fit

2020-08-17 Thread Sidoti, Salvatore A.

To begin with, I'm not a fan of cross-posting. However, I posted my question on
Stack Exchange more than two weeks ago, but I have yet to receive a sufficient
answer:

https://stats.stackexchange.com/questions/479600/data-with-ordinal-responses-calculate-icc-assessing-model-fit

Here's what I've learned since then (hopefully):

1) ICC of a CLMM:
Computed like this:
(variance of the random effect) / (variance of the random effect + 1)
If this is correct, I would love to see a reference/citation for it.

2) 95% Confidence Interval for the ICC from a CLMM Model
To my current understanding, a confidence interval for an ICC is only
obtainable via simulation. I've conducted simulations with GLMM model objects
('lme4' package) and the bootMer() function. Unfortunately, bootMer() will not
accept a CLMM model ('ordinal' package).

3) Model Fit of a CLMM
Assuming that the model converges without incident, the model summary includes
a condition number of the Hessian ('cond.H'). This value should be below 10^4
for a "good fit". This is straightforward enough. However, I am not as sure
about the value for 'max.grad', which needs to be "well below 1". The question
is, to what magnitude should max.grad < 1 for a decent model fit? My reference
is linked below (Christensen, 2019), but it does not elaborate further on this
point:

https://documentcloud.adobe.com/link/track?uri=urn:aaid:scds:US:b6a61fe2-b851-49ce-b8b1-cd760d290636

3) Effect Size of a CLMM
The random variable's effect is determined by a comparison between the full
model to a model with only the fixed effects via the anova() function. I found
this information on the 'rcompanion' package website:

https://rcompanion.org/handbook/G_12.html

The output of this particular anova() will include a value named 'LR.stat', the
likelihood ratio statistic. The LR.stat is twice the difference of each
log-likelihood (absolute value) of the respective models. Is LR.stat the
mixed-model version of an "effect size"? If so, how does one determine if the
effect is small, large, in-between, etc?

Cheers,
Sal

Salvatore A. Sidoti
PhD Candidate
Behavioral Ecology
The Ohio State University

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data frames intersections

2020-04-22 Thread Jim Lemon

Hi Julie,
Your task is a bit obscure and I don't have the function
"st_intersects", but I'll suggest this:

br_list<-list()
# your commands would have only run once
for (i in 1:nrow(arrets_buffer)) {
  br_list[[i]]<- st_intersects(x = batiments, arrets_buffer[i,], sparse = FALSE)
}

You should get a list with nrow(arrets_buffer) elements, each of which
will be the result of your "st_intersects" function.

Jim

On Wed, Apr 22, 2020 at 6:00 PM Julie Poitevin  wrote:
>
> Hello,
> > I want to build a map (bus accessibility map) and for that I need to 
> > identify some polygons intersections. To do that I have 2 data.frame: 
> > batiments (that gives buildings in a city) and arrets_buffer (that gives 
> > bus stops (points) with a buffer around the point).
> >
> > I want to have a column giving the intersect binary result (TRUE or FALSE) 
> > in batiments data.frame. I use st_intersects from sf package.
> >
> > For that I wanted to perform this loop, but it's not a good idea:
> >
> > for (i in nrow(arrets_buffer)) {
> >   batiments$in_recharges <- st_intersects(x = batiments, arrets_buffer[i,], 
> > sparse = FALSE)
> > }
> >
> > each time i is incremented, batiments$in_recharges is removed with new 
> > values. So at the end I have results for i=nrow(arrets_buffer) only
> >
> > No possibility to add a loop in the loop as the number of lines is quite 
> > important (the loop will turns a long time..)
> >
> > Do you have some some idea to help me?
> >
> > Many thanks for your help
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Data frames intersections

2020-04-22 Thread Julie Poitevin

Hello, 
> I want to build a map (bus accessibility map) and for that I need to identify 
> some polygons intersections. To do that I have 2 data.frame: batiments (that 
> gives buildings in a city) and arrets_buffer (that gives bus stops (points) 
> with a buffer around the point).
> 
> I want to have a column giving the intersect binary result (TRUE or FALSE) in 
> batiments data.frame. I use st_intersects from sf package.
> 
> For that I wanted to perform this loop, but it's not a good idea:
> 
> for (i in nrow(arrets_buffer)) {
>   batiments$in_recharges <- st_intersects(x = batiments, arrets_buffer[i,], 
> sparse = FALSE)
> }
> 
> each time i is incremented, batiments$in_recharges is removed with new 
> values. So at the end I have results for i=nrow(arrets_buffer) only
> 
> No possibility to add a loop in the loop as the number of lines is quite 
> important (the loop will turns a long time..)
> 
> Do you have some some idea to help me?
> 
> Many thanks for your help

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data Carpentry - Creating a New SQLite Database

2020-01-10 Thread William Michels via R-help

Hi Phillip,

Skipping to the last few lines of your email, did you download a
program to look at Sqlite databases (independent of R) as listed
below? Maybe that program ("DB Browser for SQLite") and/or the
instructions below can help you locate your database directory:

https://datacarpentry.org/semester-biology/computer-setup/
https://datacarpentry.org/semester-biology/materials/sql-for-dplyr-users/

If you do have that program, and you're still seeing an error, you
might consider looking for similar issues at the appropriate
'datacarpentry' repository on Github (or posting a new issue
yourself):

https://github.com/datacarpentry/R-ecology-lesson/issues

Finally, I really feel you'll benefit from reading over the documents
pertaining to "R Data Import/Export" on the www.r-project.org website.
No disrespect to the people at 'datacarpentry', but you'll find
similar (and possibly, easier) R code to follow at section 4.3.1
'Packages using DBI' :

https://cran.r-project.org/doc/manuals/r-release/R-data.html

HTH, Bill.

W. Michels, Ph.D.




On Fri, Jan 10, 2020 at 10:32 AM Phillip Heinrich  wrote:
>
> Working my way through a tutorial named Data Carpentry 
> (https://datacarpentry.org/R-ecology-lesson/).  for the most part it is 
> excellent but I’m stuck on the very last section 
> (https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html).
>
> First, below are the packages I have loaded:
> [1] "forcats"   "stringr"   "purrr" "readr" "tidyr" "tibble"
> "ggplot2"   "tidyverse" "dbplyr""RMySQL""DBI"
> [12] "dplyr" "RSQLite"   "stats" "graphics"  "grDevices" "utils" 
> "datasets"  "methods"   "base"
>
>
> >
>
>
> Second,
> Second, is the text of the last section of the last chapter titled “Creating 
> a New SQLite Database”.
> Second, below is the text from the tutorial.  The black type is from the 
> tutorial.  The green and blue is the suggested R code.  My comments are in 
> red.
> Creating a new SQLite database
> So far, we have used a previously prepared SQLite database. But we can also 
> use R to create a new database, e.g. from existing csv files. Let’s recreate 
> the mammals database that we’ve been working with, in R. First let’s download 
> and read in the csv files. We’ll import tidyverse to gain access to the 
> read_csv() function.
>
> download.file("https://ndownloader.figshare.com/files/3299483;,
>   "data_raw/species.csv")
> download.file("https://ndownloader.figshare.com/files/10717177;,
>   "data_raw/surveys.csv")
> download.file("https://ndownloader.figshare.com/files/3299474;,
>       "data_raw/plots.csv")
> library(tidyverse)
> species <- read_csv("data_raw/species.csv")No problem here.  I’m pulling 
> three databases from the Web and saving them to a folder on my hard drive. 
> (...data_raw/species.csv) etc.surveys <- read_csv("data_raw/surveys.csv") 
> plots <- read_csv("data_raw/plots.csv")Again no problem.  I’m just creating 
> an R data files.  But here is where I loose it.  I’m creating something named 
> my_db_file from another file named portal-database-output with an sqlite 
> extension and then creating my_db from the My_db_file.  Not sure where the 
> sqlite extension file came from. Creating a new SQLite database with dplyr is 
> easy. You can re-use the same command we used above to open an existing 
> .sqlite file. The create = TRUE argument instructs R to create a new, empty 
> database instead.
>
> Caution: When create = TRUE is added, any existing database at the same 
> location is overwritten without warning.
>
> my_db_file <- "data/portal-database-output.sqlite"
> my_db <- src_sqlite(my_db_file, create = TRUE)Currently, our new database is 
> empty, it doesn’t contain any tables:
>
> my_db#> src:  sqlite 3.29.0 [data/portal-database-output.sqlite]
> #> tbls:To add tables, we copy the existing data.frames into the database one 
> by one:
>
> copy_to(my_db, surveys)
> copy_to(my_db, plots)
> my_dbI can follow the directions to fill in my_db but I have no idea how to 
> access the tables.  The text from the tutorial below says to check the 
> location of our database.  Huh!  Can someone give me some direction.  Thanks.
>
>
>
>
>
> If you check the location of our database you’ll see that data is 
> automatically being written to disk. R and dplyr not only provide easy ways 
> to query existing databases, they also allows you to easily create your o

Re: [R] Data Carpentry - Creating a New SQLite Database

2020-01-10 Thread Bert Gunter

Please note that tidyverse packages have their own support resources at
RStudio, whence they came; e.g. here:
https://education.rstudio.com/learn/beginner/
You may also do better asking about issues that concern them at their
support site:  https://support.rstudio.com/hc/en-us
 though, as you already found out, there are folks here who may help also.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Jan 10, 2020 at 10:32 AM Phillip Heinrich  wrote:

> Working my way through a tutorial named Data Carpentry (
> https://datacarpentry.org/R-ecology-lesson/).  for the most part it is
> excellent but I’m stuck on the very last section (
> https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html).
>
> First, below are the packages I have loaded:
> [1] "forcats"   "stringr"   "purrr" "readr" "tidyr" "tibble"
>   "ggplot2"   "tidyverse" "dbplyr""RMySQL""DBI"
> [12] "dplyr" "RSQLite"   "stats" "graphics"  "grDevices" "utils"
>"datasets"  "methods"   "base"
>
>
> >
>
>
> Second,
> Second, is the text of the last section of the last chapter titled
> “Creating a New SQLite Database”.
> Second, below is the text from the tutorial.  The black type is from the
> tutorial.  The green and blue is the suggested R code.  My comments are in
> red.
> Creating a new SQLite database
> So far, we have used a previously prepared SQLite database. But we can
> also use R to create a new database, e.g. from existing csv files. Let’s
> recreate the mammals database that we’ve been working with, in R. First
> let’s download and read in the csv files. We’ll import tidyverse to gain
> access to the read_csv() function.
>
> download.file("https://ndownloader.figshare.com/files/3299483;,
>   "data_raw/species.csv")
> download.file("https://ndownloader.figshare.com/files/10717177;,
>   "data_raw/surveys.csv")
> download.file("https://ndownloader.figshare.com/files/3299474;,
>   "data_raw/plots.csv")
> library(tidyverse)
> species <- read_csv("data_raw/species.csv")No problem here.  I’m pulling
> three databases from the Web and saving them to a folder on my hard drive.
> (...data_raw/species.csv) etc.surveys <- read_csv("data_raw/surveys.csv")
> plots <- read_csv("data_raw/plots.csv")Again no problem.  I’m just creating
> an R data files.  But here is where I loose it.  I’m creating something
> named my_db_file from another file named portal-database-output with an
> sqlite extension and then creating my_db from the My_db_file.  Not sure
> where the sqlite extension file came from. Creating a new SQLite database
> with dplyr is easy. You can re-use the same command we used above to open
> an existing .sqlite file. The create = TRUE argument instructs R to create
> a new, empty database instead.
>
> Caution: When create = TRUE is added, any existing database at the same
> location is overwritten without warning.
>
> my_db_file <- "data/portal-database-output.sqlite"
> my_db <- src_sqlite(my_db_file, create = TRUE)Currently, our new database
> is empty, it doesn’t contain any tables:
>
> my_db#> src:  sqlite 3.29.0 [data/portal-database-output.sqlite]
> #> tbls:To add tables, we copy the existing data.frames into the database
> one by one:
>
> copy_to(my_db, surveys)
> copy_to(my_db, plots)
> my_dbI can follow the directions to fill in my_db but I have no idea how
> to access the tables.  The text from the tutorial below says to check the
> location of our database.  Huh!  Can someone give me some direction.
> Thanks.
>
>
>
>
>
> If you check the location of our database you’ll see that data is
> automatically being written to disk. R and dplyr not only provide easy ways
> to query existing databases, they also allows you to easily create your own
> databases from flat files!
>
>
>
> Here is where I loose it.
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data Carpentry - Creating a New SQLite Database

2020-01-10 Thread Ivan Krylov

On Fri, 10 Jan 2020 11:31:58 -0700
"Phillip Heinrich"  wrote:

> below is the text from the tutorial.  The black type is from the
> tutorial.  The green and blue is the suggested R code.  My comments
> are in red

R-help is a plain text mailing list, so the markup has been stripped
off (and since HTML-enabled mail clients don't quite care how the plain
text version of the e-mail looks, some paragraph breaks had to go, too).

> etc.surveys <- read_csv("data_raw/surveys.csv")
> plots <- read_csv("data_raw/plots.csv")

> Again no problem.  I’m just creating an R data files.

Note that it is not files that you are creating by running read_csv(),
but variables (of type "tibble", which is like "data.frame", either of
which should have been covered in earlier chapters in a good tutorial)
in the R environment. The files you downloaded previously are opened
in read only mode and are never changed.

> my_db_file <- "data/portal-database-output.sqlite"

> I’m creating something named my_db_file from another file named
> portal-database-output with an sqlite extension and then creating
> my_db from the My_db_file.

This something is just a text string that happens to contain a *path*
to a file. Just like the variable `greeting` in the following snippet:

greeting <- "Hello world"
print(greeting)

See [1] for more info on character vectors in R.

> Not sure where the sqlite extension file came from.

The authors of the tutorial decided that the file to be created should
be named like this. Feel free to change the extension (or the path) to
anything else: neither R, nor SQLite cares about it much (but the file
manager you use may display a different icon for it or become confused
if you name it .txt or .pdf).

> I can follow the directions to fill in my_db but I have no idea
> how to access the tables.

What exactly do you mean by "access"? At this point my_db should be a
dplyr "src" object, so the tools described in dplyr vignettes [2] should
be applicable. Try calling tbl() on it and passing the name of one of
the tables you have just created. Also try running:

example("src_sqlite")

> The text from the tutorial below says to check the location of our
> database.  Huh!  Can someone give me some direction.

The variable my_db_file contains the location of the file where the
database is stored. This is the same variable that you passed to the
src_sqlite() function.

-- 
Best regards,
Ivan

[1]
https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Character-vectors

[2]
https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html
https://cran.r-project.org/web/packages/dplyr/vignettes/window-functions.html

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Data Carpentry - Creating a New SQLite Database

2020-01-10 Thread Phillip Heinrich

Working my way through a tutorial named Data Carpentry 
(https://datacarpentry.org/R-ecology-lesson/).  for the most part it is 
excellent but I’m stuck on the very last section 
(https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html).

First, below are the packages I have loaded:
[1] "forcats"   "stringr"   "purrr" "readr" "tidyr" "tibble"
"ggplot2"   "tidyverse" "dbplyr""RMySQL""DBI"  
[12] "dplyr" "RSQLite"   "stats" "graphics"  "grDevices" "utils" 
"datasets"  "methods"   "base" 
 
 
>  
 

Second,
Second, is the text of the last section of the last chapter titled “Creating a 
New SQLite Database”.
Second, below is the text from the tutorial.  The black type is from the 
tutorial.  The green and blue is the suggested R code.  My comments are in red.
Creating a new SQLite database
So far, we have used a previously prepared SQLite database. But we can also use 
R to create a new database, e.g. from existing csv files. Let’s recreate the 
mammals database that we’ve been working with, in R. First let’s download and 
read in the csv files. We’ll import tidyverse to gain access to the read_csv() 
function.

download.file("https://ndownloader.figshare.com/files/3299483;,
  "data_raw/species.csv")
download.file("https://ndownloader.figshare.com/files/10717177;,
  "data_raw/surveys.csv")
download.file("https://ndownloader.figshare.com/files/3299474;,
  "data_raw/plots.csv")
library(tidyverse)
species <- read_csv("data_raw/species.csv")No problem here.  I’m pulling three 
databases from the Web and saving them to a folder on my hard drive. 
(...data_raw/species.csv) etc.surveys <- read_csv("data_raw/surveys.csv") plots 
<- read_csv("data_raw/plots.csv")Again no problem.  I’m just creating an R data 
files.  But here is where I loose it.  I’m creating something named my_db_file 
from another file named portal-database-output with an sqlite extension and 
then creating my_db from the My_db_file.  Not sure where the sqlite extension 
file came from. Creating a new SQLite database with dplyr is easy. You can 
re-use the same command we used above to open an existing .sqlite file. The 
create = TRUE argument instructs R to create a new, empty database instead.

Caution: When create = TRUE is added, any existing database at the same 
location is overwritten without warning.

my_db_file <- "data/portal-database-output.sqlite"
my_db <- src_sqlite(my_db_file, create = TRUE)Currently, our new database is 
empty, it doesn’t contain any tables:

my_db#> src:  sqlite 3.29.0 [data/portal-database-output.sqlite]
#> tbls:To add tables, we copy the existing data.frames into the database one 
by one:

copy_to(my_db, surveys)
copy_to(my_db, plots)
my_dbI can follow the directions to fill in my_db but I have no idea how to 
access the tables.  The text from the tutorial below says to check the location 
of our database.  Huh!  Can someone give me some direction.  Thanks.





If you check the location of our database you’ll see that data is automatically 
being written to disk. R and dplyr not only provide easy ways to query existing 
databases, they also allows you to easily create your own databases from flat 
files!



Here is where I loose it.  


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data reshape

2019-12-20 Thread Yuan Chun Ding

Hi Bert,

Thank you for the elegant code example.  I achieved my goal using lapply 
function and do.call function together.  Reduce function is nicer one and I am 
looking into it.

Ding

From: Bert Gunter [mailto:bgunter.4...@gmail.com]
Sent: Friday, December 20, 2019 11:47 AM
To: Yuan Chun Ding
Cc: r-help@r-project.org
Subject: Re: [R] data reshape

[Attention: This email came from an external source. Do not open attachments or 
click on links from unknown senders or unexpected emails.]

It is perhaps worth noting that (assuming I understand correctly) this can 
easily be done in one go without any overt looping as a nice application of 
Reduce() after all your files are read into your global environment as a nice 
application of Reduce().

Example:

> a.out <- data.frame(x = 1:3, y1 = 11:13)
> b.out <- data.frame(x = c(1,3), y2 = 21:22)
> d.out <- data.frame(x = c(2:3), y3 = c(.5,.6))

> nm <- ls(pat = ".*out$")
> f <- function(dat, y) merge(dat, get(y), all = TRUE)
> allofthem <- Reduce(f, nm[-1], init = get(nm[1]))
> allofthem
  x y1 y2  y3
1 1 11 21  NA
2 2 12 NA 0.5
3 3 13 22 0.6

## note the change to "all = TRUE" in the merge() call

Cheers,
Bert

On Fri, Dec 20, 2019 at 9:37 AM Bert Gunter 
mailto:bgunter.4...@gmail.com>> wrote:
?merge ## note the all.x option
Example:
> a <- data.frame(x = 1:3, y1 = 11:13)
> b <- data.frame(x = c(1,3), y2 = 21:22)

> merge(a,b, all.x = TRUE)
  x y1 y2
1 1 11 21
2 2 12 NA
3 3 13 22

Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, Dec 20, 2019 at 9:00 AM Yuan Chun Ding 
mailto:ycd...@coh.org>> wrote:
Hi Bert,

Sorry that I was in a hurry  going home yesterday afternoon and just posted my 
question and hoped to get some advice.

Here is what I got yesterday before going home.
---
setwd("C:/Awork/VNTR/GETXdata/GTEx_genotypes")

file_list <- list.files(pattern="*.out")

#to read all 652 files into Rstudio and found that NOT all files have same 
number of rows
for (i in 1:length(file_list)){

  assign( substr(file_list[i], 1, nchar(file_list[i]) -4) ,

 read.delim(file_list[i], head=F))
}

#the first file, GTEX_1117F, in the following format,  one column and 19482 rows
#4 is marker id, 25/48 is its marker value;
#  V1
#  4
# 25/48
# 201
# 2/2
# ...
# 648589
# None

#to make this one-column file into a two-column file as below
# so first column is marker id, second is corresponding marker values for the 
sample GTEX_1117F
#  VNTRid  GTEX_1117F
#   4   25/48
#   2012/2
#...  ...
# 648589  None

for (i in 1:length(file_list)){
  temp <- read.delim(file_list[i], head=F)
  even <-seq(2, length(temp$V1),2)
  odd <-seq(1, length(temp$V1)-1, 2)
  output <-matrix(0, ncol=2, nrow=length(temp$V1)/2)
  colnames(output)<- c("VNTRid",substr(file_list[i], 1, nchar(file_list[i]) -4))
  for (j in 1:length(temp$V1)/2){
  output[j,1]<- as.character(temp$V1)[odd[j]]
  output[j,2]<- as.character(temp$V1)[even[j]]}
  assign(gsub("-","_", substr(file_list[i], 1, nchar(file_list[i])-4)), 
as.data.frame(output))
 }

Yesterday, I intended to reshape the output file above from long to wide using 
VNTRid as key.
Since not all files have the same number of rows, after reshaping, those file 
would not bind correctly using rbind function.
One my way to work place this morning, I changed my intension; I will not 
reshape to wide format and actually like the long format I generated. I will 
read in a VNTR marker annotation file including VNTRid in first column and 
marker locations in human chromosomes in the second column, this annotation 
file should include all the VNTR markers.  I know the VNTRid in the annotation 
file are same as the VNTRid in the 652 file I read in.

Do you know a good way to merge all those 652 files (with two columns) ?

Thank you,

Ding

#merge all 652 files into one file with VNTRid as first column, 2nd to 653th 
column are genotype with header
#as sample ID,  so

From: Bert Gunter [mailto:bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>]
Sent: Thursday, December 19, 2019 6:52 PM
To: Yuan Chun Ding
Cc: r-help@r-project.org<mailto:r-help@r-project.org>
Subject: Re: [R] data reshape

[Attention: This email came from an external source. Do not open attachments or 
click on links from unknown senders or unexpected emails.]

Did you even make an attempt to do this? -- or would you like us do all your 
work for you?

If you made an attempt, show us your code and er

Re: [R] data reshape

2019-12-20 Thread Bert Gunter

It is perhaps worth noting that (assuming I understand correctly) this can
easily be done in one go without any overt looping as a nice application of
Reduce() after all your files are read into your global environment as a
nice application of Reduce().

Example:

> a.out <- data.frame(x = 1:3, y1 = 11:13)
> b.out <- data.frame(x = c(1,3), y2 = 21:22)
> d.out <- data.frame(x = c(2:3), y3 = c(.5,.6))

> nm <- ls(pat = ".*out$")
> f <- function(dat, y) merge(dat, get(y), all = TRUE)
> allofthem <- Reduce(f, nm[-1], init = get(nm[1]))
> allofthem
  x y1 y2  y3
1 1 11 21  NA
2 2 12 NA 0.5
3 3 13 22 0.6

## note the change to "all = TRUE" in the merge() call

Cheers,
Bert



On Fri, Dec 20, 2019 at 9:37 AM Bert Gunter  wrote:

> ?merge ## note the all.x option
> Example:
> > a <- data.frame(x = 1:3, y1 = 11:13)
> > b <- data.frame(x = c(1,3), y2 = 21:22)
>
> > merge(a,b, all.x = TRUE)
>   x y1 y2
> 1 1 11 21
> 2 2 12 NA
> 3 3 13 22
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Fri, Dec 20, 2019 at 9:00 AM Yuan Chun Ding  wrote:
>
>> Hi Bert,
>>
>>
>>
>> Sorry that I was in a hurry  going home yesterday afternoon and just
>> posted my question and hoped to get some advice.
>>
>>
>>
>> Here is what I got yesterday before going home.
>>
>> ---
>>
>> setwd("C:/Awork/VNTR/GETXdata/GTEx_genotypes")
>>
>>
>>
>> file_list <- list.files(pattern="*.out")
>>
>>
>>
>> #to read all 652 files into Rstudio and found that NOT all files have
>> same number of rows
>>
>> for (i in 1:length(file_list)){
>>
>>
>>
>>   assign( substr(file_list[i], 1, nchar(file_list[i]) -4) ,
>>
>>
>>
>>  read.delim(file_list[i], head=F))
>>
>> }
>>
>>
>>
>> #the first file, GTEX_1117F, in the following format,  one column and
>> 19482 rows
>>
>> #4 is marker id, 25/48 is its marker value;
>>
>> #  V1
>>
>> #  4
>>
>> # 25/48
>>
>> # 201
>>
>> # 2/2
>>
>> # ...
>>
>> # 648589
>>
>> # None
>>
>>
>>
>> #to make this one-column file into a two-column file as below
>>
>> # so first column is marker id, second is corresponding marker values for
>> the sample GTEX_1117F
>>
>> #  VNTRid  GTEX_1117F
>>
>> #   4   25/48
>>
>> #   2012/2
>>
>> #...  ...
>>
>> # 648589  None
>>
>>
>>
>> for (i in 1:length(file_list)){
>>
>>   temp <- read.delim(file_list[i], head=F)
>>
>>   even <-seq(2, length(temp$V1),2)
>>
>>   odd <-seq(1, length(temp$V1)-1, 2)
>>
>>   output <-matrix(0, ncol=2, nrow=length(temp$V1)/2)
>>
>>   colnames(output)<- c("VNTRid",substr(file_list[i], 1,
>> nchar(file_list[i]) -4))
>>
>>   for (j in 1:length(temp$V1)/2){
>>
>>   output[j,1]<- as.character(temp$V1)[odd[j]]
>>
>>   output[j,2]<- as.character(temp$V1)[even[j]]}
>>
>>   assign(gsub("-","_", substr(file_list[i], 1, nchar(file_list[i])-4)),
>> as.data.frame(output))
>>
>>  }
>>
>>
>>
>> Yesterday, I intended to reshape the output file above from long to wide
>> using VNTRid as key.
>>
>> Since not all files have the same number of rows, after reshaping, those
>> file would not bind correctly using rbind function.
>>
>> One my way to work place this morning, I changed my intension; I will not
>> reshape to wide format and actually like the long format I generated. I
>> will read in a VNTR marker annotation file including VNTRid in first column
>> and marker locations in human chromosomes in the second column, this
>> annotation file should include all the VNTR markers.  I know the VNTRid in
>> the annotation file are same as the VNTRid in the 652 file I read in.
>>
>>
>>
>> Do you know a good way to merge all those 652 files (with two columns) ?
>>
>>
>>
>> Thank you,
>>
>>
>>
>> Ding
>>
>>
>>
>>
>>
>> #merge all 652 files into one file with VNTRid as first column, 2nd t

Re: [R] data reshape

2019-12-20 Thread Bert Gunter

?merge ## note the all.x option
Example:
> a <- data.frame(x = 1:3, y1 = 11:13)
> b <- data.frame(x = c(1,3), y2 = 21:22)

> merge(a,b, all.x = TRUE)
  x y1 y2
1 1 11 21
2 2 12 NA
3 3 13 22


Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Dec 20, 2019 at 9:00 AM Yuan Chun Ding  wrote:

> Hi Bert,
>
>
>
> Sorry that I was in a hurry  going home yesterday afternoon and just
> posted my question and hoped to get some advice.
>
>
>
> Here is what I got yesterday before going home.
>
> ---
>
> setwd("C:/Awork/VNTR/GETXdata/GTEx_genotypes")
>
>
>
> file_list <- list.files(pattern="*.out")
>
>
>
> #to read all 652 files into Rstudio and found that NOT all files have same
> number of rows
>
> for (i in 1:length(file_list)){
>
>
>
>   assign( substr(file_list[i], 1, nchar(file_list[i]) -4) ,
>
>
>
>  read.delim(file_list[i], head=F))
>
> }
>
>
>
> #the first file, GTEX_1117F, in the following format,  one column and
> 19482 rows
>
> #4 is marker id, 25/48 is its marker value;
>
> #  V1
>
> #  4
>
> # 25/48
>
> # 201
>
> # 2/2
>
> # ...
>
> # 648589
>
> # None
>
>
>
> #to make this one-column file into a two-column file as below
>
> # so first column is marker id, second is corresponding marker values for
> the sample GTEX_1117F
>
> #  VNTRid  GTEX_1117F
>
> #   4   25/48
>
> #   2012/2
>
> #...  ...
>
> # 648589  None
>
>
>
> for (i in 1:length(file_list)){
>
>   temp <- read.delim(file_list[i], head=F)
>
>   even <-seq(2, length(temp$V1),2)
>
>   odd <-seq(1, length(temp$V1)-1, 2)
>
>   output <-matrix(0, ncol=2, nrow=length(temp$V1)/2)
>
>   colnames(output)<- c("VNTRid",substr(file_list[i], 1,
> nchar(file_list[i]) -4))
>
>   for (j in 1:length(temp$V1)/2){
>
>   output[j,1]<- as.character(temp$V1)[odd[j]]
>
>   output[j,2]<- as.character(temp$V1)[even[j]]}
>
>   assign(gsub("-","_", substr(file_list[i], 1, nchar(file_list[i])-4)),
> as.data.frame(output))
>
>  }
>
>
>
> Yesterday, I intended to reshape the output file above from long to wide
> using VNTRid as key.
>
> Since not all files have the same number of rows, after reshaping, those
> file would not bind correctly using rbind function.
>
> One my way to work place this morning, I changed my intension; I will not
> reshape to wide format and actually like the long format I generated. I
> will read in a VNTR marker annotation file including VNTRid in first column
> and marker locations in human chromosomes in the second column, this
> annotation file should include all the VNTR markers.  I know the VNTRid in
> the annotation file are same as the VNTRid in the 652 file I read in.
>
>
>
> Do you know a good way to merge all those 652 files (with two columns) ?
>
>
>
> Thank you,
>
>
>
> Ding
>
>
>
>
>
> #merge all 652 files into one file with VNTRid as first column, 2nd to
> 653th column are genotype with header
>
> #as sample ID,  so
>
>
>
> *From:* Bert Gunter [mailto:bgunter.4...@gmail.com]
> *Sent:* Thursday, December 19, 2019 6:52 PM
> *To:* Yuan Chun Ding
> *Cc:* r-help@r-project.org
> *Subject:* Re: [R] data reshape
>
>
> --
>
> [Attention: This email came from an external source. Do not open
> attachments or click on links from unknown senders or unexpected emails.]
> --
>
> Did you even make an attempt to do this? -- or would you like us do all
> your work for you?
>
>
>
> If you made an attempt, show us your code and errors.
>
> If not, we usually expect you to try on your own first.
>
> If you have no idea where to start, perhaps you need to spend some more
> time with tutorials to learn basic R functionality before proceeding.
>
>
>
> Bert
>
>
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
>
>
>
> On Thu, Dec 19, 2019 at 6:01 PM Yuan Chun Ding  wrote:
>
> Hi R users,
>
> I have a folder (called genotype) with 652 files; the file names are
> GTEX-1A3MV.out, GTEX-1A3MX.out, GTEX-1B8SF.out, etc; in each

Re: [R] data reshape

2019-12-20 Thread Yuan Chun Ding

Hi Bert,

Sorry that I was in a hurry  going home yesterday afternoon and just posted my 
question and hoped to get some advice.

Here is what I got yesterday before going home.
---
setwd("C:/Awork/VNTR/GETXdata/GTEx_genotypes")

file_list <- list.files(pattern="*.out")

#to read all 652 files into Rstudio and found that NOT all files have same 
number of rows
for (i in 1:length(file_list)){

  assign( substr(file_list[i], 1, nchar(file_list[i]) -4) ,

 read.delim(file_list[i], head=F))
}

#the first file, GTEX_1117F, in the following format,  one column and 19482 rows
#4 is marker id, 25/48 is its marker value;
#  V1
#  4
# 25/48
# 201
# 2/2
# ...
# 648589
# None

#to make this one-column file into a two-column file as below
# so first column is marker id, second is corresponding marker values for the 
sample GTEX_1117F
#  VNTRid  GTEX_1117F
#   4   25/48
#   2012/2
#...  ...
# 648589  None

for (i in 1:length(file_list)){
  temp <- read.delim(file_list[i], head=F)
  even <-seq(2, length(temp$V1),2)
  odd <-seq(1, length(temp$V1)-1, 2)
  output <-matrix(0, ncol=2, nrow=length(temp$V1)/2)
  colnames(output)<- c("VNTRid",substr(file_list[i], 1, nchar(file_list[i]) -4))
  for (j in 1:length(temp$V1)/2){
  output[j,1]<- as.character(temp$V1)[odd[j]]
  output[j,2]<- as.character(temp$V1)[even[j]]}
  assign(gsub("-","_", substr(file_list[i], 1, nchar(file_list[i])-4)), 
as.data.frame(output))
 }

Yesterday, I intended to reshape the output file above from long to wide using 
VNTRid as key.
Since not all files have the same number of rows, after reshaping, those file 
would not bind correctly using rbind function.
One my way to work place this morning, I changed my intension; I will not 
reshape to wide format and actually like the long format I generated. I will 
read in a VNTR marker annotation file including VNTRid in first column and 
marker locations in human chromosomes in the second column, this annotation 
file should include all the VNTR markers.  I know the VNTRid in the annotation 
file are same as the VNTRid in the 652 file I read in.

Do you know a good way to merge all those 652 files (with two columns) ?

Thank you,

Ding


#merge all 652 files into one file with VNTRid as first column, 2nd to 653th 
column are genotype with header
#as sample ID,  so

From: Bert Gunter [mailto:bgunter.4...@gmail.com]
Sent: Thursday, December 19, 2019 6:52 PM
To: Yuan Chun Ding
Cc: r-help@r-project.org
Subject: Re: [R] data reshape


[Attention: This email came from an external source. Do not open attachments or 
click on links from unknown senders or unexpected emails.]

Did you even make an attempt to do this? -- or would you like us do all your 
work for you?

If you made an attempt, show us your code and errors.
If not, we usually expect you to try on your own first.
If you have no idea where to start, perhaps you need to spend some more time 
with tutorials to learn basic R functionality before proceeding.

Bert

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Thu, Dec 19, 2019 at 6:01 PM Yuan Chun Ding 
mailto:ycd...@coh.org>> wrote:
Hi R users,

I have a folder (called genotype) with 652 files; the file names are  
GTEX-1A3MV.out, GTEX-1A3MX.out, GTEX-1B8SF.out, etc; in each file,  only one 
column of data without a header as below
201
2/2
238
3/4
245
1/2
.
983255
3/3
983766
None


A total of 20528 rows;

I need to read all those 652 files in the genotype folder and then reshape the 
one column in each file as:
SampleID 201238245   983255 
983766
GTEX-1A3MV 2/2 3/41/2 3/3 
None

There are 10264 data columns plus the sample ID column, so 10265 columns in 
total after data reshaping.

After reading those 652 file and reshape the one column in each file, I will 
stack them by the rbind function, then I have a file with a dimension of 653 
row, 10265 column.


Thank you,

Ding

--

-SECURITY/CONFIDENTIALITY WARNING-

This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view th

Re: [R] data reshape

2019-12-19 Thread Bert Gunter

Did you even make an attempt to do this? -- or would you like us do all
your work for you?

If you made an attempt, show us your code and errors.
If not, we usually expect you to try on your own first.
If you have no idea where to start, perhaps you need to spend some more
time with tutorials to learn basic R functionality before proceeding.

Bert

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Thu, Dec 19, 2019 at 6:01 PM Yuan Chun Ding  wrote:

> Hi R users,
>
> I have a folder (called genotype) with 652 files; the file names are
> GTEX-1A3MV.out, GTEX-1A3MX.out, GTEX-1B8SF.out, etc; in each file,  only
> one column of data without a header as below
> 201
> 2/2
> 238
> 3/4
> 245
> 1/2
> .
> 983255
> 3/3
> 983766
> None
>
>
> A total of 20528 rows;
>
> I need to read all those 652 files in the genotype folder and then reshape
> the one column in each file as:
> SampleID 201238245   983255
>  983766
> GTEX-1A3MV 2/2 3/41/2 3/3
>None
>
> There are 10264 data columns plus the sample ID column, so 10265 columns
> in total after data reshaping.
>
> After reading those 652 file and reshape the one column in each file, I
> will stack them by the rbind function, then I have a file with a dimension
> of 653 row, 10265 column.
>
>
> Thank you,
>
> Ding
>
> --
> 
> -SECURITY/CONFIDENTIALITY WARNING-
>
> This message and any attachments are intended solely for the individual or
> entity to which they are addressed. This communication may contain
> information that is privileged, confidential, or exempt from disclosure
> under applicable law (e.g., personal health information, research data,
> financial information). Because this e-mail has been sent without
> encryption, individuals other than the intended recipient may be able to
> view the information, forward it to others or tamper with the information
> without the knowledge or consent of the sender. If you are not the intended
> recipient, or the employee or person responsible for delivering the message
> to the intended recipient, any dissemination, distribution or copying of
> the communication is strictly prohibited. If you received the communication
> in error, please notify the sender immediately by replying to this message
> and deleting the message and any accompanying files from your system. If,
> due to the security risks, you do not wish to rec
>  eive further communications via e-mail, please reply to this message and
> inform the sender that you do not wish to receive further e-mail from the
> sender. (LCP301)
> 
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] data reshape

2019-12-19 Thread Yuan Chun Ding

Hi R users,

I have a folder (called genotype) with 652 files; the file names are  
GTEX-1A3MV.out, GTEX-1A3MX.out, GTEX-1B8SF.out, etc; in each file,  only one 
column of data without a header as below
201
2/2
238
3/4
245
1/2
.
983255
3/3
983766
None


A total of 20528 rows;

I need to read all those 652 files in the genotype folder and then reshape the 
one column in each file as:
SampleID 201238245   983255 
983766
GTEX-1A3MV 2/2 3/41/2 3/3 
None

There are 10264 data columns plus the sample ID column, so 10265 columns in 
total after data reshaping.

After reading those 652 file and reshape the one column in each file, I will 
stack them by the rbind function, then I have a file with a dimension of 653 
row, 10265 column.


Thank you,

Ding

--

-SECURITY/CONFIDENTIALITY WARNING-  

This message and any attachments are intended solely for the individual or 
entity to which they are addressed. This communication may contain information 
that is privileged, confidential, or exempt from disclosure under applicable 
law (e.g., personal health information, research data, financial information). 
Because this e-mail has been sent without encryption, individuals other than 
the intended recipient may be able to view the information, forward it to 
others or tamper with the information without the knowledge or consent of the 
sender. If you are not the intended recipient, or the employee or person 
responsible for delivering the message to the intended recipient, any 
dissemination, distribution or copying of the communication is strictly 
prohibited. If you received the communication in error, please notify the 
sender immediately by replying to this message and deleting the message and any 
accompanying files from your system. If, due to the security risks, you do not 
wish to rec
 eive further communications via e-mail, please reply to this message and 
inform the sender that you do not wish to receive further e-mail from the 
sender. (LCP301)


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data Structure to Unnest_tokens in tidytext package

2019-12-11 Thread Eric Berger

Hi Sarah,
I looked at the documentation that you linked to. It contains the step

text_df <- tibble(line = 1:4, text = text)

before it does the step

text_df %>%
  unnest_tokens(word, text)

So you may be missing a step.

Best,
Eric

On Tue, Dec 10, 2019 at 9:05 PM Sarah Payne  wrote:
>
> Hi--I'm fairly new to R and trying to do a text mining project on a novel
> using the tidytext package. The novel is saved as a plain text document and
> I can import it into RStudio just fine. For reference I'm trying to do
> something similar to section 1.3 of this tidy text tutorial
> , except I'm working with one
> novel instead of many. So I import the novel and then run:
>
> "tidy_novel <- quicksandr %>%
> unnest_tokens (word, text)"
>
> I get the following error:
>
> Error in check_input(x) :
>   Input must be a character vector of any length or a list of character
>   vectors, each of which has a length of 1.
>
> typeof(novel) returns "list" and str(novel) returns
>
> Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 955 obs. of  1
> variable:
>  $ FOR E. S. I.: chr  "FOR E. S. I." "My old man died in a fine big house.
> My ma died in a shack. I wonder where I'm gonna die, Being neither white
> nor black?'" "LANGSTON HUGHES" "ONE" ...
>  - attr(*, "problems")=Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 8 obs. of
>  5 variables:
>   ..$ row : int  530 726 733 836 853 886 889 942
>   ..$ col : chr  NA NA NA NA ...
>   ..$ expected: chr  "1 columns" "1 columns" "1 columns" "1 columns" ...
>   ..$ actual  : chr  "2 columns" "2 columns" "2 columns" "2 columns" ...
>   ..$ file: chr  "'quicksandr.txt'" "'quicksandr.txt'"
> "'quicksandr.txt'" "'quicksandr.txt'" ...
>  - attr(*, "spec")=
>   .. cols(
>   ..   `FOR E. S. I.` = col_character()
>   .. )
> >
>
> I'm just importing the text file and then trying to run the unnest_tokens
> function, so maybe I'm missing a step in between? I seem to need my text
> file in a different format, so would appreciate answers on how to do that.
> Thanks, and let me know if I need to provide more info!
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Data Structure to Unnest_tokens in tidytext package

2019-12-10 Thread Sarah Payne

Hi--I'm fairly new to R and trying to do a text mining project on a novel
using the tidytext package. The novel is saved as a plain text document and
I can import it into RStudio just fine. For reference I'm trying to do
something similar to section 1.3 of this tidy text tutorial
, except I'm working with one
novel instead of many. So I import the novel and then run:

"tidy_novel <- quicksandr %>%
unnest_tokens (word, text)"

I get the following error:

Error in check_input(x) :
  Input must be a character vector of any length or a list of character
  vectors, each of which has a length of 1.

typeof(novel) returns "list" and str(novel) returns

Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 955 obs. of  1
variable:
 $ FOR E. S. I.: chr  "FOR E. S. I." "My old man died in a fine big house.
My ma died in a shack. I wonder where I'm gonna die, Being neither white
nor black?'" "LANGSTON HUGHES" "ONE" ...
 - attr(*, "problems")=Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 8 obs. of
 5 variables:
  ..$ row : int  530 726 733 836 853 886 889 942
  ..$ col : chr  NA NA NA NA ...
  ..$ expected: chr  "1 columns" "1 columns" "1 columns" "1 columns" ...
  ..$ actual  : chr  "2 columns" "2 columns" "2 columns" "2 columns" ...
  ..$ file: chr  "'quicksandr.txt'" "'quicksandr.txt'"
"'quicksandr.txt'" "'quicksandr.txt'" ...
 - attr(*, "spec")=
  .. cols(
  ..   `FOR E. S. I.` = col_character()
  .. )
>

I'm just importing the text file and then trying to run the unnest_tokens
function, so maybe I'm missing a step in between? I seem to need my text
file in a different format, so would appreciate answers on how to do that.
Thanks, and let me know if I need to provide more info!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data load from excel files

2019-11-13 Thread Rui Barradas


Hello,

Try which.max?

Hope this helps,

Rui Barradas

Às 14:10 de 13/11/19, ani jaya escreveu:

Thank you very much Mr. Rui, but for delete the duplicated row I use:

...
library(tidyverse)
alldata<-data.frame(Reduce(rbind, pon1))
c<-(which(duplicated(alldata$Tanggal))) #duplicate
alldata<-alldata[-c,]
attach(alldata)


because not every last row from every df is bad one.

Another problem is I want to know when the max value is occurred. So 
basically I have maximum value every month (maxi, n=360, from 1986 to 
2015) and I want to find annual_maxima.



...
maxi<-lapply(pon1, function(x) max(x$RR,na.rm=T))
maxi<-data.frame(Reduce(rbind, maxi))
names(maxi)<-"maxi"
annual_maxima <- rep(NA,30)
date <- rep(NA,30)
for(i in 1:30){

   annual_maxima[i] <- max(maxi$maxi[(i*12-11):(i*12)])
   date[i]<-Tanggal[which(RR==annual_maxima[i])]
}


Here "alldata" contain "Tanggal" in this case is date and rainfall 
("RR"). What I get is error stated that:


In date[i] <- Tanggal[which(RR == annual_maxima[i])] : number of items 
to replace is not a multiple of replacement length



Maybe you have some idea where the problem is, I would be thankful.

Best,
Ani





On Wed, Nov 13, 2019 at 5:49 PM Rui Barradas > wrote:


Hello,

Maybe the following will get you close to what you want.


# remove the last row from every df
pon1 <- lapply(pon1, function(DF){
    DF[[1]] <- as.Date(DF[["Tanggal"]], "%d-%m-%Y")
    DF[-nrow(DF), ]
})


# order the list by year-month
inx_ym <- sapply(pon1, function(DF){
    format(DF[["Tanggal"]][1], "%Y-%m")
})
pon1 <- pon1[order(inx_ym)]


# get the minimum and maximum of every "RR"
min.RR <- sapply(pon1, function(DF) min(DF[["RR"]], na.rm = TRUE))
max.RR <- sapply(pon1, function(DF) max(DF[["RR"]], na.rm = TRUE))


Hope this helps,

Rui Barradas



Às 07:50 de 13/11/19, ani jaya escreveu:
 > Dear R-Help,
 >
 > I have 30 of year-based excel files and each file contain month
sheets. I
 > have some problem here. My data is daily rainfall but there is
extra 1 day
 > (first date of next month) for several sheets. My main goal is to
get the
 > minimum value for every month.
 >
 > First, how to extract those data to list of data frame based on
year and
 > delete every overlapping date?
 > Second, how to sort it based on date with ascending order (old to
new)?
 > Third, how to get the maximum together with the date?
 >
 > I did this one,
 >
 > ...
 > file.list <- list.files(pattern='*.xlsx')
 > file.list<-mixedsort(file.list)
 >
 > #
 >

https://stackoverflow.com/questions/12945687/read-all-worksheets-in-an-excel-workbook-into-an-r-list-with-data-frames
 >
 > read_excel_allsheets <- function(filename, tibble = FALSE) {
 >    sheets <- readxl::excel_sheets(filename)
 >    x <- lapply(sheets, function(X) read.xlsx(filename, sheet=X,
rows=9:40,
 > cols=1:2))
 >    if(!tibble) x <- lapply(x, as.data.frame)
 >    names(x) <- sheets
 >    x
 > }
 >
 > pon<-lapply(file.list, function(i) read_excel_allsheets(i))
 > pon1<-do.call("rbind",pon)
 > names(pon1) <- paste0("M.", 1:360)
 > pon1 <-lapply(pon1,function(x){x$RR[x$RR==] <- NA; x})
 > pon1 <-lapply(pon1,function(x){x$RR[x$RR==""] <- NA; x})
 > maxi<-lapply(pon1, function(x) max(x$RR,na.rm=T))
 > maxi<-data.frame(Reduce(rbind, maxi))
 > names(maxi)<-"maxi"
 > 
 >
 > but the list start from January for every year, and move to
February and so
 > on. And there is no date in "maxi". Here some sample what I get
from my
 > simple code.
 >
 >> pon1[256:258]$M.256
 >        Tanggal   RR
 > 1  01-09-2001  5.2
 > 2  02-09-2001  0.3
 > 3  03-09-2001 29.0
 > 4  04-09-2001  0.7
 > 5  05-09-2001  9.6
 > 6  06-09-2001  0.7
 > 7  07-09-2001   NA
 > 8  08-09-2001 13.2
 > 9  09-09-2001   NA
 > 10 10-09-2001   NA
 > 11 11-09-2001  0.0
 > 12 12-09-2001 66.0
 > 13 13-09-2001  0.0
 > 14 14-09-2001 57.6
 > 15 15-09-2001 18.0
 > 16 16-09-2001 29.2
 > 17 17-09-2001 52.2
 > 18 18-09-2001  7.0
 > 19 19-09-2001   NA
 > 20 20-09-2001 74.5
 > 21 21-09-2001 20.3
 > 22 22-09-2001 49.6
 > 23 23-09-2001  0.0
 > 24 24-09-2001  1.3
 > 25 25-09-2001  0.0
 > 26 26-09-2001  1.0
 > 27 27-09-2001  0.1
 > 28 28-09-2001  1.9
 > 29 29-09-2001  9.5
 > 30 30-09-2001  3.3
 > 31 01-10-2001  0.0
 >
 > $M.257
 >        Tanggal   RR
 > 1  01-09-2002  0.0
 > 2  02-09-2002  0.0
 > 3  03-09-2002  0.0
 > 4  04-09-2002 12.8
 > 5  05-09-2002  1.0
 > 6  06-09-2002  0.0
 > 7  07-09-2002   NA
 > 8  08-09-2002 22.2
 > 9  09-09-2002   NA
 > 10 10-09-2002   NA
 > 11 11-09-2002  0.0
 > 12 12-09-2002  0.0

Re: [R] data load from excel files

2019-11-13 Thread ani jaya

Thank you very much Mr. Rui, but for delete the duplicated row I use:

...
library(tidyverse)
alldata<-data.frame(Reduce(rbind, pon1))
c<-(which(duplicated(alldata$Tanggal))) #duplicate
alldata<-alldata[-c,]
attach(alldata)


because not every last row from every df is bad one.

Another problem is I want to know when the max value is occurred. So
basically I have maximum value every month (maxi, n=360, from 1986 to 2015)
and I want to find annual_maxima.


...
maxi<-lapply(pon1, function(x) max(x$RR,na.rm=T))
maxi<-data.frame(Reduce(rbind, maxi))
names(maxi)<-"maxi"
annual_maxima <- rep(NA,30)
date <- rep(NA,30)
for(i in 1:30){

  annual_maxima[i] <- max(maxi$maxi[(i*12-11):(i*12)])
  date[i]<-Tanggal[which(RR==annual_maxima[i])]
}


Here "alldata" contain "Tanggal" in this case is date and rainfall ("RR").
What I get is error stated that:

In date[i] <- Tanggal[which(RR == annual_maxima[i])] :  number of
items to replace is not a multiple of replacement length


Maybe you have some idea where the problem is, I would be thankful.

Best,
Ani





On Wed, Nov 13, 2019 at 5:49 PM Rui Barradas  wrote:

> Hello,
>
> Maybe the following will get you close to what you want.
>
>
> # remove the last row from every df
> pon1 <- lapply(pon1, function(DF){
>DF[[1]] <- as.Date(DF[["Tanggal"]], "%d-%m-%Y")
>DF[-nrow(DF), ]
> })
>
>
> # order the list by year-month
> inx_ym <- sapply(pon1, function(DF){
>format(DF[["Tanggal"]][1], "%Y-%m")
> })
> pon1 <- pon1[order(inx_ym)]
>
>
> # get the minimum and maximum of every "RR"
> min.RR <- sapply(pon1, function(DF) min(DF[["RR"]], na.rm = TRUE))
> max.RR <- sapply(pon1, function(DF) max(DF[["RR"]], na.rm = TRUE))
>
>
> Hope this helps,
>
> Rui Barradas
>
>
>
> Às 07:50 de 13/11/19, ani jaya escreveu:
> > Dear R-Help,
> >
> > I have 30 of year-based excel files and each file contain month sheets. I
> > have some problem here. My data is daily rainfall but there is extra 1
> day
> > (first date of next month) for several sheets. My main goal is to get the
> > minimum value for every month.
> >
> > First, how to extract those data to list of data frame based on year and
> > delete every overlapping date?
> > Second, how to sort it based on date with ascending order (old to new)?
> > Third, how to get the maximum together with the date?
> >
> > I did this one,
> >
> > ...
> > file.list <- list.files(pattern='*.xlsx')
> > file.list<-mixedsort(file.list)
> >
> > #
> >
> https://stackoverflow.com/questions/12945687/read-all-worksheets-in-an-excel-workbook-into-an-r-list-with-data-frames
> >
> > read_excel_allsheets <- function(filename, tibble = FALSE) {
> >sheets <- readxl::excel_sheets(filename)
> >x <- lapply(sheets, function(X) read.xlsx(filename, sheet=X,
> rows=9:40,
> > cols=1:2))
> >if(!tibble) x <- lapply(x, as.data.frame)
> >names(x) <- sheets
> >x
> > }
> >
> > pon<-lapply(file.list, function(i) read_excel_allsheets(i))
> > pon1<-do.call("rbind",pon)
> > names(pon1) <- paste0("M.", 1:360)
> > pon1 <-lapply(pon1,function(x){x$RR[x$RR==] <- NA; x})
> > pon1 <-lapply(pon1,function(x){x$RR[x$RR==""] <- NA; x})
> > maxi<-lapply(pon1, function(x) max(x$RR,na.rm=T))
> > maxi<-data.frame(Reduce(rbind, maxi))
> > names(maxi)<-"maxi"
> > 
> >
> > but the list start from January for every year, and move to February and
> so
> > on. And there is no date in "maxi". Here some sample what I get from my
> > simple code.
> >
> >> pon1[256:258]$M.256
> >Tanggal   RR
> > 1  01-09-2001  5.2
> > 2  02-09-2001  0.3
> > 3  03-09-2001 29.0
> > 4  04-09-2001  0.7
> > 5  05-09-2001  9.6
> > 6  06-09-2001  0.7
> > 7  07-09-2001   NA
> > 8  08-09-2001 13.2
> > 9  09-09-2001   NA
> > 10 10-09-2001   NA
> > 11 11-09-2001  0.0
> > 12 12-09-2001 66.0
> > 13 13-09-2001  0.0
> > 14 14-09-2001 57.6
> > 15 15-09-2001 18.0
> > 16 16-09-2001 29.2
> > 17 17-09-2001 52.2
> > 18 18-09-2001  7.0
> > 19 19-09-2001   NA
> > 20 20-09-2001 74.5
> > 21 21-09-2001 20.3
> > 22 22-09-2001 49.6
> > 23 23-09-2001  0.0
> > 24 24-09-2001  1.3
> > 25 25-09-2001  0.0
> > 26 26-09-2001  1.0
> > 27 27-09-2001  0.1
> > 28 28-09-2001  1.9
> > 29 29-09-2001  9.5
> > 30 30-09-2001  3.3
> > 31 01-10-2001  0.0
> >
> > $M.257
> >Tanggal   RR
> > 1  01-09-2002  0.0
> > 2  02-09-2002  0.0
> > 3  03-09-2002  0.0
> > 4  04-09-2002 12.8
> > 5  05-09-2002  1.0
> > 6  06-09-2002  0.0
> > 7  07-09-2002   NA
> > 8  08-09-2002 22.2
> > 9  09-09-2002   NA
> > 10 10-09-2002   NA
> > 11 11-09-2002  0.0
> > 12 12-09-2002  0.0
> > 13 13-09-2002  0.0
> > 14 14-09-2002   NA
> > 15 15-09-2002  0.0
> > 16 16-09-2002  0.0
> > 17 17-09-2002  0.0
> > 18 18-09-2002 13.3
> > 19 19-09-2002  0.0
> > 20 20-09-2002  0.0
> > 21 21-09-2002  0.0
> > 22 22-09-2002  0.0
> > 23 23-09-2002  0.0
> > 24 24-09-2002  0.0
> > 25 25-09-2002  0.0
> > 26 26-09-2002  0.5
> > 27 27-09-2002  2.1
> > 28 28-09-2002   NA
> > 29 29-09-2002 18.5
> > 30 30-09-2002  0.0
> > 31 01-10-2002   NA
> >
> > $M.258
> >

Re: [R] data load from excel files

2019-11-13 Thread Rui Barradas


Hello,

Maybe the following will get you close to what you want.


# remove the last row from every df
pon1 <- lapply(pon1, function(DF){
  DF[[1]] <- as.Date(DF[["Tanggal"]], "%d-%m-%Y")
  DF[-nrow(DF), ]
})


# order the list by year-month
inx_ym <- sapply(pon1, function(DF){
  format(DF[["Tanggal"]][1], "%Y-%m")
})
pon1 <- pon1[order(inx_ym)]


# get the minimum and maximum of every "RR"
min.RR <- sapply(pon1, function(DF) min(DF[["RR"]], na.rm = TRUE))
max.RR <- sapply(pon1, function(DF) max(DF[["RR"]], na.rm = TRUE))


Hope this helps,

Rui Barradas



Às 07:50 de 13/11/19, ani jaya escreveu:

Dear R-Help,

I have 30 of year-based excel files and each file contain month sheets. I
have some problem here. My data is daily rainfall but there is extra 1 day
(first date of next month) for several sheets. My main goal is to get the
minimum value for every month.

First, how to extract those data to list of data frame based on year and
delete every overlapping date?
Second, how to sort it based on date with ascending order (old to new)?
Third, how to get the maximum together with the date?

I did this one,

...
file.list <- list.files(pattern='*.xlsx')
file.list<-mixedsort(file.list)

#
https://stackoverflow.com/questions/12945687/read-all-worksheets-in-an-excel-workbook-into-an-r-list-with-data-frames

read_excel_allsheets <- function(filename, tibble = FALSE) {
   sheets <- readxl::excel_sheets(filename)
   x <- lapply(sheets, function(X) read.xlsx(filename, sheet=X, rows=9:40,
cols=1:2))
   if(!tibble) x <- lapply(x, as.data.frame)
   names(x) <- sheets
   x
}

pon<-lapply(file.list, function(i) read_excel_allsheets(i))
pon1<-do.call("rbind",pon)
names(pon1) <- paste0("M.", 1:360)
pon1 <-lapply(pon1,function(x){x$RR[x$RR==] <- NA; x})
pon1 <-lapply(pon1,function(x){x$RR[x$RR==""] <- NA; x})
maxi<-lapply(pon1, function(x) max(x$RR,na.rm=T))
maxi<-data.frame(Reduce(rbind, maxi))
names(maxi)<-"maxi"


but the list start from January for every year, and move to February and so
on. And there is no date in "maxi". Here some sample what I get from my
simple code.


pon1[256:258]$M.256

   Tanggal   RR
1  01-09-2001  5.2
2  02-09-2001  0.3
3  03-09-2001 29.0
4  04-09-2001  0.7
5  05-09-2001  9.6
6  06-09-2001  0.7
7  07-09-2001   NA
8  08-09-2001 13.2
9  09-09-2001   NA
10 10-09-2001   NA
11 11-09-2001  0.0
12 12-09-2001 66.0
13 13-09-2001  0.0
14 14-09-2001 57.6
15 15-09-2001 18.0
16 16-09-2001 29.2
17 17-09-2001 52.2
18 18-09-2001  7.0
19 19-09-2001   NA
20 20-09-2001 74.5
21 21-09-2001 20.3
22 22-09-2001 49.6
23 23-09-2001  0.0
24 24-09-2001  1.3
25 25-09-2001  0.0
26 26-09-2001  1.0
27 27-09-2001  0.1
28 28-09-2001  1.9
29 29-09-2001  9.5
30 30-09-2001  3.3
31 01-10-2001  0.0

$M.257
   Tanggal   RR
1  01-09-2002  0.0
2  02-09-2002  0.0
3  03-09-2002  0.0
4  04-09-2002 12.8
5  05-09-2002  1.0
6  06-09-2002  0.0
7  07-09-2002   NA
8  08-09-2002 22.2
9  09-09-2002   NA
10 10-09-2002   NA
11 11-09-2002  0.0
12 12-09-2002  0.0
13 13-09-2002  0.0
14 14-09-2002   NA
15 15-09-2002  0.0
16 16-09-2002  0.0
17 17-09-2002  0.0
18 18-09-2002 13.3
19 19-09-2002  0.0
20 20-09-2002  0.0
21 21-09-2002  0.0
22 22-09-2002  0.0
23 23-09-2002  0.0
24 24-09-2002  0.0
25 25-09-2002  0.0
26 26-09-2002  0.5
27 27-09-2002  2.1
28 28-09-2002   NA
29 29-09-2002 18.5
30 30-09-2002  0.0
31 01-10-2002   NA

$M.258
   Tanggal   RR
1  01-09-2003  0.0
2  02-09-2003  0.0
3  03-09-2003  0.0
4  04-09-2003  4.0
5  05-09-2003  0.3
6  06-09-2003  0.0
7  07-09-2003   NA
8  08-09-2003  0.0
9  09-09-2003  0.0
10 10-09-2003  0.0
11 11-09-2003   NA
12 12-09-2003  1.0
13 13-09-2003  0.0
14 14-09-2003 60.0
15 15-09-2003  4.5
16 16-09-2003  0.1
17 17-09-2003  2.1
18 18-09-2003   NA
19 19-09-2003  0.0
20 20-09-2003   NA
21 21-09-2003   NA
22 22-09-2003 31.5
23 23-09-2003 42.0
24 24-09-2003 43.3
25 25-09-2003  2.8
26 26-09-2003 21.4
27 27-09-2003  0.8
28 28-09-2003 42.3
29 29-09-2003  5.3
30 30-09-2003 17.3
31 01-10-2003  0.0


Any lead or help is very appreciate.

Best,

Ani

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] data load from excel files

2019-11-12 Thread ani jaya

Dear R-Help,

I have 30 of year-based excel files and each file contain month sheets. I
have some problem here. My data is daily rainfall but there is extra 1 day
(first date of next month) for several sheets. My main goal is to get the
minimum value for every month.

First, how to extract those data to list of data frame based on year and
delete every overlapping date?
Second, how to sort it based on date with ascending order (old to new)?
Third, how to get the maximum together with the date?

I did this one,

...
file.list <- list.files(pattern='*.xlsx')
file.list<-mixedsort(file.list)

#
https://stackoverflow.com/questions/12945687/read-all-worksheets-in-an-excel-workbook-into-an-r-list-with-data-frames

read_excel_allsheets <- function(filename, tibble = FALSE) {
  sheets <- readxl::excel_sheets(filename)
  x <- lapply(sheets, function(X) read.xlsx(filename, sheet=X, rows=9:40,
cols=1:2))
  if(!tibble) x <- lapply(x, as.data.frame)
  names(x) <- sheets
  x
}

pon<-lapply(file.list, function(i) read_excel_allsheets(i))
pon1<-do.call("rbind",pon)
names(pon1) <- paste0("M.", 1:360)
pon1 <-lapply(pon1,function(x){x$RR[x$RR==] <- NA; x})
pon1 <-lapply(pon1,function(x){x$RR[x$RR==""] <- NA; x})
maxi<-lapply(pon1, function(x) max(x$RR,na.rm=T))
maxi<-data.frame(Reduce(rbind, maxi))
names(maxi)<-"maxi"


but the list start from January for every year, and move to February and so
on. And there is no date in "maxi". Here some sample what I get from my
simple code.

> pon1[256:258]$M.256
  Tanggal   RR
1  01-09-2001  5.2
2  02-09-2001  0.3
3  03-09-2001 29.0
4  04-09-2001  0.7
5  05-09-2001  9.6
6  06-09-2001  0.7
7  07-09-2001   NA
8  08-09-2001 13.2
9  09-09-2001   NA
10 10-09-2001   NA
11 11-09-2001  0.0
12 12-09-2001 66.0
13 13-09-2001  0.0
14 14-09-2001 57.6
15 15-09-2001 18.0
16 16-09-2001 29.2
17 17-09-2001 52.2
18 18-09-2001  7.0
19 19-09-2001   NA
20 20-09-2001 74.5
21 21-09-2001 20.3
22 22-09-2001 49.6
23 23-09-2001  0.0
24 24-09-2001  1.3
25 25-09-2001  0.0
26 26-09-2001  1.0
27 27-09-2001  0.1
28 28-09-2001  1.9
29 29-09-2001  9.5
30 30-09-2001  3.3
31 01-10-2001  0.0

$M.257
  Tanggal   RR
1  01-09-2002  0.0
2  02-09-2002  0.0
3  03-09-2002  0.0
4  04-09-2002 12.8
5  05-09-2002  1.0
6  06-09-2002  0.0
7  07-09-2002   NA
8  08-09-2002 22.2
9  09-09-2002   NA
10 10-09-2002   NA
11 11-09-2002  0.0
12 12-09-2002  0.0
13 13-09-2002  0.0
14 14-09-2002   NA
15 15-09-2002  0.0
16 16-09-2002  0.0
17 17-09-2002  0.0
18 18-09-2002 13.3
19 19-09-2002  0.0
20 20-09-2002  0.0
21 21-09-2002  0.0
22 22-09-2002  0.0
23 23-09-2002  0.0
24 24-09-2002  0.0
25 25-09-2002  0.0
26 26-09-2002  0.5
27 27-09-2002  2.1
28 28-09-2002   NA
29 29-09-2002 18.5
30 30-09-2002  0.0
31 01-10-2002   NA

$M.258
  Tanggal   RR
1  01-09-2003  0.0
2  02-09-2003  0.0
3  03-09-2003  0.0
4  04-09-2003  4.0
5  05-09-2003  0.3
6  06-09-2003  0.0
7  07-09-2003   NA
8  08-09-2003  0.0
9  09-09-2003  0.0
10 10-09-2003  0.0
11 11-09-2003   NA
12 12-09-2003  1.0
13 13-09-2003  0.0
14 14-09-2003 60.0
15 15-09-2003  4.5
16 16-09-2003  0.1
17 17-09-2003  2.1
18 18-09-2003   NA
19 19-09-2003  0.0
20 20-09-2003   NA
21 21-09-2003   NA
22 22-09-2003 31.5
23 23-09-2003 42.0
24 24-09-2003 43.3
25 25-09-2003  2.8
26 26-09-2003 21.4
27 27-09-2003  0.8
28 28-09-2003 42.3
29 29-09-2003  5.3
30 30-09-2003 17.3
31 01-10-2003  0.0


Any lead or help is very appreciate.

Best,

Ani

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data conversion

2019-09-18 Thread Jim Lemon

Hi Edward,
Say your "data frame" is named "epdat". This may do it:

epmat<-matrix(epdat[10:289],nrow=28)
colnames(epmat)<-sub("1","",names(epdat[10:289])[seq(1,270,by=28)])

This one looks like the Sorceror's Apprentice tangled with one of
those experimental schedule scripting programs.

Jim

On Thu, Sep 19, 2019 at 7:04 AM Patzelt, Edward  wrote:
>
> Hi R Help,
>
> How would I convert the data below so that I have it formatted with trials
> along the rows and then each type of measure separately? e.g.,
> Subject RT OnOff Feedback
> Trial_1
> Trial_2
> Trial_3
> Trial_4
>
> Thanks!
>
> Edward
>
>
>
>
>
> structure(list(TAP_ID = "967372   ", TAP_Date = NA_real_, TAP_Time = 29700,
> TAP_Study = "  ", SexOfTarget = "M",
> SexOfSubj = "M", OperatorName = "", LowThresh = 220,
> HighThresh = 1320, Trial1 = 1, Trial2 = 2, Trial3 = 3, Trial4 = 4,
> Trial5 = 5, Trial6 = 6, Trial7 = 7, Trial8 = 8, Trial9 = 9,
> Trial10 = 10, Trial11 = 11, Trial12 = 12, Trial13 = 13, Trial14 = 14,
> Trial15 = 15, Trial16 = 16, Trial17 = 17, Trial18 = 18, Trial19 = 19,
> Trial20 = 20, Trial21 = 21, Trial22 = 22, Trial23 = 23, Trial24 = 24,
> Trial25 = 25, Trial26 = 26, Trial27 = 27, Trial28 = 28, ITI1 = 5,
> ITI2 = 5, ITI3 = 5, ITI4 = 5, ITI5 = 5, ITI6 = 5, ITI7 = 5,
> ITI8 = 5, ITI9 = 5, ITI10 = 5, ITI11 = 5, ITI12 = 5, ITI13 = 5,
> ITI14 = 5, ITI15 = 5, ITI16 = 5, ITI17 = 5, ITI18 = 5, ITI19 = 5,
> ITI20 = 5, ITI21 = 5, ITI22 = 5, ITI23 = 5, ITI24 = 5, ITI25 = 5,
> ITI26 = 5, ITI27 = 5, ITI28 = 5, Shock1 = 0, Shock2 = 0,
> Shock3 = 0, Shock4 = 0, Shock5 = 0, Shock6 = 0, Shock7 = 0,
> Shock8 = 0, Shock9 = 0, Shock10 = 0, Shock11 = 0, Shock12 = 0,
> Shock13 = 0, Shock14 = 0, Shock15 = 0, Shock16 = 0, Shock17 = 0,
> Shock18 = 0, Shock19 = 0, Shock20 = 0, Shock21 = 0, Shock22 = 0,
> Shock23 = 0, Shock24 = 0, Shock25 = 0, Shock26 = 0, Shock27 = 0,
> Shock28 = 0, Delay1 = 1102, Delay2 = 993, Delay3 = 446, Delay4 = 613,
> Delay5 = 649, Delay6 = 333, Delay7 = 342, Delay8 = 366, Delay9 = 360,
> Delay10 = 307, Delay11 = 372, Delay12 = 335, Delay13 = 328,
> Delay14 = 296, Delay15 = 521, Delay16 = 393, Delay17 = 491,
> Delay18 = 467, Delay19 = 401, Delay20 = 483, Delay21 = 312,
> Delay22 = 311, Delay23 = 274, Delay24 = 348, Delay25 = 422,
> Delay26 = 305, Delay27 = 637, Delay28 = 429, Hold1 = 1203,
> Hold2 = 598, Hold3 = 1209, Hold4 = 1373, Hold5 = 1170, Hold6 = 1442,
> Hold7 = 2192, Hold8 = 1802, Hold9 = 1891, Hold10 = 1880,
> Hold11 = 1204, Hold12 = 1597, Hold13 = 809, Hold14 = 848,
> Hold15 = 1328, Hold16 = 767, Hold17 = 1053, Hold18 = 1648,
> Hold19 = 1365, Hold20 = 1889, Hold21 = 1452, Hold22 = 1468,
> Hold23 = 1595, Hold24 = 2060, Hold25 = 1213, Hold26 = 1060,
> Hold27 = 745, Hold28 = 1110, RTdelay1 = 800, RTdelay2 = 251,
> RTdelay3 = 422, RTdelay4 = 264, RTdelay5 = 397, RTdelay6 = 472,
> RTdelay7 = 225, RTdelay8 = 228, RTdelay9 = 430, RTdelay10 = 527,
> RTdelay11 = 244, RTdelay12 = 747, RTdelay13 = 269, RTdelay14 = 330,
> RTdelay15 = 400, RTdelay16 = 401, RTdelay17 = 394, RTdelay18 = 364,
> RTdelay19 = 210, RTdelay20 = 415, RTdelay21 = 267, RTdelay22 = 248,
> RTdelay23 = 209, RTdelay24 = 277, RTdelay25 = 498, RTdelay26 = 663,
> RTdelay27 = 331, RTdelay28 = 494, RTheld1 = 2318, RTheld2 = 2039,
> RTheld3 = 2594, RTheld4 = 2061, RTheld5 = 1501, RTheld6 = 1070,
> RTheld7 = 2492, RTheld8 = 1532, RTheld9 = 2034, RTheld10 = 2338,
> RTheld11 = 1095, RTheld12 = 2227, RTheld13 = 2402, RTheld14 = 1057,
> RTheld15 = 1718, RTheld16 = 1789, RTheld17 = 1611, RTheld18 = 1824,
> RTheld19 = 1582, RTheld20 = 2749, RTheld21 = 1407, RTheld22 = 1780,
> RTheld23 = 1103, RTheld24 = 1513, RTheld25 = 1562, RTheld26 = 2283,
> RTheld27 = 2722, RTheld28 = 2703, RT1 = 284, RT2 = 228, RT3 = 226,
> RT4 = 186, RT5 = 208, RT6 = 223, RT7 = 206, RT8 = 189, RT9 = 229,
> RT10 = 198, RT11 = 224, RT12 = 203, RT13 = 199, RT14 = 224,
> RT15 = 220, RT16 = 208, RT17 = 270, RT18 = 188, RT19 = 205,
> RT20 = 191, RT21 = 190, RT22 = 183, RT23 = 193, RT24 = 176,
> RT25 = 195, RT26 = 196, RT27 = 185, RT28 = 160, Feedback1 = 2,
> Feedback2 = 2, Feedback3 = 3, Feedback4 = 3, Feedback5 = 2,
> Feedback6 = 3, Feedback7 = 4, Feedback8 = 5, Feedback9 = 5,
> Feedback10 = 6, Feedback11 = 5, Feedback12 = 6, Feedback13 = 6,
> Feedback14 = 7, Feedback15 = 9, Feedback16 = 8, Feedback17 = 9,
> Feedback18 = 8, Feedback19 = 8, Feedback20 = 9, Feedback21 = 20,
> Feedback22 = 9, Feedback23 = 8, Feedback24 = 9, Feedback25 = 8,
> Feedback26 = 8, Feedback27 = 9, Feedback28 = 8, OnOff1 = 1,
> OnOff2 = 0, OnOff3 = 0, OnOff4 = 1, OnOff5 = 0, OnOff6 = 1,
> OnOff7 = 0, OnOff8 = 0, OnOff9 = 1, OnOff10 = 1, OnOff11 = 0,
> OnOff12 = 1, OnOff13 = 0, OnOff14 = 1, OnOff15 = 1, OnOff16 = 0,
> OnOff17 = 1, OnOff18 = 0, OnOff19 =

[R] Data conversion

2019-09-18 Thread Patzelt, Edward

Hi R Help,

How would I convert the data below so that I have it formatted with trials
along the rows and then each type of measure separately? e.g.,
Subject RT OnOff Feedback
Trial_1
Trial_2
Trial_3
Trial_4

Thanks!

Edward





structure(list(TAP_ID = "967372   ", TAP_Date = NA_real_, TAP_Time = 29700,
TAP_Study = "  ", SexOfTarget = "M",
SexOfSubj = "M", OperatorName = "", LowThresh = 220,
HighThresh = 1320, Trial1 = 1, Trial2 = 2, Trial3 = 3, Trial4 = 4,
Trial5 = 5, Trial6 = 6, Trial7 = 7, Trial8 = 8, Trial9 = 9,
Trial10 = 10, Trial11 = 11, Trial12 = 12, Trial13 = 13, Trial14 = 14,
Trial15 = 15, Trial16 = 16, Trial17 = 17, Trial18 = 18, Trial19 = 19,
Trial20 = 20, Trial21 = 21, Trial22 = 22, Trial23 = 23, Trial24 = 24,
Trial25 = 25, Trial26 = 26, Trial27 = 27, Trial28 = 28, ITI1 = 5,
ITI2 = 5, ITI3 = 5, ITI4 = 5, ITI5 = 5, ITI6 = 5, ITI7 = 5,
ITI8 = 5, ITI9 = 5, ITI10 = 5, ITI11 = 5, ITI12 = 5, ITI13 = 5,
ITI14 = 5, ITI15 = 5, ITI16 = 5, ITI17 = 5, ITI18 = 5, ITI19 = 5,
ITI20 = 5, ITI21 = 5, ITI22 = 5, ITI23 = 5, ITI24 = 5, ITI25 = 5,
ITI26 = 5, ITI27 = 5, ITI28 = 5, Shock1 = 0, Shock2 = 0,
Shock3 = 0, Shock4 = 0, Shock5 = 0, Shock6 = 0, Shock7 = 0,
Shock8 = 0, Shock9 = 0, Shock10 = 0, Shock11 = 0, Shock12 = 0,
Shock13 = 0, Shock14 = 0, Shock15 = 0, Shock16 = 0, Shock17 = 0,
Shock18 = 0, Shock19 = 0, Shock20 = 0, Shock21 = 0, Shock22 = 0,
Shock23 = 0, Shock24 = 0, Shock25 = 0, Shock26 = 0, Shock27 = 0,
Shock28 = 0, Delay1 = 1102, Delay2 = 993, Delay3 = 446, Delay4 = 613,
Delay5 = 649, Delay6 = 333, Delay7 = 342, Delay8 = 366, Delay9 = 360,
Delay10 = 307, Delay11 = 372, Delay12 = 335, Delay13 = 328,
Delay14 = 296, Delay15 = 521, Delay16 = 393, Delay17 = 491,
Delay18 = 467, Delay19 = 401, Delay20 = 483, Delay21 = 312,
Delay22 = 311, Delay23 = 274, Delay24 = 348, Delay25 = 422,
Delay26 = 305, Delay27 = 637, Delay28 = 429, Hold1 = 1203,
Hold2 = 598, Hold3 = 1209, Hold4 = 1373, Hold5 = 1170, Hold6 = 1442,
Hold7 = 2192, Hold8 = 1802, Hold9 = 1891, Hold10 = 1880,
Hold11 = 1204, Hold12 = 1597, Hold13 = 809, Hold14 = 848,
Hold15 = 1328, Hold16 = 767, Hold17 = 1053, Hold18 = 1648,
Hold19 = 1365, Hold20 = 1889, Hold21 = 1452, Hold22 = 1468,
Hold23 = 1595, Hold24 = 2060, Hold25 = 1213, Hold26 = 1060,
Hold27 = 745, Hold28 = 1110, RTdelay1 = 800, RTdelay2 = 251,
RTdelay3 = 422, RTdelay4 = 264, RTdelay5 = 397, RTdelay6 = 472,
RTdelay7 = 225, RTdelay8 = 228, RTdelay9 = 430, RTdelay10 = 527,
RTdelay11 = 244, RTdelay12 = 747, RTdelay13 = 269, RTdelay14 = 330,
RTdelay15 = 400, RTdelay16 = 401, RTdelay17 = 394, RTdelay18 = 364,
RTdelay19 = 210, RTdelay20 = 415, RTdelay21 = 267, RTdelay22 = 248,
RTdelay23 = 209, RTdelay24 = 277, RTdelay25 = 498, RTdelay26 = 663,
RTdelay27 = 331, RTdelay28 = 494, RTheld1 = 2318, RTheld2 = 2039,
RTheld3 = 2594, RTheld4 = 2061, RTheld5 = 1501, RTheld6 = 1070,
RTheld7 = 2492, RTheld8 = 1532, RTheld9 = 2034, RTheld10 = 2338,
RTheld11 = 1095, RTheld12 = 2227, RTheld13 = 2402, RTheld14 = 1057,
RTheld15 = 1718, RTheld16 = 1789, RTheld17 = 1611, RTheld18 = 1824,
RTheld19 = 1582, RTheld20 = 2749, RTheld21 = 1407, RTheld22 = 1780,
RTheld23 = 1103, RTheld24 = 1513, RTheld25 = 1562, RTheld26 = 2283,
RTheld27 = 2722, RTheld28 = 2703, RT1 = 284, RT2 = 228, RT3 = 226,
RT4 = 186, RT5 = 208, RT6 = 223, RT7 = 206, RT8 = 189, RT9 = 229,
RT10 = 198, RT11 = 224, RT12 = 203, RT13 = 199, RT14 = 224,
RT15 = 220, RT16 = 208, RT17 = 270, RT18 = 188, RT19 = 205,
RT20 = 191, RT21 = 190, RT22 = 183, RT23 = 193, RT24 = 176,
RT25 = 195, RT26 = 196, RT27 = 185, RT28 = 160, Feedback1 = 2,
Feedback2 = 2, Feedback3 = 3, Feedback4 = 3, Feedback5 = 2,
Feedback6 = 3, Feedback7 = 4, Feedback8 = 5, Feedback9 = 5,
Feedback10 = 6, Feedback11 = 5, Feedback12 = 6, Feedback13 = 6,
Feedback14 = 7, Feedback15 = 9, Feedback16 = 8, Feedback17 = 9,
Feedback18 = 8, Feedback19 = 8, Feedback20 = 9, Feedback21 = 20,
Feedback22 = 9, Feedback23 = 8, Feedback24 = 9, Feedback25 = 8,
Feedback26 = 8, Feedback27 = 9, Feedback28 = 8, OnOff1 = 1,
OnOff2 = 0, OnOff3 = 0, OnOff4 = 1, OnOff5 = 0, OnOff6 = 1,
OnOff7 = 0, OnOff8 = 0, OnOff9 = 1, OnOff10 = 1, OnOff11 = 0,
OnOff12 = 1, OnOff13 = 0, OnOff14 = 1, OnOff15 = 1, OnOff16 = 0,
OnOff17 = 1, OnOff18 = 0, OnOff19 = 1, OnOff20 = 0, OnOff21 = 0,
OnOff22 = 1, OnOff23 = 0, OnOff24 = 1, OnOff25 = 0, OnOff26 = 1,
OnOff27 = 0, OnOff28 = 1), class = "data.frame", row.names = c(NA,
-1L), variable.labels = c(TAP_ID = "", TAP_Date = "", TAP_Time = "",
TAP_Study = "", SexOfTarget = "", SexOfSubj = "", OperatorName = "",
LowThresh = "", HighThresh = "", Trial1 = "Trial number 1", Trial2 = "Trial
number 2",
Trial3 = "Trial number 3", Trial4 = "Trial number 4", Trial5 = "Trial
number 5",
Trial6 = "Trial number 6",

Re: [R] Data frame organization

2019-08-27 Thread Arnaud Mosnier

Aaaah finally !!! Thanks a lot !!!

Arnaud

Le lun. 26 août 2019 18 h 28, Jim Lemon  a écrit :

> Hi Arnaud,
> The reason I wrote the following function is that it always takes me
> half a dozen tries with "reshape" before I get the syntax right:
>
> amdf<-read.table(text="A   10
> B   5
> C   9
> A   5
> B   15
> C   20")
> library(prettyR)
> stretch_df(amdf,"V1","V2")
>  V1 V2_1 V2_2
> 1  A   105
> 2  B5   15
> 3  C9   20
>
> Jim
>
> On Tue, Aug 27, 2019 at 4:06 AM Arnaud Mosnier 
> wrote:
> >
> > Hi,
> >
> > I have a really simple question.
> > I need to convert a data.frame with the following format
> >
> > A   10
> > B   5
> > C   9
> > A   5
> > B   15
> > C   20
> >
> > in this format
> >
> > A   10   5
> > B   515
> > C   920
> >
> > Thanks !!!
> >
> > [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data frame organization

2019-08-26 Thread Jim Lemon

Hi Arnaud,
The reason I wrote the following function is that it always takes me
half a dozen tries with "reshape" before I get the syntax right:

amdf<-read.table(text="A   10
B   5
C   9
A   5
B   15
C   20")
library(prettyR)
stretch_df(amdf,"V1","V2")
 V1 V2_1 V2_2
1  A   105
2  B5   15
3  C9   20

Jim

On Tue, Aug 27, 2019 at 4:06 AM Arnaud Mosnier  wrote:
>
> Hi,
>
> I have a really simple question.
> I need to convert a data.frame with the following format
>
> A   10
> B   5
> C   9
> A   5
> B   15
> C   20
>
> in this format
>
> A   10   5
> B   515
> C   920
>
> Thanks !!!
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Data Frame Organization

2019-08-26 Thread Sam Charya via R-help

 There is some issue with the plain text vs. HTML - please find the answer 
again. If illegible kindly see the attached pic.
Best Wishes.
s.

x <- c('A', 'B', 'C', 'A', 'B', 'C')
y <- c(10, 5, 9, 5, 15, 20)
df <- data.frame(x,y)
df
f <- reshape(df, v.names = "y", idvar = "x", timevar = "y", direction = "wide")
RESULT:
> f
  x y.10 y.5 y.9 y.15 y.201 A   10   5  NA   NA   NA2 B   NA   5  NA   15   NA3 
C   NA  NA   9   NA   20
  
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data frame organization

2019-08-26 Thread Sam Charya via R-help

 Dear Arnaud,
I just played around with your data a bit and found this to be useful. But 
kindly note that I am NO expert like the other people in the group. My answer 
to you is purely for help purposes. My knowledge in R too is limited. I used 
the reshape function and arrived at something. I am sure others will arrive at 
a better and more crisp answer that I have. Again please note: I am only a 
novice.
x <- c('A', 'B', 'C', 'A', 'B', 'C')y <- c(10, 5, 9, 5, 15, 20)df <- 
data.frame(x,y)dff <- reshape(df, v.names = "y", idvar = "x", timevar = "y", 
direction = "wide")
RESULT:
> f  x y.10 y.5 y.9 y.15 y.201 A   10   5  NA   NA   NA2 B   NA   5  NA   15   
> NA3 C   NA  NA   9   NA   20

Hope this is of any use. 
Kind Regards,
s. 

On Monday, 26 August 2019, 11:37:13 pm GMT+5:30, Arnaud Mosnier 
 wrote:  

 Hi,

I have a really simple question.
I need to convert a data.frame with the following format

A  10
B  5
C  9
A  5
B  15
C  20

in this format

A  10  5
B  5    15
C  9    20

Thanks !!!

    [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Data frame organization

2019-08-26 Thread Arnaud Mosnier

Hi,

I have a really simple question.
I need to convert a data.frame with the following format

A   10
B   5
C   9
A   5
B   15
C   20

in this format

A   10   5
B   515
C   920

Thanks !!!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Downloading R Data

2019-04-14 Thread Jim Lemon

Hi Spencer,
Just download it to your R working directory and:

load("GBM_data.Rdata")

Worked okay for me (all 53.9 Mb)

Jim

On Mon, Apr 15, 2019 at 8:39 AM Spencer Brackett
 wrote:
>
>   I am also looking to be able to read this file on an appropriate
> application. As of now, it’s too large to view directly in GoogleDrive or
> word, and I can only get a mistranslated version of the script included as
> a .txt file.
>
>
>
> [image: File]
> GBM_Data.RData
> 
>
> Best,
>
> Spencer
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Downloading R Data

2019-04-14 Thread Daniel Nordlund


On 4/14/2019 3:36 PM, Spencer Brackett wrote:

   I am also looking to be able to read this file on an appropriate
application. As of now, it’s too large to view directly in GoogleDrive or
word, and I can only get a mistranslated version of the script included as
a .txt file.



[image: File]
GBM_Data.RData


Best,

Spencer

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Spencer,

this looks like an saved R workspace.  I went to provided link using 
Firefox and clicked on the icon to download a file.  Once the file was 
downloaded, I started the RGui and entered the following command


load(file.choose())

This opened a window in which I could browse to the file and load it 
into R.  Also, I could double click on the file and Rstudio would load 
the file into the workspace.


If you wish to do something else, you will need to be more specific 
about what you want.



Hope this is helpful,

Dan

--
Daniel Nordlund
Port Townsend, WA  USA

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Downloading R Data

2019-04-14 Thread Spencer Brackett

  I am also looking to be able to read this file on an appropriate
application. As of now, it’s too large to view directly in GoogleDrive or
word, and I can only get a mistranslated version of the script included as
a .txt file.



[image: File]
GBM_Data.RData


Best,

Spencer

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data frame solution

2019-03-20 Thread Izmirlian, Grant (NIH/NCI) [E] via R-help

Statements like c(rbind(x, xx+yy), max(t)) and rep(0,length(df$b[1]))  don't 
make any sense. You're example will be easier to understand if you show us the 
nrow(df) ==3
case. Thanks


Grant Izmirlian, Ph.D.
Mathematical Statistician
izmir...@mail.nih.gov

Delivery Address:
9609 Medical Center Dr, RM 5E130
Rockville MD 20850

Postal Address:
BG 9609 RM 5E130 MSC 9789
9609 Medical Center Dr
Bethesda, MD 20892-9789

 ofc:  240-276-7025
 cell: 240-888-7367
  fax: 240-276-7845



From: Andras Farkas 
Sent: Tuesday, March 19, 2019 7:06 AM
To: R-help Mailing List
Subject: [R] data frame solution

Hello All,

wonder if you have thoughts on a clever solution for this code:



df   <- data.frame(a = c(6,1), b = c(1000,1200), c =c(-1,3))

#the caveat here is that the number of rows for df can be anything from 1 row 
to in the hundreds. I kept it to 2 to have minimal reproducible

t<-seq(-5,24,0.1) #min(t) will always be <=df$c[1], which is the value that is 
always going to equal to min(df$c)

times1 <- c(rbind(df$c[1],df$c[1]+df$a[1]),max(t)) #length of times1 will 
always be 3, see times2 is of length 4

input1   <- c(rbind(df$b[1]/df$a[1],rep(0,length(df$b[1]))),0) #length of 
input1 will always be 3, see input2 is of length 4

out1 
<-data.frame(t,ifelse(t>=times1[1]=times1[2]=times2[1]=times2[2]=times2[3]https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] data frame solution

2019-03-19 Thread Andras Farkas via R-help

Hello All,

wonder if you have thoughts on a clever solution for this code:



df       <- data.frame(a = c(6,1), b = c(1000,1200), c =c(-1,3)) 

#the caveat here is that the number of rows for df can be anything from 1 row 
to in the hundreds. I kept it to 2 to have minimal reproducible

t<-seq(-5,24,0.1) #min(t) will always be <=df$c[1], which is the value that is 
always going to equal to min(df$c)

times1 <- c(rbind(df$c[1],df$c[1]+df$a[1]),max(t)) #length of times1 will 
always be 3, see times2 is of length 4

input1   <- c(rbind(df$b[1]/df$a[1],rep(0,length(df$b[1]))),0) #length of 
input1 will always be 3, see input2 is of length 4

out1 
<-data.frame(t,ifelse(t>=times1[1]=times1[2]=times2[1]=times2[2]=times2[3]https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R Data

2019-02-14 Thread Fowler, Mark

I am not sure I would use the word ‘accounted’, more like discounted (tossed 
out).

From: Spencer Brackett 
Sent: February 14, 2019 9:21 AM
To: Fowler, Mark 
Cc: R-help ; Sarah Goslee ; 
Caitlin Gibbons ; Jeff Newmiller 

Subject: Re: R Data

Mr. Fowler,

Thank you! This information is most helpful. So from my understanding, I can 
use the regression coefficients shown (via the coding I originally sent, to 
generate a continuous distribution with what is essentially a line of best fit? 
The data added here had some 30,000 variables (it is genomic data from TCGA), 
does this mean that any none NA data is being accounted for in said 
distribution?

Best,

Spencer Brackett

On Thursday, February 14, 2019, Fowler, Mark 
mailto:mark.fow...@dfo-mpo.gc.ca>> wrote:
Hi Spencer,

The an1 syntax is adding regression coefficients (or NAs where a regression 
could not be done) to the downloaded and processed data, which ends up a 
matrix. The cbind function adds the regression coefficients to the last column 
of the matrix (i.e. bind the columns of the inputs in the order given). Simple 
example below. Not actually any need for the separate cbind commands, could 
have just used an1=cbind(an,p,t). The cbind function expects all the columns to 
be of the same length, hence the use of the tryCatch function to capture NA's 
for failed regression attempts, ensuring that p and t correspond row by row 
with the matrix.

 x=seq(1,5)
 y=seq(6,10)
 z=seq(1,5)
xyz=cbind(x,y,z)
xyz
   x  y z
[1,] 1  6 1
[2,] 2  7 2
[3,] 3  8 3
[4,] 4  9 4
[5,] 5 10 5
dangs=rep(NA,5)
xyzd=cbind(xyz,dangs)
xyzd
 x  y z dangs
[1,] 1  6 1NA
[2,] 2  7 2NA
[3,] 3  8 3NA
[4,] 4  9 4NA
[5,] 5 10 5NA

-Original Message-
From: R-help 
mailto:r-help-boun...@r-project.org>> On Behalf 
Of Spencer Brackett
Sent: February 14, 2019 12:32 AM
To: R-help mailto:r-help@r-project.org>>; Sarah Goslee 
mailto:sarah.gos...@gmail.com>>; Caitlin Gibbons 
mailto:bioprogram...@gmail.com>>; Jeff Newmiller 
mailto:jdnew...@dcn.davis.ca.us>>
Subject: [R] R Data

Hello everyone,

The following is a portion of coding that a colleague sent. Given my lack of 
experience in R, I am not quite sure what the significance of the following 
arguments. Could anyone help me translate? For context, I am aware of the 
downloading portion of the script... library(data.table) etc., but am not 
familiar with the portion pertaining to an1 .

library(data.table)
anno = as.data.frame(fread(file =
"/rsrch1/bcb/kchen_group/v_mohanty/data/TCGA/450K/mapper.txt", sep ="\t", 
header = T)) meth = read.table(file = 
"/rsrch1/bcb/kchen_group/v_mohanty/data/TCGA/27K/GBM.txt", sep  ="\t", header = 
T, row.names = 1) meth = as.matrix(meth) """ the loop just formats the 
methylation column names to match format"""
colnames(meth) = sapply(colnames(meth), function(i){
  c1 = strsplit(i,split = '.', fixed = T)[[1]]
  c1[4] = paste(strsplit(c1[4],split = "",fixed = T)[[1]][1:2],collapse =
"")
  paste(c1,collapse = ".")
})
exp = read.table(file =
"/rsrch1/bcb/kchen_group/v_mohanty/data/TCGA/RNAseq/GBM.txt", sep = "\t", 
header = T, row.names = 1) exp = as.matrix(exp) c = 
intersect(colnames(exp),colnames(meth))
exp = exp[,c]
meth = meth[,c]
m = apply(meth, 1, function(i){
  log2(i/(1-i))
})
m = t(as.matrix(m))
an = anno[anno$probe %in% rownames(m),]
an = an[an$gene %in% rownames(exp),]
an = an[an$location %in% c("TSS200","TSS1500"),]

p = apply(an,1,function(i){
  tryCatch(summary(lm(exp[as.character(i[2]),] ~ 
m[as.character(i[1]),]))$coefficient[2,4], error= function(e)NA)
})
t = apply(an,1,function(i){
  tryCatch(summary(lm(exp[as.character(i[2]),] ~ 
m[as.character(i[1]),]))$coefficient[2,3], error= function(e)NA)
})
an1 =cbind(an,p)
an1 = cbind(an1,t)
an1$q = p.adjust(as.numeric(an1$p))
summary(lm(exp["MAOB",] ~ m["cg00121904",]$coefficient[2,c(3:4)]
###

[[alternative HTML version deleted]]

__
R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R Data

2019-02-14 Thread Spencer Brackett

Mr. Fowler,

Thank you! This information is most helpful. So from my understanding, I
can use the regression coefficients shown (via the coding I originally
sent, to generate a continuous distribution with what is essentially a line
of best fit? The data added here had some 30,000 variables (it is genomic
data from TCGA), does this mean that any none NA data is being accounted
for in said distribution?

Best,

Spencer Brackett



On Thursday, February 14, 2019, Fowler, Mark 
wrote:

> Hi Spencer,
>
> The an1 syntax is adding regression coefficients (or NAs where a
> regression could not be done) to the downloaded and processed data, which
> ends up a matrix. The cbind function adds the regression coefficients to
> the last column of the matrix (i.e. bind the columns of the inputs in the
> order given). Simple example below. Not actually any need for the separate
> cbind commands, could have just used an1=cbind(an,p,t). The cbind function
> expects all the columns to be of the same length, hence the use of the
> tryCatch function to capture NA's for failed regression attempts, ensuring
> that p and t correspond row by row with the matrix.
>
>  x=seq(1,5)
>  y=seq(6,10)
>  z=seq(1,5)
> xyz=cbind(x,y,z)
> xyz
>x  y z
> [1,] 1  6 1
> [2,] 2  7 2
> [3,] 3  8 3
> [4,] 4  9 4
> [5,] 5 10 5
> dangs=rep(NA,5)
> xyzd=cbind(xyz,dangs)
> xyzd
>  x  y z dangs
> [1,] 1  6 1NA
> [2,] 2  7 2NA
> [3,] 3  8 3NA
> [4,] 4  9 4NA
> [5,] 5 10 5NA
>
> -Original Message-
> From: R-help  On Behalf Of Spencer Brackett
> Sent: February 14, 2019 12:32 AM
> To: R-help ; Sarah Goslee ;
> Caitlin Gibbons ; Jeff Newmiller <
> jdnew...@dcn.davis.ca.us>
> Subject: [R] R Data
>
> Hello everyone,
>
> The following is a portion of coding that a colleague sent. Given my lack
> of experience in R, I am not quite sure what the significance of the
> following arguments. Could anyone help me translate? For context, I am
> aware of the downloading portion of the script... library(data.table) etc.,
> but am not familiar with the portion pertaining to an1 .
>
> library(data.table)
> anno = as.data.frame(fread(file =
> "/rsrch1/bcb/kchen_group/v_mohanty/data/TCGA/450K/mapper.txt", sep ="\t",
> header = T)) meth = read.table(file = "/rsrch1/bcb/kchen_group/v_
> mohanty/data/TCGA/27K/GBM.txt", sep  ="\t", header = T, row.names = 1)
> meth = as.matrix(meth) """ the loop just formats the methylation column
> names to match format"""
> colnames(meth) = sapply(colnames(meth), function(i){
>   c1 = strsplit(i,split = '.', fixed = T)[[1]]
>   c1[4] = paste(strsplit(c1[4],split = "",fixed = T)[[1]][1:2],collapse =
> "")
>   paste(c1,collapse = ".")
> })
> exp = read.table(file =
> "/rsrch1/bcb/kchen_group/v_mohanty/data/TCGA/RNAseq/GBM.txt", sep = "\t",
> header = T, row.names = 1) exp = as.matrix(exp) c = intersect(colnames(exp),
> colnames(meth))
> exp = exp[,c]
> meth = meth[,c]
> m = apply(meth, 1, function(i){
>   log2(i/(1-i))
> })
> m = t(as.matrix(m))
> an = anno[anno$probe %in% rownames(m),]
> an = an[an$gene %in% rownames(exp),]
> an = an[an$location %in% c("TSS200","TSS1500"),]
>
> p = apply(an,1,function(i){
>   tryCatch(summary(lm(exp[as.character(i[2]),] ~ 
> m[as.character(i[1]),]))$coefficient[2,4],
> error= function(e)NA)
> })
> t = apply(an,1,function(i){
>   tryCatch(summary(lm(exp[as.character(i[2]),] ~ 
> m[as.character(i[1]),]))$coefficient[2,3],
> error= function(e)NA)
> })
> an1 =cbind(an,p)
> an1 = cbind(an1,t)
> an1$q = p.adjust(as.numeric(an1$p))
> summary(lm(exp["MAOB",] ~ m["cg00121904",]$coefficient[2,c(3:4)]
> ###
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R Data

2019-02-14 Thread Fowler, Mark

Hi Spencer,

The an1 syntax is adding regression coefficients (or NAs where a regression 
could not be done) to the downloaded and processed data, which ends up a 
matrix. The cbind function adds the regression coefficients to the last column 
of the matrix (i.e. bind the columns of the inputs in the order given). Simple 
example below. Not actually any need for the separate cbind commands, could 
have just used an1=cbind(an,p,t). The cbind function expects all the columns to 
be of the same length, hence the use of the tryCatch function to capture NA's 
for failed regression attempts, ensuring that p and t correspond row by row 
with the matrix.

 x=seq(1,5)
 y=seq(6,10)
 z=seq(1,5)
xyz=cbind(x,y,z)
xyz
   x  y z
[1,] 1  6 1
[2,] 2  7 2
[3,] 3  8 3
[4,] 4  9 4
[5,] 5 10 5
dangs=rep(NA,5)
xyzd=cbind(xyz,dangs)
xyzd
 x  y z dangs
[1,] 1  6 1NA
[2,] 2  7 2NA
[3,] 3  8 3NA
[4,] 4  9 4NA
[5,] 5 10 5NA

-Original Message-
From: R-help  On Behalf Of Spencer Brackett
Sent: February 14, 2019 12:32 AM
To: R-help ; Sarah Goslee ; 
Caitlin Gibbons ; Jeff Newmiller 

Subject: [R] R Data

Hello everyone,

The following is a portion of coding that a colleague sent. Given my lack of 
experience in R, I am not quite sure what the significance of the following 
arguments. Could anyone help me translate? For context, I am aware of the 
downloading portion of the script... library(data.table) etc., but am not 
familiar with the portion pertaining to an1 .

library(data.table)
anno = as.data.frame(fread(file =
"/rsrch1/bcb/kchen_group/v_mohanty/data/TCGA/450K/mapper.txt", sep ="\t", 
header = T)) meth = read.table(file = 
"/rsrch1/bcb/kchen_group/v_mohanty/data/TCGA/27K/GBM.txt", sep  ="\t", header = 
T, row.names = 1) meth = as.matrix(meth) """ the loop just formats the 
methylation column names to match format"""
colnames(meth) = sapply(colnames(meth), function(i){
  c1 = strsplit(i,split = '.', fixed = T)[[1]]
  c1[4] = paste(strsplit(c1[4],split = "",fixed = T)[[1]][1:2],collapse =
"")
  paste(c1,collapse = ".")
})
exp = read.table(file =
"/rsrch1/bcb/kchen_group/v_mohanty/data/TCGA/RNAseq/GBM.txt", sep = "\t", 
header = T, row.names = 1) exp = as.matrix(exp) c = 
intersect(colnames(exp),colnames(meth))
exp = exp[,c]
meth = meth[,c]
m = apply(meth, 1, function(i){
  log2(i/(1-i))
})
m = t(as.matrix(m))
an = anno[anno$probe %in% rownames(m),]
an = an[an$gene %in% rownames(exp),]
an = an[an$location %in% c("TSS200","TSS1500"),]

p = apply(an,1,function(i){
  tryCatch(summary(lm(exp[as.character(i[2]),] ~ 
m[as.character(i[1]),]))$coefficient[2,4], error= function(e)NA)
})
t = apply(an,1,function(i){
  tryCatch(summary(lm(exp[as.character(i[2]),] ~ 
m[as.character(i[1]),]))$coefficient[2,3], error= function(e)NA)
})
an1 =cbind(an,p)
an1 = cbind(an1,t)
an1$q = p.adjust(as.numeric(an1$p))
summary(lm(exp["MAOB",] ~ m["cg00121904",]$coefficient[2,c(3:4)]
###

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] R Data

2019-02-13 Thread Spencer Brackett

Hello everyone,

The following is a portion of coding that a colleague sent. Given my lack
of experience in R, I am not quite sure what the significance of the
following arguments. Could anyone help me translate? For context, I am
aware of the downloading portion of the script... library(data.table) etc.,
but am not familiar with the portion pertaining to an1 .

library(data.table)
anno = as.data.frame(fread(file =
"/rsrch1/bcb/kchen_group/v_mohanty/data/TCGA/450K/mapper.txt", sep ="\t",
header = T))
meth = read.table(file =
"/rsrch1/bcb/kchen_group/v_mohanty/data/TCGA/27K/GBM.txt", sep  ="\t",
header = T, row.names = 1)
meth = as.matrix(meth)
""" the loop just formats the methylation column names to match format"""
colnames(meth) = sapply(colnames(meth), function(i){
  c1 = strsplit(i,split = '.', fixed = T)[[1]]
  c1[4] = paste(strsplit(c1[4],split = "",fixed = T)[[1]][1:2],collapse =
"")
  paste(c1,collapse = ".")
})
exp = read.table(file =
"/rsrch1/bcb/kchen_group/v_mohanty/data/TCGA/RNAseq/GBM.txt", sep = "\t",
header = T, row.names = 1)
exp = as.matrix(exp)
c = intersect(colnames(exp),colnames(meth))
exp = exp[,c]
meth = meth[,c]
m = apply(meth, 1, function(i){
  log2(i/(1-i))
})
m = t(as.matrix(m))
an = anno[anno$probe %in% rownames(m),]
an = an[an$gene %in% rownames(exp),]
an = an[an$location %in% c("TSS200","TSS1500"),]

p = apply(an,1,function(i){
  tryCatch(summary(lm(exp[as.character(i[2]),] ~
m[as.character(i[1]),]))$coefficient[2,4], error= function(e)NA)
})
t = apply(an,1,function(i){
  tryCatch(summary(lm(exp[as.character(i[2]),] ~
m[as.character(i[1]),]))$coefficient[2,3], error= function(e)NA)
})
an1 =cbind(an,p)
an1 = cbind(an1,t)
an1$q = p.adjust(as.numeric(an1$p))
summary(lm(exp["MAOB",] ~ m["cg00121904",]$coefficient[2,c(3:4)]
###

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data transformation

2019-01-20 Thread Jeff Newmiller

There is no "perhaps" about it. Nonsense phrases like "similar to logit, where 
I dont [sic] lose normality of the data" that lead into off-topic discussions 
of why one introduces transformations in the first place are perfect examples 
of why questions like this belong on a statistical theory discussion forum like 
StackExchange rather than here where the topic is the R language.

On January 20, 2019 6:02:15 AM PST, Adrian Johnson  
wrote:
>Dear group,
>My question, perhaps is more of a statistical question using R
>I have a data matrix ( 400 x 400 normally distributed) with data
>points ranging from -1 to +1..
>For certain clustering algorithms, I suspect the tight data range is
>not helping resolving the clusters.
>
>Is there a way to transform the data something similar to logit, where
>I dont lose normality of the data and yet I can better expand the data
>ranges.
>
>Thanks
>Adrian
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data transformation

2019-01-20 Thread Richard M. Heiberger

this might work for you

newy <- sign(oldy)*f(abs(oldy))

where f() is a monotonic transformation, perhaps a power function.

On Sun, Jan 20, 2019 at 11:08 AM Adrian Johnson
 wrote:
>
> I apologize,  I forgot to mention another key operation.
> in my matrix -1 to <0 has a different meaning while values between >0
> to 1 has a different set of meaning.  So If I do logit transformation
> some of the positives becomes negative (values < 0.5 etc.). In such
> case, the resulting transformed matrix is incorrect.
>
> I want to transform numbers ranging from -1 to <0   and numbers
> between >0 and 1 independently.
>
> Thanks
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data transformation

2019-01-20 Thread David L Carlson

I don't think you have given us enough information. For example, is the 500x500 
matrix a distance matrix or does it represent 500 columns of information about 
500 rows of observations? If a distance matrix, how is distance being measured? 
You clarification suggests it may be a distance matrix of correlation 
coefficients? If distance has different meanings between -1 and 0 and 0 and +1, 
getting interpretable results from cluster analysis will be difficult, but it 
is not clear what you mean by that.

-
David L. Carlson
Department of Anthropology
Texas A University

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Adrian Johnson
Sent: Sunday, January 20, 2019 8:02 AM
To: r-help 
Subject: [R] data transformation

Dear group,
My question, perhaps is more of a statistical question using R
I have a data matrix ( 400 x 400 normally distributed) with data
points ranging from -1 to +1..
For certain clustering algorithms, I suspect the tight data range is
not helping resolving the clusters.

Is there a way to transform the data something similar to logit, where
I dont lose normality of the data and yet I can better expand the data
ranges.

Thanks
Adrian

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained


-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Adrian Johnson
Sent: Sunday, January 20, 2019 10:08 AM
To: r-help 
Subject: Re: [R] data transformation

I apologize,  I forgot to mention another key operation.
in my matrix -1 to <0 has a different meaning while values between >0
to 1 has a different set of meaning.  So If I do logit transformation
some of the positives becomes negative (values < 0.5 etc.). In such
case, the resulting transformed matrix is incorrect.

I want to transform numbers ranging from -1 to <0   and numbers
between >0 and 1 independently.

Thanks

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data transformation

2019-01-20 Thread Adrian Johnson

I apologize,  I forgot to mention another key operation.
in my matrix -1 to <0 has a different meaning while values between >0
to 1 has a different set of meaning.  So If I do logit transformation
some of the positives becomes negative (values < 0.5 etc.). In such
case, the resulting transformed matrix is incorrect.

I want to transform numbers ranging from -1 to <0   and numbers
between >0 and 1 independently.

Thanks

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] data transformation

2019-01-20 Thread Adrian Johnson

Dear group,
My question, perhaps is more of a statistical question using R
I have a data matrix ( 400 x 400 normally distributed) with data
points ranging from -1 to +1..
For certain clustering algorithms, I suspect the tight data range is
not helping resolving the clusters.

Is there a way to transform the data something similar to logit, where
I dont lose normality of the data and yet I can better expand the data
ranges.

Thanks
Adrian

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data frame transformation

2019-01-07 Thread Andras Farkas via R-help

Thanks Bert this will do...
Andras

Sent from Yahoo Mail on Android 
 
  On Sun, Jan 6, 2019 at 1:09 PM, Bert Gunter wrote:   
... and my reordering of column indices was unnecessary:    merge(dat, d, all.y 
= TRUE)will do.
Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Jan 6, 2019 at 5:16 AM Andras Farkas via R-help  
wrote:

Hello Everyone,

would you be able to assist with some expertise on how to get the following 
done in a way that can be applied to a data set with different dimensions and 
without all the line items here?

we have:

id<-c(1,1,1,2,2,2,2,3,3,4,4,4,4,5,5,5,5)#length of unique IDs may differ of 
course in real data set, usually in magnitude of 1
letter<-c(sample(c("A","B","C","D","E"),3),sample(c("A","B","C","D","E"),4),sample(c("A","B","C","D","E"),2),
          
sample(c("A","B","C","D","E"),4),sample(c("A","B","C","D","E"),4))#number of 
unique "letters" is less than 4000 in real data set and they are no duplicates 
within same ID
weight<-c(sample(c(1:30),3),sample(c(1:30),4),sample(c(1:30),2),
          sample(c(1:30),4),sample(c(1:30),4))#number of unique weights is 
below 50 in real data set and they are no duplicates within same ID


data<-data.frame(id=id,letter=letter,weight=weight)

#goal is to get the following transformation where a column is added for each 
unique letter and the weight is pulled into the column if the letter exist 
within the ID, otherwise NA
#so we would get datatransform like below but without the many steps described 
here

datatransfer<-data.frame(data,apply(data[2],2,function(x) 
ifelse(x=="A",data$weight,NA)))
datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x) 
ifelse(x=="B",data$weight,NA)))
datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x) 
ifelse(x=="C",data$weight,NA)))
datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x) 
ifelse(x=="D",data$weight,NA)))
datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x) 
ifelse(x=="E",data$weight,NA)))

colnames(datatransfer)<-c("id","weight","letter","A","B","C","D","E")
much appreciate the help,

thanks

Andras 

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

  

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data frame transformation

2019-01-06 Thread Bert Gunter

... and my reordering of column indices was unnecessary:
merge(dat, d, all.y = TRUE)
will do.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Jan 6, 2019 at 5:16 AM Andras Farkas via R-help <
r-help@r-project.org> wrote:

> Hello Everyone,
>
> would you be able to assist with some expertise on how to get the
> following done in a way that can be applied to a data set with different
> dimensions and without all the line items here?
>
> we have:
>
> id<-c(1,1,1,2,2,2,2,3,3,4,4,4,4,5,5,5,5)#length of unique IDs may differ
> of course in real data set, usually in magnitude of 1
>
> letter<-c(sample(c("A","B","C","D","E"),3),sample(c("A","B","C","D","E"),4),sample(c("A","B","C","D","E"),2),
>
> sample(c("A","B","C","D","E"),4),sample(c("A","B","C","D","E"),4))#number
> of unique "letters" is less than 4000 in real data set and they are no
> duplicates within same ID
> weight<-c(sample(c(1:30),3),sample(c(1:30),4),sample(c(1:30),2),
>   sample(c(1:30),4),sample(c(1:30),4))#number of unique weights is
> below 50 in real data set and they are no duplicates within same ID
>
>
> data<-data.frame(id=id,letter=letter,weight=weight)
>
> #goal is to get the following transformation where a column is added for
> each unique letter and the weight is pulled into the column if the letter
> exist within the ID, otherwise NA
> #so we would get datatransform like below but without the many steps
> described here
>
> datatransfer<-data.frame(data,apply(data[2],2,function(x)
> ifelse(x=="A",data$weight,NA)))
> datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x)
> ifelse(x=="B",data$weight,NA)))
> datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x)
> ifelse(x=="C",data$weight,NA)))
> datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x)
> ifelse(x=="D",data$weight,NA)))
> datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x)
> ifelse(x=="E",data$weight,NA)))
>
> colnames(datatransfer)<-c("id","weight","letter","A","B","C","D","E")
> much appreciate the help,
>
> thanks
>
> Andras
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data frame transformation

2019-01-06 Thread Bert Gunter

Like this (using base R only)?

dat<-data.frame(id=id,letter=letter,weight=weight) # using your data

ud <- unique(dat$id)
ul = unique(dat$letter)
d <- with(dat,
  data.frame(
  letter = rep(ul, e = length(ud)),
  id = rep(ud, length(ul))
  ) )

 merge(dat[,c(2,1,3)],d, all.y = TRUE)
## resulting in:

   letter id weight
1   A  1 25
2   A  2 28
3   A  3 14
4   A  4 27
5   A  5 NA
6   B  1 13
7   B  2 14
8   B  3 NA
9   B  4 15
10  B  5  2
11  C  1 NA
12  C  2 NA
13  C  3 NA
14  C  4 NA
15  C  5 25
16  D  1 24
17  D  2 18
18  D  3 NA
19  D  4 29
20  D  5 27
21  E  1 NA
22  E  2  2
23  E  3 20
24  E  4 25
25  E  5 28


Cheers,

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Sun, Jan 6, 2019 at 5:16 AM Andras Farkas via R-help <
r-help@r-project.org> wrote:

> Hello Everyone,
>
> would you be able to assist with some expertise on how to get the
> following done in a way that can be applied to a data set with different
> dimensions and without all the line items here?
>
> we have:
>
> id<-c(1,1,1,2,2,2,2,3,3,4,4,4,4,5,5,5,5)#length of unique IDs may differ
> of course in real data set, usually in magnitude of 1
>
> letter<-c(sample(c("A","B","C","D","E"),3),sample(c("A","B","C","D","E"),4),sample(c("A","B","C","D","E"),2),
>
> sample(c("A","B","C","D","E"),4),sample(c("A","B","C","D","E"),4))#number
> of unique "letters" is less than 4000 in real data set and they are no
> duplicates within same ID
> weight<-c(sample(c(1:30),3),sample(c(1:30),4),sample(c(1:30),2),
>   sample(c(1:30),4),sample(c(1:30),4))#number of unique weights is
> below 50 in real data set and they are no duplicates within same ID
>
>
> data<-data.frame(id=id,letter=letter,weight=weight)
>
> #goal is to get the following transformation where a column is added for
> each unique letter and the weight is pulled into the column if the letter
> exist within the ID, otherwise NA
> #so we would get datatransform like below but without the many steps
> described here
>
> datatransfer<-data.frame(data,apply(data[2],2,function(x)
> ifelse(x=="A",data$weight,NA)))
> datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x)
> ifelse(x=="B",data$weight,NA)))
> datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x)
> ifelse(x=="C",data$weight,NA)))
> datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x)
> ifelse(x=="D",data$weight,NA)))
> datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x)
> ifelse(x=="E",data$weight,NA)))
>
> colnames(datatransfer)<-c("id","weight","letter","A","B","C","D","E")
> much appreciate the help,
>
> thanks
>
> Andras
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data frame transformation

2019-01-06 Thread K. Elo

Hi!

Maybe this would do the trick:

--- snip ---

library(reshape2) # Use 'reshape2'
library(dplyr)# Use 'dplyr'

datatransfer<-data %>% mutate(letter2=letter) %>% 
  dcast(id+letter~letter2, value.var="weight")

--- snip ---

Or did I misunderstood something?

Best,

Kimmo

2019-01-06, 13:16 +, Andras Farkas via R-help wrote:
> Hello Everyone,
> 
> would you be able to assist with some expertise on how to get the
> following done in a way that can be applied to a data set with
> different dimensions and without all the line items here?
> 
> we have:
> 
> id<-c(1,1,1,2,2,2,2,3,3,4,4,4,4,5,5,5,5)#length of unique IDs may
> differ of course in real data set, usually in magnitude of 1
> letter<-
> c(sample(c("A","B","C","D","E"),3),sample(c("A","B","C","D","E"),4),s
> ample(c("A","B","C","D","E"),2),
>  
> sample(c("A","B","C","D","E"),4),sample(c("A","B","C","D","E"),4))#nu
> mber of unique "letters" is less than 4000 in real data set and they
> are no duplicates within same ID
> weight<-c(sample(c(1:30),3),sample(c(1:30),4),sample(c(1:30),2),
>   sample(c(1:30),4),sample(c(1:30),4))#number of unique
> weights is below 50 in real data set and they are no duplicates
> within same ID
> 
> 
> data<-data.frame(id=id,letter=letter,weight=weight)
> 
> #goal is to get the following transformation where a column is added
> for each unique letter and the weight is pulled into the column if
> the letter exist within the ID, otherwise NA
> #so we would get datatransform like below but without the many steps
> described here
> 
> datatransfer<-data.frame(data,apply(data[2],2,function(x)
> ifelse(x=="A",data$weight,NA)))
> datatransfer<-
> data.frame(datatransfer,apply(datatransfer[2],2,function(x)
> ifelse(x=="B",data$weight,NA)))
> datatransfer<-
> data.frame(datatransfer,apply(datatransfer[2],2,function(x)
> ifelse(x=="C",data$weight,NA)))
> datatransfer<-
> data.frame(datatransfer,apply(datatransfer[2],2,function(x)
> ifelse(x=="D",data$weight,NA)))
> datatransfer<-
> data.frame(datatransfer,apply(datatransfer[2],2,function(x)
> ifelse(x=="E",data$weight,NA)))
> 
> colnames(datatransfer)<-c("id","weight","letter","A","B","C","D","E")
> much appreciate the help,
> 
> thanks
> 
> Andras 
> 
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] data frame transformation

2019-01-06 Thread Andras Farkas via R-help

Hello Everyone,

would you be able to assist with some expertise on how to get the following 
done in a way that can be applied to a data set with different dimensions and 
without all the line items here?

we have:

id<-c(1,1,1,2,2,2,2,3,3,4,4,4,4,5,5,5,5)#length of unique IDs may differ of 
course in real data set, usually in magnitude of 1
letter<-c(sample(c("A","B","C","D","E"),3),sample(c("A","B","C","D","E"),4),sample(c("A","B","C","D","E"),2),
          
sample(c("A","B","C","D","E"),4),sample(c("A","B","C","D","E"),4))#number of 
unique "letters" is less than 4000 in real data set and they are no duplicates 
within same ID
weight<-c(sample(c(1:30),3),sample(c(1:30),4),sample(c(1:30),2),
          sample(c(1:30),4),sample(c(1:30),4))#number of unique weights is 
below 50 in real data set and they are no duplicates within same ID


data<-data.frame(id=id,letter=letter,weight=weight)

#goal is to get the following transformation where a column is added for each 
unique letter and the weight is pulled into the column if the letter exist 
within the ID, otherwise NA
#so we would get datatransform like below but without the many steps described 
here

datatransfer<-data.frame(data,apply(data[2],2,function(x) 
ifelse(x=="A",data$weight,NA)))
datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x) 
ifelse(x=="B",data$weight,NA)))
datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x) 
ifelse(x=="C",data$weight,NA)))
datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x) 
ifelse(x=="D",data$weight,NA)))
datatransfer<-data.frame(datatransfer,apply(datatransfer[2],2,function(x) 
ifelse(x=="E",data$weight,NA)))

colnames(datatransfer)<-c("id","weight","letter","A","B","C","D","E")
much appreciate the help,

thanks

Andras 

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] r-data partitioning considering two variables (character and numeric)

2018-08-27 Thread Ahmed Attia

Thanks Bert, worked nicely. Yes, genotypes with only one ID will be
eliminated before partitioning the data.


Best regards

Ahmed Attia






On Mon, Aug 27, 2018 at 8:09 PM, Bert Gunter  wrote:
> Just partition the unique stand_ID's and select on them using %in% , say:
>
> id <- unique(dataGenotype$stand_ID)
> tst <- sample(id, floor(length(id)/2))
> wh <- dataGenotype$stand_ID %in% tst ## logical vector
> test<- dataGenotype[wh,]
> train <- dataGenotype[!wh,]
>
> There are a million variations on this theme I'm sure.
>
> -- Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Aug 27, 2018 at 3:54 PM Ahmed Attia  wrote:
>>
>> I would like to partition the following dataset (dataGenotype) based
>> on two variables; Genotype and stand_ID, for example, for Genotype
>> H13: stand_ID number 7 may go to training and stand_ID number 18 and
>> 21 may go to testing.
>>
>> Genotypestand_IDInventory_date  stemC   mheight
>> H13 75/18/2006  1940.1075   11.33995
>> H13 711/1/2008  10898.9597  23.20395
>> H13 74/14/2009  12830.1284  23.77395
>> H131811/3/2005  2726.42 13.4432
>> H13186/30/2008  12226.1554  24.091967
>> H13184/14/2009  14141.6825.0922
>> H13215/18/2006  4981.7158   15.7173
>> H13214/14/2009  20327.0667  27.9155
>> H159 3/31/2006  3570.06 14.7898
>> H159 11/1/2008  15138.8383  26.2088
>> H159 4/14/2009  17035.4688  26.8778
>> H15   20 1/18/2005  3016.88114.1886
>> H15   2010/4/2006   8330.4688   20.19425
>> H15   206/30/2008   13576.5 25.4774
>> H15   322/1/20063426.2525   14.31815
>> U21   3 1/9/20063660.41615.09925
>> U21   3 6/30/2008   13236.2924.27634
>> U21   3 4/14/2009   16124.192   25.79562
>> U21   6711/4/2005   2812.8425   13.60485
>> U21   674/14/2009   13468.455   24.6203
>>
>> And the desired output is the following;
>>
>> A-training
>>
>> Genotypestand_IDInventory_date  stemC   mheight
>> H137 5/18/2006  1940.1075   11.33995
>> H137 11/1/2008  10898.9597  23.20395
>> H137 4/14/2009  12830.1284  23.77395
>> H159 3/31/2006  3570.06 14.7898
>> H159 11/1/2008  15138.8383  26.2088
>> H159 4/14/2009  17035.4688  26.8778
>> U216711/4/2005  2812.8425   13.60485
>> U21674/14/2009  13468.455   24.6203
>>
>> B-testing
>>
>> Genotypestand_IDInventory_date  stemC   mheight
>> H13 18   11/3/2005  2726.42 13.4432
>> H13 18   6/30/2008  12226.1554  24.091967
>> H13 18   4/14/2009  14141.6825.0922
>> H13 21   5/18/2006  4981.7158   15.7173
>> H13 21   4/14/2009  20327.0667  27.9155
>> H15 20   1/18/2005  3016.88114.1886
>> H15 20   10/4/2006  8330.4688   20.19425
>> H15 20   6/30/2008  13576.5 25.4774
>> H15 32   2/1/2006   3426.2525   14.31815
>> U21 31/9/2006   3660.41615.09925
>> U21 36/30/2008  13236.2924.27634
>> U21 34/14/2009  16124.192   25.79562
>>
>> I tried the following code;
>>
>> library(caret)
>> dataPartitioning <-
>> createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
>> train = dataGenotype[dataPartitioning,]
>> test = dataGenotype[-dataPartitioning,]
>>
>> Also tried
>>
>> createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)
>>
>> It did not produce the desired output, the data are partitioned within
>> the stand_ID. For example, one row of stand_ID 7 goes to training and
>> two rows of stand_ID 7 go to testing. How can I partition the data by
>> Genotype and stand_ID together?.
>>
>>
>>
>> Ahmed Attia
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] r-data partitioning considering two variables (character and numeric)

2018-08-27 Thread Bert Gunter

Sorry, my bad -- careless reading: you need to do the partitioning within
genotype.
Something like:

by(dataGenotype, dataGenotype$Genotype, function(x){

  u <- unique(x$standID)

   tst <- x$x2 %in% sample(u, floor(length(u)/2))

   list(test = x[tst,], train = x[!tst,]

   })


This will give a list each component of which will split the Genotype into
test and train dataframe subsets by ID. These lists of data frames can then
be recombined into a single test and train dataframe by, e.g. an
appropriate rbind() call.


HOWEVER, note that you will need to modify this function to decide what to
do if/when there is only one ID in a Genotype, as Don MacQueen already
pointed out.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Aug 27, 2018 at 4:09 PM Bert Gunter  wrote:

> Just partition the unique stand_ID's and select on them using %in% , say:
>
> id <- unique(dataGenotype$stand_ID)
> tst <- sample(id, floor(length(id)/2))
> wh <- dataGenotype$stand_ID %in% tst ## logical vector
> test<- dataGenotype[wh,]
> train <- dataGenotype[!wh,]
>
> There are a million variations on this theme I'm sure.
>
> -- Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Aug 27, 2018 at 3:54 PM Ahmed Attia  wrote:
>
>> I would like to partition the following dataset (dataGenotype) based
>> on two variables; Genotype and stand_ID, for example, for Genotype
>> H13: stand_ID number 7 may go to training and stand_ID number 18 and
>> 21 may go to testing.
>>
>> Genotypestand_IDInventory_date  stemC   mheight
>> H13 75/18/2006  1940.1075   11.33995
>> H13 711/1/2008  10898.9597  23.20395
>> H13 74/14/2009  12830.1284  23.77395
>> H131811/3/2005  2726.42 13.4432
>> H13186/30/2008  12226.1554  24.091967
>> H13184/14/2009  14141.6825.0922
>> H13215/18/2006  4981.7158   15.7173
>> H13214/14/2009  20327.0667  27.9155
>> H159 3/31/2006  3570.06 14.7898
>> H159 11/1/2008  15138.8383  26.2088
>> H159 4/14/2009  17035.4688  26.8778
>> H15   20 1/18/2005  3016.88114.1886
>> H15   2010/4/2006   8330.4688   20.19425
>> H15   206/30/2008   13576.5 25.4774
>> H15   322/1/20063426.2525   14.31815
>> U21   3 1/9/20063660.41615.09925
>> U21   3 6/30/2008   13236.2924.27634
>> U21   3 4/14/2009   16124.192   25.79562
>> U21   6711/4/2005   2812.8425   13.60485
>> U21   674/14/2009   13468.455   24.6203
>>
>> And the desired output is the following;
>>
>> A-training
>>
>> Genotypestand_IDInventory_date  stemC   mheight
>> H137 5/18/2006  1940.1075   11.33995
>> H137 11/1/2008  10898.9597  23.20395
>> H137 4/14/2009  12830.1284  23.77395
>> H159 3/31/2006  3570.06 14.7898
>> H159 11/1/2008  15138.8383  26.2088
>> H159 4/14/2009  17035.4688  26.8778
>> U216711/4/2005  2812.8425   13.60485
>> U21674/14/2009  13468.455   24.6203
>>
>> B-testing
>>
>> Genotypestand_IDInventory_date  stemC   mheight
>> H13 18   11/3/2005  2726.42 13.4432
>> H13 18   6/30/2008  12226.1554  24.091967
>> H13 18   4/14/2009  14141.6825.0922
>> H13 21   5/18/2006  4981.7158   15.7173
>> H13 21   4/14/2009  20327.0667  27.9155
>> H15 20   1/18/2005  3016.88114.1886
>> H15 20   10/4/2006  8330.4688   20.19425
>> H15 20   6/30/2008  13576.5 25.4774
>> H15 32   2/1/2006   3426.2525   14.31815
>> U21 31/9/2006   3660.41615.09925
>> U21 36/30/2008  13236.2924.27634
>> U21 34/14/2009  16124.192   25.79562
>>
>> I tried the following code;
>>
>> library(caret)
>> dataPartitioning <-
>> createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
>> train = dataGenotype[dataPartitioning,]
>> test = dataGenotype[-dataPartitioning,]
>>
>> Also tried
>>
>> createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)
>>
>> It did not produce the desired output, the data are partitioned within
>> the stand_ID. For example, one row of stand_ID 7 goes to training and
>> two rows of stand_ID 7 go to testing. How can I partition the data by
>> Genotype and stand_ID together?.
>>
>>
>>
>> Ahmed Attia
>>
>>

Re: [R] r-data partitioning considering two variables (character and numeric)

2018-08-27 Thread MacQueen, Don via R-help

And yes, I ignored Genotype, but for the example data none of the stand_ID 
values are present in more than one Genotype, so it doesn't matter. If that's 
not true in general, then constructing the grp variable is a little more 
complex, but the principle is the same.

--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
Lab cell 925-724-7509
 
 

On 8/27/18, 4:10 PM, "R-help on behalf of MacQueen, Don via R-help" 
 wrote:

You could start with split()

grp <- rep('', nrow(mydata) )
grp[mydata$stand_ID %in% c(7,9,67)] <- 'A-training'
grp[mydata$stand_ID %in% c(3,18,20,21,32)] <- 'B-testing'

split(mydata, grp)

or perhaps

grp <- ifelse(  mydata$stand_ID %in% c(7,9,67) , 'A-training', 'B-testing' )
split(mydata, grp)

-Don

--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
Lab cell 925-724-7509
 
 

On 8/27/18, 3:54 PM, "R-help on behalf of Ahmed Attia" 
 wrote:

I would like to partition the following dataset (dataGenotype) based
on two variables; Genotype and stand_ID, for example, for Genotype
H13: stand_ID number 7 may go to training and stand_ID number 18 and
21 may go to testing.

Genotypestand_IDInventory_date  stemC   mheight
H13 75/18/2006  1940.1075   11.33995
H13 711/1/2008  10898.9597  23.20395
H13 74/14/2009  12830.1284  23.77395
H131811/3/2005  2726.42 13.4432
H13186/30/2008  12226.1554  24.091967
H13184/14/2009  14141.6825.0922
H13215/18/2006  4981.7158   15.7173
H13214/14/2009  20327.0667  27.9155
H159 3/31/2006  3570.06 14.7898
H159 11/1/2008  15138.8383  26.2088
H159 4/14/2009  17035.4688  26.8778
H15   20 1/18/2005  3016.88114.1886
H15   2010/4/2006   8330.4688   20.19425
H15   206/30/2008   13576.5 25.4774
H15   322/1/20063426.2525   14.31815
U21   3 1/9/20063660.41615.09925
U21   3 6/30/2008   13236.2924.27634
U21   3 4/14/2009   16124.192   25.79562
U21   6711/4/2005   2812.8425   13.60485
U21   674/14/2009   13468.455   24.6203

And the desired output is the following;

A-training

Genotypestand_IDInventory_date  stemC   mheight
H137 5/18/2006  1940.1075   11.33995
H137 11/1/2008  10898.9597  23.20395
H137 4/14/2009  12830.1284  23.77395
H159 3/31/2006  3570.06 14.7898
H159 11/1/2008  15138.8383  26.2088
H159 4/14/2009  17035.4688  26.8778
U216711/4/2005  2812.8425   13.60485
U21674/14/2009  13468.455   24.6203

B-testing

Genotypestand_IDInventory_date  stemC   mheight
H13 18   11/3/2005  2726.42 13.4432
H13 18   6/30/2008  12226.1554  24.091967
H13 18   4/14/2009  14141.6825.0922
H13 21   5/18/2006  4981.7158   15.7173
H13 21   4/14/2009  20327.0667  27.9155
H15 20   1/18/2005  3016.88114.1886
H15 20   10/4/2006  8330.4688   20.19425
H15 20   6/30/2008  13576.5 25.4774
H15 32   2/1/2006   3426.2525   14.31815
U21 31/9/2006   3660.41615.09925
U21 36/30/2008  13236.2924.27634
U21 34/14/2009  16124.192   25.79562

I tried the following code;

library(caret)
dataPartitioning <- 
createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
train = dataGenotype[dataPartitioning,]
test = dataGenotype[-dataPartitioning,]

Also tried

createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)

It did not produce the desired output, the data are partitioned within
the stand_ID. For example, one row of stand_ID 7 goes to training and
two rows of stand_ID 7 go to testing. How can I partition the data by
Genotype and stand_ID together?.



Ahmed Attia

Re: [R] r-data partitioning considering two variables (character and numeric)

2018-08-27 Thread MacQueen, Don via R-help

You could start with split()

grp <- rep('', nrow(mydata) )
grp[mydata$stand_ID %in% c(7,9,67)] <- 'A-training'
grp[mydata$stand_ID %in% c(3,18,20,21,32)] <- 'B-testing'

split(mydata, grp)

or perhaps

grp <- ifelse(  mydata$stand_ID %in% c(7,9,67) , 'A-training', 'B-testing' )
split(mydata, grp)

-Don

--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
Lab cell 925-724-7509
 
 

On 8/27/18, 3:54 PM, "R-help on behalf of Ahmed Attia" 
 wrote:

I would like to partition the following dataset (dataGenotype) based
on two variables; Genotype and stand_ID, for example, for Genotype
H13: stand_ID number 7 may go to training and stand_ID number 18 and
21 may go to testing.

Genotypestand_IDInventory_date  stemC   mheight
H13 75/18/2006  1940.1075   11.33995
H13 711/1/2008  10898.9597  23.20395
H13 74/14/2009  12830.1284  23.77395
H131811/3/2005  2726.42 13.4432
H13186/30/2008  12226.1554  24.091967
H13184/14/2009  14141.6825.0922
H13215/18/2006  4981.7158   15.7173
H13214/14/2009  20327.0667  27.9155
H159 3/31/2006  3570.06 14.7898
H159 11/1/2008  15138.8383  26.2088
H159 4/14/2009  17035.4688  26.8778
H15   20 1/18/2005  3016.88114.1886
H15   2010/4/2006   8330.4688   20.19425
H15   206/30/2008   13576.5 25.4774
H15   322/1/20063426.2525   14.31815
U21   3 1/9/20063660.41615.09925
U21   3 6/30/2008   13236.2924.27634
U21   3 4/14/2009   16124.192   25.79562
U21   6711/4/2005   2812.8425   13.60485
U21   674/14/2009   13468.455   24.6203

And the desired output is the following;

A-training

Genotypestand_IDInventory_date  stemC   mheight
H137 5/18/2006  1940.1075   11.33995
H137 11/1/2008  10898.9597  23.20395
H137 4/14/2009  12830.1284  23.77395
H159 3/31/2006  3570.06 14.7898
H159 11/1/2008  15138.8383  26.2088
H159 4/14/2009  17035.4688  26.8778
U216711/4/2005  2812.8425   13.60485
U21674/14/2009  13468.455   24.6203

B-testing

Genotypestand_IDInventory_date  stemC   mheight
H13 18   11/3/2005  2726.42 13.4432
H13 18   6/30/2008  12226.1554  24.091967
H13 18   4/14/2009  14141.6825.0922
H13 21   5/18/2006  4981.7158   15.7173
H13 21   4/14/2009  20327.0667  27.9155
H15 20   1/18/2005  3016.88114.1886
H15 20   10/4/2006  8330.4688   20.19425
H15 20   6/30/2008  13576.5 25.4774
H15 32   2/1/2006   3426.2525   14.31815
U21 31/9/2006   3660.41615.09925
U21 36/30/2008  13236.2924.27634
U21 34/14/2009  16124.192   25.79562

I tried the following code;

library(caret)
dataPartitioning <- 
createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
train = dataGenotype[dataPartitioning,]
test = dataGenotype[-dataPartitioning,]

Also tried

createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)

It did not produce the desired output, the data are partitioned within
the stand_ID. For example, one row of stand_ID 7 goes to training and
two rows of stand_ID 7 go to testing. How can I partition the data by
Genotype and stand_ID together?.



Ahmed Attia

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] r-data partitioning considering two variables (character and numeric)

2018-08-27 Thread Bert Gunter

Just partition the unique stand_ID's and select on them using %in% , say:

id <- unique(dataGenotype$stand_ID)
tst <- sample(id, floor(length(id)/2))
wh <- dataGenotype$stand_ID %in% tst ## logical vector
test<- dataGenotype[wh,]
train <- dataGenotype[!wh,]

There are a million variations on this theme I'm sure.

-- Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Aug 27, 2018 at 3:54 PM Ahmed Attia  wrote:

> I would like to partition the following dataset (dataGenotype) based
> on two variables; Genotype and stand_ID, for example, for Genotype
> H13: stand_ID number 7 may go to training and stand_ID number 18 and
> 21 may go to testing.
>
> Genotypestand_IDInventory_date  stemC   mheight
> H13 75/18/2006  1940.1075   11.33995
> H13 711/1/2008  10898.9597  23.20395
> H13 74/14/2009  12830.1284  23.77395
> H131811/3/2005  2726.42 13.4432
> H13186/30/2008  12226.1554  24.091967
> H13184/14/2009  14141.6825.0922
> H13215/18/2006  4981.7158   15.7173
> H13214/14/2009  20327.0667  27.9155
> H159 3/31/2006  3570.06 14.7898
> H159 11/1/2008  15138.8383  26.2088
> H159 4/14/2009  17035.4688  26.8778
> H15   20 1/18/2005  3016.88114.1886
> H15   2010/4/2006   8330.4688   20.19425
> H15   206/30/2008   13576.5 25.4774
> H15   322/1/20063426.2525   14.31815
> U21   3 1/9/20063660.41615.09925
> U21   3 6/30/2008   13236.2924.27634
> U21   3 4/14/2009   16124.192   25.79562
> U21   6711/4/2005   2812.8425   13.60485
> U21   674/14/2009   13468.455   24.6203
>
> And the desired output is the following;
>
> A-training
>
> Genotypestand_IDInventory_date  stemC   mheight
> H137 5/18/2006  1940.1075   11.33995
> H137 11/1/2008  10898.9597  23.20395
> H137 4/14/2009  12830.1284  23.77395
> H159 3/31/2006  3570.06 14.7898
> H159 11/1/2008  15138.8383  26.2088
> H159 4/14/2009  17035.4688  26.8778
> U216711/4/2005  2812.8425   13.60485
> U21674/14/2009  13468.455   24.6203
>
> B-testing
>
> Genotypestand_IDInventory_date  stemC   mheight
> H13 18   11/3/2005  2726.42 13.4432
> H13 18   6/30/2008  12226.1554  24.091967
> H13 18   4/14/2009  14141.6825.0922
> H13 21   5/18/2006  4981.7158   15.7173
> H13 21   4/14/2009  20327.0667  27.9155
> H15 20   1/18/2005  3016.88114.1886
> H15 20   10/4/2006  8330.4688   20.19425
> H15 20   6/30/2008  13576.5 25.4774
> H15 32   2/1/2006   3426.2525   14.31815
> U21 31/9/2006   3660.41615.09925
> U21 36/30/2008  13236.2924.27634
> U21 34/14/2009  16124.192   25.79562
>
> I tried the following code;
>
> library(caret)
> dataPartitioning <-
> createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
> train = dataGenotype[dataPartitioning,]
> test = dataGenotype[-dataPartitioning,]
>
> Also tried
>
> createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)
>
> It did not produce the desired output, the data are partitioned within
> the stand_ID. For example, one row of stand_ID 7 goes to training and
> two rows of stand_ID 7 go to testing. How can I partition the data by
> Genotype and stand_ID together?.
>
>
>
> Ahmed Attia
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] r-data partitioning considering two variables (character and numeric)

2018-08-27 Thread Ahmed Attia

I would like to partition the following dataset (dataGenotype) based
on two variables; Genotype and stand_ID, for example, for Genotype
H13: stand_ID number 7 may go to training and stand_ID number 18 and
21 may go to testing.

Genotypestand_IDInventory_date  stemC   mheight
H13 75/18/2006  1940.1075   11.33995
H13 711/1/2008  10898.9597  23.20395
H13 74/14/2009  12830.1284  23.77395
H131811/3/2005  2726.42 13.4432
H13186/30/2008  12226.1554  24.091967
H13184/14/2009  14141.6825.0922
H13215/18/2006  4981.7158   15.7173
H13214/14/2009  20327.0667  27.9155
H159 3/31/2006  3570.06 14.7898
H159 11/1/2008  15138.8383  26.2088
H159 4/14/2009  17035.4688  26.8778
H15   20 1/18/2005  3016.88114.1886
H15   2010/4/2006   8330.4688   20.19425
H15   206/30/2008   13576.5 25.4774
H15   322/1/20063426.2525   14.31815
U21   3 1/9/20063660.41615.09925
U21   3 6/30/2008   13236.2924.27634
U21   3 4/14/2009   16124.192   25.79562
U21   6711/4/2005   2812.8425   13.60485
U21   674/14/2009   13468.455   24.6203

And the desired output is the following;

A-training

Genotypestand_IDInventory_date  stemC   mheight
H137 5/18/2006  1940.1075   11.33995
H137 11/1/2008  10898.9597  23.20395
H137 4/14/2009  12830.1284  23.77395
H159 3/31/2006  3570.06 14.7898
H159 11/1/2008  15138.8383  26.2088
H159 4/14/2009  17035.4688  26.8778
U216711/4/2005  2812.8425   13.60485
U21674/14/2009  13468.455   24.6203

B-testing

Genotypestand_IDInventory_date  stemC   mheight
H13 18   11/3/2005  2726.42 13.4432
H13 18   6/30/2008  12226.1554  24.091967
H13 18   4/14/2009  14141.6825.0922
H13 21   5/18/2006  4981.7158   15.7173
H13 21   4/14/2009  20327.0667  27.9155
H15 20   1/18/2005  3016.88114.1886
H15 20   10/4/2006  8330.4688   20.19425
H15 20   6/30/2008  13576.5 25.4774
H15 32   2/1/2006   3426.2525   14.31815
U21 31/9/2006   3660.41615.09925
U21 36/30/2008  13236.2924.27634
U21 34/14/2009  16124.192   25.79562

I tried the following code;

library(caret)
dataPartitioning <- createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
train = dataGenotype[dataPartitioning,]
test = dataGenotype[-dataPartitioning,]

Also tried

createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)

It did not produce the desired output, the data are partitioned within
the stand_ID. For example, one row of stand_ID 7 goes to training and
two rows of stand_ID 7 go to testing. How can I partition the data by
Genotype and stand_ID together?.



Ahmed Attia

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data frame with Factor column missing data change to NA

2018-06-14 Thread Bill Poling

Jim,
Actually, I got this to work.

df$NonAcceptanceOther[df$NonAcceptanceOther==""]<- NA
df$NonAcceptanceOther

missingDF <- plot_missing(df)

# missingDF
#  featurenum_missing   pct_missinggroup
# 13 NonAcceptanceOther   26157   0.86859932257 Remove


Good to go now, for the moment, big smile!

Thank you for your help Sir.


WHP




From: Bill Poling
Sent: Thursday, June 14, 2018 6:49 AM
To: 'Jim Lemon' 
Cc: r-help (r-help@r-project.org) 
Subject: RE: [R] Data frame with Factor column missing data change to NA

#Good morning Jim, thank you for your response and guidance.


So I ran the suggested and got: Error in nchar(df2$NonAcceptanceOther) :   
'nchar()' requires a character vector

So I ran this:

df2$NonAcceptanceOther[] <- lapply(df2$NonAcceptanceOther,as.character)

#Then tried again.

#But still getting the error?

#Because the column remains a factor?
names(df2)

#[1] "PlaceOfService" "ClaimStatusID"  "NonAcceptanceOther" 
"RejectionCodeID""CPTCats""RevCodeCats""GCode2" 
"ClaimTypeID"

classes <- as.character(sapply(df2, class))
classes


#[1] "factor"  "integer" "factor"  "integer" "factor"  "factor"  "integer" 
"integer"



#Not sure if this structure helps, I guess that the 1L’s are the missing

dput(head(df2$NonAcceptanceOther, 25))

#structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 118L, 1L,

#1L, 1L, 64L, 64L, 134L, 134L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)



#View from the CSV file original data


NonAcceptanceOther











ERS




Claim paid without PHX recommended savings

Claim paid without PHX recommended savings

MRC Amount

MRC Amount









Appreciate your help Sir.

WHP

From: Jim Lemon [mailto:drjimle...@gmail.com]
Sent: Wednesday, June 13, 2018 8:30 PM
To: Bill Poling mailto:bill.pol...@zelis.com>>
Cc: r-help (r-help@r-project.org<mailto:r-help@r-project.org>) 
mailto:r-help@r-project.org>>
Subject: Re: [R] Data frame with Factor column missing data change to NA

Hi Bill,
It may be that the NonAcceptanceOther, being a character value, has ""
(0 length string) rather than NA. You can convert that to NA like
this:

df2$NonAcceptanceOther[nchar(df2$NonAcceptanceOther) == 0]<-NA

Jim


On Thu, Jun 14, 2018 at 12:47 AM, Bill Poling 
mailto:bill.pol...@zelis.com>> wrote:
> Good morning.
>
> #I have df with a Factor column called "NonAcceptanceOther" that contains 
> missing data.
>
> #Not every record in the df is expected to have a value in this column.
>
> # Typical values look like:
> # ERS
> # Claim paid without PHX recommended savings
> # Claim paid without PHX recommended savings
> # MRC Amount
> # MRC Amount
> # PPO per provider
> #Or they are missing (blank)
>
> #Example
>
> df2 <- 
> df[,c("PlaceOfService","ClaimStatusID","NonAcceptanceOther","RejectionCodeID","CPTCats","RevCodeCats","GCode2","ClaimTypeID")]
> head(df2, n=20)
>
> PlaceOfService ClaimStatusID NonAcceptanceOther RejectionCodeID CPTCats 
> RevCodeCats GCode2 ClaimTypeID
>
> 1 11 2 NA ResPSys NotValidRevCode 2 2
>
> 2 81 3 53 PathandLab NotValidRevCode 2 2
>
> 3 11 3 47 Medicine NotValidRevCode 1 2
>
> 4 09 2 NA NotCPT NotValidRevCode 1 2
>
> 5 11 2 NA Radiology NotValidRevCode 2 2
>
> 6 23 2 NA MusculoSys NotValidRevCode 2 2
>
> 7 12 3 47 NotCPT NotValidRevCode 2 2
>
> 8 12 2 NA Medicine NotValidRevCode 2 2
>
> 9 11 3 47 Medicine NotValidRevCode 1 2
>
> 10 21 2 NA Anesthesia NotValidRevCode 2 2
>
> 11 11 3 ERS 30 EvalandMgmt NotValidRevCode 2 2
>
> 12 81 2 NA PathandLab NotValidRevCode 2 2
>
> 13 21 2 NA Radiology NotValidRevCode 1 2
>
> 14 11 2 NA Medicine NotValidRevCode 1 2
>
> 15 99 3 Claim paid without PHX recommended savings 30 CardioHemLympSys Lab 0 1
>
> 16 99 3 Claim paid without PHX recommended savings 30 PathandLab Lab 0 1
>
> 17 99 3 MRC Amount 30 NotCPT Pharma 2 1
>
> 18 99 3 MRC Amount 30 PathandLab Lab 2 1
>
> 19 81 2 NA PathandLab NotValidRevCode 2 2
>
> 20 23 2 NA IntegSys NotValidRevCode 1 2
>
> #I would like to set these missing to NA and have them reflected similarly to 
> an NA in a numeric or integer column if possible.
>
> #I have tried several approaches from Googled references:
>
> NonAcceptanceOther <- df$NonAcceptanceOther
> table(addNA(NonAcceptanceOther))
>
> is.na<http://is.na> <- df$NonAcceptanceOther
>
> df[NonAcceptanceOther == '' | NonAcceptanceOther == 'NA'] <- NA
>
> #However, when I go to use:
>
> missingDF <- PlotMissing

Re: [R] Data frame with Factor column missing data change to NA

2018-06-14 Thread Bill Poling

#Good morning Jim, thank you for your response and guidance.


So I ran the suggested and got: Error in nchar(df2$NonAcceptanceOther) :   
'nchar()' requires a character vector

So I ran this:

df2$NonAcceptanceOther[] <- lapply(df2$NonAcceptanceOther,as.character)

#Then tried again.

#But still getting the error?

#Because the column remains a factor?
names(df2)

#[1] "PlaceOfService" "ClaimStatusID"  "NonAcceptanceOther" 
"RejectionCodeID""CPTCats""RevCodeCats""GCode2" 
"ClaimTypeID"

classes <- as.character(sapply(df2, class))
classes


#[1] "factor"  "integer" "factor"  "integer" "factor"  "factor"  "integer" 
"integer"



#Not sure if this structure helps, I guess that the 1L’s are the missing

dput(head(df2$NonAcceptanceOther, 25))

#structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 118L, 1L,

#1L, 1L, 64L, 64L, 134L, 134L, 1L, 1L, 1L, 1L, 1L, 1L, 1L)



#View from the CSV file original data


NonAcceptanceOther











ERS




Claim paid without PHX recommended savings

Claim paid without PHX recommended savings

MRC Amount

MRC Amount









Appreciate your help Sir.

WHP

From: Jim Lemon [mailto:drjimle...@gmail.com]
Sent: Wednesday, June 13, 2018 8:30 PM
To: Bill Poling 
Cc: r-help (r-help@r-project.org) 
Subject: Re: [R] Data frame with Factor column missing data change to NA

Hi Bill,
It may be that the NonAcceptanceOther, being a character value, has ""
(0 length string) rather than NA. You can convert that to NA like
this:

df2$NonAcceptanceOther[nchar(df2$NonAcceptanceOther) == 0]<-NA

Jim


On Thu, Jun 14, 2018 at 12:47 AM, Bill Poling 
mailto:bill.pol...@zelis.com>> wrote:
> Good morning.
>
> #I have df with a Factor column called "NonAcceptanceOther" that contains 
> missing data.
>
> #Not every record in the df is expected to have a value in this column.
>
> # Typical values look like:
> # ERS
> # Claim paid without PHX recommended savings
> # Claim paid without PHX recommended savings
> # MRC Amount
> # MRC Amount
> # PPO per provider
> #Or they are missing (blank)
>
> #Example
>
> df2 <- 
> df[,c("PlaceOfService","ClaimStatusID","NonAcceptanceOther","RejectionCodeID","CPTCats","RevCodeCats","GCode2","ClaimTypeID")]
> head(df2, n=20)
>
> PlaceOfService ClaimStatusID NonAcceptanceOther RejectionCodeID CPTCats 
> RevCodeCats GCode2 ClaimTypeID
>
> 1 11 2 NA ResPSys NotValidRevCode 2 2
>
> 2 81 3 53 PathandLab NotValidRevCode 2 2
>
> 3 11 3 47 Medicine NotValidRevCode 1 2
>
> 4 09 2 NA NotCPT NotValidRevCode 1 2
>
> 5 11 2 NA Radiology NotValidRevCode 2 2
>
> 6 23 2 NA MusculoSys NotValidRevCode 2 2
>
> 7 12 3 47 NotCPT NotValidRevCode 2 2
>
> 8 12 2 NA Medicine NotValidRevCode 2 2
>
> 9 11 3 47 Medicine NotValidRevCode 1 2
>
> 10 21 2 NA Anesthesia NotValidRevCode 2 2
>
> 11 11 3 ERS 30 EvalandMgmt NotValidRevCode 2 2
>
> 12 81 2 NA PathandLab NotValidRevCode 2 2
>
> 13 21 2 NA Radiology NotValidRevCode 1 2
>
> 14 11 2 NA Medicine NotValidRevCode 1 2
>
> 15 99 3 Claim paid without PHX recommended savings 30 CardioHemLympSys Lab 0 1
>
> 16 99 3 Claim paid without PHX recommended savings 30 PathandLab Lab 0 1
>
> 17 99 3 MRC Amount 30 NotCPT Pharma 2 1
>
> 18 99 3 MRC Amount 30 PathandLab Lab 2 1
>
> 19 81 2 NA PathandLab NotValidRevCode 2 2
>
> 20 23 2 NA IntegSys NotValidRevCode 1 2
>
> #I would like to set these missing to NA and have them reflected similarly to 
> an NA in a numeric or integer column if possible.
>
> #I have tried several approaches from Googled references:
>
> NonAcceptanceOther <- df$NonAcceptanceOther
> table(addNA(NonAcceptanceOther))
>
> is.na<http://is.na> <- df$NonAcceptanceOther
>
> df[NonAcceptanceOther == '' | NonAcceptanceOther == 'NA'] <- NA
>
> #However, when I go to use:
>
> missingDF <- PlotMissing(df)
>
> #Only the columns that are numeric or integer reflect their missing values 
> (i.e. RejectionCodeID) and this "NonAcceptanceOther" column does not reflect 
> or hold the NA values?
>
> Thank you for any advice.
>
> WHP
>
>
>
>
>
>
>
>
>
>
>
>
> Confidentiality Notice This message is sent from Zelis. ...{{dropped:16}}
>
> __
> R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
> UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
> PLEASE do read t

Re: [R] Data frame with Factor column missing data change to NA

2018-06-13 Thread Jim Lemon

Hi Bill,
It may be that the NonAcceptanceOther, being a character value, has ""
(0 length string) rather than NA. You can convert that to NA like
this:

df2$NonAcceptanceOther[nchar(df2$NonAcceptanceOther) == 0]<-NA

Jim


On Thu, Jun 14, 2018 at 12:47 AM, Bill Poling  wrote:
> Good morning.
>
> #I have df with a Factor column called "NonAcceptanceOther" that contains 
> missing data.
>
> #Not every record in the df is expected to have a value in this column.
>
> # Typical values look like:
> # ERS
> # Claim paid without PHX recommended savings
> # Claim paid without PHX recommended savings
> # MRC Amount
> # MRC Amount
> # PPO per provider
> #Or they are missing (blank)
>
> #Example
>
> df2 <- 
> df[,c("PlaceOfService","ClaimStatusID","NonAcceptanceOther","RejectionCodeID","CPTCats","RevCodeCats","GCode2","ClaimTypeID")]
> head(df2, n=20)
>
>PlaceOfService ClaimStatusID NonAcceptanceOther 
> RejectionCodeID  CPTCats RevCodeCats GCode2 ClaimTypeID
>
> 1  11 2   
>   NA  ResPSys NotValidRevCode  2   2
>
> 2  81 3   
>   53   PathandLab NotValidRevCode  2   2
>
> 3  11 3   
>   47 Medicine NotValidRevCode  1   2
>
> 4  09 2   
>   NA   NotCPT NotValidRevCode  1   2
>
> 5  11 2   
>   NARadiology NotValidRevCode  2   2
>
> 6  23 2   
>   NA   MusculoSys NotValidRevCode  2   2
>
> 7  12 3   
>   47   NotCPT NotValidRevCode  2   2
>
> 8  12 2   
>   NA Medicine NotValidRevCode  2   2
>
> 9  11 3   
>   47 Medicine NotValidRevCode  1   2
>
> 10 21 2   
>   NA   Anesthesia NotValidRevCode  2   2
>
> 11 11 3ERS
>   30  EvalandMgmt NotValidRevCode  2   2
>
> 12 81 2   
>   NA   PathandLab NotValidRevCode  2   2
>
> 13 21 2   
>   NARadiology NotValidRevCode  1   2
>
> 14 11 2   
>   NA Medicine NotValidRevCode  1   2
>
> 15 99 3 Claim paid without PHX recommended savings
>   30 CardioHemLympSys Lab  0   1
>
> 16 99 3 Claim paid without PHX recommended savings
>   30   PathandLab Lab  0   1
>
> 17 99 3 MRC Amount
>   30   NotCPT  Pharma  2   1
>
> 18 99 3 MRC Amount
>   30   PathandLab Lab  2   1
>
> 19 81 2   
>   NA   PathandLab NotValidRevCode  2   2
>
> 20 23 2   
>   NA IntegSys NotValidRevCode  1   2
>
> #I would like to set these missing to NA and have them reflected similarly to 
> an NA in a numeric or integer column if possible.
>
> #I have tried several approaches from Googled references:
>
> NonAcceptanceOther <- df$NonAcceptanceOther
> table(addNA(NonAcceptanceOther))
>
> is.na <- df$NonAcceptanceOther
>
> df[NonAcceptanceOther == '' | NonAcceptanceOther == 'NA'] <- NA
>
> #However, when I go to use:
>
> missingDF <- PlotMissing(df)
>
> #Only the columns that are numeric or integer reflect their missing values 
> (i.e. RejectionCodeID)  and this "NonAcceptanceOther" column does not reflect 
> or hold the NA values?
>
> Thank you for any advice.
>
> WHP
>
>
>
>
>
>
>
>
>
>
>
>
> Confidentiality Notice This message is sent from Zelis. ...{{dropped:16}}
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide

[R] Data frame with Factor column missing data change to NA

2018-06-13 Thread Bill Poling

Good morning.

#I have df with a Factor column called "NonAcceptanceOther" that contains 
missing data.

#Not every record in the df is expected to have a value in this column.

# Typical values look like:
# ERS
# Claim paid without PHX recommended savings
# Claim paid without PHX recommended savings
# MRC Amount
# MRC Amount
# PPO per provider
#Or they are missing (blank)

#Example

df2 <- 
df[,c("PlaceOfService","ClaimStatusID","NonAcceptanceOther","RejectionCodeID","CPTCats","RevCodeCats","GCode2","ClaimTypeID")]
head(df2, n=20)

   PlaceOfService ClaimStatusID NonAcceptanceOther 
RejectionCodeID  CPTCats RevCodeCats GCode2 ClaimTypeID

1  11 2 
NA  ResPSys NotValidRevCode  2   2

2  81 3 
53   PathandLab NotValidRevCode  2   2

3  11 3 
47 Medicine NotValidRevCode  1   2

4  09 2 
NA   NotCPT NotValidRevCode  1   2

5  11 2 
NARadiology NotValidRevCode  2   2

6  23 2 
NA   MusculoSys NotValidRevCode  2   2

7  12 3 
47   NotCPT NotValidRevCode  2   2

8  12 2 
NA Medicine NotValidRevCode  2   2

9  11 3 
47 Medicine NotValidRevCode  1   2

10 21 2 
NA   Anesthesia NotValidRevCode  2   2

11 11 3ERS  
30  EvalandMgmt NotValidRevCode  2   2

12 81 2 
NA   PathandLab NotValidRevCode  2   2

13 21 2 
NARadiology NotValidRevCode  1   2

14 11 2 
NA Medicine NotValidRevCode  1   2

15 99 3 Claim paid without PHX recommended savings  
30 CardioHemLympSys Lab  0   1

16 99 3 Claim paid without PHX recommended savings  
30   PathandLab Lab  0   1

17 99 3 MRC Amount  
30   NotCPT  Pharma  2   1

18 99 3 MRC Amount  
30   PathandLab Lab  2   1

19 81 2 
NA   PathandLab NotValidRevCode  2   2

20 23 2 
NA IntegSys NotValidRevCode  1   2

#I would like to set these missing to NA and have them reflected similarly to 
an NA in a numeric or integer column if possible.

#I have tried several approaches from Googled references:

NonAcceptanceOther <- df$NonAcceptanceOther
table(addNA(NonAcceptanceOther))

is.na <- df$NonAcceptanceOther

df[NonAcceptanceOther == '' | NonAcceptanceOther == 'NA'] <- NA

#However, when I go to use:

missingDF <- PlotMissing(df)

#Only the columns that are numeric or integer reflect their missing values 
(i.e. RejectionCodeID)  and this "NonAcceptanceOther" column does not reflect 
or hold the NA values?

Thank you for any advice.

WHP












Confidentiality Notice This message is sent from Zelis. ...{{dropped:16}}

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data analysis for partial two-by-two factorial design

2018-03-05 Thread Ding, Yuan Chun

Thanks a lot, after reading this message, I think I got the advantage of Bert's 
coding. Those two drugs indeed do not interact with each other, so additive 
assumption is valid. 

I learned a lot today. Thanks again.

Ding

-Original Message-
From: David Winsemius [mailto:dwinsem...@comcast.net] 
Sent: Monday, March 05, 2018 3:55 PM
To: Bert Gunter
Cc: Ding, Yuan Chun; r-help@r-project.org
Subject: Re: [R] data analysis for partial two-by-two factorial design


> On Mar 5, 2018, at 3:04 PM, Bert Gunter <bgunter.4...@gmail.com> wrote:
> 
> But of course the whole point of additivity is to decompose the combined 
> effect as the sum of individual effects.

Agreed. Furthermore your encoding of the treatment assignments has the 
advantage that the default treatment contrast for A+B will have a statistical 
estimate associated with it. That was a deficiency of my encoding that Ding 
found problematic. I did have the incorrect notion that the encoding of Drug B 
in the single drug situation would have been NA and that the `lm`-function 
would produce nothing useful. Your setup had not occurred to me.

Best;
David.

> 
> "Mislead" is a subjective judgment, so no comment. The explanation I provided 
> is standard. I used it for decades when I taught in industry.
> 
> Cheers,
> Bert
> 
> 
> 
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along and 
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> 
> On Mon, Mar 5, 2018 at 3:00 PM, David Winsemius <dwinsem...@comcast.net> 
> wrote:
> 
> > On Mar 5, 2018, at 2:27 PM, Bert Gunter <bgunter.4...@gmail.com> wrote:
> >
> > David:
> >
> > I believe your response on SO is incorrect. This is a standard OFAT (one 
> > factor at a time) design, so that assuming additivity (no interactions), 
> > the effects of drugA and drugB can be determined via the model you rejected:
> 
> >> three groups, no drugA/no drugB, yes drugA/no drugB, yes drugA/yes drug B, 
> >> omitting the fourth group of no drugA/yes drugB.
> 
> >
> > For example, if baseline control (no drugs) has a response of 0, drugA has 
> > an effect of 1, drugB has an effect of 2, and the effects are additive, 
> > with no noise we would have:
> >
> > > d <- data.frame(drugA = c("n","y","y"),drugB = c("n","n","y"))
> 
> d2 <- data.frame(trt = c("Baseline","DrugA_only","DrugA_drugB")
> >
> > > y <- c(0,1,3)
> >
> > And a straighforward inear model recovers the effects:
> >
> > > lm(y ~ drugA + drugB, data=d)
> >
> > Call:
> > lm(formula = y ~ drugA + drugB, data = d)
> >
> > Coefficients:
> > (Intercept)   drugAy   drugBy
> >   1.282e-161.000e+002.000e+00
> 
> I think the labeling above is rather to mislead since what is labeled drugB 
> is actually A I think the method I suggest is more likely to be 
> interpreted correctly:
> 
> > d2 <- data.frame(trt = c("Baseline","DrugA_only","DrugA_drugB"))
> >  y <- c(0,1,3)
> > lm(y ~ trt, data=d2)
> 
> Call:
> lm(formula = y ~ trt, data = d2)
> 
> Coefficients:
>(Intercept)  trtDrugA_drugB   trtDrugA_only
>  2.564e-16   3.000e+00   1.000e+00
> 
> --
> David.
> >
> > As usual, OFAT designs are blind to interactions, so that if they really 
> > exist, the interpretation as additive effects is incorrect.
> >
> > Cheers,
> > Bert
> >
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along and 
> > sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> > On Mon, Mar 5, 2018 at 2:03 PM, David Winsemius <dwinsem...@comcast.net> 
> > wrote:
> >
> > > On Mar 5, 2018, at 8:52 AM, Ding, Yuan Chun <ycd...@coh.org> wrote:
> > >
> > > Hi Bert,
> > >
> > > I am very sorry to bother you again.
> > >
> > > For the following question, as you suggested, I posted it in both 
> > > Biostars website and stackexchange website, so far no reply.
> > >
> > > I really hope that you can do me a great favor to share your points about 
> > > how to explain the coefficients for drug A and drug B if run anova model 
> > > (response variable = drug A + drug B). is it different from running three 
> > > separate T tests?
> > >
> > > T

Re: [R] data analysis for partial two-by-two factorial design

2018-03-05 Thread David Winsemius


> On Mar 5, 2018, at 3:04 PM, Bert Gunter <bgunter.4...@gmail.com> wrote:
> 
> But of course the whole point of additivity is to decompose the combined 
> effect as the sum of individual effects.

Agreed. Furthermore your encoding of the treatment assignments has the 
advantage that the default treatment contrast for A+B will have a statistical 
estimate associated with it. That was a deficiency of my encoding that Ding 
found problematic. I did have the incorrect notion that the encoding of Drug B 
in the single drug situation would have been NA and that the `lm`-function 
would produce nothing useful. Your setup had not occurred to me.

Best;
David.

> 
> "Mislead" is a subjective judgment, so no comment. The explanation I provided 
> is standard. I used it for decades when I taught in industry.
> 
> Cheers,
> Bert
> 
> 
> 
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along and 
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> 
> On Mon, Mar 5, 2018 at 3:00 PM, David Winsemius <dwinsem...@comcast.net> 
> wrote:
> 
> > On Mar 5, 2018, at 2:27 PM, Bert Gunter <bgunter.4...@gmail.com> wrote:
> >
> > David:
> >
> > I believe your response on SO is incorrect. This is a standard OFAT (one 
> > factor at a time) design, so that assuming additivity (no interactions), 
> > the effects of drugA and drugB can be determined via the model you rejected:
> 
> >> three groups, no drugA/no drugB, yes drugA/no drugB, yes drugA/yes drug B, 
> >> omitting the fourth group of no drugA/yes drugB.
> 
> >
> > For example, if baseline control (no drugs) has a response of 0, drugA has 
> > an effect of 1, drugB has an effect of 2, and the effects are additive, 
> > with no noise we would have:
> >
> > > d <- data.frame(drugA = c("n","y","y"),drugB = c("n","n","y"))
> 
> d2 <- data.frame(trt = c("Baseline","DrugA_only","DrugA_drugB")
> >
> > > y <- c(0,1,3)
> >
> > And a straighforward inear model recovers the effects:
> >
> > > lm(y ~ drugA + drugB, data=d)
> >
> > Call:
> > lm(formula = y ~ drugA + drugB, data = d)
> >
> > Coefficients:
> > (Intercept)   drugAy   drugBy
> >   1.282e-161.000e+002.000e+00
> 
> I think the labeling above is rather to mislead since what is labeled drugB 
> is actually A I think the method I suggest is more likely to be 
> interpreted correctly:
> 
> > d2 <- data.frame(trt = c("Baseline","DrugA_only","DrugA_drugB"))
> >  y <- c(0,1,3)
> > lm(y ~ trt, data=d2)
> 
> Call:
> lm(formula = y ~ trt, data = d2)
> 
> Coefficients:
>(Intercept)  trtDrugA_drugB   trtDrugA_only
>  2.564e-16   3.000e+00   1.000e+00
> 
> --
> David.
> >
> > As usual, OFAT designs are blind to interactions, so that if they really 
> > exist, the interpretation as additive effects is incorrect.
> >
> > Cheers,
> > Bert
> >
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along and 
> > sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> > On Mon, Mar 5, 2018 at 2:03 PM, David Winsemius <dwinsem...@comcast.net> 
> > wrote:
> >
> > > On Mar 5, 2018, at 8:52 AM, Ding, Yuan Chun <ycd...@coh.org> wrote:
> > >
> > > Hi Bert,
> > >
> > > I am very sorry to bother you again.
> > >
> > > For the following question, as you suggested, I posted it in both 
> > > Biostars website and stackexchange website, so far no reply.
> > >
> > > I really hope that you can do me a great favor to share your points about 
> > > how to explain the coefficients for drug A and drug B if run anova model 
> > > (response variable = drug A + drug B). is it different from running three 
> > > separate T tests?
> > >
> > > Thank you so much!!
> > >
> > > Ding
> > >
> > > I need to analyze data generated from a partial two-by-two factorial 
> > > design: two levels for drug A (yes, no), two levels for drug B (yes, no); 
> > >  however, data points are available only for three groups, no drugA/no 
> > > drugB, yes drugA/no drugB, yes drugA/yes drug B, omitting the fourth 
> > > group of no drugA/yes drugB.  I think we can

Re: [R] data analysis for partial two-by-two factorial design

2018-03-05 Thread Bert Gunter

Yuan:

IMHO you need to stop making up your own statistical analyses and get local
expert help.

I have nothing further to say. Do what you will.

Cheers,
Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Mon, Mar 5, 2018 at 2:44 PM, Ding, Yuan Chun <ycd...@coh.org> wrote:

> Hi Bert and David,
>
>
>
> Thank you so much for willingness to spend some time on my problem!!!  I
> have some statistical knowledge (going to get a master in applied
> statisitics), but do not have a chance to purse a phD for statistics, so I
> am always be careful before starting to do analysis and hope to gather
> supportive information from real statisticians.
>
>
>
> Sorry that I did not tell more info about experiment design.
>
>
>
> I did not do this experiment, my collaborator did it and I only got chance
> to analyze the data.
>
>
>
> There are nine dishes of cells.  Three replicates for each treatment
> combination.  So randomly select three dishes for no drug A/no drug B
> treatment, a second three dishes for drug A only, then last three dishes to
> add both A and B drugs.  After drug treatments, they measure DNA
> methylation and genes or gene expression as outcome or response
> variables(two differnet types of response variables).
>
>
>
> My boss might want to find out net effect of drug B, but I think we can
> not exclude the confounding effect of drugA. For example, it is possible
> that drug B has no effect, only has effect when drug A is present.   I
> asked my collaborator whey she omitted the fourth combination drugA only
> treatment, she said it was expensive to measure methylation or gene
> expression, so they performed the experiments based on their hypothesis
> which is too complicated here, so not illustrated here in details.  I am
> still not happy that they could just add three more replicates to do a full
> 2X2 design.
>
>
>
> On the weekend, I also thought about doing a one-way anova, but then I
> have to do three pairwise comparisons to find out the pair to show
> difference if p value for one way anova is significant.
>
>
>
> Thanks,
>
>
> Ding
>
>
>
> *From:* Bert Gunter [mailto:bgunter.4...@gmail.com]
> *Sent:* Monday, March 05, 2018 2:27 PM
> *To:* David Winsemius
> *Cc:* Ding, Yuan Chun; r-help@r-project.org
>
> *Subject:* Re: [R] data analysis for partial two-by-two factorial design
>
>
>
> David:
>
> I believe your response on SO is incorrect. This is a standard OFAT (one
> factor at a time) design, so that assuming additivity (no interactions),
> the effects of drugA and drugB can be determined via the model you rejected:
>
> For example, if baseline control (no drugs) has a response of 0, drugA has
> an effect of 1, drugB has an effect of 2, and the effects are additive,
> with no noise we would have:
>
> > d <- data.frame(drugA = c("n","y","y"),drugB = c("n","n","y"))
> > y <- c(0,1,3)
>
> And a straighforward inear model recovers the effects:
>
>
> > lm(y ~ drugA + drugB, data=d)
>
> Call:
> lm(formula = y ~ drugA + drugB, data = d)
>
> Coefficients:
> (Intercept)   drugAy   drugBy
>   1.282e-161.000e+002.000e+00
>
> As usual, OFAT designs are blind to interactions, so that if they really
> exist, the interpretation as additive effects is incorrect.
>
>
>
> Cheers,
>
> Bert
>
>
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
>
> On Mon, Mar 5, 2018 at 2:03 PM, David Winsemius <dwinsem...@comcast.net>
> wrote:
>
>
> > On Mar 5, 2018, at 8:52 AM, Ding, Yuan Chun <ycd...@coh.org> wrote:
> >
> > Hi Bert,
> >
> > I am very sorry to bother you again.
> >
> > For the following question, as you suggested, I posted it in both
> Biostars website and stackexchange website, so far no reply.
> >
> > I really hope that you can do me a great favor to share your points
> about how to explain the coefficients for drug A and drug B if run anova
> model (response variable = drug A + drug B). is it different from running
> three separate T tests?
> >
> > Thank you so much!!
> >
> > Ding
> >
> > I need to analyze data generated from a partial two-by-two factorial
> design: two levels for drug A (yes, no), two levels for drug B (yes, no);
>

Re: [R] data analysis for partial two-by-two factorial design

2018-03-05 Thread Bert Gunter

But of course the whole point of additivity is to decompose the combined
effect as the sum of individual effects.

"Mislead" is a subjective judgment, so no comment. The explanation I
provided is standard. I used it for decades when I taught in industry.

Cheers,
Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Mon, Mar 5, 2018 at 3:00 PM, David Winsemius <dwinsem...@comcast.net>
wrote:

>
> > On Mar 5, 2018, at 2:27 PM, Bert Gunter <bgunter.4...@gmail.com> wrote:
> >
> > David:
> >
> > I believe your response on SO is incorrect. This is a standard OFAT (one
> factor at a time) design, so that assuming additivity (no interactions),
> the effects of drugA and drugB can be determined via the model you rejected:
>
> >> three groups, no drugA/no drugB, yes drugA/no drugB, yes drugA/yes drug
> B, omitting the fourth group of no drugA/yes drugB.
>
> >
> > For example, if baseline control (no drugs) has a response of 0, drugA
> has an effect of 1, drugB has an effect of 2, and the effects are additive,
> with no noise we would have:
> >
> > > d <- data.frame(drugA = c("n","y","y"),drugB = c("n","n","y"))
>
> d2 <- data.frame(trt = c("Baseline","DrugA_only","DrugA_drugB")
> >
> > > y <- c(0,1,3)
> >
> > And a straighforward inear model recovers the effects:
> >
> > > lm(y ~ drugA + drugB, data=d)
> >
> > Call:
> > lm(formula = y ~ drugA + drugB, data = d)
> >
> > Coefficients:
> > (Intercept)   drugAy   drugBy
> >   1.282e-161.000e+002.000e+00
>
> I think the labeling above is rather to mislead since what is labeled
> drugB is actually A I think the method I suggest is more likely to be
> interpreted correctly:
>
> > d2 <- data.frame(trt = c("Baseline","DrugA_only","DrugA_drugB"))
> >  y <- c(0,1,3)
> > lm(y ~ trt, data=d2)
>
> Call:
> lm(formula = y ~ trt, data = d2)
>
> Coefficients:
>(Intercept)  trtDrugA_drugB   trtDrugA_only
>  2.564e-16   3.000e+00   1.000e+00
>
> --
> David.
> >
> > As usual, OFAT designs are blind to interactions, so that if they really
> exist, the interpretation as additive effects is incorrect.
> >
> > Cheers,
> > Bert
> >
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> > On Mon, Mar 5, 2018 at 2:03 PM, David Winsemius <dwinsem...@comcast.net>
> wrote:
> >
> > > On Mar 5, 2018, at 8:52 AM, Ding, Yuan Chun <ycd...@coh.org> wrote:
> > >
> > > Hi Bert,
> > >
> > > I am very sorry to bother you again.
> > >
> > > For the following question, as you suggested, I posted it in both
> Biostars website and stackexchange website, so far no reply.
> > >
> > > I really hope that you can do me a great favor to share your points
> about how to explain the coefficients for drug A and drug B if run anova
> model (response variable = drug A + drug B). is it different from running
> three separate T tests?
> > >
> > > Thank you so much!!
> > >
> > > Ding
> > >
> > > I need to analyze data generated from a partial two-by-two factorial
> design: two levels for drug A (yes, no), two levels for drug B (yes, no);
> however, data points are available only for three groups, no drugA/no
> drugB, yes drugA/no drugB, yes drugA/yes drug B, omitting the fourth group
> of no drugA/yes drugB.  I think we can not investigate interaction between
> drug A and drug B, can I still run  model using R as usual:  response
> variable = drug A + drug B?  any suggestion is appreciated.
> >
> > Replied on CrossValidated where this would be on-topic.
> >
> > --
> > David,
> >
> > >
> > >
> > > From: Bert Gunter [mailto:bgunter.4...@gmail.com]
> > > Sent: Friday, March 02, 2018 12:32 PM
> > > To: Ding, Yuan Chun
> > > Cc: r-help@r-project.org
> > > Subject: Re: [R] data analysis for partial two-by-two factorial design
> > >
> > > 
> > > [Attention: This email came from an external source. Do not open
> attachments or click on links from unknown se

Re: [R] data analysis for partial two-by-two factorial design

2018-03-05 Thread David Winsemius


> On Mar 5, 2018, at 2:27 PM, Bert Gunter <bgunter.4...@gmail.com> wrote:
> 
> David:
> 
> I believe your response on SO is incorrect. This is a standard OFAT (one 
> factor at a time) design, so that assuming additivity (no interactions), the 
> effects of drugA and drugB can be determined via the model you rejected:

>> three groups, no drugA/no drugB, yes drugA/no drugB, yes drugA/yes drug B, 
>> omitting the fourth group of no drugA/yes drugB.

> 
> For example, if baseline control (no drugs) has a response of 0, drugA has an 
> effect of 1, drugB has an effect of 2, and the effects are additive, with no 
> noise we would have:
> 
> > d <- data.frame(drugA = c("n","y","y"),drugB = c("n","n","y"))

d2 <- data.frame(trt = c("Baseline","DrugA_only","DrugA_drugB")
> 
> > y <- c(0,1,3)
> 
> And a straighforward inear model recovers the effects:
> 
> > lm(y ~ drugA + drugB, data=d)
> 
> Call:
> lm(formula = y ~ drugA + drugB, data = d)
> 
> Coefficients:
> (Intercept)   drugAy   drugBy  
>   1.282e-161.000e+002.000e+00  

I think the labeling above is rather to mislead since what is labeled drugB is 
actually A I think the method I suggest is more likely to be interpreted 
correctly:

> d2 <- data.frame(trt = c("Baseline","DrugA_only","DrugA_drugB"))
>  y <- c(0,1,3)
> lm(y ~ trt, data=d2)

Call:
lm(formula = y ~ trt, data = d2)

Coefficients:
   (Intercept)  trtDrugA_drugB   trtDrugA_only  
 2.564e-16   3.000e+00   1.000e+00  

-- 
David.
> 
> As usual, OFAT designs are blind to interactions, so that if they really 
> exist, the interpretation as additive effects is incorrect.
> 
> Cheers,
> Bert
> 
> 
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along and 
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> 
> On Mon, Mar 5, 2018 at 2:03 PM, David Winsemius <dwinsem...@comcast.net> 
> wrote:
> 
> > On Mar 5, 2018, at 8:52 AM, Ding, Yuan Chun <ycd...@coh.org> wrote:
> >
> > Hi Bert,
> >
> > I am very sorry to bother you again.
> >
> > For the following question, as you suggested, I posted it in both Biostars 
> > website and stackexchange website, so far no reply.
> >
> > I really hope that you can do me a great favor to share your points about 
> > how to explain the coefficients for drug A and drug B if run anova model 
> > (response variable = drug A + drug B). is it different from running three 
> > separate T tests?
> >
> > Thank you so much!!
> >
> > Ding
> >
> > I need to analyze data generated from a partial two-by-two factorial 
> > design: two levels for drug A (yes, no), two levels for drug B (yes, no);  
> > however, data points are available only for three groups, no drugA/no 
> > drugB, yes drugA/no drugB, yes drugA/yes drug B, omitting the fourth group 
> > of no drugA/yes drugB.  I think we can not investigate interaction between 
> > drug A and drug B, can I still run  model using R as usual:  response 
> > variable = drug A + drug B?  any suggestion is appreciated.
> 
> Replied on CrossValidated where this would be on-topic.
> 
> --
> David,
> 
> >
> >
> > From: Bert Gunter [mailto:bgunter.4...@gmail.com]
> > Sent: Friday, March 02, 2018 12:32 PM
> > To: Ding, Yuan Chun
> > Cc: r-help@r-project.org
> > Subject: Re: [R] data analysis for partial two-by-two factorial design
> >
> > 
> > [Attention: This email came from an external source. Do not open 
> > attachments or click on links from unknown senders or unexpected emails.]
> > 
> >
> > This list provides help on R programming (see the posting guide linked 
> > below for details on what is/is not considered on topic), and generally 
> > avoids discussion of purely statistical issues, which is what your query 
> > appears to be. The simple answer is yes, you can fit the model as 
> > described,  but you clearly need the off topic discussion as to what it 
> > does or does not mean. For that, you might try the 
> > stats.stackexchange.com<http://stats.stackexchange.com> statistical site.
> >
> > Cheers,
> > Bert
> >
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along and 
> > sticking things into it."
> > -- Opus (aka Berkeley Breath

Re: [R] data analysis for partial two-by-two factorial design

2018-03-05 Thread Ding, Yuan Chun

I am sorry that I made a typo:
.   I asked my collaborator whey she omitted the fourth combination drugA only 
treatment,
I wanted to say .   "I asked my collaborator why she omitted the fourth 
combination drugB only treatment",

Ding 

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Ding, Yuan Chun
Sent: Monday, March 05, 2018 2:45 PM
To: Bert Gunter; David Winsemius
Cc: r-help@r-project.org
Subject: Re: [R] data analysis for partial two-by-two factorial design

Hi Bert and David,

Thank you so much for willingness to spend some time on my problem!!!  I have 
some statistical knowledge (going to get a master in applied statisitics), but 
do not have a chance to purse a phD for statistics, so I am always be careful 
before starting to do analysis and hope to gather supportive information from 
real statisticians.

Sorry that I did not tell more info about experiment design.

I did not do this experiment, my collaborator did it and I only got chance to 
analyze the data.

There are nine dishes of cells.  Three replicates for each treatment 
combination.  So randomly select three dishes for no drug A/no drug B 
treatment, a second three dishes for drug A only, then last three dishes to add 
both A and B drugs.  After drug treatments, they measure DNA methylation and 
genes or gene expression as outcome or response variables(two differnet types 
of response variables).

My boss might want to find out net effect of drug B, but I think we can not 
exclude the confounding effect of drugA. For example, it is possible that drug 
B has no effect, only has effect when drug A is present.   I asked my 
collaborator whey she omitted the fourth combination drugA only treatment, she 
said it was expensive to measure methylation or gene expression, so they 
performed the experiments based on their hypothesis which is too complicated 
here, so not illustrated here in details.  I am still not happy that they could 
just add three more replicates to do a full 2X2 design.

On the weekend, I also thought about doing a one-way anova, but then I have to 
do three pairwise comparisons to find out the pair to show difference if p 
value for one way anova is significant.

Thanks,

Ding

From: Bert Gunter [mailto:bgunter.4...@gmail.com]
Sent: Monday, March 05, 2018 2:27 PM
To: David Winsemius
Cc: Ding, Yuan Chun; r-help@r-project.org
Subject: Re: [R] data analysis for partial two-by-two factorial design

David:
I believe your response on SO is incorrect. This is a standard OFAT (one factor 
at a time) design, so that assuming additivity (no interactions), the effects 
of drugA and drugB can be determined via the model you rejected:
For example, if baseline control (no drugs) has a response of 0, drugA has an 
effect of 1, drugB has an effect of 2, and the effects are additive, with no 
noise we would have:

> d <- data.frame(drugA = c("n","y","y"),drugB = c("n","n","y")) y <- 
> c(0,1,3)
And a straighforward inear model recovers the effects:

> lm(y ~ drugA + drugB, data=d)

Call:
lm(formula = y ~ drugA + drugB, data = d)

Coefficients:
(Intercept)   drugAy   drugBy
  1.282e-161.000e+002.000e+00
As usual, OFAT designs are blind to interactions, so that if they really exist, 
the interpretation as additive effects is incorrect.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Mon, Mar 5, 2018 at 2:03 PM, David Winsemius 
<dwinsem...@comcast.net<mailto:dwinsem...@comcast.net>> wrote:

> On Mar 5, 2018, at 8:52 AM, Ding, Yuan Chun 
> <ycd...@coh.org<mailto:ycd...@coh.org>> wrote:
>
> Hi Bert,
>
> I am very sorry to bother you again.
>
> For the following question, as you suggested, I posted it in both Biostars 
> website and stackexchange website, so far no reply.
>
> I really hope that you can do me a great favor to share your points about how 
> to explain the coefficients for drug A and drug B if run anova model 
> (response variable = drug A + drug B). is it different from running three 
> separate T tests?
>
> Thank you so much!!
>
> Ding
>
> I need to analyze data generated from a partial two-by-two factorial design: 
> two levels for drug A (yes, no), two levels for drug B (yes, no);  however, 
> data points are available only for three groups, no drugA/no drugB, yes 
> drugA/no drugB, yes drugA/yes drug B, omitting the fourth group of no 
> drugA/yes drugB.  I think we can not investigate interaction between drug A 
> and drug B, can I still run  model using R as usual:  response variable = 
> drug A + drug B?  any suggestion is appreciated.

Replied on CrossValidated where this would be on-topic.

--
David,

Re: [R] data analysis for partial two-by-two factorial design

2018-03-05 Thread Ding, Yuan Chun

Hi Bert and David,

Thank you so much for willingness to spend some time on my problem!!!  I have 
some statistical knowledge (going to get a master in applied statisitics), but 
do not have a chance to purse a phD for statistics, so I am always be careful 
before starting to do analysis and hope to gather supportive information from 
real statisticians.

Sorry that I did not tell more info about experiment design.

I did not do this experiment, my collaborator did it and I only got chance to 
analyze the data.

There are nine dishes of cells.  Three replicates for each treatment 
combination.  So randomly select three dishes for no drug A/no drug B 
treatment, a second three dishes for drug A only, then last three dishes to add 
both A and B drugs.  After drug treatments, they measure DNA methylation and 
genes or gene expression as outcome or response variables(two differnet types 
of response variables).

My boss might want to find out net effect of drug B, but I think we can not 
exclude the confounding effect of drugA. For example, it is possible that drug 
B has no effect, only has effect when drug A is present.   I asked my 
collaborator whey she omitted the fourth combination drugA only treatment, she 
said it was expensive to measure methylation or gene expression, so they 
performed the experiments based on their hypothesis which is too complicated 
here, so not illustrated here in details.  I am still not happy that they could 
just add three more replicates to do a full 2X2 design.

On the weekend, I also thought about doing a one-way anova, but then I have to 
do three pairwise comparisons to find out the pair to show difference if p 
value for one way anova is significant.

Thanks,

Ding

From: Bert Gunter [mailto:bgunter.4...@gmail.com]
Sent: Monday, March 05, 2018 2:27 PM
To: David Winsemius
Cc: Ding, Yuan Chun; r-help@r-project.org
Subject: Re: [R] data analysis for partial two-by-two factorial design

David:
I believe your response on SO is incorrect. This is a standard OFAT (one factor 
at a time) design, so that assuming additivity (no interactions), the effects 
of drugA and drugB can be determined via the model you rejected:
For example, if baseline control (no drugs) has a response of 0, drugA has an 
effect of 1, drugB has an effect of 2, and the effects are additive, with no 
noise we would have:

> d <- data.frame(drugA = c("n","y","y"),drugB = c("n","n","y"))
> y <- c(0,1,3)
And a straighforward inear model recovers the effects:

> lm(y ~ drugA + drugB, data=d)

Call:
lm(formula = y ~ drugA + drugB, data = d)

Coefficients:
(Intercept)   drugAy   drugBy
  1.282e-161.000e+002.000e+00
As usual, OFAT designs are blind to interactions, so that if they really exist, 
the interpretation as additive effects is incorrect.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Mon, Mar 5, 2018 at 2:03 PM, David Winsemius 
<dwinsem...@comcast.net<mailto:dwinsem...@comcast.net>> wrote:

> On Mar 5, 2018, at 8:52 AM, Ding, Yuan Chun 
> <ycd...@coh.org<mailto:ycd...@coh.org>> wrote:
>
> Hi Bert,
>
> I am very sorry to bother you again.
>
> For the following question, as you suggested, I posted it in both Biostars 
> website and stackexchange website, so far no reply.
>
> I really hope that you can do me a great favor to share your points about how 
> to explain the coefficients for drug A and drug B if run anova model 
> (response variable = drug A + drug B). is it different from running three 
> separate T tests?
>
> Thank you so much!!
>
> Ding
>
> I need to analyze data generated from a partial two-by-two factorial design: 
> two levels for drug A (yes, no), two levels for drug B (yes, no);  however, 
> data points are available only for three groups, no drugA/no drugB, yes 
> drugA/no drugB, yes drugA/yes drug B, omitting the fourth group of no 
> drugA/yes drugB.  I think we can not investigate interaction between drug A 
> and drug B, can I still run  model using R as usual:  response variable = 
> drug A + drug B?  any suggestion is appreciated.

Replied on CrossValidated where this would be on-topic.

--
David,

>
>
> From: Bert Gunter 
> [mailto:bgunter.4...@gmail.com<mailto:bgunter.4...@gmail.com>]
> Sent: Friday, March 02, 2018 12:32 PM
> To: Ding, Yuan Chun
> Cc: r-help@r-project.org<mailto:r-help@r-project.org>
> Subject: Re: [R] data analysis for partial two-by-two factorial design
>
> 
> [Attention: This email came from an external source. Do not open attachments 
> or click on links from unknown senders or unexpected emails.]
> __

Re: [R] data analysis for partial two-by-two factorial design

2018-03-05 Thread Bert Gunter

David:

I believe your response on SO is incorrect. This is a standard OFAT (one
factor at a time) design, so that assuming additivity (no interactions),
the effects of drugA and drugB can be determined via the model you rejected:

For example, if baseline control (no drugs) has a response of 0, drugA has
an effect of 1, drugB has an effect of 2, and the effects are additive,
with no noise we would have:

> d <- data.frame(drugA = c("n","y","y"),drugB = c("n","n","y"))
> y <- c(0,1,3)

And a straighforward inear model recovers the effects:

> lm(y ~ drugA + drugB, data=d)

Call:
lm(formula = y ~ drugA + drugB, data = d)

Coefficients:
(Intercept)   drugAy   drugBy
  1.282e-161.000e+002.000e+00

As usual, OFAT designs are blind to interactions, so that if they really
exist, the interpretation as additive effects is incorrect.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Mon, Mar 5, 2018 at 2:03 PM, David Winsemius <dwinsem...@comcast.net>
wrote:

>
> > On Mar 5, 2018, at 8:52 AM, Ding, Yuan Chun <ycd...@coh.org> wrote:
> >
> > Hi Bert,
> >
> > I am very sorry to bother you again.
> >
> > For the following question, as you suggested, I posted it in both
> Biostars website and stackexchange website, so far no reply.
> >
> > I really hope that you can do me a great favor to share your points
> about how to explain the coefficients for drug A and drug B if run anova
> model (response variable = drug A + drug B). is it different from running
> three separate T tests?
> >
> > Thank you so much!!
> >
> > Ding
> >
> > I need to analyze data generated from a partial two-by-two factorial
> design: two levels for drug A (yes, no), two levels for drug B (yes, no);
> however, data points are available only for three groups, no drugA/no
> drugB, yes drugA/no drugB, yes drugA/yes drug B, omitting the fourth group
> of no drugA/yes drugB.  I think we can not investigate interaction between
> drug A and drug B, can I still run  model using R as usual:  response
> variable = drug A + drug B?  any suggestion is appreciated.
>
> Replied on CrossValidated where this would be on-topic.
>
> --
> David,
>
> >
> >
> > From: Bert Gunter [mailto:bgunter.4...@gmail.com]
> > Sent: Friday, March 02, 2018 12:32 PM
> > To: Ding, Yuan Chun
> > Cc: r-help@r-project.org
> > Subject: Re: [R] data analysis for partial two-by-two factorial design
> >
> > 
> > [Attention: This email came from an external source. Do not open
> attachments or click on links from unknown senders or unexpected emails.]
> > 
> >
> > This list provides help on R programming (see the posting guide linked
> below for details on what is/is not considered on topic), and generally
> avoids discussion of purely statistical issues, which is what your query
> appears to be. The simple answer is yes, you can fit the model as
> described,  but you clearly need the off topic discussion as to what it
> does or does not mean. For that, you might try the stats.stackexchange.com
> <http://stats.stackexchange.com> statistical site.
> >
> > Cheers,
> > Bert
> >
> >
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> > On Fri, Mar 2, 2018 at 10:34 AM, Ding, Yuan Chun <ycd...@coh.org ycd...@coh.org>> wrote:
> > Dear R users,
> >
> > I need to analyze data generated from a partial two-by-two factorial
> design: two levels for drug A (yes, no), two levels for drug B (yes, no);
> however, data points are available only for three groups, no drugA/no
> drugB, yes drugA/no drugB, yes drugA/yes drug B, omitting the fourth group
> of no drugA/yes drugB.  I think we can not investigate interaction between
> drug A and drug B, can I still run  model using R as usual:  response
> variable = drug A + drug B?  any suggestion is appreciated.
> >
> > Thank you very much!
> >
> > Yuan Chun Ding
> >
> >
> > -
> > -SECURITY/CONFIDENTIALITY WARNING-
> > This message (and any attachments) are intended solely f...{{dropped:28}}
> >
> > __
> > R-help@r-project.org

Re: [R] data analysis for partial two-by-two factorial design

2018-03-05 Thread David Winsemius


> On Mar 5, 2018, at 8:52 AM, Ding, Yuan Chun <ycd...@coh.org> wrote:
> 
> Hi Bert,
> 
> I am very sorry to bother you again.
> 
> For the following question, as you suggested, I posted it in both Biostars 
> website and stackexchange website, so far no reply.
> 
> I really hope that you can do me a great favor to share your points about how 
> to explain the coefficients for drug A and drug B if run anova model 
> (response variable = drug A + drug B). is it different from running three 
> separate T tests?
> 
> Thank you so much!!
> 
> Ding
> 
> I need to analyze data generated from a partial two-by-two factorial design: 
> two levels for drug A (yes, no), two levels for drug B (yes, no);  however, 
> data points are available only for three groups, no drugA/no drugB, yes 
> drugA/no drugB, yes drugA/yes drug B, omitting the fourth group of no 
> drugA/yes drugB.  I think we can not investigate interaction between drug A 
> and drug B, can I still run  model using R as usual:  response variable = 
> drug A + drug B?  any suggestion is appreciated.

Replied on CrossValidated where this would be on-topic.

-- 
David,

> 
> 
> From: Bert Gunter [mailto:bgunter.4...@gmail.com]
> Sent: Friday, March 02, 2018 12:32 PM
> To: Ding, Yuan Chun
> Cc: r-help@r-project.org
> Subject: Re: [R] data analysis for partial two-by-two factorial design
> 
> 
> [Attention: This email came from an external source. Do not open attachments 
> or click on links from unknown senders or unexpected emails.]
> 
> 
> This list provides help on R programming (see the posting guide linked below 
> for details on what is/is not considered on topic), and generally avoids 
> discussion of purely statistical issues, which is what your query appears to 
> be. The simple answer is yes, you can fit the model as described,  but you 
> clearly need the off topic discussion as to what it does or does not mean. 
> For that, you might try the 
> stats.stackexchange.com<http://stats.stackexchange.com> statistical site.
> 
> Cheers,
> Bert
> 
> 
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along and 
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> 
> On Fri, Mar 2, 2018 at 10:34 AM, Ding, Yuan Chun 
> <ycd...@coh.org<mailto:ycd...@coh.org>> wrote:
> Dear R users,
> 
> I need to analyze data generated from a partial two-by-two factorial design: 
> two levels for drug A (yes, no), two levels for drug B (yes, no);  however, 
> data points are available only for three groups, no drugA/no drugB, yes 
> drugA/no drugB, yes drugA/yes drug B, omitting the fourth group of no 
> drugA/yes drugB.  I think we can not investigate interaction between drug A 
> and drug B, can I still run  model using R as usual:  response variable = 
> drug A + drug B?  any suggestion is appreciated.
> 
> Thank you very much!
> 
> Yuan Chun Ding
> 
> 
> -
> -SECURITY/CONFIDENTIALITY WARNING-
> This message (and any attachments) are intended solely...{{dropped:31}}

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data analysis for partial two-by-two factorial design

2018-03-05 Thread Ding, Yuan Chun

Hi Bert,

I am very sorry to bother you again.

For the following question, as you suggested, I posted it in both Biostars 
website and stackexchange website, so far no reply.

I really hope that you can do me a great favor to share your points about how 
to explain the coefficients for drug A and drug B if run anova model (response 
variable = drug A + drug B). is it different from running three separate T 
tests?

Thank you so much!!

Ding

I need to analyze data generated from a partial two-by-two factorial design: 
two levels for drug A (yes, no), two levels for drug B (yes, no);  however, 
data points are available only for three groups, no drugA/no drugB, yes 
drugA/no drugB, yes drugA/yes drug B, omitting the fourth group of no drugA/yes 
drugB.  I think we can not investigate interaction between drug A and drug B, 
can I still run  model using R as usual:  response variable = drug A + drug B?  
any suggestion is appreciated.


From: Bert Gunter [mailto:bgunter.4...@gmail.com]
Sent: Friday, March 02, 2018 12:32 PM
To: Ding, Yuan Chun
Cc: r-help@r-project.org
Subject: Re: [R] data analysis for partial two-by-two factorial design


[Attention: This email came from an external source. Do not open attachments or 
click on links from unknown senders or unexpected emails.]


This list provides help on R programming (see the posting guide linked below 
for details on what is/is not considered on topic), and generally avoids 
discussion of purely statistical issues, which is what your query appears to 
be. The simple answer is yes, you can fit the model as described,  but you 
clearly need the off topic discussion as to what it does or does not mean. For 
that, you might try the stats.stackexchange.com<http://stats.stackexchange.com> 
statistical site.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, Mar 2, 2018 at 10:34 AM, Ding, Yuan Chun 
<ycd...@coh.org<mailto:ycd...@coh.org>> wrote:
Dear R users,

I need to analyze data generated from a partial two-by-two factorial design: 
two levels for drug A (yes, no), two levels for drug B (yes, no);  however, 
data points are available only for three groups, no drugA/no drugB, yes 
drugA/no drugB, yes drugA/yes drug B, omitting the fourth group of no drugA/yes 
drugB.  I think we can not investigate interaction between drug A and drug B, 
can I still run  model using R as usual:  response variable = drug A + drug B?  
any suggestion is appreciated.

Thank you very much!

Yuan Chun Ding


-
-SECURITY/CONFIDENTIALITY WARNING-
This message (and any attachments) are intended solely f...{{dropped:28}}

__
R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data analysis for partial two-by-two factorial design

2018-03-02 Thread Ding, Yuan Chun

Hi Bert,

Thank  you so much for your direction, I have asked a question on stackexchange 
website.

Ding

From: Bert Gunter [mailto:bgunter.4...@gmail.com]
Sent: Friday, March 02, 2018 12:32 PM
To: Ding, Yuan Chun
Cc: r-help@r-project.org
Subject: Re: [R] data analysis for partial two-by-two factorial design

[Attention: This email came from an external source. Do not open attachments or 
click on links from unknown senders or unexpected emails.]

This list provides help on R programming (see the posting guide linked below 
for details on what is/is not considered on topic), and generally avoids 
discussion of purely statistical issues, which is what your query appears to 
be. The simple answer is yes, you can fit the model as described,  but you 
clearly need the off topic discussion as to what it does or does not mean. For 
that, you might try the stats.stackexchange.com<http://stats.stackexchange.com> 
statistical site.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and 
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, Mar 2, 2018 at 10:34 AM, Ding, Yuan Chun 
<ycd...@coh.org<mailto:ycd...@coh.org>> wrote:
Dear R users,

I need to analyze data generated from a partial two-by-two factorial design: 
two levels for drug A (yes, no), two levels for drug B (yes, no);  however, 
data points are available only for three groups, no drugA/no drugB, yes 
drugA/no drugB, yes drugA/yes drug B, omitting the fourth group of no drugA/yes 
drugB.  I think we can not investigate interaction between drug A and drug B, 
can I still run  model using R as usual:  response variable = drug A + drug B?  
any suggestion is appreciated.

Thank you very much!

Yuan Chun Ding

-
-SECURITY/CONFIDENTIALITY WARNING-
This message (and any attachments) are intended solely f...{{dropped:28}}

__
R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data analysis for partial two-by-two factorial design

2018-03-02 Thread Bert Gunter

This list provides help on R programming (see the posting guide linked
below for details on what is/is not considered on topic), and generally
avoids discussion of purely statistical issues, which is what your query
appears to be. The simple answer is yes, you can fit the model as
described,  but you clearly need the off topic discussion as to what it
does or does not mean. For that, you might try the stats.stackexchange.com
statistical site.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, Mar 2, 2018 at 10:34 AM, Ding, Yuan Chun  wrote:

> Dear R users,
>
> I need to analyze data generated from a partial two-by-two factorial
> design: two levels for drug A (yes, no), two levels for drug B (yes, no);
> however, data points are available only for three groups, no drugA/no
> drugB, yes drugA/no drugB, yes drugA/yes drug B, omitting the fourth group
> of no drugA/yes drugB.  I think we can not investigate interaction between
> drug A and drug B, can I still run  model using R as usual:  response
> variable = drug A + drug B?  any suggestion is appreciated.
>
> Thank you very much!
>
> Yuan Chun Ding
>
>
> -
> -SECURITY/CONFIDENTIALITY WARNING-
> This message (and any attachments) are intended solely...{{dropped:13}}

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] data analysis for partial two-by-two factorial design

2018-03-02 Thread Ding, Yuan Chun

Dear R users,

I need to analyze data generated from a partial two-by-two factorial design: 
two levels for drug A (yes, no), two levels for drug B (yes, no);  however, 
data points are available only for three groups, no drugA/no drugB, yes 
drugA/no drugB, yes drugA/yes drug B, omitting the fourth group of no drugA/yes 
drugB.  I think we can not investigate interaction between drug A and drug B, 
can I still run  model using R as usual:  response variable = drug A + drug B?  
any suggestion is appreciated.

Thank you very much!

Yuan Chun Ding


-
-SECURITY/CONFIDENTIALITY WARNING-
This message (and any attachments) are intended solely f...{{dropped:28}}

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data Table Merge Help

2018-02-01 Thread Jeff Newmiller

I rarely use data.table, but I think the vignette for the package discusses 
rolling joins. Also,  Google popped up [1].

[1] https://www.r-bloggers.com/understanding-data-table-rolling-joins/
-- 
Sent from my phone. Please excuse my brevity.

On February 1, 2018 9:45:53 AM PST, "Graeve, Nick"  
wrote:
>Hello
>
>I'm not sure if this is an appropriate use of this mailing list or not,
>please let me know if it isn't.  I'm struggling to figure out how to
>merge two data tables based on max effective date logic compared to
>when a payment occurred.  My dtDistributions DT is a transactional
>dataset while dtDepartments is a domain data set containing all
>department names and the effective date of when department name changes
>have occurred.  For the Bob example below, there was a payment on
>2016-01-01 which occurred in H229000.  In 2012, this department was
>named "Modified Name", in 2019 the department will be named "Final
>Name".  When I merge these two tables, I'd like it to pull the
>transactional data and match it up to department name "Modified Name"
>since that was the active department name at the time of that
>transaction.  I've read documentation on foverlaps, but I'm not sure if
>this problem is considered a range of dates or not.  At the bottom of
>this post is a temporarily solution that is working but it runs for a
>long time due to the amount of data in my actual source.
>
>Here is some sample data to get started:
>library(data.table)
>dtDistributions <- data.table(PayeeName = c("Bob", "Tracy", "Tom"),
>  Department = factor(c("H229000", "H135000",
>"H047800")),
>  Amount = c(5, 34, 87),
>  PaymentDT = as.Date(c("2016-01-01",
>"2015-01-01", "2015-01-01")))
>
>dtDepartments <- data.table(Department = factor(c("H229000", "H229000",
>"H229000", "H135000", "H047800")),
>    EffDT = as.Date(c("2019-01-01", "2012-01-01",
>"1901-01-01", "1901-01-01", "1901-01-01")),
>    Descr = c("Final Name","Modified
>Name","Original Name","Payables","Postal"))
>
>Here is the output I would like to see:
>PayeeName  Department PaymentDT   Amount
>Bob    Modified Name  2016-01-01  5
>Tracy  Payables   2015-01-01  34
>Tom    Postal     2015-01-01  87
>
>I was able to get this working by using the sqldf library, but it runs
>for a very long time in my actual dataset and I'd like to use
>data.table if at all possible.
>library(sqldf)
>joinString <- "SELECT A.PayeeName, B.Descr, A.PaymentDT, A.Amount
>    FROM dtDistributions A, dtDepartments B
>    WHERE A.DEPARTMENT = B.Department
>    AND B.EffDT = (SELECT MAX(ED.EffDT)
>    FROM dtDepartments ED
>    WHERE B.Department = ED.Department
>    AND ED.EffDT <= A.PaymentDT)"
>
>finalDT <- data.table(sqldf(joinString))
>
>
>
>-Message Disclaimer-
>
>This e-mail message is intended only for the use of the individual or
>entity to which it is addressed, and may contain information that is
>privileged, confidential and exempt from disclosure under applicable
>law. If you are not the intended recipient, any dissemination,
>distribution or copying of this communication is strictly prohibited.
>If you have received this communication in error, please notify us
>immediately by reply email to conn...@principal.com and delete or
>destroy all copies of the original message and attachments thereto.
>Email sent to or from the Principal Financial Group or any of its
>member companies may be retained as required by law or regulation.
>
>Nothing in this message is intended to constitute an Electronic
>signature for purposes of the Uniform Electronic Transactions Act
>(UETA) or the Electronic Signatures in Global and National Commerce Act
>("E-Sign") unless a specific statement to the contrary is included in
>this message.
>
>If you no longer wish to receive any further solicitation from the
>Principal Financial Group you may unsubscribe at
>https://www.principal.com/do-not-contact-form any time.
>
>If you are a Canadian resident and no longer wish to receive commercial
>electronic messages you may unsubscribe at
>https://www.principal.com/do-not-email-request-canadian-residents any
>time.
>
>
>
>
>
>This message was secured by Zix(R).
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained,

Re: [R] Data Table Merge Help

2018-02-01 Thread Bert Gunter

Did you search first? (This is suggested by the posting guide -- below
-- prior to posting).

"merge 2 data.tables in R" brought up what looked like useful stuff,
in particular the  merge() function for data tables. If this does not
do what you want, it may help to explain why not.

Alternatively, there is a merge.data.frame function that may do the
job if you first convert your data.table to a data.frame.

As I do not use the data.table package, you or others may have to fill
in details to make these work -- if they *can* work.

Cheers,
Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Thu, Feb 1, 2018 at 9:45 AM, Graeve, Nick  wrote:
> Hello
>
> I'm not sure if this is an appropriate use of this mailing list or not, 
> please let me know if it isn't.  I'm struggling to figure out how to merge 
> two data tables based on max effective date logic compared to when a payment 
> occurred.  My dtDistributions DT is a transactional dataset while 
> dtDepartments is a domain data set containing all department names and the 
> effective date of when department name changes have occurred.  For the Bob 
> example below, there was a payment on 2016-01-01 which occurred in H229000.  
> In 2012, this department was named "Modified Name", in 2019 the department 
> will be named "Final Name".  When I merge these two tables, I'd like it to 
> pull the transactional data and match it up to department name "Modified 
> Name" since that was the active department name at the time of that 
> transaction.  I've read documentation on foverlaps, but I'm not sure if this 
> problem is considered a range of dates or not.  At the bottom of this post is 
> a temporarily solution that is wo
 rking but it runs for a long time due to the amount of data in my actual 
source.
>
> Here is some sample data to get started:
> library(data.table)
> dtDistributions <- data.table(PayeeName = c("Bob", "Tracy", "Tom"),
>   Department = factor(c("H229000", "H135000", 
> "H047800")),
>   Amount = c(5, 34, 87),
>   PaymentDT = as.Date(c("2016-01-01", "2015-01-01", 
> "2015-01-01")))
>
> dtDepartments <- data.table(Department = factor(c("H229000", "H229000", 
> "H229000", "H135000", "H047800")),
> EffDT = as.Date(c("2019-01-01", "2012-01-01", 
> "1901-01-01", "1901-01-01", "1901-01-01")),
> Descr = c("Final Name","Modified Name","Original 
> Name","Payables","Postal"))
>
> Here is the output I would like to see:
> PayeeName  Department PaymentDT   Amount
> BobModified Name  2016-01-01  5
> Tracy  Payables   2015-01-01  34
> TomPostal 2015-01-01  87
>
> I was able to get this working by using the sqldf library, but it runs for a 
> very long time in my actual dataset and I'd like to use data.table if at all 
> possible.
> library(sqldf)
> joinString <- "SELECT A.PayeeName, B.Descr, A.PaymentDT, A.Amount
> FROM dtDistributions A, dtDepartments B
> WHERE A.DEPARTMENT = B.Department
> AND B.EffDT = (SELECT MAX(ED.EffDT)
> FROM dtDepartments ED
> WHERE B.Department = ED.Department
> AND ED.EffDT <= A.PaymentDT)"
>
> finalDT <- data.table(sqldf(joinString))
>
>
>
> -Message Disclaimer-
>
> This e-mail message is intended only for the use of the individual or entity 
> to which it is addressed, and may contain information that is privileged, 
> confidential and exempt from disclosure under applicable law. If you are not 
> the intended recipient, any dissemination, distribution or copying of this 
> communication is strictly prohibited. If you have received this communication 
> in error, please notify us immediately by reply email to 
> conn...@principal.com and delete or destroy all copies of the original 
> message and attachments thereto. Email sent to or from the Principal 
> Financial Group or any of its member companies may be retained as required by 
> law or regulation.
>
> Nothing in this message is intended to constitute an Electronic signature for 
> purposes of the Uniform Electronic Transactions Act (UETA) or the Electronic 
> Signatures in Global and National Commerce Act ("E-Sign") unless a specific 
> statement to the contrary is included in this message.
>
> If you no longer wish to receive any further solicitation from the Principal 
> Financial Group you may unsubscribe at 
> https://www.principal.com/do-not-contact-form any time.
>
> If you are a Canadian resident and no longer wish to receive commercial 
> electronic messages you may unsubscribe at 
> https://www.principal.com/do-not-email-request-canadian-residents any time.
>
>
>
>
>
> This message was secured by Zix(R).
>
>

[R] Data Table Merge Help

2018-02-01 Thread Graeve, Nick

Hello

I'm not sure if this is an appropriate use of this mailing list or not, please 
let me know if it isn't.  I'm struggling to figure out how to merge two data 
tables based on max effective date logic compared to when a payment occurred.  
My dtDistributions DT is a transactional dataset while dtDepartments is a 
domain data set containing all department names and the effective date of when 
department name changes have occurred.  For the Bob example below, there was a 
payment on 2016-01-01 which occurred in H229000.  In 2012, this department was 
named "Modified Name", in 2019 the department will be named "Final Name".  When 
I merge these two tables, I'd like it to pull the transactional data and match 
it up to department name "Modified Name" since that was the active department 
name at the time of that transaction.  I've read documentation on foverlaps, 
but I'm not sure if this problem is considered a range of dates or not.  At the 
bottom of this post is a temporarily solution that is working but it runs for a 
long time due to the amount of data in my actual source.

Here is some sample data to get started:
library(data.table)
dtDistributions <- data.table(PayeeName = c("Bob", "Tracy", "Tom"),
  Department = factor(c("H229000", "H135000", 
"H047800")),
  Amount = c(5, 34, 87),
  PaymentDT = as.Date(c("2016-01-01", "2015-01-01", 
"2015-01-01")))

dtDepartments <- data.table(Department = factor(c("H229000", "H229000", 
"H229000", "H135000", "H047800")),
    EffDT = as.Date(c("2019-01-01", "2012-01-01", 
"1901-01-01", "1901-01-01", "1901-01-01")),
    Descr = c("Final Name","Modified Name","Original 
Name","Payables","Postal"))

Here is the output I would like to see:
PayeeName  Department PaymentDT   Amount
Bob    Modified Name  2016-01-01  5
Tracy  Payables   2015-01-01  34
Tom    Postal     2015-01-01  87

I was able to get this working by using the sqldf library, but it runs for a 
very long time in my actual dataset and I'd like to use data.table if at all 
possible.
library(sqldf)
joinString <- "SELECT A.PayeeName, B.Descr, A.PaymentDT, A.Amount
    FROM dtDistributions A, dtDepartments B
    WHERE A.DEPARTMENT = B.Department
    AND B.EffDT = (SELECT MAX(ED.EffDT)
    FROM dtDepartments ED
    WHERE B.Department = ED.Department
    AND ED.EffDT <= A.PaymentDT)"

finalDT <- data.table(sqldf(joinString))



-Message Disclaimer-

This e-mail message is intended only for the use of the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential and exempt from disclosure under applicable law. If you are not 
the intended recipient, any dissemination, distribution or copying of this 
communication is strictly prohibited. If you have received this communication 
in error, please notify us immediately by reply email to conn...@principal.com 
and delete or destroy all copies of the original message and attachments 
thereto. Email sent to or from the Principal Financial Group or any of its 
member companies may be retained as required by law or regulation.

Nothing in this message is intended to constitute an Electronic signature for 
purposes of the Uniform Electronic Transactions Act (UETA) or the Electronic 
Signatures in Global and National Commerce Act ("E-Sign") unless a specific 
statement to the contrary is included in this message.

If you no longer wish to receive any further solicitation from the Principal 
Financial Group you may unsubscribe at 
https://www.principal.com/do-not-contact-form any time.

If you are a Canadian resident and no longer wish to receive commercial 
electronic messages you may unsubscribe at 
https://www.principal.com/do-not-email-request-canadian-residents any time.





This message was secured by Zix(R).

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data cleaning & Data preparation, what do R users want?

2017-12-11 Thread Robert Wilkins

Dominik (and others)

If it is indeed still the biggest paint point, even in 2017, then maybe we
can do something about that, with more efforts at different user interface
design and try-outs with them on specialized datasets.
[ The fact that in some specialties, such as clinical trials, for example,
getting access to public domain datasets (and not having to use a tiny
"toy" dataset, which nobody will pay attention to, does make it harder].

It would help if academia (both comp-sci and statistics departments) would
support those who invest resources in drafting and test-driving new product
designs. If, in the year 2017, it is still a big pain point, doesn't that
make sense. More speculative work in statistical programming language
design has not been a priority in academia since before 1980.

On Thu, Nov 30, 2017 at 4:11 AM, Dominik Schneider <
dominik.schnei...@colorado.edu> wrote:

> I would agree that getting data into R from various sources is the biggest
> pain point. Even if there is an api, the results are not always consistent
> and you have to do lots of dimension checking to get it right. Or there
> isn't an open api at all and you have to hack it by web scraping or
> otherwise- http://enpiar.com/2017/08/11/one-hour-package/
>
> On Thu, Nov 30, 2017 at 1:00 AM, Jim Lemon  wrote:
>
>> Hi again,
>> Typo in the last email. Should read "about 40 standard deviations".
>>
>> Jim
>>
>> On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon  wrote:
>> > Hi Robert,
>> > People want different levels of automation in the software they use.
>> > What concerns many of us is the desire for the function
>> > "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
>> > Such users typically want something that justifies its use by being
>> > written by someone who seems to know what they're doing and lots of
>> > other people use it. One advantage of many R functions is their
>> > modular construction. This encourages users to at least consider the
>> > steps that are taken rather than just accept what comes out of that
>> > long tube.
>> >
>> > Take the contentious problem of outlier identification. If I just let
>> > the black box peel off some values, I don't know what I have lost. On
>> > the other hand, if I import data and examine it with a summary
>> > function, I may find that one woman has a height of 5.2 meters. I can
>> > range check by looking up the Guinness Book of Records. It's an
>> > outlier. I can estimate the probability of such a height.  Hmm, about
>> > 4 standard deviations above the mean. It's an outlier. I can attempt a
>> > Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2")
>> > has been recorded as a metric value". It's not an outlier.
>> >
>> > The more R gravitates toward "black box" functions, the more some
>> > users are encouraged to let them do the work.You pays your money and
>> > you takes your chances.
>> >
>> > Jim
>> >
>> >
>> > On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins 
>> wrote:
>> >> R has a very wide audience, clinical research, astronomy, psychology,
>> and
>> >> so on and so on.
>> >> I would consider data analysis work to be three stages: data
>> preparation,
>> >> statistical analysis, and producing the report.
>> >> This regards the process of getting the data ready for analysis and
>> >> reporting, sometimes called "data cleaning" or "data munging" or "data
>> >> wrangling".
>> >>
>> >> So as regards tools for data preparation, speaking to the highly
>> diverse
>> >> audience mentioned, here is my question:
>> >>
>> >> What do you want?
>> >> Or are you already quite happy with the range of tools that is
>> currently
>> >> before you?
>> >>
>> >> [BTW,  I posed the same question last week to the r-devel list, and was
>> >> advised that r-help might be a more suitable audience by one of the
>> >> moderators.]
>> >>
>> >> Robert Wilkins
>> >>
>> >> [[alternative HTML version deleted]]
>> >>
>> >> __
>> >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained,

Re: [R] Data cleaning & Data preparation, what do R users want?

2017-11-30 Thread Dominik Schneider

I would agree that getting data into R from various sources is the biggest
pain point. Even if there is an api, the results are not always consistent
and you have to do lots of dimension checking to get it right. Or there
isn't an open api at all and you have to hack it by web scraping or
otherwise- http://enpiar.com/2017/08/11/one-hour-package/

On Thu, Nov 30, 2017 at 1:00 AM, Jim Lemon  wrote:

> Hi again,
> Typo in the last email. Should read "about 40 standard deviations".
>
> Jim
>
> On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon  wrote:
> > Hi Robert,
> > People want different levels of automation in the software they use.
> > What concerns many of us is the desire for the function
> > "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
> > Such users typically want something that justifies its use by being
> > written by someone who seems to know what they're doing and lots of
> > other people use it. One advantage of many R functions is their
> > modular construction. This encourages users to at least consider the
> > steps that are taken rather than just accept what comes out of that
> > long tube.
> >
> > Take the contentious problem of outlier identification. If I just let
> > the black box peel off some values, I don't know what I have lost. On
> > the other hand, if I import data and examine it with a summary
> > function, I may find that one woman has a height of 5.2 meters. I can
> > range check by looking up the Guinness Book of Records. It's an
> > outlier. I can estimate the probability of such a height.  Hmm, about
> > 4 standard deviations above the mean. It's an outlier. I can attempt a
> > Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2")
> > has been recorded as a metric value". It's not an outlier.
> >
> > The more R gravitates toward "black box" functions, the more some
> > users are encouraged to let them do the work.You pays your money and
> > you takes your chances.
> >
> > Jim
> >
> >
> > On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins 
> wrote:
> >> R has a very wide audience, clinical research, astronomy, psychology,
> and
> >> so on and so on.
> >> I would consider data analysis work to be three stages: data
> preparation,
> >> statistical analysis, and producing the report.
> >> This regards the process of getting the data ready for analysis and
> >> reporting, sometimes called "data cleaning" or "data munging" or "data
> >> wrangling".
> >>
> >> So as regards tools for data preparation, speaking to the highly diverse
> >> audience mentioned, here is my question:
> >>
> >> What do you want?
> >> Or are you already quite happy with the range of tools that is currently
> >> before you?
> >>
> >> [BTW,  I posed the same question last week to the r-devel list, and was
> >> advised that r-help might be a more suitable audience by one of the
> >> moderators.]
> >>
> >> Robert Wilkins
> >>
> >> [[alternative HTML version deleted]]
> >>
> >> __
> >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data cleaning & Data preparation, what do R users want?

2017-11-29 Thread Jim Lemon

Hi again,
Typo in the last email. Should read "about 40 standard deviations".

Jim

On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon  wrote:
> Hi Robert,
> People want different levels of automation in the software they use.
> What concerns many of us is the desire for the function
> "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
> Such users typically want something that justifies its use by being
> written by someone who seems to know what they're doing and lots of
> other people use it. One advantage of many R functions is their
> modular construction. This encourages users to at least consider the
> steps that are taken rather than just accept what comes out of that
> long tube.
>
> Take the contentious problem of outlier identification. If I just let
> the black box peel off some values, I don't know what I have lost. On
> the other hand, if I import data and examine it with a summary
> function, I may find that one woman has a height of 5.2 meters. I can
> range check by looking up the Guinness Book of Records. It's an
> outlier. I can estimate the probability of such a height.  Hmm, about
> 4 standard deviations above the mean. It's an outlier. I can attempt a
> Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2")
> has been recorded as a metric value". It's not an outlier.
>
> The more R gravitates toward "black box" functions, the more some
> users are encouraged to let them do the work.You pays your money and
> you takes your chances.
>
> Jim
>
>
> On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins  wrote:
>> R has a very wide audience, clinical research, astronomy, psychology, and
>> so on and so on.
>> I would consider data analysis work to be three stages: data preparation,
>> statistical analysis, and producing the report.
>> This regards the process of getting the data ready for analysis and
>> reporting, sometimes called "data cleaning" or "data munging" or "data
>> wrangling".
>>
>> So as regards tools for data preparation, speaking to the highly diverse
>> audience mentioned, here is my question:
>>
>> What do you want?
>> Or are you already quite happy with the range of tools that is currently
>> before you?
>>
>> [BTW,  I posed the same question last week to the r-devel list, and was
>> advised that r-help might be a more suitable audience by one of the
>> moderators.]
>>
>> Robert Wilkins
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data cleaning & Data preparation, what do R users want?

2017-11-29 Thread Jim Lemon

Hi Robert,
People want different levels of automation in the software they use.
What concerns many of us is the desire for the function
"figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
Such users typically want something that justifies its use by being
written by someone who seems to know what they're doing and lots of
other people use it. One advantage of many R functions is their
modular construction. This encourages users to at least consider the
steps that are taken rather than just accept what comes out of that
long tube.

Take the contentious problem of outlier identification. If I just let
the black box peel off some values, I don't know what I have lost. On
the other hand, if I import data and examine it with a summary
function, I may find that one woman has a height of 5.2 meters. I can
range check by looking up the Guinness Book of Records. It's an
outlier. I can estimate the probability of such a height.  Hmm, about
4 standard deviations above the mean. It's an outlier. I can attempt a
Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2")
has been recorded as a metric value". It's not an outlier.

The more R gravitates toward "black box" functions, the more some
users are encouraged to let them do the work.You pays your money and
you takes your chances.

Jim

On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins  wrote:
> R has a very wide audience, clinical research, astronomy, psychology, and
> so on and so on.
> I would consider data analysis work to be three stages: data preparation,
> statistical analysis, and producing the report.
> This regards the process of getting the data ready for analysis and
> reporting, sometimes called "data cleaning" or "data munging" or "data
> wrangling".
>
> So as regards tools for data preparation, speaking to the highly diverse
> audience mentioned, here is my question:
>
> What do you want?
> Or are you already quite happy with the range of tools that is currently
> before you?
>
> [BTW,  I posed the same question last week to the r-devel list, and was
> advised that r-help might be a more suitable audience by one of the
> moderators.]
>
> Robert Wilkins
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data cleaning & Data preparation, what do R users want?

2017-11-29 Thread Robert Wilkins

Christopher,

OK, well what about a range of functions in an R package that
automatically, with very little syntax, pulls in data from a variety of
formats (CSV, SQLite, and so on) and converts them to an R data frame. You
seem to be pointing to something like that.
Something like that, in some form or another, probably already exists,
though it might be either imperfect (not as user-friendly as possible) or
not well publicised, or both.
Or another tangent: your co-workers are not going to stop using Excel,
whether you like it or not, and many end-users are stuck in the exact same
position as you (co-workers who deliver the data in Excel). I will guess
that data stored in Excel tends to be dirty in somewhat predictable ways.
(And again, those other end-user's coworkers are not going to change their
behaviour). And so: a data munging tool that makes it as easy as possible
to clean up the data in Excel spreadsheets and export them to R data
frames. One prerequisite: an understanding of what tends to go wrong with
data with Excel ( the data in Excel tends to be dirty, but dirty in what
way?).

Thank you for your response Christopher. What state are you in?

On Wed, Nov 29, 2017 at 11:52 AM, Christopher W. Ryan <cr...@binghamton.edu>
wrote:

> Great question. What do I want? I want my co-workers to stop using Excel
> spreadsheets for data entry, storage, and sharing! I want them to
> understand the value of data discipline. But alas . . . .
>
> I work in a county health department in the US. Between dplyr, stringr,
> grep, grepl, and the base R read() functions, I'm doing OK.
>
> I need to learn more about APIs, so I can see if I can make R directly
> grab data from, e.g. our state health department sources. My biggest
> hassle is having to download a data file, save it somewhere, and then
> open R and read it in. I'd like to be able to do it all in R. Would make
> the generation of recurring reports easier.
>
> --Chris Ryan
>
> Robert Wilkins wrote:
> > R has a very wide audience, clinical research, astronomy, psychology, and
> > so on and so on.
> > I would consider data analysis work to be three stages: data preparation,
> > statistical analysis, and producing the report.
> > This regards the process of getting the data ready for analysis and
> > reporting, sometimes called "data cleaning" or "data munging" or "data
> > wrangling".
> >
> > So as regards tools for data preparation, speaking to the highly diverse
> > audience mentioned, here is my question:
> >
> > What do you want?
> > Or are you already quite happy with the range of tools that is currently
> > before you?
> >
> > [BTW,  I posed the same question last week to the r-devel list, and was
> > advised that r-help might be a more suitable audience by one of the
> > moderators.]
> >
> > Robert Wilkins
> >
> >   [[alternative HTML version deleted]]
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Data cleaning & Data preparation, what do R users want?

2017-11-29 Thread Bert Gunter

Oh Crap! I mistakenly replied onlist. PLEASE IGNORE -- these are only my
ignorant opinions.

-- Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Wed, Nov 29, 2017 at 8:48 AM, Bert Gunter  wrote:

> I don't think my view is of interest to many, so offlist.
>
> I reject this:
>
> " I would consider data analysis work to be three stages: data preparation,
> statistical analysis, and producing the report."
>
> For example, there is no such thing as "outliers" -- data to be removed as
> part of cleaning/preparation -- without a statistical model to be an
> "outlier" **from**, which is part of the statistical analysis. And the
> structure of the data (data preparation) may need to change depending on
> the course of the analysis (including graphics, also part of the analysis).
> So I think your view reflects a naïve view of the nature of data analysis,
> which is an iterative and holistic process. I suspect your training is as a
> computer scientist and you have not done much 1-1 consulting with
> researchers, though you should certainly feel free to reject this canard.
> Building software for large scale automated analysis of data required a
> much different analytical paradigm than the statistical consulting model,
> which is largely my background.
>
> No reply necessary. Just my opinion, which you are of course free to trash.
>
> Cheers,
> Bert
>
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
> On Wed, Nov 29, 2017 at 8:37 AM, Robert Wilkins 
> wrote:
>
>> R has a very wide audience, clinical research, astronomy, psychology, and
>> so on and so on.
>> I would consider data analysis work to be three stages: data preparation,
>> statistical analysis, and producing the report.
>> This regards the process of getting the data ready for analysis and
>> reporting, sometimes called "data cleaning" or "data munging" or "data
>> wrangling".
>>
>> So as regards tools for data preparation, speaking to the highly diverse
>> audience mentioned, here is my question:
>>
>> What do you want?
>> Or are you already quite happy with the range of tools that is currently
>> before you?
>>
>> [BTW,  I posed the same question last week to the r-devel list, and was
>> advised that r-help might be a more suitable audience by one of the
>> moderators.]
>>
>> Robert Wilkins
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1414 matches

Mail list logo