Re: [R] the quantile function and problems.

2022-07-14 Thread Ivan Krylov
В Thu, 14 Jul 2022 14:58:17 +0200
Uwe Brauer  пишет:

> What turns me crazy is that the way R, matlab and the JCR calculate
> the quartiles gives different results.

R by itself can give up to 9 slightly different results:

sapply(1:9, function(type) quantile(1:267, 1:3/4, type = type))
# [,1] [,2] [,3]   [,4]   [,5] [,6]  [,7]  [,8] [,9]
# 25%   67   67   67  66.75  67.25   67  67.5  67.16667  67.1875
# 50%  134  134  134 133.50 134.00  134 134.0 134.0 134.
# 75%  201  201  200 200.25 200.75  201 200.5 200.8 200.8125

Choose the ones that fit your ideas of quantile best. See ?quantile for
more info.

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500)

2022-07-14 Thread avi.e.gross
Tim,

Your reply is reasonable if you want to read in EVERYTHING and use various
nice features of the select() function in the dplyr package of the tidyverse
that let you exclude a bunch of columns based on names starting or ending or
containing various characters or not being of type integer and so on. 

But another category wants to skip creating some columns in the first place.
Many reader functions that take in data from something like a .CSV file will
allow you to effectively ignore some of the columns of data and thus
hopefully cut down on some overhead.

I assume most of us have no real experience with the package called "mice"
and who is willing to read to page 72 or so in this document: 
https://cran.r-project.org/web/packages/mice/mice.pdf

Anywho, the mice() function this person wants to use has arguments meant to
control what is brought in and stored in whatever internal format as in not
taking some rows. A cursory glance suggests no way to suppress columns other
than not including them before calling the function as it does not read the
data from a file and expects either a data.frame or a matrix.

So your answer is valid. The questioner can use any method they wish to
adjust the initial data.frame and create a partial copy to use. If they want
a small subset of 2500+ columns (and who wouldn't) then it may be easiest to
simply name them in base R or select as in:

 New.df <- Old.df(, c("col36", "col89", "hike"))

On the other hand, if they merely want to exclude lots of columns that have
something in common, yes, select() allows things like:

New.df <- Select(Old.df, -ends_with(c("extra", "comment"))

The tidyverse keeps being rewritten so some new ways may be replacing old,
but there are variants like select_if() that allow arbitrary functions to
decide what columns to include/exclude  such as based on what type they
contain

So the key is to trick before calling the function but leave in everything
needed.

Only the one asking the question knows what all the columns mean and what
rhyme or reasons decides which to keep or exclude. A more specific question
may get a more specific answer.


-Original Message-
From: R-help  On Behalf Of Ebert,Timothy Aaron
Sent: Thursday, July 14, 2022 2:12 PM
To: Bert Gunter ; Ian McPhail 
Cc: R-help 
Subject: Re: [R] mice: selecting small subset of variables to impute from
dataset with many variables (> 2500)

Maybe this is too simple but could you use the select() function from dplyr?
Tim

-Original Message-
From: R-help  On Behalf Of Bert Gunter
Sent: Thursday, July 14, 2022 2:10 PM
To: Ian McPhail 
Cc: R-help 
Subject: Re: [R] mice: selecting small subset of variables to impute from
dataset with many variables (> 2500)

[External Email]

If I understand your query correctly, you can use negative indexing to omit
variables. See ?'[' for details.

> dat <- data.frame (a = 1:3, b = letters[1:3], c = 4:6, d =
> letters[5:7]) dat
  a b c d
1 1 a 4 e
2 2 b 5 f
3 3 c 6 g
> dat[,-c(2,4)]
  a c
1 1 4
2 2 5
3 3 6

Of course you have to know the numerical index of the columns you wish to
omit, but somethingh of the sort seems unavoidable in any case.

Cheers,
Bert

On Thu, Jul 14, 2022 at 11:00 AM Ian McPhail  wrote:
>
> Hello,
>
> I am looking for some advice on how to select subsets of variables for 
> imputing when using the mice package.
>
> From Van Buuren's original mice paper, I see that selecting variables 
> to be 'skipped' in an imputation can be written as:
>
> ini <- mice(nhanes2, maxit = 0, print = FALSE) pred <- ini$pred pred[, 
> "bmi"] <- 0 meth <- ini$meth meth["bmi"] <- ""
>
> With the last two lines specifying the the "bmi" variable gets skipped 
> over and not imputed.
>
> And I have come across other examples, but all that I have seen lay 
> out a method of skipping variables where EVERY variable is named (as 
> "bmi" is named above). I am wondering if there is a reasonably easy 
> way to select out approximately 30 variables for imputation from a 
> larger dataset with around 2500 variables, without having to name all
2450+ other variables.
>
> Thank you,
>
> Ian
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
> man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs
> Rzsn7AkP-g&m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2e
> wcXchwc&s=ABj_L_b515lhH7RIgTmmjylyWxJCbRWvzZDkxUkGw90&e=
> PLEASE do read the posting guide
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or
> g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA
> sRzsn7AkP-g&m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2
> ewcXchwc&s=LiocKPLYgq5olAT6tqGjr2xOLwDWw55DRzhuq7gcF5A&e=
> and provide commented, minimal, self-contained, reproducible code.

__
R-help

Re: [R] the quantile function and problems.

2022-07-14 Thread Bert Gunter
Read ?quantile carefully, please (and any references therein that you
may wish to consult).

You are estimating a continuous function by a discrete finite step
function, and as the Help page (and further references) explains,
there are many ways to do this.

Bert


On Thu, Jul 14, 2022 at 2:33 PM Uwe Brauer  wrote:
>
>
> Hi
>
> I am very acquainted with R. I use it occasionally via the org-babel library 
> of GNU emacs.
>
> I wanted to check the first, second and third quartiles of the scientific 
> science index JCR
> https://support.clarivate.com/ScientificandAcademicResearch/s/article/Journal-Citation-Reports-Quartile-rankings-and-other-metrics?language=en_U
> S
> Its criterion is
> #+begin_src
> | Quartil | range |   |
> | -+--+---|
> | Q1  | 0.0 < Z \leq 0.25 | Highest ranked journals in a category |
> | Q2  | 0.25 < Z \leq 0.5 |   |
> | Q3  | 0.5 < Z \leq 0.75 |   |
> | Q4  | 0.75 < Z  | Lowest ranked journals in a category  |
> #+end_src
>
> Z=(X/Y)
>
> Where X is the journal rank in category and Y is the number of journals in 
> the category.
>
> Now I have a list of 267 journals.
>
> What turns me crazy is that the way R, matlab and the JCR calculate the 
> quartiles gives different results.
>
> Here is a table
> #+begin_matlab :exports both :eval never-export :results output latex
> #+RESULTS:
> | quartil-limit (last member) || floor_Rlang | jcr | jcr_check | 
> floor_check |
> |-++-+-+---+-|
> |67.5 | Q1 |  67 |  66 |0.2472 |  
> 0.2509 |
> | 134 | Q2 | 134 | 133 |0.4981 |  
> 0.5019 |
> |   200.5 | Q3 | 200 | 200 |0.7491 |  
> 0.7491 |
> | 267 || 267 | 267 | 1 |  
>  1 |
> #+TBLFM: $5=$4/267::$6=$3/267
> #+end_matlab
>
> I calculated using R (I don't provide the vector from 1 to 267)
>
> #+begin_src R :colnames t :var t1=jcr22
>   quantile(t1$Data,c(1/4,1/2,3/4,1))
> #+end_src
> #+begin_src
> #+RESULTS:
> | x |
> |---|
> |  67.5 |
> |   134 |
> | 200.5 |
> |   267 |
> #+end_src
>
>
> So you see the problem with Q1 and Q2.
>
> On top of that matlab gives
>
> #+begin_src matlab :exports results :eval never-export :results output latex
> format short
> x=1:267;
> q1 = quantile(x,1/4);
> q2 = quantile(x,1/2);
> q3 = quantile(x,3/4);
> Q=[q1; q2; q3];
> sprintf('|%g|   \n', Q)
> #+end_src
>
> #+RESULTS:
> #+begin_export latex
> |67.25|
> |134|
> |200.75|
> #+end_export
>
> Which is also slightly different from R.
>
> Can anybody enlighten me please?
> Thanks and regards
>
> Uwe Brauer
>
> --
> I strongly condemn Putin's war of aggression against the Ukraine.
> I support to deliver weapons to Ukraine's military.
> I support the ban of Russia from SWIFT.
> I support the EU membership of the Ukraine.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] the quantile function and problems.

2022-07-14 Thread Uwe Brauer


Hi 

I am very acquainted with R. I use it occasionally via the org-babel library of 
GNU emacs.

I wanted to check the first, second and third quartiles of the scientific 
science index JCR
https://support.clarivate.com/ScientificandAcademicResearch/s/article/Journal-Citation-Reports-Quartile-rankings-and-other-metrics?language=en_U
S 
Its criterion is 
#+begin_src 
| Quartil | range |   |
| -+--+---|
| Q1  | 0.0 < Z \leq 0.25 | Highest ranked journals in a category |
| Q2  | 0.25 < Z \leq 0.5 |   |
| Q3  | 0.5 < Z \leq 0.75 |   |
| Q4  | 0.75 < Z  | Lowest ranked journals in a category  |
#+end_src

Z=(X/Y)

Where X is the journal rank in category and Y is the number of journals in the 
category.

Now I have a list of 267 journals.

What turns me crazy is that the way R, matlab and the JCR calculate the 
quartiles gives different results.

Here is a table 
#+begin_matlab :exports both :eval never-export :results output latex
#+RESULTS:
| quartil-limit (last member) || floor_Rlang | jcr | jcr_check | 
floor_check |
|-++-+-+---+-|
|67.5 | Q1 |  67 |  66 |0.2472 |  
0.2509 |
| 134 | Q2 | 134 | 133 |0.4981 |  
0.5019 |
|   200.5 | Q3 | 200 | 200 |0.7491 |  
0.7491 |
| 267 || 267 | 267 | 1 |   
1 |
#+TBLFM: $5=$4/267::$6=$3/267
#+end_matlab

I calculated using R (I don't provide the vector from 1 to 267)

#+begin_src R :colnames t :var t1=jcr22
  quantile(t1$Data,c(1/4,1/2,3/4,1))
#+end_src
#+begin_src 
#+RESULTS:
| x |
|---|
|  67.5 |
|   134 |
| 200.5 |
|   267 |
#+end_src


So you see the problem with Q1 and Q2.

On top of that matlab gives

#+begin_src matlab :exports results :eval never-export :results output latex
format short
x=1:267;
q1 = quantile(x,1/4);
q2 = quantile(x,1/2);
q3 = quantile(x,3/4);
Q=[q1; q2; q3];
sprintf('|%g|   \n', Q)
#+end_src

#+RESULTS:
#+begin_export latex
|67.25|   
|134|   
|200.75|   
#+end_export

Which is also slightly different from R.

Can anybody enlighten me please?
Thanks and regards 

Uwe Brauer 

-- 
I strongly condemn Putin's war of aggression against the Ukraine.
I support to deliver weapons to Ukraine's military. 
I support the ban of Russia from SWIFT.
I support the EU membership of the Ukraine.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500)

2022-07-14 Thread Rui Barradas

Hello,

You can use mice() argument predictorMatrix to tell mice() which 
variables/blocks are used when imputing which column. If the column 
vector is set to zeros, no column or block will used in its imputation.



library(mice)

predmat <- matrix(1L, ncol(nhanes2), ncol(nhanes2),
  dimnames = list(names(nhanes2), names(nhanes2)))
diag(predmat) <- 0L
predmat[, "bmi"] <- 0L
predmat
#> age bmi hyp chl
#> age   0   0   1   1
#> bmi   1   0   1   1
#> hyp   1   0   0   1
#> chl   1   0   1   0




Then use argument where to skip the variables you do not want imputed.
Note that this is not the same as not being imputed according to 
variables shown above as rownames of predmat.


The default of where is the matrix is.na(nhanes2) so make a copy of this 
matrix then set column "bmi" to FALSE. Then call mice().




predmat <- matrix(1L, ncol(nhanes2), ncol(nhanes2),
  dimnames = list(names(nhanes2), names(nhanes2)))
diag(predmat) <- 0L
predmat[, "bmi"] <- 0L
predmat
#> age bmi hyp chl
#> age   0   0   1   1
#> bmi   1   0   1   1
#> hyp   1   0   0   1
#> chl   1   0   1   0

not_bmi <- is.na(nhanes2)
not_bmi[, "bmi"] <- FALSE

ini_all <- mice(nhanes2, print = FALSE)
ini_bmi <- mice(nhanes2,
predictorMatrix = predmat,
where = not_bmi,
print = FALSE)


cmpl_all <- complete(ini_all)
head(cmpl_all)
#> age  bmi hyp chl
#> 1 20-39 28.7  no 187
#> 2 40-59 22.7  no 187
#> 3 20-39 30.1  no 187
#> 4 60-99 27.5 yes 284
#> 5 20-39 20.4  no 113
#> 6 60-99 20.4  no 184
cmpl_bmi <- complete(ini_bmi)
head(cmpl_bmi)
#> age  bmi hyp chl
#> 1 20-39   NA  no 187
#> 2 40-59 22.7  no 187
#> 3 20-39   NA  no 187
#> 4 60-99   NA yes 206
#> 5 20-39 20.4  no 113
#> 6 60-99   NA yes 184


Hope this helps,

Rui Barradas

Às 18:59 de 14/07/2022, Ian McPhail escreveu:

Hello,

I am looking for some advice on how to select subsets of variables for
imputing when using the mice package.

 From Van Buuren's original mice paper, I see that selecting variables to be
'skipped' in an imputation can be written as:

ini <- mice(nhanes2, maxit = 0, print = FALSE)
pred <- ini$pred
pred[, "bmi"] <- 0
meth <- ini$meth
meth["bmi"] <- ""

With the last two lines specifying the the "bmi" variable gets skipped over
and not imputed.

And I have come across other examples, but all that I have seen lay out a
method of skipping variables where EVERY variable is named (as "bmi" is
named above). I am wondering if there is a reasonably easy way to select
out approximately 30 variables for imputation from a larger dataset with
around 2500 variables, without having to name all 2450+ other variables.

Thank you,

Ian

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] running a scraping code in parallel...

2022-07-14 Thread akshay kulkarni
Dear members,

please feel free to ignore this mail if you feel that it is not about Base R.

  I have the following web scraping code ( i have 500 
stocks to iterate over):
getFirmsDates <- function() {
 rD <- RsDriver(browser="chrome")
 remDr <- rD$client

 { scrape for stock i }
 }

Will the following code work?

DATES <- mclapply(1:500, getFirmsDates, mc.cores = 48)

Basically, there must be 500 chrome instances and rD and remDr are same for all 
iterations. If not any suggestions on how to accomplish the task?

I am using RSelenium and rvest packages.

THanking you,
yours sincerely,
AKSHAY M KULKARNI

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500)

2022-07-14 Thread Ebert,Timothy Aaron
Maybe this is too simple but could you use the select() function from dplyr?
Tim

-Original Message-
From: R-help  On Behalf Of Bert Gunter
Sent: Thursday, July 14, 2022 2:10 PM
To: Ian McPhail 
Cc: R-help 
Subject: Re: [R] mice: selecting small subset of variables to impute from 
dataset with many variables (> 2500)

[External Email]

If I understand your query correctly, you can use negative indexing to omit 
variables. See ?'[' for details.

> dat <- data.frame (a = 1:3, b = letters[1:3], c = 4:6, d = 
> letters[5:7]) dat
  a b c d
1 1 a 4 e
2 2 b 5 f
3 3 c 6 g
> dat[,-c(2,4)]
  a c
1 1 4
2 2 5
3 3 6

Of course you have to know the numerical index of the columns you wish to omit, 
but somethingh of the sort seems unavoidable in any case.

Cheers,
Bert

On Thu, Jul 14, 2022 at 11:00 AM Ian McPhail  wrote:
>
> Hello,
>
> I am looking for some advice on how to select subsets of variables for 
> imputing when using the mice package.
>
> From Van Buuren's original mice paper, I see that selecting variables 
> to be 'skipped' in an imputation can be written as:
>
> ini <- mice(nhanes2, maxit = 0, print = FALSE) pred <- ini$pred pred[, 
> "bmi"] <- 0 meth <- ini$meth meth["bmi"] <- ""
>
> With the last two lines specifying the the "bmi" variable gets skipped 
> over and not imputed.
>
> And I have come across other examples, but all that I have seen lay 
> out a method of skipping variables where EVERY variable is named (as 
> "bmi" is named above). I am wondering if there is a reasonably easy 
> way to select out approximately 30 variables for imputation from a 
> larger dataset with around 2500 variables, without having to name all 2450+ 
> other variables.
>
> Thank you,
>
> Ian
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
> man_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAs
> Rzsn7AkP-g&m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2e
> wcXchwc&s=ABj_L_b515lhH7RIgTmmjylyWxJCbRWvzZDkxUkGw90&e=
> PLEASE do read the posting guide 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or
> g_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeA
> sRzsn7AkP-g&m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2
> ewcXchwc&s=LiocKPLYgq5olAT6tqGjr2xOLwDWw55DRzhuq7gcF5A&e=
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2ewcXchwc&s=ABj_L_b515lhH7RIgTmmjylyWxJCbRWvzZDkxUkGw90&e=
PLEASE do read the posting guide 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html&d=DwICAg&c=sJ6xIWYx-zLMB3EPkvcnVg&r=9PEhQh2kVeAsRzsn7AkP-g&m=UxEz20f8LSF-iyVuq17UnoNVkEe6HoC3E6vHWssLjSBKtqLSrm7qs8v2ewcXchwc&s=LiocKPLYgq5olAT6tqGjr2xOLwDWw55DRzhuq7gcF5A&e=
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500)

2022-07-14 Thread Bert Gunter
If I understand your query correctly, you can use negative indexing to
omit variables. See ?'[' for details.

> dat <- data.frame (a = 1:3, b = letters[1:3], c = 4:6, d = letters[5:7])
> dat
  a b c d
1 1 a 4 e
2 2 b 5 f
3 3 c 6 g
> dat[,-c(2,4)]
  a c
1 1 4
2 2 5
3 3 6

Of course you have to know the numerical index of the columns you wish
to omit, but somethingh of the sort seems unavoidable in any case.

Cheers,
Bert

On Thu, Jul 14, 2022 at 11:00 AM Ian McPhail  wrote:
>
> Hello,
>
> I am looking for some advice on how to select subsets of variables for
> imputing when using the mice package.
>
> From Van Buuren's original mice paper, I see that selecting variables to be
> 'skipped' in an imputation can be written as:
>
> ini <- mice(nhanes2, maxit = 0, print = FALSE)
> pred <- ini$pred
> pred[, "bmi"] <- 0
> meth <- ini$meth
> meth["bmi"] <- ""
>
> With the last two lines specifying the the "bmi" variable gets skipped over
> and not imputed.
>
> And I have come across other examples, but all that I have seen lay out a
> method of skipping variables where EVERY variable is named (as "bmi" is
> named above). I am wondering if there is a reasonably easy way to select
> out approximately 30 variables for imputation from a larger dataset with
> around 2500 variables, without having to name all 2450+ other variables.
>
> Thank you,
>
> Ian
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] mice: selecting small subset of variables to impute from dataset with many variables (> 2500)

2022-07-14 Thread Ian McPhail
Hello,

I am looking for some advice on how to select subsets of variables for
imputing when using the mice package.

>From Van Buuren's original mice paper, I see that selecting variables to be
'skipped' in an imputation can be written as:

ini <- mice(nhanes2, maxit = 0, print = FALSE)
pred <- ini$pred
pred[, "bmi"] <- 0
meth <- ini$meth
meth["bmi"] <- ""

With the last two lines specifying the the "bmi" variable gets skipped over
and not imputed.

And I have come across other examples, but all that I have seen lay out a
method of skipping variables where EVERY variable is named (as "bmi" is
named above). I am wondering if there is a reasonably easy way to select
out approximately 30 variables for imputation from a larger dataset with
around 2500 variables, without having to name all 2450+ other variables.

Thank you,

Ian

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to parse a really silly date with lubridate

2022-07-14 Thread avi.e.gross
To be clear, I take no credit for the rather extraordinary function cll shown 
below:

mutate(Date = lubridate::dmy_hm(Date))

I would pretty much never have constructed such an interesting and rather 
unnecessary line of code.

ALL the work is done within the parentheses:

Date = lubridate::dmy_hm(Date))

The above creates a tibble and assigns it to the name of Date. but first it 
takes a vector called Date containing text and uses a function to parse it and 
return a form more suitable to use as a date.

So what does mutate() do when it is called with a tibble and asked to do 
nothing? I mean it sees something more like this:

Mutate(.data=Date)

What is missing are the usual assignment statements to modify existing columns 
or make new ones. So it does nothing and exists without complaint.

My suggestion was more like the following which uses mutate. For illustration, 
I will not use the name "Date" repeatedly and make new columns and uses an 
original_date as the source but note I am not saying the lubridate used will 
work as I do not see how it handles the initial part of the string containing 
an index number.

old_tibble <-  tibble(useless=original_date)
new_tibble <- mutate(old_tibble, useful = lubridate::dmy_hm(Date))

The above can be more compact by making the tibble directly in the first 
argument, and it can be done using old or new pipelines. 

The reason the suggested way worked is because it used the vectorized methods 
of base-R and I mentioned there was no reason you must use dplyr for this or 
many other things especially when it is simple.

Now if you wanted to make multiple new columns containing character or integer 
versions of the month and other parts of the date or even calculate what day of 
the week that full date was in that year, then mutate can be very useful as you 
can keep adding requests to make a new column using all old and new columns 
already specified.

Sometimes when code works, we don't look to see if it works inadvertently, LOL!


-Original Message-
From: R-help  On Behalf Of Dr Eberhard W Lisse
Sent: Wednesday, July 13, 2022 5:49 PM
To: r-help@r-project.org
Subject: Re: [R] How to parse a really silly date with lubridate


Bui,

thanks, this what Avi suggested in an email to me as well and works.

It's so easy if you know it :-)-O

el

On 2022-07-13 23:40 , Rui Barradas wrote:
> Hello,
> 
> Are you looking for mutate? In the example below I haven't included 
> the filter, since the tibble only has 2 rows. But the date column is 
> coerced to an actual datetime class in place, without the need for NewDate.
> 
> suppressPackageStartupMessages({
>library(tibble)
>library(dplyr)
> })
> 
> DDATA <- tibble(Date = c('9. Jul 2022 at 11:39', '10. Jul 2022 at 
> 01:58'))
> 
> DDATA %>%
>mutate(Date = lubridate::dmy_hm(Date)) #> # A tibble: 2 × 1 #>   
> Date #>#> 1 2022-07-09 11:39:00 #> 2 2022-07-10 01:58:00
> 
> 
> Hope this helps,
> 
> Rui Barradas
[...]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aborting the execution of a function...

2022-07-14 Thread akshay kulkarni
Dear Bill,
Many thanks ..

Yours sincrely,
AKSHAY M KULKARNI

From: Bill Dunlap 
Sent: Thursday, July 14, 2022 1:28 AM
To: akshay kulkarni 
Cc: R help Mailing list 
Subject: Re: [R] aborting the execution of a function...

You could write a function that returns an environment (or list if you prefer) 
containing the results collected before the interrupt by using 
tryCatch(interrupt=...).  E.g.,

doMany <- function(names) {
resultEnv <- new.env(parent=emptyenv())
tryCatch(
for(name in names) resultEnv[[name]] <- Sys.sleep(1), # replace 
Sys.sleep(1) by getStuffFromWeb(name)
interrupt = function(e) NULL)
resultEnv
}

Use it as

> system.time(e <- doMany(state.name)) # hit Esc or ^C after 
> a few seconds
^C   user  system elapsed
  0.001   0.000   4.390
> names(e)
[1] "Alabama"  "Alaska"   "Arizona"  "Arkansas"
> eapply(e, identity)
$Alabama
NULL

$Alaska
NULL

$Arizona
NULL

$Arkansas
NULL

-Bill

On Wed, Jul 13, 2022 at 12:20 PM akshay kulkarni 
mailto:akshay...@hotmail.com>> wrote:
Dear members,
 I am running a large scraping code in a very powerful 
AWS ec2 instance:

DATES <- getFirms Dates()

It iterates over 500 stocks from a website. Despite the power of the machine, 
the execution is very slow.

If I abort the function (by ctrl + C), after, say 150th iteration, the DATES 
object will still contain the scraped data untill the 150th iteration, right? ( 
The rest of the 350 entries will be NA's, I suppose).

Many thanks in advance.

Yours sincerely,
AKSHAY M KULKARNI



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aborting the execution of a function...

2022-07-14 Thread akshay kulkarni
Dear Avi,
 THanks a lot...

Yours sincerely,
AKSHAY M KULKARNI

From: R-help  on behalf of avi.e.gr...@gmail.com 

Sent: Thursday, July 14, 2022 1:39 AM
To: 'R help Mailing list' 
Subject: Re: [R] aborting the execution of a function...

Jeff & Akshay,

What you say is true if you just call a function and do not arrange to
handle various interrupts and errors.

Obviously if you find a way to gracefully handle an error then you can opt
to have your work so far saved. For what is meant to be a fatal error,
though, you are expected to GIVE UP or at least exit rapidly after doing
something graceful.

I am not clear what machine the user is using and their R setup. It may be
you can find something like a try() or suspendInterrupts method to catch and
handle control-C but may I offer a DIFFERENT solution?

You can identify your R process and send it some other milder signal that
can easily be caught or change your loop to slow it down a bit more.

For example, change your code so it opens a file and writes data to it in
some format like one result per line or a CSV format or whatever makes you
happy. Your loop adds new lines/items to the file and perhaps even closes
the file on interrupt or whatever works. Or it writes the results to the
console where you can copy/paste from if not too long.

And, consider having long-running code periodically check something in the
environment looking for a signal. Say every hundredth iteration it checks
for the existence of a file called "STOP_IT.stupid" and if it sees it,
removes it and exits gracefully while preserving your results somehow or
whatever you need. No interrupt needed, just create an empty file or other
logical marker.

Another variant is to use some form of threading or subprocess that does the
work somewhat in the background but can get commands from the foreground
process as needed including a request to stop. Again, no horrible signals
that kill the program.

And note on some systems, a process can be halted and resumed, if the
problem is that it has run a long time and is using too many resources at a
time they are needed.



-Original Message-
From: R-help  On Behalf Of Jeff Newmiller
Sent: Wednesday, July 13, 2022 3:49 PM
To: r-help@r-project.org; akshay kulkarni ; R help
Mailing list 
Subject: Re: [R] aborting the execution of a function...

This would be easy for you to test on a small example on your local
computer.

But the answer is "no". Nothing is assigned if the function does not return
normally... and Ctrl+C is anything but normal.

On July 13, 2022 12:19:58 PM PDT, akshay kulkarni 
wrote:
>Dear members,
> I am running a large scraping code in a very
powerful AWS ec2 instance:
>
>DATES <- getFirms Dates()
>
>It iterates over 500 stocks from a website. Despite the power of the
machine, the execution is very slow.
>
>If I abort the function (by ctrl + C), after, say 150th iteration, the
DATES object will still contain the scraped data untill the 150th iteration,
right? ( The rest of the 350 entries will be NA's, I suppose).
>
>Many thanks in advance.
>
>Yours sincerely,
>AKSHAY M KULKARNI
>
>
>
>   [[alternative HTML version deleted]]
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

--
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Thanks to this list ....

2022-07-14 Thread Rolf Turner


I set out to appeal to this list for help with disentangling
a bewildering anomaly that was produced by some dynamically loaded
Fortran code.

In composing an email to explain the nature of the anomaly, I *FINALLY*
spotted the loony!  I had an expression in a nested do loop:

j = npro + (r-1)*nvym1 + s

where r and s were the indices of two of the nested loops (explicitly
declared to be integer at the start of the subroutine in question).

When I copied the foregoing expression into the email I at last noticed
that the variable "nvym1" *should* have been "nyvm1".  See the subtle
difference?  :-)

The variable nvym1 was never initialised, so it took on strange values
plucked out of RAM, I guess.  Whence the anomaly.  Once I corrected that
trivial typo, things were OK.

Thanks everybody!!! :-)

cheers,

Rolf Turner

P.S. What I can't figure out (and won't waste any time trying) is
why the first four values of j were as they should have been, and things
did not go to hell in a handcart until the code got to the 5th value
of j.  I guess the computer gods were just amusing themselves at my
expense.

R. T.

-- 
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.