On 12/31/2014 12:22 AM, Karim Mezhoud wrote:
Thanks,
It seems for loop spends less time ;)

with
dim(DataFrame)
[1] 338  70

For loop has
    user  system elapsed
   0.012   0.000   0.012

and apply has
   user  system elapsed
   0.020   0.000   0.021

The timings are so short that the answer in terms of speed is 'it does not 
matter'.

Here is a selection of approaches

f0 <- function(df) {
    for (i in seq_along(df))
        df[,i] <- as.numeric(df[,i])
    df
}

f0a <- function(df) {
    ## data.frame is a list-of-equal-length vectors; access each
    ## column with "[["
    for (i in seq_along(df))
        df[[i]] <- as.numeric(df[[i]])
    df
}

f0c <- compiler::cmpfun(f0)  ## loops sometimes benefit from compilation

f1 <- function(df)
    as.data.frame(apply(df, 2, as.numeric))

f2 <- function(df) {
    ## replace all columns of df with list-of-vectors
    df[] <- lapply(df, as.numeric)
    df
}

f3 <- function(df) {
    ## coerce to matrix to avoid the explicit loop, use mode<- to
    ## change storage of elements
    m <- as.matrix(df)
    mode(m) <- "numeric"
    as.data.frame(m)
}

f4 <- function(df) {
    ## if it's a matrix, why are we returning a data.frame?
    m <- as.matrix(df)
    mode(m) <- "numeric"
    m
}

f4a <- function(df)
    ## unlist to single vector, coerce, then format as matrix
    matrix(as.numeric(unlist(df, use.names=FALSE)), nrow(df),
           dimnames=dimnames(df))

It's important to test that different methods return the same result (perhaps allowing for differences in attributes such as row or column names). The microbenchmark package repeats timings across multiple trials (default 100 times).

library(microbenchmark)
test <- function(df) {
    stopifnot(
        identical(f0(df), f0a(df)),
        identical(f0(df), f0c(df)),
        identical(f0(df), f1(df)),
        identical(f0(df), f2(df)),
        identical(f0(df), f3(df)),
        identical(as.matrix(f0(df)), f4(df)),
        all.equal(f4(df), f4a(df), check.attributes=FALSE))
    microbenchmark(f0(df), f0a(df), f1(df), f2(df), f3(df), f4(df), f4a(df))
}

Here are some data sets

m <- matrix(rnorm(338 * 70), 338)
df <- as.data.frame(m)
dfc <- as.data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
dff <- as.data.frame(lapply(df, as.character))

and results

> test(df)
Unit: microseconds
    expr      min        lq      mean    median        uq      max neval
  f0(df) 6208.956 6270.5500 6367.4138 6306.7110 6362.2225 7731.281   100
 f0a(df) 2917.973 2975.2090 3024.8623 3002.3805 3036.5365 3951.618   100
 f0c(df) 6078.399 6150.1085 6264.0998 6188.3690 6244.5725 7684.116   100
  f1(df) 2698.074 2743.2905 2821.8453 2769.3655 2805.5345 4033.229   100
  f2(df) 1989.057 2041.0685 2066.1830 2055.0020 2083.8545 2267.732   100
  f3(df) 1532.435 1572.9810 1609.7378 1597.6245 1624.2305 2003.584   100
  f4(df)  808.593  828.5445  852.2626  847.5355  864.6665 1180.977   100
 f4a(df)  422.657  437.2705  458.9845  455.2470  465.5815  695.443   100
> test(dfc)
Unit: milliseconds
    expr       min        lq      mean    median        uq       max neval
  f0(df) 11.416532 11.647858 11.915287 11.767647 12.016276 14.239622   100
 f0a(df)  8.095709  8.211116  8.380638  8.289895  8.454948  9.529026   100
 f0c(df) 11.339293 11.577811 11.772087 11.702341 11.896729 12.674766   100
  f1(df)  8.227371  8.277147  8.422412  8.331403  8.490411  9.145499   100
  f2(df)  6.907888  7.010828  7.162529  7.147198  7.239048  7.763758   100
  f3(df)  6.608107  6.688232  6.845936  6.792066  6.892635  8.359274   100
  f4(df)  5.859482  5.939680  6.046976  5.993804  6.105388  6.968601   100
 f4a(df)  5.372214  5.460987  5.556687  5.521542  5.614482  6.107081   100
> test(dff)
Error: identical(f0(df), f1(df)) is not TRUE

Except when dealing with factors, the use of explicit loops is the slowest. With factors, matrix-based methods coerce the level labels to numeric, whereas vector-based methods coerce the underlying codes (level values) of the factor; obviously great care needs to be taken.

> f0(dff)[1:5, 1:5]
   V1  V2  V3  V4  V5
1 150 232 294  88  56
2 159   8  89  59  10
3 132 171  40 205 119
4 214 273  26 262 216
5 281  49 255  31 233
> f1(dff)[1:5, 1:5]
          V1          V2         V3         V4          V5
1 -1.7092463  0.50234009  0.8492982 -0.5636901 -0.38545566
2 -2.3020854 -0.05580931 -0.5963673 -0.3671748 -0.09408031
3 -1.2915110 -2.46181533 -0.2470108  0.3301129 -1.06810225
4  0.3065989  0.89263099 -0.1717432  0.7721411  0.35856334
5  0.8795616 -0.43049898  0.4560515 -0.1722099  0.46125149

In terms of 'best practice', I would represent my data in the appropriate data structure in the first place (as a matrix of appropriate type, rather than data.frame, so the entire coercion is irrelevant). If faced with a data.frame with specific columns to coerce I would use the approach

    cidx <- sapply(df, is.character)      # index of columns to coerce
    df[cidx] <- lapply(df[cidx], as.numeric)

which seems to be reasonably correct, expressive, compact, and speedy.

Martin Morgan


   Ô__
  c/ /'_;~~~~kmezhoud
(*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
http://bioinformatics.tn/



On Wed, Dec 31, 2014 at 8:54 AM, Berend Hasselman <b...@xs4all.nl> wrote:


On 31-12-2014, at 08:40, Karim Mezhoud <kmezh...@gmail.com> wrote:

Hi All,
I would like to choice between these two data frame convert. which is
faster?

   for(i in 1:ncol(DataFrame)){

                    DataFrame[,i] <- as.numeric(DataFrame[,i])
                }


OR

DataFrame <- as.data.frame(apply(DataFrame,2 ,function(x) as.numeric(x)))



Try it and use system.time.

Berend

Thanks
Karim
  Ô__
c/ /'_;~~~~kmezhoud
(*) \(*)   ⴽⴰⵔⵉⵎ  ⵎⴻⵣⵀⵓⴷ
http://bioinformatics.tn/

       [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to