Re: [R-pkg-devel] Fast Matrix Serialization in R?
On Fri, 10 May 2024 15:12:17 +1200 Simon Urbanek wrote: > I wonder if it may be worth doing something a bit smarter and tag > officially a "reverse XDR" format instead - that way it would be > well-defined and could be made the default. Do you mean changing R so that when reading a "B\n" serialized stream, a format code read as 0x0200 or 0x0300 would mean regular formats 2 or 3 but byte-swapped? That would be backwards-compatible, and we probably weren't going to have >= 65536 format versions anyway... -- Best regards, Ivan __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] Fast Matrix Serialization in R?
> On 10/05/2024, at 12:31 PM, Henrik Bengtsson > wrote: > > On Thu, May 9, 2024 at 3:46 PM Simon Urbanek > wrote: >> >> FWIW serialize() is binary so there is no conversion to text: >> >>> serialize(1:10+0L, NULL) >> [1] 58 0a 00 00 00 03 00 04 02 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 >> 00 >> [26] 00 0d 00 00 00 0a 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 >> 00 >> [51] 05 00 00 00 06 00 00 00 07 00 00 00 08 00 00 00 09 00 00 00 0a >> >> It uses the native representation so it is actually not as bad as it sounds. >> >> One aspect I forgot to mention in the earlier thread is that if you don't >> need to exchange the serialized objects between machines with different >> endianness then avoiding the swap makes it faster. E.g, on Intel (which is >> little-endian and thus needs swapping): >> >>> a=1:1e8/2 >>> system.time(serialize(a, NULL)) >> user system elapsed >> 2.123 0.468 2.661 >>> system.time(serialize(a, NULL, xdr=FALSE)) >> user system elapsed >> 0.393 0.348 0.742 > > Would it be worth looking into making xdr=FALSE the default? From > help("serialize"): > > xdr: a logical: if a binary representation is used, should a > big-endian one (XDR) be used? > ... > As almost all systems in current use are little-endian, xdr = FALSE > can be used to avoid byte-shuffling at both ends when transferring > data from one little-endian machine to another (or between processes > on the same machine). Depending on the system, this can speed up > serialization and unserialization by a factor of up to 3x. > > This seems like a low-hanging fruit that could spare the world from > wasting unnecessary CPU cycles. > I thought about it before, but the main problem here is (as often) compatibility. The current default guarantees that the output can be safely read on any machine while xdr=FALSE only works if used on machines with the same endianness and will fail horribly otherwise. R cannot really know whether the user intends to transport the serialized data to another machine or not, so it cannot assume it is safe unless the user indicates so. Therefore all we can safely do is tell the users that they should use it where appropriate -- and the documentation explicitly says so: As almost all systems in current use are little-endian, ‘xdr = FALSE’ can be used to avoid byte-shuffling at both ends when transferring data from one little-endian machine to another (or between processes on the same machine). Depending on the system, this can speed up serialization and unserialization by a factor of up to 3x. Unfortunately, no one bothers to reads the documentation so it is not as effective as changing the default, but for reasons above it is just not as easy to change. I do acknowledge that the risk is relatively low since big-endian machines are becoming rare, but it's not zero. That said, what worries me a bit more is that some derived functions such as saveRDS() don't expose the xdr option, so you actually have no way to use the native binary format. I understand the logic - see above, but as you said, that makes them unnecessarily slow. I wonder if it may be worth doing something a bit smarter and tag officially a "reverse XDR" format instead - that way it would be well-defined and could be made the default. Interestingly, the de-serialization part actually doesn't care, so you can use readRDS() on the binary serialization even in current R versions, so just adding the option would still be backwards-compatible. Definitely something to think about... Cheers, Simon > > >> >> Cheers, >> Simon >> >> __ >> R-package-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-package-devel > __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] Fast Matrix Serialization in R?
On Thu, May 9, 2024 at 3:46 PM Simon Urbanek wrote: > > > > > On 9/05/2024, at 11:58 PM, Vladimir Dergachev > > wrote: > > > > > > > > On Thu, 9 May 2024, Sameh Abdulah wrote: > > > >> Hi, > >> > >> I need to serialize and save a 20K x 20K matrix as a binary file. This > >> process is significantly slower in R compared to Python (4X slower). > >> > >> I'm not sure about the best approach to optimize the below code. Is it > >> possible to parallelize the serialization function to enhance performance? > > > > Parallelization should not help - a single CPU thread should be able to > > saturate your disk or your network, assuming you have a typical computer. > > > > The problem is possibly the conversion to text, writing it as binary should > > be much faster. > > > > > FWIW serialize() is binary so there is no conversion to text: > > > serialize(1:10+0L, NULL) > [1] 58 0a 00 00 00 03 00 04 02 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 > 00 > [26] 00 0d 00 00 00 0a 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 > 00 > [51] 05 00 00 00 06 00 00 00 07 00 00 00 08 00 00 00 09 00 00 00 0a > > It uses the native representation so it is actually not as bad as it sounds. > > One aspect I forgot to mention in the earlier thread is that if you don't > need to exchange the serialized objects between machines with different > endianness then avoiding the swap makes it faster. E.g, on Intel (which is > little-endian and thus needs swapping): > > > a=1:1e8/2 > > system.time(serialize(a, NULL)) >user system elapsed > 2.123 0.468 2.661 > > system.time(serialize(a, NULL, xdr=FALSE)) >user system elapsed > 0.393 0.348 0.742 Would it be worth looking into making xdr=FALSE the default? From help("serialize"): xdr: a logical: if a binary representation is used, should a big-endian one (XDR) be used? ... As almost all systems in current use are little-endian, xdr = FALSE can be used to avoid byte-shuffling at both ends when transferring data from one little-endian machine to another (or between processes on the same machine). Depending on the system, this can speed up serialization and unserialization by a factor of up to 3x. This seems like a low-hanging fruit that could spare the world from wasting unnecessary CPU cycles. /Henrik > > Cheers, > Simon > > __ > R-package-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-package-devel __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] Fast Matrix Serialization in R?
> On 9/05/2024, at 11:58 PM, Vladimir Dergachev wrote: > > > > On Thu, 9 May 2024, Sameh Abdulah wrote: > >> Hi, >> >> I need to serialize and save a 20K x 20K matrix as a binary file. This >> process is significantly slower in R compared to Python (4X slower). >> >> I'm not sure about the best approach to optimize the below code. Is it >> possible to parallelize the serialization function to enhance performance? > > Parallelization should not help - a single CPU thread should be able to > saturate your disk or your network, assuming you have a typical computer. > > The problem is possibly the conversion to text, writing it as binary should > be much faster. > FWIW serialize() is binary so there is no conversion to text: > serialize(1:10+0L, NULL) [1] 58 0a 00 00 00 03 00 04 02 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00 [26] 00 0d 00 00 00 0a 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 00 [51] 05 00 00 00 06 00 00 00 07 00 00 00 08 00 00 00 09 00 00 00 0a It uses the native representation so it is actually not as bad as it sounds. One aspect I forgot to mention in the earlier thread is that if you don't need to exchange the serialized objects between machines with different endianness then avoiding the swap makes it faster. E.g, on Intel (which is little-endian and thus needs swapping): > a=1:1e8/2 > system.time(serialize(a, NULL)) user system elapsed 2.123 0.468 2.661 > system.time(serialize(a, NULL, xdr=FALSE)) user system elapsed 0.393 0.348 0.742 Cheers, Simon __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] Fast Matrix Serialization in R?
On Thu, 9 May 2024, Sameh Abdulah wrote: Hi, I need to serialize and save a 20K x 20K matrix as a binary file. This process is significantly slower in R compared to Python (4X slower). I'm not sure about the best approach to optimize the below code. Is it possible to parallelize the serialization function to enhance performance? Parallelization should not help - a single CPU thread should be able to saturate your disk or your network, assuming you have a typical computer. The problem is possibly the conversion to text, writing it as binary should be much faster. To add to other suggestions, you might want to try my package "RMVL" - aside from fast writes, it also gives you ability to share data between ultimate users of the package. best Vladimir Dergachev PS Example: library("RMVL") M<-mvl_open("test1.mvl", append=TRUE, create=TRUE) n <- 2^2 cat("Generating matrices ... ") INI.TIME <- proc.time() A <- matrix(runif(n), ncol = m) END_GEN.TIME <- proc.time() mvl_write(M, A, name="A") mvl_close(M) END_SER.TIME <- proc.time() # Use in another script: library("RMVL") M2<-mvl_open("test1.mvl") print(M2$A[1:10, 1:10]) __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] Fast Matrix Serialization in R?
Sameh, if it's a matrix, that's easy as you can write it directly which is the fastest possible way without compression - e.g. quick proof of concept: n <- 2^2 A <- matrix(runif(n), ncol = sqrt(n)) ## write (dim + payload) con <- file(description = "matrix_file", open = "wb") system.time({ writeBin(d <- dim(A), con) dim(A)=NULL writeBin(A, con) dim(A)=d }) close(con) ## read con <- file(description = "matrix_file", open = "rb") system.time({ d <- readBin(con, 1L, 2) A1 <- readBin(con, 1, d[1] * d[2]) dim(A1) <- d }) close(con) identical(A, A1) user system elapsed 0.931 2.713 3.644 user system elapsed 0.089 1.360 1.451 [1] TRUE So it's really just limited by the speed of your disk, parallelization won't help here. Note that in general you get faster read times by using compression as most data is reasonably compressible, so that's where parallelization can be useful. There are plenty of package with more tricks like mmapping the files etc., but the above is just base R. Cheers, Simon > On 9/05/2024, at 3:20 PM, Sameh Abdulah wrote: > > Hi, > > I need to serialize and save a 20K x 20K matrix as a binary file. This > process is significantly slower in R compared to Python (4X slower). > > I'm not sure about the best approach to optimize the below code. Is it > possible to parallelize the serialization function to enhance performance? > > > n <- 2^2 > cat("Generating matrices ... ") > INI.TIME <- proc.time() > A <- matrix(runif(n), ncol = m) > END_GEN.TIME <- proc.time() > arg_ser <- serialize(object = A, connection = NULL) > > END_SER.TIME <- proc.time() > con <- file(description = "matrix_file", open = "wb") > writeBin(object = arg_ser, con = con) > close(con) > END_WRITE.TIME <- proc.time() > con <- file(description = "matrix_file", open = "rb") > par_raw <- readBin(con, what = raw(), n = file.info("matrix_file")$size) > END_READ.TIME <- proc.time() > B <- unserialize(connection = par_raw) > close(con) > END_DES.TIME <- proc.time() > TIME <- END_GEN.TIME - INI.TIME > cat("Generation time", TIME[3], " seconds.") > > TIME <- END_SER.TIME - END_GEN.TIME > cat("Serialization time", TIME[3], " seconds.") > > TIME <- END_WRITE.TIME - END_SER.TIME > cat("Writting time", TIME[3], " seconds.") > > TIME <- END_READ.TIME - END_WRITE.TIME > cat("Read time", TIME[3], " seconds.") > > TIME <- END_DES.TIME - END_READ.TIME > cat("Deserialize time", TIME[3], " seconds.") > > > > > Best, > --Sameh > > -- > > This message and its contents, including attachments are intended solely > for the original recipient. If you are not the intended recipient or have > received this message in error, please notify me immediately and delete > this message from your computer system. Any unauthorized use or > distribution is prohibited. Please consider the environment before printing > this email. > > [[alternative HTML version deleted]] > > __ > R-package-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-package-devel > __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
Re: [R-pkg-devel] Fast Matrix Serialization in R?
On 9 May 2024 at 03:20, Sameh Abdulah wrote: | I need to serialize and save a 20K x 20K matrix as a binary file. Hm that is an incomplete specification: _what_ do you want to do with it? Read it back in R? Share it with other languages (like Python) ? I.e. what really is your use case? Also, you only seem to use readBin / writeBin. Why not readRDS / saveRDS which at least give you compression? If it is to read/write from / to R look into the qs package. It is good. The README.md at its repo has benchmarks: https://github.com/traversc/qs If you want to index into the stored data look into fst. Else also look at databases Dirk -- dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel
[R-pkg-devel] Fast Matrix Serialization in R?
Hi, I need to serialize and save a 20K x 20K matrix as a binary file. This process is significantly slower in R compared to Python (4X slower). I'm not sure about the best approach to optimize the below code. Is it possible to parallelize the serialization function to enhance performance? n <- 2^2 cat("Generating matrices ... ") INI.TIME <- proc.time() A <- matrix(runif(n), ncol = m) END_GEN.TIME <- proc.time() arg_ser <- serialize(object = A, connection = NULL) END_SER.TIME <- proc.time() con <- file(description = "matrix_file", open = "wb") writeBin(object = arg_ser, con = con) close(con) END_WRITE.TIME <- proc.time() con <- file(description = "matrix_file", open = "rb") par_raw <- readBin(con, what = raw(), n = file.info("matrix_file")$size) END_READ.TIME <- proc.time() B <- unserialize(connection = par_raw) close(con) END_DES.TIME <- proc.time() TIME <- END_GEN.TIME - INI.TIME cat("Generation time", TIME[3], " seconds.") TIME <- END_SER.TIME - END_GEN.TIME cat("Serialization time", TIME[3], " seconds.") TIME <- END_WRITE.TIME - END_SER.TIME cat("Writting time", TIME[3], " seconds.") TIME <- END_READ.TIME - END_WRITE.TIME cat("Read time", TIME[3], " seconds.") TIME <- END_DES.TIME - END_READ.TIME cat("Deserialize time", TIME[3], " seconds.") Best, --Sameh -- This message and its contents, including attachments are intended solely for the original recipient. If you are not the intended recipient or have received this message in error, please notify me immediately and delete this message from your computer system. Any unauthorized use or distribution is prohibited. Please consider the environment before printing this email. [[alternative HTML version deleted]] __ R-package-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-package-devel