Re: [R-pkg-devel] Fast Matrix Serialization in R?

2024-05-10 Thread Ivan Krylov via R-package-devel
On Fri, 10 May 2024 15:12:17 +1200
Simon Urbanek  wrote:

> I wonder if it may be worth doing something a bit smarter and tag
> officially a "reverse XDR" format instead - that way it would be
> well-defined and could be made the default.

Do you mean changing R so that when reading a "B\n" serialized stream,
a format code read as 0x0200 or 0x0300 would mean regular
formats 2 or 3 but byte-swapped? That would be backwards-compatible,
and we probably weren't going to have >= 65536 format versions anyway...

-- 
Best regards,
Ivan

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Fast Matrix Serialization in R?

2024-05-09 Thread Simon Urbanek



> On 10/05/2024, at 12:31 PM, Henrik Bengtsson  
> wrote:
> 
> On Thu, May 9, 2024 at 3:46 PM Simon Urbanek
>  wrote:
>> 
>> FWIW serialize() is binary so there is no conversion to text:
>> 
>>> serialize(1:10+0L, NULL)
>> [1] 58 0a 00 00 00 03 00 04 02 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 
>> 00
>> [26] 00 0d 00 00 00 0a 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 
>> 00
>> [51] 05 00 00 00 06 00 00 00 07 00 00 00 08 00 00 00 09 00 00 00 0a
>> 
>> It uses the native representation so it is actually not as bad as it sounds.
>> 
>> One aspect I forgot to mention in the earlier thread is that if you don't 
>> need to exchange the serialized objects between machines with different 
>> endianness then avoiding the swap makes it faster. E.g, on Intel (which is 
>> little-endian and thus needs swapping):
>> 
>>> a=1:1e8/2
>>> system.time(serialize(a, NULL))
>>   user  system elapsed
>>  2.123   0.468   2.661
>>> system.time(serialize(a, NULL, xdr=FALSE))
>>   user  system elapsed
>>  0.393   0.348   0.742
> 
> Would it be worth looking into making xdr=FALSE the default? From
> help("serialize"):
> 
> xdr: a logical: if a binary representation is used, should a
> big-endian one (XDR) be used?
> ...
> As almost all systems in current use are little-endian, xdr = FALSE
> can be used to avoid byte-shuffling at both ends when transferring
> data from one little-endian machine to another (or between processes
> on the same machine). Depending on the system, this can speed up
> serialization and unserialization by a factor of up to 3x.
> 
> This seems like a low-hanging fruit that could spare the world from
> wasting unnecessary CPU cycles.
> 


I thought about it before, but the main problem here is (as often) 
compatibility. The current default guarantees that the output can be safely 
read on any machine while xdr=FALSE only works if used on machines with the 
same endianness and will fail horribly otherwise. R cannot really know whether 
the user intends to transport the serialized data to another machine or not, so 
it cannot assume it is safe unless the user indicates so. Therefore all we can 
safely do is tell the users that they should use it where appropriate -- and 
the documentation explicitly says so:

 As almost all systems in current use are little-endian, ‘xdr =
 FALSE’ can be used to avoid byte-shuffling at both ends when
 transferring data from one little-endian machine to another (or
 between processes on the same machine).  Depending on the system,
 this can speed up serialization and unserialization by a factor of
 up to 3x.

Unfortunately, no one bothers to reads the documentation so it is not as 
effective as changing the default, but for reasons above it is just not as easy 
to change. I do acknowledge that the risk is relatively low since big-endian 
machines are becoming rare, but it's not zero.

That said, what worries me a bit more is that some derived functions such as 
saveRDS() don't expose the xdr option, so you actually have no way to use the 
native binary format. I understand the logic - see above, but as you said, that 
makes them unnecessarily slow. I wonder if it may be worth doing something a 
bit smarter and tag officially a "reverse XDR" format instead - that way it 
would be well-defined and could be made the default. Interestingly, the 
de-serialization part actually doesn't care, so you can use readRDS() on the 
binary serialization even in current R versions, so just adding the option 
would still be backwards-compatible. Definitely something to think about...

Cheers,
Simon


> 
> 
>> 
>> Cheers,
>> Simon
>> 
>> __
>> R-package-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> 

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Fast Matrix Serialization in R?

2024-05-09 Thread Henrik Bengtsson
On Thu, May 9, 2024 at 3:46 PM Simon Urbanek
 wrote:
>
>
>
> > On 9/05/2024, at 11:58 PM, Vladimir Dergachev  
> > wrote:
> >
> >
> >
> > On Thu, 9 May 2024, Sameh Abdulah wrote:
> >
> >> Hi,
> >>
> >> I need to serialize and save a 20K x 20K matrix as a binary file. This 
> >> process is significantly slower in R compared to Python (4X slower).
> >>
> >> I'm not sure about the best approach to optimize the below code. Is it 
> >> possible to parallelize the serialization function to enhance performance?
> >
> > Parallelization should not help - a single CPU thread should be able to 
> > saturate your disk or your network, assuming you have a typical computer.
> >
> > The problem is possibly the conversion to text, writing it as binary should 
> > be much faster.
> >
>
>
> FWIW serialize() is binary so there is no conversion to text:
>
> > serialize(1:10+0L, NULL)
>  [1] 58 0a 00 00 00 03 00 04 02 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 
> 00
> [26] 00 0d 00 00 00 0a 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 
> 00
> [51] 05 00 00 00 06 00 00 00 07 00 00 00 08 00 00 00 09 00 00 00 0a
>
> It uses the native representation so it is actually not as bad as it sounds.
>
> One aspect I forgot to mention in the earlier thread is that if you don't 
> need to exchange the serialized objects between machines with different 
> endianness then avoiding the swap makes it faster. E.g, on Intel (which is 
> little-endian and thus needs swapping):
>
> > a=1:1e8/2
> > system.time(serialize(a, NULL))
>user  system elapsed
>   2.123   0.468   2.661
> > system.time(serialize(a, NULL, xdr=FALSE))
>user  system elapsed
>   0.393   0.348   0.742

Would it be worth looking into making xdr=FALSE the default? From
help("serialize"):

xdr: a logical: if a binary representation is used, should a
big-endian one (XDR) be used?
...
As almost all systems in current use are little-endian, xdr = FALSE
can be used to avoid byte-shuffling at both ends when transferring
data from one little-endian machine to another (or between processes
on the same machine). Depending on the system, this can speed up
serialization and unserialization by a factor of up to 3x.

This seems like a low-hanging fruit that could spare the world from
wasting unnecessary CPU cycles.

/Henrik



>
> Cheers,
> Simon
>
> __
> R-package-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Fast Matrix Serialization in R?

2024-05-09 Thread Simon Urbanek



> On 9/05/2024, at 11:58 PM, Vladimir Dergachev  wrote:
> 
> 
> 
> On Thu, 9 May 2024, Sameh Abdulah wrote:
> 
>> Hi,
>> 
>> I need to serialize and save a 20K x 20K matrix as a binary file. This 
>> process is significantly slower in R compared to Python (4X slower).
>> 
>> I'm not sure about the best approach to optimize the below code. Is it 
>> possible to parallelize the serialization function to enhance performance?
> 
> Parallelization should not help - a single CPU thread should be able to 
> saturate your disk or your network, assuming you have a typical computer.
> 
> The problem is possibly the conversion to text, writing it as binary should 
> be much faster.
> 


FWIW serialize() is binary so there is no conversion to text:

> serialize(1:10+0L, NULL)
 [1] 58 0a 00 00 00 03 00 04 02 00 00 03 05 00 00 00 00 05 55 54 46 2d 38 00 00
[26] 00 0d 00 00 00 0a 00 00 00 01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 00
[51] 05 00 00 00 06 00 00 00 07 00 00 00 08 00 00 00 09 00 00 00 0a

It uses the native representation so it is actually not as bad as it sounds.

One aspect I forgot to mention in the earlier thread is that if you don't need 
to exchange the serialized objects between machines with different endianness 
then avoiding the swap makes it faster. E.g, on Intel (which is little-endian 
and thus needs swapping):

> a=1:1e8/2
> system.time(serialize(a, NULL))
   user  system elapsed 
  2.123   0.468   2.661 
> system.time(serialize(a, NULL, xdr=FALSE))
   user  system elapsed 
  0.393   0.348   0.742 

Cheers,
Simon

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Fast Matrix Serialization in R?

2024-05-09 Thread Vladimir Dergachev




On Thu, 9 May 2024, Sameh Abdulah wrote:


Hi,

I need to serialize and save a 20K x 20K matrix as a binary file. This process 
is significantly slower in R compared to Python (4X slower).

I'm not sure about the best approach to optimize the below code. Is it possible 
to parallelize the serialization function to enhance performance?


Parallelization should not help - a single CPU thread should be able to 
saturate your disk or your network, assuming you have a typical computer.


The problem is possibly the conversion to text, writing it as binary 
should be much faster.


To add to other suggestions, you might want to try my package "RMVL" - 
aside from fast writes, it also gives you ability to share data between 
ultimate users of the package.


best

Vladimir Dergachev

PS Example:

library("RMVL")

M<-mvl_open("test1.mvl", append=TRUE, create=TRUE)

n <- 2^2
cat("Generating matrices ... ")
INI.TIME <- proc.time()
A <- matrix(runif(n), ncol = m)
END_GEN.TIME <- proc.time()

mvl_write(M, A, name="A")

mvl_close(M)

END_SER.TIME <- proc.time()


# Use in another script:

library("RMVL")

M2<-mvl_open("test1.mvl")

print(M2$A[1:10, 1:10])

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Fast Matrix Serialization in R?

2024-05-08 Thread Simon Urbanek
Sameh,

if it's a matrix, that's easy as you can write it directly which is the fastest 
possible way without compression - e.g. quick proof of concept:

n <- 2^2
A <- matrix(runif(n), ncol = sqrt(n))

## write (dim + payload)
con <- file(description = "matrix_file", open = "wb")
system.time({
writeBin(d <- dim(A), con)
dim(A)=NULL
writeBin(A, con)
dim(A)=d
})
close(con)

## read
con <- file(description = "matrix_file", open = "rb")
system.time({
d <- readBin(con, 1L, 2)
A1 <- readBin(con, 1, d[1] * d[2])
dim(A1) <- d
})
close(con)
identical(A, A1)

   user  system elapsed 
  0.931   2.713   3.644 
   user  system elapsed 
  0.089   1.360   1.451 
[1] TRUE

So it's really just limited by the speed of your disk, parallelization won't 
help here.

Note that in general you get faster read times by using compression as most 
data is reasonably compressible, so that's where parallelization can be useful. 
There are plenty of package with more tricks like mmapping the files etc., but 
the above is just base R.

Cheers,
Simon



> On 9/05/2024, at 3:20 PM, Sameh Abdulah  wrote:
> 
> Hi,
> 
> I need to serialize and save a 20K x 20K matrix as a binary file. This 
> process is significantly slower in R compared to Python (4X slower).
> 
> I'm not sure about the best approach to optimize the below code. Is it 
> possible to parallelize the serialization function to enhance performance?
> 
> 
>  n <- 2^2
>  cat("Generating matrices ... ")
>  INI.TIME <- proc.time()
>  A <- matrix(runif(n), ncol = m)
>  END_GEN.TIME <- proc.time()
>  arg_ser <- serialize(object = A, connection = NULL)
> 
>  END_SER.TIME <- proc.time()
>  con <- file(description = "matrix_file", open = "wb")
>  writeBin(object = arg_ser, con = con)
>  close(con)
>  END_WRITE.TIME <- proc.time()
>  con <- file(description = "matrix_file", open = "rb")
>  par_raw <- readBin(con, what = raw(), n = file.info("matrix_file")$size)
>  END_READ.TIME <- proc.time()
>  B <- unserialize(connection = par_raw)
>  close(con)
>  END_DES.TIME <- proc.time()
>  TIME <- END_GEN.TIME - INI.TIME
>  cat("Generation time", TIME[3], " seconds.")
> 
>  TIME <- END_SER.TIME - END_GEN.TIME
>  cat("Serialization time", TIME[3], " seconds.")
> 
>  TIME <- END_WRITE.TIME - END_SER.TIME
>  cat("Writting time", TIME[3], " seconds.")
> 
>  TIME <- END_READ.TIME - END_WRITE.TIME
>  cat("Read time", TIME[3], " seconds.")
> 
>  TIME <- END_DES.TIME - END_READ.TIME
>  cat("Deserialize time", TIME[3], " seconds.")
> 
> 
> 
> 
> Best,
> --Sameh
> 
> -- 
> 
> This message and its contents, including attachments are intended solely 
> for the original recipient. If you are not the intended recipient or have 
> received this message in error, please notify me immediately and delete 
> this message from your computer system. Any unauthorized use or 
> distribution is prohibited. Please consider the environment before printing 
> this email.
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-package-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> 

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Fast Matrix Serialization in R?

2024-05-08 Thread Dirk Eddelbuettel


On 9 May 2024 at 03:20, Sameh Abdulah wrote:
| I need to serialize and save a 20K x 20K matrix as a binary file.

Hm that is an incomplete specification: _what_ do you want to do with it?
Read it back in R?  Share it with other languages (like Python) ? I.e. what
really is your use case?  Also, you only seem to use readBin / writeBin. Why
not readRDS / saveRDS which at least give you compression?

If it is to read/write from / to R look into the qs package. It is good. The
README.md at its repo has benchmarks: https://github.com/traversc/qs If you
want to index into the stored data look into fst. Else also look at databases

Dirk

-- 
dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


[R-pkg-devel] Fast Matrix Serialization in R?

2024-05-08 Thread Sameh Abdulah
Hi,

I need to serialize and save a 20K x 20K matrix as a binary file. This process 
is significantly slower in R compared to Python (4X slower).

I'm not sure about the best approach to optimize the below code. Is it possible 
to parallelize the serialization function to enhance performance?


  n <- 2^2
  cat("Generating matrices ... ")
  INI.TIME <- proc.time()
  A <- matrix(runif(n), ncol = m)
  END_GEN.TIME <- proc.time()
  arg_ser <- serialize(object = A, connection = NULL)

  END_SER.TIME <- proc.time()
  con <- file(description = "matrix_file", open = "wb")
  writeBin(object = arg_ser, con = con)
  close(con)
  END_WRITE.TIME <- proc.time()
  con <- file(description = "matrix_file", open = "rb")
  par_raw <- readBin(con, what = raw(), n = file.info("matrix_file")$size)
  END_READ.TIME <- proc.time()
  B <- unserialize(connection = par_raw)
  close(con)
  END_DES.TIME <- proc.time()
  TIME <- END_GEN.TIME - INI.TIME
  cat("Generation time", TIME[3], " seconds.")

  TIME <- END_SER.TIME - END_GEN.TIME
  cat("Serialization time", TIME[3], " seconds.")

  TIME <- END_WRITE.TIME - END_SER.TIME
  cat("Writting time", TIME[3], " seconds.")

  TIME <- END_READ.TIME - END_WRITE.TIME
  cat("Read time", TIME[3], " seconds.")

  TIME <- END_DES.TIME - END_READ.TIME
  cat("Deserialize time", TIME[3], " seconds.")




Best,
--Sameh

-- 

This message and its contents, including attachments are intended solely 
for the original recipient. If you are not the intended recipient or have 
received this message in error, please notify me immediately and delete 
this message from your computer system. Any unauthorized use or 
distribution is prohibited. Please consider the environment before printing 
this email.

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel