Re: [Rd] slightly speeding up readChar()

2011-08-05 Thread Michael Lachmann

On 5 Aug 2011, at 1:20AM, Dirk Eddelbuettel wrote:

 When you know the (fixed) structure of the data, the CRAN package mmap can be
 a huge winner.

Thanks! I didn't know that.

Is there a package that provides methods for mmap, like sum(x) or maybe even 
y=x+z
where x, and z are mmaps?

I assume that once you mmap to a huge file, you do operations on it by working 
on chunks at a time... are there packages for that, or do I have to write my 
own code?

Thanks!

Michael
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] slightly speeding up readChar()

2011-08-05 Thread Jeffrey Ryan
Michael,

The mmap package currently provides an Ops method for comparisons, which
returns something like which(i==) does in R - since a vector of logicals the
same size would be likely too big to handle in many cases.

At some point I'll implement mmap to mmap operations, though for vectorized
ops that result in non-logical output (i.e. numeric), I haven't yet decided
on how that should be implemented.  Something like a results buffer on
disk/memory has been my thinking, but anyone with additional (better!)
suggestions please feel free to send me ideas off list.

I'll look to add some basic summary statistics as well.

Note that you need to have a binary representation on disk (via fwrite in C,
or writeBin or as.mmap in R) for this to work.  But the package currently
supports something like 16 data types, including bit logicals, fixed width
character strings (\0 delim vectors), floats (4 byte), and 64 bit ints.  The
vignette covers a lot of the details.

Additionally if you have struct-style data (think row-oriented, with varying
types), you can use the struct() feature.  This maps to an R list, but
allows for very fast access if you are pulling complete rows.

example(mmap)
example(types)
example(struct)

The R-forge version has more than the CRAN version at this moment, but I'll
be pushing a new one to CRAN soon.

Jeff

On Fri, Aug 5, 2011 at 8:22 AM, Michael Lachmann lachm...@eva.mpg.dewrote:


 On 5 Aug 2011, at 1:20AM, Dirk Eddelbuettel wrote:

  When you know the (fixed) structure of the data, the CRAN package mmap
 can be
  a huge winner.

 Thanks! I didn't know that.

 Is there a package that provides methods for mmap, like sum(x) or maybe
 even y=x+z
 where x, and z are mmaps?

 I assume that once you mmap to a huge file, you do operations on it by
 working on chunks at a time... are there packages for that, or do I have to
 write my own code?

 Thanks!

 Michael
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel




-- 
Jeffrey Ryan
jeffrey.r...@lemnica.com

www.lemnica.com
www.esotericR.com

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] slightly speeding up readChar()

2011-08-04 Thread Simon Urbanek

On Aug 4, 2011, at 5:26 PM, Michael Lachmann wrote:

 Hi,
 
 I was trying to have R read files faster with readChar(). That was before I 
 noticed that readChar() is not that bad! In any case, below I suggest a few 
 simple changes that will make readChar slightly faster.
 
 I followed readChar(useBytes=T), and tried to identify all O(N) operations, 
 where N is the size of the file. The assumption is that for LARGE files we 
 want to avoid any O(N) operations, and any O(N) memory allocations.
 

I'm not sure it's really worth bothering with such optimizations, on my machine 
I get

 system.time(readChar(large.file,1e8,T))
   user  system elapsed 
  0.295   0.048   0.343 

so that is a fraction of a second for 100MB of data. Besides, why in the world 
would you want to use character vectors to store bytes? It's much more 
efficient to work with raw vectors instead...

AFAICS the only lesson from the list below is that we could add in an internal 
function for safe_mkCharLenCE that doesn't check for NULs. But that seems a 
little contrived for a very special case. Everything else is either increasing 
complexity at the cost of safety or could lead to internal inconsistencies 
(hashing is mandatory, otherwise you can't compare CHARSXPs).

BTW: your code doesn't do what you think it does - you'll force a pretty ugly 
buffer overflow - perfect illustration why optimization should be done only if 
really needed, otherwise you are just likely to introduce bugs...

Cheers,
Simon



 Here they are:
 
 1. In readFixedString in envir.c, an N sized vector is 
 allocated, and memset to 0. O(N)
 
 2. The file is read into the buffer with con-read O(N) (but this probably 
 can't be dropped)
 
 3. mkChar is called, which calls mkCharLenCE(name, strlen(name), CE_NATIVE);
   strlen is O(N)
 
 4. In mkCharLenCE, a loop along the string looks for 0s to tell if the string 
 includes NULs (notice that because strlen was called before, that can't 
 really happen) O(N)
 
 5. A hashcode is computed for the string to see if it is already in memory. 
 That is an O(N) operation.
 
 6. A Charsxp of size N is allocated
 
 7. The data is copied to the Charsxp - O(N).
 
 So, as far as I could tell, in addition to the reading operation, 5 O(N) 
 operations are done, and double the memory of what is needed is allocated.
 
 A couple of these operations are easy to drop:
 
 1. One could only zero the memory beyond what was read, in case not N chars 
 were read.
 
 2. We know the length of the string, so we can call mkCharLenCE directly from 
 readFixedString with the right length. 
 
 Others could maybe be dropped.
 
 3. Does one really need to look for 0s?
 In readFixedString there is a comment:
/* String may contain nuls which we now (R = 2.8.0) assume to be
   padding and ignore silently */
 4. If a file was just read, is it likely that it is in the hash? Is it worth 
 paying the time for those people who read in the same file twice? 
 
 Finally about the allocation.
 
 Could the Charsxp be allocated to begin with, and the data read straight into 
 it?
 Then we'd save one extra allocation, and a memcpy. For that one would need 
 something like mkEmptyCharLen.
 
 One could also allocate a slighly bigger memory region, and then pass that so 
 that instead of allocating it a new the old pointer is used (?).
 
 In any case, here is an updated readFixedString(), which would drop 2 O(N) 
 operations.
 
 ---
 static SEXP
   readFixedString(Rconnection con, int len, int useBytes)
 {
   SEXP ans;
   char *buf;
   int  m;
   const void *vmax = vmaxget();
 
   if(utf8locale  !useBytes) {
   int i, clen;
   char *p, *q;
 
   p = buf = (char *) R_alloc(MB_CUR_MAX*len+1, sizeof(char));
   memset(buf, 0, MB_CUR_MAX*len+1);
   for(i = 0; i  len; i++) {
   q = p;
   m = con-read(p, sizeof(char), 1, con);
   if(!m) { if(i == 0) return R_NilValue; else break;}
   clen = utf8clen(*p++);
   if(clen  1) {
   m = con-read(p, sizeof(char), clen - 1, con);
   if(m  clen - 1) error(_(invalid UTF-8 input 
 in readChar()));
   p += clen - 1;
   /* NB: this only checks validity of multi-byte characters */
   if((int)mbrtowc(NULL, q, clen, NULL)  0)
   error(_(invalid UTF-8 input in 
 readChar()));
   }
   }
   } else {
   buf = (char *) R_alloc(len+1, sizeof(char));
   //memset() was here
   m = con-read(buf, sizeof(char), len, con);
   if(m  len )
   memset(buf, m+1, len+1); // changed
   if(len  !m) return R_NilValue;
   }
   /* String may contain nuls which we now (R = 2.8.0) assume to be
   padding 

Re: [Rd] slightly speeding up readChar()

2011-08-04 Thread Michael Lachmann

On 4 Aug 2011, at 11:50PM, Simon Urbanek wrote:

 
 On Aug 4, 2011, at 5:26 PM, Michael Lachmann wrote:
 
 Hi,
 
 I was trying to have R read files faster with readChar(). That was before I 
 noticed that readChar() is not that bad! In any case, below I suggest a few 
 simple changes that will make readChar slightly faster.
 
 I followed readChar(useBytes=T), and tried to identify all O(N) operations, 
 where N is the size of the file. The assumption is that for LARGE files we 
 want to avoid any O(N) operations, and any O(N) memory allocations.
 
 
 I'm not sure it's really worth bothering with such optimizations, on my 
 machine I get

No it isn't worth it, you're right. Though 100MB is much smaller than my 
average file size. But you're right, readChar is quite efficient.

 
 system.time(readChar(large.file,1e8,T))
   user  system elapsed 
  0.295   0.048   0.343 
 
 so that is a fraction of a second for 100MB of data. Besides, why in the 
 world would you want to use character vectors to store bytes? It's much more 
 efficient to work with raw vectors instead...
 
 AFAICS the only lesson from the list below is that we could add in an 
 internal function for safe_mkCharLenCE that doesn't check for NULs. But that 
 seems a little contrived for a very special case. Everything else is either 
 increasing complexity at the cost of safety or could lead to internal 
 inconsistencies (hashing is mandatory, otherwise you can't compare CHARSXPs).

You're right. Then the harder changes aren't worth it, unless you manage to 
save on one of the allocations. 

 
 BTW: your code doesn't do what you think it does - you'll force a pretty ugly 
 buffer overflow - 

But here I don't agree. memset and strlen can be dropped. 

 perfect illustration why optimization should be done only if really needed, 
 otherwise you are just likely to introduce bugs...


Right again. I shouldn't really have bothered... but I did.

Michael


 Cheers,
 Simon
 
 
 
 Here they are:
 
 1. In readFixedString in envir.c, an N sized vector is 
 allocated, and memset to 0. O(N)
 
 2. The file is read into the buffer with con-read O(N) (but this probably 
 can't be dropped)
 
 3. mkChar is called, which calls mkCharLenCE(name, strlen(name), CE_NATIVE);
  strlen is O(N)
 
 4. In mkCharLenCE, a loop along the string looks for 0s to tell if the 
 string includes NULs (notice that because strlen was called before, that 
 can't really happen) O(N)
 
 5. A hashcode is computed for the string to see if it is already in memory. 
 That is an O(N) operation.
 
 6. A Charsxp of size N is allocated
 
 7. The data is copied to the Charsxp - O(N).
 
 So, as far as I could tell, in addition to the reading operation, 5 O(N) 
 operations are done, and double the memory of what is needed is allocated.
 
 A couple of these operations are easy to drop:
 
 1. One could only zero the memory beyond what was read, in case not N chars 
 were read.
 
 2. We know the length of the string, so we can call mkCharLenCE directly 
 from readFixedString with the right length. 
 
 Others could maybe be dropped.
 
 3. Does one really need to look for 0s?
 In readFixedString there is a comment:
   /* String may contain nuls which we now (R = 2.8.0) assume to be
  padding and ignore silently */
 4. If a file was just read, is it likely that it is in the hash? Is it worth 
 paying the time for those people who read in the same file twice? 
 
 Finally about the allocation.
 
 Could the Charsxp be allocated to begin with, and the data read straight 
 into it?
 Then we'd save one extra allocation, and a memcpy. For that one would need 
 something like mkEmptyCharLen.
 
 One could also allocate a slighly bigger memory region, and then pass that 
 so that instead of allocating it a new the old pointer is used (?).
 
 In any case, here is an updated readFixedString(), which would drop 2 O(N) 
 operations.
 
 ---
 static SEXP
  readFixedString(Rconnection con, int len, int useBytes)
 {
  SEXP ans;
  char *buf;
  int  m;
  const void *vmax = vmaxget();
 
  if(utf8locale  !useBytes) {
  int i, clen;
  char *p, *q;
 
  p = buf = (char *) R_alloc(MB_CUR_MAX*len+1, sizeof(char));
  memset(buf, 0, MB_CUR_MAX*len+1);
  for(i = 0; i  len; i++) {
  q = p;
  m = con-read(p, sizeof(char), 1, con);
  if(!m) { if(i == 0) return R_NilValue; else break;}
  clen = utf8clen(*p++);
  if(clen  1) {
  m = con-read(p, sizeof(char), clen - 1, con);
  if(m  clen - 1) error(_(invalid UTF-8 input 
 in readChar()));
  p += clen - 1;
  /* NB: this only checks validity of multi-byte characters */
  if((int)mbrtowc(NULL, q, clen, NULL)  0)
  error(_(invalid UTF-8 input in 
 

Re: [Rd] slightly speeding up readChar()

2011-08-04 Thread Dirk Eddelbuettel

On 5 August 2011 at 00:15, Michael Lachmann wrote:
|  I'm not sure it's really worth bothering with such optimizations, on my 
machine I get
| 
| No it isn't worth it, you're right. Though 100MB is much smaller than my 
average file size. But you're right, readChar is quite efficient.

When you know the (fixed) structure of the data, the CRAN package mmap can be
a huge winner.

Dirk

-- 
Gauss once played himself in a zero-sum game and won $50.
  -- #11 at http://www.gaussfacts.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel