RE: [R] fast mkChar

2004-06-09 Thread Vadim Ogranovich
Thank you for the lead, Peter. It may be useful for other packages I
write.

As to the strings, I think I have to take what is already there. I agree
that strings would be better managed in malloc-style fashion (probably
with reference counter) and not by gc(). However I don't want to have a
system with two different string classes, such close relatives seldom
coexist peacefully.

BTW, the slowness of mkChar explains why R is so slow when it needs to
compute names for long vectors.

Thank you for an interesting discussion,
Vadim 

> -Original Message-
> From: Peter Dalgaard [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, June 08, 2004 3:35 PM
> To: Vadim Ogranovich
> Cc: R-Help
> Subject: Re: [R] fast mkChar
> 
> "Vadim Ogranovich" <[EMAIL PROTECTED]> writes:
> 
> > I am no expert in memory management in R so it's hard for 
> me to tell 
> > what is and what is not doable. From reading the code of 
> allocVector() 
> > in memory.c I think that the critical part is to vectorize 
> > CLASS_GET_FREE_NODE and use the vectorized version along 
> the lines of 
> > the code fragment below (taken from memory.c).
> > 
> > if (node_class < NUM_SMALL_NODE_CLASSES) {
> > CLASS_GET_FREE_NODE(node_class, s);
> > 
> > If this is possible than the rest is just a matter of code 
> refactoring.
> > 
> > By vectorizing I mean writing a macro 
> CLASS_GET_FREE_NODE2(node_class, 
> > s, n) which in one go allocates n little objects of class 
> node_class 
> > and "inscribes" them into the elements of vector s, which 
> is assumed 
> > to be long enough to hold these objects.
> > 
> > If this is doable than the only missing piece would be a 
> new function 
> > setChar(CHARSXP rstr, const char * cstr) which copies 
> 'cstr' into 'rstr'
> > and (re)allocates the heap memory if necessary. Here the setChar() 
> > macro is safe since s[i]-s are all brand new and thus are 
> not shared 
> > with any other object.
> 
> I had a similar idea initially, but I don't think it can fly: 
> First, allocating n objects at once is not likely to be much 
> faster than allocating them one-by-one, especially when you 
> consider the implications of having to deal with 
> near-out-of-memory conditions.
> Second, you have to know the string lengths when allocating, 
> since the structure of a vector object (CHARSXP) is a header 
> immediately followed by the data.
> 
> A more interesting line to pursue is that - depending on what 
> it really is that you need - you might be able to create a 
> different kind of object that could "walk and quack" like a 
> character vector, but is stored differently internally. E.g. 
> you could set up a representation that is just a block of 
> pointers, pointing to strings that are being maintained in 
> malloc-style.
> 
> Have a look at External pointers and finalization.
> 
> 
> -- 
>O__   Peter Dalgaard Blegdamsvej 3  
>   c/ /'_ --- Dept. of Biostatistics 2200 Cph. N   
>  (*) \(*) -- University of Copenhagen   Denmark  Ph: 
> (+45) 35327918
> ~~ - ([EMAIL PROTECTED]) FAX: 
> (+45) 35327907
> 
>

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Wrong question [Wasn't: Re: [R] fast mkChar]

2004-06-09 Thread Matej Cepl
On Wednesday 09 of June 2004 09:52, you wrote:
> This is my first message to the list and I believe the question
> I am including is a simple one.

http://www.r-project.org/posting-guide.html

-- 
Matej Cepl, http://www.ceplovi.cz/matej
GPG Finger: 89EF 4BC6 288A BF43 1BAB  25C3 E09F EF25 D964 84AC
138 Highland Ave. #10, Somerville, Ma 02143, (617) 623-1488
 
Of course I'm respectable. I'm old. Politicians, ugly buildings,
and whores all get respectable if they last long enough.
  --John Huston in "Chinatown."

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] fast mkChar

2004-06-09 Thread Paulo Nuin
Hello everyone

This is my first message to the list and I believe the question I am
including is a simple one.

I have a matrix where I need to calculate ANOVA for the rows as the
columns represent a different treatment. I would like to know if there
is a command or a series of commans that I can enter to do that. 

At the moment I have a external script that extracts each row from the
matrix, transforms it in a column, another factor columns is add and the
text file is thrown to Rterm --vanilla.

Any help is appreciated.

Thanks a lot

Paulo Nuin

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] fast mkChar

2004-06-08 Thread Peter Dalgaard
"Vadim Ogranovich" <[EMAIL PROTECTED]> writes:

> I am no expert in memory management in R so it's hard for me to tell
> what is and what is not doable. From reading the code of allocVector()
> in memory.c I think that the critical part is to vectorize
> CLASS_GET_FREE_NODE and use the vectorized version along the lines of
> the code fragment below (taken from memory.c).
> 
>   if (node_class < NUM_SMALL_NODE_CLASSES) {
>   CLASS_GET_FREE_NODE(node_class, s); 
> 
> If this is possible than the rest is just a matter of code refactoring.
> 
> By vectorizing I mean writing a macro CLASS_GET_FREE_NODE2(node_class,
> s, n) which in one go allocates n little objects of class node_class and
> "inscribes" them into the elements of vector s, which is assumed to be
> long enough to hold these objects.
> 
> If this is doable than the only missing piece would be a new function
> setChar(CHARSXP rstr, const char * cstr) which copies 'cstr' into 'rstr'
> and (re)allocates the heap memory if necessary. Here the setChar() macro
> is safe since s[i]-s are all brand new and thus are not shared with any
> other object.

I had a similar idea initially, but I don't think it can fly: First,
allocating n objects at once is not likely to be much faster than
allocating them one-by-one, especially when you consider the
implications of having to deal with near-out-of-memory conditions.
Second, you have to know the string lengths when allocating, since the
structure of a vector object (CHARSXP) is a header immediately
followed by the data.

A more interesting line to pursue is that - depending on what it
really is that you need - you might be able to create a different kind
of object that could "walk and quack" like a character vector, but is
stored differently internally. E.g. you could set up a representation
that is just a block of pointers, pointing to strings that are being
maintained in malloc-style.

Have a look at External pointers and finalization.


-- 
   O__   Peter Dalgaard Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics 2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark  Ph: (+45) 35327918
~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] fast mkChar

2004-06-08 Thread Vadim Ogranovich
I am no expert in memory management in R so it's hard for me to tell
what is and what is not doable. From reading the code of allocVector()
in memory.c I think that the critical part is to vectorize
CLASS_GET_FREE_NODE and use the vectorized version along the lines of
the code fragment below (taken from memory.c).

if (node_class < NUM_SMALL_NODE_CLASSES) {
CLASS_GET_FREE_NODE(node_class, s); 

If this is possible than the rest is just a matter of code refactoring.

By vectorizing I mean writing a macro CLASS_GET_FREE_NODE2(node_class,
s, n) which in one go allocates n little objects of class node_class and
"inscribes" them into the elements of vector s, which is assumed to be
long enough to hold these objects.

If this is doable than the only missing piece would be a new function
setChar(CHARSXP rstr, const char * cstr) which copies 'cstr' into 'rstr'
and (re)allocates the heap memory if necessary. Here the setChar() macro
is safe since s[i]-s are all brand new and thus are not shared with any
other object.



> -Original Message-
> From: Peter Dalgaard [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, June 08, 2004 1:23 PM
> To: Vadim Ogranovich
> Cc: R-Help
> Subject: Re: [R] fast mkChar
> 
> "Vadim Ogranovich" <[EMAIL PROTECTED]> writes:
> 
> > Hi,
> >  
> > To speed up reading of large (few million lines) CSV files I am 
> > writing custom read functions (in C). By timing various 
> approaches I 
> > figured out that one of the bottlenecks in reading 
> character fields is 
> > the mkChar() function which on each call incurs a lot of 
> > garbage-collection-related overhead.
> >  
> > I wonder if there is a "vectorized" version of mkChar, say 
> > mkChar2(char **, int length) that converts an array of C 
> strings to a 
> > string vector, which somehow amortizes the gc overhead over 
> the entire array?
> >  
> > If no such function exists, I'd appreciate any hint as to 
> how to write 
> > it.
> 
> The real issue here is that character vectors are implemented 
> as generic vectors of little R objects (CHARSXP type) that 
> each hold one string. Allocating all those objects is 
> probably what does you in.
> 
> The reason behind the implementation is probably that doing 
> it that way allows the mechanics of the garbage collector to 
> be applied directly (CHARSXPs are just vectors of bytes), but 
> it is obviously wasteful in terms of total allocation. If you 
> can think up something better, please say so (but remember 
> that the memory management issues are nontrivial).
> 
> -- 
>O__   Peter Dalgaard Blegdamsvej 3  
>   c/ /'_ --- Dept. of Biostatistics 2200 Cph. N   
>  (*) \(*) -- University of Copenhagen   Denmark  Ph: 
> (+45) 35327918
> ~~ - ([EMAIL PROTECTED]) FAX: 
> (+45) 35327907
> 
>

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] fast mkChar

2004-06-08 Thread Duncan Murdoch
On Tue, 8 Jun 2004 12:23:58 -0700, "Vadim Ogranovich"
<[EMAIL PROTECTED]> wrote :

>Hi,
> 
>To speed up reading of large (few million lines) CSV files I am writing
>custom read functions (in C). By timing various approaches I figured out
>that one of the bottlenecks in reading character fields is the mkChar()
>function which on each call incurs a lot of garbage-collection-related
>overhead.
> 
>I wonder if there is a "vectorized" version of mkChar, say mkChar2(char
>**, int length) that converts an array of C strings to a string vector,
>which somehow amortizes the gc overhead over the entire array?
> 
>If no such function exists, I'd appreciate any hint as to how to write
>it.

It's not easy.  Internally R strings always have a header at the
front, so you need to allocate memory and move C strings to get R to
understand them.  

Duncan Murdoch

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] fast mkChar

2004-06-08 Thread Peter Dalgaard
"Vadim Ogranovich" <[EMAIL PROTECTED]> writes:

> Hi,
>  
> To speed up reading of large (few million lines) CSV files I am writing
> custom read functions (in C). By timing various approaches I figured out
> that one of the bottlenecks in reading character fields is the mkChar()
> function which on each call incurs a lot of garbage-collection-related
> overhead.
>  
> I wonder if there is a "vectorized" version of mkChar, say mkChar2(char
> **, int length) that converts an array of C strings to a string vector,
> which somehow amortizes the gc overhead over the entire array?
>  
> If no such function exists, I'd appreciate any hint as to how to write
> it.

The real issue here is that character vectors are implemented as
generic vectors of little R objects (CHARSXP type) that each hold one
string. Allocating all those objects is probably what does you in.

The reason behind the implementation is probably that doing it that
way allows the mechanics of the garbage collector to be applied
directly (CHARSXPs are just vectors of bytes), but it is obviously
wasteful in terms of total allocation. If you can think up something
better, please say so (but remember that the memory management issues
are nontrivial).

-- 
   O__   Peter Dalgaard Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics 2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark  Ph: (+45) 35327918
~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html