[Rd] Commenting conventions
This might be a dumb question, but I couldn't figure out how to find the answer: why is it that comments in R documentation files (i.e. in examples) typically start with a double hash (##) instead of a single hash? -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Commenting conventions
Erik Iverson wrote: > dhi...@sonic.net wrote: > > This might be a dumb question, but I couldn't figure out how to find > > the answer: why is it that comments in R documentation files (i.e. in > > examples) typically start with a double hash (##) instead of a single > > hash? > See the second paragraph in section 7.5 for the likely answer. > http://ess.r-project.org/Manual/ess.html#Indenting Ahh. I'd forgotten the (setq ess-fancy-comments nil) in my .emacs file! I thought the explanation would turn up in an R coding standards document and/or in Writing R Documentation Files, and it isn't an easy thing to google. -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Moderating consequences of garbage collection when in C
Martin Morgan wrote: > Allocating many small objects triggers numerous garbage collections as R > grows its memory, seriously degrading performance. The specific use case > is in creating a STRSXP of several 1,000,000's of elements of 60-100 > characters each; a simplified illustration understating the effects > (because there is initially little to garbage collect, in contrast to an > R session with several packages loaded) is below. What a coincidence -- I was just going to post a question about why it is so slow to create a STRSXP of ~10,000,000 unique elements, each ~10 characters long. I had noticed that this seemed to show much worse than linear scaling. I had not thought of garbage collection as the culprit -- but indeed it is. By manipulating the GC trigger, I can make this operation take as little as 3 seconds (with no GC) or as long as 76 seconds (with 31 garbage collections). -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Moderating consequences of garbage collection when in C
dhi...@sonic.net wrote: > Martin Morgan wrote: > > Allocating many small objects triggers numerous garbage collections as R > > grows its memory, seriously degrading performance. The specific use case > > is in creating a STRSXP of several 1,000,000's of elements of 60-100 > > characters each; a simplified illustration understating the effects > > (because there is initially little to garbage collect, in contrast to an > > R session with several packages loaded) is below. > What a coincidence -- I was just going to post a question about why it > is so slow to create a STRSXP of ~10,000,000 unique elements, each ~10 > characters long. I had noticed that this seemed to show much worse > than linear scaling. I had not thought of garbage collection as the > culprit -- but indeed it is. By manipulating the GC trigger, I can > make this operation take as little as 3 seconds (with no GC) or as > long as 76 seconds (with 31 garbage collections). I had done some google searches on this issue, since it seemed like it should not be too uncommon, but the only other hit I could come up with was a thread from 2006: https://stat.ethz.ch/pipermail/r-devel/2006-November/043446.html In any case, one issue with your suggested workaround is that it requires knowing how much additional storage is needed, which may be an expensive operation to determine. I've just tried implementing a different approach, which is to define two new functions to either disable or enable GC. The function to disable GC first invokes R_gc_full() to shrink the heap as much as possible, then sets a flag. Then in R_gc_internal(), I first check that flag, and if it is set, I call AdjustHeapSize(size_needed) and exit immediately. These calls could be used to bracket any code section that expects to make lots of calls to R's memory allocator. The down side is that this approach requires that all paths out of such a code section (including error handling) need to take care to unset the GC-disabled flag. I think I would want to hear from someone on the R team about whether they think this is a good idea. A final alternative might be to provide a vectorized version of mkChar that would accept a char ** and use one of these methods internally, rather than exporting the underlying methods as part of R's API. I don't know if there are other clear use cases where GC is a serious bottleneck, besides constructing large vectors of mostly unique strings. Such a function would be less generally useful since it would require that the full vector of C strings be assembled at one time. -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Moderating consequences of garbage collection when in C
Martin Morgan wrote: > On 11/14/2011 11:47 AM, dhi...@sonic.net wrote: > > dhi...@sonic.net wrote: > >> Martin Morgan wrote: > > > > I had done some google searches on this issue, since it seemed like it > > should not be too uncommon, but the only other hit I could come up > > with was a thread from 2006: > > > > https://stat.ethz.ch/pipermail/r-devel/2006-November/043446.html > > > > In any case, one issue with your suggested workaround is that it > > requires knowing how much additional storage is needed, which may be > > an expensive operation to determine. I've just tried implementing a > > different approach, which is to define two new functions to either > > disable or enable GC. The function to disable GC first invokes > > R_gc_full() to shrink the heap as much as possible, then sets a flag. > > Then in R_gc_internal(), I first check that flag, and if it is set, I > > call AdjustHeapSize(size_needed) and exit immediately. > I think this is a better approach; mine seriously understated the > complexity of figuring out required size. > > These calls could be used to bracket any code section that expects to > > make lots of calls to R's memory allocator. The down side is that > > this approach requires that all paths out of such a code section > > (including error handling) need to take care to unset the GC-disabled > > flag. I think I would want to hear from someone on the R team about > > whether they think this is a good idea. > > > Another place where this comes up is during package load, especially for > packages with many S4 instances. Do you know if this is all happening inside a C function that could handle disabling and enabling GC? Or would it require doing this at the R level? For testing, I am turning GC on and off at the R level but I am thinking about where we would need to check for failures to re-enable GC. I suppose one approach would be to provide an R wrapper that would evaluate an expression with GC disabled using tryCatch to guarantee that it would exit with GC enabled. >> system.time(as.character(1:1000)) > user system elapsed >61.908 0.297 62.303 I get 6 seconds for this with GC disabled. > There's a hierarchy of CHARSXP / STRSXP, so maybe that could be > exploited in the mark phase? I haven't explored whether GC could be made smarter so that this isn't as big of a hit. I don't really understand the GC process. -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Moderating consequences of garbage collection when in C
Martin Morgan wrote: > > Do you know if this is all happening inside a C function that could > > handle disabling and enabling GC? Or would it require doing this at > > the R level? For testing, I am turning GC on and off at the R level > Generally complicated operations across multiple function calls. > Something like >f = function() { > state <- gcdisable(TRUE) > on.exit(gcdisable(state)) > as.character(1:1000) >} > might be used. Here is how I've implemented the core part of this (for discussion, not a complete patch) -- Dave --- memory.c.orig 2011-04-04 15:05:04.0 -0700 +++ memory.c2011-11-14 15:21:42.0 -0800 @@ -98,6 +98,7 @@ */ static int gc_reporting = 0; +static int gc_disabled = 0; static int gc_count = 0; #ifdef TESTING_WRITE_BARRIER @@ -2467,6 +2468,17 @@ R_gc_internal(size_needed); } +SEXP attribute_hidden do_gcdisable(SEXP call, SEXP op, SEXP args, SEXP rho) +{ +int i; +SEXP old = ScalarLogical(gc_disabled); +checkArity(op, args); +i = asLogical(CAR(args)); +if (i != NA_LOGICAL) + gc_disabled = i; +return old; +} + #ifdef _R_HAVE_TIMING_ double R_getClockIncrement(void); void R_getProcTime(double *data); @@ -2541,6 +2553,14 @@ SEXP first_bad_sexp_type_sexp = NULL; int first_bad_sexp_type_line = 0; +if (gc_disabled) { + AdjustHeapSize(size_needed); + if (NO_FREE_NODES() || VHEAP_FREE() < size_needed) { + gc_disabled = 0; + error("Heap adjustment failed -- enabling GC"); + } else return; +} + again: gc_count++; __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Maybe a bug in warning() for condition objects?
I'm using R-2.3.1 but the code in question is the same in the 01-Oct-2006 snapshot for release 2.4.0. I'd like to evaluate an expression, catching errors and handling them as warnings. My first attempt: x <- tryCatch(lm(xyzzy), error=warning) didn't work; the error is still treated as an error, not a warning. So I thought, hmmm, the condition is still classed as an "error", how about if I change that: as.warning <- function(e) warning(simpleWarning(e$message,e$call)) x <- tryCatch(lm(xyzzy), error=as.warning) Still no luck. But this works: as.warning <- function(e) .signalSimpleWarning(e$message,e$call) x <- tryCatch(lm(xyzzy), error=as.warning) I think the problem here is that warning() contains the code: withRestarts({ .Internal(.signalCondition(cond, message, call)) .Internal(.dfltStop(message, call)) }, muffleWarning = function() NULL) i.e., the default action is .dfltStop(), not .dfltWarn(). Is this intentional? It seems to make more sense to me for the default action for conditions passed to warning() to be .dfltWarn(), but I may well be misunderstanding something. -- David Hinds __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Bug in warning() for condition objects (PR#9274)
Full_Name: David Hinds Version: 2.4.0 OS: Windows XP Submission from: (NULL) (64.168.232.238) A (maybe naive) use of tryCatch to trap errors and report as warnings does not work, i.e.: x <- tryCatch(lm(xyzzy), error=warning) In src/library/base/R/stop.R, the warning() function contains the following code, for handling condition objects: withRestarts({ .Internal(.signalCondition(cond, message, call)) .Internal(.dfltStop(message, call)) }, muffleWarning = function() NULL) So all conditions result in calling .dfltStop(). It would seem more useful and/or consistent for warning() to call .dfltWarn(), or in the alternative, to choose between .dfltStop and .dfltWarn based on the class of the condition object. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Watch out for the latest Cygwin upgrade
Duncan Murdoch <[EMAIL PROTECTED]> wrote: > I just updated my copy of Cygwin to the latest version, and now Windows > builds are failing on that machine. The only parts of the R toolset I > changed were the Cygwin dlls. I haven't tracked down exactly what the > problem is, and probably won't be able to do so for a few days. > So if you're a Windows user thinking about a Cygwin upgrade, be prepared > for problems... The change that bit me is that the latest cygwin bash is unhappy with cr/lf line endings in scripts. Which breaks some of the "R CMD ..." scripts in non-obvious ways. The error messages are hard to interpret because they have embedded "\r" characters, causing some of the text of the messages to be overwritten. There is a sort-of workaround in the latest release (the "igncr" shell option) but that seems to still be in flux so I decided to revert to the previous release for now. There seems to be a somewhat cavalier attitude among Cygwin developers about backwards compatibility. They've said that their primary focus is on making cygwin as linux-like as possible, and they're willing to sacrifice interoperability to do so. -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] DBI + ROracle problem with parser ?? (PR#9424)
[EMAIL PROTECTED] wrote: > doesn't: > dbGetQuery(conn, "\nselect * from dual") > dbGetQuery(conn, "select\n * from dual") > dbGetQuery(conn, "/* comment */ select * from dual") This sounds like my doing. What version of Oracle are you using? Oracle 9i has a bug that interferes with the documented mechanism for asking Oracle for the type of an SQL statement (i.e. whether it is a query returning row data, or a statement that modifies rows). So I asked David James for a quick fix that consisted of checking the beginning of the SQL for either "select" or "with". We could be more sophisticated about parsing things, I guess skipping over any arbitrary combination of comments and white space. Or, if you're using a version of Oracle not affected by the bug, you can edit src/Makefile and comment out the line: WORKAROUND = "-DRS_ORA_SQLGLS_WORKAROUND" -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] "ROracle" Packages is not to be installed (PR#10652)
[EMAIL PROTECTED] wrote: > /opt/oracle/product/10g/lib/libclntst10.a: file not recognized: File truncated Here: http://osdir.com/ml/lang.r.mac/2006-08/msg00031.html I found a suggestion to do this: $ cd $ORACLE_HOME/bin $ ./genclntst -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] RFC: "loop connections"
I've just implemented a generalization of R's text connections, to also support reading/writing raw binary data. There is very little new code to speak of. For input connections, I wrote code to populate the old text connection buffer from a raw vector, and provided a new raw_read() method. For output connections, I wrote a raw_write() to append to a raw vector. On input, the mode (text or binary) is determined by the data type of the input object; on output, I use the requested output mode (i.e. "w" / "wb"). For example: > con <- loopConnection("r", "wb") > a <- c(10,100,1000) > writeBin(a, con, size=4) > r [1] 00 00 20 41 00 00 c8 42 00 00 7a 44 > close(con) > con <- loopConnection(r) > readBin(con, "double", n=3, size=4) [1] 10 100 1000 > close(con) I think "loop connection" is a better name for this sort of connection than "text connection" was even for the old version; that confuses the mode of the connection (text vs binary) with the mechanism (file, socket, etc). I've appended a patch to the end of this message. As implemented here, textConnection is replaced by loopConnection but functionally this is a superset of the old textConnection. For compatibility, one could add: textConnection <- function(...) loopConnection(...) The patch is against R-2.1.1. I can investigate whether any changes are required for the current development tree. I can also update the documentation files as required. I thought I'd first check whether anyone else thought this was worth inclusion before spending more time on it. The raw_write() code could be improved with smarter memory allocation (grabbing bigger chunks rather than reallocating the raw vector for every write), but this is at least a proof of principle. -- David Hinds --- src/main/connections.c.orig 2005-06-17 19:05:02.0 -0700 +++ src/main/connections.c 2005-08-22 15:54:03.156038200 -0700 @@ -1644,13 +1644,13 @@ return ans; } -/* --- text connections - */ +/* --- loop connections - */ /* read a R character vector into a buffer */ static void text_init(Rconnection con, SEXP text) { int i, nlines = length(text), nchars = 0; -Rtextconn this = (Rtextconn)con->private; +Rloopconn this = (Rloopconn)con->private; for(i = 0; i < nlines; i++) nchars += strlen(CHAR(STRING_ELT(text, i))) + 1; @@ -1668,19 +1668,35 @@ this->cur = this->save = 0; } -static Rboolean text_open(Rconnection con) +/* read a R raw vector into a buffer */ +static void raw_init(Rconnection con, SEXP raw) +{ +int nbytes = length(raw); +Rloopconn this = (Rloopconn)con->private; + +this->data = (char *) malloc(nbytes); +if(!this->data) { + free(this); free(con->description); free(con->class); free(con); + error(_("cannot allocate memory for raw connection")); +} +memcpy(this->data, RAW(raw), nbytes); +this->nchars = nbytes; +this->cur = this->save = 0; +} + +static Rboolean loop_open(Rconnection con) { con->save = -1000; return TRUE; } -static void text_close(Rconnection con) +static void loop_close(Rconnection con) { } -static void text_destroy(Rconnection con) +static void loop_destroy(Rconnection con) { -Rtextconn this = (Rtextconn)con->private; +Rloopconn this = (Rloopconn)con->private; free(this->data); /* this->cur = this->nchars = 0; */ @@ -1689,7 +1705,7 @@ static int text_fgetc(Rconnection con) { -Rtextconn this = (Rtextconn)con->private; +Rloopconn this = (Rloopconn)con->private; if(this->save) { int c; c = this->save; @@ -1700,48 +1716,69 @@ else return (int) (this->data[this->cur++]); } -static double text_seek(Rconnection con, double where, int origin, int rw) +static double loop_seek(Rconnection con, double where, int origin, int rw) { -if(where >= 0) error(_("seek is not relevant for text connection")); +if(where >= 0) error(_("seek is not relevant for loop connection")); return 0; /* if just asking, always at the beginning */ } -static Rconnection newtext(char *description, SEXP text) +static size_t raw_read(void *ptr, size_t size, size_t nitems, + Rconnection con) +{ +Rloopconn this = (Rloopconn)con->private; +if (this->cur + size*nitems > this->nchars) { + nitems = (this->nchars - this->cur)/size; + memcpy(ptr, this->data+this->cur, size*nitems); + this->cur = this->nchars; +} else { + memcpy(ptr, this->data+this->cur, size*nitems); + this->cur += size*nitems; +} +return nitems; +} + +static Rconnection newloop(char *description, SEXP data) { Rconnection new; new = (Rconnection) malloc(sizeof(struct Rconn)); -if(!new) error(_("allocation of text connection failed")); -new->class = (char *) malloc(strlen("textConnection") + 1); +if(!new) error(_("allocation of loop connection failed")); +
Re: [Rd] Typo(s) in proc.time.Rd and comment about ?proc.time (PR#8092)
[EMAIL PROTECTED] wrote: > On Wed, 24 Aug 2005 [EMAIL PROTECTED] wrote: > > I just downloaded the file > > > > ftp://ftp.stat.math.ethz.ch/Software/R/R-devel.tar.gz > > > > and within proc.time.Rd, the second paragraph of the \value > > section contains a typo: > I believe your understanding of the English language is different from the > author here, who is English. (You on the other hand seem to think there > is no need to give your country in your address when writing an addess in > Denmark.) The preferred language for R documentation is English (and not > American). > > The resolution of the times will be system-specific; it is common for > > them to be recorded to of the order of 1/100 second, and elapsed [...] > > ^ > > > > I'd say replacing "to of" with just "of" would grammatically > > fix the sentence. > I'd say it was correct and your correction is incorrect. In English we > say `recorded to 1/100th of a second', not `recorded 1/100th second'. The correction was incorrect, but so was the original. I've never heard the expression "of the order of"; common usage (in English or American, as far as I know) is "on the order of". Your "recorded to 1/100th of a second" is also ok. > > Second, the \note{} section for Unix-like machines reads: > > > > It is possible to compile \R without support for \code{proc.time}, > > when the function will throw an error. > > > > I believe this is ungrammatical and suggest replacing > > "when the function will throw an error" with "in which > > case the function will throw an error". > Again, the statement given is the intended meaning. I think more clear might be, "it is possible to compile R without support for proc.time, when the function *would* throw an error". -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: "loop connections"
I accidentally left one small change out of my previous patch. So... no response to my request for comments. Does that mean no one has an opinion about whether this is a good idea or not? I'd appreciate a response from an R core member one way or the other; if this is not the right way to get a response, should I email people instead? -- David Hinds --- src/include/Internal.h.orig 2005-05-20 05:51:37.0 -0700 +++ src/include/Internal.h 2005-08-22 15:46:48.968190600 -0700 @@ -518,7 +518,7 @@ SEXP do_pushback(SEXP, SEXP, SEXP, SEXP); SEXP do_pushbacklength(SEXP, SEXP, SEXP, SEXP); SEXP do_clearpushback(SEXP, SEXP, SEXP, SEXP); -SEXP do_textconnection(SEXP, SEXP, SEXP, SEXP); +SEXP do_loopconnection(SEXP, SEXP, SEXP, SEXP); SEXP do_getallconnections(SEXP, SEXP, SEXP, SEXP); SEXP do_sumconnection(SEXP, SEXP, SEXP, SEXP); SEXP do_download(SEXP, SEXP, SEXP, SEXP); __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: "loop connections"
Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > OK. I guess you want one of the core people to respond but in the > interim can you explain the terminology "loop"? > Also, do you have any prototypical applications in mind? "loop" is short for "loopback". A loop or loopback device is one that just returns the data sent to it. The prototypical applications are the same sort of applications text connections are used for: data transformation, in this case of raw binary data, rather than formatted text data. In my case, I needed to interpret a "long raw" column from an Oracle table, that consisted of packed single precision floating point numbers. The caTools package on CRAN includes less capable raw2bin and bin2raw functions, used to implement Base64 encoders and decoders. -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: "loop connections"
Martin Maechler <[EMAIL PROTECTED]> wrote: > In the mean time, I think it has become clear that > "loopconnection" isn't necessarily a better name, and that > textConnection() has been there in "the S litterature" for a > good reason and for quite a while. > Let's forget about the naming and the exact UI for the moment. That is entirely fine with me. > I think the main point of David's proposal is still worth > consideration: One way to see text connections is as a way to > treat some kind of R objects as "generalized files" i.e., connections. > And AFAICS David proposes to enlarge the kind of R objects that > can be dealt with as connections > from {"character"} > to{"character", "raw"} > something which has some appeal to me. > IIUC, Brian Ripley is doubting the potential use for the > proposed generalization, whereas David makes a point of someone > else (the 'caTools' author) having written raw2bin / bin2raw function > for a related use case. > Maybe you can elaborate on the above a bit, David? I'm not sure what more can be said on the subject. Most connection types support both text-mode and binary-mode, so this is partly a proposal for symmetry and consistency. Prof. Ripley is correct that binary anonymous connections provide overlapping functionality, but the semantics are slightly different, and performance is different. I don't see an advantage for having the "text-like" connection only support text access. I ran some quick benchmarks on three implementations, where the task was conversion back and forth between a numeric vector of length 1000, and a packed raw vector of single precision floats, repeated 1000 times. The first method uses a new anonymous connection for each transformation. The second reuses a single anonymous connection. The third uses a new raw textConnection for each transformation. usr sys elapsed 1.5 9.5 14.6anonymous 1.1 0.11.2persistent 0.9 0.00.9raw Setting up and tearing down anonymous connections is very slow (at least on Windows) because it requires substantial OS intervention. If a program can be easily organized so that a single connection can be used, performance is much better. I would appreciate feedback on how to improve raw_write() for the case of appending to an existing vector. Is it possible to reserve free space at the end of a vector for appending? I see that there is a distinction between LENGTH() and TRUELENGTH() but I'm not sure if this is the intended use. > In any case, as you might have guessed by now, R-core would have > been more positive to a proposal to generalize current > textConnection() - fully back-compatibly - rather than renaming > it first. I have no interest in sacrificing back compatibility; I did intend that there would always be a textConnection() entry point, if only as a wrapper for the new constructor. The only reason for a new name (and I'm certainly open to suggestions) is because the notion of a binary or raw textConnection seemed wrong. -- David Hinds __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] floating point control on windows
Chris Paulse <[EMAIL PROTECTED]> wrote: > Hi, > I'm sure that this question has come up many times before. When I load an R > extension dll I've built with the Microsoft compiler, I get the warning: > Warning message: > DLL attempted to change FPU control word from 8001f to 9001f I think maybe most parsimonious/simple fix for this problem is to add "fp10.obj" to the link line for your code. This file is provided by Microsoft to flip the precision of the run time library to 80 bits. The linker should find it automatically. http://msdn.microsoft.com/library/en-us/vclib/html/_crt_floating.2d.point_support.asp -- David Hinds __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: "loop connections"
Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > Just to be concrete, suppose one wants to run the following as a > concurrent process to R. (What is does is it implicitly sets x to > zero and then for each line of stdin it adds the first field of the > input to x and prints that to stdout unless the first field is > "exit" in which case it exits. gawk has an implicit read/process > loop so one does not have to specify the read step. The fflush() > command just makes sure that output is emitted, rather than > buffered, as it is produced.) It seems you're just trying to reinvent fifo and/or pipe connections for interprocess communication. That is not directly related to the problem I wanted to address. -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: "loop connections"
Martin Maechler <[EMAIL PROTECTED]> wrote: > I think the main point of David's proposal is still worth > consideration: One way to see text connections is as a way to > treat some kind of R objects as "generalized files" i.e., connections. To summarize the motivation for the proposal, again: - There are two modes of connections: text and binary. The operations supported on text and binary connections are mostly disjoint. Most connection classes (socket, file, etc) support both modes. - textConnection() binds a character vector to a text connection. There is no equivalent for a binary connection. there are workarounds (i.e. anonymous connections, equivalent to temporary files), but these have substantial performance penalties. - Both connection modes have useful applications. textConnection() is useful, or it would not exist. Orthogonality is good, special cases are bad. - Only about 50 lines of code are required to implement a binary form of textConnection() in the R core. Implementing this functionality in a separate package requires substantially more code. - I need it, and in at least one case, another R package developer has implemented it using temporary files (caTools). I also just noticed that Duncon Murdoch recently proposed the EXACT SAME feature on r-help: https://stat.ethz.ch/pipermail/r-help/2005-April/067651.html I think that just about sums it up. I've attached a smaller patch that makes fewer changes to R source, doesn't change any existing function names, etc. The feature adds 400 bytes to the size of R.dll. -- Dave --- src/main/connections.c.orig 2005-06-17 19:05:02.0 -0700 +++ src/main/connections.c 2005-08-31 15:26:19.947195100 -0700 @@ -1644,7 +1644,7 @@ return ans; } -/* --- text connections - */ +/* --- text and raw connections - */ /* read a R character vector into a buffer */ static void text_init(Rconnection con, SEXP text) @@ -1668,6 +1668,22 @@ this->cur = this->save = 0; } +/* read a R raw vector into a buffer */ +static void raw_init(Rconnection con, SEXP raw) +{ +int nbytes = length(raw); +Rtextconn this = (Rtextconn)con->private; + +this->data = (char *) malloc(nbytes); +if(!this->data) { + free(this); free(con->description); free(con->class); free(con); + error(_("cannot allocate memory for raw connection")); +} +memcpy(this->data, RAW(raw), nbytes); +this->nchars = nbytes; +this->cur = this->save = 0; +} + static Rboolean text_open(Rconnection con) { con->save = -1000; @@ -1702,41 +1718,60 @@ static double text_seek(Rconnection con, double where, int origin, int rw) { -if(where >= 0) error(_("seek is not relevant for text connection")); +if(where >= 0) error(_("seek is not relevant for this connection")); return 0; /* if just asking, always at the beginning */ } -static Rconnection newtext(char *description, SEXP text) +static size_t raw_read(void *ptr, size_t size, size_t nitems, + Rconnection con) +{ +Rtextconn this = (Rtextconn)con->private; +if (this->cur + size*nitems > this->nchars) { + nitems = (this->nchars - this->cur)/size; + memcpy(ptr, this->data+this->cur, size*nitems); + this->cur = this->nchars; +} else { + memcpy(ptr, this->data+this->cur, size*nitems); + this->cur += size*nitems; +} +return nitems; +} + +static Rconnection newtext(char *description, SEXP data) { Rconnection new; +int isText = isString(data); new = (Rconnection) malloc(sizeof(struct Rconn)); -if(!new) error(_("allocation of text connection failed")); -new->class = (char *) malloc(strlen("textConnection") + 1); -if(!new->class) { - free(new); - error(_("allocation of text connection failed")); -} -strcpy(new->class, "textConnection"); +if(!new) goto f1; +new->class = (char *) malloc(strlen("Connection") + 1); +if(!new->class) goto f2; +sprintf(new->class, "%sConnection", isText ? "text" : "raw"); new->description = (char *) malloc(strlen(description) + 1); -if(!new->description) { - free(new->class); free(new); - error(_("allocation of text connection failed")); -} +if(!new->description) goto f3; init_con(new, description, "r"); new->isopen = TRUE; new->canwrite = FALSE; new->open = &text_open; new->close = &text_close; new->destroy = &text_destroy; -new->fgetc = &text_fgetc; new->seek = &text_seek; new->private = (void*) malloc(sizeof(struct textconn)); -if(!new->private) { - free(new->description); free(new->class); free(new); - error(_("allocation of text connection failed")); +if(!new->private) goto f4; +new->text = isText; +if (new->text) { + new->fgetc = &text_fgetc; + text_init(new, data); +} else { + new->re
Re: [Rd] RFC: rawConnection (was "loop connections")
Duncan Murdoch <[EMAIL PROTECTED]> wrote: > I would implement it differently from the way you did. I'd call it > a rawConnection, taking a raw variable (or converting something else > using as.raw) as the input, and providing both text and binary > read/write modes (using the same conventions for text mode as a file > connection would). It *should* support seek, at least in binary > mode. I was trying to reuse as much of the textConnection semantics and underlying code as possible... Having a rawConnection() entry point is simple enough. Seeking also seems straightforward. I'm not so sure about using as.raw(). I wondered about that, but also thought that rather than coercing to raw, it might make more sense to cast atomic vector types to raw, byte-for-byte. Can you given an example of where a text-mode raw connection would be a useful thing? -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: rawConnection (was "loop connections")
Duncan Murdoch <[EMAIL PROTECTED]> wrote: > > > > Having a rawConnection() entry point is simple enough. Seeking also > > seems straightforward. I'm not so sure about using as.raw(). I > > wondered about that, but also thought that rather than coercing to > > raw, it might make more sense to cast atomic vector types to raw, > > byte-for-byte. > I'd prefer as.raw, so that we don't end up with two incompatible ways to > convert other objects to raw objects. An advantage of no as.raw() would be that you could create a raw connection on an object without making an extra copy, which was another of your requests. But there would be a lack of symmetry, because you could "r" from an arbitrary R object, but only "w" to raw, unless there was also a way of specifying a type for the result vector. Having the backing store be an R object with no copy does seem tricky, however. Currently, textConnection() makes a copy for "r" connections but writes directly to an R object for "w" connections. The "w" case is buggy; you can crash R by removing the target object while the connection is being used. I'm not familiar enough with R internals to know how to fix that. Maybe the object has to be searched for every time the connection is used, to avoid potentially stale pointers? > > Can you given an example of where a text-mode raw connection would be > > a useful thing? > No, but someone else might. Why unnecessarily let the source of the > bytes determine the mode of the connection? In the case of > textConnection, there are natural line breaks, so a text mode connection > makes sense. A raw object can contain anything, so why wouldn't someone > want to put text in it some day? It seems that that a text-mode raw connection would be equivalent to a textConnection on the result of rawToChar(), no? While some of these possibilities seem like they might be useful, I'm not sure that all need to be implemented immediately. If we can agree on the basic interface and semantics, then we could implement a basic version now, and relax restrictions on the arguments later as needed? -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: rawConnection (was "loop connections")
Duncan Murdoch <[EMAIL PROTECTED]> wrote: > I think the cost of duplicating as.raw is worse than the cost of using > extra memory. If the lack of symmetry bothers you, a solution is to > require a raw object as input. It wouldn't exactly be duplicating as.raw since this way of converting to raw is actually to do nothing at all, just to treat the object as if it is already raw. But, I don't have a strong opinion. > > Currently, textConnection() makes a copy for "r" connections > > but writes directly to an R object for "w" connections. The "w" case > > is buggy; you can crash R by removing the target object while the > > connection is being used. I'm not familiar enough with R internals to > > know how to fix that. Maybe the object has to be searched for every > > time the connection is used, to avoid potentially stale pointers? > I've been having an argument with some other people about something > related to this. I think they would say that the language doesn't > support writing to a variable. I tried changing textConnection output connections to look up the destination object on every access and that seems to solve the problem without being terribly expensive. > If so, then a binary mode rawConnection (with mention of the way to > convert in the Rd file) would be good enough for me. It seems we are coming back to something close to what I had originally implemented? -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: rawConnection (was "loop connections")
Duncan Murdoch <[EMAIL PROTECTED]> wrote: > Probably! The differences I still know about are: > - I'd like the name to reflect the data source, so rawConnection or > something similar rather than overloading textConnection. > - It needs a man page, or to be included on the textConnection man page. Here is an updated patch, with the rawConnection() entry point, and a man page, against today's R-devel snapshot. This also fixes (text or raw) output connections to verify that the target object still exists before writing to that object. -- Dave --- src/main/connections.c.orig 2005-08-29 17:47:35.0 -0700 +++ src/main/connections.c 2005-09-03 13:34:25.098514900 -0700 @@ -1678,7 +1678,7 @@ return ans; } -/* --- text connections - */ +/* --- text and raw connections - */ /* read a R character vector into a buffer */ static void text_init(Rconnection con, SEXP text) @@ -1702,6 +1702,22 @@ this->cur = this->save = 0; } +/* read a R raw vector into a buffer */ +static void raw_init(Rconnection con, SEXP raw) +{ +int nbytes = length(raw); +Rtextconn this = (Rtextconn)con->private; + +this->data = (char *) malloc(nbytes); +if(!this->data) { + free(this); free(con->description); free(con->class); free(con); + error(_("cannot allocate memory for raw connection")); +} +memcpy(this->data, RAW(raw), nbytes); +this->nchars = nbytes; +this->cur = this->save = 0; +} + static Rboolean text_open(Rconnection con) { con->save = -1000; @@ -1736,41 +1752,60 @@ static double text_seek(Rconnection con, double where, int origin, int rw) { -if(where >= 0) error(_("seek is not relevant for text connection")); +if(where >= 0) error(_("seek is not relevant for this connection")); return 0; /* if just asking, always at the beginning */ } -static Rconnection newtext(char *description, SEXP text) +static size_t raw_read(void *ptr, size_t size, size_t nitems, + Rconnection con) +{ +Rtextconn this = (Rtextconn)con->private; +if (this->cur + size*nitems > this->nchars) { + nitems = (this->nchars - this->cur)/size; + memcpy(ptr, this->data+this->cur, size*nitems); + this->cur = this->nchars; +} else { + memcpy(ptr, this->data+this->cur, size*nitems); + this->cur += size*nitems; +} +return nitems; +} + +static Rconnection newtext(char *description, SEXP data) { Rconnection new; +int isText = isString(data); new = (Rconnection) malloc(sizeof(struct Rconn)); -if(!new) error(_("allocation of text connection failed")); -new->class = (char *) malloc(strlen("textConnection") + 1); -if(!new->class) { - free(new); - error(_("allocation of text connection failed")); -} -strcpy(new->class, "textConnection"); +if(!new) goto f1; +new->class = (char *) malloc(strlen("Connection") + 1); +if(!new->class) goto f2; +sprintf(new->class, "%sConnection", isText ? "text" : "raw"); new->description = (char *) malloc(strlen(description) + 1); -if(!new->description) { - free(new->class); free(new); - error(_("allocation of text connection failed")); -} +if(!new->description) goto f3; init_con(new, description, "r"); new->isopen = TRUE; new->canwrite = FALSE; new->open = &text_open; new->close = &text_close; new->destroy = &text_destroy; -new->fgetc = &text_fgetc; new->seek = &text_seek; new->private = (void*) malloc(sizeof(struct textconn)); -if(!new->private) { - free(new->description); free(new->class); free(new); - error(_("allocation of text connection failed")); +if(!new->private) goto f4; +new->text = isText; +if (new->text) { + new->fgetc = &text_fgetc; + text_init(new, data); +} else { + new->read = &raw_read; + raw_init(new, data); } -text_init(new, text); return new; + +f4: free(new->description); +f3: free(new->class); +f2: free(new); +f1: error(_("allocation of %s connection failed"), + isText ? "text" : "raw"); } static void outtext_close(Rconnection con) @@ -1780,10 +1815,13 @@ int idx = ConnIndex(con); if(strlen(this->lastline) > 0) { - PROTECT(tmp = lengthgets(this->data, ++this->len)); + tmp = findVar1(this->namesymbol, VECTOR_ELT(OutTextData, idx), + STRSXP, FALSE); + if (tmp == R_UnboundValue) + error(_("connection endpoint unbound")); + PROTECT(tmp = lengthgets(tmp, ++this->len)); SET_STRING_ELT(tmp, this->len - 1, mkChar(this->lastline)); defineVar(this->namesymbol, tmp, VECTOR_ELT(OutTextData, idx)); - this->data = tmp; UNPROTECT(1); } SET_VECTOR_ELT(OutTextData, idx, R_NilValue); @@ -1843,10 +1881,13 @@ if(q) { int idx = ConnIndex(con);
[Rd] A memory management question
Can someone explain the use of SETLENGTH() and SETTRUELENGTH()? I would like to allocate a vector and reserve some space at the end, so that it appears shorter than the allocated size. So that I can more efficiently append to the vector, without requiring a new copy every time. So I'd like to use SETLENGTH() with a shorter apparent length, and bump this up as needed until I've used the entire space. There are only a couple users of SETLENGTH() in R, and they all appear at first glance to be pointless: a few routines use allocVector() and then call SETLENGTH() to set the vector length to the value that was just allocated. What are valid uses for SETLENGTH()? And what are the intended semantics for "truelength" as opposed to the regular length? If GC happens and an object is moved, and its apparent LENGTH() differs from its allocated length, does GC preserve the allocated length, or the updated LENGTH()? Is there any way to get at the original allocated length, given an SEXP? -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] A memory management question
Luke Tierney <[EMAIL PROTECTED]> wrote: > It might or might not work now but is not guaranteed to do so reliably > in the future. Seeing the risks of leaving SETLENGTH exposed, it is > very likely that SETLENGTH will be removed from the sources after the > 2.2.0 release. > If you provide your own methods to read and write the external pointer > then you don' need this; this is safer than relying on undocumented > behavior of [ and [<- in any case. You also then don't need to use > R_PreserveObject unless you really need to use it from the C level > outside of a context where an R reference exists. I'm not sure I follow this. Maybe I should explain the context for the problem. textConnection("xyz", "w") creates a connection, the output of which is deposited in a char vector named "xyz", which is updated line by line as output is sent to the connection. The current code maintains a pointer to "xyz" in the form of an unprotected SEXP. Hence if the user does rm(xyz), bad things happen. A small bug, I admit. I think the best fix is to use a protected reference to the result vector. I think this is safe and doesn't rely on any abuse of the interfaces. There's also a performance issue, that the result is updated after every line of output, resulting in a vast amount of copying if a large result is accumulated. This is the part that could be fixed by using SETLENGTH to manage the length of the protected result vector. I'm not sure what you mean by undocumented behavior of [ and [<-. I think all I'm relying on is that as long as an outstanding reference to the result vector exists, that R has to make sure the reference remains valid, and hence can't change the memory allocation of the result vector in any way. I don't care what else happens to the contents of the vector, as long as I get to control when it is released. It is ok with me if the user modifies the result vector in-place, since my reference stays valid. So I don't actually care how [ and [<- work. I think the only undocumented thing I'm relying on, is that the memory manager doesn't pay attention to the LENGTH of objects that it isn't actively doing anything to. Currently, it actually only uses LENGTH in one spot: for updating R_LargeVallocSize when a large vector is released. The true allocation sizes for individual objects are always kept in another place (either by malloc, or in the node class of the object). It seems like in this limited usage, SETLENGTH does represent a useful feature, by permitting safe over-allocation of a protected object, and might be worth preserving (and documenting) for that purpose. Of course, the real problem here is the semantics of textConnection(), which make life much more difficult and can't be changed because they are specified outside of R. -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] A memory management question
Luke Tierney <[EMAIL PROTECTED]> wrote: > I am not comfortable making this available at this point. It might be > useful to have but would need careful thought. Without some way to > find out the true length there are potential problems. Without some > way of making sure the fields in VECSXP and STRSXP that are added are > valid there are potential problems (not the first time but if the size > is shrunk and then increased). Not that this can't be resolved but it > would take time that I don't have now, and this isn't high priority > enough to schedule in the near future. So for now you should not use > SETLENGTH if you want your code to work beyond 2.2.0. Ok, that's fine... given the lack of other valid uses of SETLENGTH, it doesn't seem worth preserving it just for this one debatable usage. > It may be possible to expand the semantics by adding a logical > argument that controls whether the vector is to be over-allocated and > filled with zero length strings and truncated to the true length on > close. Another variant would be to have a logical argument that says > to keep the input internally and provide a function, say > textConnectionOutput, to retrieve the internal output. These are possible... or optionally just don't reveal the intermediate output at all, and just make the final result visible on close... -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Updated rawConnection() patch
Here's an update of my rawConnection() implementation. In addition to providing a raw version of textConnection(), this fixes two existing issues with textConnection(): one is that the current textConnection() implementation carries around unprotected SEXP pointers, the other is a performance problem due to prolific copying of the output buffer as output is accumulated line by line. This new version uses a separate buffer for connection output, which is extended in larger chunks, so that resize operations are less frequent. And the buffer is hidden behind an active binding, so that the user can't corrupt it. My original need for this is largely addressed by Brian Ripley's recent extension of readBin/writeBin to operate on raw vectors as well as connections, in the latest development tree. But I think having a raw version of textConnection is still a bit more orthogonal and flexible, and requires very little code. -- Dave --- ./src/include/Internal.h.orig 2005-08-29 17:47:27.0 -0700 +++ ./src/include/Internal.h2005-09-18 00:32:08.196336200 -0700 @@ -525,6 +525,7 @@ SEXP do_pushbacklength(SEXP, SEXP, SEXP, SEXP); SEXP do_clearpushback(SEXP, SEXP, SEXP, SEXP); SEXP do_textconnection(SEXP, SEXP, SEXP, SEXP); +SEXP do_graboutput(SEXP, SEXP, SEXP, SEXP); SEXP do_getallconnections(SEXP, SEXP, SEXP, SEXP); SEXP do_sumconnection(SEXP, SEXP, SEXP, SEXP); SEXP do_download(SEXP, SEXP, SEXP, SEXP); --- ./src/include/Rconnections.h.orig 2005-08-03 08:50:36.0 -0700 +++ ./src/include/Rconnections.h2005-09-17 23:56:01.875475000 -0700 @@ -94,8 +94,7 @@ typedef struct outtextconn { int len; /* number of lines */ -SEXP namesymbol; -SEXP data; +SEXP namesymbol, data, venv; char *lastline; int lastlinelength; /* buffer size */ } *Routtextconn; --- ./src/library/base/man/rawconnections.Rd.orig 2005-09-18 11:37:18.004405000 -0700 +++ ./src/library/base/man/rawconnections.Rd2005-09-18 11:37:00.535655300 -0700 @@ -0,0 +1,71 @@ +\name{rawConnection} +\alias{rawConnection} +\title{Raw Connections} +\description{ + Input and output raw connections. +} +\usage{ +rawConnection(object, open = "r", local = FALSE) +} +\arguments{ + \item{object}{raw or character. A description of the connection. +For an input this is an \R raw vector object, and for an output +connection the name for the \R raw vector to receive the +output. + } + \item{open}{character. Either \code{"rb"} (or equivalently \code{""}) +for an input connection or \code{"wb"} or \code{"ab"} for an output +connection.} + \item{local}{logical. Used only for output connections. If \code{TRUE}, +output is assigned to a variable in the calling environment. Otherwise +the global environment is used.} +} +\details{ + An input raw connection is opened and the raw vector is copied + at time the connection object is created, and \code{close} + destroys the copy. + + An output raw connection is opened and creates an \R raw vector of + the given name in the user's workspace or in the calling + environment, depending on the value of the \code{local} argument. + This object will at all times hold the accumulated output to the + connection. + + Opening a raw connection with \code{mode = "ab"} will attempt to + append to an existing raw vector with the given name in the user's + workspace or the calling environment. If none is found (even if an + object exists of the right name but the wrong type) a new raw vector + wil be created, with a warning. + + You cannot \code{seek} on a raw connection, and \code{seek} will + always return zero as the position. +} + +\value{ + A binary-mode connection object of class \code{"rawConnection"} + which inherits from class \code{"connection"}. +} + +\seealso{ + \code{\link{connections}}, \code{\link{showConnections}}, + \code{\link{readBin}}, \code{\link{writeBin}}, + \code{\link{textConnection}}. +} + +\examples{ +zz <- rawConnection("foo", "wb") +writeBin(1:2, zz) +writeBin(1:8, zz, size=1) +writeBin(pi, zz, size=4) +close(zz) +foo + +zz <- rawConnection(foo) +readBin(zz, "integer", n=2) +sprintf("\%04x", readBin(zz, "integer", n=2, size=2)) +sprintf("\%08x", readBin(zz, "integer", endian="swap")) +readBin(zz, "numeric", n=1, size=4) +close(zz) +} +\keyword{file} +\keyword{connection} --- ./src/library/base/man/textconnections.Rd.orig 2005-09-03 13:55:48.274305900 -0700 +++ ./src/library/base/man/textconnections.Rd 2005-09-18 11:37:03.457530300 -0700 @@ -45,16 +45,11 @@ } \value{ - A connection object of class \code{"textConnection"} which inherits - from class \code{"connection"}. + A text-mode connection object of class \code{"textConnection"} which + inherits from class \code{"connection"}. } \note{ - As output text connections keep the character vector up to date - line-by-line, they are relatively expensive to use, and it is often - better to use an anonymous \code{\link{file
[Rd] Future plans for raw data type?
I've been working with raw vectors quite a bit and was wondering if the R team might comment on where they see raw vector support going in the long run. Is the intent that 'raw' will eventually become a first class data type on the same level as 'integer'? Or should 'raw' have more limited support, by design? For example, with very minor changes to subassign.c to implement some automatic coercions, raw vectors can become arguments to ifelse() and can be members of data frames. Would this be desirable? -- David Hinds __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Problems with autoconf example from r-ext.
Prof Brian Ripley <[EMAIL PROTECTED]> wrote: > The current R-exts.texi has > AC_INIT([RODBC], 1.1.4) dnl package name, version > and that is crucially different from your example. Autoconf 2.59 has a > barely documented back-compatibility mode than is invoked for AC_INIT with > just one argument. I was tripped up by this same issue, and was not easily able to figure out from the autoconf documentation how AC_INIT had changed over time. The one-argument AC_INIT, for the version of autoconf I was using (2.57), expects its argument to be a path to a file that is relatively unique to the package. However, this isn't actually related to the problem at hand: > > R CMD INSTALL > > --configure-args='--with-sbmlode-lib=/data/opt/sbmlodesolve/include \ > > --with-sbmlode-include=/data/opt/sbmlodesolve/lib' \ > > SBMLodeSolveR This is a shell programming error. Remove the '\' inside your quoted --configure-args argument. The backslash causes the newline to be escaped in the string passed to the configure script, which confuses the argument parser. You don't need the backslash because a quoted string is automatically continued until the closing quote is seen. -- Dave __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel