Re: [Rd] readLines() segfaults on large file & question on how to work around
Jennifer, Why do you try Sparkr? https://spark.apache.org/docs/1.6.1/api/R/read.json.html On 2 September 2017 at 23:15, Jennifer Lyon wrote: > Thank you for your suggestion. Unfortunately, while R doesn't segfault > calling readr::read_file() on the test file I described, I get the error > message: > > Error in read_file_(ds, locale) : negative length vectors are not allowed > > Jen > > On Sat, Sep 2, 2017 at 1:38 PM, Ista Zahn wrote: > >> As s work-around I suggest readr::read_file. >> >> --Ista >> >> >> On Sep 2, 2017 2:58 PM, "Jennifer Lyon" wrote: >> >>> Hi: >>> >>> I have a 2.1GB JSON file. Typically I use readLines() and >>> jsonlite:fromJSON() to extract data from a JSON file. >>> >>> When I try and read in this file using readLines() R segfaults. >>> >>> I believe the two salient issues with this file are >>> 1). Its size >>> 2). It is a single line (no line breaks) >>> >>> I can reproduce this issue as follows >>> #Generate a big file with no line breaks >>> # In R >>> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="") >>> >>> # in unix shell >>> cp alpha.txt file.txt >>> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt >>> file.txt; done >>> >>> This generates a 2.3GB file with no line breaks >>> >>> in R: >>> > moo <- readLines("file.txt") >>> >>> *** caught segfault *** >>> address 0x7cff, cause 'memory not mapped' >>> >>> Traceback: >>> 1: readLines("file.txt") >>> >>> Possible actions: >>> 1: abort (with core dump, if enabled) >>> 2: normal R exit >>> 3: exit R without saving workspace >>> 4: exit R saving workspace >>> Selection: 3 >>> >>> I conclude: >>> I am potentially running up against a limit in R, which should give a >>> reasonable error, but currently just segfaults. >>> >>> My question: >>> Most of the content of the JSON is an approximately 100K x 6K JSON >>> equivalent of a dataframe, and I know R can handle much bigger than this >>> size. I am expecting these JSON files to get even larger. My R code lives >>> in a bigger system, and the JSON comes in via stdin, so I have absolutely >>> no control over the data format. I can imagine trying to incrementally >>> parse the JSON so I don't bump up against the limit, but I am eager for >>> suggestions of simpler solutions. >>> >>> Also, I apologize for the timing of this bug report, as I know folks are >>> working to get out the next release of R, but like so many things I have >>> no >>> control over when bugs leap up. >>> >>> Thanks. >>> >>> Jen >>> >>> > sessionInfo() >>> R version 3.4.1 (2017-06-30) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> Running under: Ubuntu 14.04.5 LTS >>> >>> Matrix products: default >>> BLAS: R-3.4.1/lib/libRblas.so >>> LAPACK:R-3.4.1/lib/libRlapack.so >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> loaded via a namespace (and not attached): >>> [1] compiler_3.4.1 >>> >>> [[alternative HTML version deleted]] >>> >>> __ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] readLines() segfaults on large file & question on how to work around
2017-09-02 20:58 GMT+02:00 Jennifer Lyon : > Hi: > > I have a 2.1GB JSON file. Typically I use readLines() and > jsonlite:fromJSON() to extract data from a JSON file. > > When I try and read in this file using readLines() R segfaults. > > I believe the two salient issues with this file are > 1). Its size > 2). It is a single line (no line breaks) As a workaround you can pipe something like "sed s/,/,\\n/g" before your R script to insert line breaks. Iñaki __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] I have corrected a dead link in the treering documentation
> Thomas Levine <_...@thomaslevine.com> > on Fri, 1 Sep 2017 13:23:47 + writes: > Martin Maechler writes: >> There may be one small problem: IIUC, the wayback machine >> is a +- private endeavor and really great and phantastic >> but it does need (US? tax deductible) donations, >> https://archive.org/donate/, to continue thriving. This >> makes me hesitate a bit to link to it within the "base R" >> documentation. But that may be wrong -- and I should >> really use it to *help* the project ? > I agree that the Wayback Machine is a private > endeavor. After reviewing other base library > documentation, I have concluded that it would regardless > be consistent with current practice to reference it in the > base documentation. > I share your concern regarding the support of other > institutions, and I have found some references that are > more problematic to me than the one of present interest. I > would thus support an initiative to consider the social > implications of the different references and to adjust the > references accordingly. > Below I start by making a distinction between two types of > references that I think should be treated differently in > terms of your concern. Next, I assess whether there is a > precedent for inclusion of references to private > publishers, as in the present patch; I include that there > is such a president. Then I present my opinion regarding > the present patch. Finally, I present some other > considerations that I find relevant to the discussion. > Distinguishing between two link types > - > For discussion of this issue, I think it is helpful to > distinguish between references to sources and references > to other materials. > In the case of references of to sources, there is little > choice but to reference the publisher, even though the > overwhelming majority of referenced publishers are private > companies that impose restrictive licenses on their > journals and books and cannot be reasonably trusted to > maintain access to the materials nor availability of > webpages. > With other references, it is possible to replace the > reference with a different document that contains similar > information. > For example, if a function implements an method based on a > particular journal article, that article's citation needs > to stay, even if the journal is published by a private > institution. On the other hand, if the reference just > provides context or suggestions related to usage, then the > reference is provided just as information and can be > replaced. > Precedent for inclusion of private non-source materials > --- > The dead link of interest is only informational, not a > citation of a source, and so it could be replaced. So I > assessed whether it would match current practice to > include it, and I concluded that there is substantial > precedent for inclusion of private reference materials > other than strict sources. Not having access to a good > library at the moment, I have limited my research on this > matter to website references. > In SVN revision 73164, \url calls are distributed among > 148 files, from 1 call to 13 calls per file, with mean of > 1.75 and median of 1. > grep '\\url' src/library/*/*/*.Rd | cut -d: -f1 | uniq > -c | sort -n > Total number of library documentation files is 1419. > find src/library/ -name \*.Rd | wc -l > I randomly selected 20 matching files for further study. > % grep '\\url' src/library/*/*/*.Rd | cut -d: -f1 | uniq > -c | sort -R | head -n 20 | tee /tmp/rd 2 > src/library/grDevices/man/pdf.Rd 1 > src/library/base/man/taskCallbackNames.Rd 1 > src/library/stats/man/shapiro.test.Rd 1 > src/library/tcltk/man/TkWidgets.Rd 2 > src/library/graphics/man/assocplot.Rd 1 > src/library/base/man/sprintf.Rd 6 > src/library/base/man/regex.Rd 3 > src/library/datasets/man/HairEyeColor.Rd 1 > src/library/stats/man/optimize.Rd 1 > src/library/datasets/man/UKDriverDeaths.Rd 1 > src/library/utils/man/object.size.Rd 1 > src/library/utils/man/unzip.Rd 1 > src/library/base/man/dcf.Rd 1 > src/library/base/man/DateTimeClasses.Rd 3 > src/library/stats/man/GammaDist.Rd 2 > src/library/utils/man/maintainer.Rd 2 > src/library/base/man/libcurlVersion.Rd 2 > src/library/base/man/eigen.Rd 2 > src/library/base/man/chol2inv.Rd 1 > src/library/tools/man/update_pkg_po.Rd >> From these 20 I composed a table with statistical unit of >> \url call and > with variables filename, url, type of reference, and type > of publisher. The following commands were helpful. > se
Re: [Rd] readLines() segfaults on large file & question on how to work around
Thank you for your suggestion. Unfortunately, while R doesn't segfault calling readr::read_file() on the test file I described, I get the error message: Error in read_file_(ds, locale) : negative length vectors are not allowed Jen On Sat, Sep 2, 2017 at 1:38 PM, Ista Zahn wrote: > As s work-around I suggest readr::read_file. > > --Ista > > > On Sep 2, 2017 2:58 PM, "Jennifer Lyon" wrote: > >> Hi: >> >> I have a 2.1GB JSON file. Typically I use readLines() and >> jsonlite:fromJSON() to extract data from a JSON file. >> >> When I try and read in this file using readLines() R segfaults. >> >> I believe the two salient issues with this file are >> 1). Its size >> 2). It is a single line (no line breaks) >> >> I can reproduce this issue as follows >> #Generate a big file with no line breaks >> # In R >> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="") >> >> # in unix shell >> cp alpha.txt file.txt >> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt >> file.txt; done >> >> This generates a 2.3GB file with no line breaks >> >> in R: >> > moo <- readLines("file.txt") >> >> *** caught segfault *** >> address 0x7cff, cause 'memory not mapped' >> >> Traceback: >> 1: readLines("file.txt") >> >> Possible actions: >> 1: abort (with core dump, if enabled) >> 2: normal R exit >> 3: exit R without saving workspace >> 4: exit R saving workspace >> Selection: 3 >> >> I conclude: >> I am potentially running up against a limit in R, which should give a >> reasonable error, but currently just segfaults. >> >> My question: >> Most of the content of the JSON is an approximately 100K x 6K JSON >> equivalent of a dataframe, and I know R can handle much bigger than this >> size. I am expecting these JSON files to get even larger. My R code lives >> in a bigger system, and the JSON comes in via stdin, so I have absolutely >> no control over the data format. I can imagine trying to incrementally >> parse the JSON so I don't bump up against the limit, but I am eager for >> suggestions of simpler solutions. >> >> Also, I apologize for the timing of this bug report, as I know folks are >> working to get out the next release of R, but like so many things I have >> no >> control over when bugs leap up. >> >> Thanks. >> >> Jen >> >> > sessionInfo() >> R version 3.4.1 (2017-06-30) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: Ubuntu 14.04.5 LTS >> >> Matrix products: default >> BLAS: R-3.4.1/lib/libRblas.so >> LAPACK:R-3.4.1/lib/libRlapack.so >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> loaded via a namespace (and not attached): >> [1] compiler_3.4.1 >> >> [[alternative HTML version deleted]] >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] readLines() segfaults on large file & question on how to work around
As s work-around I suggest readr::read_file. --Ista On Sep 2, 2017 2:58 PM, "Jennifer Lyon" wrote: > Hi: > > I have a 2.1GB JSON file. Typically I use readLines() and > jsonlite:fromJSON() to extract data from a JSON file. > > When I try and read in this file using readLines() R segfaults. > > I believe the two salient issues with this file are > 1). Its size > 2). It is a single line (no line breaks) > > I can reproduce this issue as follows > #Generate a big file with no line breaks > # In R > > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="") > > # in unix shell > cp alpha.txt file.txt > for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt > file.txt; done > > This generates a 2.3GB file with no line breaks > > in R: > > moo <- readLines("file.txt") > > *** caught segfault *** > address 0x7cff, cause 'memory not mapped' > > Traceback: > 1: readLines("file.txt") > > Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace > Selection: 3 > > I conclude: > I am potentially running up against a limit in R, which should give a > reasonable error, but currently just segfaults. > > My question: > Most of the content of the JSON is an approximately 100K x 6K JSON > equivalent of a dataframe, and I know R can handle much bigger than this > size. I am expecting these JSON files to get even larger. My R code lives > in a bigger system, and the JSON comes in via stdin, so I have absolutely > no control over the data format. I can imagine trying to incrementally > parse the JSON so I don't bump up against the limit, but I am eager for > suggestions of simpler solutions. > > Also, I apologize for the timing of this bug report, as I know folks are > working to get out the next release of R, but like so many things I have no > control over when bugs leap up. > > Thanks. > > Jen > > > sessionInfo() > R version 3.4.1 (2017-06-30) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 14.04.5 LTS > > Matrix products: default > BLAS: R-3.4.1/lib/libRblas.so > LAPACK:R-3.4.1/lib/libRlapack.so > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_3.4.1 > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] readLines() segfaults on large file & question on how to work around
Hi: I have a 2.1GB JSON file. Typically I use readLines() and jsonlite:fromJSON() to extract data from a JSON file. When I try and read in this file using readLines() R segfaults. I believe the two salient issues with this file are 1). Its size 2). It is a single line (no line breaks) I can reproduce this issue as follows #Generate a big file with no line breaks # In R > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="") # in unix shell cp alpha.txt file.txt for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt file.txt; done This generates a 2.3GB file with no line breaks in R: > moo <- readLines("file.txt") *** caught segfault *** address 0x7cff, cause 'memory not mapped' Traceback: 1: readLines("file.txt") Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: 3 I conclude: I am potentially running up against a limit in R, which should give a reasonable error, but currently just segfaults. My question: Most of the content of the JSON is an approximately 100K x 6K JSON equivalent of a dataframe, and I know R can handle much bigger than this size. I am expecting these JSON files to get even larger. My R code lives in a bigger system, and the JSON comes in via stdin, so I have absolutely no control over the data format. I can imagine trying to incrementally parse the JSON so I don't bump up against the limit, but I am eager for suggestions of simpler solutions. Also, I apologize for the timing of this bug report, as I know folks are working to get out the next release of R, but like so many things I have no control over when bugs leap up. Thanks. Jen > sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.5 LTS Matrix products: default BLAS: R-3.4.1/lib/libRblas.so LAPACK:R-3.4.1/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.1 [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Please avoid direct use of NAMED and SET_NAMED macros
On Sat, 2 Sep 2017, Radford Neal wrote: To allow for future changes in the way the need for duplication is detected in R internal C code, package C code should avoid direct use of NAMED,and SET_NAMED, or assumptions on the maximal value of NAMED. Use the macros MAYBE_REFERENCED, MAYBE_SHARED, and MARK_NOT_MUTABLE instead. These currently correspond to MAYBE_REFERENCED(x): NAMED(x) > 0 MAYBE_SHARED(x): NAMED(x) > 1 MARK_NOT_MUTABLE(x): SET_NAMED(c, NAMEDMAX) Best, luke Checking https://cran.r-project.org/doc/manuals/r-release/R-exts.html shows that currently there is no mention of these macros in the documentation for package writers. Of course, the explanation of NAMED there also does not adequtely describe what it is supposed to mean, which may explain why it's often not used correctly. As of yesterday they are mentioned in the R-devel version of this manual, which will make it to the web in due course. Before embarking on a major change to the C API, I'd suggest that you produce clear and complete documention on the new scheme. Radford Neal -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Please avoid direct use of NAMED and SET_NAMED macros
> To allow for future changes in the way the need for duplication is > detected in R internal C code, package C code should avoid direct > use of NAMED,and SET_NAMED, or assumptions on the maximal value > of NAMED. Use the macros MAYBE_REFERENCED, MAYBE_SHARED, and > MARK_NOT_MUTABLE instead. These currently correspond to > > MAYBE_REFERENCED(x): NAMED(x) > 0 > MAYBE_SHARED(x): NAMED(x) > 1 > MARK_NOT_MUTABLE(x): SET_NAMED(c, NAMEDMAX) > > Best, > > luke Checking https://cran.r-project.org/doc/manuals/r-release/R-exts.html shows that currently there is no mention of these macros in the documentation for package writers. Of course, the explanation of NAMED there also does not adequtely describe what it is supposed to mean, which may explain why it's often not used correctly. Before embarking on a major change to the C API, I'd suggest that you produce clear and complete documention on the new scheme. Radford Neal __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Missing y label
On 1 September 2017 at 15:50, Therneau, Terry M., Ph.D. wrote: | The system admins here ... I suggest you get these local admins to help you. These CRAN repos for Ubuntu are used by thousands of people every day, and they "just work", for both the recent releases and the most recent LTS. Dirk -- http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Wayback and related questions (was: RE: I have corrected a dead link ...)
> If the R project cannot use or reference any site that uses non-open > code, including minified javascript - which appears to be the > principle issue for GitHub - I suspect that you will be obliged to > discontinue links to almost every journal, university, charity, > government and research establishment site currently in existence as > soon as GNU get round to assessing them. I personally have great > difficulty seeing that as sensible. The policy that you suggest would indeed be completely stupid. Fortunately, a reasonable policy that vaguely matches the current practices is likely to affect hardly any documentation files. I don't have a strong opinion as to whether publishing characteristics of references should be a consideration during the composition of R documentation files, and I trust the R developers to decide well. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel