Re: [Rd] readLines() segfaults on large file & question on how to work around
As of R-devel 72925 one gets a proper error message instead of the crash. Tomas On 09/04/2017 08:46 AM, rh...@eoos.dds.nl wrote: Although the problem can apparently be avoided in this case. readLines causing a segfault still seems unwanted behaviour to me. I can replicate this with the example below (sessionInfo is further down): # Generate an example file l <- paste0(sample(c(letters, LETTERS), 1E6, replace = TRUE), collapse="") con <- file("test.txt", "wt") for (i in seq_len(2500)) { writeLines(l, con, sep ="") } close(con) # Causes segfault: readLines("test.txt") Also the error reported by readr is also reproduced (a more informative error message and checking for integer overflows would be nice). I will report this with readr. library(readr) read_file("test.txt") # Error in read_file_(ds, locale) : negative length vectors are not # allowed -- Jan > sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 17.04 Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.7.0 LAPACK: /usr/lib/lapack/liblapack.so.3.7.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=nl_NL.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] readr_1.1.1 loaded via a namespace (and not attached): [1] compiler_3.4.1 R6_2.2.2 hms_0.3tools_3.4.1 tibble_1.3.3 Rcpp_0.12.12 rlang_0.1.2 On 03-09-17 20:50, Jennifer Lyon wrote: Jeroen: Thank you for pointing me to ndjson, which I had not heard of and is exactly my case. My experience: jsonlite::stream_in - segfaults ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too old so it won't compile the package corpus::read_ndjson - works!!! Of course it does a different simplification than jsonlite::fromJSON, so I have to change some code, but it works beautifully at least in simple tests. The memory-map option may be of use in the future. Another correspondent said that strings in R can only be 2^31-1 long, which is why any "solution" that tries to load the whole file into R first as a string, will fail. Thanks for suggesting a path forward for me! Jen On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms wrote: On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon wrote: I have a 2.1GB JSON file. Typically I use readLines() and jsonlite:fromJSON() to extract data from a JSON file. If your data consists of one json object per line, this is called 'ndjson'. There are several packages specialized to read ndjon files: - corpus::read_ndjson - ndjson::stream_in - jsonlite::stream_in In particular the 'corpus' package handles large files really well because it has an option to memory-map the file instead of reading all of its data into memory. If the data is too large to read, you can preprocess it using https://stedolan.github.io/jq/ to extract the fields that you need. You really don't need hadoop/spark/etc for this. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] readLines() segfaults on large file & question on how to work around
Although the problem can apparently be avoided in this case. readLines causing a segfault still seems unwanted behaviour to me. I can replicate this with the example below (sessionInfo is further down): # Generate an example file l <- paste0(sample(c(letters, LETTERS), 1E6, replace = TRUE), collapse="") con <- file("test.txt", "wt") for (i in seq_len(2500)) { writeLines(l, con, sep ="") } close(con) # Causes segfault: readLines("test.txt") Also the error reported by readr is also reproduced (a more informative error message and checking for integer overflows would be nice). I will report this with readr. library(readr) read_file("test.txt") # Error in read_file_(ds, locale) : negative length vectors are not # allowed -- Jan > sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 17.04 Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.7.0 LAPACK: /usr/lib/lapack/liblapack.so.3.7.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=nl_NL.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] readr_1.1.1 loaded via a namespace (and not attached): [1] compiler_3.4.1 R6_2.2.2 hms_0.3tools_3.4.1 tibble_1.3.3 Rcpp_0.12.12 rlang_0.1.2 On 03-09-17 20:50, Jennifer Lyon wrote: Jeroen: Thank you for pointing me to ndjson, which I had not heard of and is exactly my case. My experience: jsonlite::stream_in - segfaults ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too old so it won't compile the package corpus::read_ndjson - works!!! Of course it does a different simplification than jsonlite::fromJSON, so I have to change some code, but it works beautifully at least in simple tests. The memory-map option may be of use in the future. Another correspondent said that strings in R can only be 2^31-1 long, which is why any "solution" that tries to load the whole file into R first as a string, will fail. Thanks for suggesting a path forward for me! Jen On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms wrote: On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon wrote: I have a 2.1GB JSON file. Typically I use readLines() and jsonlite:fromJSON() to extract data from a JSON file. If your data consists of one json object per line, this is called 'ndjson'. There are several packages specialized to read ndjon files: - corpus::read_ndjson - ndjson::stream_in - jsonlite::stream_in In particular the 'corpus' package handles large files really well because it has an option to memory-map the file instead of reading all of its data into memory. If the data is too large to read, you can preprocess it using https://stedolan.github.io/jq/ to extract the fields that you need. You really don't need hadoop/spark/etc for this. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] readLines() segfaults on large file & question on how to work around
Jeroen: Thank you for pointing me to ndjson, which I had not heard of and is exactly my case. My experience: jsonlite::stream_in - segfaults ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too old so it won't compile the package corpus::read_ndjson - works!!! Of course it does a different simplification than jsonlite::fromJSON, so I have to change some code, but it works beautifully at least in simple tests. The memory-map option may be of use in the future. Another correspondent said that strings in R can only be 2^31-1 long, which is why any "solution" that tries to load the whole file into R first as a string, will fail. Thanks for suggesting a path forward for me! Jen On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms wrote: > On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon > wrote: > > I have a 2.1GB JSON file. Typically I use readLines() and > > jsonlite:fromJSON() to extract data from a JSON file. > > If your data consists of one json object per line, this is called > 'ndjson'. There are several packages specialized to read ndjon files: > > - corpus::read_ndjson > - ndjson::stream_in > - jsonlite::stream_in > > In particular the 'corpus' package handles large files really well > because it has an option to memory-map the file instead of reading all > of its data into memory. > > If the data is too large to read, you can preprocess it using > https://stedolan.github.io/jq/ to extract the fields that you need. > > You really don't need hadoop/spark/etc for this. > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] readLines() segfaults on large file & question on how to work around
On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon wrote: > I have a 2.1GB JSON file. Typically I use readLines() and > jsonlite:fromJSON() to extract data from a JSON file. If your data consists of one json object per line, this is called 'ndjson'. There are several packages specialized to read ndjon files: - corpus::read_ndjson - ndjson::stream_in - jsonlite::stream_in In particular the 'corpus' package handles large files really well because it has an option to memory-map the file instead of reading all of its data into memory. If the data is too large to read, you can preprocess it using https://stedolan.github.io/jq/ to extract the fields that you need. You really don't need hadoop/spark/etc for this. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] readLines() segfaults on large file & question on how to work around
Jennifer, Why do you try Sparkr? https://spark.apache.org/docs/1.6.1/api/R/read.json.html On 2 September 2017 at 23:15, Jennifer Lyon wrote: > Thank you for your suggestion. Unfortunately, while R doesn't segfault > calling readr::read_file() on the test file I described, I get the error > message: > > Error in read_file_(ds, locale) : negative length vectors are not allowed > > Jen > > On Sat, Sep 2, 2017 at 1:38 PM, Ista Zahn wrote: > >> As s work-around I suggest readr::read_file. >> >> --Ista >> >> >> On Sep 2, 2017 2:58 PM, "Jennifer Lyon" wrote: >> >>> Hi: >>> >>> I have a 2.1GB JSON file. Typically I use readLines() and >>> jsonlite:fromJSON() to extract data from a JSON file. >>> >>> When I try and read in this file using readLines() R segfaults. >>> >>> I believe the two salient issues with this file are >>> 1). Its size >>> 2). It is a single line (no line breaks) >>> >>> I can reproduce this issue as follows >>> #Generate a big file with no line breaks >>> # In R >>> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="") >>> >>> # in unix shell >>> cp alpha.txt file.txt >>> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt >>> file.txt; done >>> >>> This generates a 2.3GB file with no line breaks >>> >>> in R: >>> > moo <- readLines("file.txt") >>> >>> *** caught segfault *** >>> address 0x7cff, cause 'memory not mapped' >>> >>> Traceback: >>> 1: readLines("file.txt") >>> >>> Possible actions: >>> 1: abort (with core dump, if enabled) >>> 2: normal R exit >>> 3: exit R without saving workspace >>> 4: exit R saving workspace >>> Selection: 3 >>> >>> I conclude: >>> I am potentially running up against a limit in R, which should give a >>> reasonable error, but currently just segfaults. >>> >>> My question: >>> Most of the content of the JSON is an approximately 100K x 6K JSON >>> equivalent of a dataframe, and I know R can handle much bigger than this >>> size. I am expecting these JSON files to get even larger. My R code lives >>> in a bigger system, and the JSON comes in via stdin, so I have absolutely >>> no control over the data format. I can imagine trying to incrementally >>> parse the JSON so I don't bump up against the limit, but I am eager for >>> suggestions of simpler solutions. >>> >>> Also, I apologize for the timing of this bug report, as I know folks are >>> working to get out the next release of R, but like so many things I have >>> no >>> control over when bugs leap up. >>> >>> Thanks. >>> >>> Jen >>> >>> > sessionInfo() >>> R version 3.4.1 (2017-06-30) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> Running under: Ubuntu 14.04.5 LTS >>> >>> Matrix products: default >>> BLAS: R-3.4.1/lib/libRblas.so >>> LAPACK:R-3.4.1/lib/libRlapack.so >>> >>> locale: >>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 >>> [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 >>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> [9] LC_ADDRESS=C LC_TELEPHONE=C >>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> loaded via a namespace (and not attached): >>> [1] compiler_3.4.1 >>> >>> [[alternative HTML version deleted]] >>> >>> __ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] readLines() segfaults on large file & question on how to work around
2017-09-02 20:58 GMT+02:00 Jennifer Lyon : > Hi: > > I have a 2.1GB JSON file. Typically I use readLines() and > jsonlite:fromJSON() to extract data from a JSON file. > > When I try and read in this file using readLines() R segfaults. > > I believe the two salient issues with this file are > 1). Its size > 2). It is a single line (no line breaks) As a workaround you can pipe something like "sed s/,/,\\n/g" before your R script to insert line breaks. Iñaki __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] readLines() segfaults on large file & question on how to work around
Thank you for your suggestion. Unfortunately, while R doesn't segfault calling readr::read_file() on the test file I described, I get the error message: Error in read_file_(ds, locale) : negative length vectors are not allowed Jen On Sat, Sep 2, 2017 at 1:38 PM, Ista Zahn wrote: > As s work-around I suggest readr::read_file. > > --Ista > > > On Sep 2, 2017 2:58 PM, "Jennifer Lyon" wrote: > >> Hi: >> >> I have a 2.1GB JSON file. Typically I use readLines() and >> jsonlite:fromJSON() to extract data from a JSON file. >> >> When I try and read in this file using readLines() R segfaults. >> >> I believe the two salient issues with this file are >> 1). Its size >> 2). It is a single line (no line breaks) >> >> I can reproduce this issue as follows >> #Generate a big file with no line breaks >> # In R >> > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="") >> >> # in unix shell >> cp alpha.txt file.txt >> for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt >> file.txt; done >> >> This generates a 2.3GB file with no line breaks >> >> in R: >> > moo <- readLines("file.txt") >> >> *** caught segfault *** >> address 0x7cff, cause 'memory not mapped' >> >> Traceback: >> 1: readLines("file.txt") >> >> Possible actions: >> 1: abort (with core dump, if enabled) >> 2: normal R exit >> 3: exit R without saving workspace >> 4: exit R saving workspace >> Selection: 3 >> >> I conclude: >> I am potentially running up against a limit in R, which should give a >> reasonable error, but currently just segfaults. >> >> My question: >> Most of the content of the JSON is an approximately 100K x 6K JSON >> equivalent of a dataframe, and I know R can handle much bigger than this >> size. I am expecting these JSON files to get even larger. My R code lives >> in a bigger system, and the JSON comes in via stdin, so I have absolutely >> no control over the data format. I can imagine trying to incrementally >> parse the JSON so I don't bump up against the limit, but I am eager for >> suggestions of simpler solutions. >> >> Also, I apologize for the timing of this bug report, as I know folks are >> working to get out the next release of R, but like so many things I have >> no >> control over when bugs leap up. >> >> Thanks. >> >> Jen >> >> > sessionInfo() >> R version 3.4.1 (2017-06-30) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: Ubuntu 14.04.5 LTS >> >> Matrix products: default >> BLAS: R-3.4.1/lib/libRblas.so >> LAPACK:R-3.4.1/lib/libRlapack.so >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> loaded via a namespace (and not attached): >> [1] compiler_3.4.1 >> >> [[alternative HTML version deleted]] >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] readLines() segfaults on large file & question on how to work around
As s work-around I suggest readr::read_file. --Ista On Sep 2, 2017 2:58 PM, "Jennifer Lyon" wrote: > Hi: > > I have a 2.1GB JSON file. Typically I use readLines() and > jsonlite:fromJSON() to extract data from a JSON file. > > When I try and read in this file using readLines() R segfaults. > > I believe the two salient issues with this file are > 1). Its size > 2). It is a single line (no line breaks) > > I can reproduce this issue as follows > #Generate a big file with no line breaks > # In R > > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="") > > # in unix shell > cp alpha.txt file.txt > for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt > file.txt; done > > This generates a 2.3GB file with no line breaks > > in R: > > moo <- readLines("file.txt") > > *** caught segfault *** > address 0x7cff, cause 'memory not mapped' > > Traceback: > 1: readLines("file.txt") > > Possible actions: > 1: abort (with core dump, if enabled) > 2: normal R exit > 3: exit R without saving workspace > 4: exit R saving workspace > Selection: 3 > > I conclude: > I am potentially running up against a limit in R, which should give a > reasonable error, but currently just segfaults. > > My question: > Most of the content of the JSON is an approximately 100K x 6K JSON > equivalent of a dataframe, and I know R can handle much bigger than this > size. I am expecting these JSON files to get even larger. My R code lives > in a bigger system, and the JSON comes in via stdin, so I have absolutely > no control over the data format. I can imagine trying to incrementally > parse the JSON so I don't bump up against the limit, but I am eager for > suggestions of simpler solutions. > > Also, I apologize for the timing of this bug report, as I know folks are > working to get out the next release of R, but like so many things I have no > control over when bugs leap up. > > Thanks. > > Jen > > > sessionInfo() > R version 3.4.1 (2017-06-30) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 14.04.5 LTS > > Matrix products: default > BLAS: R-3.4.1/lib/libRblas.so > LAPACK:R-3.4.1/lib/libRlapack.so > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > loaded via a namespace (and not attached): > [1] compiler_3.4.1 > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] readLines() segfaults on large file & question on how to work around
Hi: I have a 2.1GB JSON file. Typically I use readLines() and jsonlite:fromJSON() to extract data from a JSON file. When I try and read in this file using readLines() R segfaults. I believe the two salient issues with this file are 1). Its size 2). It is a single line (no line breaks) I can reproduce this issue as follows #Generate a big file with no line breaks # In R > writeLines(paste0(c(letters, 0:9), collapse=""), "alpha.txt", sep="") # in unix shell cp alpha.txt file.txt for i in {1..26}; do cat file.txt file.txt > file2.txt && mv -f file2.txt file.txt; done This generates a 2.3GB file with no line breaks in R: > moo <- readLines("file.txt") *** caught segfault *** address 0x7cff, cause 'memory not mapped' Traceback: 1: readLines("file.txt") Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: 3 I conclude: I am potentially running up against a limit in R, which should give a reasonable error, but currently just segfaults. My question: Most of the content of the JSON is an approximately 100K x 6K JSON equivalent of a dataframe, and I know R can handle much bigger than this size. I am expecting these JSON files to get even larger. My R code lives in a bigger system, and the JSON comes in via stdin, so I have absolutely no control over the data format. I can imagine trying to incrementally parse the JSON so I don't bump up against the limit, but I am eager for suggestions of simpler solutions. Also, I apologize for the timing of this bug report, as I know folks are working to get out the next release of R, but like so many things I have no control over when bugs leap up. Thanks. Jen > sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 14.04.5 LTS Matrix products: default BLAS: R-3.4.1/lib/libRblas.so LAPACK:R-3.4.1/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_3.4.1 [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel