[jira] [Updated] (ARROW-17543) [R] %in% on an empty vector c() fails
[ https://issues.apache.org/jira/browse/ARROW-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egill Axfjord Fridgeirsson updated ARROW-17543: --- Component/s: R > [R] %in% on an empty vector c() fails > - > > Key: ARROW-17543 > URL: https://issues.apache.org/jira/browse/ARROW-17543 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 >Reporter: Egill Axfjord Fridgeirsson >Assignee: Egill Axfjord Fridgeirsson >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > When using %in% on empty vectors I'm getting an error > "Error: Cannot infer type from vector" > I'd expect this to work the same as base R where you can use %in% on empty > vectors. > The arrow::is_in compute function does accept nulls as the value_set. If I > manually create an empty array of type NULL it does work as expected. > Reprex: > {code:java} > library(dplyr) > library(arrow) > options(arrow.debug=T) > #base R > a <- c(1,2,3) > b <- c() # NULL > a %in% b > #> [1] FALSE FALSE FALSE > # arrow arrays > arrowArray <- arrow::Array$create(c(1,2,3)) > arrow::is_in(arrowArray, c()) > #> Error: Cannot infer type from vector > # define type of c() manually > arrow::is_in(arrowArray, arrow::Array$create(c(), type=arrow::null())) > #> Array > #> > #> [ > #> false, > #> false, > #> false > #> ] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17543) [R] %in% on an empty vector c() fails
[ https://issues.apache.org/jira/browse/ARROW-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597051#comment-17597051 ] Egill Axfjord Fridgeirsson commented on ARROW-17543: That does look straightforward. I'd be happy to make a pull request for this. > [R] %in% on an empty vector c() fails > - > > Key: ARROW-17543 > URL: https://issues.apache.org/jira/browse/ARROW-17543 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 9.0.0 >Reporter: Egill Axfjord Fridgeirsson >Priority: Major > > When using %in% on empty vectors I'm getting an error > "Error: Cannot infer type from vector" > I'd expect this to work the same as base R where you can use %in% on empty > vectors. > The arrow::is_in compute function does accept nulls as the value_set. If I > manually create an empty array of type NULL it does work as expected. > Reprex: > {code:java} > library(dplyr) > library(arrow) > options(arrow.debug=T) > #base R > a <- c(1,2,3) > b <- c() # NULL > a %in% b > #> [1] FALSE FALSE FALSE > # arrow arrays > arrowArray <- arrow::Array$create(c(1,2,3)) > arrow::is_in(arrowArray, c()) > #> Error: Cannot infer type from vector > # define type of c() manually > arrow::is_in(arrowArray, arrow::Array$create(c(), type=arrow::null())) > #> Array > #> > #> [ > #> false, > #> false, > #> false > #> ] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-17543) [R] %in% on an empty vector c() fails
[ https://issues.apache.org/jira/browse/ARROW-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egill Axfjord Fridgeirsson reassigned ARROW-17543: -- Assignee: Egill Axfjord Fridgeirsson > [R] %in% on an empty vector c() fails > - > > Key: ARROW-17543 > URL: https://issues.apache.org/jira/browse/ARROW-17543 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 9.0.0 >Reporter: Egill Axfjord Fridgeirsson >Assignee: Egill Axfjord Fridgeirsson >Priority: Major > > When using %in% on empty vectors I'm getting an error > "Error: Cannot infer type from vector" > I'd expect this to work the same as base R where you can use %in% on empty > vectors. > The arrow::is_in compute function does accept nulls as the value_set. If I > manually create an empty array of type NULL it does work as expected. > Reprex: > {code:java} > library(dplyr) > library(arrow) > options(arrow.debug=T) > #base R > a <- c(1,2,3) > b <- c() # NULL > a %in% b > #> [1] FALSE FALSE FALSE > # arrow arrays > arrowArray <- arrow::Array$create(c(1,2,3)) > arrow::is_in(arrowArray, c()) > #> Error: Cannot infer type from vector > # define type of c() manually > arrow::is_in(arrowArray, arrow::Array$create(c(), type=arrow::null())) > #> Array > #> > #> [ > #> false, > #> false, > #> false > #> ] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17543) [R] %in% on an empty vector c() fails
Egill Axfjord Fridgeirsson created ARROW-17543: -- Summary: [R] %in% on an empty vector c() fails Key: ARROW-17543 URL: https://issues.apache.org/jira/browse/ARROW-17543 Project: Apache Arrow Issue Type: Bug Affects Versions: 9.0.0 Reporter: Egill Axfjord Fridgeirsson When using %in% on empty vectors I'm getting an error "Error: Cannot infer type from vector" I'd expect this to work the same as base R where you can use %in% on empty vectors. The arrow::is_in compute function does accept nulls as the value_set. If I manually create an empty array of type NULL it does work as expected. Reprex: {code:java} library(dplyr) library(arrow) options(arrow.debug=T) #base R a <- c(1,2,3) b <- c() # NULL a %in% b #> [1] FALSE FALSE FALSE # arrow arrays arrowArray <- arrow::Array$create(c(1,2,3)) arrow::is_in(arrowArray, c()) #> Error: Cannot infer type from vector # define type of c() manually arrow::is_in(arrowArray, arrow::Array$create(c(), type=arrow::null())) #> Array #> #> [ #> false, #> false, #> false #> ] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17373) [R] copying dataset and immediatly writing the copy to a different location fails
[ https://issues.apache.org/jira/browse/ARROW-17373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584108#comment-17584108 ] Egill Axfjord Fridgeirsson commented on ARROW-17373: My bad I didn't realize I was trying to overwrite the same dataset. IMO stopping me and giving an error with a useful error message would have helped me here. > [R] copying dataset and immediatly writing the copy to a different location > fails > - > > Key: ARROW-17373 > URL: https://issues.apache.org/jira/browse/ARROW-17373 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 > Environment: Ubuntu 22.04 >Reporter: Egill Axfjord Fridgeirsson >Priority: Major > > When I copy large feather files, open a dataset from that file and > immediately write that dataset to a new location I get the following error: > > ```Error: Invalid: Expected to read 144 metadata bytes but got 0``` > > I have made a reproducible example below: > > ``` r > df <- data.frame(replicate(1,sample(0:1,100e6,rep=TRUE))) > savePath <- file.path(tempdir(), 'arrowTest') > if (!dir.exists(savePath)) { > dir.create(savePath) > } > arrow::write_feather(df, file.path(savePath, 'part-0.feather')) > copyPath <- file.path(tempdir(),'arrowTest') > if (!dir.exists(copyPath)) { > dir.create(copyPath) > } > writePath <- file.path(tempdir(), 'arrowTest') > if (!dir.exists(writePath)) { > dir.create(writePath) > } > arrow::copy_files(savePath, copyPath) > dataset <- arrow::open_dataset(copyPath, format='feather') > arrow::write_dataset(dataset = dataset, path = writePath, format = 'feather') > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17373) [R] copying dataset and immediatly writing the copy to a different location fails
[ https://issues.apache.org/jira/browse/ARROW-17373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egill Axfjord Fridgeirsson updated ARROW-17373: --- Environment: Ubuntu 22.04 (was: Ubuntu 22.10) > [R] copying dataset and immediatly writing the copy to a different location > fails > - > > Key: ARROW-17373 > URL: https://issues.apache.org/jira/browse/ARROW-17373 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 > Environment: Ubuntu 22.04 >Reporter: Egill Axfjord Fridgeirsson >Priority: Major > > When I copy large feather files, open a dataset from that file and > immediately write that dataset to a new location I get the following error: > > ```Error: Invalid: Expected to read 144 metadata bytes but got 0``` > > I have made a reproducible example below: > > ``` r > df <- data.frame(replicate(1,sample(0:1,100e6,rep=TRUE))) > savePath <- file.path(tempdir(), 'arrowTest') > if (!dir.exists(savePath)) { > dir.create(savePath) > } > arrow::write_feather(df, file.path(savePath, 'part-0.feather')) > copyPath <- file.path(tempdir(),'arrowTest') > if (!dir.exists(copyPath)) { > dir.create(copyPath) > } > writePath <- file.path(tempdir(), 'arrowTest') > if (!dir.exists(writePath)) { > dir.create(writePath) > } > arrow::copy_files(savePath, copyPath) > dataset <- arrow::open_dataset(copyPath, format='feather') > arrow::write_dataset(dataset = dataset, path = writePath, format = 'feather') > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17373) [R] copying dataset and immediatly writing the copy to a different location fails
[ https://issues.apache.org/jira/browse/ARROW-17373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578358#comment-17578358 ] Egill Axfjord Fridgeirsson commented on ARROW-17373: After some further testing it seems the copying is unnecessary. Opening a large dataset and writing to a different location seems to produce the error in most cases. Here is a slightly simpler reprex: {code:java} df <- data.frame(replicate(1,sample(0:1,100e6,rep=TRUE))) savePath <- file.path(tempdir(), 'arrowTest') if (!dir.exists(savePath)) { dir.create(savePath) } arrow::write_feather(df, file.path(savePath, 'part-0.feather')) writePath <- file.path(tempdir(), 'arrowTest') if (!dir.exists(writePath)) { dir.create(writePath) } dataset <- arrow::open_dataset(savePath, format='feather') arrow::write_dataset(dataset = dataset, path = writePath, format = 'feather') {code} > [R] copying dataset and immediatly writing the copy to a different location > fails > - > > Key: ARROW-17373 > URL: https://issues.apache.org/jira/browse/ARROW-17373 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 9.0.0 > Environment: Ubuntu 22.10 >Reporter: Egill Axfjord Fridgeirsson >Priority: Major > > When I copy large feather files, open a dataset from that file and > immediately write that dataset to a new location I get the following error: > > ```Error: Invalid: Expected to read 144 metadata bytes but got 0``` > > I have made a reproducible example below: > > ``` r > df <- data.frame(replicate(1,sample(0:1,100e6,rep=TRUE))) > savePath <- file.path(tempdir(), 'arrowTest') > if (!dir.exists(savePath)) { > dir.create(savePath) > } > arrow::write_feather(df, file.path(savePath, 'part-0.feather')) > copyPath <- file.path(tempdir(),'arrowTest') > if (!dir.exists(copyPath)) { > dir.create(copyPath) > } > writePath <- file.path(tempdir(), 'arrowTest') > if (!dir.exists(writePath)) { > dir.create(writePath) > } > arrow::copy_files(savePath, copyPath) > dataset <- arrow::open_dataset(copyPath, format='feather') > arrow::write_dataset(dataset = dataset, path = writePath, format = 'feather') > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17373) copying dataset and immediatly writing the copy to a different location fails
Egill Axfjord Fridgeirsson created ARROW-17373: -- Summary: copying dataset and immediatly writing the copy to a different location fails Key: ARROW-17373 URL: https://issues.apache.org/jira/browse/ARROW-17373 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 9.0.0 Environment: Ubuntu 22.10 Reporter: Egill Axfjord Fridgeirsson When I copy large feather files, open a dataset from that file and immediately write that dataset to a new location I get the following error: ```Error: Invalid: Expected to read 144 metadata bytes but got 0``` I have made a reproducible example below: ``` r df <- data.frame(replicate(1,sample(0:1,100e6,rep=TRUE))) savePath <- file.path(tempdir(), 'arrowTest') if (!dir.exists(savePath)) { dir.create(savePath) } arrow::write_feather(df, file.path(savePath, 'part-0.feather')) copyPath <- file.path(tempdir(),'arrowTest') if (!dir.exists(copyPath)) { dir.create(copyPath) } writePath <- file.path(tempdir(), 'arrowTest') if (!dir.exists(writePath)) { dir.create(writePath) } arrow::copy_files(savePath, copyPath) dataset <- arrow::open_dataset(copyPath, format='feather') arrow::write_dataset(dataset = dataset, path = writePath, format = 'feather') ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory
[ https://issues.apache.org/jira/browse/ARROW-16157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541428#comment-17541428 ] Egill Axfjord Fridgeirsson commented on ARROW-16157: Thanks [~thisisnic] ! That's good to know. > [R] Inconsistent behavior for arrow datasets vs working in memory > - > > Key: ARROW-16157 > URL: https://issues.apache.org/jira/browse/ARROW-16157 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 7.0.0 > Environment: Ubuntu 21.10 > R 4.1.3. > Arrow 7.0.0 >Reporter: Egill Axfjord Fridgeirsson >Assignee: Nicola Crane >Priority: Major > > When I generate a sparse matrix using indices from an arrow dataset I get > inconsistent behavior, sometimes there are duplicated indexes resulting in a > matrix with values more than one at some places. When loading the dataset > first in memory everything works as expected and all the values are one > Repro > {code:java} > library(Matrix) > library(dplyr) > library(arrow) > sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T") > dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1) > arrow::write_dataset(dF, path='./data/feather', format='feather') > arrowDataset <- arrow::open_dataset('./data/feather', format='feather') > # run the below a few times, and at some time the output is more than just # > 1 for unique(newSparse@x), indicating there are > # duplicate indices for the sparse matrix (then it adds the values there) > newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) , > j = arrowDataset %>% pull(j), > x = 1) > unique(newSparse@x) # here is the bug, @x is the slot for values > arrowInMemory <- arrowDataset %>% collect() > # after loading in memory the output is never more than 1 no matter how > # often I run it > newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) , > j = arrowInMemory %>% pull(j), > x = 1) > unique(newSparse@x){code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory
[ https://issues.apache.org/jira/browse/ARROW-16157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521577#comment-17521577 ] Egill Axfjord Fridgeirsson commented on ARROW-16157: Hi [~thisisnic] , I updated to the dev version and unfortunately I still get the issue. Here is my arrow::info() output if that helps {code:java} > arrow::arrow_info() Arrow package version: 7.0.0.20220412 Capabilities: datasetTRUE engineFALSE parquetTRUE json TRUE s3FALSE utf8proc TRUE re2TRUE snappy TRUE gzip FALSE brotliFALSE zstd FALSE lz4TRUE lz4_frame TRUE lzo FALSE bz2 FALSE jemalloc FALSE mimalloc FALSE To reinstall with more optional capabilities enabled, see https://arrow.apache.org/docs/r/articles/install.html Memory: Allocator system Current 76.29 Mb Max76.3 Mb Runtime: SIMD Level avx2 Detected SIMD Level avx2 Build: C++ Library Version 8.0.0-SNAPSHOT C++ CompilerGNU C++ Compiler Version 11.2.0 {code} > [R] Inconsistent behavior for arrow datasets vs working in memory > - > > Key: ARROW-16157 > URL: https://issues.apache.org/jira/browse/ARROW-16157 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 7.0.0 > Environment: Ubuntu 21.10 > R 4.1.3. > Arrow 7.0.0 >Reporter: Egill Axfjord Fridgeirsson >Assignee: Nicola Crane >Priority: Major > > When I generate a sparse matrix using indices from an arrow dataset I get > inconsistent behavior, sometimes there are duplicated indexes resulting in a > matrix with values more than one at some places. When loading the dataset > first in memory everything works as expected and all the values are one > Repro > {code:java} > library(Matrix) > library(dplyr) > library(arrow) > sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T") > dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1) > arrow::write_dataset(dF, path='./data/feather', format='feather') > arrowDataset <- arrow::open_dataset('./data/feather', format='feather') > # run the below a few times, and at some time the output is more than just # > 1 for unique(newSparse@x), indicating there are > # duplicate indices for the sparse matrix (then it adds the values there) > newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) , > j = arrowDataset %>% pull(j), > x = 1) > unique(newSparse@x) # here is the bug, @x is the slot for values > arrowInMemory <- arrowDataset %>% collect() > # after loading in memory the output is never more than 1 no matter how > # often I run it > newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) , > j = arrowInMemory %>% pull(j), > x = 1) > unique(newSparse@x){code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory
[ https://issues.apache.org/jira/browse/ARROW-16157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Egill Axfjord Fridgeirsson updated ARROW-16157: --- Description: When I generate a sparse matrix using indices from an arrow dataset I get inconsistent behavior, sometimes there are duplicated indexes resulting in a matrix with values more than one at some places. When loading the dataset first in memory everything works as expected and all the values are one Repro {code:java} library(Matrix) library(dplyr) library(arrow) sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T") dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1) arrow::write_dataset(dF, path='./data/feather', format='feather') arrowDataset <- arrow::open_dataset('./data/feather', format='feather') # run the below a few times, and at some time the output is more than just # 1 for unique(newSparse@x), indicating there are # duplicate indices for the sparse matrix (then it adds the values there) newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) , j = arrowDataset %>% pull(j), x = 1) unique(newSparse@x) # here is the bug, @x is the slot for values arrowInMemory <- arrowDataset %>% collect() # after loading in memory the output is never more than 1 no matter how # often I run it newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) , j = arrowInMemory %>% pull(j), x = 1) unique(newSparse@x){code} was: When I generate a sparse matrix using indices from an arrow dataset I get inconsistent behavior, sometimes there are duplicated indexes resulting in a matrix with values more than one at some places. When loading the dataset first in memory everything works as expected and all the values are one Repro {code:java} library(Matrix) library(dplyr) library(arrow) sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T") dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1) arrow::write_dataset(dF, path='./data/feather', format='feather') arrowDataset <- arrow::open_dataset('./data/feather', format='feather') # run the below a few times, and at some time the output is more than just # 1 for unique(newSparse@x), indicating there are duplicate indices for # the sparse matrix (then it adds the values there) newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) , j = arrowDataset %>% pull(j), x = 1) unique(newSparse@x) # here is the bug, @x is the slot for values arrowInMemory <- arrowDataset %>% collect() # after loading in memory the output is never more than 1 no matter how # often I run it newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) , j = arrowInMemory %>% pull(j), x = 1) unique(newSparse@x){code} > [R] Inconsistent behavior for arrow datasets vs working in memory > - > > Key: ARROW-16157 > URL: https://issues.apache.org/jira/browse/ARROW-16157 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 7.0.0 > Environment: Ubuntu 21.10 > R 4.1.3. > Arrow 7.0.0 >Reporter: Egill Axfjord Fridgeirsson >Priority: Major > > When I generate a sparse matrix using indices from an arrow dataset I get > inconsistent behavior, sometimes there are duplicated indexes resulting in a > matrix with values more than one at some places. When loading the dataset > first in memory everything works as expected and all the values are one > Repro > {code:java} > library(Matrix) > library(dplyr) > library(arrow) > sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T") > dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1) > arrow::write_dataset(dF, path='./data/feather', format='feather') > arrowDataset <- arrow::open_dataset('./data/feather', format='feather') > # run the below a few times, and at some time the output is more than just # > 1 for unique(newSparse@x), indicating there are > # duplicate indices for the sparse matrix (then it adds the values there) > newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) , > j = arrowDataset %>% pull(j), > x = 1) > unique(newSparse@x) # here is the bug, @x is the slot for values > arrowInMemory <- arrowDataset %>% collect() > # after loading in memory the output is never more than 1 no matter how > # often I run it > newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) , > j = arrowInMemory %>% pull(j), > x = 1) > unique(newSparse@x){code} -- This m
[jira] [Created] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory
Egill Axfjord Fridgeirsson created ARROW-16157: -- Summary: [R] Inconsistent behavior for arrow datasets vs working in memory Key: ARROW-16157 URL: https://issues.apache.org/jira/browse/ARROW-16157 Project: Apache Arrow Issue Type: Bug Affects Versions: 7.0.0 Environment: Ubuntu 21.10 R 4.1.3. Arrow 7.0.0 Reporter: Egill Axfjord Fridgeirsson When I generate a sparse matrix using indices from an arrow dataset I get inconsistent behavior, sometimes there are duplicated indexes resulting in a matrix with values more than one at some places. When loading the dataset first in memory everything works as expected and all the values are one Repro {code:java} library(Matrix) library(dplyr) library(arrow) sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T") dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1) arrow::write_dataset(dF, path='./data/feather', format='feather') arrowDataset <- arrow::open_dataset('./data/feather', format='feather') # run the below a few times, and at some time the output is more than just # 1 for unique(newSparse@x), indicating there are duplicate indices for # the sparse matrix (then it adds the values there) newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) , j = arrowDataset %>% pull(j), x = 1) unique(newSparse@x) # here is the bug, @x is the slot for values arrowInMemory <- arrowDataset %>% collect() # after loading in memory the output is never more than 1 no matter how # often I run it newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) , j = arrowInMemory %>% pull(j), x = 1) unique(newSparse@x){code} -- This message was sent by Atlassian Jira (v8.20.1#820001)