[ https://issues.apache.org/jira/browse/ARROW-16157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541428#comment-17541428 ]
Egill Axfjord Fridgeirsson commented on ARROW-16157: ---------------------------------------------------- Thanks [~thisisnic] ! That's good to know. > [R] Inconsistent behavior for arrow datasets vs working in memory > ----------------------------------------------------------------- > > Key: ARROW-16157 > URL: https://issues.apache.org/jira/browse/ARROW-16157 > Project: Apache Arrow > Issue Type: Bug > Affects Versions: 7.0.0 > Environment: Ubuntu 21.10 > R 4.1.3. > Arrow 7.0.0 > Reporter: Egill Axfjord Fridgeirsson > Assignee: Nicola Crane > Priority: Major > > When I generate a sparse matrix using indices from an arrow dataset I get > inconsistent behavior, sometimes there are duplicated indexes resulting in a > matrix with values more than one at some places. When loading the dataset > first in memory everything works as expected and all the values are one > Repro > {code:java} > library(Matrix) > library(dplyr) > library(arrow) > sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T") > dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1) > arrow::write_dataset(dF, path='./data/feather', format='feather') > arrowDataset <- arrow::open_dataset('./data/feather', format='feather') > # run the below a few times, and at some time the output is more than just # > 1 for unique(newSparse@x), indicating there are > # duplicate indices for the sparse matrix (then it adds the values there) > newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) , > j = arrowDataset %>% pull(j), > x = 1) > unique(newSparse@x) # here is the bug, @x is the slot for values > arrowInMemory <- arrowDataset %>% collect() > # after loading in memory the output is never more than 1 no matter how > # often I run it > newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) , > j = arrowInMemory %>% pull(j), > x = 1) > unique(newSparse@x){code} -- This message was sent by Atlassian Jira (v8.20.7#820007)