[jira] [Updated] (ARROW-17543) [R] %in% on an empty vector c() fails

2022-08-29 Thread Egill Axfjord Fridgeirsson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egill Axfjord Fridgeirsson updated ARROW-17543:
---
Component/s: R

> [R] %in% on an empty vector c() fails
> -
>
> Key: ARROW-17543
> URL: https://issues.apache.org/jira/browse/ARROW-17543
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
>Reporter: Egill Axfjord Fridgeirsson
>Assignee: Egill Axfjord Fridgeirsson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When using %in% on empty vectors I'm getting an error
> "Error: Cannot infer type from vector"
> I'd expect this to work the same as base R where you can use %in% on empty 
> vectors.
> The arrow::is_in compute function does accept nulls as the value_set. If I 
> manually create an empty array of type NULL it does work as expected.
> Reprex:
> {code:java}
> library(dplyr)
> library(arrow)
> options(arrow.debug=T)
> #base R
> a <- c(1,2,3)
> b <- c() # NULL
> a %in% b
> #> [1] FALSE FALSE FALSE
> # arrow arrays
> arrowArray <- arrow::Array$create(c(1,2,3))
> arrow::is_in(arrowArray, c())
> #> Error: Cannot infer type from vector
> # define type of c() manually
> arrow::is_in(arrowArray, arrow::Array$create(c(), type=arrow::null()))
> #> Array
> #> 
> #> [
> #>   false,
> #>   false,
> #>   false
> #> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17543) [R] %in% on an empty vector c() fails

2022-08-29 Thread Egill Axfjord Fridgeirsson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17597051#comment-17597051
 ] 

Egill Axfjord Fridgeirsson commented on ARROW-17543:


That does look straightforward. I'd be happy to make a pull request for this.

> [R] %in% on an empty vector c() fails
> -
>
> Key: ARROW-17543
> URL: https://issues.apache.org/jira/browse/ARROW-17543
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 9.0.0
>Reporter: Egill Axfjord Fridgeirsson
>Priority: Major
>
> When using %in% on empty vectors I'm getting an error
> "Error: Cannot infer type from vector"
> I'd expect this to work the same as base R where you can use %in% on empty 
> vectors.
> The arrow::is_in compute function does accept nulls as the value_set. If I 
> manually create an empty array of type NULL it does work as expected.
> Reprex:
> {code:java}
> library(dplyr)
> library(arrow)
> options(arrow.debug=T)
> #base R
> a <- c(1,2,3)
> b <- c() # NULL
> a %in% b
> #> [1] FALSE FALSE FALSE
> # arrow arrays
> arrowArray <- arrow::Array$create(c(1,2,3))
> arrow::is_in(arrowArray, c())
> #> Error: Cannot infer type from vector
> # define type of c() manually
> arrow::is_in(arrowArray, arrow::Array$create(c(), type=arrow::null()))
> #> Array
> #> 
> #> [
> #>   false,
> #>   false,
> #>   false
> #> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17543) [R] %in% on an empty vector c() fails

2022-08-29 Thread Egill Axfjord Fridgeirsson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egill Axfjord Fridgeirsson reassigned ARROW-17543:
--

Assignee: Egill Axfjord Fridgeirsson

> [R] %in% on an empty vector c() fails
> -
>
> Key: ARROW-17543
> URL: https://issues.apache.org/jira/browse/ARROW-17543
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 9.0.0
>Reporter: Egill Axfjord Fridgeirsson
>Assignee: Egill Axfjord Fridgeirsson
>Priority: Major
>
> When using %in% on empty vectors I'm getting an error
> "Error: Cannot infer type from vector"
> I'd expect this to work the same as base R where you can use %in% on empty 
> vectors.
> The arrow::is_in compute function does accept nulls as the value_set. If I 
> manually create an empty array of type NULL it does work as expected.
> Reprex:
> {code:java}
> library(dplyr)
> library(arrow)
> options(arrow.debug=T)
> #base R
> a <- c(1,2,3)
> b <- c() # NULL
> a %in% b
> #> [1] FALSE FALSE FALSE
> # arrow arrays
> arrowArray <- arrow::Array$create(c(1,2,3))
> arrow::is_in(arrowArray, c())
> #> Error: Cannot infer type from vector
> # define type of c() manually
> arrow::is_in(arrowArray, arrow::Array$create(c(), type=arrow::null()))
> #> Array
> #> 
> #> [
> #>   false,
> #>   false,
> #>   false
> #> ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17543) [R] %in% on an empty vector c() fails

2022-08-28 Thread Egill Axfjord Fridgeirsson (Jira)
Egill Axfjord Fridgeirsson created ARROW-17543:
--

 Summary: [R] %in% on an empty vector c() fails
 Key: ARROW-17543
 URL: https://issues.apache.org/jira/browse/ARROW-17543
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 9.0.0
Reporter: Egill Axfjord Fridgeirsson


When using %in% on empty vectors I'm getting an error

"Error: Cannot infer type from vector"

I'd expect this to work the same as base R where you can use %in% on empty 
vectors.

The arrow::is_in compute function does accept nulls as the value_set. If I 
manually create an empty array of type NULL it does work as expected.

Reprex:
{code:java}
library(dplyr)
library(arrow)

options(arrow.debug=T)

#base R
a <- c(1,2,3)
b <- c() # NULL
a %in% b
#> [1] FALSE FALSE FALSE

# arrow arrays
arrowArray <- arrow::Array$create(c(1,2,3))
arrow::is_in(arrowArray, c())
#> Error: Cannot infer type from vector

# define type of c() manually
arrow::is_in(arrowArray, arrow::Array$create(c(), type=arrow::null()))
#> Array
#> 
#> [
#>   false,
#>   false,
#>   false
#> ]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17373) [R] copying dataset and immediatly writing the copy to a different location fails

2022-08-24 Thread Egill Axfjord Fridgeirsson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584108#comment-17584108
 ] 

Egill Axfjord Fridgeirsson commented on ARROW-17373:


My bad I didn't realize I was trying to overwrite the same dataset. IMO 
stopping me and giving an error with a useful error message would have helped 
me here.

> [R] copying dataset and immediatly writing the copy to a different location 
> fails
> -
>
> Key: ARROW-17373
> URL: https://issues.apache.org/jira/browse/ARROW-17373
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
> Environment: Ubuntu 22.04
>Reporter: Egill Axfjord Fridgeirsson
>Priority: Major
>
> When I copy large feather files, open a dataset from that file and 
> immediately write that dataset to a new location I get the following error:
>  
> ```Error: Invalid: Expected to read 144 metadata bytes but got 0```
>  
> I have made a reproducible example below:
>  
> ``` r
> df <- data.frame(replicate(1,sample(0:1,100e6,rep=TRUE)))
> savePath <- file.path(tempdir(), 'arrowTest')
> if (!dir.exists(savePath)) {
>   dir.create(savePath)
> }
> arrow::write_feather(df, file.path(savePath, 'part-0.feather'))
> copyPath <- file.path(tempdir(),'arrowTest')
> if (!dir.exists(copyPath)) {
>   dir.create(copyPath)
> }
> writePath <- file.path(tempdir(), 'arrowTest')
> if (!dir.exists(writePath)) {
>   dir.create(writePath)
> }
> arrow::copy_files(savePath, copyPath)
> dataset <- arrow::open_dataset(copyPath, format='feather')
> arrow::write_dataset(dataset = dataset, path = writePath, format = 'feather')
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17373) [R] copying dataset and immediatly writing the copy to a different location fails

2022-08-17 Thread Egill Axfjord Fridgeirsson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egill Axfjord Fridgeirsson updated ARROW-17373:
---
Environment: Ubuntu 22.04  (was: Ubuntu 22.10)

> [R] copying dataset and immediatly writing the copy to a different location 
> fails
> -
>
> Key: ARROW-17373
> URL: https://issues.apache.org/jira/browse/ARROW-17373
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
> Environment: Ubuntu 22.04
>Reporter: Egill Axfjord Fridgeirsson
>Priority: Major
>
> When I copy large feather files, open a dataset from that file and 
> immediately write that dataset to a new location I get the following error:
>  
> ```Error: Invalid: Expected to read 144 metadata bytes but got 0```
>  
> I have made a reproducible example below:
>  
> ``` r
> df <- data.frame(replicate(1,sample(0:1,100e6,rep=TRUE)))
> savePath <- file.path(tempdir(), 'arrowTest')
> if (!dir.exists(savePath)) {
>   dir.create(savePath)
> }
> arrow::write_feather(df, file.path(savePath, 'part-0.feather'))
> copyPath <- file.path(tempdir(),'arrowTest')
> if (!dir.exists(copyPath)) {
>   dir.create(copyPath)
> }
> writePath <- file.path(tempdir(), 'arrowTest')
> if (!dir.exists(writePath)) {
>   dir.create(writePath)
> }
> arrow::copy_files(savePath, copyPath)
> dataset <- arrow::open_dataset(copyPath, format='feather')
> arrow::write_dataset(dataset = dataset, path = writePath, format = 'feather')
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17373) [R] copying dataset and immediatly writing the copy to a different location fails

2022-08-11 Thread Egill Axfjord Fridgeirsson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17578358#comment-17578358
 ] 

Egill Axfjord Fridgeirsson commented on ARROW-17373:


After some further testing it seems the copying is unnecessary.

Opening a large dataset and writing to a different location seems to produce 
the error in most cases.

 

Here is a slightly simpler reprex:
{code:java}
df <- data.frame(replicate(1,sample(0:1,100e6,rep=TRUE)))
savePath <- file.path(tempdir(), 'arrowTest')
if (!dir.exists(savePath)) {
  dir.create(savePath)
}
arrow::write_feather(df, file.path(savePath, 'part-0.feather'))
writePath <- file.path(tempdir(), 'arrowTest')
if (!dir.exists(writePath)) {
  dir.create(writePath)
}
dataset <- arrow::open_dataset(savePath, format='feather')
arrow::write_dataset(dataset = dataset, path = writePath, format = 'feather')
{code}

> [R] copying dataset and immediatly writing the copy to a different location 
> fails
> -
>
> Key: ARROW-17373
> URL: https://issues.apache.org/jira/browse/ARROW-17373
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 9.0.0
> Environment: Ubuntu 22.10
>Reporter: Egill Axfjord Fridgeirsson
>Priority: Major
>
> When I copy large feather files, open a dataset from that file and 
> immediately write that dataset to a new location I get the following error:
>  
> ```Error: Invalid: Expected to read 144 metadata bytes but got 0```
>  
> I have made a reproducible example below:
>  
> ``` r
> df <- data.frame(replicate(1,sample(0:1,100e6,rep=TRUE)))
> savePath <- file.path(tempdir(), 'arrowTest')
> if (!dir.exists(savePath)) {
>   dir.create(savePath)
> }
> arrow::write_feather(df, file.path(savePath, 'part-0.feather'))
> copyPath <- file.path(tempdir(),'arrowTest')
> if (!dir.exists(copyPath)) {
>   dir.create(copyPath)
> }
> writePath <- file.path(tempdir(), 'arrowTest')
> if (!dir.exists(writePath)) {
>   dir.create(writePath)
> }
> arrow::copy_files(savePath, copyPath)
> dataset <- arrow::open_dataset(copyPath, format='feather')
> arrow::write_dataset(dataset = dataset, path = writePath, format = 'feather')
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17373) copying dataset and immediatly writing the copy to a different location fails

2022-08-10 Thread Egill Axfjord Fridgeirsson (Jira)
Egill Axfjord Fridgeirsson created ARROW-17373:
--

 Summary: copying dataset and immediatly writing the copy to a 
different location fails
 Key: ARROW-17373
 URL: https://issues.apache.org/jira/browse/ARROW-17373
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 9.0.0
 Environment: Ubuntu 22.10
Reporter: Egill Axfjord Fridgeirsson


When I copy large feather files, open a dataset from that file and immediately 
write that dataset to a new location I get the following error:

 

```Error: Invalid: Expected to read 144 metadata bytes but got 0```

 

I have made a reproducible example below:

 

``` r
df <- data.frame(replicate(1,sample(0:1,100e6,rep=TRUE)))
savePath <- file.path(tempdir(), 'arrowTest')
if (!dir.exists(savePath)) {
  dir.create(savePath)
}

arrow::write_feather(df, file.path(savePath, 'part-0.feather'))

copyPath <- file.path(tempdir(),'arrowTest')
if (!dir.exists(copyPath)) {
  dir.create(copyPath)
}

writePath <- file.path(tempdir(), 'arrowTest')
if (!dir.exists(writePath)) {
  dir.create(writePath)
}
arrow::copy_files(savePath, copyPath)

dataset <- arrow::open_dataset(copyPath, format='feather')
arrow::write_dataset(dataset = dataset, path = writePath, format = 'feather')
```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory

2022-05-24 Thread Egill Axfjord Fridgeirsson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17541428#comment-17541428
 ] 

Egill Axfjord Fridgeirsson commented on ARROW-16157:


Thanks [~thisisnic] ! That's good to know.

> [R] Inconsistent behavior for arrow datasets vs working in memory
> -
>
> Key: ARROW-16157
> URL: https://issues.apache.org/jira/browse/ARROW-16157
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 7.0.0
> Environment: Ubuntu 21.10
> R 4.1.3.
> Arrow 7.0.0
>Reporter: Egill Axfjord Fridgeirsson
>Assignee: Nicola Crane
>Priority: Major
>
> When I generate a sparse matrix using indices from an arrow dataset I get 
> inconsistent behavior, sometimes there are duplicated indexes resulting in a 
> matrix with values more than one at some places. When loading the dataset 
> first in memory everything works as expected and all the values are one
> Repro
> {code:java}
> library(Matrix)
> library(dplyr)
> library(arrow)
> sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")
> dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)
> arrow::write_dataset(dF, path='./data/feather', format='feather')
> arrowDataset <- arrow::open_dataset('./data/feather', format='feather')
> # run the below a few times, and at some time the output is more than just # 
> 1 for unique(newSparse@x), indicating there are 
> # duplicate indices for the sparse matrix (then it adds the values there)
> newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
>                                   j = arrowDataset %>% pull(j),
>                                   x = 1)
> unique(newSparse@x) # here is the bug, @x is the slot for values
> arrowInMemory <- arrowDataset %>% collect()
> # after loading in memory the output is never more than 1 no matter how 
> # often I run it
> newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
>                                   j = arrowInMemory %>% pull(j),
>                                   x = 1)
> unique(newSparse@x){code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory

2022-04-13 Thread Egill Axfjord Fridgeirsson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521577#comment-17521577
 ] 

Egill Axfjord Fridgeirsson commented on ARROW-16157:


Hi [~thisisnic] , 

I updated to the dev version and unfortunately I still get the issue. 

Here is my arrow::info() output if that helps

{code:java}
 > arrow::arrow_info()
Arrow package version: 7.0.0.20220412

Capabilities:
   
datasetTRUE
engineFALSE
parquetTRUE
json   TRUE
s3FALSE
utf8proc   TRUE
re2TRUE
snappy TRUE
gzip  FALSE
brotliFALSE
zstd  FALSE
lz4TRUE
lz4_frame  TRUE
lzo   FALSE
bz2   FALSE
jemalloc  FALSE
mimalloc  FALSE

To reinstall with more optional capabilities enabled, see
   https://arrow.apache.org/docs/r/articles/install.html

Memory:
  
Allocator   system
Current   76.29 Mb
Max76.3 Mb

Runtime:

SIMD Level  avx2
Detected SIMD Level avx2

Build:
   
C++ Library Version  8.0.0-SNAPSHOT
C++ CompilerGNU
C++ Compiler Version 11.2.0
{code}


> [R] Inconsistent behavior for arrow datasets vs working in memory
> -
>
> Key: ARROW-16157
> URL: https://issues.apache.org/jira/browse/ARROW-16157
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 7.0.0
> Environment: Ubuntu 21.10
> R 4.1.3.
> Arrow 7.0.0
>Reporter: Egill Axfjord Fridgeirsson
>Assignee: Nicola Crane
>Priority: Major
>
> When I generate a sparse matrix using indices from an arrow dataset I get 
> inconsistent behavior, sometimes there are duplicated indexes resulting in a 
> matrix with values more than one at some places. When loading the dataset 
> first in memory everything works as expected and all the values are one
> Repro
> {code:java}
> library(Matrix)
> library(dplyr)
> library(arrow)
> sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")
> dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)
> arrow::write_dataset(dF, path='./data/feather', format='feather')
> arrowDataset <- arrow::open_dataset('./data/feather', format='feather')
> # run the below a few times, and at some time the output is more than just # 
> 1 for unique(newSparse@x), indicating there are 
> # duplicate indices for the sparse matrix (then it adds the values there)
> newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
>                                   j = arrowDataset %>% pull(j),
>                                   x = 1)
> unique(newSparse@x) # here is the bug, @x is the slot for values
> arrowInMemory <- arrowDataset %>% collect()
> # after loading in memory the output is never more than 1 no matter how 
> # often I run it
> newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
>                                   j = arrowInMemory %>% pull(j),
>                                   x = 1)
> unique(newSparse@x){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory

2022-04-08 Thread Egill Axfjord Fridgeirsson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egill Axfjord Fridgeirsson updated ARROW-16157:
---
Description: 
When I generate a sparse matrix using indices from an arrow dataset I get 
inconsistent behavior, sometimes there are duplicated indexes resulting in a 
matrix with values more than one at some places. When loading the dataset first 
in memory everything works as expected and all the values are one

Repro
{code:java}
library(Matrix)
library(dplyr)
library(arrow)

sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")

dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)

arrow::write_dataset(dF, path='./data/feather', format='feather')
arrowDataset <- arrow::open_dataset('./data/feather', format='feather')

# run the below a few times, and at some time the output is more than just # 1 
for unique(newSparse@x), indicating there are 
# duplicate indices for the sparse matrix (then it adds the values there)
newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
                                  j = arrowDataset %>% pull(j),
                                  x = 1)
unique(newSparse@x) # here is the bug, @x is the slot for values


arrowInMemory <- arrowDataset %>% collect()

# after loading in memory the output is never more than 1 no matter how 
# often I run it
newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
                                  j = arrowInMemory %>% pull(j),
                                  x = 1)
unique(newSparse@x){code}

  was:
When I generate a sparse matrix using indices from an arrow dataset I get 
inconsistent behavior, sometimes there are duplicated indexes resulting in a 
matrix with values more than one at some places. When loading the dataset first 
in memory everything works as expected and all the values are one

Repro
{code:java}
library(Matrix)
library(dplyr)
library(arrow)

sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")

dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)

arrow::write_dataset(dF, path='./data/feather', format='feather')
arrowDataset <- arrow::open_dataset('./data/feather', format='feather')

# run the below a few times, and at some time the output is more than just # 1 
for unique(newSparse@x), indicating there are duplicate indices for  
# the sparse matrix (then it adds the values there)
newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
                                  j = arrowDataset %>% pull(j),
                                  x = 1)
unique(newSparse@x) # here is the bug, @x is the slot for values


arrowInMemory <- arrowDataset %>% collect()

# after loading in memory the output is never more than 1 no matter how 
# often I run it
newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
                                  j = arrowInMemory %>% pull(j),
                                  x = 1)
unique(newSparse@x){code}


> [R] Inconsistent behavior for arrow datasets vs working in memory
> -
>
> Key: ARROW-16157
> URL: https://issues.apache.org/jira/browse/ARROW-16157
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 7.0.0
> Environment: Ubuntu 21.10
> R 4.1.3.
> Arrow 7.0.0
>Reporter: Egill Axfjord Fridgeirsson
>Priority: Major
>
> When I generate a sparse matrix using indices from an arrow dataset I get 
> inconsistent behavior, sometimes there are duplicated indexes resulting in a 
> matrix with values more than one at some places. When loading the dataset 
> first in memory everything works as expected and all the values are one
> Repro
> {code:java}
> library(Matrix)
> library(dplyr)
> library(arrow)
> sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")
> dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)
> arrow::write_dataset(dF, path='./data/feather', format='feather')
> arrowDataset <- arrow::open_dataset('./data/feather', format='feather')
> # run the below a few times, and at some time the output is more than just # 
> 1 for unique(newSparse@x), indicating there are 
> # duplicate indices for the sparse matrix (then it adds the values there)
> newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
>                                   j = arrowDataset %>% pull(j),
>                                   x = 1)
> unique(newSparse@x) # here is the bug, @x is the slot for values
> arrowInMemory <- arrowDataset %>% collect()
> # after loading in memory the output is never more than 1 no matter how 
> # often I run it
> newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
>                                   j = arrowInMemory %>% pull(j),
>                                   x = 1)
> unique(newSparse@x){code}



--
This m

[jira] [Created] (ARROW-16157) [R] Inconsistent behavior for arrow datasets vs working in memory

2022-04-08 Thread Egill Axfjord Fridgeirsson (Jira)
Egill Axfjord Fridgeirsson created ARROW-16157:
--

 Summary: [R] Inconsistent behavior for arrow datasets vs working 
in memory
 Key: ARROW-16157
 URL: https://issues.apache.org/jira/browse/ARROW-16157
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 7.0.0
 Environment: Ubuntu 21.10
R 4.1.3.
Arrow 7.0.0
Reporter: Egill Axfjord Fridgeirsson


When I generate a sparse matrix using indices from an arrow dataset I get 
inconsistent behavior, sometimes there are duplicated indexes resulting in a 
matrix with values more than one at some places. When loading the dataset first 
in memory everything works as expected and all the values are one

Repro
{code:java}
library(Matrix)
library(dplyr)
library(arrow)

sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")

dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)

arrow::write_dataset(dF, path='./data/feather', format='feather')
arrowDataset <- arrow::open_dataset('./data/feather', format='feather')

# run the below a few times, and at some time the output is more than just # 1 
for unique(newSparse@x), indicating there are duplicate indices for  
# the sparse matrix (then it adds the values there)
newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
                                  j = arrowDataset %>% pull(j),
                                  x = 1)
unique(newSparse@x) # here is the bug, @x is the slot for values


arrowInMemory <- arrowDataset %>% collect()

# after loading in memory the output is never more than 1 no matter how 
# often I run it
newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
                                  j = arrowInMemory %>% pull(j),
                                  x = 1)
unique(newSparse@x){code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)