[ 
https://issues.apache.org/jira/browse/ARROW-10125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10125:
-----------------------------------
    Labels: pull-request-available  (was: )

> [R] Int64 downcast check doesn't consider all chunks
> ----------------------------------------------------
>
>                 Key: ARROW-10125
>                 URL: https://issues.apache.org/jira/browse/ARROW-10125
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 1.0.1
>            Reporter: Kyle Kavanagh
>            Assignee: Neal Richardson
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I've got a proprietary dataset where one of the columns is an integer64 but 
> all of the values would fit within 32bits.  As I understand it, arrow/feather 
> will downcast that column when the data is read back into R (not ideal IMO, 
> but not an issue generally).  However, I'm having some trouble with a 
> specific dataset. 
> When I read in the data, the column is set to the class "integer64", however 
> the column type (typeof) is 'integer' and not 'double', which is the 
> underlying type used by bit64.  This mismatch causes R data.table to error 
> out 
> ([https://github.com/Rdatatable/data.table/blob/master/src/rbindlist.c#L325)]
> I do not have any issue with integer64 columns which have values > 2^32, and 
> suspiciously I am also unable to recreate the issue by manually creating a 
> data.table with an int64 column with small values (e.g 
> data.table(col=as.integer64(c(1,2,3))) )
> I did look thru the arrow::r cpp source and couldnt find an obvious case 
> where the underlying storage array would be an integer but also have the 
> 'integer64' class attr assigned...  A fix would either be to remove the 
> integer64 class attr, or ensure that the underlying data store is a REALSXP 
> instead of INTEGERSXP
> My company's network policies wont let me upload the sample dataset, hoping 
> to see if this triggers an immediate thoughts.  If not, I can try to figure 
> our how to upload the dataset or otherwise provide details from it as 
> requested.
>  
> {code:java}
> > arrow::write_feather(df[,list(testCol)][1], '~/test.feather')
> > test = arrow::read_feather('~/test.feather')
> > class(test$testCol)
> [1] "integer64" "np.ulong"
> > typeof(test$testCol)
> [1] "integer"
> > str(test)
> Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       1 obs. of  1 variable: $ 
> testCol:Error in as.character.integer64(object) :  REAL() can only be applied 
> to a 'numeric', not a 'integer'
> #In the larger original dataset, it handles most columns properly, only the 
> 'testCol' breaks things.  Note the difference:
> > typeof(df$goodCol)
> [1] "double"
> > class(df$goodCol)
> [1] "integer64" "np.ulong"
> > typeof(df$testCol)
> [1] "integer"
> > class(df$testCol)
> [1] "integer64" "np.ulong"
> > str(df)
> Classes ‘data.table’ and 'data.frame':  214781 obs. of  17 variables: 
> $ goodCol        :integer64 1599777000000604025 ... 
> $ testCol        :Error in as.character.integer64(object) :
> > sessionInfo()
> R version 3.6.1 (2019-07-05)Platform: x86_64-pc-linux-gnu (64-bit)Running 
> under: Red Hat Enterprise Linux Server 7.7 (Maipo)
> Matrix products: defaultBLAS:   /usr/lib64/libblas.so.3.4.2LAPACK: 
> /usr/lib64/liblapack.so.3.4.2locale: 
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8        
> LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8 
> [7] LC_PAPER=en_US.UTF-8       LC_NAME=C [9] LC_ADDRESS=C               
> LC_TELEPHONE=C[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:[1] stats     graphics  grDevices utils     datasets  
> methods   baseother attached packages:[1] data.table_1.13.0 bit64_4.0.5       
> bit_4.0.4loaded via a namespace (and not attached): [1] Rcpp_1.0.5           
> lattice_0.20-41      arrow_1.0.1 [4] assertthat_0.2.1     rappdirs_0.3.1      
>  grid_3.6.1 [7] R6_2.4.1             jsonlite_1.7.1       magrittr_1.5[10] 
> rlang_0.4.7          Matrix_1.2-18        vctrs_0.3.4[13] 
> reticulate_1.14-9001 tools_3.6.1          glue_1.4.2[16] purrr_0.3.4          
> compiler_3.6.1       tidyselect_1.1.0{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to