[jira] [Created] (ARROW-15202) Create pyarrow array using an object's `__array__` method.

2021-12-24 Thread A. Coady (Jira)
A. Coady created ARROW-15202:


 Summary: Create pyarrow array using an object's `__array__` method.
 Key: ARROW-15202
 URL: https://issues.apache.org/jira/browse/ARROW-15202
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Affects Versions: 6.0.1
Reporter: A. Coady


`pa.array` supports optimized creation from an object with the 
`__arrow_array__` method, or from a literal NumPy ndarray. But there's a 
performance gap if the input object has only an `__array__` method, as it isn't 
used.

 

So the user has to know to call `np.asarray` first. And even if the original 
object could be extended to support '__arrow_array__`, it doesn't seems like a 
great workaround if all that method would do is call 
`pa.array(np.asarray(self))`.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15201) Problem counting number of records of a parquet dataset created using Spark

2021-12-24 Thread Nelson Areal (Jira)
Nelson Areal created ARROW-15201:


 Summary: Problem counting number of records of a parquet dataset 
created using Spark
 Key: ARROW-15201
 URL: https://issues.apache.org/jira/browse/ARROW-15201
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 6.0.1
Reporter: Nelson Areal


When I open a dataset of parquet files created by Spark I cannot get a count of 
the number of records, the process hangs with 100% CPU usage.

If I use DuckDB (to_duckdb) to perform the count,  the operation completes as 
expected.

The example below reproduces the problem:
{code:r}
library(tidyverse) # v 1.3.1
library(arrow) # v 6.0.1
library(duckdb) # v 0.3.1-1
library(sparklyr) # v 1.7.3

# Using Spark: 3.0.0, but the same occurs when using Spark 2.4
sc <- spark_connect(master = "local")

# Create a simple data frame and save it to parquet using Spark
test_df <- tibble(a = 1:10e6)
test_spark_tbl <- copy_to(sc, test_df)
spark_write_parquet(test_spark_tbl, path="test")

test_arrow_ds <- open_dataset(sources = "test")

# This works as expected
system.time(
  test_arrow_ds %>% 
to_duckdb() %>% 
count() 
)
#  user  system elapsed 
#  0.039   0.040   0.065 


# The following will hang the process with 100% CPU usage 
test_arrow_ds %>% 
  count() %>% 
  collect()
{code}
 
The session information:
{noformat}
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Monterey 12.1

Matrix products: default
LAPACK: 
/Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
 [1] sparklyr_1.7.3  duckdb_0.3.1-1  DBI_1.1.2   arrow_6.0.1
 [5] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7 purrr_0.3.4
 [9] readr_2.1.1 tidyr_1.1.4 tibble_3.1.6ggplot2_3.3.5  
[13] tidyverse_1.3.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7lubridate_1.8.0   forge_0.2.0   rprojroot_2.0.2  
 [5] assertthat_0.2.1  digest_0.6.29 utf8_1.2.2R6_2.5.1 
 [9] cellranger_1.1.0  backports_1.4.1   reprex_2.0.1  evaluate_0.14
[13] httr_1.4.2pillar_1.6.4  rlang_0.4.12  readxl_1.3.1 
[17] rstudioapi_0.13   blob_1.2.2rmarkdown_2.11htmlwidgets_1.5.4
[21] r2d3_0.2.5bit_4.0.4 munsell_0.5.0 broom_0.7.10 
[25] compiler_4.1.2modelr_0.1.8  xfun_0.29 pkgconfig_2.0.3  
[29] base64enc_0.1-3   htmltools_0.5.2   tidyselect_1.1.1  fansi_0.5.0  
[33] crayon_1.4.2  tzdb_0.2.0dbplyr_2.1.1  withr_2.4.3  
[37] grid_4.1.2jsonlite_1.7.2gtable_0.3.0  lifecycle_1.0.1  
[41] magrittr_2.0.1scales_1.1.1  cli_3.1.0 stringi_1.7.6
[45] fs_1.5.2  xml2_1.3.3ellipsis_0.3.2generics_0.1.1   
[49] vctrs_0.3.8   tools_4.1.2   bit64_4.0.5   glue_1.6.0   
[53] hms_1.1.1 fastmap_1.1.0 yaml_2.2.1colorspace_2.0-2 
[57] rvest_1.0.2   knitr_1.37haven_2.4.3  
{noformat}
I can also reproduce this in on Linux machine. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)