[ https://issues.apache.org/jira/browse/ARROW-13293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nicola Crane reassigned ARROW-13293: ------------------------------------ Assignee: (was: Nicola Crane) > [R] open_dataset followed by collect hangs (while compute works) > ---------------------------------------------------------------- > > Key: ARROW-13293 > URL: https://issues.apache.org/jira/browse/ARROW-13293 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 4.0.1 > Environment: Windows 10 (see also session info included in reprex) > Reporter: Hans Van Calster > Priority: Minor > > Tried to make a reproducible example using the iris dataset, but it works as > expected for that dataset. So the issue might be specific to the dataset I am > using (which contains over 100 columns). The example below illustrates the > issue. > The parquet data used in the example can be downloaded from [this > link|https://drive.google.com/file/d/1MHaq3KqlheqrNm8dk71we74n_ip9hMqJ/view?usp=sharing] > > The issue I see is the following: > > * calling open_dataset() %>% filter() %>% collect() hangs on my machine > (while I would expect that a tibble 1,646 x 116 would be returned very fast) > * The two alternative calls (one using read_parquet on the specific parquet > file within the Dataset on which I filter, and the other using compute() > instead of collect()) seem to work as expected > > ``` r > library(dplyr) > #> > #> Attaching package: 'dplyr' > #> The following objects are masked from 'package:stats': > #> > #> filter, lag > #> The following objects are masked from 'package:base': > #> > #> intersect, setdiff, setequal, union > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > read_parquet("data/lucas_harmonised/1_table/parquet_hive/year=2018/part-4.parquet") > %>% > filter(nuts1 == "BE2") > #> # A tibble: 1,646 x 116 > #> id point_id nuts0 nuts1 nuts2 nuts3 th_lat th_long office_pi ex_ante > #> <int> <int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> > #> 1 199451 39803106 BE BE2 BE22 BE221 51.0 5.14 1 0 > #> 2 220669 39623116 BE BE2 BE21 BE213 51.0 4.88 1 0 > #> 3 215557 39483154 BE BE2 BE21 BE211 51.4 4.64 1 0 > #> 4 223579 40303122 BE BE2 BE22 BE222 51.1 5.84 1 0 > #> 5 331079 39783134 BE BE2 BE21 BE213 51.2 5.09 0 0 > #> 6 225417 39403150 BE BE2 BE21 BE211 51.3 4.53 1 0 > #> 7 3340 38863118 BE BE2 BE23 BE234 51.0 3.79 1 0 > #> 8 137361 38143132 BE BE2 BE25 BE258 51.1 2.75 1 0 > #> 9 221861 38343148 BE BE2 BE25 BE255 51.2 3.02 1 0 > #> 10 787 39523148 BE BE2 BE21 BE211 51.3 4.70 1 0 > #> # ... with 1,636 more rows, and 106 more variables: survey_date <chr>, > #> # car_latitude <dbl>, car_ew <chr>, car_longitude <dbl>, gps_proj <chr>, > #> # gps_prec <int>, gps_altitude <int>, gps_lat <dbl>, gps_ew <chr>, > #> # gps_long <dbl>, obs_dist <dbl>, obs_direct <chr>, obs_type <chr>, > #> # obs_radius <chr>, letter_group <chr>, lc1 <chr>, lc1_label <chr>, > #> # lc1_spec <chr>, lc1_spec_label <chr>, lc1_perc <chr>, lc2 <chr>, > #> # lc2_label <chr>, lc2_spec <chr>, lc2_spec_label <chr>, lc2_perc <chr>, > #> # lu1 <chr>, lu1_label <chr>, lu1_type <chr>, lu1_type_label <chr>, > #> # lu1_perc <chr>, lu2 <chr>, lu2_label <chr>, lu2_type <chr>, > #> # lu2_type_label <chr>, lu2_perc <chr>, parcel_area_ha <chr>, > #> # tree_height_maturity <chr>, tree_height_survey <chr>, feature_width > <chr>, > #> # lm_stone_walls <chr>, crop_residues <chr>, lm_grass_margins <chr>, > #> # grazing <chr>, special_status <chr>, lc_lu_special_remark <chr>, > #> # cprn_cando <chr>, cprn_lc <chr>, cprn_lc_label <chr>, cprn_lc1n <int>, > #> # cprnc_lc1e <int>, cprnc_lc1s <int>, cprnc_lc1w <int>, > #> # cprn_lc1n_brdth <int>, cprn_lc1e_brdth <int>, cprn_lc1s_brdth <int>, > #> # cprn_lc1w_brdth <int>, cprn_lc1n_next <chr>, cprn_lc1s_next <chr>, > #> # cprn_lc1e_next <chr>, cprn_lc1w_next <chr>, cprn_urban <chr>, > #> # cprn_impervious_perc <int>, inspire_plcc1 <int>, inspire_plcc2 <int>, > #> # inspire_plcc3 <int>, inspire_plcc4 <int>, inspire_plcc5 <int>, > #> # inspire_plcc6 <int>, inspire_plcc7 <int>, inspire_plcc8 <int>, > #> # eunis_complex <chr>, grassland_sample <chr>, grass_cando <chr>, wm <chr>, > #> # wm_source <chr>, wm_type <chr>, wm_delivery <chr>, erosion_cando <chr>, > #> # soil_stones_perc <chr>, bio_sample <chr>, soil_bio_taken <chr>, > #> # bulk0_10_sample <chr>, soil_blk_0_10_taken <chr>, bulk10_20_sample <chr>, > #> # soil_blk_10_20_taken <chr>, bulk20_30_sample <chr>, > #> # soil_blk_20_30_taken <chr>, standard_sample <chr>, soil_std_taken <chr>, > #> # organic_sample <chr>, soil_org_depth_cando <chr>, soil_taken <chr>, > #> # soil_crop <chr>, photo_point <chr>, photo_north <chr>, photo_south <chr>, > #> # photo_east <chr>, photo_west <chr>, transect <chr>, revisit <int>, ... > open_dataset("data/lucas_harmonised/1_table/parquet_hive/") %>% > filter(nuts1 == "BE2", year == 2018) %>% > compute() > #> Table > #> 1646 rows x 117 columns > #> $id <int64> > #> $point_id <int64> > #> $nuts0 <string> > #> $nuts1 <string> > #> $nuts2 <string> > #> $nuts3 <string> > #> $th_lat <double> > #> $th_long <double> > #> $office_pi <string> > #> $ex_ante <string> > #> $survey_date <string> > #> $car_latitude <double> > #> $car_ew <string> > #> $car_longitude <double> > #> $gps_proj <string> > #> $gps_prec <int64> > #> $gps_altitude <int64> > #> $gps_lat <double> > #> $gps_ew <string> > #> $gps_long <double> > #> $obs_dist <double> > #> $obs_direct <string> > #> $obs_type <string> > #> $obs_radius <string> > #> $letter_group <string> > #> $lc1 <string> > #> $lc1_label <string> > #> $lc1_spec <string> > #> $lc1_spec_label <string> > #> $lc1_perc <string> > #> $lc2 <string> > #> $lc2_label <string> > #> $lc2_spec <string> > #> $lc2_spec_label <string> > #> $lc2_perc <string> > #> $lu1 <string> > #> $lu1_label <string> > #> $lu1_type <string> > #> $lu1_type_label <string> > #> $lu1_perc <string> > #> $lu2 <string> > #> $lu2_label <string> > #> $lu2_type <string> > #> $lu2_type_label <string> > #> $lu2_perc <string> > #> $parcel_area_ha <string> > #> $tree_height_maturity <string> > #> $tree_height_survey <string> > #> $feature_width <string> > #> $lm_stone_walls <string> > #> $crop_residues <string> > #> $lm_grass_margins <string> > #> $grazing <string> > #> $special_status <string> > #> $lc_lu_special_remark <string> > #> $cprn_cando <string> > #> $cprn_lc <string> > #> $cprn_lc_label <string> > #> $cprn_lc1n <int64> > #> $cprnc_lc1e <int64> > #> $cprnc_lc1s <int64> > #> $cprnc_lc1w <int64> > #> $cprn_lc1n_brdth <int64> > #> $cprn_lc1e_brdth <int64> > #> $cprn_lc1s_brdth <int64> > #> $cprn_lc1w_brdth <int64> > #> $cprn_lc1n_next <string> > #> $cprn_lc1s_next <string> > #> $cprn_lc1e_next <string> > #> $cprn_lc1w_next <string> > #> $cprn_urban <string> > #> $cprn_impervious_perc <int64> > #> $inspire_plcc1 <int64> > #> $inspire_plcc2 <int64> > #> $inspire_plcc3 <int64> > #> $inspire_plcc4 <int64> > #> $inspire_plcc5 <int64> > #> $inspire_plcc6 <int64> > #> $inspire_plcc7 <int64> > #> $inspire_plcc8 <int64> > #> $eunis_complex <string> > #> $grassland_sample <string> > #> $grass_cando <string> > #> $wm <string> > #> $wm_source <string> > #> $wm_type <string> > #> $wm_delivery <string> > #> $erosion_cando <string> > #> $soil_stones_perc <string> > #> $bio_sample <string> > #> $soil_bio_taken <string> > #> $bulk0_10_sample <string> > #> $soil_blk_0_10_taken <string> > #> $bulk10_20_sample <string> > #> $soil_blk_10_20_taken <string> > #> $bulk20_30_sample <string> > #> $soil_blk_20_30_taken <string> > #> $standard_sample <string> > #> $soil_std_taken <string> > #> $organic_sample <string> > #> $soil_org_depth_cando <string> > #> $soil_taken <string> > #> $soil_crop <string> > #> $photo_point <string> > #> $photo_north <string> > #> $photo_south <string> > #> $photo_east <string> > #> $photo_west <string> > #> $transect <string> > #> $revisit <int64> > #> $th_gps_dist <double> > #> $file_path_gisco_north <string> > #> $file_path_gisco_south <string> > #> $file_path_gisco_east <string> > #> $file_path_gisco_west <string> > #> $file_path_gisco_point <string> > #> $year <int32> > #open_dataset("data/lucas_harmonised/1_table/parquet_hive/") %>% > # filter(nuts1 == "BE2", year == 2018) %>% > # collect() > # not run: this will hang > ``` > <sup>Created on 2021-07-09 by the [reprex > package](https://reprex.tidyverse.org) (v2.0.0)</sup> > <details style="margin-bottom:10px;"> > <summary> > Session info > </summary> > ``` r > sessioninfo::session_info() > #> - Session info > --------------------------------------------------------------- > #> setting value > #> version R version 4.1.0 (2021-05-18) > #> os Windows 10 x64 > #> system x86_64, mingw32 > #> ui RTerm > #> language (EN) > #> collate Dutch_Belgium.1252 > #> ctype Dutch_Belgium.1252 > #> tz Europe/Paris > #> date 2021-07-09 > #> > #> - Packages > ------------------------------------------------------------------- > #> package * version date lib source > #> arrow * 4.0.1 2021-05-28 [1] CRAN (R 4.1.0) > #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) > #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0) > #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0) > #> cli 2.5.0 2021-04-26 [1] CRAN (R 4.0.5) > #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0) > #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0) > #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0) > #> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.0.5) > #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) > #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) > #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.5) > #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0) > #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0) > #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0) > #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) > #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0) > #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0) > #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0) > #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) > #> pillar 1.6.1 2021-05-16 [1] CRAN (R 4.1.0) > #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) > #> ps 1.6.0 2021-02-28 [1] CRAN (R 4.1.0) > #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) > #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0) > #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0) > #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0) > #> rmarkdown 2.9 2021-06-15 [1] CRAN (R 4.0.5) > #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) > #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0) > #> stringi 1.6.2 2021-05-17 [1] CRAN (R 4.0.5) > #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) > #> tibble 3.1.2 2021-05-16 [1] CRAN (R 4.1.0) > #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) > #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.1.0) > #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) > #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0) > #> xfun 0.24 2021-06-15 [1] CRAN (R 4.0.5) > #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) > #> > #> [1] C:/R/library > #> [2] C:/R/R-4.1.0/library > ``` > </details> -- This message was sent by Atlassian Jira (v8.3.4#803005)