[ https://issues.apache.org/jira/browse/ARROW-16776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson resolved ARROW-16776. ------------------------------------- Resolution: Fixed Issue resolved by pull request 13563 [https://github.com/apache/arrow/pull/13563] > [R] dplyr::glimpse method for arrow table and datasets > ------------------------------------------------------ > > Key: ARROW-16776 > URL: https://issues.apache.org/jira/browse/ARROW-16776 > Project: Apache Arrow > Issue Type: Improvement > Components: R > Reporter: Thomas Mock > Assignee: Neal Richardson > Priority: Minor > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > When working with Arrow datasets/tables, I often find myself wanting to > interactively print or "see" the results of a query or the first few rows of > the data without having to fully collect into memory. > I can perform exploratory data analysis on large out-of-memory datasets via > Arrow + dplyr but in order to print the returned values I have to collect() > into memory or send to_duckdb(). > * compute() - returns number of rows/columns, but no data > * collect() - returns data fully into memory, can be combined with head() > * to_duckdb() - keeps data out of memory, always returns top 10 rows and all > columns, optionally increase/decrease number of printed rows > While to_duckdb() gives me the ability to do true EDA, it seems > counterintuitive to need to send the arrow table over to a duckdb database > just to see the glimpse()/head() equivalent. > My feature request is that there is a dplyr::glimpse() method that will > lazily print the first few values of table/dataset. The expected output would > be something like the below. > ``` r > library(dplyr) > library(arrow) > mtcars %>% arrow::write_parquet("mtcars.parquet") > car_ds <- arrow::open_dataset("mtcars.parquet") > car_ds %>% > glimpse() > Rows: ?? > Columns: 11 > $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, … > $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, … > $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 36… > $ hp <dbl> 110, 110, 93, 110, 175, 105, 2… > $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, … > $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.… > $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17… > $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, … > $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, … > $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, … > $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, … > ``` > Currently glimpse() will return a list output where the majority of the > output is erroneous to the actual data/values. > ``` r > library(dplyr) > library(arrow) > mtcars %>% arrow::write_parquet("mtcars.parquet") > car_ds <- arrow::open_dataset("mtcars.parquet") > car_ds %>% > glimpse() > #> Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' > <FileSystemDataset> > #> Inherits from: <Dataset> > #> Public: > #> .:xp:.: externalptr > #> .class_title: function () > #> clone: function (deep = FALSE) > #> files: active binding > #> filesystem: active binding > #> format: active binding > #> initialize: function (xp) > #> invalidate: function () > #> metadata: active binding > #> NewScan: function () > #> num_cols: active binding > #> num_rows: active binding > #> pointer: function () > #> print: function (...) > #> schema: active binding > #> set_pointer: function (xp) > #> ToString: function () > #> type: active binding > car_ds %>% > filter(cyl == 6) %>% > glimpse() > #> List of 7 > #> $ mpg :Classes 'FileSystemDataset', 'Dataset', 'ArrowObject', 'R6' > <FileSystemDataset> > #> Inherits from: <Dataset> > #> Public: > #> .:xp:.: externalptr > #> .class_title: function () > #> clone: function (deep = FALSE) > #> files: active binding > #> filesystem: active binding > #> format: active binding > #> initialize: function (xp) > #> invalidate: function () > #> metadata: active binding > #> NewScan: function () > #> num_cols: active binding > #> num_rows: active binding > #> pointer: function () > #> print: function (...) > #> schema: active binding > #> set_pointer: function (xp) > #> ToString: function () > #> type: active binding > #> $ cyl :List of 11 > #> ..$ mpg :Classes 'Expression', 'ArrowObject', 'R6' <Expression> > #> Inherits from: <ArrowObject> > #> Public: > #> .:xp:.: externalptr > #> cast: function (to_type, safe = TRUE, ...) > #> clone: function (deep = FALSE) > #> Equals: function (other, ...) > #> field_name: active binding > #> initialize: function (xp) > #> invalidate: function () > #> pointer: function () > #> print: function (...) > #> schema: Schema, ArrowObject, R6 > #> set_pointer: function (xp) > #> ToString: function () > #> type: function (schema = self$schema) > #> type_id: function (schema = self$schema) > #> ..$ cyl :Classes 'Expression', 'ArrowObject', 'R6' <Expression> > #> Inherits from: <ArrowObject> > #> Public: > #> .:xp:.: externalptr > #> cast: function (to_type, safe = TRUE, ...) > #> clone: function (deep = FALSE) > #> Equals: function (other, ...) > #> field_name: active binding > #> initialize: function (xp) > #> invalidate: function () > #> pointer: function () > #> print: function (...) > #> schema: Schema, ArrowObject, R6 > #> set_pointer: function (xp) > #> ToString: function () > #> type: function (schema = self$schema) > #> type_id: function (schema = self$schema) > #> ..$ disp:Classes 'Expression', 'ArrowObject', 'R6' <Expression> > #> Inherits from: <ArrowObject> > #> Public: > #> .:xp:.: externalptr > #> cast: function (to_type, safe = TRUE, ...) > #> clone: function (deep = FALSE) > #> Equals: function (other, ...) > #> field_name: active binding > #> initialize: function (xp) > #> invalidate: function () > #> pointer: function () > #> print: function (...) > #> schema: Schema, ArrowObject, R6 > #> set_pointer: function (xp) > #> ToString: function () > #> type: function (schema = self$schema) > #> type_id: function (schema = self$schema) > #> ..$ hp :Classes 'Expression', 'ArrowObject', 'R6' <Expression> > #> Inherits from: <ArrowObject> > #> Public: > #> .:xp:.: externalptr > #> cast: function (to_type, safe = TRUE, ...) > #> clone: function (deep = FALSE) > #> Equals: function (other, ...) > #> field_name: active binding > #> initialize: function (xp) > #> invalidate: function () > #> pointer: function () > #> print: function (...) > #> schema: Schema, ArrowObject, R6 > #> set_pointer: function (xp) > #> ToString: function () > #> type: function (schema = self$schema) > #> type_id: function (schema = self$schema) > #> ..$ drat:Classes 'Expression', 'ArrowObject', 'R6' <Expression> > #> Inherits from: <ArrowObject> > #> Public: > #> .:xp:.: externalptr > #> cast: function (to_type, safe = TRUE, ...) > #> clone: function (deep = FALSE) > #> Equals: function (other, ...) > #> field_name: active binding > #> initialize: function (xp) > #> invalidate: function () > #> pointer: function () > #> print: function (...) > #> schema: Schema, ArrowObject, R6 > #> set_pointer: function (xp) > #> ToString: function () > #> type: function (schema = self$schema) > #> type_id: function (schema = self$schema) > #> ..$ wt :Classes 'Expression', 'ArrowObject', 'R6' <Expression> > #> Inherits from: <ArrowObject> > #> Public: > #> .:xp:.: externalptr > #> cast: function (to_type, safe = TRUE, ...) > #> clone: function (deep = FALSE) > #> Equals: function (other, ...) > #> field_name: active binding > #> initialize: function (xp) > #> invalidate: function () > #> pointer: function () > #> print: function (...) > #> schema: Schema, ArrowObject, R6 > #> set_pointer: function (xp) > #> ToString: function () > #> type: function (schema = self$schema) > #> type_id: function (schema = self$schema) > #> ..$ qsec:Classes 'Expression', 'ArrowObject', 'R6' <Expression> > #> Inherits from: <ArrowObject> > #> Public: > #> .:xp:.: externalptr > #> cast: function (to_type, safe = TRUE, ...) > #> clone: function (deep = FALSE) > #> Equals: function (other, ...) > #> field_name: active binding > #> initialize: function (xp) > #> invalidate: function () > #> pointer: function () > #> print: function (...) > #> schema: Schema, ArrowObject, R6 > #> set_pointer: function (xp) > #> ToString: function () > #> type: function (schema = self$schema) > #> type_id: function (schema = self$schema) > #> ..$ vs :Classes 'Expression', 'ArrowObject', 'R6' <Expression> > #> Inherits from: <ArrowObject> > #> Public: > #> .:xp:.: externalptr > #> cast: function (to_type, safe = TRUE, ...) > #> clone: function (deep = FALSE) > #> Equals: function (other, ...) > #> field_name: active binding > #> initialize: function (xp) > #> invalidate: function () > #> pointer: function () > #> print: function (...) > #> schema: Schema, ArrowObject, R6 > #> set_pointer: function (xp) > #> ToString: function () > #> type: function (schema = self$schema) > #> type_id: function (schema = self$schema) > #> ..$ am :Classes 'Expression', 'ArrowObject', 'R6' <Expression> > #> Inherits from: <ArrowObject> > #> Public: > #> .:xp:.: externalptr > #> cast: function (to_type, safe = TRUE, ...) > #> clone: function (deep = FALSE) > #> Equals: function (other, ...) > #> field_name: active binding > #> initialize: function (xp) > #> invalidate: function () > #> pointer: function () > #> print: function (...) > #> schema: Schema, ArrowObject, R6 > #> set_pointer: function (xp) > #> ToString: function () > #> type: function (schema = self$schema) > #> type_id: function (schema = self$schema) > #> ..$ gear:Classes 'Expression', 'ArrowObject', 'R6' <Expression> > #> Inherits from: <ArrowObject> > #> Public: > #> .:xp:.: externalptr > #> cast: function (to_type, safe = TRUE, ...) > #> clone: function (deep = FALSE) > #> Equals: function (other, ...) > #> field_name: active binding > #> initialize: function (xp) > #> invalidate: function () > #> pointer: function () > #> print: function (...) > #> schema: Schema, ArrowObject, R6 > #> set_pointer: function (xp) > #> ToString: function () > #> type: function (schema = self$schema) > #> type_id: function (schema = self$schema) > #> ..$ carb:Classes 'Expression', 'ArrowObject', 'R6' <Expression> > #> Inherits from: <ArrowObject> > #> Public: > #> .:xp:.: externalptr > #> cast: function (to_type, safe = TRUE, ...) > #> clone: function (deep = FALSE) > #> Equals: function (other, ...) > #> field_name: active binding > #> initialize: function (xp) > #> invalidate: function () > #> pointer: function () > #> print: function (...) > #> schema: Schema, ArrowObject, R6 > #> set_pointer: function (xp) > #> ToString: function () > #> type: function (schema = self$schema) > #> type_id: function (schema = self$schema) > #> $ disp:Classes 'Expression', 'ArrowObject', 'R6' <Expression> > #> Inherits from: <ArrowObject> > #> Public: > #> .:xp:.: externalptr > #> cast: function (to_type, safe = TRUE, ...) > #> clone: function (deep = FALSE) > #> Equals: function (other, ...) > #> field_name: active binding > #> initialize: function (xp) > #> invalidate: function () > #> pointer: function () > #> print: function (...) > #> schema: Schema, ArrowObject, R6 > #> set_pointer: function (xp) > #> ToString: function () > #> type: function (schema = self$schema) > #> type_id: function (schema = self$schema) > #> $ hp : chr(0) > #> $ drat: NULL > #> $ wt : list() > #> $ qsec: logi(0) > #> - attr(*, "class")= chr "arrow_dplyr_query" > ``` > <sup>Created on 2022-06-07 by the [reprex > package](https://reprex.tidyverse.org) (v2.0.1)</sup> -- This message was sent by Atlassian Jira (v8.20.10#820010)