[jira] [Created] (ARROW-18429) [R] Bump dev version following 10.0.1 patch release
Nicola Crane created ARROW-18429: Summary: [R] Bump dev version following 10.0.1 patch release Key: ARROW-18429 URL: https://issues.apache.org/jira/browse/ARROW-18429 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, R Reporter: Nicola Crane Assignee: Nicola Crane Fix For: 11.0.0 CI job fails with: {code:java} Insufficient package version (submitted: 10.0.0.9000, existing: 10.0.1) Version contains large components (10.0.0.9000) {code} https://github.com/apache/arrow/actions/runs/3639669477/jobs/6145488845#step:10:567 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18416) [R] Update NEWS for 10.0.1
Nicola Crane created ARROW-18416: Summary: [R] Update NEWS for 10.0.1 Key: ARROW-18416 URL: https://issues.apache.org/jira/browse/ARROW-18416 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Assignee: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18415) [R] Update R package README to reference GH Issues
Nicola Crane created ARROW-18415: Summary: [R] Update R package README to reference GH Issues Key: ARROW-18415 URL: https://issues.apache.org/jira/browse/ARROW-18415 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane The R package README should be updated to refer to GH Issues for users who don't have a JIRA account -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18403) [C++] Error consuming Substrait plan which uses count function: "only unary aggregate functions are currently supported"
Nicola Crane created ARROW-18403: Summary: [C++] Error consuming Substrait plan which uses count function: "only unary aggregate functions are currently supported" Key: ARROW-18403 URL: https://issues.apache.org/jira/browse/ARROW-18403 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Nicola Crane ARROW-17523 added support for the Substrait extension function "count", but when I write code which produces a Substrait plan which calls it, and then try to run it in Acero, I get an error. The plan: {code:r} message of type 'substrait.Plan' with 3 fields set extension_uris { extension_uri_anchor: 1 uri: "https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml; } extension_uris { extension_uri_anchor: 2 uri: "https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml; } extension_uris { extension_uri_anchor: 3 uri: "https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml; } extensions { extension_function { extension_uri_reference: 3 function_anchor: 2 name: "count" } } relations { rel { aggregate { input { project { common { emit { output_mapping: 9 output_mapping: 10 output_mapping: 11 output_mapping: 12 output_mapping: 13 output_mapping: 14 output_mapping: 15 output_mapping: 16 output_mapping: 17 } } input { read { base_schema { names: "int" names: "dbl" names: "dbl2" names: "lgl" names: "false" names: "chr" names: "verses" names: "padded_strings" names: "some_negative" struct_ { types { i32 { nullability: NULLABILITY_NULLABLE } } types { fp64 { nullability: NULLABILITY_NULLABLE } } types { fp64 { nullability: NULLABILITY_NULLABLE } } types { bool_ { nullability: NULLABILITY_NULLABLE } } types { bool_ { nullability: NULLABILITY_NULLABLE } } types { string { nullability: NULLABILITY_NULLABLE } } types { string { nullability: NULLABILITY_NULLABLE } } types { string { nullability: NULLABILITY_NULLABLE } } types { fp64 { nullability: NULLABILITY_NULLABLE } } } } local_files { items { uri_file: "file:///tmp/RtmpsBsoZJ/file1915f604cff4a" parquet { } } } } } expressions { selection { direct_reference { struct_field { } } root_reference { } } } expressions { selection { direct_reference { struct_field { field: 1 } } root_reference { } } } expressions { selection { direct_reference { struct_field { field: 2 } } root_reference { } } } expressions { selection { direct_reference { struct_field { field: 3 } } root_reference { } } } expressions { selection { direct_reference { struct_field { field: 4 } } root_reference { } } } expressions { selection { direct_reference {
[jira] [Created] (ARROW-18393) [Docs][R] Include warning when viewing old docs (redirecting to stable docs)
Nicola Crane created ARROW-18393: Summary: [Docs][R] Include warning when viewing old docs (redirecting to stable docs) Key: ARROW-18393 URL: https://issues.apache.org/jira/browse/ARROW-18393 Project: Apache Arrow Issue Type: Improvement Components: Documentation Reporter: Joris Van den Bossche Assignee: Alenka Frim Now we have versioned docs, we also have the old versions of the developers docs (eg https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those might be outdated (eg regarding communication channels, build instructions, etc), and typically when contributing / developing with the latest arrow, one should _always_ check the latest dev version of the contributing docs. We could add a warning box pointing this out and linking to the dev docs. For example similarly how some projects warn about viewing old docs in general and point to the stable docs (eg https://mne.tools/1.1/index.html or https://scikit-learn.org/1.0/user_guide.html). In this case we could have a custom box when at a page in /developers to point to the dev docs instead of stable docs -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18391) [R] Fix the version selector dropdown
Nicola Crane created ARROW-18391: Summary: [R] Fix the version selector dropdown Key: ARROW-18391 URL: https://issues.apache.org/jira/browse/ARROW-18391 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Assignee: Nicola Crane ARROW-17887 updates the docs to use Bootstrap 5 which will break the docs version dropdown selector, as it relies on replacing a page element, but the page elements are different in this version of Bootstrap. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18358) [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow
Nicola Crane created ARROW-18358: Summary: [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow Key: ARROW-18358 URL: https://issues.apache.org/jira/browse/ARROW-18358 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane In order to make the transition between using the different CSV reading functions as smoothly as possible we could introduce a version of open_dataset specifically for reading CSVs with a signature more closely matching that of read_csv_arrow - this would just pass the arguments through to open_dataset (in the ellipses), but would make it simpler to have a docs page showing these options explicitly and thus be clearer for users. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18357) [R] support parse_options, read_options, convert_options in open_dataset to mirror read_csv_arrow
Nicola Crane created ARROW-18357: Summary: [R] support parse_options, read_options, convert_options in open_dataset to mirror read_csv_arrow Key: ARROW-18357 URL: https://issues.apache.org/jira/browse/ARROW-18357 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane The {{read_csv_arrow()}} function allows users to pass in options via its parse_options, convert_options, and read_options parameters. We could allow users to pass these into {{open_dataset()}} to enable users to more easily switch between {{read_csv_arrow()}} and {{open_dataset()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18356) [R] Handle as_data_frame argument if passed into open_dataset for CSVs
Nicola Crane created ARROW-18356: Summary: [R] Handle as_data_frame argument if passed into open_dataset for CSVs Key: ARROW-18356 URL: https://issues.apache.org/jira/browse/ARROW-18356 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane Currently, if the argument {{as_data_frame}} is passed into {{open_dataset()}} with a CSV format dataset, the error message returned is: {code:r} Error: The following option is supported in "read_delim_arrow" functions but not yet supported here: "as_data_frame" {code} Instead, we could silently ignore it if as_data_frame is set to {{FALSE}} and give a more helpful error if set to {{TRUE}} (i.e. direct user to call {{as.data.frame()}} or {{collect()}}). Reasoning: it'd be great to get to a point where users can just swap their {{read_csv_arrow()}} syntax for {{open_dataset()}} and get helpful results. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18355) [R] support the quoted_na argument in open_dataset for CSVs by mapping it to CSVConvertOptions$strings_can_be_null
Nicola Crane created ARROW-18355: Summary: [R] support the quoted_na argument in open_dataset for CSVs by mapping it to CSVConvertOptions$strings_can_be_null Key: ARROW-18355 URL: https://issues.apache.org/jira/browse/ARROW-18355 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18354) [R] Better document the CSV read/parse/convert options we can use with open_dataset()
Nicola Crane created ARROW-18354: Summary: [R] Better document the CSV read/parse/convert options we can use with open_dataset() Key: ARROW-18354 URL: https://issues.apache.org/jira/browse/ARROW-18354 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane When a user opens a CSV dataset using open_dataset, they can take advantage of a lot of different options which can be specified via {{CsvReadOptions$create()}} etc. However, as they are passed in via the ellipses ({{...}}) argument, it's not particularly clear to users which arguments are supported or not. They are not documented in the {{open_dataset()}} docs, and further confused (see the code for {{CsvFileFormat$create()}} by the fact that we support a mix of Arrow and readr parameters. We should better document the arguments we do support. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18352) [R] Datasets API interface improvements
Nicola Crane created ARROW-18352: Summary: [R] Datasets API interface improvements Key: ARROW-18352 URL: https://issues.apache.org/jira/browse/ARROW-18352 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Umbrella ticket for improvements for our interface to the datasets API, and making the experience more consistent between {{open_dataset()}} and the {{read_*()}} functions. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18266) [R] Make it more obvious how to read in a Parquet file with a different schema to the inferred one
Nicola Crane created ARROW-18266: Summary: [R] Make it more obvious how to read in a Parquet file with a different schema to the inferred one Key: ARROW-18266 URL: https://issues.apache.org/jira/browse/ARROW-18266 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane It's not all that clear from our docs that if we want to read in a Parquet file and change the schema, we need to call the {{cast()}} method on the Table, e.g. {code:r} # Write out data data <- tibble::tibble(x = c(letters[1:5], NA), y = 1:6) data_with_schema <- arrow_table(data, schema = schema(x = string(), y = int64())) write_parquet(data_with_schema, "data_with_schema.parquet") # Read in data while specifying a schema data_in <- read_parquet("data_with_schema.parquet", as_data_frame = FALSE) data_in$cast(target_schema = schema(x = string(), y = int32())) {code} We should document this more clearly. Pehaps we could even update the code here to automatically do some of this if we pass in a schema to the {...} argument of {{read_parquet}} _and_ the returned data doesn't match the desired schema? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18263) [R] Error when trying to write POSIXlt data to CSV
Nicola Crane created ARROW-18263: Summary: [R] Error when trying to write POSIXlt data to CSV Key: ARROW-18263 URL: https://issues.apache.org/jira/browse/ARROW-18263 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane I get an error trying to write a tibble of POSIXlt data to a file. The error is a bit misleading as it refers to the column being of length 0. {code:r} posixlt_data <- tibble::tibble(x = as.POSIXlt(Sys.time())) write_csv_arrow(posixlt_data, "posixlt_data.csv") {code} {code:r} Error: Invalid: Unsupported Type:POSIXlt of length 0 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18236) [R] Improve error message when providing a mix of rreadr and Arrow options
Nicola Crane created ARROW-18236: Summary: [R] Improve error message when providing a mix of rreadr and Arrow options Key: ARROW-18236 URL: https://issues.apache.org/jira/browse/ARROW-18236 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane I was trying to solve a user issue today and tried to run the following code: {code:r} df = tibble(x = c("a","b", "" , "d")) write_tsv(df, "data.tsv") open_dataset("data.tsv", format="tsv", skip_rows=1, schema=schema(x=string()), skip_empty_rows = TRUE) %>% collect() {code} which gives me the error {code:r} Error: Use either Arrow parse options or readr parse options, not both {code} which is somewhat obnoxious as I have literally no context provided to know which options are being referred to and what the possible options are. Also, like, why can't we have a mix of both? This is a totally valid use-case. I think both a code update and a more informative error message are needed here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18216) [R] Better error message when creating an array from decimals
Nicola Crane created ARROW-18216: Summary: [R] Better error message when creating an array from decimals Key: ARROW-18216 URL: https://issues.apache.org/jira/browse/ARROW-18216 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane We should first check why this doesn't work, and if we can instead fix the problem instead of the error message {code:r} > ChunkedArray$create(c(1.4, 525.5), type = decimal(precision = 1, scale = 3)) Error: NotImplemented: Extend {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18215) [R] User experience improvements
Nicola Crane created ARROW-18215: Summary: [R] User experience improvements Key: ARROW-18215 URL: https://issues.apache.org/jira/browse/ARROW-18215 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Umbrella ticket to collect together tickets relating to improving error messages, and general dev-experience tweaks -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18200) [R] Misleading error message if opening CSV dataset with invalid file in directory
Nicola Crane created ARROW-18200: Summary: [R] Misleading error message if opening CSV dataset with invalid file in directory Key: ARROW-18200 URL: https://issues.apache.org/jira/browse/ARROW-18200 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane I made a mistake before where I thought a dataset contained CSVs which were, in fact, Parquet files, but the error message I got was super unhelpful {code:r} library(arrow) download.file( url = "https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip;, destfile = here::here("data/nyc-taxi-tiny.zip") ) # (unzip the zip file into the data directory but don't delete it after) open_dataset("data", format = "csv") {code} {code:r} Error in nchar(x) : invalid multibyte string, element 1 In addition: Warning message: In grepl("No match for FieldRef.Name(__filename)", msg, fixed = TRUE) : input string 1 is invalid in this locale {code} Note, this only occurs with {{format="csv"}} and omitting this argument (i.e. the default of {{format="parquet"}} leaves us with the much better error: {code:r} Error in `open_dataset()`: ! Invalid: Error creating dataset. Could not read schema from '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Could not open Parquet input source '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file. /home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338 GetReader(source, scan_options). Is this a 'parquet' file? /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:44 InspectSchemas(std::move(options)) /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:265 Inspect(options.inspect_options) ℹ Did you mean to specify a 'format' other than the default (parquet)? {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18199) [R] Misleading error message in query using across()
Nicola Crane created ARROW-18199: Summary: [R] Misleading error message in query using across() Key: ARROW-18199 URL: https://issues.apache.org/jira/browse/ARROW-18199 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane Error handling looks like it's happening in the wrong place - a comma has been missed in the {{select()}} but it's wrongly appearing like it's an issue with {{across()}}. Can we do something to make this not happen? {code:r} download.file( url = "https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip;, destfile = here::here("data/nyc-taxi-tiny.zip") ) library(arrow) library(dplyr) open_dataset("data") %>% select(pickup_datetime, pickup_longitude, pickup_latitude ends_with("amount")) %>% mutate(across(ends_with("amount"), ~.x * 0.87, .names = "{.col}_gbp")) %>% collect() {code} {code:r} Error in `across()`: ! Must be used inside dplyr verbs. Run `rlang::last_error()` to see where the error occurred. {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18181) [R] CSV Reader Improvements
Nicola Crane created ARROW-18181: Summary: [R] CSV Reader Improvements Key: ARROW-18181 URL: https://issues.apache.org/jira/browse/ARROW-18181 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Umbrella ticket for tickets relating to CSV reader improvements in R. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18180) [R] GCS Improvements
Nicola Crane created ARROW-18180: Summary: [R] GCS Improvements Key: ARROW-18180 URL: https://issues.apache.org/jira/browse/ARROW-18180 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18079) [R] Performance regressions after ARROW-12105
Nicola Crane created ARROW-18079: Summary: [R] Performance regressions after ARROW-12105 Key: ARROW-18079 URL: https://issues.apache.org/jira/browse/ARROW-18079 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane In ARROW-12105 the functionality implemented introduced some performance regressions that we should sort out before the release. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18062) [R] error in CI jobs for R 3.5 and 3.6 when R package being installed
Nicola Crane created ARROW-18062: Summary: [R] error in CI jobs for R 3.5 and 3.6 when R package being installed Key: ARROW-18062 URL: https://issues.apache.org/jira/browse/ARROW-18062 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane e.g. https://github.com/ursacomputing/crossbow/actions/runs/3246698242/jobs/5325752692#step:5:3164 >From the install logs on that CI job: {code} ** R ** inst ** byte-compile and prepare package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ** testing if installed package can be loaded from temporary location Error: package or namespace load failed for ‘arrow’: .onLoad failed in loadNamespace() for 'arrow', details: call: fun_cache[[unqualified_name]] <- fun error: invalid type/length (closure/0) in vector allocation Error: loading failed {code} It is currently erroring for R 3.5 and 3.6 in the nightlies with this error. The line of code where the error comes from was added in ARROW-16444 but seeing as that was 3 months ago, it seems unlikely that this change introduced the error. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18057) [R] test for slice functions fail on builds without Datasets capability
Nicola Crane created ARROW-18057: Summary: [R] test for slice functions fail on builds without Datasets capability Key: ARROW-18057 URL: https://issues.apache.org/jira/browse/ARROW-18057 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane The changes in ARROW-13766 introduced a test which depends on datasets functionality being enabled - we should skip this on CI builds where it is not. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18049) [R] Support column renaming in col_select argument to file reading functions
Nicola Crane created ARROW-18049: Summary: [R] Support column renaming in col_select argument to file reading functions Key: ARROW-18049 URL: https://issues.apache.org/jira/browse/ARROW-18049 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane We should support the ability to rename columns when reading in data via the CSV/Parquet/Feather/JSON file readers. We currently have an argument {{col_select}}, which allows users to choose which columns to read in, but renaming doesn't work. To implement this, we'd need to check if any columns have been renamed by {{col_select}} and then updating the schema of the object being returned once the file has been read. {code:r} library(readr) library(arrow) readr::read_csv(readr_example("mtcars.csv"), col_select = c(not_hp = hp)) #> # A tibble: 32 × 1 #>not_hp #> #> 1110 #> 2110 #> 3 93 #> 4110 #> 5175 #> 6105 #> 7245 #> 8 62 #> 9 95 #> 10123 #> # … with 22 more rows arrow::read_csv_arrow(readr_example("mtcars.csv"), col_select = c(not_hp = hp)) #> # A tibble: 32 × 1 #> hp #> #> 1 110 #> 2 110 #> 393 #> 4 110 #> 5 175 #> 6 105 #> 7 245 #> 862 #> 995 #> 10 123 #> # … with 22 more rows {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-18043) [R] Properly instantiate empty arrays of extension types in Table__from_schema
Nicola Crane created ARROW-18043: Summary: [R] Properly instantiate empty arrays of extension types in Table__from_schema Key: ARROW-18043 URL: https://issues.apache.org/jira/browse/ARROW-18043 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane The PR for ARROW-12105 introduces the function Table__from_schema which creates an empty Table from a Schema object. Currently it can't handle extension types, and instead just returns NULL type objects. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17987) [R] Warning message when building Arrow
Nicola Crane created ARROW-17987: Summary: [R] Warning message when building Arrow Key: ARROW-17987 URL: https://issues.apache.org/jira/browse/ARROW-17987 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane I just got the following message when I rebuilt Arrow after pulling from a different fork: {code:r} Warning message: Failed to enable user cancellation: Signal stop source already set up {code} I'm not sure exactly what it is or how to reproduce it (it disappeared after I restarted my R session), but we might want to check that end users won't end up seeing this? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17986) [R] native type checking in where()
Nicola Crane created ARROW-17986: Summary: [R] native type checking in where() Key: ARROW-17986 URL: https://issues.apache.org/jira/browse/ARROW-17986 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane The {{where()}} implementation in ARROW-12105 requires simulating a tibble from an Arrow Schema. Could we have a version of this where we allow native type checks, such as `is_int32()` or `is_decimal()`? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17948) [R] arrow_eval user-defined generic functions
Nicola Crane created ARROW-17948: Summary: [R] arrow_eval user-defined generic functions Key: ARROW-17948 URL: https://issues.apache.org/jira/browse/ARROW-17948 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane ARROW-14071 covers evaluating user-defined functions, but once this is implemented, would it be possible to evaluate generics? Here's an example of how that works in dplyr from a [Stack Overflow question|https://stackoverflow.com/questions/73950714/is-it-possible-to-use-generics-in-apache-arrow]: {code:r} library(dplyr) df <- data.frame(a = c("these", "are", "some", "strings"), b = 1:4) boop <- function(x, ...) UseMethod("boop", x) boop.numeric <- function(x) mean(x, na.rm = TRUE) boop.character <- function(x) mean(nchar(x), na.rm =TRUE ) df %>% summarise(across(everything(), boop)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17911) [R] Implement `across()` within `transmute()`
Nicola Crane created ARROW-17911: Summary: [R] Implement `across()` within `transmute()` Key: ARROW-17911 URL: https://issues.apache.org/jira/browse/ARROW-17911 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17895) [R] Implement dplyr::across()
Nicola Crane created ARROW-17895: Summary: [R] Implement dplyr::across() Key: ARROW-17895 URL: https://issues.apache.org/jira/browse/ARROW-17895 Project: Apache Arrow Issue Type: Improvement Reporter: Nicola Crane Umbrella ticket for implementing {{across()]] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17784) [C++] Opening a dataset where partitioning variable is in the dataset should error differently
Nicola Crane created ARROW-17784: Summary: [C++] Opening a dataset where partitioning variable is in the dataset should error differently Key: ARROW-17784 URL: https://issues.apache.org/jira/browse/ARROW-17784 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Nicola Crane The error message given when the name of the partition given matches a field in the dataset is a bit misleading - can we catch this earlier and give a different error message? {code:r} /library(dplyr) library(arrow) tf <- tempfile() dir.create(tf) write_dataset(mtcars, tf, partitioning = "cyl", hive_style = FALSE) # The schema fed into `partitioning` should refer to `cyl` and not `wt`, but the error message doesn't refer to the duplication here open_dataset(tf, partitioning = schema(wt = int64())) %>% collect() #> Error in `open_dataset()`: #> ! Invalid: Unable to merge: Field wt has incompatible types: double vs int64 #> /home/nic2/arrow/cpp/src/arrow/type.cc:1692 fields_[i]->MergeWith(field) #> /home/nic2/arrow/cpp/src/arrow/type.cc:1755 AddField(field) #> /home/nic2/arrow/cpp/src/arrow/type.cc:1826 builder.AddSchema(schema) #> /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262 Inspect(options.inspect_options) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17700) [R] Can't open CSV dataset with partitioning and a schema
Nicola Crane created ARROW-17700: Summary: [R] Can't open CSV dataset with partitioning and a schema Key: ARROW-17700 URL: https://issues.apache.org/jira/browse/ARROW-17700 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane I feel like this might be a duplicate of a previous ticket, but can't find it. {code:r} ``` r library(dplyr) #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union library(arrow) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp # all good! tf <- tempfile() dir.create(tf) write_dataset(mtcars, tf, format = "csv") open_dataset(tf, format = "csv") %>% collect() #> # A tibble: 32 × 11 #> mpg cyl disphp dratwt qsecvsam gear carb #> #> 1 21 6 160110 3.9 2.62 16.5 0 1 4 4 #> 2 21 6 160110 3.9 2.88 17.0 0 1 4 4 #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 #> 4 21.4 6 258110 3.08 3.22 19.4 1 0 3 1 #> 5 18.7 8 360175 3.15 3.44 17.0 0 0 3 2 #> 6 18.1 6 225105 2.76 3.46 20.2 1 0 3 1 #> 7 14.3 8 360245 3.21 3.57 15.8 0 0 3 4 #> 8 24.4 4 147.62 3.69 3.19 20 1 0 4 2 #> 9 22.8 4 141.95 3.92 3.15 22.9 1 0 4 2 #> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 #> # … with 22 more rows # all good tf <- tempfile() dir.create(tf) write_dataset(group_by(mtcars, cyl), tf, format = "csv") open_dataset(tf, format = "csv") %>% collect() #> # A tibble: 32 × 11 #> mpg disphp dratwt qsecvsam gear carb cyl #> #> 1 22.8 108 93 3.85 2.32 18.6 1 1 4 1 4 #> 2 24.4 147. 62 3.69 3.19 20 1 0 4 2 4 #> 3 22.8 141. 95 3.92 3.15 22.9 1 0 4 2 4 #> 4 32.4 78.766 4.08 2.2 19.5 1 1 4 1 4 #> 5 30.4 75.752 4.93 1.62 18.5 1 1 4 2 4 #> 6 33.9 71.165 4.22 1.84 19.9 1 1 4 1 4 #> 7 21.5 120. 97 3.7 2.46 20.0 1 0 3 1 4 #> 8 27.3 79 66 4.08 1.94 18.9 1 1 4 1 4 #> 9 26 120. 91 4.43 2.14 16.7 0 1 5 2 4 #> 10 30.4 95.1 113 3.77 1.51 16.9 1 1 5 2 4 #> # … with 22 more rows list.files(tf) #> [1] "cyl=4" "cyl=6" "cyl=8" # hive-style=FALSE leads to no `cyl` column, which, sure, makes sense tf <- tempfile() dir.create(tf) write_dataset(group_by(mtcars, cyl), tf, format = "csv", hive_style = FALSE) open_dataset(tf, format = "csv") %>% collect() #> # A tibble: 32 × 10 #> mpg disphp dratwt qsecvsam gear carb #> #> 1 22.8 108 93 3.85 2.32 18.6 1 1 4 1 #> 2 24.4 147. 62 3.69 3.19 20 1 0 4 2 #> 3 22.8 141. 95 3.92 3.15 22.9 1 0 4 2 #> 4 32.4 78.766 4.08 2.2 19.5 1 1 4 1 #> 5 30.4 75.752 4.93 1.62 18.5 1 1 4 2 #> 6 33.9 71.165 4.22 1.84 19.9 1 1 4 1 #> 7 21.5 120. 97 3.7 2.46 20.0 1 0 3 1 #> 8 27.3 79 66 4.08 1.94 18.9 1 1 4 1 #> 9 26 120. 91 4.43 2.14 16.7 0 1 5 2 #> 10 30.4 95.1 113 3.77 1.51 16.9 1 1 5 2 #> # … with 22 more rows list.files(tf) #> [1] "4" "6" "8" # *but* if we try to add it in via a schema, it doesn't work desired_schema <- schema(mpg = float64(), disp = float64(), hp = int64(), drat = float64(), wt = float64(), qsec = float64(), vs = int64(), am = int64(), gear = int64(), carb = int64(), cyl = int64()) tf <- tempfile() dir.create(tf) write_dataset(group_by(mtcars, cyl), tf, format = "csv", hive_style = FALSE) open_dataset(tf, format = "csv", schema = desired_schema) %>% collect() #> Error in `dplyr::collect()`: #> ! Invalid: Could not open CSV input source '/tmp/RtmpnInOwc/file13f0d38c5b994/4/part-0.csv': Invalid: CSV parse error: Row #1: Expected 11 columns, got 10: "mpg","disp","hp","drat","wt","qsec","vs","am","gear","carb" #> /home/nic2/arrow/cpp/src/arrow/csv/parser.cc:477 (ParseLine(values_writer, parsed_writer, data, data_end, is_final, _end, bulk_filter)) #> /home/nic2/arrow/cpp/src/arrow/csv/parser.cc:566 ParseChunk( _writer, _writer, data, data_end, is_final,
[jira] [Created] (ARROW-17699) [R] Error message erroneously triggered when opened partitioned CSV dataset with schema
Nicola Crane created ARROW-17699: Summary: [R] Error message erroneously triggered when opened partitioned CSV dataset with schema Key: ARROW-17699 URL: https://issues.apache.org/jira/browse/ARROW-17699 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane {code:r} library(dplyr) # all good! tf <- tempfile() dir.create(tf) write_dataset(mtcars, tf, format = "csv") open_dataset(tf, format = "csv") %>% collect() #> # A tibble: 32 × 11 #> mpg cyl disphp dratwt qsecvsam gear carb #> #> 1 21 6 160110 3.9 2.62 16.5 0 1 4 4 #> 2 21 6 160110 3.9 2.88 17.0 0 1 4 4 #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 #> 4 21.4 6 258110 3.08 3.22 19.4 1 0 3 1 #> 5 18.7 8 360175 3.15 3.44 17.0 0 0 3 2 #> 6 18.1 6 225105 2.76 3.46 20.2 1 0 3 1 #> 7 14.3 8 360245 3.21 3.57 15.8 0 0 3 4 #> 8 24.4 4 147.62 3.69 3.19 20 1 0 4 2 #> 9 22.8 4 141.95 3.92 3.15 22.9 1 0 4 2 #> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 #> # … with 22 more rows # all good tf <- tempfile() dir.create(tf) write_dataset(group_by(mtcars, cyl), tf, format = "csv") open_dataset(tf, format = "csv") %>% collect() #> # A tibble: 32 × 11 #> mpg disphp dratwt qsecvsam gear carb cyl #> #> 1 22.8 108 93 3.85 2.32 18.6 1 1 4 1 4 #> 2 24.4 147. 62 3.69 3.19 20 1 0 4 2 4 #> 3 22.8 141. 95 3.92 3.15 22.9 1 0 4 2 4 #> 4 32.4 78.766 4.08 2.2 19.5 1 1 4 1 4 #> 5 30.4 75.752 4.93 1.62 18.5 1 1 4 2 4 #> 6 33.9 71.165 4.22 1.84 19.9 1 1 4 1 4 #> 7 21.5 120. 97 3.7 2.46 20.0 1 0 3 1 4 #> 8 27.3 79 66 4.08 1.94 18.9 1 1 4 1 4 #> 9 26 120. 91 4.43 2.14 16.7 0 1 5 2 4 #> 10 30.4 95.1 113 3.77 1.51 16.9 1 1 5 2 4 #> # … with 22 more rows list.files(tf) #> [1] "cyl=4" "cyl=6" "cyl=8" # hive-style=FALSE leads to no `cyl` column, which, sure, makes sense tf <- tempfile() dir.create(tf) write_dataset(group_by(mtcars, cyl), tf, format = "csv", hive_style = FALSE) open_dataset(tf, format = "csv") %>% collect() #> # A tibble: 32 × 10 #> mpg disphp dratwt qsecvsam gear carb #> #> 1 22.8 108 93 3.85 2.32 18.6 1 1 4 1 #> 2 24.4 147. 62 3.69 3.19 20 1 0 4 2 #> 3 22.8 141. 95 3.92 3.15 22.9 1 0 4 2 #> 4 32.4 78.766 4.08 2.2 19.5 1 1 4 1 #> 5 30.4 75.752 4.93 1.62 18.5 1 1 4 2 #> 6 33.9 71.165 4.22 1.84 19.9 1 1 4 1 #> 7 21.5 120. 97 3.7 2.46 20.0 1 0 3 1 #> 8 27.3 79 66 4.08 1.94 18.9 1 1 4 1 #> 9 26 120. 91 4.43 2.14 16.7 0 1 5 2 #> 10 30.4 95.1 113 3.77 1.51 16.9 1 1 5 2 #> # … with 22 more rows list.files(tf) #> [1] "4" "6" "8" # *but* if we try to add it in via a schema, it doesn't work desired_schema <- schema(mpg = float64(), disp = float64(), hp = int64(), drat = float64(), wt = float64(), qsec = float64(), vs = int64(), am = int64(), gear = int64(), carb = int64(), cyl = int64()) tf <- tempfile() dir.create(tf) write_dataset(group_by(mtcars, cyl), tf, format = "csv", hive_style = FALSE) open_dataset(tf, format = "csv", schema = schema) %>% collect() #> Error in `CsvFileFormat$create()`: #> ! Values in `column_names` must match `schema` field names #> ✖ `column_names` and `schema` field names match but are not in the same order list.files(tf) #> [1] "4" "6" "8" {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17698) [R] Implement use of `where()` inside `across()
Nicola Crane created ARROW-17698: Summary: [R] Implement use of `where()` inside `across() Key: ARROW-17698 URL: https://issues.apache.org/jira/browse/ARROW-17698 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17689) [R] Implement dplyr::across() inside group_by()
Nicola Crane created ARROW-17689: Summary: [R] Implement dplyr::across() inside group_by() Key: ARROW-17689 URL: https://issues.apache.org/jira/browse/ARROW-17689 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17690) [R] Implement dplyr::across() inside distinct()
Nicola Crane created ARROW-17690: Summary: [R] Implement dplyr::across() inside distinct() Key: ARROW-17690 URL: https://issues.apache.org/jira/browse/ARROW-17690 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17680) [Dev] More descriptive error output in merge script
Nicola Crane created ARROW-17680: Summary: [Dev] More descriptive error output in merge script Key: ARROW-17680 URL: https://issues.apache.org/jira/browse/ARROW-17680 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Nicola Crane I've just updated to the newer version of the merge script, and something is going wrong; however, the error message I'm getting isn't super-helpful for working out what's happened: {code:java} File "/home/nic2/arrow_for_merging_prs_only/dev/merge_arrow_pr.py", line 539, in connect_jira return jira.client.JIRA(options={'server': JIRA_API_BASE}, TypeError: __init__() got an unexpected keyword argument 'token_auth' {code} Is there some object we could just dump the output of, in cases of failure, so it provides a few more hints to work out what's gone wrong? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17654) [R] Add link to cookbook from README
Nicola Crane created ARROW-17654: Summary: [R] Add link to cookbook from README Key: ARROW-17654 URL: https://issues.apache.org/jira/browse/ARROW-17654 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17637) [R] as.Date fails going from timestamp[s
Nicola Crane created ARROW-17637: Summary: [R] as.Date fails going from timestamp[s Key: ARROW-17637 URL: https://issues.apache.org/jira/browse/ARROW-17637 Project: Apache Arrow Issue Type: Improvement Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17528) [R] Tidy up the pkgdown articles site
Nicola Crane created ARROW-17528: Summary: [R] Tidy up the pkgdown articles site Key: ARROW-17528 URL: https://issues.apache.org/jira/browse/ARROW-17528 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane We could better organise the different articles we have to make it easier for users to find the right info -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17490) [R] Differing results in log bindings
Nicola Crane created ARROW-17490: Summary: [R] Differing results in log bindings Key: ARROW-17490 URL: https://issues.apache.org/jira/browse/ARROW-17490 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane We get different results for dplyr versus Acero if we call log on a column that contains 0, i.e. {code:r} ``` r library(arrow) library(dplyr) df <- tibble(x = 0:10) df %>% mutate(y = log(x)) %>% collect() #> # A tibble: 11 × 2 #>xy #> #> 1 0 -Inf #> 2 10 #> 3 20.693 #> 4 31.10 #> 5 41.39 #> 6 51.61 #> 7 61.79 #> 8 71.95 #> 9 82.08 #> 10 92.20 #> 11102.30 df %>% arrow_table() %>% mutate(y = log(x)) %>% collect() #> Error in `collect()`: #> ! Invalid: logarithm of zero ``` {code} This is because R defines {{log(0)}} as {{-Inf}} whereas Acero defines it as an error. Not sure what the solution is here; do we want to request the addition of an Acero option to define behaviour for this? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17489) [R] Nightly builds failing due to test referencing unrelease stringr functions
Nicola Crane created ARROW-17489: Summary: [R] Nightly builds failing due to test referencing unrelease stringr functions Key: ARROW-17489 URL: https://issues.apache.org/jira/browse/ARROW-17489 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Many of the nightly builds are failing (e.g. https://github.com/ursacomputing/crossbow/runs/7942883382?check_suite_focus=true#step:5:24666) due to a test which conditionally runs based on the version of string available. This is due to an NSE function we have implemented which is only in the dev version of stringr, and we expected to be included in the next release but was not. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17445) [R] Add vignette on ExecPlans and how they work
Nicola Crane created ARROW-17445: Summary: [R] Add vignette on ExecPlans and how they work Key: ARROW-17445 URL: https://issues.apache.org/jira/browse/ARROW-17445 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane I've been working on a blog post to showcase the new {{show_exec_plan()}} function, but there's a lot of information that people think would make a good addition that would be better placed in a new vignette or pkgdown article. There's sufficient R-related content to include here (i.e. about how {{show_exec_plan()}} works) that it's worth having this in the R docs and not in general docs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17387) [R] Implement dplyr::across() inside filter()
Nicola Crane created ARROW-17387: Summary: [R] Implement dplyr::across() inside filter() Key: ARROW-17387 URL: https://issues.apache.org/jira/browse/ARROW-17387 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane Fix For: 10.0.0 ARROW-11699 adds the ability to call dplyr::across() inside dplyr::mutate(). Once this is merged, we should also add the ability to do so within dplyr::summarise(). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17384) [R] Additional dplyr functionality
Nicola Crane created ARROW-17384: Summary: [R] Additional dplyr functionality Key: ARROW-17384 URL: https://issues.apache.org/jira/browse/ARROW-17384 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Umbrella ticket to collect together tickets relating to implementing additional dplyr verbs or unimplemented arguments for implemented verbs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17371) [R] Remove as.factor to dictionary_encode mapping
Nicola Crane created ARROW-17371: Summary: [R] Remove as.factor to dictionary_encode mapping Key: ARROW-17371 URL: https://issues.apache.org/jira/browse/ARROW-17371 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane There is an NSE func mapping between {{base::as.factor}} and Acero's {{dictionary_encode}}. However, it doesn't work at present - see ARROW-12632. At present, calling this function results in an error. We should remove this mapping so that an error is raised and we call {{as.factor}} in R not Acero. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17366) [R] Support purrr-style lambda functions in .fns argument to across()
Nicola Crane created ARROW-17366: Summary: [R] Support purrr-style lambda functions in .fns argument to across() Key: ARROW-17366 URL: https://issues.apache.org/jira/browse/ARROW-17366 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane ARROW-11699 adds support for dplyr::across inside a mutate(). The .fns argument does not yet support purrr-style lambda functions (e.g. {{~round(.x, digits = -1)}} but should be added. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17365) [R] Implement ... argument inside across()
Nicola Crane created ARROW-17365: Summary: [R] Implement ... argument inside across() Key: ARROW-17365 URL: https://issues.apache.org/jira/browse/ARROW-17365 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The {{.names}} argument is not yet supported but should be added. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17364) [R] Implement .names argument inside across()
Nicola Crane created ARROW-17364: Summary: [R] Implement .names argument inside across() Key: ARROW-17364 URL: https://issues.apache.org/jira/browse/ARROW-17364 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}. The {{.names}} argument is not yet supported but should be added. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17362) [R] Implement dplyr::across() inside summarise()
Nicola Crane created ARROW-17362: Summary: [R] Implement dplyr::across() inside summarise() Key: ARROW-17362 URL: https://issues.apache.org/jira/browse/ARROW-17362 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane ARROW-11699 adds the ability to call dplyr::across() inside dplyr::mutate(). Once this is merged, we should also add the ability to do so within dplyr::summarise(). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17356) [R] Update binding for add_filename() NSE function to error if used on Table
Nicola Crane created ARROW-17356: Summary: [R] Update binding for add_filename() NSE function to error if used on Table Key: ARROW-17356 URL: https://issues.apache.org/jira/browse/ARROW-17356 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane ARROW-15260 adds a function which allows the user to add the filename as an output field. This function only makes sense to use with datasets and not tables. Currently, the error generated from using it with a table is handled by {{handle_augmented_field_misuse()}}. Instead, we should follow [one of the suggestions from the PR|https://github.com/apache/arrow/pull/12826#issuecomment-1192007298] to detect this when the function is called. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17355) [R] Refactor the handle_* utility functions for a better dev experience
Nicola Crane created ARROW-17355: Summary: [R] Refactor the handle_* utility functions for a better dev experience Key: ARROW-17355 URL: https://issues.apache.org/jira/browse/ARROW-17355 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane In ARROW-15260, the utility functions for handling different kinds of reading errors (handle_parquet_io_error, handle_csv_read_error, and handle_augmented_field_misuse) were refactored so that multiple ones could be chained together. An issue with this is that other errors may be swallowed if they're used without any errors that they don't capture being raised manually afterwards. We should update the code to prevent this from being possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17102) [R] Test fails on test-r-offline-minimal nightly build
Nicola Crane created ARROW-17102: Summary: [R] Test fails on test-r-offline-minimal nightly build Key: ARROW-17102 URL: https://issues.apache.org/jira/browse/ARROW-17102 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Assignee: Nicola Crane May be due to missing option to skip if parquet not available https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=29590=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181=17703 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17054) [R] Creating an Array from an object bigger than 2^31 results in an Array of length 0
Nicola Crane created ARROW-17054: Summary: [R] Creating an Array from an object bigger than 2^31 results in an Array of length 0 Key: ARROW-17054 URL: https://issues.apache.org/jira/browse/ARROW-17054 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Apologies for the lack of proper reprex but it crashes my session when I try to make one. I'm working on ARROW-16977 which is all about the reporting of object size having integer overflow issues, but this affects object creation. {code:r} library(arrow, warn.conflicts = TRUE) # works - creates a huge array, hurrah big_logical <- vector(mode = "logical", length = .Machine$integer.max) big_logical_array <- Array$create(big_logical) length(big_logical) ## [1] 2147483647 length(big_logical_array) ## [1] 2147483647 # creates an array of length 0, boo! too_big <- vector(mode = "logical", length = .Machine$integer.max + 1) too_big_array <- Array$create(too_big) length(too_big) ## [1] 2147483648 length(too_big_array) ## [1] 0 {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16977) [R] Update dataset row counting so no integer overflow on large datasets
Nicola Crane created ARROW-16977: Summary: [R] Update dataset row counting so no integer overflow on large datasets Key: ARROW-16977 URL: https://issues.apache.org/jira/browse/ARROW-16977 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16973) [R] segfault on some CI jobs when calling `flight_put()`
Nicola Crane created ARROW-16973: Summary: [R] segfault on some CI jobs when calling `flight_put()` Key: ARROW-16973 URL: https://issues.apache.org/jira/browse/ARROW-16973 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane CI jobs for PRs unrelated to this area of the codebase have been segfaulting recently, e.g.: * [https://github.com/apache/arrow/runs/7180218227?check_suite_focus=true] * [https://github.com/apache/arrow/runs/7139495271?check_suite_focus=true#step:7:22897] * [https://github.com/apache/arrow/runs/7134531302?check_suite_focus=true#step:7:25791] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-16862) [C++] Add option for casting failure values to default to NULL/NA
Nicola Crane created ARROW-16862: Summary: [C++] Add option for casting failure values to default to NULL/NA Key: ARROW-16862 URL: https://issues.apache.org/jira/browse/ARROW-16862 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Nicola Crane In ARROW-16833, a user is complaining that they are unable to cast their messy string data to integer data and they receive an error message. In R, it's possible to convert this kind of data to integers, with values that fail just being converted to NA values. Would it be possible to enable this as an option in Arrow? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16829) [R] Add link to new contributors guide to developer guide
Nicola Crane created ARROW-16829: Summary: [R] Add link to new contributors guide to developer guide Key: ARROW-16829 URL: https://issues.apache.org/jira/browse/ARROW-16829 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Assignee: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16650) [R] Binding for between() is in dplyr-funcs-type.R
Nicola Crane created ARROW-16650: Summary: [R] Binding for between() is in dplyr-funcs-type.R Key: ARROW-16650 URL: https://issues.apache.org/jira/browse/ARROW-16650 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane I was looking for the bindings for `dplyr::between()` and was surprised to find it in `dplyr-funcs-type.R`, - we should move it to somewhere more appropriate, like `dplyr-funcs-math.R`. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16649) [C++] Add support for sorting to the Substrait consumer
Nicola Crane created ARROW-16649: Summary: [C++] Add support for sorting to the Substrait consumer Key: ARROW-16649 URL: https://issues.apache.org/jira/browse/ARROW-16649 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Nicola Crane The streaming execution engine supports sorting (I believe, as a sink node option?), but the Substrait consumer does not currently consume sort relations. Please can we have support for this? Here's the example code/plan I tested with: {code:java} library(dplyr) library(substrait) # create a basic table and order it out <- tibble::tibble(a = 1, b = 2) %>% arrow_substrait_compiler() %>% arrange(a) # take a look at the plan created out$plan() #> message of type 'substrait.Plan' with 2 fields set #> extension_uris { #> extension_uri_anchor: 1 #> } #> relations { #> root { #> input { #> sort { #> input { #> read { #> base_schema { #> names: "a" #> names: "b" #> struct_ { #> types { #> fp64 { #> } #> } #> types { #> fp64 { #> } #> } #> } #> } #> named_table { #> names: "named_table_1" #> } #> } #> } #> sorts { #> expr { #> selection { #> direct_reference { #> struct_field { #> } #> } #> } #> } #> direction: SORT_DIRECTION_ASC_NULLS_LAST #> } #> } #> } #> names: "a" #> names: "b" #> } #> } # try to run the plan collect(out) #> Error: NotImplemented: conversion to arrow::compute::Declaration from Substrait relation sort { ... #> /home/nic2/arrow/cpp/src/arrow/engine/substrait/serde.cc:73 FromProto(plan_rel.rel(), ext_set) {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16560) [Website][Release] Version JSON files not updated in release
Nicola Crane created ARROW-16560: Summary: [Website][Release] Version JSON files not updated in release Key: ARROW-16560 URL: https://issues.apache.org/jira/browse/ARROW-16560 Project: Apache Arrow Issue Type: Bug Components: Website Reporter: Nicola Crane ARROW-15366 added a script to automatically increment the version switchers for the docs, which was updated as part of the changes in ARROW-1. However, the latest release did not increment the version numbers (and ARROW-1 changes the script to update on snapshots instead of releases - could be the reason for it not happening?) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16480) [R] Update read_csv_arrow parse_options, read_options, and convert_options to take lists
Nicola Crane created ARROW-16480: Summary: [R] Update read_csv_arrow parse_options, read_options, and convert_options to take lists Key: ARROW-16480 URL: https://issues.apache.org/jira/browse/ARROW-16480 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Currently if we want to specify Arrow-specific read options such as encoding, we'd have to do something like this: {code:java} df <- read_csv_arrow(tf, read_options = CsvReadOptions$create(encoding = "utf8")) {code} We should update the code inside {{read_csv_arrow()}} so that the user can specify {{read_options}} as a list which we then pass through to CsvReadOptions internally, so we could instead call the much more user-friendly code below: {code:java} df <- read_csv_arrow(tf, read_options = list(encoding = "utf8")) {code} We should then add an example of this to the function doc examples. We also should do the same for parse_options and convert_options. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16447) [R] Integer overflow causes error - (in dplyr we get an NA with a warning)
Nicola Crane created ARROW-16447: Summary: [R] Integer overflow causes error - (in dplyr we get an NA with a warning) Key: ARROW-16447 URL: https://issues.apache.org/jira/browse/ARROW-16447 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane {code:java} library(dplyr) library(arrow) .input = tibble::tibble( x = .Machine$integer.max ) # in dplyr .input %>% mutate(x2 = x + 6L) %>% collect() #> Warning in x + 6L: NAs produced by integer overflow #> # A tibble: 1 × 2 #> x x2 #> #> 1 2147483647 NA # in Arrow via arrow .input %>% arrow_table() %>% mutate(x2 = x + 6L) %>% collect() #> Error in `collect()`: #> ! Invalid: overflow #> /home/nic2/arrow/cpp/src/arrow/compute/exec.cc:701 kernel_->exec(kernel_ctx_, batch, ) #> /home/nic2/arrow/cpp/src/arrow/compute/exec.cc:642 ExecuteBatch(batch, listener) #> /home/nic2/arrow/cpp/src/arrow/compute/exec/expression.cc:547 executor->Execute(arguments, ) #> /home/nic2/arrow/cpp/src/arrow/compute/exec/project_node.cc:90 ExecuteScalarExpression(simplified_expr, target, plan()->exec_context()) #> /home/nic2/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:463 iterator_.Next() #> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:337 ReadNext() #> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:351 ToRecordBatches() {code} Do we want to enable the return of NAs on integer overflow, or just give the user a more specific hint in the error message? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16376) [R][CI] Update test-r-devdocs on Windows to build UCRT and don't pin to R 4.1
Nicola Crane created ARROW-16376: Summary: [R][CI] Update test-r-devdocs on Windows to build UCRT and don't pin to R 4.1 Key: ARROW-16376 URL: https://issues.apache.org/jira/browse/ARROW-16376 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane The failed devdocs builds were fixed by pinning the R version to 4.1 in ARROW-16375 but we should instead just add UCRT to the build and not pin the version -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16375) [R] Pin test-r-devdocs on Windows to R 4.1
Nicola Crane created ARROW-16375: Summary: [R] Pin test-r-devdocs on Windows to R 4.1 Key: ARROW-16375 URL: https://issues.apache.org/jira/browse/ARROW-16375 Project: Apache Arrow Issue Type: Bug Reporter: Nicola Crane This build is failing on Windows, likely because R 4.2 on Windows requires UCRT. A short-term solution is pinning these builds to R 4.1 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16310) [R] test-fedora-r-clang-sanitizer job fails - possible tzdb installation issue
Nicola Crane created ARROW-16310: Summary: [R] test-fedora-r-clang-sanitizer job fails - possible tzdb installation issue Key: ARROW-16310 URL: https://issues.apache.org/jira/browse/ARROW-16310 Project: Apache Arrow Issue Type: Bug Reporter: Nicola Crane We're seeing an error on a sanitizer build for https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=23988=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181=3034 I think it's something to do with tzdb installation: {code:java} make: Target 'all' not remade because of errors. * installing *source* package ‘tzdb’ ... ** package ‘tzdb’ successfully unpacked and MD5 sums checked ** using staged installation make[1]: *** [/opt/R-devel/lib64/R/etc/Makeconf:178: api.o] Error 1 make[1]: Leaving directory '/tmp/Rtmp0aqclz/R.INSTALL51cc14b8c441/tzdb/src' ERROR: compilation failed for package ‘tzdb’ * removing ‘/opt/R-devel/lib64/R/library/tzdb’ The downloaded source packages are in ‘/tmp/Rtmpg6gyGy/downloaded_packages’ Updating HTML index of packages in '.Library' Making 'packages.html' ... done Warning messages: 1: package ‘’ is not available for this version of R A version of this package for your version of R might be available elsewhere, see the ideas at https://cran.r-project.org/doc/manuals/r-devel/R-admin.html#Installing-packages 2: In i.p(...) : installation of one or more packages failed, probably ‘tzdb’ > > / + popd {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16164) [C++] Pushdown filters on augmented columns like fragment filename
Nicola Crane created ARROW-16164: Summary: [C++] Pushdown filters on augmented columns like fragment filename Key: ARROW-16164 URL: https://issues.apache.org/jira/browse/ARROW-16164 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Nicola Crane In the discussion on ARROW-15260, if we run the following code in R, we might expect it to push down the filter so we can just read in the relevant files: {code:r} filter = Expression$create( "match_substring", Expression$field_ref("__filename"), options = list(pattern = "cyl=8") ) {code} As mentioned by [~westonpace]: "You might think we would get the hint and only read files matching that pattern. This is not the case. We will read the entire dataset and apply the "cyl=8" filter in memory. If we want to pushdown filters on the filename column we will need to add some special logic." -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16154) [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing
Nicola Crane created ARROW-16154: Summary: [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing Key: ARROW-16154 URL: https://issues.apache.org/jira/browse/ARROW-16154 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Fix For: 8.0.0 See discussion here for context: https://github.com/apache/arrow/pull/12826#issuecomment-1092052001 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16106) [R] Support for filename-based partitioning
Nicola Crane created ARROW-16106: Summary: [R] Support for filename-based partitioning Key: ARROW-16106 URL: https://issues.apache.org/jira/browse/ARROW-16106 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane This was added in ARROW-14612 and now needs implementing in R -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16080) [R][Documentation] Document filename-based partitioning
Nicola Crane created ARROW-16080: Summary: [R][Documentation] Document filename-based partitioning Key: ARROW-16080 URL: https://issues.apache.org/jira/browse/ARROW-16080 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Fix For: 8.0.0 Filename-based partitioning has been implemented in C++; we should add something to our docs about this. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16011) [R] CI jobs should fail if lintr picked up issues
Nicola Crane created ARROW-16011: Summary: [R] CI jobs should fail if lintr picked up issues Key: ARROW-16011 URL: https://issues.apache.org/jira/browse/ARROW-16011 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Currently the lintr flags up styling issues on every PR, which can lead to it flagging up stylistic issues on unrelated PRs if a previous R-related PR has caused a listing issue. We should instead cause the R CI build to fail in these cases, so these problems are not merged in the first place. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding
Nicola Crane created ARROW-16000: Summary: [C++][Dataset] Support Latin-1 encoding Key: ARROW-16000 URL: https://issues.apache.org/jira/browse/ARROW-16000 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Nicola Crane In ARROW-15992 a user is reporting issues with trying to read in files with Latin-1 encoding. I had a look through the docs for the Dataset API and I don't think this is currently supported. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string
Nicola Crane created ARROW-15943: Summary: [C++] Filter which files to be read in as part of filesystem, filtered using a string Key: ARROW-15943 URL: https://issues.apache.org/jira/browse/ARROW-15943 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Nicola Crane There is a report from a user (see this Stack Overflow post [1]) who has used the {{basename_template}} parameter to write files to a dataset, some of which have the prefix {{"summary"}} and others which have the prefix "{{{}prediction"{}}}. This data is saved in partitioned directories. They want to be able to read back in the data, so that, as well as the partition variables in their dataset, they can choose which subset (predictions vs. summaries) to read back in. This isn't currently possible; if they try to open a dataset with a list of files, they cannot read it in as partitioned data. A short-term solution is to suggest they change the structure of how their data is stored, but it could be useful to be able to pass in some sort of filter to determine which files get read in as a dataset. [1] [https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15880) [C++] Can't open partitioned dataset if the root directory has "=" in its name
Nicola Crane created ARROW-15880: Summary: [C++] Can't open partitioned dataset if the root directory has "=" in its name Key: ARROW-15880 URL: https://issues.apache.org/jira/browse/ARROW-15880 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Nicola Crane Not sure if this is a bug or "just how Hive style partitioning works" but if I try to open a dataset where the root directory has an "=" in it, I have to specify that directory in my partitioning to be able to successfully open it. This has caused users to trip up when they've saved one directory from a partitioned dataset somewhere and tried to then open this directory as a dataset. {code:r} library(arrow) td <- tempfile() dir.create(td) # directory with equals sign in name subdir <- file.path(td, "foo=bar") dir.create(subdir) write_dataset(mtcars, subdir, partitioning = "am") list.files(td, recursive = TRUE) #> [1] "foo=bar/am=0/part-0.parquet" "foo=bar/am=1/part-0.parquet" # doesn't work open_dataset(subdir, partitioning = "am") #> Error: #> ! "partitioning" does not match the detected Hive-style partitions: c("foo", "am") #> ℹ Omit "partitioning" to use the Hive partitions #> ℹ Set `hive_style = FALSE` to override what was detected #> ℹ Or, to rename partition columns, call `select()` or `rename()` after opening the dataset # works open_dataset(subdir, partitioning = c("foo", "am")) #> FileSystemDataset with 2 Parquet files #> mpg: double #> cyl: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> gear: double #> carb: double #> foo: string #> am: int32 #> #> See $metadata for additional Schema metadata {code} Compare this with the same example but the folder is just called "foobar" instead of "foo=bar". {code:r} td <- tempfile() dir.create(td) subdir <- file.path(td, "foobar") dir.create(subdir) write_dataset(mtcars, subdir, partitioning = "am") list.files(td, recursive = TRUE) #> [1] "foobar/am=0/part-0.parquet" "foobar/am=1/part-0.parquet" # works open_dataset(subdir, partitioning = "am") #> FileSystemDataset with 2 Parquet files #> mpg: double #> cyl: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> gear: double #> carb: double #> am: int32 #> #> See $metadata for additional Schema metadata {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15827) [R] Improve UX of write_dataset(..., max_rows_per_group)
Nicola Crane created ARROW-15827: Summary: [R] Improve UX of write_dataset(..., max_rows_per_group) Key: ARROW-15827 URL: https://issues.apache.org/jira/browse/ARROW-15827 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane When using {{write_dataset()}}, if we set {{max_rows_per_file}} without also setting {{max_rows_per_group}}, we always get the error shown below. {code:r} library(arrow) td <- tempfile() dir.create(td) write_dataset(mtcars, td, max_rows_per_file = 5L) #> Error: Invalid: max_rows_per_group must be less than or equal to max_rows_per_file {code} We should change the behaviour so we can specify one without having to also specify the other. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15819) [R] R docs version switcher doesn't work on Safari on MacOS
Nicola Crane created ARROW-15819: Summary: [R] R docs version switcher doesn't work on Safari on MacOS Key: ARROW-15819 URL: https://issues.apache.org/jira/browse/ARROW-15819 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane Reported as missing on Safari on MacOS by both Ian and Neal -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15812) [R] Allow user to supply col_names argument when reading in a CSV dataset
Nicola Crane created ARROW-15812: Summary: [R] Allow user to supply col_names argument when reading in a CSV dataset Key: ARROW-15812 URL: https://issues.apache.org/jira/browse/ARROW-15812 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Allow the user to supply the {{col_names}} argument from {{readr}} when reading in a dataset. This is already possible when reading in a single CSV file via {{arrow::read_csv_arrow()}} via the {{readr_to_csv_read_options}} function, and so once the C++ functionality to autogenerate column names for Datasets is implemented, we should hook up {{readr_to_csv_read_options}} in {{csv_file_format_read_opts}} just like we have with {{readr_to_csv_parse_options}} in {{csv_file_format_parse_options}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15797) [R] Supplying column names to open_dataset results in all columns being read in as strings
Nicola Crane created ARROW-15797: Summary: [R] Supplying column names to open_dataset results in all columns being read in as strings Key: ARROW-15797 URL: https://issues.apache.org/jira/browse/ARROW-15797 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane {code:r} library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp td <- tempfile() dir.create(td) write_dataset(mtcars, td, format = "csv") # Correct column types open_dataset(td, format = "csv") #> FileSystemDataset with 1 csv file #> mpg: double #> cyl: int64 #> disp: double #> hp: int64 #> drat: double #> wt: double #> qsec: double #> vs: int64 #> am: int64 #> gear: int64 #> carb: int64 # Incorrect column types open_dataset(td, format = "csv", column_names = c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb")) #> FileSystemDataset with 1 csv file #> mpg: string #> cyl: string #> disp: string #> hp: string #> drat: string #> wt: string #> qsec: string #> vs: string #> am: string #> gear: string #> carb: string {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15743) [R] `skip` not connected up to `skip_rows` on open_dataset despite error messages indicating otherwise
Nicola Crane created ARROW-15743: Summary: [R] `skip` not connected up to `skip_rows` on open_dataset despite error messages indicating otherwise Key: ARROW-15743 URL: https://issues.apache.org/jira/browse/ARROW-15743 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane If I open a dataset of CSVs with a schema, the error message tells me to supply {{`skip = 1`}} if my data contains a header row (to prevent it being read in as data), but only {{skip_rows = 1}} actually works. {code:r} library(arrow) library(dplyr) td <- tempfile() dir.create(td) write_dataset(mtcars, td, format = "csv") schema <- schema(mpg = float64(), cyl = float64(), disp = float64(), hp = float64(), drat = float64(), wt = float64(), qsec = float64(), vs = float64(), am = float64(), gear = float64(), carb = float64()) open_dataset(td, format = "csv", schema = schema) %>% collect() #> Error in `handle_csv_read_error()`: #> ! Invalid: Could not open CSV input source '/tmp/RtmppZbpeF/file6cec135ed29c/part-0.csv': Invalid: In CSV column #0: Row #1: CSV conversion error to double: invalid value 'mpg' #> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:550 decoder_.Decode(data, size, quoted, ) #> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:123 status #> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:554 parser.VisitColumn(col_index, visit) #> /home/nic2/arrow/cpp/src/arrow/csv/reader.cc:463 arrow::internal::UnwrapOrRaise(maybe_decoded_arrays) #> /home/nic2/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:445 iterator_.Next() #> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:336 ReadNext() #> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:347 ReadAll() #> ℹ If you have supplied a schema and your data contains a header row, you should supply the argument `skip = 1` to prevent the header being read in as data. open_dataset(td, format = "csv", schema = schema, skip = 1) %>% collect() #> Error: The following option is supported in "read_delim_arrow" functions but not yet supported here: "skip" open_dataset(td, format = "csv", schema = schema, skip_rows = 1) %>% collect() #> # A tibble: 32 × 11 #> mpg cyl disphp dratwt qsecvsam gear carb #> #> 1 21 6 160110 3.9 2.62 16.5 0 1 4 4 #> 2 21 6 160110 3.9 2.88 17.0 0 1 4 4 #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 #> 4 21.4 6 258110 3.08 3.22 19.4 1 0 3 1 #> 5 18.7 8 360175 3.15 3.44 17.0 0 0 3 2 #> 6 18.1 6 225105 2.76 3.46 20.2 1 0 3 1 #> 7 14.3 8 360245 3.21 3.57 15.8 0 0 3 4 #> 8 24.4 4 147.62 3.69 3.19 20 1 0 4 2 #> 9 22.8 4 141.95 3.92 3.15 22.9 1 0 4 2 #> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 #> # … with 22 more rows {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15567) [R] Implement as_substrait() and from_substrait() for integers
Nicola Crane created ARROW-15567: Summary: [R] Implement as_substrait() and from_substrait() for integers Key: ARROW-15567 URL: https://issues.apache.org/jira/browse/ARROW-15567 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15566) [R] Create initial implementation
Nicola Crane created ARROW-15566: Summary: [R] Create initial implementation Key: ARROW-15566 URL: https://issues.apache.org/jira/browse/ARROW-15566 Project: Apache Arrow Issue Type: Sub-task Components: R Reporter: Nicola Crane Assignee: Dewey Dunnington Create an initial implementation of an R package which will generate Substrait plans from dplyr code -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15507) [R] Refactor repeated code into check_match function
Nicola Crane created ARROW-15507: Summary: [R] Refactor repeated code into check_match function Key: ARROW-15507 URL: https://issues.apache.org/jira/browse/ARROW-15507 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane In https://github.com/apache/arrow/pull/12277#discussion_r794636116 we discuss similar reasoning in two different places in the codebase; this should be refactored into a function to make the code easier to skim. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15480) [R] Expand on schema/colnames mismatch error messages
Nicola Crane created ARROW-15480: Summary: [R] Expand on schema/colnames mismatch error messages Key: ARROW-15480 URL: https://issues.apache.org/jira/browse/ARROW-15480 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Assignee: Nicola Crane In ARROW-14744 extra checks were added for when {{open_dataset()}} is used and there are conflicts between column names from the schema vs. passed in explicitly - we should expand on the messaging and tests for the different possible cases here. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15476) [R][Docs] Update the links in the developing vignette so they don't point to absolute paths
Nicola Crane created ARROW-15476: Summary: [R][Docs] Update the links in the developing vignette so they don't point to absolute paths Key: ARROW-15476 URL: https://issues.apache.org/jira/browse/ARROW-15476 Project: Apache Arrow Issue Type: Improvement Components: Documentation, R Reporter: Nicola Crane There are 3 links in the "developing" vignettes which point to absolute paths to articles for developers. This works for the package vignettes but doesn't work well on pkgdown versions of the vignettes as they point to the latest published version of those articles, and so, for example, in the dev docs, result in "Not Found" as those docs are not yet published on the main site. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15470) [C++] Allows user to specify string to be used for missing data when writing CSV dataset
Nicola Crane created ARROW-15470: Summary: [C++] Allows user to specify string to be used for missing data when writing CSV dataset Key: ARROW-15470 URL: https://issues.apache.org/jira/browse/ARROW-15470 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Nicola Crane The ability to select the string to be used for missing data was implemented for the CSV Writer in ARROW-14903 but would it be possible to also allow this when writing CSV datasets? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15366) [R] Automate incrementing of pkgdown version for dropdown menu
Nicola Crane created ARROW-15366: Summary: [R] Automate incrementing of pkgdown version for dropdown menu Key: ARROW-15366 URL: https://issues.apache.org/jira/browse/ARROW-15366 Project: Apache Arrow Issue Type: Bug Components: R Reporter: Nicola Crane Assignee: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15337) [Doc] New contributors guide updates
Nicola Crane created ARROW-15337: Summary: [Doc] New contributors guide updates Key: ARROW-15337 URL: https://issues.apache.org/jira/browse/ARROW-15337 Project: Apache Arrow Issue Type: Sub-task Reporter: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15303) [R] linting errors
Nicola Crane created ARROW-15303: Summary: [R] linting errors Key: ARROW-15303 URL: https://issues.apache.org/jira/browse/ARROW-15303 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Assignee: Nicola Crane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15281) [C++] Implement ability to retrieve fragment filename
Nicola Crane created ARROW-15281: Summary: [C++] Implement ability to retrieve fragment filename Key: ARROW-15281 URL: https://issues.apache.org/jira/browse/ARROW-15281 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Nicola Crane A user has requested the ability to include the filename of the CSV in the dataset output - see discussion on ARROW-15260 for more context. Relevant info from that ticket: "From a C++ perspective we've got many of the pieces needed already. One challenge is that the datasets API is written to work with "fragments" and not "files". For example, a dataset might be an in-memory table in which case we are working with InMemoryFragment and not FileFragment so there is no concept of "filename". That being said, the low level ScanBatchesAsync method actually returns a generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is a struct with the record batch as well as the source fragment for that record batch. So if you were to execute scan, you could inspect the fragment and, if it is a FileFragment, you could extract the filename. Another challenge is that R is moving towards more and more access through an exec plan and not directly using a scanner. In order for that to work we would need to augment the scan results with the filename in C++ before sending into the exec plan. Luckily, we already do this a bit as well. We currently augment the scan results with fragment index, batch index, and whether the batch is the last batch in the fragment. Since ExecBatch can work with constants efficiently I don't think there will be much performance cost in always including the filename. So the work remaining is simply to add a new augmented field _{_}fragment_source_name which is always attached if the underlying fragment is a filename. Then users can get this field if they want by including "{_}_fragment_source_name" in the list of columns they query for." -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15278) [R] Reorganise tests for dates and datetimes to test them together
Nicola Crane created ARROW-15278: Summary: [R] Reorganise tests for dates and datetimes to test them together Key: ARROW-15278 URL: https://issues.apache.org/jira/browse/ARROW-15278 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Assignee: Nicola Crane The tests in {{arrow/r/tests/test-dplyr-funcs-datetime.R}} have dates and datetimes tested separately. Given that both I (the person who originally wrote them like that!), and subsequent contributors have ended up accidentally forgetting to test one of these classes, it would make more sense for the tests to just test both at once. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15279) [R] Update "writing bindings" dev docs based on user feedback
Nicola Crane created ARROW-15279: Summary: [R] Update "writing bindings" dev docs based on user feedback Key: ARROW-15279 URL: https://issues.apache.org/jira/browse/ARROW-15279 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Fix For: 7.0.0 I would add two comments for the article on [Writing bindings|https://ursalabs.org/arrow-r-nightly/articles/developers/bindings.html#writing-bindings] : * in [Step -1|https://ursalabs.org/arrow-r-nightly/articles/developers/bindings.html#step-1---add-unit-tests] I suggest to add that {{compare_dplyr_binding()}} and {{compare_dplyr_error()}} can be found in {{arrow/r/tests/testthat/helper-expectation.R}} * Due to [ARROW-15010|https://github.com/apache/arrow/pull/11904] Step 3b should be corrected {{nse_funcs$startsWith ...}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15254) [C++] Ability to skip CSV footer when reading in dataset
Nicola Crane created ARROW-15254: Summary: [C++] Ability to skip CSV footer when reading in dataset Key: ARROW-15254 URL: https://issues.apache.org/jira/browse/ARROW-15254 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Nicola Crane In ARROW-15252 a user reports wanting to be able to skip the final row of a CSV (the footer) when reading in a dataset of CSVs - is this possible to implement? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15145) [R] test-r-minimal-build fails due to updated error message
Nicola Crane created ARROW-15145: Summary: [R] test-r-minimal-build fails due to updated error message Key: ARROW-15145 URL: https://issues.apache.org/jira/browse/ARROW-15145 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane In ARROW-15047, the error messaging in {{read_compressed_error()}} was updated to be more user-friendly - the corresponding unit test (named "Error messages are shown when the compression algorithm lz4 is not found") needs updating to reflect this change -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15128) [C++] segfault when writing CSV from RecordBatchReader
Nicola Crane created ARROW-15128: Summary: [C++] segfault when writing CSV from RecordBatchReader Key: ARROW-15128 URL: https://issues.apache.org/jira/browse/ARROW-15128 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Nicola Crane I'm currently trying to implement functionality in R so that we can open a dataset and then write to a CSV file, but I'm getting a segfault when I run my tests: {code:r} tbl <- tibble::tibble( dbl = c(1:8, NA, 10) + .1, lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE), false = logical(10), chr = letters[c(1:5, NA, 7:10)] ) make_temp_dir <- function() { path <- tempfile() dir.create(path) normalizePath(path, winslash = "/") } data_dir <- make_temp_dir() write_dataset(tbl, data_dir, partitioning = "lgl") data_in <- open_dataset(data_dir) csv_file <- tempfile() tbl_out <- write_csv_arrow(data_in, csv_file) {code} {code:java} Thread 1 "R" received signal SIGSEGV, Segmentation fault. 0x7fffee51fdd7 in __gnu_cxx::__exchange_and_add (__mem=0xe9, __val=-1) at /usr/include/c++/9/ext/atomicity.h:49 49{ return __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); } {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15103) [Documentation][C++] Error building docs: "arrow/cpp/src/arrow/csv/options.h:182: error: Found unknown command '\r' "
Nicola Crane created ARROW-15103: Summary: [Documentation][C++] Error building docs: "arrow/cpp/src/arrow/csv/options.h:182: error: Found unknown command '\r' " Key: ARROW-15103 URL: https://issues.apache.org/jira/browse/ARROW-15103 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation Reporter: Nicola Crane I am trying to build the docs, following the instructions at https://arrow.apache.org/docs/developers/documentation.html However, after running {{pip install -r docs/requirements.txt}} and then going to {{cpp/apidoc}} and running {{doxygen}} I get the following error: {code:java} warning: ignoring unsupported tag 'HTML_FORMULA_FORMAT' at line 1537, file Doxyfile /home/nic2/arrow/cpp/src/arrow/csv/options.h:182: error: Found unknown command '\r' (warning treated as error, aborting now) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15040) [R] Enable write_csv_arrow to take a RecordBatchReader as input
Nicola Crane created ARROW-15040: Summary: [R] Enable write_csv_arrow to take a RecordBatchReader as input Key: ARROW-15040 URL: https://issues.apache.org/jira/browse/ARROW-15040 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Currently, this code fails: {code:r} dataset <- open_dataset("some/folder/with/parquet/files") write_csv_arrow(dataset, sink = "dataset.csv") {code} with this error message: {code:r} Error: x must be an object of class 'data.frame', 'RecordBatch', or 'Table', not 'FileSystemDataset'. {code} In ARROW-14741, support was added for reading from a RecordBatchReader, so we should be able to now extend {{write_csv_arrow()}} to allow this behaviour. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15022) [R] install vignette and installation dev vignette need alt text for images
Nicola Crane created ARROW-15022: Summary: [R] install vignette and installation dev vignette need alt text for images Key: ARROW-15022 URL: https://issues.apache.org/jira/browse/ARROW-15022 Project: Apache Arrow Issue Type: Improvement Reporter: Nicola Crane Fix For: 7.0.0 The installation docs have been updated recently, with images added, but there is no alt text to accompany them. Alt text should be added to all images, and extra text should be added to the flowchart describing installation on Windows, given that it is too complex for a simple alt text description. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14989) [R] Update num_rows methods to output doubles not integers to prevent integer overflow
Nicola Crane created ARROW-14989: Summary: [R] Update num_rows methods to output doubles not integers to prevent integer overflow Key: ARROW-14989 URL: https://issues.apache.org/jira/browse/ARROW-14989 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane In cases where Arrow objects are particularly large, this can result in an integer overflow when returning their size. See discussion on https://github.com/apache/arrow/pull/11783 for more details of a possible solution. {code:r} library(arrow) test_array1 <- Array$create(raw(2^31 - 1)) test_array2 <- Array$create(raw(1)) big_chunked <- chunked_array(test_array1, test_array2) big_table <- Table$create(col = big_chunked) big_table$num_rows # NA {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-14988) [R] Improve source build experience
Nicola Crane created ARROW-14988: Summary: [R] Improve source build experience Key: ARROW-14988 URL: https://issues.apache.org/jira/browse/ARROW-14988 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Nicola Crane Assignee: Nicola Crane * We should make ARROW_DEPENDENCY_SOURCE=AUTO the default and then document how to install the dependencies (such that you can) using apt/yum; that will speed up source builds * In the default case where they aren't downloading a binary, we could advertise *** For a faster installation, set the environment variable LIBARROW_BINARY=true before installing or something. I think that wouldn't be against CRAN policy * We could also message more loudly in the default source build that not all features are enabled, set LIBARROW_MINIMAL=false and reinstall if you need them -- This message was sent by Atlassian Jira (v8.20.1#820001)