[jira] [Created] (ARROW-18429) [R] Bump dev version following 10.0.1 patch release

2022-12-08 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18429:


 Summary: [R] Bump dev version following 10.0.1 patch release
 Key: ARROW-18429
 URL: https://issues.apache.org/jira/browse/ARROW-18429
 Project: Apache Arrow
  Issue Type: Bug
  Components: Continuous Integration, R
Reporter: Nicola Crane
Assignee: Nicola Crane
 Fix For: 11.0.0


CI job fails with:


{code:java}
   Insufficient package version (submitted: 10.0.0.9000, existing: 10.0.1)
  Version contains large components (10.0.0.9000)
{code}


https://github.com/apache/arrow/actions/runs/3639669477/jobs/6145488845#step:10:567



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18416) [R] Update NEWS for 10.0.1

2022-11-29 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18416:


 Summary: [R] Update NEWS for 10.0.1
 Key: ARROW-18416
 URL: https://issues.apache.org/jira/browse/ARROW-18416
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
Assignee: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18415) [R] Update R package README to reference GH Issues

2022-11-29 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18415:


 Summary: [R] Update R package README to reference GH Issues
 Key: ARROW-18415
 URL: https://issues.apache.org/jira/browse/ARROW-18415
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


The R package README should be updated to refer to GH Issues for users who 
don't have a JIRA account



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18403) [C++] Error consuming Substrait plan which uses count function: "only unary aggregate functions are currently supported"

2022-11-24 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18403:


 Summary: [C++] Error consuming Substrait plan which uses count 
function: "only unary aggregate functions are currently supported"
 Key: ARROW-18403
 URL: https://issues.apache.org/jira/browse/ARROW-18403
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Nicola Crane


ARROW-17523 added support for the Substrait extension function "count", but 
when I write code which produces a Substrait plan which calls it, and then try 
to run it in Acero, I get an error.

The plan:

{code:r}
message of type 'substrait.Plan' with 3 fields set
extension_uris {
  extension_uri_anchor: 1
  uri: 
"https://github.com/substrait-io/substrait/blob/main/extensions/functions_arithmetic.yaml;
}
extension_uris {
  extension_uri_anchor: 2
  uri: 
"https://github.com/substrait-io/substrait/blob/main/extensions/functions_comparison.yaml;
}
extension_uris {
  extension_uri_anchor: 3
  uri: 
"https://github.com/substrait-io/substrait/blob/main/extensions/functions_aggregate_generic.yaml;
}
extensions {
  extension_function {
extension_uri_reference: 3
function_anchor: 2
name: "count"
  }
}
relations {
  rel {
aggregate {
  input {
project {
  common {
emit {
  output_mapping: 9
  output_mapping: 10
  output_mapping: 11
  output_mapping: 12
  output_mapping: 13
  output_mapping: 14
  output_mapping: 15
  output_mapping: 16
  output_mapping: 17
}
  }
  input {
read {
  base_schema {
names: "int"
names: "dbl"
names: "dbl2"
names: "lgl"
names: "false"
names: "chr"
names: "verses"
names: "padded_strings"
names: "some_negative"
struct_ {
  types {
i32 {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
fp64 {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
fp64 {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
bool_ {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
bool_ {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
string {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
string {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
string {
  nullability: NULLABILITY_NULLABLE
}
  }
  types {
fp64 {
  nullability: NULLABILITY_NULLABLE
}
  }
}
  }
  local_files {
items {
  uri_file: "file:///tmp/RtmpsBsoZJ/file1915f604cff4a"
  parquet {
  }
}
  }
}
  }
  expressions {
selection {
  direct_reference {
struct_field {
}
  }
  root_reference {
  }
}
  }
  expressions {
selection {
  direct_reference {
struct_field {
  field: 1
}
  }
  root_reference {
  }
}
  }
  expressions {
selection {
  direct_reference {
struct_field {
  field: 2
}
  }
  root_reference {
  }
}
  }
  expressions {
selection {
  direct_reference {
struct_field {
  field: 3
}
  }
  root_reference {
  }
}
  }
  expressions {
selection {
  direct_reference {
struct_field {
  field: 4
}
  }
  root_reference {
  }
}
  }
  expressions {
selection {
  direct_reference {
 

[jira] [Created] (ARROW-18393) [Docs][R] Include warning when viewing old docs (redirecting to stable docs)

2022-11-23 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18393:


 Summary: [Docs][R] Include warning when viewing old docs 
(redirecting to stable docs)
 Key: ARROW-18393
 URL: https://issues.apache.org/jira/browse/ARROW-18393
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Joris Van den Bossche
Assignee: Alenka Frim


Now we have versioned docs, we also have the old versions of the developers 
docs (eg 
https://arrow.apache.org/docs/9.0/developers/guide/communication.html). Those 
might be outdated (eg regarding communication channels, build instructions, 
etc), and typically when contributing / developing with the latest arrow, one 
should _always_ check the latest dev version of the contributing docs.

We could add a warning box pointing this out and linking to the dev docs. 

For example similarly how some projects warn about viewing old docs in general 
and point to the stable docs (eg https://mne.tools/1.1/index.html or 
https://scikit-learn.org/1.0/user_guide.html). In this case we could have a 
custom box when at a page in /developers to point to the dev docs instead of 
stable docs



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18391) [R] Fix the version selector dropdown

2022-11-23 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18391:


 Summary: [R] Fix the version selector dropdown
 Key: ARROW-18391
 URL: https://issues.apache.org/jira/browse/ARROW-18391
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
Assignee: Nicola Crane


ARROW-17887 updates the docs to use Bootstrap 5 which will break the docs 
version dropdown selector, as it relies on replacing a page element, but the 
page elements are different in this version of Bootstrap.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18358) [R] Implement new function open_dataset_csv with signature more closely matching read_csv_arrow

2022-11-17 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18358:


 Summary: [R] Implement new function open_dataset_csv with 
signature more closely matching read_csv_arrow
 Key: ARROW-18358
 URL: https://issues.apache.org/jira/browse/ARROW-18358
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane


In order to make the transition between using the different CSV reading 
functions as smoothly as possible we could introduce a version of open_dataset 
specifically for reading CSVs with a signature more closely matching that of 
read_csv_arrow - this would just pass the arguments through to open_dataset (in 
the ellipses), but would make it simpler to have a docs page showing these 
options explicitly and thus be clearer for users.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18357) [R] support parse_options, read_options, convert_options in open_dataset to mirror read_csv_arrow

2022-11-17 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18357:


 Summary: [R] support parse_options, read_options, convert_options 
in open_dataset to mirror read_csv_arrow
 Key: ARROW-18357
 URL: https://issues.apache.org/jira/browse/ARROW-18357
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane


The {{read_csv_arrow()}} function allows users to pass in options via its 
parse_options, convert_options, and read_options parameters.  We could allow 
users to pass these into {{open_dataset()}} to enable users to more easily 
switch between {{read_csv_arrow()}} and {{open_dataset()}}.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18356) [R] Handle as_data_frame argument if passed into open_dataset for CSVs

2022-11-17 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18356:


 Summary: [R] Handle as_data_frame argument if passed into 
open_dataset for CSVs
 Key: ARROW-18356
 URL: https://issues.apache.org/jira/browse/ARROW-18356
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane


Currently, if the argument {{as_data_frame}} is passed into {{open_dataset()}} 
with a CSV format dataset, the error message returned is:

{code:r}
Error: The following option is supported in "read_delim_arrow" functions but 
not yet supported here: "as_data_frame"
{code}

Instead, we could silently ignore it if as_data_frame is set to {{FALSE}} and 
give a more helpful error if set to {{TRUE}} (i.e. direct user to call 
{{as.data.frame()}} or {{collect()}}).

Reasoning: it'd be great to get to a point where users can just swap their 
{{read_csv_arrow()}} syntax for {{open_dataset()}} and get helpful results.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18355) [R] support the quoted_na argument in open_dataset for CSVs by mapping it to CSVConvertOptions$strings_can_be_null

2022-11-17 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18355:


 Summary: [R] support the quoted_na argument in open_dataset for 
CSVs by mapping it to CSVConvertOptions$strings_can_be_null
 Key: ARROW-18355
 URL: https://issues.apache.org/jira/browse/ARROW-18355
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18354) [R] Better document the CSV read/parse/convert options we can use with open_dataset()

2022-11-17 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18354:


 Summary: [R] Better document the CSV read/parse/convert options we 
can use with open_dataset()
 Key: ARROW-18354
 URL: https://issues.apache.org/jira/browse/ARROW-18354
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane


When a user opens a CSV dataset using open_dataset, they can take advantage of 
a lot of different options which can be specified via 
{{CsvReadOptions$create()}} etc.

However, as they are passed in via the ellipses ({{...}}) argument, it's not 
particularly clear to users which arguments are supported or not.  They are not 
documented in the {{open_dataset()}} docs, and further confused (see the code 
for {{CsvFileFormat$create()}} by the fact that we support a mix of Arrow and 
readr parameters.

We should better document the arguments we do support.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18352) [R] Datasets API interface improvements

2022-11-17 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18352:


 Summary: [R] Datasets API interface improvements
 Key: ARROW-18352
 URL: https://issues.apache.org/jira/browse/ARROW-18352
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Umbrella ticket for improvements for our interface to the datasets API, and 
making the experience more consistent between {{open_dataset()}} and the 
{{read_*()}} functions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18266) [R] Make it more obvious how to read in a Parquet file with a different schema to the inferred one

2022-11-07 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18266:


 Summary: [R] Make it more obvious how to read in a Parquet file 
with a different schema to the inferred one
 Key: ARROW-18266
 URL: https://issues.apache.org/jira/browse/ARROW-18266
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


It's not all that clear from our docs that if we want to read in a Parquet file 
and change the schema, we need to call the {{cast()}} method on the Table, e.g. 

{code:r}
# Write out data
data <- tibble::tibble(x = c(letters[1:5], NA), y = 1:6)
data_with_schema <- arrow_table(data, schema = schema(x = string(), y = 
int64()))
write_parquet(data_with_schema, "data_with_schema.parquet")

# Read in data while specifying a schema
data_in <- read_parquet("data_with_schema.parquet", as_data_frame = FALSE)  
data_in$cast(target_schema = schema(x = string(), y = int32()))
{code}

We should document this more clearly. Pehaps we could even update the code here 
to automatically do some of this if we pass in a schema to the {...} argument 
of {{read_parquet}} _and_ the returned data doesn't match the desired schema? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18263) [R] Error when trying to write POSIXlt data to CSV

2022-11-07 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18263:


 Summary: [R] Error when trying to write POSIXlt data to CSV
 Key: ARROW-18263
 URL: https://issues.apache.org/jira/browse/ARROW-18263
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


I get an error trying to write a tibble of POSIXlt data to a file.  The error 
is a bit misleading as it refers to the column being of length 0.

{code:r}
posixlt_data <- tibble::tibble(x = as.POSIXlt(Sys.time()))
write_csv_arrow(posixlt_data, "posixlt_data.csv")
{code}


{code:r}
Error: Invalid: Unsupported Type:POSIXlt of length 0
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18236) [R] Improve error message when providing a mix of rreadr and Arrow options

2022-11-03 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18236:


 Summary: [R] Improve error message when providing a mix of rreadr 
and Arrow options
 Key: ARROW-18236
 URL: https://issues.apache.org/jira/browse/ARROW-18236
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane


I was trying to solve a user issue today and tried to run the following code:

{code:r}
df = tibble(x = c("a","b",  ""  , "d"))
write_tsv(df, "data.tsv")
open_dataset("data.tsv", format="tsv", skip_rows=1, schema=schema(x=string()), 
skip_empty_rows = TRUE) %>%
  collect()
{code}

 which gives me the error

{code:r}
Error: Use either Arrow parse options or readr parse options, not both
{code}

which is somewhat obnoxious as I have literally no context provided to know 
which options are being referred to and what the possible options are.

Also, like, why can't we have a mix of both? This is a totally valid use-case.  
I think both a code update and a more informative error message are needed here.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18216) [R] Better error message when creating an array from decimals

2022-11-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18216:


 Summary: [R] Better error message when creating an array from 
decimals
 Key: ARROW-18216
 URL: https://issues.apache.org/jira/browse/ARROW-18216
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane


We should first check why this doesn't work, and if we can instead fix the 
problem instead of the error message

{code:r}
> ChunkedArray$create(c(1.4, 525.5), type = decimal(precision = 1, scale = 3))
Error: NotImplemented: Extend
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18215) [R] User experience improvements

2022-11-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18215:


 Summary: [R] User experience improvements
 Key: ARROW-18215
 URL: https://issues.apache.org/jira/browse/ARROW-18215
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Umbrella ticket to collect together tickets relating to improving error 
messages, and general dev-experience tweaks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18200) [R] Misleading error message if opening CSV dataset with invalid file in directory

2022-10-31 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18200:


 Summary: [R] Misleading error message if opening CSV dataset with 
invalid file in directory
 Key: ARROW-18200
 URL: https://issues.apache.org/jira/browse/ARROW-18200
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


I made a mistake before where I thought a dataset contained CSVs which were, in 
fact, Parquet files, but the error message I got was super unhelpful

{code:r}
library(arrow)

download.file(
  url = 
"https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip;,
  destfile = here::here("data/nyc-taxi-tiny.zip")
)
 # (unzip the zip file into the data directory but don't delete it after)

open_dataset("data", format = "csv")
{code}


{code:r}
Error in nchar(x) : invalid multibyte string, element 1
In addition: Warning message:
In grepl("No match for FieldRef.Name(__filename)", msg, fixed = TRUE) :
  input string 1 is invalid in this locale
{code}

Note, this only occurs with {{format="csv"}} and omitting this argument (i.e. 
the default of {{format="parquet"}} leaves us with the much better error:


{code:r}
Error in `open_dataset()`:
! Invalid: Error creating dataset. Could not read schema from 
'/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Could not open Parquet 
input source '/home/nic2/arrow_10_twitter/data/nyc-taxi-tiny.zip': Parquet 
magic bytes not found in footer. Either the file is corrupted or this is not a 
parquet file.
/home/nic2/arrow/cpp/src/arrow/dataset/file_parquet.cc:338  GetReader(source, 
scan_options). Is this a 'parquet' file?
/home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:44  
InspectSchemas(std::move(options))
/home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:265  
Inspect(options.inspect_options)
ℹ Did you mean to specify a 'format' other than the default (parquet)?
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18199) [R] Misleading error message in query using across()

2022-10-31 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18199:


 Summary: [R] Misleading error message in query using across()
 Key: ARROW-18199
 URL: https://issues.apache.org/jira/browse/ARROW-18199
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


Error handling looks like it's happening in the wrong place - a comma has been 
missed in the {{select()}} but it's wrongly appearing like it's an issue with 
{{across()}}.  Can we do something to make this not happen?

{code:r}
download.file(
  url = 
"https://github.com/djnavarro/arrow-user2022/releases/download/v0.1/nyc-taxi-tiny.zip;,
  destfile = here::here("data/nyc-taxi-tiny.zip")
)

library(arrow)
library(dplyr)

open_dataset("data") %>%
  select(pickup_datetime, pickup_longitude, pickup_latitude 
ends_with("amount")) %>%
  mutate(across(ends_with("amount"), ~.x * 0.87, .names = "{.col}_gbp")) %>%
  collect()
{code}


{code:r}
Error in `across()`:
! Must be used inside dplyr verbs.
Run `rlang::last_error()` to see where the error occurred.
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18181) [R] CSV Reader Improvements

2022-10-27 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18181:


 Summary: [R] CSV Reader Improvements
 Key: ARROW-18181
 URL: https://issues.apache.org/jira/browse/ARROW-18181
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Umbrella ticket for tickets relating to CSV reader improvements in R.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18180) [R] GCS Improvements

2022-10-27 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18180:


 Summary: [R] GCS Improvements
 Key: ARROW-18180
 URL: https://issues.apache.org/jira/browse/ARROW-18180
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18079) [R] Performance regressions after ARROW-12105

2022-10-17 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18079:


 Summary: [R] Performance regressions after ARROW-12105
 Key: ARROW-18079
 URL: https://issues.apache.org/jira/browse/ARROW-18079
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


In ARROW-12105 the functionality implemented introduced some performance 
regressions that we should sort out before the release.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18062) [R] error in CI jobs for R 3.5 and 3.6 when R package being installed

2022-10-14 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18062:


 Summary: [R] error in CI jobs for R 3.5 and 3.6 when R package 
being installed
 Key: ARROW-18062
 URL: https://issues.apache.org/jira/browse/ARROW-18062
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


e.g. 
https://github.com/ursacomputing/crossbow/actions/runs/3246698242/jobs/5325752692#step:5:3164

>From the install logs on that CI job:
{code}
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for ‘arrow’:
 .onLoad failed in loadNamespace() for 'arrow', details:
  call: fun_cache[[unqualified_name]] <- fun
  error: invalid type/length (closure/0) in vector allocation
Error: loading failed
{code}

It is currently erroring for R 3.5 and 3.6 in the nightlies with this error.

The line of code where the error comes from was added in  ARROW-16444 but 
seeing as that was 3 months ago, it seems unlikely that this change introduced 
the error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18057) [R] test for slice functions fail on builds without Datasets capability

2022-10-14 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18057:


 Summary: [R] test for slice functions fail on builds without 
Datasets capability
 Key: ARROW-18057
 URL: https://issues.apache.org/jira/browse/ARROW-18057
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


The changes in ARROW-13766 introduced a test which depends on datasets 
functionality being enabled - we should skip this on CI builds where it is not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18049) [R] Support column renaming in col_select argument to file reading functions

2022-10-14 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18049:


 Summary: [R] Support column renaming in col_select argument to 
file reading functions
 Key: ARROW-18049
 URL: https://issues.apache.org/jira/browse/ARROW-18049
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


We should support the ability to rename columns when reading in data via the 
CSV/Parquet/Feather/JSON file readers.

We currently have an argument {{col_select}}, which allows users to choose 
which columns to read in, but renaming doesn't work.  

To implement this, we'd need to check if any columns have been renamed by 
{{col_select}} and then updating the schema of the object being returned once 
the file has been read.

{code:r}

library(readr)
library(arrow)
readr::read_csv(readr_example("mtcars.csv"), col_select = c(not_hp = hp))
#> # A tibble: 32 × 1
#>not_hp
#> 
#>  1110
#>  2110
#>  3 93
#>  4110
#>  5175
#>  6105
#>  7245
#>  8 62
#>  9 95
#> 10123
#> # … with 22 more rows
arrow::read_csv_arrow(readr_example("mtcars.csv"), col_select = c(not_hp = hp))
#> # A tibble: 32 × 1
#>   hp
#>
#>  1   110
#>  2   110
#>  393
#>  4   110
#>  5   175
#>  6   105
#>  7   245
#>  862
#>  995
#> 10   123
#> # … with 22 more rows
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18043) [R] Properly instantiate empty arrays of extension types in Table__from_schema

2022-10-13 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-18043:


 Summary: [R] Properly instantiate empty arrays of extension types 
in Table__from_schema
 Key: ARROW-18043
 URL: https://issues.apache.org/jira/browse/ARROW-18043
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


The PR for ARROW-12105 introduces the function Table__from_schema which creates 
an empty Table from a Schema object.  Currently it can't handle extension 
types, and instead just returns NULL type objects.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17987) [R] Warning message when building Arrow

2022-10-11 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17987:


 Summary: [R] Warning message when building Arrow
 Key: ARROW-17987
 URL: https://issues.apache.org/jira/browse/ARROW-17987
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


I just got the following message when I rebuilt Arrow after pulling from a 
different fork:

{code:r}
Warning message:
Failed to enable user cancellation: Signal stop source already set up 
{code}

I'm not sure exactly what it is or how to reproduce it (it disappeared after I 
restarted my R session), but we might want to check that end users won't end up 
seeing this?  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17986) [R] native type checking in where()

2022-10-11 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17986:


 Summary: [R] native type checking in where()
 Key: ARROW-17986
 URL: https://issues.apache.org/jira/browse/ARROW-17986
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


The  {{where()}} implementation in ARROW-12105 requires simulating a tibble 
from an Arrow Schema.  Could we have a version of this where we allow native 
type checks, such as `is_int32()` or `is_decimal()`? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17948) [R] arrow_eval user-defined generic functions

2022-10-06 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17948:


 Summary: [R] arrow_eval user-defined generic functions  
 Key: ARROW-17948
 URL: https://issues.apache.org/jira/browse/ARROW-17948
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


ARROW-14071 covers evaluating user-defined functions, but once this is 
implemented, would it be possible to evaluate generics?  Here's an example of 
how that works in dplyr from a [Stack Overflow 
question|https://stackoverflow.com/questions/73950714/is-it-possible-to-use-generics-in-apache-arrow]:


{code:r}
library(dplyr)

df <- data.frame(a = c("these", "are", "some", "strings"),
 b = 1:4)

boop <- function(x, ...) UseMethod("boop", x)

boop.numeric <- function(x) mean(x, na.rm = TRUE)

boop.character <- function(x) mean(nchar(x), na.rm =TRUE )

df %>% summarise(across(everything(), boop))
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17911) [R] Implement `across()` within `transmute()`

2022-10-02 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17911:


 Summary: [R] Implement `across()` within `transmute()`
 Key: ARROW-17911
 URL: https://issues.apache.org/jira/browse/ARROW-17911
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17895) [R] Implement dplyr::across()

2022-09-29 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17895:


 Summary: [R] Implement dplyr::across()
 Key: ARROW-17895
 URL: https://issues.apache.org/jira/browse/ARROW-17895
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Nicola Crane


Umbrella ticket for implementing {{across()]]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17784) [C++] Opening a dataset where partitioning variable is in the dataset should error differently

2022-09-20 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17784:


 Summary: [C++] Opening a dataset where partitioning variable is in 
the dataset should error differently
 Key: ARROW-17784
 URL: https://issues.apache.org/jira/browse/ARROW-17784
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Nicola Crane


The error message given when the name of the partition given matches a field in 
the dataset is a bit misleading - can we catch this earlier and give a 
different error message?


{code:r}
/library(dplyr)
library(arrow)

tf <- tempfile()
dir.create(tf)
write_dataset(mtcars, tf, partitioning = "cyl", hive_style = FALSE)
# The schema fed into `partitioning` should refer to `cyl` and not `wt`, but 
the error message doesn't refer to the duplication here
open_dataset(tf, partitioning = schema(wt = int64())) %>% collect()
#> Error in `open_dataset()`:
#> ! Invalid: Unable to merge: Field wt has incompatible types: double vs int64
#> /home/nic2/arrow/cpp/src/arrow/type.cc:1692  fields_[i]->MergeWith(field)
#> /home/nic2/arrow/cpp/src/arrow/type.cc:1755  AddField(field)
#> /home/nic2/arrow/cpp/src/arrow/type.cc:1826  builder.AddSchema(schema)
#> /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:262  
Inspect(options.inspect_options)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17700) [R] Can't open CSV dataset with partitioning and a schema

2022-09-13 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17700:


 Summary: [R] Can't open CSV dataset with partitioning and a schema
 Key: ARROW-17700
 URL: https://issues.apache.org/jira/browse/ARROW-17700
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


I feel like this might be a duplicate of a previous ticket, but can't find it.


{code:r}
``` r
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#> filter, lag
#> The following objects are masked from 'package:base':
#> 
#> intersect, setdiff, setequal, union
library(arrow)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#> timestamp

# all good!
tf <- tempfile()
dir.create(tf)
write_dataset(mtcars, tf, format = "csv")
open_dataset(tf, format = "csv") %>% collect()
#> # A tibble: 32 × 11
#>  mpg   cyl  disphp  dratwt  qsecvsam  gear  carb
#>  
#>  1  21   6  160110  3.9   2.62  16.5 0 1 4 4
#>  2  21   6  160110  3.9   2.88  17.0 0 1 4 4
#>  3  22.8 4  108 93  3.85  2.32  18.6 1 1 4 1
#>  4  21.4 6  258110  3.08  3.22  19.4 1 0 3 1
#>  5  18.7 8  360175  3.15  3.44  17.0 0 0 3 2
#>  6  18.1 6  225105  2.76  3.46  20.2 1 0 3 1
#>  7  14.3 8  360245  3.21  3.57  15.8 0 0 3 4
#>  8  24.4 4  147.62  3.69  3.19  20   1 0 4 2
#>  9  22.8 4  141.95  3.92  3.15  22.9 1 0 4 2
#> 10  19.2 6  168.   123  3.92  3.44  18.3 1 0 4 4
#> # … with 22 more rows

# all good
tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, cyl), tf, format = "csv")
open_dataset(tf, format = "csv") %>% collect()
#> # A tibble: 32 × 11
#>  mpg  disphp  dratwt  qsecvsam  gear  carb   cyl
#>  
#>  1  22.8 108  93  3.85  2.32  18.6 1 1 4 1 4
#>  2  24.4 147. 62  3.69  3.19  20   1 0 4 2 4
#>  3  22.8 141. 95  3.92  3.15  22.9 1 0 4 2 4
#>  4  32.4  78.766  4.08  2.2   19.5 1 1 4 1 4
#>  5  30.4  75.752  4.93  1.62  18.5 1 1 4 2 4
#>  6  33.9  71.165  4.22  1.84  19.9 1 1 4 1 4
#>  7  21.5 120. 97  3.7   2.46  20.0 1 0 3 1 4
#>  8  27.3  79  66  4.08  1.94  18.9 1 1 4 1 4
#>  9  26   120. 91  4.43  2.14  16.7 0 1 5 2 4
#> 10  30.4  95.1   113  3.77  1.51  16.9 1 1 5 2 4
#> # … with 22 more rows
list.files(tf)
#> [1] "cyl=4" "cyl=6" "cyl=8"

# hive-style=FALSE leads to no `cyl` column, which, sure, makes sense
tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, cyl), tf, format = "csv", hive_style = FALSE)
open_dataset(tf, format = "csv") %>% collect()
#> # A tibble: 32 × 10
#>  mpg  disphp  dratwt  qsecvsam  gear  carb
#> 
#>  1  22.8 108  93  3.85  2.32  18.6 1 1 4 1
#>  2  24.4 147. 62  3.69  3.19  20   1 0 4 2
#>  3  22.8 141. 95  3.92  3.15  22.9 1 0 4 2
#>  4  32.4  78.766  4.08  2.2   19.5 1 1 4 1
#>  5  30.4  75.752  4.93  1.62  18.5 1 1 4 2
#>  6  33.9  71.165  4.22  1.84  19.9 1 1 4 1
#>  7  21.5 120. 97  3.7   2.46  20.0 1 0 3 1
#>  8  27.3  79  66  4.08  1.94  18.9 1 1 4 1
#>  9  26   120. 91  4.43  2.14  16.7 0 1 5 2
#> 10  30.4  95.1   113  3.77  1.51  16.9 1 1 5 2
#> # … with 22 more rows
list.files(tf)
#> [1] "4" "6" "8"


# *but* if we try to add it in via a schema, it doesn't work

desired_schema <- schema(mpg = float64(), disp = float64(), hp = int64(), drat 
= float64(), 
wt = float64(), qsec = float64(), vs = int64(), am = int64(), 
gear = int64(), carb = int64(), cyl = int64())

tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, cyl), tf, format = "csv", hive_style = FALSE)
open_dataset(tf, format = "csv", schema = desired_schema) %>% collect()
#> Error in `dplyr::collect()`:
#> ! Invalid: Could not open CSV input source 
'/tmp/RtmpnInOwc/file13f0d38c5b994/4/part-0.csv': Invalid: CSV parse error: Row 
#1: Expected 11 columns, got 10: 
"mpg","disp","hp","drat","wt","qsec","vs","am","gear","carb"
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.cc:477  
(ParseLine(values_writer, parsed_writer, data, 
data_end, is_final, _end, bulk_filter))
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.cc:566  
ParseChunk( _writer, _writer, data, data_end, 
is_final, 

[jira] [Created] (ARROW-17699) [R] Error message erroneously triggered when opened partitioned CSV dataset with schema

2022-09-13 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17699:


 Summary: [R] Error message erroneously triggered when opened 
partitioned CSV dataset with schema
 Key: ARROW-17699
 URL: https://issues.apache.org/jira/browse/ARROW-17699
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane



{code:r}

library(dplyr)

# all good!
tf <- tempfile()
dir.create(tf)
write_dataset(mtcars, tf, format = "csv")
open_dataset(tf, format = "csv") %>% collect()
#> # A tibble: 32 × 11
#>  mpg   cyl  disphp  dratwt  qsecvsam  gear  carb
#>  
#>  1  21   6  160110  3.9   2.62  16.5 0 1 4 4
#>  2  21   6  160110  3.9   2.88  17.0 0 1 4 4
#>  3  22.8 4  108 93  3.85  2.32  18.6 1 1 4 1
#>  4  21.4 6  258110  3.08  3.22  19.4 1 0 3 1
#>  5  18.7 8  360175  3.15  3.44  17.0 0 0 3 2
#>  6  18.1 6  225105  2.76  3.46  20.2 1 0 3 1
#>  7  14.3 8  360245  3.21  3.57  15.8 0 0 3 4
#>  8  24.4 4  147.62  3.69  3.19  20   1 0 4 2
#>  9  22.8 4  141.95  3.92  3.15  22.9 1 0 4 2
#> 10  19.2 6  168.   123  3.92  3.44  18.3 1 0 4 4
#> # … with 22 more rows

# all good
tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, cyl), tf, format = "csv")
open_dataset(tf, format = "csv") %>% collect()
#> # A tibble: 32 × 11
#>  mpg  disphp  dratwt  qsecvsam  gear  carb   cyl
#>  
#>  1  22.8 108  93  3.85  2.32  18.6 1 1 4 1 4
#>  2  24.4 147. 62  3.69  3.19  20   1 0 4 2 4
#>  3  22.8 141. 95  3.92  3.15  22.9 1 0 4 2 4
#>  4  32.4  78.766  4.08  2.2   19.5 1 1 4 1 4
#>  5  30.4  75.752  4.93  1.62  18.5 1 1 4 2 4
#>  6  33.9  71.165  4.22  1.84  19.9 1 1 4 1 4
#>  7  21.5 120. 97  3.7   2.46  20.0 1 0 3 1 4
#>  8  27.3  79  66  4.08  1.94  18.9 1 1 4 1 4
#>  9  26   120. 91  4.43  2.14  16.7 0 1 5 2 4
#> 10  30.4  95.1   113  3.77  1.51  16.9 1 1 5 2 4
#> # … with 22 more rows
list.files(tf)
#> [1] "cyl=4" "cyl=6" "cyl=8"

# hive-style=FALSE leads to no `cyl` column, which, sure, makes sense
tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, cyl), tf, format = "csv", hive_style = FALSE)
open_dataset(tf, format = "csv") %>% collect()
#> # A tibble: 32 × 10
#>  mpg  disphp  dratwt  qsecvsam  gear  carb
#> 
#>  1  22.8 108  93  3.85  2.32  18.6 1 1 4 1
#>  2  24.4 147. 62  3.69  3.19  20   1 0 4 2
#>  3  22.8 141. 95  3.92  3.15  22.9 1 0 4 2
#>  4  32.4  78.766  4.08  2.2   19.5 1 1 4 1
#>  5  30.4  75.752  4.93  1.62  18.5 1 1 4 2
#>  6  33.9  71.165  4.22  1.84  19.9 1 1 4 1
#>  7  21.5 120. 97  3.7   2.46  20.0 1 0 3 1
#>  8  27.3  79  66  4.08  1.94  18.9 1 1 4 1
#>  9  26   120. 91  4.43  2.14  16.7 0 1 5 2
#> 10  30.4  95.1   113  3.77  1.51  16.9 1 1 5 2
#> # … with 22 more rows
list.files(tf)
#> [1] "4" "6" "8"


# *but* if we try to add it in via a schema, it doesn't work

desired_schema <- schema(mpg = float64(), disp = float64(), hp = int64(), drat 
= float64(), 
wt = float64(), qsec = float64(), vs = int64(), am = int64(), 
gear = int64(), carb = int64(), cyl = int64())

tf <- tempfile()
dir.create(tf)
write_dataset(group_by(mtcars, cyl), tf, format = "csv", hive_style = FALSE)
open_dataset(tf, format = "csv", schema = schema) %>% collect()
#> Error in `CsvFileFormat$create()`:
#> ! Values in `column_names` must match `schema` field names
#> ✖ `column_names` and `schema` field names match but are not in the same order
list.files(tf)
#> [1] "4" "6" "8"

{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17698) [R] Implement use of `where()` inside `across()

2022-09-13 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17698:


 Summary: [R] Implement use of `where()` inside `across()
 Key: ARROW-17698
 URL: https://issues.apache.org/jira/browse/ARROW-17698
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17689) [R] Implement dplyr::across() inside group_by()

2022-09-12 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17689:


 Summary: [R] Implement dplyr::across() inside group_by()
 Key: ARROW-17689
 URL: https://issues.apache.org/jira/browse/ARROW-17689
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17690) [R] Implement dplyr::across() inside distinct()

2022-09-12 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17690:


 Summary: [R] Implement dplyr::across() inside distinct()
 Key: ARROW-17690
 URL: https://issues.apache.org/jira/browse/ARROW-17690
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17680) [Dev] More descriptive error output in merge script

2022-09-12 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17680:


 Summary: [Dev] More descriptive error output in merge script
 Key: ARROW-17680
 URL: https://issues.apache.org/jira/browse/ARROW-17680
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Nicola Crane


I've just updated to the newer version of the merge script, and something is 
going wrong; however, the error message I'm getting isn't super-helpful for 
working out what's happened: 

{code:java}
  File "/home/nic2/arrow_for_merging_prs_only/dev/merge_arrow_pr.py", line 539, 
in connect_jira
return jira.client.JIRA(options={'server': JIRA_API_BASE},
TypeError: __init__() got an unexpected keyword argument 'token_auth'
{code}

Is there some object we could just dump the output of, in cases of failure, so 
it provides a few more hints to work out what's gone wrong?  




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17654) [R] Add link to cookbook from README

2022-09-08 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17654:


 Summary: [R] Add link to cookbook from README
 Key: ARROW-17654
 URL: https://issues.apache.org/jira/browse/ARROW-17654
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17637) [R] as.Date fails going from timestamp[s

2022-09-06 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17637:


 Summary: [R] as.Date fails going from timestamp[s
 Key: ARROW-17637
 URL: https://issues.apache.org/jira/browse/ARROW-17637
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17528) [R] Tidy up the pkgdown articles site

2022-08-25 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17528:


 Summary: [R] Tidy up the pkgdown articles site 
 Key: ARROW-17528
 URL: https://issues.apache.org/jira/browse/ARROW-17528
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


We could better organise the different articles we have to make it easier for 
users to find the right info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17490) [R] Differing results in log bindings

2022-08-22 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17490:


 Summary: [R] Differing results in log bindings
 Key: ARROW-17490
 URL: https://issues.apache.org/jira/browse/ARROW-17490
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


We get different results for dplyr versus Acero if we call log on a column that 
contains 0, i.e.

{code:r}
``` r
library(arrow)
library(dplyr)

df <- tibble(x = 0:10)

df %>%
  mutate(y = log(x)) %>%
  collect()
#> # A tibble: 11 × 2
#>xy
#>
#>  1 0 -Inf
#>  2 10
#>  3 20.693
#>  4 31.10 
#>  5 41.39 
#>  6 51.61 
#>  7 61.79 
#>  8 71.95 
#>  9 82.08 
#> 10 92.20 
#> 11102.30

df %>%
  arrow_table() %>%
  mutate(y = log(x)) %>%
  collect()
#> Error in `collect()`:
#> ! Invalid: logarithm of zero
```

{code}

This is because R defines {{log(0)}} as {{-Inf}} whereas Acero defines it as an 
error.  Not sure what the solution is here; do we want to request the addition 
of an Acero option to define behaviour for this?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17489) [R] Nightly builds failing due to test referencing unrelease stringr functions

2022-08-22 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17489:


 Summary: [R] Nightly builds failing due to test referencing 
unrelease stringr functions
 Key: ARROW-17489
 URL: https://issues.apache.org/jira/browse/ARROW-17489
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Many of the nightly builds are failing (e.g. 
https://github.com/ursacomputing/crossbow/runs/7942883382?check_suite_focus=true#step:5:24666)
 due to a test which conditionally runs based on the version of string 
available.  This is due to an NSE function we have implemented which is only in 
the dev version of stringr, and we expected to be included in the next release 
but was not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17445) [R] Add vignette on ExecPlans and how they work

2022-08-17 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17445:


 Summary: [R] Add vignette on ExecPlans and how they work
 Key: ARROW-17445
 URL: https://issues.apache.org/jira/browse/ARROW-17445
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


I've been working on a blog post to showcase the new {{show_exec_plan()}} 
function, but there's a lot of information that people think would make a good 
addition that would be better placed in a new vignette or pkgdown article.  

There's sufficient R-related content to include here (i.e. about how 
{{show_exec_plan()}} works) that it's worth having this in the R docs and not 
in general docs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17387) [R] Implement dplyr::across() inside filter()

2022-08-11 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17387:


 Summary: [R] Implement dplyr::across() inside filter()
 Key: ARROW-17387
 URL: https://issues.apache.org/jira/browse/ARROW-17387
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane
 Fix For: 10.0.0


ARROW-11699 adds the ability to call dplyr::across() inside dplyr::mutate().  
Once this is merged, we should also add the ability to do so within 
dplyr::summarise().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17384) [R] Additional dplyr functionality

2022-08-11 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17384:


 Summary: [R] Additional dplyr functionality
 Key: ARROW-17384
 URL: https://issues.apache.org/jira/browse/ARROW-17384
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Umbrella ticket to collect together tickets relating to implementing additional 
dplyr verbs or unimplemented arguments for implemented verbs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17371) [R] Remove as.factor to dictionary_encode mapping

2022-08-10 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17371:


 Summary: [R] Remove as.factor to dictionary_encode mapping
 Key: ARROW-17371
 URL: https://issues.apache.org/jira/browse/ARROW-17371
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


There is an NSE func mapping between {{base::as.factor}} and Acero's 
{{dictionary_encode}}.  However, it doesn't work at present - see ARROW-12632.  
At present, calling this function results in an error.  We should remove this 
mapping so that an error is raised and we call {{as.factor}} in R not Acero.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17366) [R] Support purrr-style lambda functions in .fns argument to across()

2022-08-09 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17366:


 Summary: [R] Support purrr-style lambda functions in .fns argument 
to across()
 Key: ARROW-17366
 URL: https://issues.apache.org/jira/browse/ARROW-17366
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


ARROW-11699 adds support for dplyr::across inside a mutate(). The .fns argument 
does not yet support purrr-style lambda functions (e.g. {{~round(.x, digits = 
-1)}} but should be added. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17365) [R] Implement ... argument inside across()

2022-08-09 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17365:


 Summary: [R] Implement ... argument inside across()
 Key: ARROW-17365
 URL: https://issues.apache.org/jira/browse/ARROW-17365
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
{{.names}} argument is not  yet supported but should be added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17364) [R] Implement .names argument inside across()

2022-08-09 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17364:


 Summary: [R] Implement .names argument inside across()
 Key: ARROW-17364
 URL: https://issues.apache.org/jira/browse/ARROW-17364
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


ARROW-11699 adds support for {{dplyr::across}} inside a {{mutate()}}.  The 
{{.names}} argument is not  yet supported but should be added.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17362) [R] Implement dplyr::across() inside summarise()

2022-08-09 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17362:


 Summary: [R] Implement dplyr::across() inside summarise()
 Key: ARROW-17362
 URL: https://issues.apache.org/jira/browse/ARROW-17362
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


ARROW-11699 adds the ability to call dplyr::across() inside dplyr::mutate().  
Once this is merged, we should also add the ability to do so within 
dplyr::summarise().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17356) [R] Update binding for add_filename() NSE function to error if used on Table

2022-08-09 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17356:


 Summary: [R] Update binding for add_filename() NSE function to 
error if used on Table
 Key: ARROW-17356
 URL: https://issues.apache.org/jira/browse/ARROW-17356
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


ARROW-15260 adds a function which allows the user to add the filename as an 
output field.  This function only makes sense to use with datasets and not 
tables.  Currently, the error generated from using it with a table is handled 
by {{handle_augmented_field_misuse()}}.  Instead, we should follow [one of the 
suggestions from the 
PR|https://github.com/apache/arrow/pull/12826#issuecomment-1192007298] to 
detect this when the function is called.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17355) [R] Refactor the handle_* utility functions for a better dev experience

2022-08-09 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17355:


 Summary: [R] Refactor the handle_* utility functions for a better 
dev experience
 Key: ARROW-17355
 URL: https://issues.apache.org/jira/browse/ARROW-17355
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


In ARROW-15260, the utility functions for handling different kinds of reading 
errors (handle_parquet_io_error, handle_csv_read_error, and 
handle_augmented_field_misuse) were refactored so that multiple ones could be 
chained together. An issue with this is that other errors may be swallowed if 
they're used without any errors that they don't capture being raised manually 
afterwards.  We should update the code to prevent this from being possible.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17102) [R] Test fails on test-r-offline-minimal nightly build

2022-07-18 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17102:


 Summary: [R] Test fails on test-r-offline-minimal nightly build
 Key: ARROW-17102
 URL: https://issues.apache.org/jira/browse/ARROW-17102
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
Assignee: Nicola Crane


May be due to missing option to skip if parquet not available

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=29590=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181=17703



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17054) [R] Creating an Array from an object bigger than 2^31 results in an Array of length 0

2022-07-12 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-17054:


 Summary: [R] Creating an Array from an object bigger than 2^31 
results in an Array of length 0
 Key: ARROW-17054
 URL: https://issues.apache.org/jira/browse/ARROW-17054
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Apologies for the lack of proper reprex but it crashes my session when I try to 
make one.

I'm working on ARROW-16977 which is all about the reporting of object size 
having integer overflow issues, but this affects object creation.

{code:r}
library(arrow, warn.conflicts = TRUE)

# works - creates a huge array, hurrah
big_logical <- vector(mode = "logical", length = .Machine$integer.max)
big_logical_array <- Array$create(big_logical)

length(big_logical)
## [1] 2147483647
length(big_logical_array)
## [1] 2147483647

# creates an array of length 0, boo!
too_big <- vector(mode = "logical", length = .Machine$integer.max + 1) 
too_big_array <- Array$create(too_big)

length(too_big)
## [1] 2147483648
length(too_big_array)
## [1] 0
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16977) [R] Update dataset row counting so no integer overflow on large datasets

2022-07-05 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16977:


 Summary: [R] Update dataset row counting so no integer overflow on 
large datasets
 Key: ARROW-16977
 URL: https://issues.apache.org/jira/browse/ARROW-16977
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16973) [R] segfault on some CI jobs when calling `flight_put()`

2022-07-04 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16973:


 Summary: [R] segfault on some CI jobs when calling `flight_put()`
 Key: ARROW-16973
 URL: https://issues.apache.org/jira/browse/ARROW-16973
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


CI jobs for PRs unrelated to this area of the codebase have been segfaulting 
recently, e.g.:
 * [https://github.com/apache/arrow/runs/7180218227?check_suite_focus=true]
 * 
[https://github.com/apache/arrow/runs/7139495271?check_suite_focus=true#step:7:22897]
 * 
[https://github.com/apache/arrow/runs/7134531302?check_suite_focus=true#step:7:25791]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-16862) [C++] Add option for casting failure values to default to NULL/NA

2022-06-20 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16862:


 Summary: [C++] Add option for casting failure values to default to 
NULL/NA
 Key: ARROW-16862
 URL: https://issues.apache.org/jira/browse/ARROW-16862
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Nicola Crane


In ARROW-16833, a user is complaining that they are unable to cast their messy 
string data to integer data and they receive an error message.  In R, it's 
possible to convert this kind of data to integers, with values that fail just 
being converted to NA values.  Would it be possible to enable this as an option 
in Arrow?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16829) [R] Add link to new contributors guide to developer guide

2022-06-14 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16829:


 Summary: [R] Add link to new contributors guide to developer guide
 Key: ARROW-16829
 URL: https://issues.apache.org/jira/browse/ARROW-16829
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
Assignee: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16650) [R] Binding for between() is in dplyr-funcs-type.R

2022-05-25 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16650:


 Summary: [R] Binding for between() is in dplyr-funcs-type.R
 Key: ARROW-16650
 URL: https://issues.apache.org/jira/browse/ARROW-16650
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


I was looking for the bindings for `dplyr::between()` and was surprised to find 
it in `dplyr-funcs-type.R`, - we should move it to somewhere more appropriate, 
like `dplyr-funcs-math.R`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16649) [C++] Add support for sorting to the Substrait consumer

2022-05-25 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16649:


 Summary: [C++] Add support for sorting to the Substrait consumer
 Key: ARROW-16649
 URL: https://issues.apache.org/jira/browse/ARROW-16649
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Nicola Crane


The streaming execution engine supports sorting (I believe, as a sink node 
option?), but the Substrait consumer does not currently consume sort relations. 
 Please can we have support for this?

Here's the example code/plan I tested with:

 
{code:java}
library(dplyr)
library(substrait)

# create a basic table and order it
out <- tibble::tibble(a = 1, b = 2) %>%
  arrow_substrait_compiler() %>%
  arrange(a)

# take a look at the plan created
out$plan()
#> message of type 'substrait.Plan' with 2 fields set
#> extension_uris {
#>   extension_uri_anchor: 1
#> }
#> relations {
#>   root {
#>     input {
#>       sort {
#>         input {
#>           read {
#>             base_schema {
#>               names: "a"
#>               names: "b"
#>               struct_ {
#>                 types {
#>                   fp64 {
#>                   }
#>                 }
#>                 types {
#>                   fp64 {
#>                   }
#>                 }
#>               }
#>             }
#>             named_table {
#>               names: "named_table_1"
#>             }
#>           }
#>         }
#>         sorts {
#>           expr {
#>             selection {
#>               direct_reference {
#>                 struct_field {
#>                 }
#>               }
#>             }
#>           }
#>           direction: SORT_DIRECTION_ASC_NULLS_LAST
#>         }
#>       }
#>     }
#>     names: "a"
#>     names: "b"
#>   }
#> }

# try to run the plan
collect(out)
#> Error: NotImplemented: conversion to arrow::compute::Declaration from 
Substrait relation sort {
...
#> /home/nic2/arrow/cpp/src/arrow/engine/substrait/serde.cc:73  
FromProto(plan_rel.rel(), ext_set)
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16560) [Website][Release] Version JSON files not updated in release

2022-05-12 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16560:


 Summary: [Website][Release] Version JSON files not updated in 
release
 Key: ARROW-16560
 URL: https://issues.apache.org/jira/browse/ARROW-16560
 Project: Apache Arrow
  Issue Type: Bug
  Components: Website
Reporter: Nicola Crane


ARROW-15366 added a script to automatically increment the version switchers for 
the docs, which was updated as part of the changes in ARROW-1.  However, 
the latest release did not increment the version numbers (and ARROW-1 
changes the script to update on snapshots instead of releases - could be the 
reason for it not happening?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16480) [R] Update read_csv_arrow parse_options, read_options, and convert_options to take lists

2022-05-05 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16480:


 Summary: [R] Update read_csv_arrow parse_options, read_options, 
and convert_options to take lists
 Key: ARROW-16480
 URL: https://issues.apache.org/jira/browse/ARROW-16480
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Currently if we want to specify Arrow-specific read options such as encoding, 
we'd have to do something like this:
{code:java}
df <- read_csv_arrow(tf, read_options = CsvReadOptions$create(encoding = 
"utf8")) {code}
We should update the code inside {{read_csv_arrow()}} so that the user can 
specify {{read_options}} as a list which we then pass through to CsvReadOptions 
internally, so we could instead call the much more user-friendly code below:
{code:java}
df <- read_csv_arrow(tf, read_options = list(encoding = "utf8")) {code}
We should then add an example of this to the function doc examples.

 

We also should do the same for parse_options and convert_options.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16447) [R] Integer overflow causes error - (in dplyr we get an NA with a warning)

2022-05-03 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16447:


 Summary: [R] Integer overflow causes error - (in dplyr we get an 
NA with a warning)
 Key: ARROW-16447
 URL: https://issues.apache.org/jira/browse/ARROW-16447
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


{code:java}
library(dplyr)
library(arrow)

.input = tibble::tibble(
  x = .Machine$integer.max
)

# in dplyr
.input %>%
      mutate(x2 = x + 6L) %>%
      collect()
#> Warning in x + 6L: NAs produced by integer overflow
#> # A tibble: 1 × 2
#>            x    x2
#>         
#> 1 2147483647    NA

# in Arrow via arrow
.input %>%
      arrow_table() %>%
      mutate(x2 = x + 6L) %>%
      collect()
#> Error in `collect()`:
#> ! Invalid: overflow
#> /home/nic2/arrow/cpp/src/arrow/compute/exec.cc:701  
kernel_->exec(kernel_ctx_, batch, )
#> /home/nic2/arrow/cpp/src/arrow/compute/exec.cc:642  ExecuteBatch(batch, 
listener)
#> /home/nic2/arrow/cpp/src/arrow/compute/exec/expression.cc:547  
executor->Execute(arguments, )
#> /home/nic2/arrow/cpp/src/arrow/compute/exec/project_node.cc:90  
ExecuteScalarExpression(simplified_expr, target, plan()->exec_context())
#> /home/nic2/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:463  
iterator_.Next()
#> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:337  ReadNext()
#> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:351  ToRecordBatches()
{code}
Do we want to enable the return of NAs on integer overflow, or just give the 
user a more specific hint in the error message?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16376) [R][CI] Update test-r-devdocs on Windows to build UCRT and don't pin to R 4.1

2022-04-27 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16376:


 Summary: [R][CI] Update test-r-devdocs on Windows to build UCRT 
and don't pin to R 4.1
 Key: ARROW-16376
 URL: https://issues.apache.org/jira/browse/ARROW-16376
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


The failed devdocs builds were fixed by pinning the R version to 4.1 in 
ARROW-16375 but we should instead just add UCRT to the build and not pin the 
version



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16375) [R] Pin test-r-devdocs on Windows to R 4.1

2022-04-27 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16375:


 Summary: [R] Pin test-r-devdocs on Windows to R 4.1
 Key: ARROW-16375
 URL: https://issues.apache.org/jira/browse/ARROW-16375
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Nicola Crane


This build is failing on Windows, likely because R 4.2 on Windows requires 
UCRT.  A short-term solution is pinning these builds to R 4.1



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16310) [R] test-fedora-r-clang-sanitizer job fails - possible tzdb installation issue

2022-04-25 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16310:


 Summary: [R] test-fedora-r-clang-sanitizer job fails - possible 
tzdb installation issue
 Key: ARROW-16310
 URL: https://issues.apache.org/jira/browse/ARROW-16310
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Nicola Crane


We're seeing an error on a sanitizer build for 

https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=23988=logs=0da5d1d9-276d-5173-c4c4-9d4d4ed14fdb=d9b15392-e4ce-5e4c-0c8c-b69645229181=3034

I think it's something to do with tzdb installation:


{code:java}
make: Target 'all' not remade because of errors.
* installing *source* package ‘tzdb’ ...
** package ‘tzdb’ successfully unpacked and MD5 sums checked
** using staged installation
make[1]: *** [/opt/R-devel/lib64/R/etc/Makeconf:178: api.o] Error 1
make[1]: Leaving directory '/tmp/Rtmp0aqclz/R.INSTALL51cc14b8c441/tzdb/src'
ERROR: compilation failed for package ‘tzdb’
* removing ‘/opt/R-devel/lib64/R/library/tzdb’

The downloaded source packages are in
‘/tmp/Rtmpg6gyGy/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Warning messages:
1: package ‘’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-devel/R-admin.html#Installing-packages 
2: In i.p(...) : installation of one or more packages failed,
  probably ‘tzdb’
> 
> 
/
+ popd

{code}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16164) [C++] Pushdown filters on augmented columns like fragment filename

2022-04-11 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16164:


 Summary: [C++] Pushdown filters on augmented columns like fragment 
filename
 Key: ARROW-16164
 URL: https://issues.apache.org/jira/browse/ARROW-16164
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Nicola Crane


In the discussion on ARROW-15260, if we run the following code in R, we might 
expect it to push down the filter so we can just read in the relevant files:

{code:r}
  filter = Expression$create(
"match_substring",
Expression$field_ref("__filename"),
options = list(pattern = "cyl=8")
  )
{code}

As mentioned by [~westonpace]:

"You might think we would get the hint and only read files matching that 
pattern. This is not the case. We will read the entire dataset and apply the 
"cyl=8" filter in memory.

If we want to pushdown filters on the filename column we will need to add some 
special logic."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16154) [R] Errors which pass through `handle_csv_read_error()` and `handle_parquet_io_error()` need better error tracing

2022-04-08 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16154:


 Summary: [R] Errors which pass through `handle_csv_read_error()` 
and `handle_parquet_io_error()` need better error tracing
 Key: ARROW-16154
 URL: https://issues.apache.org/jira/browse/ARROW-16154
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
 Fix For: 8.0.0


See discussion here for context: 
https://github.com/apache/arrow/pull/12826#issuecomment-1092052001



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16106) [R] Support for filename-based partitioning

2022-04-04 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16106:


 Summary: [R] Support for filename-based partitioning
 Key: ARROW-16106
 URL: https://issues.apache.org/jira/browse/ARROW-16106
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


This was added in ARROW-14612 and now needs implementing in R



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16080) [R][Documentation] Document filename-based partitioning

2022-03-31 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16080:


 Summary: [R][Documentation] Document filename-based partitioning
 Key: ARROW-16080
 URL: https://issues.apache.org/jira/browse/ARROW-16080
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
 Fix For: 8.0.0


Filename-based partitioning has been implemented in C++; we should add 
something to our docs about this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16011) [R] CI jobs should fail if lintr picked up issues

2022-03-23 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16011:


 Summary: [R] CI jobs should fail if lintr picked up issues
 Key: ARROW-16011
 URL: https://issues.apache.org/jira/browse/ARROW-16011
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Currently the lintr flags up styling issues on every PR, which can lead to it 
flagging up stylistic issues on unrelated PRs if a previous R-related PR has 
caused a listing issue.  We should instead cause the R CI build to fail in 
these cases, so these problems are not merged in the first place.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16000) [C++][Dataset] Support Latin-1 encoding

2022-03-22 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-16000:


 Summary: [C++][Dataset] Support Latin-1 encoding
 Key: ARROW-16000
 URL: https://issues.apache.org/jira/browse/ARROW-16000
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Nicola Crane


In ARROW-15992 a user is reporting issues with trying to read in files with 
Latin-1 encoding.  I had a look through the docs for the Dataset API and I 
don't think this is currently supported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15943) [C++] Filter which files to be read in as part of filesystem, filtered using a string

2022-03-15 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15943:


 Summary: [C++] Filter which files to be read in as part of 
filesystem, filtered using a string
 Key: ARROW-15943
 URL: https://issues.apache.org/jira/browse/ARROW-15943
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Nicola Crane


There is a report from a user (see this Stack Overflow post [1]) who has used 
the {{basename_template}} parameter to write files to a dataset, some of which 
have the prefix {{"summary"}} and others which have the prefix 
"{{{}prediction"{}}}.  This data is saved in partitioned directories.  They 
want to be able to read back in the data, so that, as well as the partition 
variables in their dataset, they can choose which subset (predictions vs. 
summaries) to read back in.  

This isn't currently possible; if they try to open a dataset with a list of 
files, they cannot read it in as partitioned data.

A short-term solution is to suggest they change the structure of how their data 
is stored, but it could be useful to be able to pass in some sort of filter to 
determine which files get read in as a dataset.

 

[1] 
[https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r)|https://stackoverflow.com/questions/71355827/arrow-parquet-partitioning-multiple-datasets-in-same-directory-structure-in-r_]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15880) [C++] Can't open partitioned dataset if the root directory has "=" in its name

2022-03-09 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15880:


 Summary: [C++] Can't open partitioned dataset if the root 
directory has "=" in its name
 Key: ARROW-15880
 URL: https://issues.apache.org/jira/browse/ARROW-15880
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Nicola Crane


Not sure if this is a bug or "just how Hive style partitioning works" but if I 
try to open a dataset where the root directory has an "=" in it, I have to 
specify that directory in my partitioning to be able to successfully open it.

This has caused users to trip up when they've saved one directory from a 
partitioned dataset somewhere and tried to then open this directory as a 
dataset.

{code:r}
library(arrow)
td <- tempfile()
dir.create(td)
# directory with equals sign in name
subdir <- file.path(td, "foo=bar")
dir.create(subdir)
write_dataset(mtcars, subdir, partitioning = "am")
list.files(td, recursive = TRUE)
#> [1] "foo=bar/am=0/part-0.parquet" "foo=bar/am=1/part-0.parquet"
# doesn't work
open_dataset(subdir, partitioning = "am")
#> Error:
#> ! "partitioning" does not match the detected Hive-style partitions: c("foo", 
"am")
#> ℹ Omit "partitioning" to use the Hive partitions
#> ℹ Set `hive_style = FALSE` to override what was detected
#> ℹ Or, to rename partition columns, call `select()` or `rename()` after 
opening the dataset
# works
open_dataset(subdir, partitioning = c("foo", "am"))
#> FileSystemDataset with 2 Parquet files
#> mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> gear: double
#> carb: double
#> foo: string
#> am: int32
#> 
#> See $metadata for additional Schema metadata
{code}

Compare this with the same example but the folder is just called "foobar" 
instead of "foo=bar".

{code:r}
td <- tempfile()
dir.create(td)
subdir <- file.path(td, "foobar")
dir.create(subdir)
write_dataset(mtcars, subdir, partitioning = "am")
list.files(td, recursive = TRUE)
#> [1] "foobar/am=0/part-0.parquet" "foobar/am=1/part-0.parquet"
# works
open_dataset(subdir, partitioning = "am")
#> FileSystemDataset with 2 Parquet files
#> mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> gear: double
#> carb: double
#> am: int32
#> 
#> See $metadata for additional Schema metadata
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15827) [R] Improve UX of write_dataset(..., max_rows_per_group)

2022-03-02 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15827:


 Summary: [R] Improve UX of write_dataset(..., max_rows_per_group)
 Key: ARROW-15827
 URL: https://issues.apache.org/jira/browse/ARROW-15827
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


When using {{write_dataset()}}, if we set {{max_rows_per_file}} without also 
setting {{max_rows_per_group}}, we always get the error shown below.  

{code:r}
library(arrow)
td <- tempfile()
dir.create(td)
write_dataset(mtcars, td, max_rows_per_file = 5L)
#> Error: Invalid: max_rows_per_group must be less than or equal to 
max_rows_per_file
{code}

We should change the behaviour so we can specify one without having to also 
specify the other.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15819) [R] R docs version switcher doesn't work on Safari on MacOS

2022-03-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15819:


 Summary: [R] R docs version switcher doesn't work on Safari on 
MacOS
 Key: ARROW-15819
 URL: https://issues.apache.org/jira/browse/ARROW-15819
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


Reported as missing on Safari on MacOS by both Ian and Neal



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15812) [R] Allow user to supply col_names argument when reading in a CSV dataset

2022-03-01 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15812:


 Summary: [R] Allow user to supply col_names argument when reading 
in a CSV dataset
 Key: ARROW-15812
 URL: https://issues.apache.org/jira/browse/ARROW-15812
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Allow the user to supply the {{col_names}} argument from {{readr}} when reading 
in a dataset.  

This is already possible when reading in a single CSV file via 
{{arrow::read_csv_arrow()}} via the {{readr_to_csv_read_options}} function, and 
so once the C++ functionality to autogenerate column names for Datasets is 
implemented, we should hook up {{readr_to_csv_read_options}} in 
{{csv_file_format_read_opts}} just like we have with 
{{readr_to_csv_parse_options}} in {{csv_file_format_parse_options}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15797) [R] Supplying column names to open_dataset results in all columns being read in as strings

2022-02-28 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15797:


 Summary: [R] Supplying column names to open_dataset results in all 
columns being read in as strings
 Key: ARROW-15797
 URL: https://issues.apache.org/jira/browse/ARROW-15797
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane



{code:r}
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#> timestamp
td <- tempfile()
dir.create(td)
write_dataset(mtcars, td, format = "csv")

# Correct column types
open_dataset(td, format = "csv")
#> FileSystemDataset with 1 csv file
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64

# Incorrect column types
open_dataset(td, format = "csv", column_names = c("mpg", "cyl", "disp", "hp", 
"drat", "wt", "qsec", "vs", "am", "gear", "carb"))
#> FileSystemDataset with 1 csv file
#> mpg: string
#> cyl: string
#> disp: string
#> hp: string
#> drat: string
#> wt: string
#> qsec: string
#> vs: string
#> am: string
#> gear: string
#> carb: string

{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15743) [R] `skip` not connected up to `skip_rows` on open_dataset despite error messages indicating otherwise

2022-02-21 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15743:


 Summary: [R] `skip` not connected up to `skip_rows` on 
open_dataset despite error messages indicating otherwise
 Key: ARROW-15743
 URL: https://issues.apache.org/jira/browse/ARROW-15743
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane


If I open a dataset of CSVs with a schema, the error message tells me to supply 
{{`skip = 1`}} if my data contains a header row (to prevent it being read in as 
data), but only {{skip_rows = 1}} actually works.

{code:r}

library(arrow)
library(dplyr)

td <- tempfile()
dir.create(td)
write_dataset(mtcars, td, format = "csv")

schema <- schema(mpg = float64(), cyl = float64(), disp = float64(), hp = 
float64(), 
drat = float64(), wt = float64(), qsec = float64(), vs = float64(), 
am = float64(), gear = float64(), carb = float64())


open_dataset(td, format = "csv", schema = schema) %>%
  collect()
#> Error in `handle_csv_read_error()`:
#> ! Invalid: Could not open CSV input source 
'/tmp/RtmppZbpeF/file6cec135ed29c/part-0.csv': Invalid: In CSV column #0: Row 
#1: CSV conversion error to double: invalid value 'mpg'
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:550  decoder_.Decode(data, 
size, quoted, )
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:123  status
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:554  
parser.VisitColumn(col_index, visit)
#> /home/nic2/arrow/cpp/src/arrow/csv/reader.cc:463  
arrow::internal::UnwrapOrRaise(maybe_decoded_arrays)
#> /home/nic2/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:445  
iterator_.Next()
#> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:336  ReadNext()
#> /home/nic2/arrow/cpp/src/arrow/record_batch.cc:347  ReadAll()
#> ℹ If you have supplied a schema and your data contains a header row, you 
should supply the argument `skip = 1` to prevent the header being read in as 
data.

open_dataset(td, format = "csv", schema = schema, skip = 1) %>%
  collect()
#> Error: The following option is supported in "read_delim_arrow" functions but 
not yet supported here: "skip"

open_dataset(td, format = "csv", schema = schema, skip_rows = 1) %>%
  collect()
#> # A tibble: 32 × 11
#>  mpg   cyl  disphp  dratwt  qsecvsam  gear  carb
#>  
#>  1  21   6  160110  3.9   2.62  16.5 0 1 4 4
#>  2  21   6  160110  3.9   2.88  17.0 0 1 4 4
#>  3  22.8 4  108 93  3.85  2.32  18.6 1 1 4 1
#>  4  21.4 6  258110  3.08  3.22  19.4 1 0 3 1
#>  5  18.7 8  360175  3.15  3.44  17.0 0 0 3 2
#>  6  18.1 6  225105  2.76  3.46  20.2 1 0 3 1
#>  7  14.3 8  360245  3.21  3.57  15.8 0 0 3 4
#>  8  24.4 4  147.62  3.69  3.19  20   1 0 4 2
#>  9  22.8 4  141.95  3.92  3.15  22.9 1 0 4 2
#> 10  19.2 6  168.   123  3.92  3.44  18.3 1 0 4 4
#> # … with 22 more rows

{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15567) [R] Implement as_substrait() and from_substrait() for integers

2022-02-04 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15567:


 Summary: [R] Implement as_substrait() and from_substrait() for 
integers
 Key: ARROW-15567
 URL: https://issues.apache.org/jira/browse/ARROW-15567
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15566) [R] Create initial implementation

2022-02-04 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15566:


 Summary: [R] Create initial implementation
 Key: ARROW-15566
 URL: https://issues.apache.org/jira/browse/ARROW-15566
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: R
Reporter: Nicola Crane
Assignee: Dewey Dunnington


Create an initial implementation of an R package which will generate Substrait 
plans from dplyr code



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15507) [R] Refactor repeated code into check_match function

2022-01-31 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15507:


 Summary: [R] Refactor repeated code into check_match function 
 Key: ARROW-15507
 URL: https://issues.apache.org/jira/browse/ARROW-15507
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


In https://github.com/apache/arrow/pull/12277#discussion_r794636116 we discuss 
similar reasoning in two different places in the codebase; this should be 
refactored into a function to make the code easier to skim.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15480) [R] Expand on schema/colnames mismatch error messages

2022-01-27 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15480:


 Summary: [R] Expand on schema/colnames mismatch error messages
 Key: ARROW-15480
 URL: https://issues.apache.org/jira/browse/ARROW-15480
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
Assignee: Nicola Crane


In ARROW-14744 extra checks were added for when {{open_dataset()}} is used and 
there are conflicts between column names from the schema vs. passed in 
explicitly - we should expand on the messaging and tests for the different 
possible cases here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15476) [R][Docs] Update the links in the developing vignette so they don't point to absolute paths

2022-01-27 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15476:


 Summary: [R][Docs] Update the links in the developing vignette so 
they don't point to absolute paths
 Key: ARROW-15476
 URL: https://issues.apache.org/jira/browse/ARROW-15476
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Reporter: Nicola Crane


There are 3 links in the "developing" vignettes which point to absolute paths 
to articles for developers.  This works for the package vignettes but doesn't 
work well on pkgdown versions of the vignettes as they point to the latest 
published version of those articles, and so, for example, in the dev docs, 
result in "Not Found" as those docs are not yet published on the main site.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15470) [C++] Allows user to specify string to be used for missing data when writing CSV dataset

2022-01-26 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15470:


 Summary: [C++] Allows user to specify string to be used for 
missing data when writing CSV dataset
 Key: ARROW-15470
 URL: https://issues.apache.org/jira/browse/ARROW-15470
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Nicola Crane


The ability to select the string to be used for missing data was implemented 
for the CSV Writer in ARROW-14903 but would it be possible to also allow this 
when writing CSV datasets?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15366) [R] Automate incrementing of pkgdown version for dropdown menu

2022-01-19 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15366:


 Summary: [R] Automate incrementing of pkgdown version for dropdown 
menu
 Key: ARROW-15366
 URL: https://issues.apache.org/jira/browse/ARROW-15366
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Nicola Crane
Assignee: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15337) [Doc] New contributors guide updates

2022-01-14 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15337:


 Summary: [Doc] New contributors guide updates
 Key: ARROW-15337
 URL: https://issues.apache.org/jira/browse/ARROW-15337
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15303) [R] linting errors

2022-01-11 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15303:


 Summary: [R] linting errors
 Key: ARROW-15303
 URL: https://issues.apache.org/jira/browse/ARROW-15303
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
Assignee: Nicola Crane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15281) [C++] Implement ability to retrieve fragment filename

2022-01-07 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15281:


 Summary: [C++] Implement ability to retrieve fragment filename
 Key: ARROW-15281
 URL: https://issues.apache.org/jira/browse/ARROW-15281
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Nicola Crane


A user has requested the ability to include the filename of the CSV in the 
dataset output - see discussion on ARROW-15260 for more context.

Relevant info from that ticket:

 
"From a C++ perspective we've got many of the pieces needed already. One 
challenge is that the datasets API is written to work with "fragments" and not 
"files". For example, a dataset might be an in-memory table in which case we 
are working with InMemoryFragment and not FileFragment so there is no concept 
of "filename".

That being said, the low level ScanBatchesAsync method actually returns a 
generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is a 
struct with the record batch as well as the source fragment for that record 
batch.

So if you were to execute scan, you could inspect the fragment and, if it is a 
FileFragment, you could extract the filename.

Another challenge is that R is moving towards more and more access through an 
exec plan and not directly using a scanner. In order for that to work we would 
need to augment the scan results with the filename in C++ before sending into 
the exec plan. Luckily, we already do this a bit as well. We currently augment 
the scan results with fragment index, batch index, and whether the batch is the 
last batch in the fragment.

Since ExecBatch can work with constants efficiently I don't think there will be 
much performance cost in always including the filename. So the work remaining 
is simply to add a new augmented field _{_}fragment_source_name which is always 
attached if the underlying fragment is a filename. Then users can get this 
field if they want by including "{_}_fragment_source_name" in the list of 
columns they query for."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15278) [R] Reorganise tests for dates and datetimes to test them together

2022-01-07 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15278:


 Summary: [R] Reorganise tests for dates and datetimes to test them 
together
 Key: ARROW-15278
 URL: https://issues.apache.org/jira/browse/ARROW-15278
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
Assignee: Nicola Crane


The tests in {{arrow/r/tests/test-dplyr-funcs-datetime.R}} have dates and 
datetimes tested separately.  Given that both I (the person who originally 
wrote them like that!), and subsequent contributors have ended up accidentally 
forgetting to test one of these classes, it would make more sense for the tests 
to just test both at once.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15279) [R] Update "writing bindings" dev docs based on user feedback

2022-01-07 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15279:


 Summary: [R] Update "writing bindings" dev docs based on user 
feedback
 Key: ARROW-15279
 URL: https://issues.apache.org/jira/browse/ARROW-15279
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
 Fix For: 7.0.0


I would add two comments for the article on [Writing 
bindings|https://ursalabs.org/arrow-r-nightly/articles/developers/bindings.html#writing-bindings]
 : * in [Step 
-1|https://ursalabs.org/arrow-r-nightly/articles/developers/bindings.html#step-1---add-unit-tests]
 I suggest to add that  {{compare_dplyr_binding()}} and 
{{compare_dplyr_error()}} can be found in 
{{arrow/r/tests/testthat/helper-expectation.R}}
 * Due to [ARROW-15010|https://github.com/apache/arrow/pull/11904] Step 3b 
should be corrected {{nse_funcs$startsWith ...}}

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15254) [C++] Ability to skip CSV footer when reading in dataset

2022-01-05 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15254:


 Summary: [C++] Ability to skip CSV footer when reading in dataset
 Key: ARROW-15254
 URL: https://issues.apache.org/jira/browse/ARROW-15254
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Nicola Crane


In ARROW-15252 a user reports wanting to be able to skip the final row of a CSV 
(the footer) when reading in a dataset of CSVs - is this possible to implement?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15145) [R] test-r-minimal-build fails due to updated error message

2021-12-17 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15145:


 Summary: [R]  test-r-minimal-build fails due to updated error 
message
 Key: ARROW-15145
 URL: https://issues.apache.org/jira/browse/ARROW-15145
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


In ARROW-15047, the error messaging in {{read_compressed_error()}} was updated 
to be more user-friendly - the corresponding unit test (named "Error messages 
are shown when the compression algorithm lz4 is not found") needs updating to 
reflect this change 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15128) [C++] segfault when writing CSV from RecordBatchReader

2021-12-16 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15128:


 Summary: [C++] segfault when writing CSV from RecordBatchReader
 Key: ARROW-15128
 URL: https://issues.apache.org/jira/browse/ARROW-15128
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Nicola Crane


I'm currently trying to implement functionality in R so that we can open a 
dataset and then write to a CSV file, but I'm getting a segfault when I run my 
tests:

 
{code:r}
tbl <- tibble::tibble(
  dbl = c(1:8, NA, 10) + .1,
  lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE),
  false = logical(10),
  chr = letters[c(1:5, NA, 7:10)]
)

make_temp_dir <- function() {
  path <- tempfile()
  dir.create(path)
  normalizePath(path, winslash = "/")
}

data_dir <- make_temp_dir()
write_dataset(tbl, data_dir, partitioning = "lgl")
data_in <- open_dataset(data_dir)

csv_file <- tempfile()
tbl_out <- write_csv_arrow(data_in, csv_file)
{code}
 
{code:java}
Thread 1 "R" received signal SIGSEGV, Segmentation fault.
0x7fffee51fdd7 in __gnu_cxx::__exchange_and_add (__mem=0xe9, __val=-1)
at /usr/include/c++/9/ext/atomicity.h:49
49{ return __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15103) [Documentation][C++] Error building docs: "arrow/cpp/src/arrow/csv/options.h:182: error: Found unknown command '\r' "

2021-12-14 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15103:


 Summary: [Documentation][C++] Error building docs: 
"arrow/cpp/src/arrow/csv/options.h:182: error: Found unknown command '\r' "
 Key: ARROW-15103
 URL: https://issues.apache.org/jira/browse/ARROW-15103
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: Nicola Crane


I am trying to build the docs, following the instructions at 
https://arrow.apache.org/docs/developers/documentation.html

However, after running {{pip install -r docs/requirements.txt}} and then going 
to {{cpp/apidoc}} and running {{doxygen}} I get the following error:

{code:java}
warning: ignoring unsupported tag 'HTML_FORMULA_FORMAT' at line 1537, file 
Doxyfile
/home/nic2/arrow/cpp/src/arrow/csv/options.h:182: error: Found unknown command 
'\r' (warning treated as error, aborting now)
{code}





--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15040) [R] Enable write_csv_arrow to take a RecordBatchReader as input

2021-12-09 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15040:


 Summary: [R] Enable write_csv_arrow to take a RecordBatchReader as 
input
 Key: ARROW-15040
 URL: https://issues.apache.org/jira/browse/ARROW-15040
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


Currently, this code fails:
{code:r}
dataset <- open_dataset("some/folder/with/parquet/files")
write_csv_arrow(dataset, sink = "dataset.csv")
{code}

with this error message:
{code:r}
Error: x must be an object of class 'data.frame', 'RecordBatch', or 'Table', 
not 'FileSystemDataset'.
{code}

In ARROW-14741, support was added for reading from a RecordBatchReader, so we 
should be able to now extend {{write_csv_arrow()}} to allow this behaviour.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15022) [R] install vignette and installation dev vignette need alt text for images

2021-12-08 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-15022:


 Summary: [R] install vignette and installation dev vignette need 
alt text for images
 Key: ARROW-15022
 URL: https://issues.apache.org/jira/browse/ARROW-15022
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Nicola Crane
 Fix For: 7.0.0


The installation docs have been updated recently, with images added, but there 
is no alt text to accompany them.  Alt text should be added to all images, and 
extra text should be added to the flowchart describing installation on Windows, 
given that it is too complex for a simple alt text description.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14989) [R] Update num_rows methods to output doubles not integers to prevent integer overflow

2021-12-06 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14989:


 Summary: [R] Update num_rows methods to output doubles not 
integers to prevent integer overflow
 Key: ARROW-14989
 URL: https://issues.apache.org/jira/browse/ARROW-14989
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane


In cases where Arrow objects are particularly large, this can result in an 
integer overflow when returning their size.  See discussion on 
https://github.com/apache/arrow/pull/11783 for more details of a possible 
solution.


{code:r}
library(arrow)
test_array1 <- Array$create(raw(2^31 - 1))
test_array2 <- Array$create(raw(1))
big_chunked <- chunked_array(test_array1, test_array2)

big_table <- Table$create(col = big_chunked)
big_table$num_rows
# NA
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14988) [R] Improve source build experience

2021-12-06 Thread Nicola Crane (Jira)
Nicola Crane created ARROW-14988:


 Summary: [R] Improve source build experience
 Key: ARROW-14988
 URL: https://issues.apache.org/jira/browse/ARROW-14988
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Nicola Crane
Assignee: Nicola Crane


* We should make ARROW_DEPENDENCY_SOURCE=AUTO the default and then document how 
to install the dependencies (such that you can) using apt/yum; that will speed 
up source builds
* In the default case where they aren't downloading a binary, we could 
advertise *** For a faster installation, set the environment variable 
LIBARROW_BINARY=true before installing or something. I think that wouldn't be 
against CRAN policy
* We could also message more loudly in the default source build that not all 
features are enabled, set LIBARROW_MINIMAL=false and reinstall if you need them



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   >