[jira] [Resolved] (ARROW-16605) [CI][R] Fix revdep docker job

2022-10-06 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-16605.

Resolution: Fixed

Issue resolved by pull request 13483
[https://github.com/apache/arrow/pull/13483]

> [CI][R] Fix revdep docker job
> -
>
> Key: ARROW-16605
> URL: https://issues.apache.org/jira/browse/ARROW-16605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> The revdep Crossbow job is currently not functioning correctly. This led to 
> changed behaviour affecting a revdep with the 8.0.0 release, requiring a 
> patch after initial submission.
> cc: [~jonkeane]
> Due to the time and performance constraints on GHA it does not make sense to 
> have a crossbow job for this. A dockeR job to be able to cleanly run this 
> locally does make sense though, so I renamed the ticket.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-10-06 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17613507#comment-17613507
 ] 

Jonathan Keane commented on ARROW-15678:


I thought that [~kou] was going to take a look at this (or at least the 
underlying multiple SIMD instruction ordering issue that causes the failures...)

The only update I have is that I continue to run into the segfault in CI for 
downstream projects I'm working on, so it continues to be an issue for 
pre-built libarrow on machines like github's macos runners. 

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 13.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17574) [R] [Docs] [CI] Investigate if we can auto generate Rd files in CI

2022-08-30 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-17574:
--

 Summary: [R] [Docs] [CI] Investigate if we can auto generate Rd 
files in CI
 Key: ARROW-17574
 URL: https://issues.apache.org/jira/browse/ARROW-17574
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Documentation, R
Reporter: Jonathan Keane


Or alternatively, warn + recommend running autotune if they are out of date 
(e.g. any change)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17533) [R] Implement asof join

2022-08-25 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-17533:
--

 Summary: [R] Implement asof join
 Key: ARROW-17533
 URL: https://issues.apache.org/jira/browse/ARROW-17533
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Jonathan Keane


With ARROW-16083 we have asof joins, could we expose this in R?

Docs for the node: 
https://arrow.apache.org/docs/cpp/api/compute.html?highlight=asof#_CPPv4N5arrow7compute19AsofJoinNodeOptionsE

A possible syntax might be (there does not appear to be a syntax in dplyr for 
this already): 

{code}
asof_join(table1, table2, by = "field", tolerance = 1) 
{code}





--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17533) [R] Implement asof join

2022-08-25 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585027#comment-17585027
 ] 

Jonathan Keane commented on ARROW-17533:


A bit more prior art | folks asking for: 
https://stackoverflow.com/questions/58538114/is-there-an-r-equivalent-of-pythons-pandas-merge-asof

> [R] Implement asof join
> ---
>
> Key: ARROW-17533
> URL: https://issues.apache.org/jira/browse/ARROW-17533
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>
> With ARROW-16083 we have asof joins, could we expose this in R?
> Docs for the node: 
> https://arrow.apache.org/docs/cpp/api/compute.html?highlight=asof#_CPPv4N5arrow7compute19AsofJoinNodeOptionsE
> A possible syntax might be (there does not appear to be a syntax in dplyr for 
> this already): 
> {code}
> asof_join(table1, table2, by = "field", tolerance = 1) 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17526) [R] [Docs] Improve (or really actually document) our Python bridge documentation

2022-08-25 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-17526:
--

 Summary: [R] [Docs] Improve (or really actually document) our 
Python bridge documentation 
 Key: ARROW-17526
 URL: https://issues.apache.org/jira/browse/ARROW-17526
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation, R
Reporter: Jonathan Keane


https://twitter.com/jonkeane/status/1560016227824721920?s=20=g2MhdOOJbh0q0MpxPI4R_Q

When I wrote this, I wished there was a one-page I could show passing a table 
or recordbatchreader back and forth. 
https://arrow.apache.org/cookbook/r/using-pyarrow-from-r.html#introduction-4 
also has some details, but is more focused on scalars and arrays than tables. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17458) [C++] CSV Writer: Unsupported cast from decimal to utf8

2022-08-25 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584853#comment-17584853
 ] 

Jonathan Keane commented on ARROW-17458:


We ran into this issue today as well, working on conversions for benchmarking 
datasets

> [C++] CSV Writer: Unsupported cast from decimal to utf8 
> 
>
> Key: ARROW-17458
> URL: https://issues.apache.org/jira/browse/ARROW-17458
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 6.0.1
>Reporter: Pavel Kovalenko
>Priority: Major
>  Labels: csv, decimal, unsupported
>
> The following code snippet fails with an Unsupported cast error if a table 
> has a decimal column.
> {code:cpp}
> std::shared_ptr table;
> ARROW_CHECK_OK(reader->ReadAll());
> std::shared_ptr output = 
> arrow::io::FileOutputStream::Open(csvPath).ValueOrDie();
> auto writeOptions = arrow::csv::WriteOptions::Defaults();
> writeOptions.include_header = false;
> auto status = arrow::csv::WriteCSV(*table, writeOptions, output.get());
> if (!status.ok()) {
> SETHROW_ERROR(std::runtime_error, "Couldn't write table csv: {}", 
> status.message());
> }
> {code}
> {code:cpp}
> Unsupported cast from decimal128(7, 2) to utf8 using function cast_string
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17084) [R] Install the package before linting

2022-08-02 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-17084.

Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13620
[https://github.com/apache/arrow/pull/13620]

> [R] Install the package before linting
> --
>
> Key: ARROW-17084
> URL: https://issues.apache.org/jira/browse/ARROW-17084
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> The R package should be installed before linting. See
> [https://github.com/r-lib/lintr/issues/352#issuecomment-587004345,] and
> https://github.com/r-lib/lintr/issues/406#issuecomment-534601141.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-12590) [C++][R] Update copies of Homebrew files to reflect recent updates

2022-07-29 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573158#comment-17573158
 ] 

Jonathan Keane commented on ARROW-12590:


Yeah, that should work until the homer maintainers decide to pull it out

> [C++][R] Update copies of Homebrew files to reflect recent updates
> --
>
> Key: ARROW-12590
> URL: https://issues.apache.org/jira/browse/ARROW-12590
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, R
>Reporter: Ian Cook
>Assignee: Jacob Wujciak-Jens
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Our copies of the Homebrew formulae at 
> [https://github.com/apache/arrow/tree/master/dev/tasks/homebrew-formulae] 
> have drifted out of sync with what's currently in 
> [https://github.com/Homebrew/homebrew-core/tree/master/Formula] and 
> [https://github.com/autobrew/homebrew-core/blob/master/Formula|https://github.com/autobrew/homebrew-core/blob/master/Formula/].
>  Get them back in sync and consider automating some method of checking that 
> they are in sync, e.g. by failing the {{homebrew-cpp}} and 
>  {{homebrew-r-autobrew}} nightly tests if our copies don't match what's in 
> the Homebrew and autobrew repos (but only if there were changes there that 
> weren't made in our repo, and not the inverse).
> Update the instructions at 
>  
> [https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingHomebrewpackages]
>  as needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17166) [R] [CI] force_tests() cannot return TRUE

2022-07-29 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-17166.

Resolution: Fixed

Issue resolved by pull request 13680
[https://github.com/apache/arrow/pull/13680]

> [R] [CI] force_tests() cannot return TRUE
> -
>
> Key: ARROW-17166
> URL: https://issues.apache.org/jira/browse/ARROW-17166
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Rok Mihevc
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: CI, pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Update: the OOM has cleared up so the scope of this PR changed.
> Old title: [R] [CI] Exclude large memory tests from the force-tests job on CI
> =
> We have noticed R CI job (AMD64 Ubuntu 20.04 R 4.2 Force-Tests true) failing 
> on master: 
> [1|https://github.com/apache/arrow/runs/7424773120?check_suite_focus=true#step:7:5547],
>  
> [2|https://github.com/apache/arrow/runs/7431821192?check_suite_focus=true#step:7:5804],
>  
> [3|https://github.com/apache/arrow/runs/7445803518?check_suite_focus=true#step:7:16305]
> with:
> {code:java}
> Start test: array uses local timezone for POSIXct without timezone
>   test-Array.R:269:3 [success]
> System has not been booted with systemd as init system (PID 1). Can't operate.
> Failed to create bus connection: Host is down
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-12590) [C++][R] Update copies of Homebrew files to reflect recent updates

2022-07-29 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573003#comment-17573003
 ] 

Jonathan Keane commented on ARROW-12590:


Agreed with syncing (and the original intent of this ticket was basically to 
find a way to detect if and when this happens in order to alert us about it). 
It is ok that the autobrew and the homebrew formulae are different (since in 
the newest versions of the autobrew setup, if we are on a modern enough system 
we _just use brew_).

If I'm remembering correctly, 
https://github.com/apache/arrow/pull/12157/files#diff-4b112dbca2ece7c78e15eb8aff3218e21dd6f4b1fab7cfc9182830488f68ca58R22-R30
 was basically the operative code that fixes this. If I were you, I would take 
the commits on my branch there and create a new branch and push forward with 
that since it will let you run it in CI. Though the R tests will probably 
segfault with the simd issue in ARROW-15678. Maybe that's fine (since it's 
"only" a limited number of computers that this happens on — just so happens the 
GH runners are one of those, apparently) or maybe we'll need to actually 
resolve ARROW-15678? 

> [C++][R] Update copies of Homebrew files to reflect recent updates
> --
>
> Key: ARROW-12590
> URL: https://issues.apache.org/jira/browse/ARROW-12590
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, R
>Reporter: Ian Cook
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Our copies of the Homebrew formulae at 
> [https://github.com/apache/arrow/tree/master/dev/tasks/homebrew-formulae] 
> have drifted out of sync with what's currently in 
> [https://github.com/Homebrew/homebrew-core/tree/master/Formula] and 
> [https://github.com/autobrew/homebrew-core/blob/master/Formula|https://github.com/autobrew/homebrew-core/blob/master/Formula/].
>  Get them back in sync and consider automating some method of checking that 
> they are in sync, e.g. by failing the {{homebrew-cpp}} and 
>  {{homebrew-r-autobrew}} nightly tests if our copies don't match what's in 
> the Homebrew and autobrew repos (but only if there were changes there that 
> weren't made in our repo, and not the inverse).
> Update the instructions at 
>  
> [https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingHomebrewpackages]
>  as needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-07-25 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571125#comment-17571125
 ] 

Jonathan Keane commented on ARROW-15678:


I have no updates beyond what's discussed above: there are a few approaches, 
none of them ideal, we need someone to champion this (or risk the homebrew 
maintainers turning off optimizations on us)

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 13.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-07-22 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570163#comment-17570163
 ] 

Jonathan Keane edited comment on ARROW-15678 at 7/22/22 7:28 PM:
-

Homebrew only accepted that as a temporary workaround and has threatened to 
turn off optimizations if we don't resolve this. They haven't yet followed 
through yet, though. 
https://github.com/Homebrew/homebrew-core/issues/94724#issuecomment-1063031123 


was (Author: jonkeane):
Homebrew only accepted that as a temporary workaround and has threatened to 
turn off optimizations if we don't resolve this. They haven't yet followed 
through yet, though.

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-07-22 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570163#comment-17570163
 ] 

Jonathan Keane commented on ARROW-15678:


Homebrew only accepted that as a temporary workaround and has threatened to 
turn off optimizations if we don't resolve this. They haven't yet followed 
through yet, though.

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17115) [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

2022-07-20 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569175#comment-17569175
 ] 

Jonathan Keane commented on ARROW-17115:


A reprex that causes this from R (which is effectively the TPC-H 12 query that 
segfaults):


{code:r}
library(arrow)
library(dplyr)
library(arrowbench)

ensure_source("tpch", scale_factor = 10)

open_dataset("data/lineitem_10.parquet") %>%
  filter(
l_shipmode %in% c("MAIL", "SHIP"),
l_commitdate < l_receiptdate,
l_shipdate < l_commitdate,
l_receiptdate >= as.Date("1994-01-01"),
l_receiptdate < as.Date("1995-01-01")
  ) %>%
  inner_join(
open_dataset("data/orders_10.parquet"),
by = c("l_orderkey" = "o_orderkey")
  ) %>%
  group_by(l_shipmode) %>%
  summarise(
high_line_count = sum(
  if_else(
(o_orderpriority == "1-URGENT") | (o_orderpriority == "2-HIGH"),
1L,
0L
  )
),
low_line_count = sum(
  if_else(
(o_orderpriority != "1-URGENT") & (o_orderpriority != "2-HIGH"),
1L,
0L
  )
)
  ) %>%
  ungroup() %>%
  arrange(l_shipmode) %>%
  collect()
{code}

> [C++] HashJoin fails if it encounters a batch with more than 32Ki rows
> --
>
> Key: ARROW-17115
> URL: https://issues.apache.org/jira/browse/ARROW-17115
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Blocker
> Fix For: 9.0.0
>
>
> The new swiss join assumes that batches are being broken according to the 
> morsel/batch model and it assumes those batches have, at most, 32Ki rows 
> (signed 16-bit indices are used in various places).
> However, we are not currently slicing all of our inputs to batches this 
> small.  This is causing conbench to fail and would likely be a problem with 
> any large inputs.
> We should fix this by slicing batches in the engine to the appropriate 
> maximum size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17115) [C++] HashJoin fails if it encounters a batch with more than 32Ki rows

2022-07-20 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-17115:
---
Fix Version/s: 9.0.0

> [C++] HashJoin fails if it encounters a batch with more than 32Ki rows
> --
>
> Key: ARROW-17115
> URL: https://issues.apache.org/jira/browse/ARROW-17115
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Blocker
> Fix For: 9.0.0
>
>
> The new swiss join assumes that batches are being broken according to the 
> morsel/batch model and it assumes those batches have, at most, 32Ki rows 
> (signed 16-bit indices are used in various places).
> However, we are not currently slicing all of our inputs to batches this 
> small.  This is causing conbench to fail and would likely be a problem with 
> any large inputs.
> We should fix this by slicing batches in the engine to the appropriate 
> maximum size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-13062) [Dev] Add a way for people to add information to our saved crossbow data

2022-07-14 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566851#comment-17566851
 ] 

Jonathan Keane commented on ARROW-13062:


IMHO manual would be fine. And honestly, probably will be needed at some level 
since autocreating jiras will result in a bunch of jiras that overlap (in cases 
where multiple failures result from one change), or need to be duplicated 
manually when one job failure is the result of multiple changes or failures.

Anyway, it's not a high priority now, we can wait until someone bumps it again 
— but wanted to make sure you got credit if it was already done as part of the 
work you pushed to get all of the other great stuff out

> [Dev] Add a way for people to add information to our saved crossbow data
> 
>
> Key: ARROW-13062
> URL: https://issues.apache.org/jira/browse/ARROW-13062
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools
>Reporter: Jonathan Keane
>Priority: Major
>
> We should have a simple + ligthweight way to annotate specific builds with 
> information like "won't be fixed until dask has a new release" or "this is 
> supposed to be fixed in ARROW-XXX".
> We should find an easy, lightweight way to add this kind of information. 
> Only relevant in its previous parent: -We *should not* require, ask, or allow 
> people to add this information to the JSON that is saved as part of 
> ARROW-13509. That JSON should be kept pristine and not have manual edits. 
> Instead, we should have a plain-text look up file that matches notes to 
> specific builds (maybe to specific dates?)-



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds

2022-07-13 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-8043.
-
Resolution: Fixed

Done with the work on https://crossbow.voltrondata.com

> [Developer] Provide better visibility for failed nightly builds
> ---
>
> Key: ARROW-8043
> URL: https://issues.apache.org/jira/browse/ARROW-8043
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Sam Albers
>Priority: Major
>  Labels: pull-request-available
>
> Emails reporting nightly failures are unsatisfactory in two ways: there is a 
> large click/scroll distance between the links presented in that email and the 
> actual error message. Worse, once one is there it's not clear what JIRAs have 
> been made or which of them are in progress.
> One solution would be to replace or augment the [NIGHTLY] email with a page 
> ([https://ursa-labs.github.org/crossbow] would be my favorite) which shows 
> how many nights it has failed, a shortcut to the actual error line in CI's 
> logs, and useful views of JIRA. We could accomplish this with:
>  - dedicated JIRA tags; one for each nightly job so a JIRA can be easily 
> associated with specific jobs
>  - A static HTML dashboard with client side JavaScript to
>  ** scrape JIRA and update the page dynamically as soon as JIRAs are opened
>  ** show any relationships between failing jobs
>  ** highlight jobs that have not been addressed, along with a counter of how 
> many nights it has gone unaddressed
>  - provide automatic and expedited creation of correctly labelled JIRAs, so 
> that viewers can quickly organize/take ownership of a failed nightly job. 
> JIRA supports reading form fields from URL parameters, so this would be 
> fairly straightforward:
>  
> [https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds

2022-07-13 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reopened ARROW-8043:
---
  Assignee: Sam Albers

> [Developer] Provide better visibility for failed nightly builds
> ---
>
> Key: ARROW-8043
> URL: https://issues.apache.org/jira/browse/ARROW-8043
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Developer Tools
>Affects Versions: 0.16.0
>Reporter: Ben Kietzman
>Assignee: Sam Albers
>Priority: Major
>  Labels: pull-request-available
>
> Emails reporting nightly failures are unsatisfactory in two ways: there is a 
> large click/scroll distance between the links presented in that email and the 
> actual error message. Worse, once one is there it's not clear what JIRAs have 
> been made or which of them are in progress.
> One solution would be to replace or augment the [NIGHTLY] email with a page 
> ([https://ursa-labs.github.org/crossbow] would be my favorite) which shows 
> how many nights it has failed, a shortcut to the actual error line in CI's 
> logs, and useful views of JIRA. We could accomplish this with:
>  - dedicated JIRA tags; one for each nightly job so a JIRA can be easily 
> associated with specific jobs
>  - A static HTML dashboard with client side JavaScript to
>  ** scrape JIRA and update the page dynamically as soon as JIRAs are opened
>  ** show any relationships between failing jobs
>  ** highlight jobs that have not been addressed, along with a counter of how 
> many nights it has gone unaddressed
>  - provide automatic and expedited creation of correctly labelled JIRAs, so 
> that viewers can quickly organize/take ownership of a failed nightly job. 
> JIRA supports reading form fields from URL parameters, so this would be 
> fairly straightforward:
>  
> [https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-13936) Add a column to show us the number of time that this job is failing

2022-07-13 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-13936.
--
Resolution: Fixed

Done with the work on https://crossbow.voltrondata.com

> Add a column to show us the number of time that this job is failing
> ---
>
> Key: ARROW-13936
> URL: https://issues.apache.org/jira/browse/ARROW-13936
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: David Dali Susanibar Arce
>Assignee: Sam Albers
>Priority: Minor
>
> Try to use external repository to collect information about jobs name failling



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (ARROW-13936) Add a column to show us the number of time that this job is failing

2022-07-13 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reopened ARROW-13936:

  Assignee: Sam Albers

> Add a column to show us the number of time that this job is failing
> ---
>
> Key: ARROW-13936
> URL: https://issues.apache.org/jira/browse/ARROW-13936
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: David Dali Susanibar Arce
>Assignee: Sam Albers
>Priority: Minor
>
> Try to use external repository to collect information about jobs name failling



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-12845) [R] [C++] S3 connections for different providers

2022-07-13 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-12845.
--
Resolution: Won't Fix

> [R] [C++] S3 connections for different providers
> 
>
> Key: ARROW-12845
> URL: https://issues.apache.org/jira/browse/ARROW-12845
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Affects Versions: 4.0.0
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Hi
> As a part of my thesis, I want to create an S3 bucket on DigitalOcean (what 
> PUC uses), and while I can write parquet files on my laptop and upload to 
> DigitalOcean Spaces (i.e. an "S3 + Google Drive") from the browser or by 
> using rclone, I could work in editing the existing code that allows to 
> connects to Amazon S3, and  provide a function that connects to 
> DigitalOcean/Linode/IBM/etc.
> This could be done in a way that amazon URL is the default and the user could 
> specify something like `new_s3_fun(...,  provider = "Tencent")` and connect 
> to an S3 that is not Amazon.
> Also, this involves the need to write more S3 documentation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-12862) [CI] Gather + display reliability of crossbow builds

2022-07-13 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-12862.
--
Resolution: Fixed

Done with the work on https://crossbow.voltrondata.com

> [CI] Gather + display reliability of crossbow builds
> 
>
> Key: ARROW-12862
> URL: https://issues.apache.org/jira/browse/ARROW-12862
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Jonathan Keane
>Assignee: Sam Albers
>Priority: Major
>
> From Wes's suggestion on the mailing list:
> Having a website
> dashboard showing build health over time along with a ~ weekly e-mail
> to dev@ indicating currently broken builds and the reliability of each
> build over the trailing 7 or 30 days would be useful. Knowing that a
> particular build is only passing 20% of the time would help steer our
> efforts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-12862) [CI] Gather + display reliability of crossbow builds

2022-07-13 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-12862:
--

Assignee: Sam Albers

> [CI] Gather + display reliability of crossbow builds
> 
>
> Key: ARROW-12862
> URL: https://issues.apache.org/jira/browse/ARROW-12862
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Jonathan Keane
>Assignee: Sam Albers
>Priority: Major
>
> From Wes's suggestion on the mailing list:
> Having a website
> dashboard showing build health over time along with a ~ weekly e-mail
> to dev@ indicating currently broken builds and the reliability of each
> build over the trailing 7 or 30 days would be useful. Knowing that a
> particular build is only passing 20% of the time would help steer our
> efforts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-14378) [R] Make custom extension classes for (some) cols with row-level metadata

2022-07-13 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-14378.
--
Resolution: Won't Fix

We ended up supporting geo columns using the geoarrow package + extension types

> [R] Make custom extension classes for (some) cols with row-level metadata
> -
>
> Key: ARROW-14378
> URL: https://issues.apache.org/jira/browse/ARROW-14378
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>
> The major usecase for this is SF columns which have attributes/metadata for 
> each element of a column. We originally stored these in our standard 
> column-level metadata, but that was very fragile and took forever, so we 
> disabled it ARROW-13189
> This will likely take some steps to accomplish. I've sketched out some in the 
> subtasks here (though if we have a different approach, we could do that 
> directly)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-12182) [R] [Dev] new helpers and suggests for testing

2022-07-13 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-12182.
--
Resolution: Won't Fix

> [R] [Dev] new helpers and suggests for testing
> --
>
> Key: ARROW-12182
> URL: https://issues.apache.org/jira/browse/ARROW-12182
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools, R
>Affects Versions: 3.0.0
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Minor
>
> _Related to https://issues.apache.org/jira/browse/ARROW-11705_
> While working on the related tickets I've found the next blockers:
> 1. Does it make sense to create expect_dplyr_named()? (i.e. to mimic 
> https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L56-L59)
> 2. Does it make sense to create expect_dplyr_identical() (i.e. to mimic 
> https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L61-L69
>  and 
> https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L83-L91)
> 3. Should we need to add glue to Suggests? (i.e. replicate 
> https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L95-L100)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-14624) [R] [Docs] Remove our tabbing hack now that it's supported by pkgdown

2022-07-13 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-14624.
--
Resolution: Fixed

This was fixed as part of the work to update the version switcher in the docs.

> [R] [Docs] Remove our tabbing hack now that it's supported by pkgdown
> -
>
> Key: ARROW-14624
> URL: https://issues.apache.org/jira/browse/ARROW-14624
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>
> tabsets are now supported natively in pkgdown (with bootstrap 5)
> https://github.com/r-lib/pkgdown/pull/1694
> So we can pull out the hack we have to make that work for our dev docs 
> vignette



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-16076) [R] Bindings for the new TPC-H generator

2022-07-13 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-16076.
--
Resolution: Won't Fix

Since the TPC-H generator does not generate compliant data, there's not a big 
need to expose this in R.

> [R] Bindings for the new TPC-H generator
> 
>
> Key: ARROW-16076
> URL: https://issues.apache.org/jira/browse/ARROW-16076
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Now that https://github.com/apache/arrow/pull/12537 is merged, we should 
> implement the R changes needed to make that useable from R.
> We should basically do the opposite of 
> https://github.com/apache/arrow/pull/12537/commits/4b16296b4ef8cd3b3d440e8b7f8af32a89a16788
> But also add in the fixes from weston: 
> https://github.com/westonpace/arrow/commit/7c4c0e0b4e208918eb195701fab5d631b8c9517a



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-13062) [Dev] Add a way for people to add information to our saved crossbow data

2022-07-13 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566383#comment-17566383
 ] 

Jonathan Keane commented on ARROW-13062:


[~boshek] Did you already add this ability? I know it's a slightly different 
set of tickets than the ones we actually worked, but we should either close it 
as duplicate, done, or won't fix (and feel free to take credit for it if you 
did it elsewhere as part of a larger ticket!)

> [Dev] Add a way for people to add information to our saved crossbow data
> 
>
> Key: ARROW-13062
> URL: https://issues.apache.org/jira/browse/ARROW-13062
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Developer Tools
>Reporter: Jonathan Keane
>Priority: Major
>
> We should have a simple + ligthweight way to annotate specific builds with 
> information like "won't be fixed until dask has a new release" or "this is 
> supposed to be fixed in ARROW-XXX".
> We should find an easy, lightweight way to add this kind of information. 
> Only relevant in its previous parent: -We *should not* require, ask, or allow 
> people to add this information to the JSON that is saved as part of 
> ARROW-13509. That JSON should be kept pristine and not have manual edits. 
> Instead, we should have a plain-text look up file that matches notes to 
> specific builds (maybe to specific dates?)-



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17059) [C++] Archery cpp-micro arrow-compute-expression-benchmark fails with Invalid: Value lengths differed from ExecBatch length

2022-07-12 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-17059.

Fix Version/s: 10.0.0
   Resolution: Fixed

Issue resolved by pull request 13584
[https://github.com/apache/arrow/pull/13584]

> [C++] Archery cpp-micro arrow-compute-expression-benchmark fails with 
> Invalid: Value lengths differed from ExecBatch length
> ---
>
> Key: ARROW-17059
> URL: https://issues.apache.org/jira/browse/ARROW-17059
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery, C++
>Reporter: Elena Henderson
>Assignee: Sasha Krassovsky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 10.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
>  [https://github.com/apache/arrow/pull/13179] causes 
> {{arrow-compute-expression-benchmark}}  to fail with:
> {code:java}
> -- Arrow Fatal Error --
> Invalid: Value lengths differed from ExecBatch length {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-07-07 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564018#comment-17564018
 ] 

Jonathan Keane commented on ARROW-15678:


Last I checked, the homebrew maintainers have said that they will disable all 
optimization for arrow if we don't get this sorted on our own. So not required 
if we're ok with that (though we should engage with them on this)

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12059) [R] Accept format-specific scan options in collect()

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12059:
---
Fix Version/s: (was: 9.0.0)

> [R] Accept format-specific scan options in collect()
> 
>
> Key: ARROW-12059
> URL: https://issues.apache.org/jira/browse/ARROW-12059
> Project: Apache Arrow
>  Issue Type: Task
>  Components: R
>Affects Versions: 4.0.0
>Reporter: David Li
>Priority: Major
>  Labels: dataset, datasets
>
> ARROW-9749 and ARROW-8631 added format/scan-specific options. In R, the most 
> natural place to accept these is in collect(), but this isn't yet done.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15283) [Python][R] Remove deprecated placeholders for UseAsync

2022-06-30 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561113#comment-17561113
 ] 

Jonathan Keane commented on ARROW-15283:


Is this something you think you'll be able to get to before 9.0.0 
[~westonpace]? Happy to push it out if not

> [Python][R] Remove deprecated placeholders for UseAsync
> ---
>
> Key: ARROW-15283
> URL: https://issues.apache.org/jira/browse/ARROW-15283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
> Fix For: 9.0.0
>
>
> In the 7.0.0 release we are marking the UseAsync parameters / functions as 
> deprecated.  In a future release we should remove these entirely.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12137) [R] New/improved vignette on dplyr features

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12137:
---
Fix Version/s: (was: 9.0.0)

> [R] New/improved vignette on dplyr features
> ---
>
> Key: ARROW-12137
> URL: https://issues.apache.org/jira/browse/ARROW-12137
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12213) [R] copy_files doesn't make it easy to copy a single file

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12213:
---
Fix Version/s: (was: 9.0.0)

> [R] copy_files doesn't make it easy to copy a single file
> -
>
> Key: ARROW-12213
> URL: https://issues.apache.org/jira/browse/ARROW-12213
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Neal Richardson
>Priority: Major
>
> copy_files (i.e. fs::CopyFiles) makes it trivial to recursively copy a 
> directory/bucket to or from S3, but I'm having a hard time downloading a 
> single file.
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-13165) [R] Add bindings for ProjectOptions

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13165:
---
Fix Version/s: (was: 9.0.0)

> [R] Add bindings for ProjectOptions
> ---
>
> Key: ARROW-13165
> URL: https://issues.apache.org/jira/browse/ARROW-13165
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>
> The {{project}} kernel creates a column of struct (equivalent to a column of 
> named lists in R). Add to {{make_compute_options}} in {{compute.cpp}} so we 
> can pass {{ProjectOptions}} to the {{project}} kernel.
> One practical application of the {{project}} kernel is to create a binding 
> for the stringr function {{str_locate}} which returns a column of named lists.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12711) [R] Bindings for paste(collapse), str_c(collapse), and str_flatten()

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-12711:
---
Fix Version/s: (was: 9.0.0)

> [R] Bindings for paste(collapse), str_c(collapse), and str_flatten()
> 
>
> Key: ARROW-12711
> URL: https://issues.apache.org/jira/browse/ARROW-12711
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>  Labels: query-engine
>
> These are the aggregating versions of string concatenation—they combine 
> values from a set of rows into a single value. 
> The bindings for {{paste()}} and {{str_c()}} might be tricky to implement 
> because when these functions are called with the {{coallapse}} argument 
> unset, they do _not_ aggregate.
> In {{summarise()}} we need to be able to use scalar concatenation within 
> aggregate concatenation, like this: 
> {code:java}
> starwars %>%
>   filter(!is.na(hair_color) & !is.na(eye_color)) %>% 
>   group_by(homeworld) %>% 
>   summarise(hair_and_eyes = paste0(paste0(hair_color, "-haired and ", 
> eye_color, "-eyed"), collapse = ", ")){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-13766) [R] Add Arrow methods slice_min(), slice_max()

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13766:
---
Fix Version/s: (was: 9.0.0)

> [R] Add Arrow methods slice_min(), slice_max()
> --
>
> Key: ARROW-13766
> URL: https://issues.apache.org/jira/browse/ARROW-13766
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>  Labels: query-engine
>
> Implement [{{slice_min()}} and 
> {{slice_max()}}|https://dplyr.tidyverse.org/reference/slice.html] methods for 
> {{ArrowTabular}}, {{Dataset}}, and {{arrow_dplyr_query}} objects.
> These dplyr functions supersede the older dplyr function 
> [{{top_n()}}|https://dplyr.tidyverse.org/reference/top_n.html] which I 
> suppose we should also consider implementing a method for.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-13767) [R] Add Arrow methods slice(), slice_head(), slice_tail()

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13767:
---
Fix Version/s: (was: 9.0.0)

> [R] Add Arrow methods slice(), slice_head(), slice_tail()
> -
>
> Key: ARROW-13767
> URL: https://issues.apache.org/jira/browse/ARROW-13767
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>  Labels: query-engine
>
> Implement [{{slice()}}, {{slice_head()}}, and 
> {{slice_tail()}}|https://dplyr.tidyverse.org/reference/slice.html] methods 
> for {{ArrowTabular}}, {{Dataset}}, and {{arrow_dplyr_query}} objects . I 
> believe this should be relatively straightforward, using {{Take()}} to return 
> only the specified rows. We already have a {{head()}} method which I believe 
> we can reuse for {{slice_head()}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-13531) [R] Read CSV with comma as decimal mark

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13531:
---
Fix Version/s: (was: 9.0.0)

> [R] Read CSV with comma as decimal mark
> ---
>
> Key: ARROW-13531
> URL: https://issues.apache.org/jira/browse/ARROW-13531
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>
> Followup to ARROW-13421. There is a new ConvertOption, that part is easy. 
> There may be some subtleties in emulating the readr way of supporting this 
> since it uses a broader {{locale()}} object, but maybe we just add 
> {{read_csv2_arrow}} (matching {{readr::read_csv2}} and {{base::read.csv2}}) 
> and that's enough.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14028) [R] Cast of NaN to integer should return NA_integer_

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14028:
---
Fix Version/s: (was: 9.0.0)

> [R] Cast of NaN to integer should return NA_integer_
> 
>
> Key: ARROW-14028
> URL: https://issues.apache.org/jira/browse/ARROW-14028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>
> Casting double {{NaN}} to integer returns a sentinel value:
> {code:r}
> call_function("cast", Scalar$create(NaN), options = list(to_type = int32(), 
> allow_float_truncate = TRUE))
> #> Scalar
> #> -2147483648
> call_function("cast", Scalar$create(NaN), options = list(to_type = int64(), 
> allow_float_truncate = TRUE))
> #> Scalar
> #> -9223372036854775808{code}
> It would be nice if this would instead return {{NA_integer}}.
> N.B. for some reason this doesn't reproduce in dplyr unless you round-trip it 
> back to double:
> {code:r}
> > Table$create(x = NaN) %>% transmute(as.double(as.integer(x))) %>% pull(1)
> #> [1] -2147483648{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14067) [R] Add error handling to C++ compute functions listed via list_compute_functions() which don't have bindings in R or options not supplied by user

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14067:
---
Fix Version/s: (was: 9.0.0)

> [R] Add error handling to C++ compute functions listed via 
> list_compute_functions() which don't have bindings in R or options not 
> supplied by user
> --
>
> Key: ARROW-14067
> URL: https://issues.apache.org/jira/browse/ARROW-14067
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> Currently we have the function {{list_compute_functions()}} which lists all 
> available Arrow compute functions.  However, it can return functions which 
> have been implemented in C++ but don't yet have bindings in R.
> A recent ticket implemented (nearly) all of the remaining compute functions 
> without bound options at that moment, but more could appear.
> Currently the error message shown is:
> {code:java}
> library(dplyr)
> library(arrow) # 5.0.0.2
> Table$create(tibble::tibble(Species = c("versicolor", "virginica", 
> "setosa"))) %>%
> mutate(x = arrow_utf8_trim(Species, options = list(characters = "a")))
> ## Error: Invalid: Attempted to initialize KernelState from null 
> FunctionOptions
> {code}
> We should catch this and instead raise a more user-friendly error. 
> Also, if a valid function is called without options supplied, we get a 
> {{could not find function}} error:
> {code:java}
> library(dplyr)
> library(arrow) # dev
> Table$create(tibble::tibble(Species = c("versicolor", "virginica", 
> "setosa"))) %>%
>   mutate(x = arrow_utf8_trim(Species))
> ## Error in arrow_utf8_trim(Species) :  could not find function 
> "arrow_utf8_trim"
> {code}
> It'd be great to instead inform the user that the correct options haven't 
> been supplied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14045) [R] Support for .keep_all = TRUE with distinct()

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14045:
---
Fix Version/s: (was: 9.0.0)

> [R] Support for .keep_all = TRUE with distinct() 
> -
>
> Key: ARROW-14045
> URL: https://issues.apache.org/jira/browse/ARROW-14045
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14218) [R] More improvements to developer docs

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14218:
---
Fix Version/s: (was: 9.0.0)

> [R] More improvements to developer docs
> ---
>
> Key: ARROW-14218
> URL: https://issues.apache.org/jira/browse/ARROW-14218
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> * Add link to the main contributions guidelines
> * Add a test of "how I know that my dev setup is OK?" to the end of the 
> R-only step
> * The R-only instructions just have instructions on how to install libarrow 
> but we should add a little about how to connect it up with the repo clone; 
> the instructions mention it in the Linux version but could be more explicit
> * When a user clones the repo via RStudio it creates an .rproj file in the 
> root directory - we should add an instructions to clone the arrow fork from 
> the command line so we can use the project's .rproj file
> * We should consider removing the instruction for installing the released 
> version of libarrow (or demoting it to the second place and explain why we'd 
> use it) as typically a dev would want the dev version
> * Mac - you can't just install openssl, you need to add it to your path as 
> LibreSSL is the default - we should add instructions about this
> * Better demarcation between "special instructions for Linux" and the next 
> section - maybe use tabs again?
> * clarification of the difference between the build directory and the 
> installation directory



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14288) [R] Implement nrow on some collapsed queries

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14288:
---
Fix Version/s: (was: 9.0.0)

> [R] Implement nrow on some collapsed queries
> 
>
> Key: ARROW-14288
> URL: https://issues.apache.org/jira/browse/ARROW-14288
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>
> collapse() doesn't always mean we can't determine the number of rows. We can 
> try to solve some cases:
> * head/tail: compute number of rows, take the smaller of that and the 
> head/tail number
> * if filter == TRUE, take the number of rows of .data (which may contain a 
> query)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14847) [R] Implement bindings for lubridate date/time parsing functions

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14847:
---
Fix Version/s: (was: 9.0.0)

> [R] Implement bindings for lubridate date/time parsing functions
> 
>
> Key: ARROW-14847
> URL: https://issues.apache.org/jira/browse/ARROW-14847
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15016) [R] show_query() for an arrow_dplyr_query

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15016:
---
Fix Version/s: (was: 9.0.0)

> [R] show_query() for an arrow_dplyr_query
> -
>
> Key: ARROW-15016
> URL: https://issues.apache.org/jira/browse/ARROW-15016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> Now that we can print a query plan (ARROW-13785) we should wire this up in R 
> so we can see what execution plans are being put together for various queries 
> (like the TPC-H queries)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-15016) [R] show_query() for an arrow_dplyr_query

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-15016:
--

Assignee: Dragoș Moldovan-Grünfeld

> [R] show_query() for an arrow_dplyr_query
> -
>
> Key: ARROW-15016
> URL: https://issues.apache.org/jira/browse/ARROW-15016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>
> Now that we can print a query plan (ARROW-13785) we should wire this up in R 
> so we can see what execution plans are being put together for various queries 
> (like the TPC-H queries)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15470) [R] Allows user to specify string to be used for missing data when writing CSV dataset

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15470:
---
Fix Version/s: (was: 9.0.0)

> [R] Allows user to specify string to be used for missing data when writing 
> CSV dataset
> --
>
> Key: ARROW-15470
> URL: https://issues.apache.org/jira/browse/ARROW-15470
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Priority: Major
>
> The ability to select the string to be used for missing data was implemented 
> for the CSV Writer in ARROW-14903 and as David Li points out below, is 
> available, so I think we just need to hook it up on the R side.
> This requires the values passed in as the "na" argument to be instead passed 
> through to "null_strings", similarly to what has been done with "skip" and 
> "skip_rows" in ARROW-15743.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15803) [R] Empty JSON object parsed as corrupt data frame

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15803:
---
Fix Version/s: (was: 9.0.0)

> [R] Empty JSON object parsed as corrupt data frame
> --
>
> Key: ARROW-15803
> URL: https://issues.apache.org/jira/browse/ARROW-15803
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Will Jones
>Priority: Major
>
> If you have a JSON object field that is always empty, it seems to be not 
> handled well, whether or not a schema is provided that tells Arrow what 
> should be in that object.
> {code:r}
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> json_val <- '{
>   "rows": [
> {"empty": {} },
> {"empty": {} },
> {"empty": {} }
>   ]
> }'
> # Remove newlines
> json_val <- gsub("\n", "", json_val)
> json_file <- tempfile()
> writeLines(json_val, json_file)
> schema <- schema(field("rows", list_of(struct(empty = struct(y = int32())
> raw <- read_json_arrow(json_file, schema=schema)
> raw$rows$empty
> #> Error: Corrupt x: no names
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15719) [R] Simplify code for handling summarise() with no aggregations

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15719:
---
Fix Version/s: (was: 9.0.0)

> [R] Simplify code for handling summarise() with no aggregations
> ---
>
> Key: ARROW-15719
> URL: https://issues.apache.org/jira/browse/ARROW-15719
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>
> Check whether ARROW-15609 enables us to remove code from 
> {{{}[query-engine.R|https://github.com/apache/arrow/blob/master/r/R/query-engine.R]{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15822) [C++] Cast duration to string (thus CSV writing) not supported

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15822:
---
Fix Version/s: (was: 9.0.0)

> [C++] Cast duration to string (thus CSV writing) not supported
> --
>
> Key: ARROW-15822
> URL: https://issues.apache.org/jira/browse/ARROW-15822
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 7.0.0, 7.0.1
>Reporter: Carl Boettiger
>Priority: Critical
>
> Edit (Dragos Moldovan-Grünfeld): The issue I opened (ARROW-15833) is 
> basically a duplicate of this. It's fundamentally a C++ issue that happened 
> to surface in the R CSV writer. I hope you don't mind, I modified the 
> components to C++
> ===
> Consider this reprex:
> {code:java}
> arrow::write_csv_arrow(data.frame(time = as.difftime(1, units="secs")), 
> "test.csv"){code}
> This errors with:
> Error: NotImplemented: Unsupported cast from duration[s] to utf8 using 
> function cast_string
>  
> Note that readr::write_csv() has no trouble with this (which renders the data 
> as "1" without a unit).  Arguably the readr rendering is lossy, but then we 
> usually assume units are provided in other metadata anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15879) [R] passing a schema calls open_dataset to fail on hive-partitioned csv files

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15879:
---
Fix Version/s: (was: 9.0.0)

> [R] passing a schema calls open_dataset to fail on hive-partitioned csv files
> -
>
> Key: ARROW-15879
> URL: https://issues.apache.org/jira/browse/ARROW-15879
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.0, 7.0.1
>Reporter: Carl Boettiger
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Consider this reprex:
>  
> Create a dataset with hive partitions in csv format with write_dataset() (so 
> cool!):
>  
> {code:java}
> library(arrow)
> library(dplyr)
> path <- fs::dir_create("tmp")
> mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")## works fine, 
> even with 'collect()'
> ds <- open_dataset(path, format="csv")## but pass a schema, and things fail
> df <- open_dataset(path, format="csv", schema = ds$schema, skip_rows=1)
> df %>% collect()
>  {code}
> In the first call to open_dataset, we don't pass a schema and things work as 
> expected. 
> However, csv files often need a schema to be read in correctly, particularly 
> with partitioned data where it is easy to 'guess' the wrong type.  Passing 
> the schema though confuses open_dataset, because the grouping column 
> (partition column) isn't found on the individual files even though it is 
> mentioned in the schema!
> Nor can we just omit the grouping column from the schema, since then it is 
> effectively lost from the data. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16155) [R] lubridate functions for 9.0.0

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16155:
---
Fix Version/s: (was: 9.0.0)

> [R] lubridate functions for 9.0.0
> -
>
> Key: ARROW-16155
> URL: https://issues.apache.org/jira/browse/ARROW-16155
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Alessandro Molina
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> Umbrella ticket for lubridate functions in 9.0.0
> Future work that is not going to happen in v9 is recorder under 
> https://issues.apache.org/jira/browse/ARROW-16841



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16190) [CI][R] Implement CI on Apple M1 for R

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16190:
---
Fix Version/s: (was: 9.0.0)

> [CI][R] Implement CI on Apple M1 for R
> --
>
> Key: ARROW-16190
> URL: https://issues.apache.org/jira/browse/ARROW-16190
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Continuous Integration, R
>Reporter: Jacob Wujciak-Jens
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16239) [R] $columns on Table and RB should be named

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16239:
---
Fix Version/s: (was: 9.0.0)

> [R] $columns on Table and RB should be named
> 
>
> Key: ARROW-16239
> URL: https://issues.apache.org/jira/browse/ARROW-16239
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Will Jones
>Priority: Minor
>  Labels: good-first-issue
>
> Currently, {{$columns}} method returns columns as a list without names. It 
> would be nice if they were named instead, similar to {{as.list}} on a 
> {{data.frame}}.
> {code:R}
> > library(arrow)
> > names(record_batch(x = 1, y = 'a')$columns)
> NULL
> > names(arrow_table(x = 1, y = 'a')$columns)
> NULL
> > as.list(data.frame(x = 1, y = 'a'))
> $x
> [1] 1
> $y
> [1] "a"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16768) [R] Factor levels cannot contain NA

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16768:
---
Fix Version/s: (was: 9.0.0)

> [R] Factor levels cannot contain NA
> ---
>
> Key: ARROW-16768
> URL: https://issues.apache.org/jira/browse/ARROW-16768
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Kieran Martin
>Priority: Minor
>
> If you try to write a data frame with a factor with a missing value to 
> parquet, you get the error: "Error: Invalid: Cannot insert dictionary values 
> containing nulls". 
> This seems likely due to how the metadata for factors is currently captured 
> in parquet files. Reprex follows:
>  
> library(arrow)
> bad_data <- data.frame(A = factor(1, 2, NA))
> write_parquet(bad_data, tempfile())
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16777) [R] printing data in Table/RecordBatch print method

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16777:
---
Fix Version/s: (was: 9.0.0)

> [R] printing data in Table/RecordBatch print method
> ---
>
> Key: ARROW-16777
> URL: https://issues.apache.org/jira/browse/ARROW-16777
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python, R
>Reporter: Thomas Mock
>Priority: Minor
>
> Related to ARROW-16776 but after a brief discussion with Neal Richardson, he 
> requested that I split the improvement request into separate issues.
> When working with Arrow datasets/tables, I often find myself wanting to 
> interactively print or "see" the results of a query or the first few rows of 
> the data without having to fully collect into memory. 
> It would be ideal to lazily print some data with Table/RecordBatch print 
> methods, however, currently, the print methods return schema without data. 
> IE:
> ``` r
> library(dplyr)
> library(arrow)
> mtcars %>% arrow::write_parquet("mtcars.parquet")
> car_ds <- arrow::open_dataset("mtcars.parquet")
> car_ds
> #> FileSystemDataset with 1 Parquet file
> #> mpg: double
> #> cyl: double
> #> disp: double
> #> hp: double
> #> drat: double
> #> wt: double
> #> qsec: double
> #> vs: double
> #> am: double
> #> gear: double
> #> carb: double
> #> 
> #> See $metadata for additional Schema metadata
> car_ds %>%
>   compute()
> #> Table
> #> 32 rows x 11 columns
> #> $mpg 
> #> $cyl 
> #> $disp 
> #> $hp 
> #> $drat 
> #> $wt 
> #> $qsec 
> #> $vs 
> #> $am 
> #> $gear 
> #> $carb 
> #> 
> #> See $metadata for additional Schema metadata
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16828) [R][Packaging] Turn on all compression libs for binaries

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-16828:
--

Assignee: Will Jones

> [R][Packaging] Turn on all compression libs for binaries
> 
>
> Key: ARROW-16828
> URL: https://issues.apache.org/jira/browse/ARROW-16828
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, R
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
> Fix For: 9.0.0
>
>
> We notably don't ship brotli for MacOS. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16878) [R] Move Windows GCS dependency building upstream

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16878:
---
Fix Version/s: (was: 9.0.0)

> [R] Move Windows GCS dependency building upstream
> -
>
> Key: ARROW-16878
> URL: https://issues.apache.org/jira/browse/ARROW-16878
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Packaging, R
>Reporter: Neal Richardson
>Priority: Major
>
> On ARROW-16510, I added the GCS filesystem to the arrow PKGBUILD, bundling it 
> in the arrow build. A better solution would be to put google-cloud-cpp in 
> rtools-packages so we don't have to build it every time. 
> There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so 
> either we'd have to make one up for rtools-packages, or we use the bundled 
> google-cloud-cpp in our cmake and see if we can put as many of its 
> dependencies in rtools-packages to ease the build. Either way, we'd want to 
> start by adding its dependencies.
> https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json 
> exists in MINGW-packages and could be brought over, but I don't think it's a 
> big deal if it is bundled.
> https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD
>  exists and could be brought over, but note that it uses C++17. That doesn't 
> seem to be a hard requirement, at least for what we're using, since we're 
> building it with C++11.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16880) [R] Test GCS auth with gargle/googleAuthR

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-16880:
--

Assignee: Will Jones

> [R] Test GCS auth with gargle/googleAuthR
> -
>
> Key: ARROW-16880
> URL: https://issues.apache.org/jira/browse/ARROW-16880
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Will Jones
>Priority: Major
> Fix For: 9.0.0
>
>
> These are the main packages that let folks worth with Google Cloud from R, so 
> we should make sure we can play nicely with their auth methods, how they 
> cache credentials, etc.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16887) [Doc][R] Document GCSFileSystem for R package

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-16887:
--

Assignee: Will Jones

> [Doc][R] Document GCSFileSystem for R package
> -
>
> Key: ARROW-16887
> URL: https://issues.apache.org/jira/browse/ARROW-16887
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Major
> Fix For: 9.0.0
>
>
> We should update the [cloud storage 
> vignette|https://arrow.apache.org/docs/r/articles/fs.html] and the filesystem 
> RD to show configuration and usage of GCSFileSystem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16879) [R] Add GCS tests using testbench

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16879:
---
Fix Version/s: (was: 9.0.0)

> [R] Add GCS tests using testbench
> -
>
> Key: ARROW-16879
> URL: https://issues.apache.org/jira/browse/ARROW-16879
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>
> Followup to ARROW-16510. That PR added the bindings and basic R tests that 
> don't require a live GCS connection. GCS has a "testbench" service you can 
> run on localhost to test, like how we use minio to test S3. See the Python 
> bindings PR for reference on how to set it up and run it, as well as some 
> tests we could add: https://github.com/apache/arrow/pull/12763



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16883) [R] Move macOS GCS dependency building upstream

2022-06-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16883:
---
Fix Version/s: (was: 9.0.0)

> [R] Move macOS GCS dependency building upstream
> ---
>
> Key: ARROW-16883
> URL: https://issues.apache.org/jira/browse/ARROW-16883
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>
> In ARROW-16510, we turned on ARROW_GCS in the autobrew formula, but it's 
> building it bundled in the arrow build. It would be more efficient if we 
> added dependencies (or google-cloud-cpp even) upstream to the autobrew 
> repositories and then used them like we do for aws-sdk-cpp and other 
> dependencies.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16605) [CI][R] Fix revdep Crossbow job

2022-06-30 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561075#comment-17561075
 ] 

Jonathan Keane commented on ARROW-16605:


Is this something we can do before the release? If not, we should run revdeps 
manually before the release (now?) to catch possible issues with enough time to 
introduce fixes

> [CI][R] Fix revdep Crossbow job
> ---
>
> Key: ARROW-16605
> URL: https://issues.apache.org/jira/browse/ARROW-16605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Blocker
> Fix For: 9.0.0
>
>
> The revdep Crossbow job is currently not functioning correctly. This led to 
> changed behaviour affecting a revdep with the 8.0.0 release, requiring a 
> patch after initial submission.
> cc: [~jonkeane]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-15805) [R] Update the as.Date() binding

2022-06-29 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560679#comment-17560679
 ] 

Jonathan Keane edited comment on ARROW-15805 at 6/29/22 11:31 PM:
--

This is alluded to in the PR comments, but taking a step back and thinking 
about the behavior:

{code}
dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02", 
"2022-01-01", "2022-01-01")
dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02", 
"2022-01-01", "2022-01-01")

as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-01-01" NA   NA   NA   "2022-01-01"
#> [6] "2022-01-01"

as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-02-02" NA   "2022-02-02" "2022-02-02" NA  
#> [6] NA
{code}

Which format is chosen and used is dependent on the underlying data, and 
critically the order that data is in. Given that we can't always guaranty the 
order of the data we are processing[1] we should not attempt to implement this 
behavior right now. 

Instead, we should have an error message if someone tries to specify 
{{tryFormats}} suggesting that they might use {{lubridate::as_date()}} if they 
want to specify multiple formats (and can accept that you don't get NAs for all 
formats other than the first that matches), or they should pick which format 
they want to use and use that.


[1] and even if we could, it would take some tricky expression writing to pick 
the right format


was (Author: jonkeane):
This is alluded to in the PR comments, but taking a step back and thinking 
about the behavior:

{code}
dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02", 
"2022-01-01", "2022-01-01")
dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02", 
"2022-01-01", "2022-01-01")

as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-01-01" NA   NA   NA   "2022-01-01"
#> [6] "2022-01-01"

as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-02-02" NA   "2022-02-02" "2022-02-02" NA  
#> [6] NA
{code}

Which format is chosen and used is dependent on the underlying data, and 
critically the order that data is in. Given that we can't always guaranty the 
order of the data we are processing[1] we should not attempt to implement this 
behavior right now. 

Instead, we should have an error message if someone tries to specify 
{{tryFormats}} suggesting that they might use {{lubridate:: as_date()}} if they 
want to specify multiple formats (and can accept that you don't get NAs for all 
formats other than the first that matches), or they should pick which format 
they want to use and use that.


[1] and even if we could, it would take some tricky expression writing to pick 
the right format

> [R] Update the as.Date() binding
> 
>
> Key: ARROW-15805
> URL: https://issues.apache.org/jira/browse/ARROW-15805
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-15805) [R] Update the as.Date() binding

2022-06-29 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560679#comment-17560679
 ] 

Jonathan Keane edited comment on ARROW-15805 at 6/29/22 11:30 PM:
--

This is alluded to in the PR comments, but taking a step back and thinking 
about the behavior:

{code}
dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02", 
"2022-01-01", "2022-01-01")
dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02", 
"2022-01-01", "2022-01-01")

as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-01-01" NA   NA   NA   "2022-01-01"
#> [6] "2022-01-01"

as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-02-02" NA   "2022-02-02" "2022-02-02" NA  
#> [6] NA
{code}

Which format is chosen and used is dependent on the underlying data, and 
critically the order that data is in. Given that we can't always guaranty the 
order of the data we are processing[1] we should not attempt to implement this 
behavior right now. 

Instead, we should have an error message if someone tries to specify 
{{tryFormats}} suggesting that they might use {{lubridate:: as_date()}} if they 
want to specify multiple formats (and can accept that you don't get NAs for all 
formats other than the first that matches), or they should pick which format 
they want to use and use that.


[1] and even if we could, it would take some tricky expression writing to pick 
the right format


was (Author: jonkeane):
This is alluded to in the PR comments, but taking a step back and thinking 
about the behavior:

{code}
dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02", 
"2022-01-01", "2022-01-01")
dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02", 
"2022-01-01", "2022-01-01")

as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-01-01" NA   NA   NA   "2022-01-01"
#> [6] "2022-01-01"

as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-02-02" NA   "2022-02-02" "2022-02-02" NA  
#> [6] NA
{code}

Which format is chosen and used is dependent on the underlying data, and 
critically the order that data is in. Given that we can't always guaranty the 
order of the data we are processing[1] we should not attempt to implement this 
behavior right now. Instead, we should have an error message if someone tries 
to specify {{tryFormats}} suggesting that they might use {{lubridate:: 
as_date()}} if they want to specify multiple formats (and can accept that you 
don't get NAs for all formats other than the first that matches), or they 
should pick which format they want to use and use that.


[1] and even if we could, it would take some tricky expression writing to pick 
the right format

> [R] Update the as.Date() binding
> 
>
> Key: ARROW-15805
> URL: https://issues.apache.org/jira/browse/ARROW-15805
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15805) [R] Update the as.Date() binding

2022-06-29 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560679#comment-17560679
 ] 

Jonathan Keane commented on ARROW-15805:


This is alluded to in the PR comments, but taking a step back and thinking 
about the behavior:

{code}
dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02", 
"2022-01-01", "2022-01-01")
dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02", 
"2022-01-01", "2022-01-01")

as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-01-01" NA   NA   NA   "2022-01-01"
#> [6] "2022-01-01"

as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d"))
#> [1] "2022-02-02" NA   "2022-02-02" "2022-02-02" NA  
#> [6] NA
{code}

Which format is chosen and used is dependent on the underlying data, and 
critically the order that data is in. Given that we can't always guaranty the 
order of the data we are processing[1] we should not attempt to implement this 
behavior right now. Instead, we should have an error message if someone tries 
to specify {{tryFormats}} suggesting that they might use {{lubridate:: 
as_date()}} if they want to specify multiple formats (and can accept that you 
don't get NAs for all formats other than the first that matches), or they 
should pick which format they want to use and use that.


[1] and even if we could, it would take some tricky expression writing to pick 
the right format

> [R] Update the as.Date() binding
> 
>
> Key: ARROW-15805
> URL: https://issues.apache.org/jira/browse/ARROW-15805
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15158) [R] stringr functions

2022-06-29 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15158:
---
Fix Version/s: (was: 9.0.0)

> [R] stringr functions
> -
>
> Key: ARROW-15158
> URL: https://issues.apache.org/jira/browse/ARROW-15158
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Alessandro Molina
>Priority: Major
>
> *Umbrella ticket for the Initiative aimed at reaching support for the most 
> important strngr functions in the R bindings*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns

2022-06-29 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560667#comment-17560667
 ] 

Jonathan Keane commented on ARROW-16700:


[~westonpace] not sure if this is related to ARROW-16904 or ARROW-16807 but 
another wrong-data ticket we should take a look at

> [C++] [R] [Datasets] aggregates on partitioning columns
> ---
>
> Key: ARROW-16700
> URL: https://issues.apache.org/jira/browse/ARROW-16700
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Jonathan Keane
>Priority: Blocker
> Fix For: 9.0.0, 8.0.1
>
>
> When summarizing a whole dataset (without group_by) with an aggregate, and 
> summarizing a partitioned column, arrow returns wrong data:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> df <- expand.grid(
>   some_nulls = c(0L, 1L, 2L),
>   year = 2010:2023,
>   month = 1:12,
>   day = 1:30
> )
> path <- tempfile()
> dir.create(path)
> write_dataset(df, path, partitioning = c("year", "month"))
> ds <- open_dataset(path)
> # with arrow the mins/maxes are off for partitioning columns
> ds %>%
>   summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
> min(day), max_year = max(year), max_month = max(month), max_day = max(day)) 
> %>% 
>   collect()
> #> # A tibble: 1 × 7
> #>   n min_year min_month min_day max_year max_month max_day
> #>   
> #> 1 15120 2023 1   1 202312  30
> # comapred to what we get with dplyr
> df %>%
>   summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
> min(day), max_year = max(year), max_month = max(month), max_day = max(day)) 
> %>% 
>   collect()
> #>   n min_year min_month min_day max_year max_month max_day
> #> 1 15120 2010 1   1 202312  30
> # even min alone is off:
> ds %>%
>   summarise(min_year = min(year)) %>% 
>   collect()
> #> # A tibble: 1 × 1
> #>   min_year
> #>  
> #> 1 2016
>   
> # but non-partitioning columns are fine:
> ds %>%
>   summarise(min_day = min(day)) %>% 
>   collect()
> #> # A tibble: 1 × 1
> #>   min_day
> #> 
> #> 1   1
>   
>   
> # But with a group_by, this seems ok
> ds %>%
>   group_by(some_nulls) %>%
>   summarise(min_year = min(year)) %>% 
>   collect()
> #> # A tibble: 3 × 2
> #>   some_nulls min_year
> #>
> #> 1  0 2010
> #> 2  1 2010
> #> 3  2 2010
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14071) [R] Try to arrow_eval user-defined functions

2022-06-29 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14071:
---
Fix Version/s: (was: 9.0.0)

> [R] Try to arrow_eval user-defined functions
> 
>
> Key: ARROW-14071
> URL: https://issues.apache.org/jira/browse/ARROW-14071
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> The first test passes but the second one fails, even though they're 
> equivalent. The user's function isn't being evaluated in the nse_funcs 
> environment.
> {code}
>   expect_dplyr_equal(
> input %>%
>   select(-fct) %>%
>   filter(nchar(padded_strings) < 10) %>%
>   collect(),
> tbl
>   )
>   isShortString <- function(x) nchar(x) < 10
>   expect_dplyr_equal(
> input %>%
>   select(-fct) %>%
>   filter(isShortString(padded_strings)) %>%
>   collect(),
> tbl
>   )
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14209) [R] Allow multiple arguments to n_distinct()

2022-06-29 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14209:
---
Fix Version/s: (was: 9.0.0)

> [R] Allow multiple arguments to n_distinct()
> 
>
> Key: ARROW-14209
> URL: https://issues.apache.org/jira/browse/ARROW-14209
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} function 
> in the dplyr verb {{summarise()}} but only with a single argument. Add 
> support for multiple arguments to {{n_distinct()}}. This should return the 
> number of unique combinations of values in the specified columns/expressions.
> See the comment about this here: 
> [https://github.com/apache/arrow/pull/11257#discussion_r720873549]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-14588) [R] Create an arrow-specific checklist for a CRAN release

2022-06-29 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14588:
---
Fix Version/s: (was: 9.0.0)

> [R] Create an arrow-specific checklist for a CRAN release  
> ---
>
> Key: ARROW-14588
> URL: https://issues.apache.org/jira/browse/ARROW-14588
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Minor
>
> This would adapt and implement the functionality of 
> {{usethis::use_release_issue()}} for {{arrow}}'s specific context.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-16692) [C++] Segfault in datasets

2022-06-25 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16692:
---
Priority: Blocker  (was: Major)

> [C++] Segfault in datasets
> --
>
> Key: ARROW-16692
> URL: https://issues.apache.org/jira/browse/ARROW-16692
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Weston Pace
>Priority: Blocker
> Attachments: backtrace.txt
>
>
> I'm still working to make a minimal reproducer for this, though I can 
> reliably reproduce it below (though that means needing to download a bunch of 
> data first...). I've cleaned out much of the unnecessary code (so this query 
> below is a bit silly, and not what I'm actually trying to do), but haven't 
> been able to make a constructed dataset that reproduces this.
> Working on some example with the new | more cleaned taxi dataset at 
> {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault:
> {code}
> library(arrow)
> library(dplyr)
> ds <- open_dataset("path/to/new_taxi/")
> ds %>%
>   filter(!is.na(pickup_location_id)) %>%
>   summarise(n = n()) %>% collect()
> {code}
> Most of the time ends in a segfault (though I have gotten it to work on 
> occasion). I've tried with smaller files | constructed datasets and haven't 
> been able to replicate it yet. One thing that might be important is:  
> {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or 
> so.
> I've attached a backtrace in case that's enough to see what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16692) [C++] Segfault in datasets

2022-06-25 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16692:
---
Fix Version/s: 9.0.0

> [C++] Segfault in datasets
> --
>
> Key: ARROW-16692
> URL: https://issues.apache.org/jira/browse/ARROW-16692
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Weston Pace
>Priority: Blocker
> Fix For: 9.0.0
>
> Attachments: backtrace.txt
>
>
> I'm still working to make a minimal reproducer for this, though I can 
> reliably reproduce it below (though that means needing to download a bunch of 
> data first...). I've cleaned out much of the unnecessary code (so this query 
> below is a bit silly, and not what I'm actually trying to do), but haven't 
> been able to make a constructed dataset that reproduces this.
> Working on some example with the new | more cleaned taxi dataset at 
> {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault:
> {code}
> library(arrow)
> library(dplyr)
> ds <- open_dataset("path/to/new_taxi/")
> ds %>%
>   filter(!is.na(pickup_location_id)) %>%
>   summarise(n = n()) %>% collect()
> {code}
> Most of the time ends in a segfault (though I have gotten it to work on 
> occasion). I've tried with smaller files | constructed datasets and haven't 
> been able to replicate it yet. One thing that might be important is:  
> {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or 
> so.
> I've attached a backtrace in case that's enough to see what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16319) [R] [Docs] Document the lubridate functions we support in {arrow}

2022-06-16 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-16319:
--

Assignee: Stephanie Hazlitt  (was: Dragoș Moldovan-Grünfeld)

> [R] [Docs] Document the lubridate functions we support in {arrow}
> -
>
> Key: ARROW-16319
> URL: https://issues.apache.org/jira/browse/ARROW-16319
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Stephanie Hazlitt
>Priority: Major
> Fix For: 9.0.0
>
>
> Add documentation around the {{lubridate}} functionality supported in 
> {{arrow}}. Could be made up of:
> * a blogpost 
> * a more in-depth piece of documentation



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (ARROW-16418) [R] Refactor the difftime() and as.diffime() bindings

2022-06-16 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane closed ARROW-16418.
--
Resolution: Won't Fix

> [R] Refactor the difftime() and as.diffime() bindings 
> --
>
> Key: ARROW-16418
> URL: https://issues.apache.org/jira/browse/ARROW-16418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>
> ARROW-16060 is solved and these 2 functions have high cyclomatic complexity



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16841) [R] Additional Lubridate Capabilities

2022-06-16 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16841:
---
Issue Type: Wish  (was: Bug)

> [R] Additional Lubridate Capabilities
> -
>
> Key: ARROW-16841
> URL: https://issues.apache.org/jira/browse/ARROW-16841
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, R
>Affects Versions: 9.0.0
>Reporter: Alessandro Molina
>Priority: Major
>
> Umbrella Ticket for the remaining lubridate work.
> This is functionality that we have scoped, but we have decided to wait to 
> implement until it is requested by someone proactively.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16841) [R] Additional Lubridate Capabilities

2022-06-16 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-16841:
---
Description: 
Umbrella Ticket for the remaining lubridate work.

This is functionality that we have scoped, but we have decided to wait to 
implement until it is requested by someone proactively.

  was:
Umbrella Ticket for the remaining lubridate work.

Most fo the work here will be triggered by explicit user requests


> [R] Additional Lubridate Capabilities
> -
>
> Key: ARROW-16841
> URL: https://issues.apache.org/jira/browse/ARROW-16841
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 9.0.0
>Reporter: Alessandro Molina
>Priority: Major
>
> Umbrella Ticket for the remaining lubridate work.
> This is functionality that we have scoped, but we have decided to wait to 
> implement until it is requested by someone proactively.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16440) [R] Implement bindings for lubridate's parse_date_time2

2022-06-16 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555088#comment-17555088
 ] 

Jonathan Keane commented on ARROW-16440:


What's special about `parse_date_time2()` compared to `parse_date_time()`?

> [R] Implement bindings for lubridate's parse_date_time2
> ---
>
> Key: ARROW-16440
> URL: https://issues.apache.org/jira/browse/ARROW-16440
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>
> Split from ARROW-14848



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16653) [R] All formats are supported with the lubridate `parse_date_time` binding

2022-06-16 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555086#comment-17555086
 ] 

Jonathan Keane commented on ARROW-16653:


What formats do we currently not support?

> [R] All formats are supported with the lubridate `parse_date_time` binding
> --
>
> Key: ARROW-16653
> URL: https://issues.apache.org/jira/browse/ARROW-16653
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Affects Versions: 8.0.1
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Critical
> Fix For: 9.0.0
>
>
> Ensure:
> - all formats supported and tested



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-13370) [R] More special handling for known errors in arrow_eval

2022-06-07 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13370:
---
Description: We have special handling in arrow_eval that looks for the "not 
supported in Arrow" error, and when that's found it shows the error message 
rather than swallowing it in an "Expression not supported" message. But we have 
other error messages we raise in nse_funcs that are worth showing--bad input 
etc. Use a sentinel error message that we can also detect and subclass as 
"arrow-try-error" like the others, or (better) raise a classed exception (if 
that's supported in all versions of R we support).   (was: We have special 
handling in arrow_eval that looks for the "not supported in Arrow" error, and 
when that's found it shows the error message rather than swallowing it in an 
"Expression not supported" message. But we have other error messages we raise 
in nse_funcs that are worth showing--bad input etc. Use a sentinel error 
message that we can also detect and subclass as "arrow-try-error" like the 
others, or (better) raised a classed exception (if that's supported in all 
versions of R we support). )

> [R] More special handling for known errors in arrow_eval
> 
>
> Key: ARROW-13370
> URL: https://issues.apache.org/jira/browse/ARROW-13370
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
> Fix For: 9.0.0
>
>
> We have special handling in arrow_eval that looks for the "not supported in 
> Arrow" error, and when that's found it shows the error message rather than 
> swallowing it in an "Expression not supported" message. But we have other 
> error messages we raise in nse_funcs that are worth showing--bad input etc. 
> Use a sentinel error message that we can also detect and subclass as 
> "arrow-try-error" like the others, or (better) raise a classed exception (if 
> that's supported in all versions of R we support). 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16415) [R] Update strptime bindings to use tz

2022-06-07 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-16415.

Resolution: Fixed

Issue resolved by pull request 13190
[https://github.com/apache/arrow/pull/13190]

> [R] Update strptime bindings to use tz 
> ---
>
> Key: ARROW-16415
> URL: https://issues.apache.org/jira/browse/ARROW-16415
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> {{strptime}} mentions it does not support {{tz}} - the timezone argument. 
> ARROW-12820 has been addressed and the binding definition need updating.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16626) [C++] Name the C++ streaming execution engine

2022-06-01 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-16626.

Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13207
[https://github.com/apache/arrow/pull/13207]

> [C++] Name the C++ streaming execution engine
> -
>
> Key: ARROW-16626
> URL: https://issues.apache.org/jira/browse/ARROW-16626
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> There is some desire on the mailing list to name the C++ execution engine.  
> Although there isn't really any code impact from such a change we should 
> update our documentation to refer to the engine by name.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-14632) [Python] Make write_dataset arguments keyword-only

2022-06-01 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-14632.

Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13289
[https://github.com/apache/arrow/pull/13289]

> [Python] Make write_dataset arguments keyword-only
> --
>
> Key: ARROW-14632
> URL: https://issues.apache.org/jira/browse/ARROW-14632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Austin Dickey
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The 
> [write_dataset|https://github.com/apache/arrow/blob/8a36f0f6cb385c88b637f479cc38b7e51d45c7e7/python/pyarrow/dataset.py#L804-L811]
>  method has many arguments for customizing the behavior of the write.  Most 
> of them could be made keyword only.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16715) [R] Bump default parquet version?

2022-06-01 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-16715:
--

 Summary: [R] Bump default parquet version?
 Key: ARROW-16715
 URL: https://issues.apache.org/jira/browse/ARROW-16715
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Jonathan Keane


With ARROW-12203 the default parquet version was bumped for pyarrow to 2.4, at 
a minimum, we should add 2_4 as a valid version type to 
https://github.com/apache/arrow/blob/9b0afc352e8b3ecb3104d58e4bcf09def256b587/r/R/parquet.R#L239-L242
 and 
https://github.com/apache/arrow/blob/9b0afc352e8b3ecb3104d58e4bcf09def256b587/r/R/enums.R#L122-L126

But do we also want to follow pyarrow's lead and bump up to a newer version by 
default?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-14264) [R] Support inequality joins

2022-06-01 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14264:
---
Description: We'll need this -not-yet-merged- merged, but unreleased dplyr 
API to do it: https://github.com/tidyverse/dplyr/pull/5910  (was: We'll need 
this not-yet-merged dplyr API to do it: 
https://github.com/tidyverse/dplyr/pull/5910)

> [R] Support inequality joins
> 
>
> Key: ARROW-14264
> URL: https://issues.apache.org/jira/browse/ARROW-14264
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Priority: Major
>  Labels: query-engine
> Fix For: 9.0.0
>
>
> We'll need this -not-yet-merged- merged, but unreleased dplyr API to do it: 
> https://github.com/tidyverse/dplyr/pull/5910



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16605) [CI][R] Fix revdep Crossbow job

2022-06-01 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544966#comment-17544966
 ] 

Jonathan Keane commented on ARROW-16605:


For visibility: 
https://github.com/apache/arrow/blob/master/dev/tasks/r/github.linux.revdepcheck.yml
 is the template for running these revdep checks

> [CI][R] Fix revdep Crossbow job
> ---
>
> Key: ARROW-16605
> URL: https://issues.apache.org/jira/browse/ARROW-16605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Blocker
> Fix For: 9.0.0
>
>
> The revdep Crossbow job is currently not functioning correctly. This led to 
> changed behaviour affecting a revdep with the 8.0.0 release, requiring a 
> patch after initial submission.
> cc: [~jonkeane]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16692) [C++] Segfault in datasets

2022-06-01 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544909#comment-17544909
 ] 

Jonathan Keane commented on ARROW-16692:


Thanks! Is there a rough timeline for when that work might be done?

I came across this prepping some demos for a talk next week — the queries do 
_sometimes_ complete (and tend to complete more reliable with the bigger 
queries). But I might need to change what queries I show if we don't think this 
will be done in the near term.

> [C++] Segfault in datasets
> --
>
> Key: ARROW-16692
> URL: https://issues.apache.org/jira/browse/ARROW-16692
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Assignee: Weston Pace
>Priority: Major
> Attachments: backtrace.txt
>
>
> I'm still working to make a minimal reproducer for this, though I can 
> reliably reproduce it below (though that means needing to download a bunch of 
> data first...). I've cleaned out much of the unnecessary code (so this query 
> below is a bit silly, and not what I'm actually trying to do), but haven't 
> been able to make a constructed dataset that reproduces this.
> Working on some example with the new | more cleaned taxi dataset at 
> {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault:
> {code}
> library(arrow)
> library(dplyr)
> ds <- open_dataset("path/to/new_taxi/")
> ds %>%
>   filter(!is.na(pickup_location_id)) %>%
>   summarise(n = n()) %>% collect()
> {code}
> Most of the time ends in a segfault (though I have gotten it to work on 
> occasion). I've tried with smaller files | constructed datasets and haven't 
> been able to replicate it yet. One thing that might be important is:  
> {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or 
> so.
> I've attached a backtrace in case that's enough to see what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16701) [R] Can we execute SQL in a dplyr pipeline?

2022-05-31 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544618#comment-17544618
 ] 

Jonathan Keane commented on ARROW-16701:


Yes, sorry that conflation was unintentional. We can do this today with duckdb, 
so we should try that — but in principle we should be able to use it with any 
backend that accepts sql + could speak arrow

> [R] Can we execute SQL in a dplyr pipeline?
> ---
>
> Key: ARROW-16701
> URL: https://issues.apache.org/jira/browse/ARROW-16701
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>
> Now that we have {{to_duckdb()}} and {{to_arrow()}} is it possible to wrap 
> those and allow someone to insert arbitrary SQL into a dplyr query?
> Something like:
> {code:r}
> sql <- function(data, sql) {
>tbl <- to_duckdb(data)
>res <- DBI::dbSendQuery(dbplyr::remote_con(.data), sql, arrow = TRUE)
>   duckdb::duckdb_fetch_record_batch(res)
> }
> ds %>%
>   filter(year > 2020) %>% 
>   sql("SELECT tip_amount, fare_amount, total_amount FROM ") %>%
>   compute()
> {code}
> This won't work totally, but is vaguely what we're looking for.
> One part that we need to think about is how to deal with the {{from}} clause, 
> a few possibilities:
> * ibis does this by making you "name" the table before doing sql so you can 
> FROM explicitly
> * though maybe you could get away with FROM . like it is a magrittr thing and 
> sub that
> * empty string, and we add it in based on the lazy_tbl object
> Possibly related prior art: 
> https://dbplyr.tidyverse.org/reference/build_sql.html (though the name isn't 
> perfect IMO, and I think this is more geared towards package developers than 
> end users?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16701) [R] Can we execute SQL in a dply pipeline?

2022-05-31 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-16701:
--

 Summary: [R] Can we execute SQL in a dply pipeline?
 Key: ARROW-16701
 URL: https://issues.apache.org/jira/browse/ARROW-16701
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Jonathan Keane


Now that we have {{to_duckdb()}} and {{to_arrow()}} is it possible to wrap 
those and allow someone to insert arbitrary SQL into a dplyr query?

Something like:

{code:r}
sql <- function(data, sql) {
   tbl <- to_duckdb(data)
   res <- DBI::dbSendQuery(dbplyr::remote_con(.data), sql, arrow = TRUE)

  duckdb::duckdb_fetch_record_batch(res)
}

ds %>%
  filter(year > 2020) %>% 
  sql("SELECT tip_amount, fare_amount, total_amount FROM ") %>%
  compute()
{code}

This won't work totally, but is vaguely what we're looking for.

One part that we need to think about is how to deal with the {{from}} clause, a 
few possibilities:

* bis does this by making you "name" the table before doing sql so you can FROM 
explicitly
* though maybe you could get away with FROM . like it is a magrittr thing and 
sub that
* empty string, and we add it in based on the lazy_tbl object

Possibly related prior art: 
https://dbplyr.tidyverse.org/reference/build_sql.html (though the name isn't 
perfect IMO, and I think this is more geared towards package developers than 
end users?)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns

2022-05-31 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-16700:
--

 Summary: [C++] [R] [Datasets] aggregates on partitioning columns
 Key: ARROW-16700
 URL: https://issues.apache.org/jira/browse/ARROW-16700
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, R
Reporter: Jonathan Keane


When summarizing a whole dataset (without group_by) with an aggregate, and 
summarizing a partitioned column, arrow returns wrong data:

{code:r}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

df <- expand.grid(
  some_nulls = c(0L, 1L, 2L),
  year = 2010:2023,
  month = 1:12,
  day = 1:30
)

path <- tempfile()
dir.create(path)
write_dataset(df, path, partitioning = c("year", "month"))

ds <- open_dataset(path)

# with arrow the mins/maxes are off for partitioning columns
ds %>%
  summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>% 
  collect()
#> # A tibble: 1 × 7
#>   n min_year min_month min_day max_year max_month max_day
#>   
#> 1 15120 2023 1   1 202312  30

# comapred to what we get with dplyr
df %>%
  summarise(n = n(), min_year = min(year), min_month = min(month), min_day = 
min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>% 
  collect()
#>   n min_year min_month min_day max_year max_month max_day
#> 1 15120 2010 1   1 202312  30

# even min alone is off:
ds %>%
  summarise(min_year = min(year)) %>% 
  collect()
#> # A tibble: 1 × 1
#>   min_year
#>  
#> 1 2016
  
# but non-partitioning columns are fine:
ds %>%
  summarise(min_day = min(day)) %>% 
  collect()
#> # A tibble: 1 × 1
#>   min_day
#> 
#> 1   1
  
  
# But with a group_by, this seems ok
ds %>%
  group_by(some_nulls) %>%
  summarise(min_year = min(year)) %>% 
  collect()
#> # A tibble: 3 × 2
#>   some_nulls min_year
#>
#> 1  0 2010
#> 2  1 2010
#> 3  2 2010
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16692) [C++] Segfault in datasets

2022-05-31 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544577#comment-17544577
 ] 

Jonathan Keane commented on ARROW-16692:


bq. One thing that might be important is: pickup_location_id is all NAs | nulls 
in the first 8 years of the data or so.

This is almost certainly a redherring now that I come back to it; the following 
*also* segfaults without referencing that specific column.

{code}
ds %>%
  filter(pickup_datetime > as.Date("2017-01-01")) %>%
  summarise(n = n()) %>% collect()
{code}

> [C++] Segfault in datasets
> --
>
> Key: ARROW-16692
> URL: https://issues.apache.org/jira/browse/ARROW-16692
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Priority: Major
> Attachments: backtrace.txt
>
>
> I'm still working to make a minimal reproducer for this, though I can 
> reliably reproduce it below (though that means needing to download a bunch of 
> data first...). I've cleaned out much of the unnecessary code (so this query 
> below is a bit silly, and not what I'm actually trying to do), but haven't 
> been able to make a constructed dataset that reproduces this.
> Working on some example with the new | more cleaned taxi dataset at 
> {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault:
> {code}
> library(arrow)
> library(dplyr)
> ds <- open_dataset("path/to/new_taxi/")
> ds %>%
>   filter(!is.na(pickup_location_id)) %>%
>   summarise(n = n()) %>% collect()
> {code}
> Most of the time ends in a segfault (though I have gotten it to work on 
> occasion). I've tried with smaller files | constructed datasets and haven't 
> been able to replicate it yet. One thing that might be important is:  
> {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or 
> so.
> I've attached a backtrace in case that's enough to see what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16695) [R][C++] Extension types are not supported in joins

2022-05-31 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544517#comment-17544517
 ] 

Jonathan Keane commented on ARROW-16695:


Thanks for the reprex! 

cc [~westonpace]

> [R][C++] Extension types are not supported in joins
> ---
>
> Key: ARROW-16695
> URL: https://issues.apache.org/jira/browse/ARROW-16695
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Dewey Dunnington
>Priority: Major
>
> It looks like extension types are not supported in joins (even if the 
> underlying type is supproted)! Reported by [~jonkeane] while making a demo 
> for Arrow + Query engine + geoarrow (R package), which uses extension types 
> liberally:
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> rb_non_ext <- record_batch(
>   a = 1:5, 
>   b = letters[1:5]
> )
> rb_ext_storage <- record_batch(
>   b = letters[1:5],
>   c = Array$create(list(as.raw(1:5)), type = binary())
> )
> rb_ext <- record_batch(
>   b = letters[1:5],
>   c = vctrs_extension_array(rb_ext_storage$c$as_vector())
> )
> rb_non_ext %>% 
>   left_join(rb_ext_storage) %>% 
>   collect()
> #> # A tibble: 5 × 3
> #>   a b  c
> #> 
> #> 1 1 a 01, 02, 03, 04, 05
> #> 2 2 b 01, 02, 03, 04, 05
> #> 3 3 c 01, 02, 03, 04, 05
> #> 4 4 d 01, 02, 03, 04, 05
> #> 5 5 e 01, 02, 03, 04, 05
> rb_non_ext %>% 
>   left_join(rb_ext) %>% 
>   collect()
> #> Error in `collect()`:
> #> ! Invalid: Data type  is not supported in join non-key 
> field
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/hash_join_node.cc:121
>   ValidateSchemas(join_type, left_schema, left_keys, left_output, 
> right_schema, right_keys, right_output, left_field_name_suffix, 
> right_field_name_suffix)
> #> 
> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/hash_join_node.cc:499
>   schema_mgr->Init( join_options.join_type, left_schema, 
> join_options.left_keys, join_options.left_output, right_schema, 
> join_options.right_keys, join_options.right_output, join_options.filter, 
> join_options.output_suffix_for_left, join_options.output_suffix_for_right)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-14632) [Python] Make write_dataset arguments keyword-only

2022-05-31 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-14632:
--

Assignee: Austin Dickey  (was: Weston Pace)

> [Python] Make write_dataset arguments keyword-only
> --
>
> Key: ARROW-14632
> URL: https://issues.apache.org/jira/browse/ARROW-14632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Austin Dickey
>Priority: Major
>  Labels: good-first-issue
>
> The 
> [write_dataset|https://github.com/apache/arrow/blob/8a36f0f6cb385c88b637f479cc38b7e51d45c7e7/python/pyarrow/dataset.py#L804-L811]
>  method has many arguments for customizing the behavior of the write.  Most 
> of them could be made keyword only.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16692) [C++] Segfault in datasets

2022-05-31 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544407#comment-17544407
 ] 

Jonathan Keane commented on ARROW-16692:


cc [~westonpace]

> [C++] Segfault in datasets
> --
>
> Key: ARROW-16692
> URL: https://issues.apache.org/jira/browse/ARROW-16692
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jonathan Keane
>Priority: Major
> Attachments: backtrace.txt
>
>
> I'm still working to make a minimal reproducer for this, though I can 
> reliably reproduce it below (though that means needing to download a bunch of 
> data first...). I've cleaned out much of the unnecessary code (so this query 
> below is a bit silly, and not what I'm actually trying to do), but haven't 
> been able to make a constructed dataset that reproduces this.
> Working on some example with the new | more cleaned taxi dataset at 
> {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault:
> {code}
> library(arrow)
> library(dplyr)
> ds <- open_dataset("path/to/new_taxi/")
> ds %>%
>   filter(!is.na(pickup_location_id)) %>%
>   summarise(n = n()) %>% collect()
> {code}
> Most of the time ends in a segfault (though I have gotten it to work on 
> occasion). I've tried with smaller files | constructed datasets and haven't 
> been able to replicate it yet. One thing that might be important is:  
> {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or 
> so.
> I've attached a backtrace in case that's enough to see what's going on here.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16692) [C++] Segfault in datasets

2022-05-31 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-16692:
--

 Summary: [C++] Segfault in datasets
 Key: ARROW-16692
 URL: https://issues.apache.org/jira/browse/ARROW-16692
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Jonathan Keane
 Attachments: backtrace.txt

I'm still working to make a minimal reproducer for this, though I can reliably 
reproduce it below (though that means needing to download a bunch of data 
first...). I've cleaned out much of the unnecessary code (so this query below 
is a bit silly, and not what I'm actually trying to do), but haven't been able 
to make a constructed dataset that reproduces this.

Working on some example with the new | more cleaned taxi dataset at 
{{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault:

{code}
library(arrow)
library(dplyr)

ds <- open_dataset("path/to/new_taxi/")

ds %>%
  filter(!is.na(pickup_location_id)) %>%
  summarise(n = n()) %>% collect()
{code}

Most of the time ends in a segfault (though I have gotten it to work on 
occasion). I've tried with smaller files | constructed datasets and haven't 
been able to replicate it yet. One thing that might be important is:  
{{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or 
so.

I've attached a backtrace in case that's enough to see what's going on here.






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-14632) [Python] Make write_dataset arguments keyword-only

2022-05-30 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-14632:
---
Description: The 
[write_dataset|https://github.com/apache/arrow/blob/8a36f0f6cb385c88b637f479cc38b7e51d45c7e7/python/pyarrow/dataset.py#L804-L811]
 method has many arguments for customizing the behavior of the write.  Most of 
them could be made keyword only.  (was: The write_dataset method has many 
arguments for customizing the behavior of the write.  Most of them could be 
made keyword only.)

> [Python] Make write_dataset arguments keyword-only
> --
>
> Key: ARROW-14632
> URL: https://issues.apache.org/jira/browse/ARROW-14632
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: good-first-issue
>
> The 
> [write_dataset|https://github.com/apache/arrow/blob/8a36f0f6cb385c88b637f479cc38b7e51d45c7e7/python/pyarrow/dataset.py#L804-L811]
>  method has many arguments for customizing the behavior of the write.  Most 
> of them could be made keyword only.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-05-18 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539107#comment-17539107
 ] 

Jonathan Keane edited comment on ARROW-15678 at 5/18/22 10:03 PM:
--

[~kou] Do you think you might be able to take a look at this?

The comment at 
https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good 
explanation of what's going on and following that there are a few possible 
fixes (though none of them were fully implemented or decided


was (Author: jonkeane):
@kou Do you think you might be able to take a look at this?

The comment at 
https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good 
explanation of what's going on and following that there are a few possible 
fixes (though none of them were fully implemented or decided

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-05-18 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539107#comment-17539107
 ] 

Jonathan Keane commented on ARROW-15678:


@kou Do you think you might be able to take a look at this?

The comment at 
https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good 
explanation of what's going on and following that there are a few possible 
fixes (though none of them were fully implemented or decided

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


  1   2   3   4   5   6   7   8   9   10   >