[jira] [Resolved] (ARROW-16605) [CI][R] Fix revdep docker job
[ https://issues.apache.org/jira/browse/ARROW-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-16605. Resolution: Fixed Issue resolved by pull request 13483 [https://github.com/apache/arrow/pull/13483] > [CI][R] Fix revdep docker job > - > > Key: ARROW-16605 > URL: https://issues.apache.org/jira/browse/ARROW-16605 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jacob Wujciak-Jens >Assignee: Jacob Wujciak-Jens >Priority: Critical > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 4h 20m > Remaining Estimate: 0h > > The revdep Crossbow job is currently not functioning correctly. This led to > changed behaviour affecting a revdep with the 8.0.0 release, requiring a > patch after initial submission. > cc: [~jonkeane] > Due to the time and performance constraints on GHA it does not make sense to > have a crossbow job for this. A dockeR job to be able to cleanly run this > locally does make sense though, so I renamed the ticket. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17613507#comment-17613507 ] Jonathan Keane commented on ARROW-15678: I thought that [~kou] was going to take a look at this (or at least the underlying multiple SIMD instruction ordering issue that causes the failures...) The only update I have is that I continue to run into the segfault in CI for downstream projects I'm working on, so it continues to be an issue for pre-built libarrow on machines like github's macos runners. > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 13.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17574) [R] [Docs] [CI] Investigate if we can auto generate Rd files in CI
Jonathan Keane created ARROW-17574: -- Summary: [R] [Docs] [CI] Investigate if we can auto generate Rd files in CI Key: ARROW-17574 URL: https://issues.apache.org/jira/browse/ARROW-17574 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, Documentation, R Reporter: Jonathan Keane Or alternatively, warn + recommend running autotune if they are out of date (e.g. any change) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17533) [R] Implement asof join
Jonathan Keane created ARROW-17533: -- Summary: [R] Implement asof join Key: ARROW-17533 URL: https://issues.apache.org/jira/browse/ARROW-17533 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Jonathan Keane With ARROW-16083 we have asof joins, could we expose this in R? Docs for the node: https://arrow.apache.org/docs/cpp/api/compute.html?highlight=asof#_CPPv4N5arrow7compute19AsofJoinNodeOptionsE A possible syntax might be (there does not appear to be a syntax in dplyr for this already): {code} asof_join(table1, table2, by = "field", tolerance = 1) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17533) [R] Implement asof join
[ https://issues.apache.org/jira/browse/ARROW-17533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17585027#comment-17585027 ] Jonathan Keane commented on ARROW-17533: A bit more prior art | folks asking for: https://stackoverflow.com/questions/58538114/is-there-an-r-equivalent-of-pythons-pandas-merge-asof > [R] Implement asof join > --- > > Key: ARROW-17533 > URL: https://issues.apache.org/jira/browse/ARROW-17533 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Jonathan Keane >Priority: Major > > With ARROW-16083 we have asof joins, could we expose this in R? > Docs for the node: > https://arrow.apache.org/docs/cpp/api/compute.html?highlight=asof#_CPPv4N5arrow7compute19AsofJoinNodeOptionsE > A possible syntax might be (there does not appear to be a syntax in dplyr for > this already): > {code} > asof_join(table1, table2, by = "field", tolerance = 1) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17526) [R] [Docs] Improve (or really actually document) our Python bridge documentation
Jonathan Keane created ARROW-17526: -- Summary: [R] [Docs] Improve (or really actually document) our Python bridge documentation Key: ARROW-17526 URL: https://issues.apache.org/jira/browse/ARROW-17526 Project: Apache Arrow Issue Type: Improvement Components: Documentation, R Reporter: Jonathan Keane https://twitter.com/jonkeane/status/1560016227824721920?s=20=g2MhdOOJbh0q0MpxPI4R_Q When I wrote this, I wished there was a one-page I could show passing a table or recordbatchreader back and forth. https://arrow.apache.org/cookbook/r/using-pyarrow-from-r.html#introduction-4 also has some details, but is more focused on scalars and arrays than tables. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17458) [C++] CSV Writer: Unsupported cast from decimal to utf8
[ https://issues.apache.org/jira/browse/ARROW-17458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584853#comment-17584853 ] Jonathan Keane commented on ARROW-17458: We ran into this issue today as well, working on conversions for benchmarking datasets > [C++] CSV Writer: Unsupported cast from decimal to utf8 > > > Key: ARROW-17458 > URL: https://issues.apache.org/jira/browse/ARROW-17458 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 6.0.1 >Reporter: Pavel Kovalenko >Priority: Major > Labels: csv, decimal, unsupported > > The following code snippet fails with an Unsupported cast error if a table > has a decimal column. > {code:cpp} > std::shared_ptr table; > ARROW_CHECK_OK(reader->ReadAll()); > std::shared_ptr output = > arrow::io::FileOutputStream::Open(csvPath).ValueOrDie(); > auto writeOptions = arrow::csv::WriteOptions::Defaults(); > writeOptions.include_header = false; > auto status = arrow::csv::WriteCSV(*table, writeOptions, output.get()); > if (!status.ok()) { > SETHROW_ERROR(std::runtime_error, "Couldn't write table csv: {}", > status.message()); > } > {code} > {code:cpp} > Unsupported cast from decimal128(7, 2) to utf8 using function cast_string > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17084) [R] Install the package before linting
[ https://issues.apache.org/jira/browse/ARROW-17084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-17084. Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 13620 [https://github.com/apache/arrow/pull/13620] > [R] Install the package before linting > -- > > Key: ARROW-17084 > URL: https://issues.apache.org/jira/browse/ARROW-17084 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > The R package should be installed before linting. See > [https://github.com/r-lib/lintr/issues/352#issuecomment-587004345,] and > https://github.com/r-lib/lintr/issues/406#issuecomment-534601141. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-12590) [C++][R] Update copies of Homebrew files to reflect recent updates
[ https://issues.apache.org/jira/browse/ARROW-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573158#comment-17573158 ] Jonathan Keane commented on ARROW-12590: Yeah, that should work until the homer maintainers decide to pull it out > [C++][R] Update copies of Homebrew files to reflect recent updates > -- > > Key: ARROW-12590 > URL: https://issues.apache.org/jira/browse/ARROW-12590 > Project: Apache Arrow > Issue Type: Task > Components: C++, R >Reporter: Ian Cook >Assignee: Jacob Wujciak-Jens >Priority: Critical > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > Our copies of the Homebrew formulae at > [https://github.com/apache/arrow/tree/master/dev/tasks/homebrew-formulae] > have drifted out of sync with what's currently in > [https://github.com/Homebrew/homebrew-core/tree/master/Formula] and > [https://github.com/autobrew/homebrew-core/blob/master/Formula|https://github.com/autobrew/homebrew-core/blob/master/Formula/]. > Get them back in sync and consider automating some method of checking that > they are in sync, e.g. by failing the {{homebrew-cpp}} and > {{homebrew-r-autobrew}} nightly tests if our copies don't match what's in > the Homebrew and autobrew repos (but only if there were changes there that > weren't made in our repo, and not the inverse). > Update the instructions at > > [https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingHomebrewpackages] > as needed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17166) [R] [CI] force_tests() cannot return TRUE
[ https://issues.apache.org/jira/browse/ARROW-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-17166. Resolution: Fixed Issue resolved by pull request 13680 [https://github.com/apache/arrow/pull/13680] > [R] [CI] force_tests() cannot return TRUE > - > > Key: ARROW-17166 > URL: https://issues.apache.org/jira/browse/ARROW-17166 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, R >Reporter: Rok Mihevc >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: CI, pull-request-available > Fix For: 10.0.0 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Update: the OOM has cleared up so the scope of this PR changed. > Old title: [R] [CI] Exclude large memory tests from the force-tests job on CI > = > We have noticed R CI job (AMD64 Ubuntu 20.04 R 4.2 Force-Tests true) failing > on master: > [1|https://github.com/apache/arrow/runs/7424773120?check_suite_focus=true#step:7:5547], > > [2|https://github.com/apache/arrow/runs/7431821192?check_suite_focus=true#step:7:5804], > > [3|https://github.com/apache/arrow/runs/7445803518?check_suite_focus=true#step:7:16305] > with: > {code:java} > Start test: array uses local timezone for POSIXct without timezone > test-Array.R:269:3 [success] > System has not been booted with systemd as init system (PID 1). Can't operate. > Failed to create bus connection: Host is down > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-12590) [C++][R] Update copies of Homebrew files to reflect recent updates
[ https://issues.apache.org/jira/browse/ARROW-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573003#comment-17573003 ] Jonathan Keane commented on ARROW-12590: Agreed with syncing (and the original intent of this ticket was basically to find a way to detect if and when this happens in order to alert us about it). It is ok that the autobrew and the homebrew formulae are different (since in the newest versions of the autobrew setup, if we are on a modern enough system we _just use brew_). If I'm remembering correctly, https://github.com/apache/arrow/pull/12157/files#diff-4b112dbca2ece7c78e15eb8aff3218e21dd6f4b1fab7cfc9182830488f68ca58R22-R30 was basically the operative code that fixes this. If I were you, I would take the commits on my branch there and create a new branch and push forward with that since it will let you run it in CI. Though the R tests will probably segfault with the simd issue in ARROW-15678. Maybe that's fine (since it's "only" a limited number of computers that this happens on — just so happens the GH runners are one of those, apparently) or maybe we'll need to actually resolve ARROW-15678? > [C++][R] Update copies of Homebrew files to reflect recent updates > -- > > Key: ARROW-12590 > URL: https://issues.apache.org/jira/browse/ARROW-12590 > Project: Apache Arrow > Issue Type: Task > Components: C++, R >Reporter: Ian Cook >Priority: Critical > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > Our copies of the Homebrew formulae at > [https://github.com/apache/arrow/tree/master/dev/tasks/homebrew-formulae] > have drifted out of sync with what's currently in > [https://github.com/Homebrew/homebrew-core/tree/master/Formula] and > [https://github.com/autobrew/homebrew-core/blob/master/Formula|https://github.com/autobrew/homebrew-core/blob/master/Formula/]. > Get them back in sync and consider automating some method of checking that > they are in sync, e.g. by failing the {{homebrew-cpp}} and > {{homebrew-r-autobrew}} nightly tests if our copies don't match what's in > the Homebrew and autobrew repos (but only if there were changes there that > weren't made in our repo, and not the inverse). > Update the instructions at > > [https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingHomebrewpackages] > as needed. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17571125#comment-17571125 ] Jonathan Keane commented on ARROW-15678: I have no updates beyond what's discussed above: there are a few approaches, none of them ideal, we need someone to champion this (or risk the homebrew maintainers turning off optimizations on us) > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 13.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570163#comment-17570163 ] Jonathan Keane edited comment on ARROW-15678 at 7/22/22 7:28 PM: - Homebrew only accepted that as a temporary workaround and has threatened to turn off optimizations if we don't resolve this. They haven't yet followed through yet, though. https://github.com/Homebrew/homebrew-core/issues/94724#issuecomment-1063031123 was (Author: jonkeane): Homebrew only accepted that as a temporary workaround and has threatened to turn off optimizations if we don't resolve this. They haven't yet followed through yet, though. > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17570163#comment-17570163 ] Jonathan Keane commented on ARROW-15678: Homebrew only accepted that as a temporary workaround and has threatened to turn off optimizations if we don't resolve this. They haven't yet followed through yet, though. > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17115) [C++] HashJoin fails if it encounters a batch with more than 32Ki rows
[ https://issues.apache.org/jira/browse/ARROW-17115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569175#comment-17569175 ] Jonathan Keane commented on ARROW-17115: A reprex that causes this from R (which is effectively the TPC-H 12 query that segfaults): {code:r} library(arrow) library(dplyr) library(arrowbench) ensure_source("tpch", scale_factor = 10) open_dataset("data/lineitem_10.parquet") %>% filter( l_shipmode %in% c("MAIL", "SHIP"), l_commitdate < l_receiptdate, l_shipdate < l_commitdate, l_receiptdate >= as.Date("1994-01-01"), l_receiptdate < as.Date("1995-01-01") ) %>% inner_join( open_dataset("data/orders_10.parquet"), by = c("l_orderkey" = "o_orderkey") ) %>% group_by(l_shipmode) %>% summarise( high_line_count = sum( if_else( (o_orderpriority == "1-URGENT") | (o_orderpriority == "2-HIGH"), 1L, 0L ) ), low_line_count = sum( if_else( (o_orderpriority != "1-URGENT") & (o_orderpriority != "2-HIGH"), 1L, 0L ) ) ) %>% ungroup() %>% arrange(l_shipmode) %>% collect() {code} > [C++] HashJoin fails if it encounters a batch with more than 32Ki rows > -- > > Key: ARROW-17115 > URL: https://issues.apache.org/jira/browse/ARROW-17115 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Blocker > Fix For: 9.0.0 > > > The new swiss join assumes that batches are being broken according to the > morsel/batch model and it assumes those batches have, at most, 32Ki rows > (signed 16-bit indices are used in various places). > However, we are not currently slicing all of our inputs to batches this > small. This is causing conbench to fail and would likely be a problem with > any large inputs. > We should fix this by slicing batches in the engine to the appropriate > maximum size. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17115) [C++] HashJoin fails if it encounters a batch with more than 32Ki rows
[ https://issues.apache.org/jira/browse/ARROW-17115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-17115: --- Fix Version/s: 9.0.0 > [C++] HashJoin fails if it encounters a batch with more than 32Ki rows > -- > > Key: ARROW-17115 > URL: https://issues.apache.org/jira/browse/ARROW-17115 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Blocker > Fix For: 9.0.0 > > > The new swiss join assumes that batches are being broken according to the > morsel/batch model and it assumes those batches have, at most, 32Ki rows > (signed 16-bit indices are used in various places). > However, we are not currently slicing all of our inputs to batches this > small. This is causing conbench to fail and would likely be a problem with > any large inputs. > We should fix this by slicing batches in the engine to the appropriate > maximum size. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-13062) [Dev] Add a way for people to add information to our saved crossbow data
[ https://issues.apache.org/jira/browse/ARROW-13062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566851#comment-17566851 ] Jonathan Keane commented on ARROW-13062: IMHO manual would be fine. And honestly, probably will be needed at some level since autocreating jiras will result in a bunch of jiras that overlap (in cases where multiple failures result from one change), or need to be duplicated manually when one job failure is the result of multiple changes or failures. Anyway, it's not a high priority now, we can wait until someone bumps it again — but wanted to make sure you got credit if it was already done as part of the work you pushed to get all of the other great stuff out > [Dev] Add a way for people to add information to our saved crossbow data > > > Key: ARROW-13062 > URL: https://issues.apache.org/jira/browse/ARROW-13062 > Project: Apache Arrow > Issue Type: Sub-task > Components: Developer Tools >Reporter: Jonathan Keane >Priority: Major > > We should have a simple + ligthweight way to annotate specific builds with > information like "won't be fixed until dask has a new release" or "this is > supposed to be fixed in ARROW-XXX". > We should find an easy, lightweight way to add this kind of information. > Only relevant in its previous parent: -We *should not* require, ask, or allow > people to add this information to the JSON that is saved as part of > ARROW-13509. That JSON should be kept pristine and not have manual edits. > Instead, we should have a plain-text look up file that matches notes to > specific builds (maybe to specific dates?)- -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds
[ https://issues.apache.org/jira/browse/ARROW-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-8043. - Resolution: Fixed Done with the work on https://crossbow.voltrondata.com > [Developer] Provide better visibility for failed nightly builds > --- > > Key: ARROW-8043 > URL: https://issues.apache.org/jira/browse/ARROW-8043 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Developer Tools >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Assignee: Sam Albers >Priority: Major > Labels: pull-request-available > > Emails reporting nightly failures are unsatisfactory in two ways: there is a > large click/scroll distance between the links presented in that email and the > actual error message. Worse, once one is there it's not clear what JIRAs have > been made or which of them are in progress. > One solution would be to replace or augment the [NIGHTLY] email with a page > ([https://ursa-labs.github.org/crossbow] would be my favorite) which shows > how many nights it has failed, a shortcut to the actual error line in CI's > logs, and useful views of JIRA. We could accomplish this with: > - dedicated JIRA tags; one for each nightly job so a JIRA can be easily > associated with specific jobs > - A static HTML dashboard with client side JavaScript to > ** scrape JIRA and update the page dynamically as soon as JIRAs are opened > ** show any relationships between failing jobs > ** highlight jobs that have not been addressed, along with a counter of how > many nights it has gone unaddressed > - provide automatic and expedited creation of correctly labelled JIRAs, so > that viewers can quickly organize/take ownership of a failed nightly job. > JIRA supports reading form fields from URL parameters, so this would be > fairly straightforward: > > [https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (ARROW-8043) [Developer] Provide better visibility for failed nightly builds
[ https://issues.apache.org/jira/browse/ARROW-8043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reopened ARROW-8043: --- Assignee: Sam Albers > [Developer] Provide better visibility for failed nightly builds > --- > > Key: ARROW-8043 > URL: https://issues.apache.org/jira/browse/ARROW-8043 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, Developer Tools >Affects Versions: 0.16.0 >Reporter: Ben Kietzman >Assignee: Sam Albers >Priority: Major > Labels: pull-request-available > > Emails reporting nightly failures are unsatisfactory in two ways: there is a > large click/scroll distance between the links presented in that email and the > actual error message. Worse, once one is there it's not clear what JIRAs have > been made or which of them are in progress. > One solution would be to replace or augment the [NIGHTLY] email with a page > ([https://ursa-labs.github.org/crossbow] would be my favorite) which shows > how many nights it has failed, a shortcut to the actual error line in CI's > logs, and useful views of JIRA. We could accomplish this with: > - dedicated JIRA tags; one for each nightly job so a JIRA can be easily > associated with specific jobs > - A static HTML dashboard with client side JavaScript to > ** scrape JIRA and update the page dynamically as soon as JIRAs are opened > ** show any relationships between failing jobs > ** highlight jobs that have not been addressed, along with a counter of how > many nights it has gone unaddressed > - provide automatic and expedited creation of correctly labelled JIRAs, so > that viewers can quickly organize/take ownership of a failed nightly job. > JIRA supports reading form fields from URL parameters, so this would be > fairly straightforward: > > [https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12319525=1=[NIGHTLY:gandiva-jar-osx,gandiva-jar-trusty]=12340948=12347769=12334626] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-13936) Add a column to show us the number of time that this job is failing
[ https://issues.apache.org/jira/browse/ARROW-13936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-13936. -- Resolution: Fixed Done with the work on https://crossbow.voltrondata.com > Add a column to show us the number of time that this job is failing > --- > > Key: ARROW-13936 > URL: https://issues.apache.org/jira/browse/ARROW-13936 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: David Dali Susanibar Arce >Assignee: Sam Albers >Priority: Minor > > Try to use external repository to collect information about jobs name failling -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (ARROW-13936) Add a column to show us the number of time that this job is failing
[ https://issues.apache.org/jira/browse/ARROW-13936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reopened ARROW-13936: Assignee: Sam Albers > Add a column to show us the number of time that this job is failing > --- > > Key: ARROW-13936 > URL: https://issues.apache.org/jira/browse/ARROW-13936 > Project: Apache Arrow > Issue Type: Sub-task >Reporter: David Dali Susanibar Arce >Assignee: Sam Albers >Priority: Minor > > Try to use external repository to collect information about jobs name failling -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-12845) [R] [C++] S3 connections for different providers
[ https://issues.apache.org/jira/browse/ARROW-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-12845. -- Resolution: Won't Fix > [R] [C++] S3 connections for different providers > > > Key: ARROW-12845 > URL: https://issues.apache.org/jira/browse/ARROW-12845 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Affects Versions: 4.0.0 >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Minor > Labels: pull-request-available > Time Spent: 3h > Remaining Estimate: 0h > > Hi > As a part of my thesis, I want to create an S3 bucket on DigitalOcean (what > PUC uses), and while I can write parquet files on my laptop and upload to > DigitalOcean Spaces (i.e. an "S3 + Google Drive") from the browser or by > using rclone, I could work in editing the existing code that allows to > connects to Amazon S3, and provide a function that connects to > DigitalOcean/Linode/IBM/etc. > This could be done in a way that amazon URL is the default and the user could > specify something like `new_s3_fun(..., provider = "Tencent")` and connect > to an S3 that is not Amazon. > Also, this involves the need to write more S3 documentation. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-12862) [CI] Gather + display reliability of crossbow builds
[ https://issues.apache.org/jira/browse/ARROW-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-12862. -- Resolution: Fixed Done with the work on https://crossbow.voltrondata.com > [CI] Gather + display reliability of crossbow builds > > > Key: ARROW-12862 > URL: https://issues.apache.org/jira/browse/ARROW-12862 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Jonathan Keane >Assignee: Sam Albers >Priority: Major > > From Wes's suggestion on the mailing list: > Having a website > dashboard showing build health over time along with a ~ weekly e-mail > to dev@ indicating currently broken builds and the reliability of each > build over the trailing 7 or 30 days would be useful. Knowing that a > particular build is only passing 20% of the time would help steer our > efforts. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-12862) [CI] Gather + display reliability of crossbow builds
[ https://issues.apache.org/jira/browse/ARROW-12862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-12862: -- Assignee: Sam Albers > [CI] Gather + display reliability of crossbow builds > > > Key: ARROW-12862 > URL: https://issues.apache.org/jira/browse/ARROW-12862 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Jonathan Keane >Assignee: Sam Albers >Priority: Major > > From Wes's suggestion on the mailing list: > Having a website > dashboard showing build health over time along with a ~ weekly e-mail > to dev@ indicating currently broken builds and the reliability of each > build over the trailing 7 or 30 days would be useful. Knowing that a > particular build is only passing 20% of the time would help steer our > efforts. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-14378) [R] Make custom extension classes for (some) cols with row-level metadata
[ https://issues.apache.org/jira/browse/ARROW-14378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-14378. -- Resolution: Won't Fix We ended up supporting geo columns using the geoarrow package + extension types > [R] Make custom extension classes for (some) cols with row-level metadata > - > > Key: ARROW-14378 > URL: https://issues.apache.org/jira/browse/ARROW-14378 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Priority: Major > > The major usecase for this is SF columns which have attributes/metadata for > each element of a column. We originally stored these in our standard > column-level metadata, but that was very fragile and took forever, so we > disabled it ARROW-13189 > This will likely take some steps to accomplish. I've sketched out some in the > subtasks here (though if we have a different approach, we could do that > directly) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-12182) [R] [Dev] new helpers and suggests for testing
[ https://issues.apache.org/jira/browse/ARROW-12182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-12182. -- Resolution: Won't Fix > [R] [Dev] new helpers and suggests for testing > -- > > Key: ARROW-12182 > URL: https://issues.apache.org/jira/browse/ARROW-12182 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools, R >Affects Versions: 3.0.0 >Reporter: Mauricio 'Pachá' Vargas Sepúlveda >Priority: Minor > > _Related to https://issues.apache.org/jira/browse/ARROW-11705_ > While working on the related tickets I've found the next blockers: > 1. Does it make sense to create expect_dplyr_named()? (i.e. to mimic > https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L56-L59) > 2. Does it make sense to create expect_dplyr_identical() (i.e. to mimic > https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L61-L69 > and > https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L83-L91) > 3. Should we need to add glue to Suggests? (i.e. replicate > https://github.com/tidyverse/dplyr/blob/master/tests/testthat/test-mutate.r#L95-L100) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-14624) [R] [Docs] Remove our tabbing hack now that it's supported by pkgdown
[ https://issues.apache.org/jira/browse/ARROW-14624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-14624. -- Resolution: Fixed This was fixed as part of the work to update the version switcher in the docs. > [R] [Docs] Remove our tabbing hack now that it's supported by pkgdown > - > > Key: ARROW-14624 > URL: https://issues.apache.org/jira/browse/ARROW-14624 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Priority: Major > > tabsets are now supported natively in pkgdown (with bootstrap 5) > https://github.com/r-lib/pkgdown/pull/1694 > So we can pull out the hack we have to make that work for our dev docs > vignette -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-16076) [R] Bindings for the new TPC-H generator
[ https://issues.apache.org/jira/browse/ARROW-16076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-16076. -- Resolution: Won't Fix Since the TPC-H generator does not generate compliant data, there's not a big need to expose this in R. > [R] Bindings for the new TPC-H generator > > > Key: ARROW-16076 > URL: https://issues.apache.org/jira/browse/ARROW-16076 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Jonathan Keane >Priority: Major > Labels: pull-request-available > Time Spent: 3h 20m > Remaining Estimate: 0h > > Now that https://github.com/apache/arrow/pull/12537 is merged, we should > implement the R changes needed to make that useable from R. > We should basically do the opposite of > https://github.com/apache/arrow/pull/12537/commits/4b16296b4ef8cd3b3d440e8b7f8af32a89a16788 > But also add in the fixes from weston: > https://github.com/westonpace/arrow/commit/7c4c0e0b4e208918eb195701fab5d631b8c9517a -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-13062) [Dev] Add a way for people to add information to our saved crossbow data
[ https://issues.apache.org/jira/browse/ARROW-13062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566383#comment-17566383 ] Jonathan Keane commented on ARROW-13062: [~boshek] Did you already add this ability? I know it's a slightly different set of tickets than the ones we actually worked, but we should either close it as duplicate, done, or won't fix (and feel free to take credit for it if you did it elsewhere as part of a larger ticket!) > [Dev] Add a way for people to add information to our saved crossbow data > > > Key: ARROW-13062 > URL: https://issues.apache.org/jira/browse/ARROW-13062 > Project: Apache Arrow > Issue Type: Sub-task > Components: Developer Tools >Reporter: Jonathan Keane >Priority: Major > > We should have a simple + ligthweight way to annotate specific builds with > information like "won't be fixed until dask has a new release" or "this is > supposed to be fixed in ARROW-XXX". > We should find an easy, lightweight way to add this kind of information. > Only relevant in its previous parent: -We *should not* require, ask, or allow > people to add this information to the JSON that is saved as part of > ARROW-13509. That JSON should be kept pristine and not have manual edits. > Instead, we should have a plain-text look up file that matches notes to > specific builds (maybe to specific dates?)- -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (ARROW-17059) [C++] Archery cpp-micro arrow-compute-expression-benchmark fails with Invalid: Value lengths differed from ExecBatch length
[ https://issues.apache.org/jira/browse/ARROW-17059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-17059. Fix Version/s: 10.0.0 Resolution: Fixed Issue resolved by pull request 13584 [https://github.com/apache/arrow/pull/13584] > [C++] Archery cpp-micro arrow-compute-expression-benchmark fails with > Invalid: Value lengths differed from ExecBatch length > --- > > Key: ARROW-17059 > URL: https://issues.apache.org/jira/browse/ARROW-17059 > Project: Apache Arrow > Issue Type: Bug > Components: Archery, C++ >Reporter: Elena Henderson >Assignee: Sasha Krassovsky >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > [https://github.com/apache/arrow/pull/13179] causes > {{arrow-compute-expression-benchmark}} to fail with: > {code:java} > -- Arrow Fatal Error -- > Invalid: Value lengths differed from ExecBatch length {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564018#comment-17564018 ] Jonathan Keane commented on ARROW-15678: Last I checked, the homebrew maintainers have said that they will disable all optimization for arrow if we don't get this sorted on our own. So not required if we're ok with that (though we should engage with them on this) > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-12059) [R] Accept format-specific scan options in collect()
[ https://issues.apache.org/jira/browse/ARROW-12059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12059: --- Fix Version/s: (was: 9.0.0) > [R] Accept format-specific scan options in collect() > > > Key: ARROW-12059 > URL: https://issues.apache.org/jira/browse/ARROW-12059 > Project: Apache Arrow > Issue Type: Task > Components: R >Affects Versions: 4.0.0 >Reporter: David Li >Priority: Major > Labels: dataset, datasets > > ARROW-9749 and ARROW-8631 added format/scan-specific options. In R, the most > natural place to accept these is in collect(), but this isn't yet done. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15283) [Python][R] Remove deprecated placeholders for UseAsync
[ https://issues.apache.org/jira/browse/ARROW-15283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561113#comment-17561113 ] Jonathan Keane commented on ARROW-15283: Is this something you think you'll be able to get to before 9.0.0 [~westonpace]? Happy to push it out if not > [Python][R] Remove deprecated placeholders for UseAsync > --- > > Key: ARROW-15283 > URL: https://issues.apache.org/jira/browse/ARROW-15283 > Project: Apache Arrow > Issue Type: Improvement > Components: Python, R >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Fix For: 9.0.0 > > > In the 7.0.0 release we are marking the UseAsync parameters / functions as > deprecated. In a future release we should remove these entirely. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-12137) [R] New/improved vignette on dplyr features
[ https://issues.apache.org/jira/browse/ARROW-12137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12137: --- Fix Version/s: (was: 9.0.0) > [R] New/improved vignette on dplyr features > --- > > Key: ARROW-12137 > URL: https://issues.apache.org/jira/browse/ARROW-12137 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-12213) [R] copy_files doesn't make it easy to copy a single file
[ https://issues.apache.org/jira/browse/ARROW-12213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12213: --- Fix Version/s: (was: 9.0.0) > [R] copy_files doesn't make it easy to copy a single file > - > > Key: ARROW-12213 > URL: https://issues.apache.org/jira/browse/ARROW-12213 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, R >Reporter: Neal Richardson >Priority: Major > > copy_files (i.e. fs::CopyFiles) makes it trivial to recursively copy a > directory/bucket to or from S3, but I'm having a hard time downloading a > single file. > cc [~bkietz] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-13165) [R] Add bindings for ProjectOptions
[ https://issues.apache.org/jira/browse/ARROW-13165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-13165: --- Fix Version/s: (was: 9.0.0) > [R] Add bindings for ProjectOptions > --- > > Key: ARROW-13165 > URL: https://issues.apache.org/jira/browse/ARROW-13165 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Priority: Major > > The {{project}} kernel creates a column of struct (equivalent to a column of > named lists in R). Add to {{make_compute_options}} in {{compute.cpp}} so we > can pass {{ProjectOptions}} to the {{project}} kernel. > One practical application of the {{project}} kernel is to create a binding > for the stringr function {{str_locate}} which returns a column of named lists. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-12711) [R] Bindings for paste(collapse), str_c(collapse), and str_flatten()
[ https://issues.apache.org/jira/browse/ARROW-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-12711: --- Fix Version/s: (was: 9.0.0) > [R] Bindings for paste(collapse), str_c(collapse), and str_flatten() > > > Key: ARROW-12711 > URL: https://issues.apache.org/jira/browse/ARROW-12711 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Ian Cook >Priority: Major > Labels: query-engine > > These are the aggregating versions of string concatenation—they combine > values from a set of rows into a single value. > The bindings for {{paste()}} and {{str_c()}} might be tricky to implement > because when these functions are called with the {{coallapse}} argument > unset, they do _not_ aggregate. > In {{summarise()}} we need to be able to use scalar concatenation within > aggregate concatenation, like this: > {code:java} > starwars %>% > filter(!is.na(hair_color) & !is.na(eye_color)) %>% > group_by(homeworld) %>% > summarise(hair_and_eyes = paste0(paste0(hair_color, "-haired and ", > eye_color, "-eyed"), collapse = ", ")){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-13766) [R] Add Arrow methods slice_min(), slice_max()
[ https://issues.apache.org/jira/browse/ARROW-13766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-13766: --- Fix Version/s: (was: 9.0.0) > [R] Add Arrow methods slice_min(), slice_max() > -- > > Key: ARROW-13766 > URL: https://issues.apache.org/jira/browse/ARROW-13766 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Priority: Major > Labels: query-engine > > Implement [{{slice_min()}} and > {{slice_max()}}|https://dplyr.tidyverse.org/reference/slice.html] methods for > {{ArrowTabular}}, {{Dataset}}, and {{arrow_dplyr_query}} objects. > These dplyr functions supersede the older dplyr function > [{{top_n()}}|https://dplyr.tidyverse.org/reference/top_n.html] which I > suppose we should also consider implementing a method for. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-13767) [R] Add Arrow methods slice(), slice_head(), slice_tail()
[ https://issues.apache.org/jira/browse/ARROW-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-13767: --- Fix Version/s: (was: 9.0.0) > [R] Add Arrow methods slice(), slice_head(), slice_tail() > - > > Key: ARROW-13767 > URL: https://issues.apache.org/jira/browse/ARROW-13767 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Priority: Major > Labels: query-engine > > Implement [{{slice()}}, {{slice_head()}}, and > {{slice_tail()}}|https://dplyr.tidyverse.org/reference/slice.html] methods > for {{ArrowTabular}}, {{Dataset}}, and {{arrow_dplyr_query}} objects . I > believe this should be relatively straightforward, using {{Take()}} to return > only the specified rows. We already have a {{head()}} method which I believe > we can reuse for {{slice_head()}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-13531) [R] Read CSV with comma as decimal mark
[ https://issues.apache.org/jira/browse/ARROW-13531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-13531: --- Fix Version/s: (was: 9.0.0) > [R] Read CSV with comma as decimal mark > --- > > Key: ARROW-13531 > URL: https://issues.apache.org/jira/browse/ARROW-13531 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > > Followup to ARROW-13421. There is a new ConvertOption, that part is easy. > There may be some subtleties in emulating the readr way of supporting this > since it uses a broader {{locale()}} object, but maybe we just add > {{read_csv2_arrow}} (matching {{readr::read_csv2}} and {{base::read.csv2}}) > and that's enough. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14028) [R] Cast of NaN to integer should return NA_integer_
[ https://issues.apache.org/jira/browse/ARROW-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14028: --- Fix Version/s: (was: 9.0.0) > [R] Cast of NaN to integer should return NA_integer_ > > > Key: ARROW-14028 > URL: https://issues.apache.org/jira/browse/ARROW-14028 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Priority: Major > > Casting double {{NaN}} to integer returns a sentinel value: > {code:r} > call_function("cast", Scalar$create(NaN), options = list(to_type = int32(), > allow_float_truncate = TRUE)) > #> Scalar > #> -2147483648 > call_function("cast", Scalar$create(NaN), options = list(to_type = int64(), > allow_float_truncate = TRUE)) > #> Scalar > #> -9223372036854775808{code} > It would be nice if this would instead return {{NA_integer}}. > N.B. for some reason this doesn't reproduce in dplyr unless you round-trip it > back to double: > {code:r} > > Table$create(x = NaN) %>% transmute(as.double(as.integer(x))) %>% pull(1) > #> [1] -2147483648{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14067) [R] Add error handling to C++ compute functions listed via list_compute_functions() which don't have bindings in R or options not supplied by user
[ https://issues.apache.org/jira/browse/ARROW-14067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14067: --- Fix Version/s: (was: 9.0.0) > [R] Add error handling to C++ compute functions listed via > list_compute_functions() which don't have bindings in R or options not > supplied by user > -- > > Key: ARROW-14067 > URL: https://issues.apache.org/jira/browse/ARROW-14067 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Priority: Major > > Currently we have the function {{list_compute_functions()}} which lists all > available Arrow compute functions. However, it can return functions which > have been implemented in C++ but don't yet have bindings in R. > A recent ticket implemented (nearly) all of the remaining compute functions > without bound options at that moment, but more could appear. > Currently the error message shown is: > {code:java} > library(dplyr) > library(arrow) # 5.0.0.2 > Table$create(tibble::tibble(Species = c("versicolor", "virginica", > "setosa"))) %>% > mutate(x = arrow_utf8_trim(Species, options = list(characters = "a"))) > ## Error: Invalid: Attempted to initialize KernelState from null > FunctionOptions > {code} > We should catch this and instead raise a more user-friendly error. > Also, if a valid function is called without options supplied, we get a > {{could not find function}} error: > {code:java} > library(dplyr) > library(arrow) # dev > Table$create(tibble::tibble(Species = c("versicolor", "virginica", > "setosa"))) %>% > mutate(x = arrow_utf8_trim(Species)) > ## Error in arrow_utf8_trim(Species) : could not find function > "arrow_utf8_trim" > {code} > It'd be great to instead inform the user that the correct options haven't > been supplied. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14045) [R] Support for .keep_all = TRUE with distinct()
[ https://issues.apache.org/jira/browse/ARROW-14045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14045: --- Fix Version/s: (was: 9.0.0) > [R] Support for .keep_all = TRUE with distinct() > - > > Key: ARROW-14045 > URL: https://issues.apache.org/jira/browse/ARROW-14045 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14218) [R] More improvements to developer docs
[ https://issues.apache.org/jira/browse/ARROW-14218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14218: --- Fix Version/s: (was: 9.0.0) > [R] More improvements to developer docs > --- > > Key: ARROW-14218 > URL: https://issues.apache.org/jira/browse/ARROW-14218 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Priority: Major > > * Add link to the main contributions guidelines > * Add a test of "how I know that my dev setup is OK?" to the end of the > R-only step > * The R-only instructions just have instructions on how to install libarrow > but we should add a little about how to connect it up with the repo clone; > the instructions mention it in the Linux version but could be more explicit > * When a user clones the repo via RStudio it creates an .rproj file in the > root directory - we should add an instructions to clone the arrow fork from > the command line so we can use the project's .rproj file > * We should consider removing the instruction for installing the released > version of libarrow (or demoting it to the second place and explain why we'd > use it) as typically a dev would want the dev version > * Mac - you can't just install openssl, you need to add it to your path as > LibreSSL is the default - we should add instructions about this > * Better demarcation between "special instructions for Linux" and the next > section - maybe use tabs again? > * clarification of the difference between the build directory and the > installation directory -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14288) [R] Implement nrow on some collapsed queries
[ https://issues.apache.org/jira/browse/ARROW-14288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14288: --- Fix Version/s: (was: 9.0.0) > [R] Implement nrow on some collapsed queries > > > Key: ARROW-14288 > URL: https://issues.apache.org/jira/browse/ARROW-14288 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Priority: Major > > collapse() doesn't always mean we can't determine the number of rows. We can > try to solve some cases: > * head/tail: compute number of rows, take the smaller of that and the > head/tail number > * if filter == TRUE, take the number of rows of .data (which may contain a > query) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14847) [R] Implement bindings for lubridate date/time parsing functions
[ https://issues.apache.org/jira/browse/ARROW-14847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14847: --- Fix Version/s: (was: 9.0.0) > [R] Implement bindings for lubridate date/time parsing functions > > > Key: ARROW-14847 > URL: https://issues.apache.org/jira/browse/ARROW-14847 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Nicola Crane >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15016) [R] show_query() for an arrow_dplyr_query
[ https://issues.apache.org/jira/browse/ARROW-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15016: --- Fix Version/s: (was: 9.0.0) > [R] show_query() for an arrow_dplyr_query > - > > Key: ARROW-15016 > URL: https://issues.apache.org/jira/browse/ARROW-15016 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > > Now that we can print a query plan (ARROW-13785) we should wire this up in R > so we can see what execution plans are being put together for various queries > (like the TPC-H queries) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-15016) [R] show_query() for an arrow_dplyr_query
[ https://issues.apache.org/jira/browse/ARROW-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-15016: -- Assignee: Dragoș Moldovan-Grünfeld > [R] show_query() for an arrow_dplyr_query > - > > Key: ARROW-15016 > URL: https://issues.apache.org/jira/browse/ARROW-15016 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > > Now that we can print a query plan (ARROW-13785) we should wire this up in R > so we can see what execution plans are being put together for various queries > (like the TPC-H queries) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15470) [R] Allows user to specify string to be used for missing data when writing CSV dataset
[ https://issues.apache.org/jira/browse/ARROW-15470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15470: --- Fix Version/s: (was: 9.0.0) > [R] Allows user to specify string to be used for missing data when writing > CSV dataset > -- > > Key: ARROW-15470 > URL: https://issues.apache.org/jira/browse/ARROW-15470 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Nicola Crane >Priority: Major > > The ability to select the string to be used for missing data was implemented > for the CSV Writer in ARROW-14903 and as David Li points out below, is > available, so I think we just need to hook it up on the R side. > This requires the values passed in as the "na" argument to be instead passed > through to "null_strings", similarly to what has been done with "skip" and > "skip_rows" in ARROW-15743. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15803) [R] Empty JSON object parsed as corrupt data frame
[ https://issues.apache.org/jira/browse/ARROW-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15803: --- Fix Version/s: (was: 9.0.0) > [R] Empty JSON object parsed as corrupt data frame > -- > > Key: ARROW-15803 > URL: https://issues.apache.org/jira/browse/ARROW-15803 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 7.0.0 >Reporter: Will Jones >Priority: Major > > If you have a JSON object field that is always empty, it seems to be not > handled well, whether or not a schema is provided that tells Arrow what > should be in that object. > {code:r} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > json_val <- '{ > "rows": [ > {"empty": {} }, > {"empty": {} }, > {"empty": {} } > ] > }' > # Remove newlines > json_val <- gsub("\n", "", json_val) > json_file <- tempfile() > writeLines(json_val, json_file) > schema <- schema(field("rows", list_of(struct(empty = struct(y = int32()) > raw <- read_json_arrow(json_file, schema=schema) > raw$rows$empty > #> Error: Corrupt x: no names > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15719) [R] Simplify code for handling summarise() with no aggregations
[ https://issues.apache.org/jira/browse/ARROW-15719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15719: --- Fix Version/s: (was: 9.0.0) > [R] Simplify code for handling summarise() with no aggregations > --- > > Key: ARROW-15719 > URL: https://issues.apache.org/jira/browse/ARROW-15719 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Priority: Major > > Check whether ARROW-15609 enables us to remove code from > {{{}[query-engine.R|https://github.com/apache/arrow/blob/master/r/R/query-engine.R]{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15822) [C++] Cast duration to string (thus CSV writing) not supported
[ https://issues.apache.org/jira/browse/ARROW-15822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15822: --- Fix Version/s: (was: 9.0.0) > [C++] Cast duration to string (thus CSV writing) not supported > -- > > Key: ARROW-15822 > URL: https://issues.apache.org/jira/browse/ARROW-15822 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 7.0.0, 7.0.1 >Reporter: Carl Boettiger >Priority: Critical > > Edit (Dragos Moldovan-Grünfeld): The issue I opened (ARROW-15833) is > basically a duplicate of this. It's fundamentally a C++ issue that happened > to surface in the R CSV writer. I hope you don't mind, I modified the > components to C++ > === > Consider this reprex: > {code:java} > arrow::write_csv_arrow(data.frame(time = as.difftime(1, units="secs")), > "test.csv"){code} > This errors with: > Error: NotImplemented: Unsupported cast from duration[s] to utf8 using > function cast_string > > Note that readr::write_csv() has no trouble with this (which renders the data > as "1" without a unit). Arguably the readr rendering is lossy, but then we > usually assume units are provided in other metadata anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15879) [R] passing a schema calls open_dataset to fail on hive-partitioned csv files
[ https://issues.apache.org/jira/browse/ARROW-15879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15879: --- Fix Version/s: (was: 9.0.0) > [R] passing a schema calls open_dataset to fail on hive-partitioned csv files > - > > Key: ARROW-15879 > URL: https://issues.apache.org/jira/browse/ARROW-15879 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 7.0.0, 7.0.1 >Reporter: Carl Boettiger >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Consider this reprex: > > Create a dataset with hive partitions in csv format with write_dataset() (so > cool!): > > {code:java} > library(arrow) > library(dplyr) > path <- fs::dir_create("tmp") > mtcars %>% group_by(gear) %>% write_dataset(path, format="csv")## works fine, > even with 'collect()' > ds <- open_dataset(path, format="csv")## but pass a schema, and things fail > df <- open_dataset(path, format="csv", schema = ds$schema, skip_rows=1) > df %>% collect() > {code} > In the first call to open_dataset, we don't pass a schema and things work as > expected. > However, csv files often need a schema to be read in correctly, particularly > with partitioned data where it is easy to 'guess' the wrong type. Passing > the schema though confuses open_dataset, because the grouping column > (partition column) isn't found on the individual files even though it is > mentioned in the schema! > Nor can we just omit the grouping column from the schema, since then it is > effectively lost from the data. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16155) [R] lubridate functions for 9.0.0
[ https://issues.apache.org/jira/browse/ARROW-16155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16155: --- Fix Version/s: (was: 9.0.0) > [R] lubridate functions for 9.0.0 > - > > Key: ARROW-16155 > URL: https://issues.apache.org/jira/browse/ARROW-16155 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: Alessandro Molina >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > > Umbrella ticket for lubridate functions in 9.0.0 > Future work that is not going to happen in v9 is recorder under > https://issues.apache.org/jira/browse/ARROW-16841 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16190) [CI][R] Implement CI on Apple M1 for R
[ https://issues.apache.org/jira/browse/ARROW-16190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16190: --- Fix Version/s: (was: 9.0.0) > [CI][R] Implement CI on Apple M1 for R > -- > > Key: ARROW-16190 > URL: https://issues.apache.org/jira/browse/ARROW-16190 > Project: Apache Arrow > Issue Type: Sub-task > Components: Continuous Integration, R >Reporter: Jacob Wujciak-Jens >Priority: Critical > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16239) [R] $columns on Table and RB should be named
[ https://issues.apache.org/jira/browse/ARROW-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16239: --- Fix Version/s: (was: 9.0.0) > [R] $columns on Table and RB should be named > > > Key: ARROW-16239 > URL: https://issues.apache.org/jira/browse/ARROW-16239 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Will Jones >Priority: Minor > Labels: good-first-issue > > Currently, {{$columns}} method returns columns as a list without names. It > would be nice if they were named instead, similar to {{as.list}} on a > {{data.frame}}. > {code:R} > > library(arrow) > > names(record_batch(x = 1, y = 'a')$columns) > NULL > > names(arrow_table(x = 1, y = 'a')$columns) > NULL > > as.list(data.frame(x = 1, y = 'a')) > $x > [1] 1 > $y > [1] "a" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16768) [R] Factor levels cannot contain NA
[ https://issues.apache.org/jira/browse/ARROW-16768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16768: --- Fix Version/s: (was: 9.0.0) > [R] Factor levels cannot contain NA > --- > > Key: ARROW-16768 > URL: https://issues.apache.org/jira/browse/ARROW-16768 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 7.0.0 >Reporter: Kieran Martin >Priority: Minor > > If you try to write a data frame with a factor with a missing value to > parquet, you get the error: "Error: Invalid: Cannot insert dictionary values > containing nulls". > This seems likely due to how the metadata for factors is currently captured > in parquet files. Reprex follows: > > library(arrow) > bad_data <- data.frame(A = factor(1, 2, NA)) > write_parquet(bad_data, tempfile()) > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16777) [R] printing data in Table/RecordBatch print method
[ https://issues.apache.org/jira/browse/ARROW-16777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16777: --- Fix Version/s: (was: 9.0.0) > [R] printing data in Table/RecordBatch print method > --- > > Key: ARROW-16777 > URL: https://issues.apache.org/jira/browse/ARROW-16777 > Project: Apache Arrow > Issue Type: Improvement > Components: Python, R >Reporter: Thomas Mock >Priority: Minor > > Related to ARROW-16776 but after a brief discussion with Neal Richardson, he > requested that I split the improvement request into separate issues. > When working with Arrow datasets/tables, I often find myself wanting to > interactively print or "see" the results of a query or the first few rows of > the data without having to fully collect into memory. > It would be ideal to lazily print some data with Table/RecordBatch print > methods, however, currently, the print methods return schema without data. > IE: > ``` r > library(dplyr) > library(arrow) > mtcars %>% arrow::write_parquet("mtcars.parquet") > car_ds <- arrow::open_dataset("mtcars.parquet") > car_ds > #> FileSystemDataset with 1 Parquet file > #> mpg: double > #> cyl: double > #> disp: double > #> hp: double > #> drat: double > #> wt: double > #> qsec: double > #> vs: double > #> am: double > #> gear: double > #> carb: double > #> > #> See $metadata for additional Schema metadata > car_ds %>% > compute() > #> Table > #> 32 rows x 11 columns > #> $mpg > #> $cyl > #> $disp > #> $hp > #> $drat > #> $wt > #> $qsec > #> $vs > #> $am > #> $gear > #> $carb > #> > #> See $metadata for additional Schema metadata > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16828) [R][Packaging] Turn on all compression libs for binaries
[ https://issues.apache.org/jira/browse/ARROW-16828?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-16828: -- Assignee: Will Jones > [R][Packaging] Turn on all compression libs for binaries > > > Key: ARROW-16828 > URL: https://issues.apache.org/jira/browse/ARROW-16828 > Project: Apache Arrow > Issue Type: Improvement > Components: Packaging, R >Affects Versions: 8.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Fix For: 9.0.0 > > > We notably don't ship brotli for MacOS. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16878) [R] Move Windows GCS dependency building upstream
[ https://issues.apache.org/jira/browse/ARROW-16878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16878: --- Fix Version/s: (was: 9.0.0) > [R] Move Windows GCS dependency building upstream > - > > Key: ARROW-16878 > URL: https://issues.apache.org/jira/browse/ARROW-16878 > Project: Apache Arrow > Issue Type: New Feature > Components: Packaging, R >Reporter: Neal Richardson >Priority: Major > > On ARROW-16510, I added the GCS filesystem to the arrow PKGBUILD, bundling it > in the arrow build. A better solution would be to put google-cloud-cpp in > rtools-packages so we don't have to build it every time. > There is no google-cloud-cpp in https://github.com/msys2/MINGW-packages, so > either we'd have to make one up for rtools-packages, or we use the bundled > google-cloud-cpp in our cmake and see if we can put as many of its > dependencies in rtools-packages to ease the build. Either way, we'd want to > start by adding its dependencies. > https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-nlohmann-json > exists in MINGW-packages and could be brought over, but I don't think it's a > big deal if it is bundled. > https://github.com/msys2/MINGW-packages/blob/master/mingw-w64-abseil-cpp/PKGBUILD > exists and could be brought over, but note that it uses C++17. That doesn't > seem to be a hard requirement, at least for what we're using, since we're > building it with C++11. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16880) [R] Test GCS auth with gargle/googleAuthR
[ https://issues.apache.org/jira/browse/ARROW-16880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-16880: -- Assignee: Will Jones > [R] Test GCS auth with gargle/googleAuthR > - > > Key: ARROW-16880 > URL: https://issues.apache.org/jira/browse/ARROW-16880 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Assignee: Will Jones >Priority: Major > Fix For: 9.0.0 > > > These are the main packages that let folks worth with Google Cloud from R, so > we should make sure we can play nicely with their auth methods, how they > cache credentials, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (ARROW-16887) [Doc][R] Document GCSFileSystem for R package
[ https://issues.apache.org/jira/browse/ARROW-16887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-16887: -- Assignee: Will Jones > [Doc][R] Document GCSFileSystem for R package > - > > Key: ARROW-16887 > URL: https://issues.apache.org/jira/browse/ARROW-16887 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Reporter: Will Jones >Assignee: Will Jones >Priority: Major > Fix For: 9.0.0 > > > We should update the [cloud storage > vignette|https://arrow.apache.org/docs/r/articles/fs.html] and the filesystem > RD to show configuration and usage of GCSFileSystem. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16879) [R] Add GCS tests using testbench
[ https://issues.apache.org/jira/browse/ARROW-16879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16879: --- Fix Version/s: (was: 9.0.0) > [R] Add GCS tests using testbench > - > > Key: ARROW-16879 > URL: https://issues.apache.org/jira/browse/ARROW-16879 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > > Followup to ARROW-16510. That PR added the bindings and basic R tests that > don't require a live GCS connection. GCS has a "testbench" service you can > run on localhost to test, like how we use minio to test S3. See the Python > bindings PR for reference on how to set it up and run it, as well as some > tests we could add: https://github.com/apache/arrow/pull/12763 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16883) [R] Move macOS GCS dependency building upstream
[ https://issues.apache.org/jira/browse/ARROW-16883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16883: --- Fix Version/s: (was: 9.0.0) > [R] Move macOS GCS dependency building upstream > --- > > Key: ARROW-16883 > URL: https://issues.apache.org/jira/browse/ARROW-16883 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > > In ARROW-16510, we turned on ARROW_GCS in the autobrew formula, but it's > building it bundled in the arrow build. It would be more efficient if we > added dependencies (or google-cloud-cpp even) upstream to the autobrew > repositories and then used them like we do for aws-sdk-cpp and other > dependencies. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16605) [CI][R] Fix revdep Crossbow job
[ https://issues.apache.org/jira/browse/ARROW-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561075#comment-17561075 ] Jonathan Keane commented on ARROW-16605: Is this something we can do before the release? If not, we should run revdeps manually before the release (now?) to catch possible issues with enough time to introduce fixes > [CI][R] Fix revdep Crossbow job > --- > > Key: ARROW-16605 > URL: https://issues.apache.org/jira/browse/ARROW-16605 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jacob Wujciak-Jens >Assignee: Jacob Wujciak-Jens >Priority: Blocker > Fix For: 9.0.0 > > > The revdep Crossbow job is currently not functioning correctly. This led to > changed behaviour affecting a revdep with the 8.0.0 release, requiring a > patch after initial submission. > cc: [~jonkeane] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-15805) [R] Update the as.Date() binding
[ https://issues.apache.org/jira/browse/ARROW-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560679#comment-17560679 ] Jonathan Keane edited comment on ARROW-15805 at 6/29/22 11:31 PM: -- This is alluded to in the PR comments, but taking a step back and thinking about the behavior: {code} dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-01-01" NA NA NA "2022-01-01" #> [6] "2022-01-01" as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-02-02" NA "2022-02-02" "2022-02-02" NA #> [6] NA {code} Which format is chosen and used is dependent on the underlying data, and critically the order that data is in. Given that we can't always guaranty the order of the data we are processing[1] we should not attempt to implement this behavior right now. Instead, we should have an error message if someone tries to specify {{tryFormats}} suggesting that they might use {{lubridate::as_date()}} if they want to specify multiple formats (and can accept that you don't get NAs for all formats other than the first that matches), or they should pick which format they want to use and use that. [1] and even if we could, it would take some tricky expression writing to pick the right format was (Author: jonkeane): This is alluded to in the PR comments, but taking a step back and thinking about the behavior: {code} dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-01-01" NA NA NA "2022-01-01" #> [6] "2022-01-01" as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-02-02" NA "2022-02-02" "2022-02-02" NA #> [6] NA {code} Which format is chosen and used is dependent on the underlying data, and critically the order that data is in. Given that we can't always guaranty the order of the data we are processing[1] we should not attempt to implement this behavior right now. Instead, we should have an error message if someone tries to specify {{tryFormats}} suggesting that they might use {{lubridate:: as_date()}} if they want to specify multiple formats (and can accept that you don't get NAs for all formats other than the first that matches), or they should pick which format they want to use and use that. [1] and even if we could, it would take some tricky expression writing to pick the right format > [R] Update the as.Date() binding > > > Key: ARROW-15805 > URL: https://issues.apache.org/jira/browse/ARROW-15805 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-15805) [R] Update the as.Date() binding
[ https://issues.apache.org/jira/browse/ARROW-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560679#comment-17560679 ] Jonathan Keane edited comment on ARROW-15805 at 6/29/22 11:30 PM: -- This is alluded to in the PR comments, but taking a step back and thinking about the behavior: {code} dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-01-01" NA NA NA "2022-01-01" #> [6] "2022-01-01" as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-02-02" NA "2022-02-02" "2022-02-02" NA #> [6] NA {code} Which format is chosen and used is dependent on the underlying data, and critically the order that data is in. Given that we can't always guaranty the order of the data we are processing[1] we should not attempt to implement this behavior right now. Instead, we should have an error message if someone tries to specify {{tryFormats}} suggesting that they might use {{lubridate:: as_date()}} if they want to specify multiple formats (and can accept that you don't get NAs for all formats other than the first that matches), or they should pick which format they want to use and use that. [1] and even if we could, it would take some tricky expression writing to pick the right format was (Author: jonkeane): This is alluded to in the PR comments, but taking a step back and thinking about the behavior: {code} dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-01-01" NA NA NA "2022-01-01" #> [6] "2022-01-01" as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-02-02" NA "2022-02-02" "2022-02-02" NA #> [6] NA {code} Which format is chosen and used is dependent on the underlying data, and critically the order that data is in. Given that we can't always guaranty the order of the data we are processing[1] we should not attempt to implement this behavior right now. Instead, we should have an error message if someone tries to specify {{tryFormats}} suggesting that they might use {{lubridate:: as_date()}} if they want to specify multiple formats (and can accept that you don't get NAs for all formats other than the first that matches), or they should pick which format they want to use and use that. [1] and even if we could, it would take some tricky expression writing to pick the right format > [R] Update the as.Date() binding > > > Key: ARROW-15805 > URL: https://issues.apache.org/jira/browse/ARROW-15805 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15805) [R] Update the as.Date() binding
[ https://issues.apache.org/jira/browse/ARROW-15805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560679#comment-17560679 ] Jonathan Keane commented on ARROW-15805: This is alluded to in the PR comments, but taking a step back and thinking about the behavior: {code} dates_dash_first <- c("2022-01-01", "2022/02/02", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") dates_slash_first <- c("2022/02/02", "2022-01-01", "2022/02/02", "2022/02/02", "2022-01-01", "2022-01-01") as.Date(dates_dash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-01-01" NA NA NA "2022-01-01" #> [6] "2022-01-01" as.Date(dates_slash_first, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")) #> [1] "2022-02-02" NA "2022-02-02" "2022-02-02" NA #> [6] NA {code} Which format is chosen and used is dependent on the underlying data, and critically the order that data is in. Given that we can't always guaranty the order of the data we are processing[1] we should not attempt to implement this behavior right now. Instead, we should have an error message if someone tries to specify {{tryFormats}} suggesting that they might use {{lubridate:: as_date()}} if they want to specify multiple formats (and can accept that you don't get NAs for all formats other than the first that matches), or they should pick which format they want to use and use that. [1] and even if we could, it would take some tricky expression writing to pick the right format > [R] Update the as.Date() binding > > > Key: ARROW-15805 > URL: https://issues.apache.org/jira/browse/ARROW-15805 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15158) [R] stringr functions
[ https://issues.apache.org/jira/browse/ARROW-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15158: --- Fix Version/s: (was: 9.0.0) > [R] stringr functions > - > > Key: ARROW-15158 > URL: https://issues.apache.org/jira/browse/ARROW-15158 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Alessandro Molina >Priority: Major > > *Umbrella ticket for the Initiative aimed at reaching support for the most > important strngr functions in the R bindings* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns
[ https://issues.apache.org/jira/browse/ARROW-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17560667#comment-17560667 ] Jonathan Keane commented on ARROW-16700: [~westonpace] not sure if this is related to ARROW-16904 or ARROW-16807 but another wrong-data ticket we should take a look at > [C++] [R] [Datasets] aggregates on partitioning columns > --- > > Key: ARROW-16700 > URL: https://issues.apache.org/jira/browse/ARROW-16700 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Reporter: Jonathan Keane >Priority: Blocker > Fix For: 9.0.0, 8.0.1 > > > When summarizing a whole dataset (without group_by) with an aggregate, and > summarizing a partitioned column, arrow returns wrong data: > {code:r} > library(arrow, warn.conflicts = FALSE) > library(dplyr, warn.conflicts = FALSE) > df <- expand.grid( > some_nulls = c(0L, 1L, 2L), > year = 2010:2023, > month = 1:12, > day = 1:30 > ) > path <- tempfile() > dir.create(path) > write_dataset(df, path, partitioning = c("year", "month")) > ds <- open_dataset(path) > # with arrow the mins/maxes are off for partitioning columns > ds %>% > summarise(n = n(), min_year = min(year), min_month = min(month), min_day = > min(day), max_year = max(year), max_month = max(month), max_day = max(day)) > %>% > collect() > #> # A tibble: 1 × 7 > #> n min_year min_month min_day max_year max_month max_day > #> > #> 1 15120 2023 1 1 202312 30 > # comapred to what we get with dplyr > df %>% > summarise(n = n(), min_year = min(year), min_month = min(month), min_day = > min(day), max_year = max(year), max_month = max(month), max_day = max(day)) > %>% > collect() > #> n min_year min_month min_day max_year max_month max_day > #> 1 15120 2010 1 1 202312 30 > # even min alone is off: > ds %>% > summarise(min_year = min(year)) %>% > collect() > #> # A tibble: 1 × 1 > #> min_year > #> > #> 1 2016 > > # but non-partitioning columns are fine: > ds %>% > summarise(min_day = min(day)) %>% > collect() > #> # A tibble: 1 × 1 > #> min_day > #> > #> 1 1 > > > # But with a group_by, this seems ok > ds %>% > group_by(some_nulls) %>% > summarise(min_year = min(year)) %>% > collect() > #> # A tibble: 3 × 2 > #> some_nulls min_year > #> > #> 1 0 2010 > #> 2 1 2010 > #> 3 2 2010 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14071) [R] Try to arrow_eval user-defined functions
[ https://issues.apache.org/jira/browse/ARROW-14071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14071: --- Fix Version/s: (was: 9.0.0) > [R] Try to arrow_eval user-defined functions > > > Key: ARROW-14071 > URL: https://issues.apache.org/jira/browse/ARROW-14071 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > > The first test passes but the second one fails, even though they're > equivalent. The user's function isn't being evaluated in the nse_funcs > environment. > {code} > expect_dplyr_equal( > input %>% > select(-fct) %>% > filter(nchar(padded_strings) < 10) %>% > collect(), > tbl > ) > isShortString <- function(x) nchar(x) < 10 > expect_dplyr_equal( > input %>% > select(-fct) %>% > filter(isShortString(padded_strings)) %>% > collect(), > tbl > ) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14209) [R] Allow multiple arguments to n_distinct()
[ https://issues.apache.org/jira/browse/ARROW-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14209: --- Fix Version/s: (was: 9.0.0) > [R] Allow multiple arguments to n_distinct() > > > Key: ARROW-14209 > URL: https://issues.apache.org/jira/browse/ARROW-14209 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Ian Cook >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > > ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} function > in the dplyr verb {{summarise()}} but only with a single argument. Add > support for multiple arguments to {{n_distinct()}}. This should return the > number of unique combinations of values in the specified columns/expressions. > See the comment about this here: > [https://github.com/apache/arrow/pull/11257#discussion_r720873549] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-14588) [R] Create an arrow-specific checklist for a CRAN release
[ https://issues.apache.org/jira/browse/ARROW-14588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14588: --- Fix Version/s: (was: 9.0.0) > [R] Create an arrow-specific checklist for a CRAN release > --- > > Key: ARROW-14588 > URL: https://issues.apache.org/jira/browse/ARROW-14588 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Minor > > This would adapt and implement the functionality of > {{usethis::use_release_issue()}} for {{arrow}}'s specific context. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16692) [C++] Segfault in datasets
[ https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16692: --- Priority: Blocker (was: Major) > [C++] Segfault in datasets > -- > > Key: ARROW-16692 > URL: https://issues.apache.org/jira/browse/ARROW-16692 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jonathan Keane >Assignee: Weston Pace >Priority: Blocker > Attachments: backtrace.txt > > > I'm still working to make a minimal reproducer for this, though I can > reliably reproduce it below (though that means needing to download a bunch of > data first...). I've cleaned out much of the unnecessary code (so this query > below is a bit silly, and not what I'm actually trying to do), but haven't > been able to make a constructed dataset that reproduces this. > Working on some example with the new | more cleaned taxi dataset at > {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault: > {code} > library(arrow) > library(dplyr) > ds <- open_dataset("path/to/new_taxi/") > ds %>% > filter(!is.na(pickup_location_id)) %>% > summarise(n = n()) %>% collect() > {code} > Most of the time ends in a segfault (though I have gotten it to work on > occasion). I've tried with smaller files | constructed datasets and haven't > been able to replicate it yet. One thing that might be important is: > {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or > so. > I've attached a backtrace in case that's enough to see what's going on here. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16692) [C++] Segfault in datasets
[ https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16692: --- Fix Version/s: 9.0.0 > [C++] Segfault in datasets > -- > > Key: ARROW-16692 > URL: https://issues.apache.org/jira/browse/ARROW-16692 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jonathan Keane >Assignee: Weston Pace >Priority: Blocker > Fix For: 9.0.0 > > Attachments: backtrace.txt > > > I'm still working to make a minimal reproducer for this, though I can > reliably reproduce it below (though that means needing to download a bunch of > data first...). I've cleaned out much of the unnecessary code (so this query > below is a bit silly, and not what I'm actually trying to do), but haven't > been able to make a constructed dataset that reproduces this. > Working on some example with the new | more cleaned taxi dataset at > {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault: > {code} > library(arrow) > library(dplyr) > ds <- open_dataset("path/to/new_taxi/") > ds %>% > filter(!is.na(pickup_location_id)) %>% > summarise(n = n()) %>% collect() > {code} > Most of the time ends in a segfault (though I have gotten it to work on > occasion). I've tried with smaller files | constructed datasets and haven't > been able to replicate it yet. One thing that might be important is: > {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or > so. > I've attached a backtrace in case that's enough to see what's going on here. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-16319) [R] [Docs] Document the lubridate functions we support in {arrow}
[ https://issues.apache.org/jira/browse/ARROW-16319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-16319: -- Assignee: Stephanie Hazlitt (was: Dragoș Moldovan-Grünfeld) > [R] [Docs] Document the lubridate functions we support in {arrow} > - > > Key: ARROW-16319 > URL: https://issues.apache.org/jira/browse/ARROW-16319 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Stephanie Hazlitt >Priority: Major > Fix For: 9.0.0 > > > Add documentation around the {{lubridate}} functionality supported in > {{arrow}}. Could be made up of: > * a blogpost > * a more in-depth piece of documentation -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (ARROW-16418) [R] Refactor the difftime() and as.diffime() bindings
[ https://issues.apache.org/jira/browse/ARROW-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane closed ARROW-16418. -- Resolution: Won't Fix > [R] Refactor the difftime() and as.diffime() bindings > -- > > Key: ARROW-16418 > URL: https://issues.apache.org/jira/browse/ARROW-16418 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 8.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > > ARROW-16060 is solved and these 2 functions have high cyclomatic complexity -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16841) [R] Additional Lubridate Capabilities
[ https://issues.apache.org/jira/browse/ARROW-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16841: --- Issue Type: Wish (was: Bug) > [R] Additional Lubridate Capabilities > - > > Key: ARROW-16841 > URL: https://issues.apache.org/jira/browse/ARROW-16841 > Project: Apache Arrow > Issue Type: Wish > Components: C++, R >Affects Versions: 9.0.0 >Reporter: Alessandro Molina >Priority: Major > > Umbrella Ticket for the remaining lubridate work. > This is functionality that we have scoped, but we have decided to wait to > implement until it is requested by someone proactively. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-16841) [R] Additional Lubridate Capabilities
[ https://issues.apache.org/jira/browse/ARROW-16841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-16841: --- Description: Umbrella Ticket for the remaining lubridate work. This is functionality that we have scoped, but we have decided to wait to implement until it is requested by someone proactively. was: Umbrella Ticket for the remaining lubridate work. Most fo the work here will be triggered by explicit user requests > [R] Additional Lubridate Capabilities > - > > Key: ARROW-16841 > URL: https://issues.apache.org/jira/browse/ARROW-16841 > Project: Apache Arrow > Issue Type: Bug > Components: C++, R >Affects Versions: 9.0.0 >Reporter: Alessandro Molina >Priority: Major > > Umbrella Ticket for the remaining lubridate work. > This is functionality that we have scoped, but we have decided to wait to > implement until it is requested by someone proactively. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16440) [R] Implement bindings for lubridate's parse_date_time2
[ https://issues.apache.org/jira/browse/ARROW-16440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555088#comment-17555088 ] Jonathan Keane commented on ARROW-16440: What's special about `parse_date_time2()` compared to `parse_date_time()`? > [R] Implement bindings for lubridate's parse_date_time2 > --- > > Key: ARROW-16440 > URL: https://issues.apache.org/jira/browse/ARROW-16440 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 9.0.0 > > > Split from ARROW-14848 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16653) [R] All formats are supported with the lubridate `parse_date_time` binding
[ https://issues.apache.org/jira/browse/ARROW-16653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17555086#comment-17555086 ] Jonathan Keane commented on ARROW-16653: What formats do we currently not support? > [R] All formats are supported with the lubridate `parse_date_time` binding > -- > > Key: ARROW-16653 > URL: https://issues.apache.org/jira/browse/ARROW-16653 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Affects Versions: 8.0.1 >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Critical > Fix For: 9.0.0 > > > Ensure: > - all formats supported and tested -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-13370) [R] More special handling for known errors in arrow_eval
[ https://issues.apache.org/jira/browse/ARROW-13370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-13370: --- Description: We have special handling in arrow_eval that looks for the "not supported in Arrow" error, and when that's found it shows the error message rather than swallowing it in an "Expression not supported" message. But we have other error messages we raise in nse_funcs that are worth showing--bad input etc. Use a sentinel error message that we can also detect and subclass as "arrow-try-error" like the others, or (better) raise a classed exception (if that's supported in all versions of R we support). (was: We have special handling in arrow_eval that looks for the "not supported in Arrow" error, and when that's found it shows the error message rather than swallowing it in an "Expression not supported" message. But we have other error messages we raise in nse_funcs that are worth showing--bad input etc. Use a sentinel error message that we can also detect and subclass as "arrow-try-error" like the others, or (better) raised a classed exception (if that's supported in all versions of R we support). ) > [R] More special handling for known errors in arrow_eval > > > Key: ARROW-13370 > URL: https://issues.apache.org/jira/browse/ARROW-13370 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Neal Richardson >Priority: Major > Fix For: 9.0.0 > > > We have special handling in arrow_eval that looks for the "not supported in > Arrow" error, and when that's found it shows the error message rather than > swallowing it in an "Expression not supported" message. But we have other > error messages we raise in nse_funcs that are worth showing--bad input etc. > Use a sentinel error message that we can also detect and subclass as > "arrow-try-error" like the others, or (better) raise a classed exception (if > that's supported in all versions of R we support). -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16415) [R] Update strptime bindings to use tz
[ https://issues.apache.org/jira/browse/ARROW-16415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-16415. Resolution: Fixed Issue resolved by pull request 13190 [https://github.com/apache/arrow/pull/13190] > [R] Update strptime bindings to use tz > --- > > Key: ARROW-16415 > URL: https://issues.apache.org/jira/browse/ARROW-16415 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 7.0.0 >Reporter: Dragoș Moldovan-Grünfeld >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > {{strptime}} mentions it does not support {{tz}} - the timezone argument. > ARROW-12820 has been addressed and the binding definition need updating. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-16626) [C++] Name the C++ streaming execution engine
[ https://issues.apache.org/jira/browse/ARROW-16626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-16626. Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13207 [https://github.com/apache/arrow/pull/13207] > [C++] Name the C++ streaming execution engine > - > > Key: ARROW-16626 > URL: https://issues.apache.org/jira/browse/ARROW-16626 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Priority: Major > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > There is some desire on the mailing list to name the C++ execution engine. > Although there isn't really any code impact from such a change we should > update our documentation to refer to the engine by name. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Resolved] (ARROW-14632) [Python] Make write_dataset arguments keyword-only
[ https://issues.apache.org/jira/browse/ARROW-14632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-14632. Fix Version/s: 9.0.0 Resolution: Fixed Issue resolved by pull request 13289 [https://github.com/apache/arrow/pull/13289] > [Python] Make write_dataset arguments keyword-only > -- > > Key: ARROW-14632 > URL: https://issues.apache.org/jira/browse/ARROW-14632 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Austin Dickey >Priority: Major > Labels: good-first-issue, pull-request-available > Fix For: 9.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > The > [write_dataset|https://github.com/apache/arrow/blob/8a36f0f6cb385c88b637f479cc38b7e51d45c7e7/python/pyarrow/dataset.py#L804-L811] > method has many arguments for customizing the behavior of the write. Most > of them could be made keyword only. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16715) [R] Bump default parquet version?
Jonathan Keane created ARROW-16715: -- Summary: [R] Bump default parquet version? Key: ARROW-16715 URL: https://issues.apache.org/jira/browse/ARROW-16715 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Jonathan Keane With ARROW-12203 the default parquet version was bumped for pyarrow to 2.4, at a minimum, we should add 2_4 as a valid version type to https://github.com/apache/arrow/blob/9b0afc352e8b3ecb3104d58e4bcf09def256b587/r/R/parquet.R#L239-L242 and https://github.com/apache/arrow/blob/9b0afc352e8b3ecb3104d58e4bcf09def256b587/r/R/enums.R#L122-L126 But do we also want to follow pyarrow's lead and bump up to a newer version by default? -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-14264) [R] Support inequality joins
[ https://issues.apache.org/jira/browse/ARROW-14264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14264: --- Description: We'll need this -not-yet-merged- merged, but unreleased dplyr API to do it: https://github.com/tidyverse/dplyr/pull/5910 (was: We'll need this not-yet-merged dplyr API to do it: https://github.com/tidyverse/dplyr/pull/5910) > [R] Support inequality joins > > > Key: ARROW-14264 > URL: https://issues.apache.org/jira/browse/ARROW-14264 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Neal Richardson >Priority: Major > Labels: query-engine > Fix For: 9.0.0 > > > We'll need this -not-yet-merged- merged, but unreleased dplyr API to do it: > https://github.com/tidyverse/dplyr/pull/5910 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16605) [CI][R] Fix revdep Crossbow job
[ https://issues.apache.org/jira/browse/ARROW-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544966#comment-17544966 ] Jonathan Keane commented on ARROW-16605: For visibility: https://github.com/apache/arrow/blob/master/dev/tasks/r/github.linux.revdepcheck.yml is the template for running these revdep checks > [CI][R] Fix revdep Crossbow job > --- > > Key: ARROW-16605 > URL: https://issues.apache.org/jira/browse/ARROW-16605 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jacob Wujciak-Jens >Assignee: Jacob Wujciak-Jens >Priority: Blocker > Fix For: 9.0.0 > > > The revdep Crossbow job is currently not functioning correctly. This led to > changed behaviour affecting a revdep with the 8.0.0 release, requiring a > patch after initial submission. > cc: [~jonkeane] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16692) [C++] Segfault in datasets
[ https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544909#comment-17544909 ] Jonathan Keane commented on ARROW-16692: Thanks! Is there a rough timeline for when that work might be done? I came across this prepping some demos for a talk next week — the queries do _sometimes_ complete (and tend to complete more reliable with the bigger queries). But I might need to change what queries I show if we don't think this will be done in the near term. > [C++] Segfault in datasets > -- > > Key: ARROW-16692 > URL: https://issues.apache.org/jira/browse/ARROW-16692 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jonathan Keane >Assignee: Weston Pace >Priority: Major > Attachments: backtrace.txt > > > I'm still working to make a minimal reproducer for this, though I can > reliably reproduce it below (though that means needing to download a bunch of > data first...). I've cleaned out much of the unnecessary code (so this query > below is a bit silly, and not what I'm actually trying to do), but haven't > been able to make a constructed dataset that reproduces this. > Working on some example with the new | more cleaned taxi dataset at > {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault: > {code} > library(arrow) > library(dplyr) > ds <- open_dataset("path/to/new_taxi/") > ds %>% > filter(!is.na(pickup_location_id)) %>% > summarise(n = n()) %>% collect() > {code} > Most of the time ends in a segfault (though I have gotten it to work on > occasion). I've tried with smaller files | constructed datasets and haven't > been able to replicate it yet. One thing that might be important is: > {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or > so. > I've attached a backtrace in case that's enough to see what's going on here. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16701) [R] Can we execute SQL in a dplyr pipeline?
[ https://issues.apache.org/jira/browse/ARROW-16701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544618#comment-17544618 ] Jonathan Keane commented on ARROW-16701: Yes, sorry that conflation was unintentional. We can do this today with duckdb, so we should try that — but in principle we should be able to use it with any backend that accepts sql + could speak arrow > [R] Can we execute SQL in a dplyr pipeline? > --- > > Key: ARROW-16701 > URL: https://issues.apache.org/jira/browse/ARROW-16701 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Jonathan Keane >Priority: Major > > Now that we have {{to_duckdb()}} and {{to_arrow()}} is it possible to wrap > those and allow someone to insert arbitrary SQL into a dplyr query? > Something like: > {code:r} > sql <- function(data, sql) { >tbl <- to_duckdb(data) >res <- DBI::dbSendQuery(dbplyr::remote_con(.data), sql, arrow = TRUE) > duckdb::duckdb_fetch_record_batch(res) > } > ds %>% > filter(year > 2020) %>% > sql("SELECT tip_amount, fare_amount, total_amount FROM ") %>% > compute() > {code} > This won't work totally, but is vaguely what we're looking for. > One part that we need to think about is how to deal with the {{from}} clause, > a few possibilities: > * ibis does this by making you "name" the table before doing sql so you can > FROM explicitly > * though maybe you could get away with FROM . like it is a magrittr thing and > sub that > * empty string, and we add it in based on the lazy_tbl object > Possibly related prior art: > https://dbplyr.tidyverse.org/reference/build_sql.html (though the name isn't > perfect IMO, and I think this is more geared towards package developers than > end users?) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16701) [R] Can we execute SQL in a dply pipeline?
Jonathan Keane created ARROW-16701: -- Summary: [R] Can we execute SQL in a dply pipeline? Key: ARROW-16701 URL: https://issues.apache.org/jira/browse/ARROW-16701 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Jonathan Keane Now that we have {{to_duckdb()}} and {{to_arrow()}} is it possible to wrap those and allow someone to insert arbitrary SQL into a dplyr query? Something like: {code:r} sql <- function(data, sql) { tbl <- to_duckdb(data) res <- DBI::dbSendQuery(dbplyr::remote_con(.data), sql, arrow = TRUE) duckdb::duckdb_fetch_record_batch(res) } ds %>% filter(year > 2020) %>% sql("SELECT tip_amount, fare_amount, total_amount FROM ") %>% compute() {code} This won't work totally, but is vaguely what we're looking for. One part that we need to think about is how to deal with the {{from}} clause, a few possibilities: * bis does this by making you "name" the table before doing sql so you can FROM explicitly * though maybe you could get away with FROM . like it is a magrittr thing and sub that * empty string, and we add it in based on the lazy_tbl object Possibly related prior art: https://dbplyr.tidyverse.org/reference/build_sql.html (though the name isn't perfect IMO, and I think this is more geared towards package developers than end users?) -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16700) [C++] [R] [Datasets] aggregates on partitioning columns
Jonathan Keane created ARROW-16700: -- Summary: [C++] [R] [Datasets] aggregates on partitioning columns Key: ARROW-16700 URL: https://issues.apache.org/jira/browse/ARROW-16700 Project: Apache Arrow Issue Type: Bug Components: C++, R Reporter: Jonathan Keane When summarizing a whole dataset (without group_by) with an aggregate, and summarizing a partitioned column, arrow returns wrong data: {code:r} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) df <- expand.grid( some_nulls = c(0L, 1L, 2L), year = 2010:2023, month = 1:12, day = 1:30 ) path <- tempfile() dir.create(path) write_dataset(df, path, partitioning = c("year", "month")) ds <- open_dataset(path) # with arrow the mins/maxes are off for partitioning columns ds %>% summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>% collect() #> # A tibble: 1 × 7 #> n min_year min_month min_day max_year max_month max_day #> #> 1 15120 2023 1 1 202312 30 # comapred to what we get with dplyr df %>% summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>% collect() #> n min_year min_month min_day max_year max_month max_day #> 1 15120 2010 1 1 202312 30 # even min alone is off: ds %>% summarise(min_year = min(year)) %>% collect() #> # A tibble: 1 × 1 #> min_year #> #> 1 2016 # but non-partitioning columns are fine: ds %>% summarise(min_day = min(day)) %>% collect() #> # A tibble: 1 × 1 #> min_day #> #> 1 1 # But with a group_by, this seems ok ds %>% group_by(some_nulls) %>% summarise(min_year = min(year)) %>% collect() #> # A tibble: 3 × 2 #> some_nulls min_year #> #> 1 0 2010 #> 2 1 2010 #> 3 2 2010 {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16692) [C++] Segfault in datasets
[ https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544577#comment-17544577 ] Jonathan Keane commented on ARROW-16692: bq. One thing that might be important is: pickup_location_id is all NAs | nulls in the first 8 years of the data or so. This is almost certainly a redherring now that I come back to it; the following *also* segfaults without referencing that specific column. {code} ds %>% filter(pickup_datetime > as.Date("2017-01-01")) %>% summarise(n = n()) %>% collect() {code} > [C++] Segfault in datasets > -- > > Key: ARROW-16692 > URL: https://issues.apache.org/jira/browse/ARROW-16692 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jonathan Keane >Priority: Major > Attachments: backtrace.txt > > > I'm still working to make a minimal reproducer for this, though I can > reliably reproduce it below (though that means needing to download a bunch of > data first...). I've cleaned out much of the unnecessary code (so this query > below is a bit silly, and not what I'm actually trying to do), but haven't > been able to make a constructed dataset that reproduces this. > Working on some example with the new | more cleaned taxi dataset at > {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault: > {code} > library(arrow) > library(dplyr) > ds <- open_dataset("path/to/new_taxi/") > ds %>% > filter(!is.na(pickup_location_id)) %>% > summarise(n = n()) %>% collect() > {code} > Most of the time ends in a segfault (though I have gotten it to work on > occasion). I've tried with smaller files | constructed datasets and haven't > been able to replicate it yet. One thing that might be important is: > {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or > so. > I've attached a backtrace in case that's enough to see what's going on here. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16695) [R][C++] Extension types are not supported in joins
[ https://issues.apache.org/jira/browse/ARROW-16695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544517#comment-17544517 ] Jonathan Keane commented on ARROW-16695: Thanks for the reprex! cc [~westonpace] > [R][C++] Extension types are not supported in joins > --- > > Key: ARROW-16695 > URL: https://issues.apache.org/jira/browse/ARROW-16695 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, R >Reporter: Dewey Dunnington >Priority: Major > > It looks like extension types are not supported in joins (even if the > underlying type is supproted)! Reported by [~jonkeane] while making a demo > for Arrow + Query engine + geoarrow (R package), which uses extension types > liberally: > {code:R} > library(arrow, warn.conflicts = FALSE) > library(dplyr, warn.conflicts = FALSE) > rb_non_ext <- record_batch( > a = 1:5, > b = letters[1:5] > ) > rb_ext_storage <- record_batch( > b = letters[1:5], > c = Array$create(list(as.raw(1:5)), type = binary()) > ) > rb_ext <- record_batch( > b = letters[1:5], > c = vctrs_extension_array(rb_ext_storage$c$as_vector()) > ) > rb_non_ext %>% > left_join(rb_ext_storage) %>% > collect() > #> # A tibble: 5 × 3 > #> a b c > #> > #> 1 1 a 01, 02, 03, 04, 05 > #> 2 2 b 01, 02, 03, 04, 05 > #> 3 3 c 01, 02, 03, 04, 05 > #> 4 4 d 01, 02, 03, 04, 05 > #> 5 5 e 01, 02, 03, 04, 05 > rb_non_ext %>% > left_join(rb_ext) %>% > collect() > #> Error in `collect()`: > #> ! Invalid: Data type is not supported in join non-key > field > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/hash_join_node.cc:121 > ValidateSchemas(join_type, left_schema, left_keys, left_output, > right_schema, right_keys, right_output, left_field_name_suffix, > right_field_name_suffix) > #> > /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/hash_join_node.cc:499 > schema_mgr->Init( join_options.join_type, left_schema, > join_options.left_keys, join_options.left_output, right_schema, > join_options.right_keys, join_options.right_output, join_options.filter, > join_options.output_suffix_for_left, join_options.output_suffix_for_right) > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Assigned] (ARROW-14632) [Python] Make write_dataset arguments keyword-only
[ https://issues.apache.org/jira/browse/ARROW-14632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-14632: -- Assignee: Austin Dickey (was: Weston Pace) > [Python] Make write_dataset arguments keyword-only > -- > > Key: ARROW-14632 > URL: https://issues.apache.org/jira/browse/ARROW-14632 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Austin Dickey >Priority: Major > Labels: good-first-issue > > The > [write_dataset|https://github.com/apache/arrow/blob/8a36f0f6cb385c88b637f479cc38b7e51d45c7e7/python/pyarrow/dataset.py#L804-L811] > method has many arguments for customizing the behavior of the write. Most > of them could be made keyword only. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16692) [C++] Segfault in datasets
[ https://issues.apache.org/jira/browse/ARROW-16692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17544407#comment-17544407 ] Jonathan Keane commented on ARROW-16692: cc [~westonpace] > [C++] Segfault in datasets > -- > > Key: ARROW-16692 > URL: https://issues.apache.org/jira/browse/ARROW-16692 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Jonathan Keane >Priority: Major > Attachments: backtrace.txt > > > I'm still working to make a minimal reproducer for this, though I can > reliably reproduce it below (though that means needing to download a bunch of > data first...). I've cleaned out much of the unnecessary code (so this query > below is a bit silly, and not what I'm actually trying to do), but haven't > been able to make a constructed dataset that reproduces this. > Working on some example with the new | more cleaned taxi dataset at > {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault: > {code} > library(arrow) > library(dplyr) > ds <- open_dataset("path/to/new_taxi/") > ds %>% > filter(!is.na(pickup_location_id)) %>% > summarise(n = n()) %>% collect() > {code} > Most of the time ends in a segfault (though I have gotten it to work on > occasion). I've tried with smaller files | constructed datasets and haven't > been able to replicate it yet. One thing that might be important is: > {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or > so. > I've attached a backtrace in case that's enough to see what's going on here. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16692) [C++] Segfault in datasets
Jonathan Keane created ARROW-16692: -- Summary: [C++] Segfault in datasets Key: ARROW-16692 URL: https://issues.apache.org/jira/browse/ARROW-16692 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Jonathan Keane Attachments: backtrace.txt I'm still working to make a minimal reproducer for this, though I can reliably reproduce it below (though that means needing to download a bunch of data first...). I've cleaned out much of the unnecessary code (so this query below is a bit silly, and not what I'm actually trying to do), but haven't been able to make a constructed dataset that reproduces this. Working on some example with the new | more cleaned taxi dataset at {{s3://ursa-labs-taxi-data-v2}}, I've run into a segfault: {code} library(arrow) library(dplyr) ds <- open_dataset("path/to/new_taxi/") ds %>% filter(!is.na(pickup_location_id)) %>% summarise(n = n()) %>% collect() {code} Most of the time ends in a segfault (though I have gotten it to work on occasion). I've tried with smaller files | constructed datasets and haven't been able to replicate it yet. One thing that might be important is: {{pickup_location_id}} is all NAs | nulls in the first 8 years of the data or so. I've attached a backtrace in case that's enough to see what's going on here. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (ARROW-14632) [Python] Make write_dataset arguments keyword-only
[ https://issues.apache.org/jira/browse/ARROW-14632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-14632: --- Description: The [write_dataset|https://github.com/apache/arrow/blob/8a36f0f6cb385c88b637f479cc38b7e51d45c7e7/python/pyarrow/dataset.py#L804-L811] method has many arguments for customizing the behavior of the write. Most of them could be made keyword only. (was: The write_dataset method has many arguments for customizing the behavior of the write. Most of them could be made keyword only.) > [Python] Make write_dataset arguments keyword-only > -- > > Key: ARROW-14632 > URL: https://issues.apache.org/jira/browse/ARROW-14632 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Weston Pace >Assignee: Weston Pace >Priority: Major > Labels: good-first-issue > > The > [write_dataset|https://github.com/apache/arrow/blob/8a36f0f6cb385c88b637f479cc38b7e51d45c7e7/python/pyarrow/dataset.py#L804-L811] > method has many arguments for customizing the behavior of the write. Most > of them could be made keyword only. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539107#comment-17539107 ] Jonathan Keane edited comment on ARROW-15678 at 5/18/22 10:03 PM: -- [~kou] Do you think you might be able to take a look at this? The comment at https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good explanation of what's going on and following that there are a few possible fixes (though none of them were fully implemented or decided was (Author: jonkeane): @kou Do you think you might be able to take a look at this? The comment at https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good explanation of what's going on and following that there are a few possible fixes (though none of them were fully implemented or decided > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled
[ https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539107#comment-17539107 ] Jonathan Keane commented on ARROW-15678: @kou Do you think you might be able to take a look at this? The comment at https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good explanation of what's going on and following that there are a few possible fixes (though none of them were fully implemented or decided > [C++][CI] a crossbow job with MinRelSize enabled > > > Key: ARROW-15678 > URL: https://issues.apache.org/jira/browse/ARROW-15678 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Continuous Integration >Reporter: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 9.0.0 > > Time Spent: 13h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007)