[jira] [Resolved] (ARROW-14206) [Go] Fix Build for ARM and s390x

2021-10-03 Thread Matthew Topol (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-14206.
---
Resolution: Fixed

Issue resolved by pull request 11299
[https://github.com/apache/arrow/pull/11299]

> [Go] Fix Build for ARM and s390x
> 
>
> Key: ARROW-14206
> URL: https://issues.apache.org/jira/browse/ARROW-14206
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, Go
>Affects Versions: 6.0.0
>Reporter: Matthew Topol
>Assignee: Matthew Topol
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14210) [C++] CMAKE_AR is not passed to bzip2 thirdparty dependency

2021-10-03 Thread Karl Dunkle Werner (Jira)
Karl Dunkle Werner created ARROW-14210:
--

 Summary: [C++] CMAKE_AR is not passed to bzip2 thirdparty 
dependency
 Key: ARROW-14210
 URL: https://issues.apache.org/jira/browse/ARROW-14210
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 5.0.0
Reporter: Karl Dunkle Werner


It seems like the {{AR}} or {{CMAKE_AR}} variables aren't getting passed for 
the bzip2 build, which causes if to fail if we're doing a {{BUNDLED}} build and 
{{ar}} isn't available in the {{$PATH}} (e.g. in a conda environment).

To replicate:
 1. Download Arrow and start an interactive shell in a container 
 (docker should be fine if you prefer it to podman)
{code:sh}
git clone --depth 1 g...@github.com:apache/arrow.git
podman run -it --rm -v ./arrow:/arrow:Z 
docker://ursalab/amd64-ubuntu-18.04-conda-python-3.6:worker bash
{code}
2. Build Arrow by running this in in the container:
{code:sh}
export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX
export ARROW_HOME=$CONDA_PREFIX
export PARQUET_HOME=$CONDA_PREFIX

cd /arrow
mkdir -p cpp/build
pushd cpp/build

cmake \
  -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
  -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
  -DCMAKE_AR=${AR} \
  -DCMAKE_RANLIB=${RANLIB} \
  -DARROW_WITH_BZ2=ON \
  -DARROW_VERBOSE_THIRDPARTY_BUILD=ON \
  -DARROW_JEMALLOC=OFF \
  -DARROW_SIMD_LEVEL=NONE -DARROW_RUNTIME_SIMD_LEVEL=NONE \
  -DARROW_DEPENDENCY_SOURCE=BUNDLED \
  ..
make
# make[3]: ar: No such file or directory
# make[3]: *** [Makefile:48: libbz2.a] Error 127
# make[2]: *** [CMakeFiles/bzip2_ep.dir/build.make:135: 
bzip2_ep-prefix/src/bzip2_ep-stamp/bzip2_ep-build] Error 2
# make[1]: *** [CMakeFiles/Makefile2:726: CMakeFiles/bzip2_ep.dir/all] Error 2

{code}
In the cmake call above, {{ARROW_JEMALLOC}} and the SIMD flags are just to skip 
compiling irrelevant things.

I think this line in {{ThirdpartyToolchain.cmake}} needs to be changed to pass 
{{CMAKE_AR}}.
 
[https://github.com/apache/arrow/blob/bad8824d5cda0fd8337c7167729c49af868f93a5/cpp/cmake_modules/ThirdpartyToolchain.cmake#L2211]

Other related issues have also needed to pass {{CMAKE_RANLIB}}, in addition to 
{{CMAKE_AR}}. I'm not sure if that applies here.

 
 Related: ARROW-4471, ARROW-4831



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14188) link error on ubuntu

2021-10-03 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423725#comment-17423725
 ] 

Kouhei Sutou commented on ARROW-14188:
--

[~icook] Could you confirm this vcpkg related problem? It seems that 
{{libarrow_bundled_dependencies.a}} isn't linked automatically.

> link error on ubuntu
> 
>
> Key: ARROW-14188
> URL: https://issues.apache.org/jira/browse/ARROW-14188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.0, 5.0.0
> Environment: Ubuntu 18.04, gcc-9, and vcpkg installation of arrow
>Reporter: Amir Ghamarian
>Priority: Major
> Attachments: completerr.txt, linkerror.txt
>
>
> I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
> uses parquet fails by giving link errors of undefined reference.
> The same code works on OSX but fails on ubuntu.
> My cmake snippet is as follows:
>  
> {code:java}
> find_package(Arrow CONFIG REQUIRED)
> get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
> find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
> find_package(Thrift CONFIG REQUIRED)
> {code}
> and the linking: 
>  
> {code:java}
> target_link_libraries(vision_obj PUBLIC  thrift::thrift re2::re2 
> arrow_static parquet_static )
> {code}
>  
>  I get a lot of errors
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-14188) link error on ubuntu

2021-10-03 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-14188:
-
Comment: was deleted

(was: [~icook] Could you confirm this vcpkg related problem? It seems that 
{{libarrow_bundled_dependencies.a}} isn't linked automatically.)

> link error on ubuntu
> 
>
> Key: ARROW-14188
> URL: https://issues.apache.org/jira/browse/ARROW-14188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.0, 5.0.0
> Environment: Ubuntu 18.04, gcc-9, and vcpkg installation of arrow
>Reporter: Amir Ghamarian
>Priority: Major
> Attachments: completerr.txt, linkerror.txt
>
>
> I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
> uses parquet fails by giving link errors of undefined reference.
> The same code works on OSX but fails on ubuntu.
> My cmake snippet is as follows:
>  
> {code:java}
> find_package(Arrow CONFIG REQUIRED)
> get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
> find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
> find_package(Thrift CONFIG REQUIRED)
> {code}
> and the linking: 
>  
> {code:java}
> target_link_libraries(vision_obj PUBLIC  thrift::thrift re2::re2 
> arrow_static parquet_static )
> {code}
>  
>  I get a lot of errors
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-14188) link error on ubuntu

2021-10-03 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423722#comment-17423722
 ] 

Kouhei Sutou edited comment on ARROW-14188 at 10/3/21, 8:53 PM:


[~icook] Could you confirm this vcpkg related problem? It seems that 
{{libarrow_bundled_dependencies.a}} isn't linked automatically.


was (Author: kou):
[~ianmcook] Could you confirm this vcpkg related problem? It seems that 
{{libarrow_bundled_dependencies.a}} isn't linked automatically.

> link error on ubuntu
> 
>
> Key: ARROW-14188
> URL: https://issues.apache.org/jira/browse/ARROW-14188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.0, 5.0.0
> Environment: Ubuntu 18.04, gcc-9, and vcpkg installation of arrow
>Reporter: Amir Ghamarian
>Priority: Major
> Attachments: completerr.txt, linkerror.txt
>
>
> I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
> uses parquet fails by giving link errors of undefined reference.
> The same code works on OSX but fails on ubuntu.
> My cmake snippet is as follows:
>  
> {code:java}
> find_package(Arrow CONFIG REQUIRED)
> get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
> find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
> find_package(Thrift CONFIG REQUIRED)
> {code}
> and the linking: 
>  
> {code:java}
> target_link_libraries(vision_obj PUBLIC  thrift::thrift re2::re2 
> arrow_static parquet_static )
> {code}
>  
>  I get a lot of errors
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14188) link error on ubuntu

2021-10-03 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423722#comment-17423722
 ] 

Kouhei Sutou commented on ARROW-14188:
--

[~ianmcook] Could you confirm this vcpkg related problem? It seems that 
{{libarrow_bundled_dependencies.a}} isn't linked automatically.

> link error on ubuntu
> 
>
> Key: ARROW-14188
> URL: https://issues.apache.org/jira/browse/ARROW-14188
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.0, 5.0.0
> Environment: Ubuntu 18.04, gcc-9, and vcpkg installation of arrow
>Reporter: Amir Ghamarian
>Priority: Major
> Attachments: completerr.txt, linkerror.txt
>
>
> I used vcpkg to install arrow versions 4 and 5, trying to build my code that 
> uses parquet fails by giving link errors of undefined reference.
> The same code works on OSX but fails on ubuntu.
> My cmake snippet is as follows:
>  
> {code:java}
> find_package(Arrow CONFIG REQUIRED)
> get_filename_component(MY_SEARCH_DIR ${Arrow_CONFIG} DIRECTORY)
> find_package(Parquet CONFIG REQUIRED PATHS ${MY_SEARCH_DIR})
> find_package(Thrift CONFIG REQUIRED)
> {code}
> and the linking: 
>  
> {code:java}
> target_link_libraries(vision_obj PUBLIC  thrift::thrift re2::re2 
> arrow_static parquet_static )
> {code}
>  
>  I get a lot of errors
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14200) [R] strftime on a date should not use or be confused by timezones

2021-10-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14200:
---
Labels: pull-request-available  (was: )

> [R] strftime on a date should not use or be confused by timezones
> -
>
> Key: ARROW-14200
> URL: https://issues.apache.org/jira/browse/ARROW-14200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When the input to {{strftime}} is a date, timezones shouldn't be necessary or 
> assumed.
> What I think is going on below is the date 1992-01-01 is being interpreted as 
> 1992-01-01 00:00:00 in UTC, and then when {{strftime()}} is being called it's 
> displaying that timestamp as 1991-12-31 ... (since my system is set to an 
> after UTC timezone), and then taking the year out of it. If I specify {{tz = 
> "utc"}} in the {{strftime()}}, I get the expected result (though that 
> shouldn't be necessary).
> Run in the US central timezone:
> {code}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> library(lubridate, warn.conflicts = FALSE)
> Table$create(
>   data.frame(
> x = as.Date("1992-01-01")
>   )
> ) %>% 
>   mutate(
> as_int_strftime = as.integer(strftime(x, "%Y")),
> strftime = strftime(x, "%Y"),
> as_int_strftime_utc = as.integer(strftime(x, "%Y", tz = "UTC")),
> strftime_utc = strftime(x, "%Y", tz = "UTC"),
> year = year(x)
>   ) %>%
>   collect()
> #>x as_int_strftime strftime as_int_strftime_utc strftime_utc year
> #> 1 1992-01-011991 19911992 1992 1992
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13588) [R] Empty character attributes not stored

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-13588:
---

Assignee: Neal Richardson

> [R] Empty character attributes not stored
> -
>
> Key: ARROW-13588
> URL: https://issues.apache.org/jira/browse/ARROW-13588
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Ubuntu 20.04 R 4.1 release
>Reporter: Charlie Gao
>Assignee: Neal Richardson
>Priority: Critical
>  Labels: attributes, feather
> Fix For: 6.0.0
>
>
> Date-times in the POSIXct format have a 'tzone' attribute that by default is 
> set to "", an empty character vector (not NULL) when created.
> This however is not stored in the Arrow feather file. When the file is read 
> back, the original and restored dataframes are not identical as per the below 
> reprex.
> I am thinking that this should not be the intention? My workaround at the 
> moment is making a check when reading back to write the empty string if the 
> tzone attribute does not exist.
> Just to confirm, the attribute is stored correctly when it is not empty.
> Thanks.
> {code:java}
> ``` r
>  dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
>  attributes(dates)
>  #> $class
>  #> [1] "POSIXct" "POSIXt" 
>  #> 
>  #> $tzone
>  #> [1] ""
>  values <- c(1:3)
>  original <- data.frame(dates, values)
>  original
>  #> dates values
>  #> 1 2020-01-01 1
>  #> 2 2020-01-02 2
>  #> 3 2020-01-02 3
> tempfile <- tempfile()
> arrow::write_feather(original, tempfile)
> restored <- arrow::read_feather(tempfile)
> identical(original, restored)
>  #> [1] FALSE
>  waldo::compare(original, restored)
>  #> `attr(old$dates, 'tzone')` is a character vector ('')
>  #> `attr(new$dates, 'tzone')` is absent
> unlink(tempfile)
>  ```
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14085) [R] Expose null placement option through sort bindings

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14085:

Fix Version/s: (was: 6.0.0)
   7.0.0

> [R] Expose null placement option through sort bindings
> --
>
> Key: ARROW-14085
> URL: https://issues.apache.org/jira/browse/ARROW-14085
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: kernel
> Fix For: 7.0.0
>
>
> ARROW-12063 added a null placement option to the sort kernels and to 
> {{OrderBySinkNode}} in the C++ library. Expose this through the R bindings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14071) [R] Try to arrow_eval user-defined functions

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14071:

Fix Version/s: (was: 6.0.0)
   7.0.0

> [R] Try to arrow_eval user-defined functions
> 
>
> Key: ARROW-14071
> URL: https://issues.apache.org/jira/browse/ARROW-14071
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
> Fix For: 7.0.0
>
>
> The first test passes but the second one fails, even though they're 
> equivalent. The user's function isn't being evaluated in the nse_funcs 
> environment.
> {code}
>   expect_dplyr_equal(
> input %>%
>   select(-fct) %>%
>   filter(nchar(padded_strings) < 10) %>%
>   collect(),
> tbl
>   )
>   isShortString <- function(x) nchar(x) < 10
>   expect_dplyr_equal(
> input %>%
>   select(-fct) %>%
>   filter(isShortString(padded_strings)) %>%
>   collect(),
> tbl
>   )
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-13779) [R] Disallow expressions that depend on order

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-13779.
---
Resolution: Won't Fix

Nothing to do right now.

> [R] Disallow expressions that depend on order
> -
>
> Key: ARROW-13779
> URL: https://issues.apache.org/jira/browse/ARROW-13779
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>
> Because in the current ExecPlan, sorting is only done in the end (in a sink 
> node), we can't sort the data and then do an operation that depends on 
> sorting (like cumsum) without first calling compute(). (This is probably not 
> yet a concern because don't seem to have any kernels that require order.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13865) [C++][R] Writing moderate-size parquet files of nested dataframes from R slows down/process hangs

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13865:

Fix Version/s: 7.0.0

> [C++][R] Writing moderate-size parquet files of nested dataframes from R 
> slows down/process hangs
> -
>
> Key: ARROW-13865
> URL: https://issues.apache.org/jira/browse/ARROW-13865
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 5.0.0
>Reporter: John Sheffield
>Priority: Major
> Fix For: 7.0.0
>
> Attachments: Screen Shot 2021-09-02 at 11.21.37 AM.png
>
>
> I observed a significant slowdown in parquet writes (and ultimately the 
> process just hangs for minutes without completion) while writing 
> moderate-size nested dataframes from R. I have replicated the issue on MacOS 
> and Ubuntu so far.
>  
> An example:
> ```
> testdf <- dplyr::tibble(
>  id = uuid::UUIDgenerate(n = 5000),
>  l1 = as.list(lapply(1:5000, (function( x ) runif(1000,
>  l2 = as.list(lapply(1:5000, (function( x ) rnorm(1000
>  )
> testdf_long <- tidyr::unnest(testdf, cols = c(l1, l2))
>  
>  # This works
> arrow::write_parquet(testdf_long, "testdf_long.parquet")
>  # This write does not complete within a few minutes on my testing but throws 
> no errors
>  arrow::write_parquet(testdf, "testdf.parquet")
> ```
> I can't guess at why this is true, but the slowdown is closely tied to row 
> counts:
> ```
>  # screenshot attached; 12ms, 56ms, and 680ms respectively.
> microbenchmark::microbenchmark(
>  arrow::write_parquet(testdf[1, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:10, ], "testdf.parquet"),
>  arrow::write_parquet(testdf[1:100, ], "testdf.parquet"),
>  times = 5
>  )
> ```
> I'm using the CRAN version 5.0.0 in both cases. The sessionInfo() for Ubuntu 
> is
>  R version 4.0.5 (2021-03-31)
>  Platform: x86_64-pc-linux-gnu (64-bit)
>  Running under: Ubuntu 20.04.3 LTS
> Matrix products: default
>  BLAS/LAPACK: 
> /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so
> locale:
>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 
> LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C 
>  [7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C 
> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> attached base packages:
>  [1] stats graphics grDevices utils datasets methods base
> other attached packages:
>  [1] arrow_5.0.0
> And sessionInfo for MacOS is:
>  R version 4.0.1 (2020-06-06) Platform: x86_64-apple-darwin17.0 (64-bit) 
> Running under: macOS Catalina 10.15.7 Matrix products: default BLAS: 
> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>  LAPACK: 
> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
> locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 
> attached base packages: [1] stats graphics grDevices utils datasets methods 
> base other attached packages: [1] arrow_5.0.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14020) [R] Writing datafames with list columns is slow and scales poorly with nesting level

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14020:

Fix Version/s: 6.0.0

> [R] Writing datafames with list columns is slow and scales poorly with 
> nesting level
> 
>
> Key: ARROW-14020
> URL: https://issues.apache.org/jira/browse/ARROW-14020
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Windows 10 x64
>Reporter: Miles McBain
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Writing data frames that contain list columns seems much slower than expected:
> ``` r
>  library(tidyverse)
>  #> Warning: package 'tidyverse' was built under R version 4.1.1
>  #> Warning: package 'tibble' was built under R version 4.1.1
>  #> Warning: package 'readr' was built under R version 4.1.1
>  library(arrow)
>  #> Warning: package 'arrow' was built under R version 4.1.1
>  #>
>  #> Attaching package: 'arrow'
>  #> The following object is masked from 'package:utils':
>  #>
>  #> timestamp
>  dummy <- tibble(
>  points = rep(list(seq(6)), 2e6),
>  index = seq(2e6)
>  )
>  # very slooow
>  system.time(write_parquet(dummy, "dummy.parquet"))
>  #> user system elapsed
>  #> 55.64 0.11 55.98
> dummy_txt <- mutate(dummy, points = map_chr(points, deparse))
>  # orders of magnitude faster
>  system.time(write_parquet(dummy_txt, "dummytext.parquet"))
>  #> user system elapsed
>  #> 0.24 0.02 0.25
>  ```
> Created on 2021-09-17 by the [reprex 
> package]([https://reprex.tidyverse.org|https://reprex.tidyverse.org/]) 
> (v2.0.0)
> 
> Session info
> ``` r
>  sessioninfo::session_info()
>  #> - Session info 
> ---
>  #> setting value
>  #> version R version 4.1.0 (2021-05-18)
>  #> os Windows 10 x64
>  #> system x86_64, mingw32
>  #> ui RTerm
>  #> language (EN)
>  #> collate English_Australia.1252
>  #> ctype English_Australia.1252
>  #> tz Australia/Brisbane
>  #> date 2021-09-17
>  #>
>  #> - Packages 
> ---
>  #> package * version date lib source
>  #> arrow * 5.0.0.2 2021-09-05 [1] CRAN (R 4.1.1)
>  #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
>  #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0)
>  #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0)
>  #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
>  #> broom 0.7.7 2021-06-13 [1] CRAN (R 4.1.0)
>  #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
>  #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0)
>  #> colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
>  #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
>  #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
>  #> dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
>  #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
>  #> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
>  #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
>  #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
>  #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
>  #> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0)
>  #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
>  #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
>  #> ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
>  #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
>  #> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
>  #> haven 2.4.1 2021-04-23 [1] CRAN (R 4.1.0)
>  #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
>  #> hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0)
>  #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
>  #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
>  #> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0)
>  #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
>  #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
>  #> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0)
>  #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
>  #> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
>  #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
>  #> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0)
>  #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
>  #> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
>  #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
>  #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
>  #> readr * 2.0.1 2021-08-10 [1] CRAN (R 4.1.1)
>  #> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.1.0)
>  #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0)
>  #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
>  #> rmarkdown 2.9 2021-06-15 [1] CRAN (R 4.1.0)
>  #> rvest 1.0.1 2021-07-26 [1] CRAN (R 4.1.0)
>  #> scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
>  #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
>  #> stringi 1.7.4 2021-08-25 [1] CRAN (R 4.1.1)
>  #> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
>  #> 

[jira] [Updated] (ARROW-14025) [R][C++] PreBuffer is not enabled when scanning parquet via exec nodes

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14025:

Fix Version/s: 6.0.0

> [R][C++] PreBuffer is not enabled when scanning parquet via exec nodes
> --
>
> Key: ARROW-14025
> URL: https://issues.apache.org/jira/browse/ARROW-14025
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Weston Pace
>Priority: Major
> Fix For: 6.0.0
>
>
> In ExecNode_Scan a ScanOptions object is built up.  If we are reading parquet 
> we should enable pre-buffering.  This is done by creating a 
> ParquetFragmentScanOptions object and enabling pre-buffering.
> Alternatively, we could just default pre-buffering to true for asynchronous 
> scans of parquet data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13866) [R] Implement Options for all compute kernels available via list_compute_functions

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13866:

Fix Version/s: 6.0.0

> [R] Implement Options for all compute kernels available via 
> list_compute_functions
> --
>
> Key: ARROW-13866
> URL: https://issues.apache.org/jira/browse/ARROW-13866
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
> Fix For: 6.0.0
>
>
> Not all of the compute kernels available via {{list_compute_functions()}} are 
> actually available to use in R, as they haven't been hooked up to the 
> relevant Options class in {{r/src/compute.cpp}}. 
> We should:
>  # Implement all remaining options classes
>  # Go through all the kernels listed by {{list_compute_functions()}} and 
> check that they have either no options classes to implement or that they have 
> been hooked up to the appropriate options class
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13901) [R] Implement IndexOptions

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13901:

Fix Version/s: 6.0.0

> [R] Implement IndexOptions
> --
>
> Key: ARROW-13901
> URL: https://issues.apache.org/jira/browse/ARROW-13901
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Nicola Crane
>Priority: Major
> Fix For: 6.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14028) [R] Cast of NaN to integer should return NA_integer_

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14028:

Fix Version/s: 7.0.0

> [R] Cast of NaN to integer should return NA_integer_
> 
>
> Key: ARROW-14028
> URL: https://issues.apache.org/jira/browse/ARROW-14028
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
> Fix For: 7.0.0
>
>
> Casting double {{NaN}} to integer returns a sentinel value:
> {code:r}
> call_function("cast", Scalar$create(NaN), options = list(to_type = int32(), 
> allow_float_truncate = TRUE))
> #> Scalar
> #> -2147483648
> call_function("cast", Scalar$create(NaN), options = list(to_type = int64(), 
> allow_float_truncate = TRUE))
> #> Scalar
> #> -9223372036854775808{code}
> It would be nice if this would instead return {{NA_integer}}.
> N.B. for some reason this doesn't reproduce in dplyr unless you round-trip it 
> back to double:
> {code:r}
> > Table$create(x = NaN) %>% transmute(as.double(as.integer(x))) %>% pull(1)
> #> [1] -2147483648{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14138) [R] update metadata when casting a record batch column

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14138:

Fix Version/s: 7

> [R] update metadata when casting a record batch column
> --
>
> Key: ARROW-14138
> URL: https://issues.apache.org/jira/browse/ARROW-14138
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Romain Francois
>Priority: Minor
> Fix For: 7
>
>
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> raws <- structure(list(
>   as.raw(c(0x70, 0x65, 0x72, 0x73, 0x6f, 0x6e))
> ), class = c("arrow_binary", "vctrs_vctr", "list"))
> batch <- record_batch(b = raws)
> batch$metadata$r
> #>  'arrow_r_metadata' chr 
> "A\n3\n262147\n197888\n5\nUTF-8\n531\n1\n531\n1\n531\n2\n531\n1\n16\n3\n262153\n12\narrow_binary\n262153\n10\nvc"|
>  __truncated__
> #> List of 1
> #>  $ columns:List of 1
> #>   ..$ b:List of 2
> #>   .. ..$ attributes:List of 1
> #>   .. .. ..$ class: chr [1:3] "arrow_binary" "vctrs_vctr" "list"
> #>   .. ..$ columns   : NULL
> # when casting `b` to a string column, the metadata is kept
> batch$b <- batch$b$cast(utf8())
> batch$metadata$r
> #>  'arrow_r_metadata' chr 
> "A\n3\n262147\n197888\n5\nUTF-8\n531\n1\n531\n1\n531\n2\n531\n1\n16\n3\n262153\n12\narrow_binary\n262153\n10\nvc"|
>  __truncated__
> #> List of 1
> #>  $ columns:List of 1
> #>   ..$ b:List of 2
> #>   .. ..$ attributes:List of 1
> #>   .. .. ..$ class: chr [1:3] "arrow_binary" "vctrs_vctr" "list"
> #>   .. ..$ columns   : NULL
> # but it should not have
> batch2 <- record_batch(b = "string")
> batch2$metadata$r
> #> NULL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-14200) [R] strftime on a date should not use or be confused by timezones

2021-10-03 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-14200:
--

Assignee: Jonathan Keane

> [R] strftime on a date should not use or be confused by timezones
> -
>
> Key: ARROW-14200
> URL: https://issues.apache.org/jira/browse/ARROW-14200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
> Fix For: 7.0.0
>
>
> When the input to {{strftime}} is a date, timezones shouldn't be necessary or 
> assumed.
> What I think is going on below is the date 1992-01-01 is being interpreted as 
> 1992-01-01 00:00:00 in UTC, and then when {{strftime()}} is being called it's 
> displaying that timestamp as 1991-12-31 ... (since my system is set to an 
> after UTC timezone), and then taking the year out of it. If I specify {{tz = 
> "utc"}} in the {{strftime()}}, I get the expected result (though that 
> shouldn't be necessary).
> Run in the US central timezone:
> {code}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> library(lubridate, warn.conflicts = FALSE)
> Table$create(
>   data.frame(
> x = as.Date("1992-01-01")
>   )
> ) %>% 
>   mutate(
> as_int_strftime = as.integer(strftime(x, "%Y")),
> strftime = strftime(x, "%Y"),
> as_int_strftime_utc = as.integer(strftime(x, "%Y", tz = "UTC")),
> strftime_utc = strftime(x, "%Y", tz = "UTC"),
> year = year(x)
>   ) %>%
>   collect()
> #>x as_int_strftime strftime as_int_strftime_utc strftime_utc year
> #> 1 1992-01-011991 19911992 1992 1992
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14138) [R] update metadata when casting a record batch column

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14138:

Fix Version/s: (was: 7)
   7.0.0

> [R] update metadata when casting a record batch column
> --
>
> Key: ARROW-14138
> URL: https://issues.apache.org/jira/browse/ARROW-14138
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Romain Francois
>Priority: Minor
> Fix For: 7.0.0
>
>
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> raws <- structure(list(
>   as.raw(c(0x70, 0x65, 0x72, 0x73, 0x6f, 0x6e))
> ), class = c("arrow_binary", "vctrs_vctr", "list"))
> batch <- record_batch(b = raws)
> batch$metadata$r
> #>  'arrow_r_metadata' chr 
> "A\n3\n262147\n197888\n5\nUTF-8\n531\n1\n531\n1\n531\n2\n531\n1\n16\n3\n262153\n12\narrow_binary\n262153\n10\nvc"|
>  __truncated__
> #> List of 1
> #>  $ columns:List of 1
> #>   ..$ b:List of 2
> #>   .. ..$ attributes:List of 1
> #>   .. .. ..$ class: chr [1:3] "arrow_binary" "vctrs_vctr" "list"
> #>   .. ..$ columns   : NULL
> # when casting `b` to a string column, the metadata is kept
> batch$b <- batch$b$cast(utf8())
> batch$metadata$r
> #>  'arrow_r_metadata' chr 
> "A\n3\n262147\n197888\n5\nUTF-8\n531\n1\n531\n1\n531\n2\n531\n1\n16\n3\n262153\n12\narrow_binary\n262153\n10\nvc"|
>  __truncated__
> #> List of 1
> #>  $ columns:List of 1
> #>   ..$ b:List of 2
> #>   .. ..$ attributes:List of 1
> #>   .. .. ..$ class: chr [1:3] "arrow_binary" "vctrs_vctr" "list"
> #>   .. ..$ columns   : NULL
> # but it should not have
> batch2 <- record_batch(b = "string")
> batch2$metadata$r
> #> NULL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14169) [R] altrep for factors

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14169:

Fix Version/s: 7.0.0

> [R] altrep for factors
> --
>
> Key: ARROW-14169
> URL: https://issues.apache.org/jira/browse/ARROW-14169
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Romain Francois
>Priority: Major
> Fix For: 7.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14200) [R] strftime on a date should not use or be confused by timezones

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14200:

Fix Version/s: 7.0.0

> [R] strftime on a date should not use or be confused by timezones
> -
>
> Key: ARROW-14200
> URL: https://issues.apache.org/jira/browse/ARROW-14200
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
> Fix For: 7.0.0
>
>
> When the input to {{strftime}} is a date, timezones shouldn't be necessary or 
> assumed.
> What I think is going on below is the date 1992-01-01 is being interpreted as 
> 1992-01-01 00:00:00 in UTC, and then when {{strftime()}} is being called it's 
> displaying that timestamp as 1991-12-31 ... (since my system is set to an 
> after UTC timezone), and then taking the year out of it. If I specify {{tz = 
> "utc"}} in the {{strftime()}}, I get the expected result (though that 
> shouldn't be necessary).
> Run in the US central timezone:
> {code}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> library(lubridate, warn.conflicts = FALSE)
> Table$create(
>   data.frame(
> x = as.Date("1992-01-01")
>   )
> ) %>% 
>   mutate(
> as_int_strftime = as.integer(strftime(x, "%Y")),
> strftime = strftime(x, "%Y"),
> as_int_strftime_utc = as.integer(strftime(x, "%Y", tz = "UTC")),
> strftime_utc = strftime(x, "%Y", tz = "UTC"),
> year = year(x)
>   ) %>%
>   collect()
> #>x as_int_strftime strftime as_int_strftime_utc strftime_utc year
> #> 1 1992-01-011991 19911992 1992 1992
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14199) [R] bindings for format where possible

2021-10-03 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-14199:

Fix Version/s: 7.0.0

> [R] bindings for format where possible
> --
>
> Key: ARROW-14199
> URL: https://issues.apache.org/jira/browse/ARROW-14199
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
> Fix For: 7.0.0
>
>
> Now that we have {{strftime}}, we should also be able to make bindings for 
> {{format()}} as well. This might be complicated / we might need to punt on a 
> bunch of types that {{format()}} can take but arrow doesn't (yet) support 
> formatting of them, that's ok. 
> Though some of those might be wrappable with a handful of kernels stacked 
> together: {{format(float)}} might be round + cast to character



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14209) [R] Allow multiple arguments to n_distinct()

2021-10-03 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated ARROW-14209:
-
Description: 
ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} function in 
the dplyr verb {{summarise()}} but only with a single argument. Add support for 
multiple arguments to {{n_distinct()}}. This should return the number of unique 
combinations of values in the specified columns/expressions.

See the comment about this here: 
[https://github.com/apache/arrow/pull/11257#discussion_r720873549]

  was:
ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} in function 
in the dplyr verb {{summarise()}} but only with a single argument. Add support 
for multiple arguments to {{n_distinct()}}. This should return the number of 
unique combinations of values in the specified columns/expressions.

See the comment about this here: 
https://github.com/apache/arrow/pull/11257#discussion_r720873549


> [R] Allow multiple arguments to n_distinct()
> 
>
> Key: ARROW-14209
> URL: https://issues.apache.org/jira/browse/ARROW-14209
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
> Fix For: 7.0.0
>
>
> ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} function 
> in the dplyr verb {{summarise()}} but only with a single argument. Add 
> support for multiple arguments to {{n_distinct()}}. This should return the 
> number of unique combinations of values in the specified columns/expressions.
> See the comment about this here: 
> [https://github.com/apache/arrow/pull/11257#discussion_r720873549]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14209) [R] Allow multiple arguments to n_distinct()

2021-10-03 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated ARROW-14209:
-
Description: 
ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} in function 
in the dplyr verb {{summarise()}} but only with a single argument. Add support 
for multiple arguments to {{n_distinct()}}. This should return the number of 
unique combinations of values in the specified columns/expressions.

See the comment about this here: 
https://github.com/apache/arrow/pull/11257#discussion_r720873549

  was:ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} in 
function in the dplyr verb {{summarise()}} but only with a single argument. Add 
support for multiple arguments to {{n_distinct()}}. This should return the 
number of unique combinations of values in the specified columns/expressions.


> [R] Allow multiple arguments to n_distinct()
> 
>
> Key: ARROW-14209
> URL: https://issues.apache.org/jira/browse/ARROW-14209
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
> Fix For: 7.0.0
>
>
> ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} in 
> function in the dplyr verb {{summarise()}} but only with a single argument. 
> Add support for multiple arguments to {{n_distinct()}}. This should return 
> the number of unique combinations of values in the specified 
> columns/expressions.
> See the comment about this here: 
> https://github.com/apache/arrow/pull/11257#discussion_r720873549



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14209) [R] Allow multiple arguments to n_distinct()

2021-10-03 Thread Ian Cook (Jira)
Ian Cook created ARROW-14209:


 Summary: [R] Allow multiple arguments to n_distinct()
 Key: ARROW-14209
 URL: https://issues.apache.org/jira/browse/ARROW-14209
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Ian Cook
 Fix For: 7.0.0


ARROW-13620 and ARROW-14036 added support for the {{n_distinct()}} in function 
in the dplyr verb {{summarise()}} but only with a single argument. Add support 
for multiple arguments to {{n_distinct()}}. This should return the number of 
unique combinations of values in the specified columns/expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14208) [C++] Build errors with Visual Studio 2019

2021-10-03 Thread Ian Cook (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423699#comment-17423699
 ] 

Ian Cook commented on ARROW-14208:
--

[~apitrou] could you take a look at this? Thank you

> [C++] Build errors with Visual Studio 2019
> --
>
> Key: ARROW-14208
> URL: https://issues.apache.org/jira/browse/ARROW-14208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>
> On September 10 the *test-build-vcpkg-win* nightly Crossbow job began to fail.
> This job uses the current {{windows-2019}} GHA runner image, so it often 
> catches build errors associated with Visual Studio/MSVC updates:
> The logs show these error messages (simplified for readability):
> {code:java}
> compute/util_internal.h(26,20): warning C4003: not enough arguments for 
> function-like macro invocation 'RtlZeroMemory'
> compute/util_internal.h(26,20): error C2146: syntax error: missing ')' before 
> identifier 'buffer'
> compute/util_internal.h(26,20): error C2065: 'buffer': undeclared identifier
> compute/util_internal.h(26,20): error C2182: 'memset': illegal use of type 
> 'void'
> compute/util_internal.h(26,20): error C7525: inline variables require at 
> least '/std:c++17'
> compute/util_internal.h(26,20): error C2059: syntax error: 'constant'
> compute/util_internal.h(26,20): error C2059: syntax error: ')'
> compute/util_internal.h(26,47): error C2143: syntax error: missing ';' before 
> '{'
> compute/util_internal.h(26,47): error C2447: '{': missing function header 
> (old-style formal list?){code}
> Here is a link to the logs when they first began to fail on September 10: 
> [https://github.com/ursacomputing/crossbow/runs/3564248552#step:4:2985]
>  The error messages have remained the same since then.
> Here is a link to the logs from the previous day (September 9) before they 
> began to fail:
>  [https://github.com/ursacomputing/crossbow/runs/3552742330]
>  
> Possible causes include:
> Updates to MSVC that were applied to the  {{windows-2019}} GHA runner image 
> on September 9:
>  [https://github.com/actions/virtual-environments/pull/3452]
> One of these commits on September 9:
>  
> [https://github.com/apache/arrow/search?o=desc=1=committer-date%3A2021-09-09=author-date=commits]
> Changes to one of the vcpkg-installed Arrow dependencies on September 9 (but 
> I don't see any such changes in the {{microsoft/vcpkg}} repo commit history).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14208) [C++] Build errors with Visual Studio 2019

2021-10-03 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated ARROW-14208:
-
Description: 
On September 10 the *test-build-vcpkg-win* nightly Crossbow job began to fail.

This job uses the current {{windows-2019}} GHA runner image, so it often 
catches build errors associated with Visual Studio/MSVC updates.

The logs show these error messages (simplified for readability):
{code:java}
compute/util_internal.h(26,20): warning C4003: not enough arguments for 
function-like macro invocation 'RtlZeroMemory'
compute/util_internal.h(26,20): error C2146: syntax error: missing ')' before 
identifier 'buffer'
compute/util_internal.h(26,20): error C2065: 'buffer': undeclared identifier
compute/util_internal.h(26,20): error C2182: 'memset': illegal use of type 
'void'
compute/util_internal.h(26,20): error C7525: inline variables require at least 
'/std:c++17'
compute/util_internal.h(26,20): error C2059: syntax error: 'constant'
compute/util_internal.h(26,20): error C2059: syntax error: ')'
compute/util_internal.h(26,47): error C2143: syntax error: missing ';' before 
'{'
compute/util_internal.h(26,47): error C2447: '{': missing function header 
(old-style formal list?){code}
Here is a link to the logs when they first began to fail on September 10: 
[https://github.com/ursacomputing/crossbow/runs/3564248552#step:4:2985]
 The error messages have remained the same since then.

Here is a link to the logs from the previous day (September 9) before they 
began to fail:
 [https://github.com/ursacomputing/crossbow/runs/3552742330]

 

Possible causes include:

Updates to MSVC that were applied to the  {{windows-2019}} GHA runner image on 
September 9:
 [https://github.com/actions/virtual-environments/pull/3452]

One of these commits on September 9:
 
[https://github.com/apache/arrow/search?o=desc=1=committer-date%3A2021-09-09=author-date=commits]

Changes to one of the vcpkg-installed Arrow dependencies on September 9 (but I 
don't see any such changes in the {{microsoft/vcpkg}} repo commit history).

  was:
On September 10 the *test-build-vcpkg-win* nightly Crossbow job began to fail.

This job uses the current {{windows-2019}} GHA runner image, so it often 
catches build errors associated with Visual Studio/MSVC updates:

The logs show these error messages (simplified for readability):
{code:java}
compute/util_internal.h(26,20): warning C4003: not enough arguments for 
function-like macro invocation 'RtlZeroMemory'
compute/util_internal.h(26,20): error C2146: syntax error: missing ')' before 
identifier 'buffer'
compute/util_internal.h(26,20): error C2065: 'buffer': undeclared identifier
compute/util_internal.h(26,20): error C2182: 'memset': illegal use of type 
'void'
compute/util_internal.h(26,20): error C7525: inline variables require at least 
'/std:c++17'
compute/util_internal.h(26,20): error C2059: syntax error: 'constant'
compute/util_internal.h(26,20): error C2059: syntax error: ')'
compute/util_internal.h(26,47): error C2143: syntax error: missing ';' before 
'{'
compute/util_internal.h(26,47): error C2447: '{': missing function header 
(old-style formal list?){code}
Here is a link to the logs when they first began to fail on September 10: 
[https://github.com/ursacomputing/crossbow/runs/3564248552#step:4:2985]
 The error messages have remained the same since then.

Here is a link to the logs from the previous day (September 9) before they 
began to fail:
 [https://github.com/ursacomputing/crossbow/runs/3552742330]

 

Possible causes include:

Updates to MSVC that were applied to the  {{windows-2019}} GHA runner image on 
September 9:
 [https://github.com/actions/virtual-environments/pull/3452]

One of these commits on September 9:
 
[https://github.com/apache/arrow/search?o=desc=1=committer-date%3A2021-09-09=author-date=commits]

Changes to one of the vcpkg-installed Arrow dependencies on September 9 (but I 
don't see any such changes in the {{microsoft/vcpkg}} repo commit history).


> [C++] Build errors with Visual Studio 2019
> --
>
> Key: ARROW-14208
> URL: https://issues.apache.org/jira/browse/ARROW-14208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ian Cook
>Priority: Major
>
> On September 10 the *test-build-vcpkg-win* nightly Crossbow job began to fail.
> This job uses the current {{windows-2019}} GHA runner image, so it often 
> catches build errors associated with Visual Studio/MSVC updates.
> The logs show these error messages (simplified for readability):
> {code:java}
> compute/util_internal.h(26,20): warning C4003: not enough arguments for 
> function-like macro invocation 'RtlZeroMemory'
> compute/util_internal.h(26,20): error C2146: syntax error: missing ')' before 
> identifier 'buffer'
> compute/util_internal.h(26,20): error 

[jira] [Created] (ARROW-14208) [C++] Build errors with Visual Studio 2019

2021-10-03 Thread Ian Cook (Jira)
Ian Cook created ARROW-14208:


 Summary: [C++] Build errors with Visual Studio 2019
 Key: ARROW-14208
 URL: https://issues.apache.org/jira/browse/ARROW-14208
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Ian Cook


On September 10 the *test-build-vcpkg-win* nightly Crossbow job began to fail.

This job uses the current {{windows-2019}} GHA runner image, so it often 
catches build errors associated with Visual Studio/MSVC updates:

The logs show these error messages (simplified for readability):
{code:java}
compute/util_internal.h(26,20): warning C4003: not enough arguments for 
function-like macro invocation 'RtlZeroMemory'
compute/util_internal.h(26,20): error C2146: syntax error: missing ')' before 
identifier 'buffer'
compute/util_internal.h(26,20): error C2065: 'buffer': undeclared identifier
compute/util_internal.h(26,20): error C2182: 'memset': illegal use of type 
'void'
compute/util_internal.h(26,20): error C7525: inline variables require at least 
'/std:c++17'
compute/util_internal.h(26,20): error C2059: syntax error: 'constant'
compute/util_internal.h(26,20): error C2059: syntax error: ')'
compute/util_internal.h(26,47): error C2143: syntax error: missing ';' before 
'{'
compute/util_internal.h(26,47): error C2447: '{': missing function header 
(old-style formal list?){code}
Here is a link to the logs when they first began to fail on September 10: 
[https://github.com/ursacomputing/crossbow/runs/3564248552#step:4:2985]
 The error messages have remained the same since then.

Here is a link to the logs from the previous day (September 9) before they 
began to fail:
 [https://github.com/ursacomputing/crossbow/runs/3552742330]

 

Possible causes include:

Updates to MSVC that were applied to the  {{windows-2019}} GHA runner image on 
September 9:
 [https://github.com/actions/virtual-environments/pull/3452]

One of these commits on September 9:
 
[https://github.com/apache/arrow/search?o=desc=1=committer-date%3A2021-09-09=author-date=commits]

Changes to one of the vcpkg-installed Arrow dependencies on September 9 (but I 
don't see any such changes in the {{microsoft/vcpkg}} repo commit history).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14188) link error on ubuntu

2021-10-03 Thread Amir Ghamarian (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423649#comment-17423649
 ] 

Amir Ghamarian commented on ARROW-14188:


Thanks, same error. It works fine on OSX.
{code:java}
../../vcpkg_installed/x64-linux/debug/lib/libarrow.a(compression_brotli.cc.o): 
In function `arrow::util::internal::(anonymous 
namespace)::BrotliDecompressor::~BrotliDecompressor()': 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:43:
 undefined reference to `BrotliDecoderDestroyInstance' 
../../vcpkg_installed/x64-linux/debug/lib/libarrow.a(compression_brotli.cc.o): 
In function `arrow::util::internal::(anonymous 
namespace)::BrotliDecompressor::Init()': 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:48:
 undefined reference to `BrotliDecoderCreateInstance' 
../../vcpkg_installed/x64-linux/debug/lib/libarrow.a(compression_brotli.cc.o): 
In function `arrow::util::internal::(anonymous 
namespace)::BrotliDecompressor::Reset()': 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:57:
 undefined reference to `BrotliDecoderDestroyInstance' 
../../vcpkg_installed/x64-linux/debug/lib/libarrow.a(compression_brotli.cc.o): 
In function `arrow::util::internal::(anonymous 
namespace)::BrotliDecompressor::Decompress(long, unsigned char const*, long, 
unsigned char*)': 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:68:
 undefined reference to `BrotliDecoderDecompressStream' 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:71:
 undefined reference to `BrotliDecoderGetErrorCode' 
../../vcpkg_installed/x64-linux/debug/lib/libarrow.a(compression_brotli.cc.o): 
In function `arrow::util::internal::(anonymous 
namespace)::BrotliDecompressor::IsFinished()': 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:78:
 undefined reference to `BrotliDecoderIsFinished' 
../../vcpkg_installed/x64-linux/debug/lib/libarrow.a(compression_brotli.cc.o): 
In function `arrow::util::internal::(anonymous 
namespace)::BrotliDecompressor::BrotliError(BrotliDecoderErrorCode, char 
const*)': 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:84:
 undefined reference to `BrotliDecoderErrorString' 
../../vcpkg_installed/x64-linux/debug/lib/libarrow.a(compression_brotli.cc.o): 
In function `arrow::util::internal::(anonymous 
namespace)::BrotliCompressor::~BrotliCompressor()': 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:100:
 undefined reference to `BrotliEncoderDestroyInstance' 
../../vcpkg_installed/x64-linux/debug/lib/libarrow.a(compression_brotli.cc.o): 
In function `arrow::util::internal::(anonymous 
namespace)::BrotliCompressor::Init()': 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:105:
 undefined reference to `BrotliEncoderCreateInstance' 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:109:
 undefined reference to `BrotliEncoderSetParameter' 
../../vcpkg_installed/x64-linux/debug/lib/libarrow.a(compression_brotli.cc.o): 
In function `arrow::util::internal::(anonymous 
namespace)::BrotliCompressor::Compress(long, unsigned char const*, long, 
unsigned char*)': 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:121:
 undefined reference to `BrotliEncoderCompressStream' 
../../vcpkg_installed/x64-linux/debug/lib/libarrow.a(compression_brotli.cc.o): 
In function `arrow::util::internal::(anonymous 
namespace)::BrotliCompressor::Flush(long, unsigned char*)': 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:136:
 undefined reference to `BrotliEncoderCompressStream' 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:142:
 undefined reference to `BrotliEncoderHasMoreOutput' 
../../vcpkg_installed/x64-linux/debug/lib/libarrow.a(compression_brotli.cc.o): 
In function `arrow::util::internal::(anonymous 
namespace)::BrotliCompressor::End(long, unsigned char*)': 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:152:
 undefined reference to `BrotliEncoderCompressStream' 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:157:
 undefined reference to `BrotliEncoderHasMoreOutput' 
/projects/vcpkg/buildtrees/arrow/src/rrow-5.0.0-fc32d5e3bc.clean/cpp/src/arrow/util/compression_brotli.cc:158:
 undefined reference to