date:20210311

[jira] [Resolved] (ARROW-11931) [Go][CI] Bump CI to use Go 1.15

2021-03-11 Thread Sebastien Binet (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastien Binet resolved ARROW-11931.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9675
[https://github.com/apache/arrow/pull/9675]

> [Go][CI] Bump CI to use Go 1.15
> ---
>
> Key: ARROW-11931
> URL: https://issues.apache.org/jira/browse/ARROW-11931
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: CI, Go
>Reporter: Matt Topol
>Assignee: Matt Topol
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11942) [C++] If tasks are submitted quickly the thread pool may fail to spin up new threads

2021-03-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11942:
---
Labels: pull-request-available  (was: )

> [C++] If tasks are submitted quickly the thread pool may fail to spin up new 
> threads
> 
>
> Key: ARROW-11942
> URL: https://issues.apache.org/jira/browse/ARROW-11942
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Probably only really affects unit tests.  Consider an idle thread pool with 1 
> thread (ready_count_ == 1).  If `Spawn` is called very quickly it may look 
> like `ready_count_` is still greater than 0 (because `ready_count_` doesn't 
> necessarily decrement by the time `Spawn` returns) and so it will not spin up 
> new threads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11942) [C++] If tasks are submitted quickly the thread pool may fail to spin up new threads

2021-03-11 Thread Weston Pace (Jira)

Weston Pace created ARROW-11942:
---

 Summary: [C++] If tasks are submitted quickly the thread pool may 
fail to spin up new threads
 Key: ARROW-11942
 URL: https://issues.apache.org/jira/browse/ARROW-11942
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Weston Pace
Assignee: Weston Pace


Probably only really affects unit tests.  Consider an idle thread pool with 1 
thread (ready_count_ == 1).  If `Spawn` is called very quickly it may look like 
`ready_count_` is still greater than 0 (because `ready_count_` doesn't 
necessarily decrement by the time `Spawn` returns) and so it will not spin up 
new threads.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11941) [Dev] "DEBUG=1 merge_arrow_pr.py" updates Jira issue

2021-03-11 Thread Yibo Cai (Jira)

Yibo Cai created ARROW-11941:


 Summary: [Dev] "DEBUG=1 merge_arrow_pr.py" updates Jira issue
 Key: ARROW-11941
 URL: https://issues.apache.org/jira/browse/ARROW-11941
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Yibo Cai
Assignee: Yibo Cai


"DEBUG=1 dev/merge_arrow_pr.py" acts as a dryrun without writing anything.
It doesn't merge PR, but it does updates the Jira issue status. Should be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11336) [C++][Doc] Improve Developing on Windows docs

2021-03-11 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated ARROW-11336:
-
Fix Version/s: 4.0.0

> [C++][Doc] Improve Developing on Windows docs
> -
>
> Key: ARROW-11336
> URL: https://issues.apache.org/jira/browse/ARROW-11336
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Documentation
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
> Fix For: 4.0.0
>
>
> Update and improve the "Developing on Windows" docs page:
>  * Add instructions for using Visual Studio 2019
>  * Add instructions for option to use vcpkg instead of conda for build 
> dependencies
>  ** Mention that when you use {{ARROW_DEPENDENCY_SOURCE=VCPKG}}, vcpkg will 
> (depending on its configuration) actually download, build, and install the 
> C++ library dependencies for you if it can't find them; this differs from 
> other dependency sources which require a prior installation
>  * Describe required Visual Studio configuration
>  * Improve some ambiguous instructions



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11758) [C++][Compute] Summation kernel round-off error

2021-03-11 Thread Yibo Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai resolved ARROW-11758.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9635
[https://github.com/apache/arrow/pull/9635]

> [C++][Compute] Summation kernel round-off error
> ---
>
> Key: ARROW-11758
> URL: https://issues.apache.org/jira/browse/ARROW-11758
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> From below test, summation kernel is of lower precision than numpy.sum.
> Numpy implements pairwise summation [1] with O(logn) round-off error, better 
> than O(n\) error from naive summation.
> *sum.py*
> {code:python}
> import numpy as np
> import pyarrow.compute as pc
> t = np.arange(321000, dtype='float64')
> t2 = t - np.mean(t)
> t2 *= t2
> print('numpy sum:', np.sum(t2))
> print('arrow sum:', pc.sum(t2))
> {code}
> *test result*
> {noformat}
> # Verified with wolfram alpha (arbitrary precision), Numpy's result is 
> correct. 
> $ ARROW_USER_SIMD_LEVEL=SSE4_2 python sum.py
> numpy sum: 2756346749973250.0
> arrow sum: 2756346749973248.0
> $ ARROW_USER_SIMD_LEVEL=AVX2 python sum.py 
> numpy sum: 2756346749973250.0
> arrow sum: 2756346749973249.0
> {noformat}
> [1] https://en.wikipedia.org/wiki/Pairwise_summation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11328) [R] Collecting zero columns from a dataset returns entire dataset

2021-03-11 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated ARROW-11328:
-
Fix Version/s: 3.0.0

> [R] Collecting zero columns from a dataset returns entire dataset
> -
>
> Key: ARROW-11328
> URL: https://issues.apache.org/jira/browse/ARROW-11328
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 2.0.0
>Reporter: András Svraka
>Assignee: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
> Fix For: 3.0.0
>
>
> Collecting a dataset with zero selected columns returns all columns of the 
> dataset in a data frame without column names.
> {code:r}
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #> intersect, setdiff, setequal, union
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> tmp <- tempfile()
> write_dataset(mtcars, tmp, format = "parquet")
> open_dataset(tmp) %>% select() %>% collect()
> #> 
> #> 1  21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
> #> 2  21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
> #> 3  22.8 4 108.0  93 3.85 2.320 18.61 1 1 4 1
> #> 4  21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
> #> 5  18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
> #> 6  18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
> #> 7  14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
> #> 8  24.4 4 146.7  62 3.69 3.190 20.00 1 0 4 2
> #> 9  22.8 4 140.8  95 3.92 3.150 22.90 1 0 4 2
> #> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
> #> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
> #> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
> #> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
> #> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
> #> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
> #> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
> #> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
> #> 18 32.4 4  78.7  66 4.08 2.200 19.47 1 1 4 1
> #> 19 30.4 4  75.7  52 4.93 1.615 18.52 1 1 4 2
> #> 20 33.9 4  71.1  65 4.22 1.835 19.90 1 1 4 1
> #> 21 21.5 4 120.1  97 3.70 2.465 20.01 1 0 3 1
> #> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
> #> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
> #> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
> #> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
> #> 26 27.3 4  79.0  66 4.08 1.935 18.90 1 1 4 1
> #> 27 26.0 4 120.3  91 4.43 2.140 16.70 0 1 5 2
> #> 28 30.4 4  95.1 113 3.77 1.513 16.90 1 1 5 2
> #> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
> #> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
> #> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
> #> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
> {code}
> Empty selections in dplyr return data frames with zero columns and based on 
> test cases covering [dplyr 
> verbs|https://github.com/apache/arrow/blob/dfee3917dc011e184264187f505da1de3d1d6fbb/r/tests/testthat/test-dplyr.R#L413-L425]
>  on RecordBatches already handle empty selections in the same way.
> Created on 2021-01-20 by the [reprex package|https://reprex.tidyverse.org] 
> \(v0.3.0)
> Session info
> {code:r}
> devtools::session_info()
> #> ─ Session info 
> ───
> #>  setting  value   
> #>  version  R version 4.0.3 (2020-10-10)
> #>  os   Ubuntu 20.04.1 LTS  
> #>  system   x86_64, linux-gnu   
> #>  ui   X11 
> #>  language (EN)
> #>  collate  en_US.UTF-8 
> #>  ctypeen_US.UTF-8 
> #>  tz   Etc/UTC 
> #>  date 2021-01-20  
> #> 
> #> - Packages 
> ---
> #>  package * versiondate   lib source
> #>  arrow   * 2.0.0.20210119 2021-01-20 [1] local 
> #>  assertthat0.2.1  2019-03-21 [1] RSPM (R 4.0.0)
> #>  bit   4.0.4  2020-08-04 [1] RSPM (R 4.0.2)
> #>  bit64 4.0.5  2020-08-30 [1] RSPM (R 4.0.2)
> #>  callr 3.5.1  2020-10-13 [1] RSPM (R 4.0.2)
> #>  cli   2.2.0  2020-11-20 [1] CRAN (R 4.0.3)
> #>  crayon1.3.4  2017-09-16 [1] RSPM (R 4.0.0)
> #>  DBI   1.1.1  2021-01-15 [1] CRAN (R 4.0.3)
> #>  desc  1.2.0  2018-05-01 [1] RSPM (R 4.0.0)
> #>  devtools  2.3.2  2020-09-18 [1] RSPM (R 4.0.2)
> #>  digest0.6.27 2020-10-24 [1] RSPM (R 4.0.3)
> #>  dplyr   * 1.0.3  2021-01-15 [1] CRAN (R 4.0.3)
> #>  ellipsis  0.3.1  2020-05-15 [1]

[jira] [Updated] (ARROW-10403) [C++] Implement unique kernel for dictionary type

2021-03-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10403:
---
Labels: pull-request-available  (was: )

> [C++] Implement unique kernel for dictionary type
> -
>
> Key: ARROW-10403
> URL: https://issues.apache.org/jira/browse/ARROW-10403
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Calling the "unique" compute function on a ChunkedArray of dictionary type 
> (as read by the CSV reader) errors with "Only hashing for data with equal 
> dictionaries currently supported". But is it necessary to hash to get unique 
> values from a dictionary type? The dictionary values are the unique values 
> (for each chunk), they're already there.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7364) [Rust] Add cast options to cast kernel

2021-03-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7364:
--
Labels: pull-request-available  (was: )

> [Rust] Add cast options to cast kernel
> --
>
> Key: ARROW-7364
> URL: https://issues.apache.org/jira/browse/ARROW-7364
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Neville Dipale
>Assignee: Mike Seddon
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The cast kernels currently do not take explicit options, but instead convert 
> overflows and invalid uft8 to nulls. We can create options that customise the 
> behaviour, similarly to CastOptions in CPP 
> ([https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/cast.h#L38])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11328) [R] Collecting zero columns from a dataset returns entire dataset

2021-03-11 Thread Ian Cook (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299939#comment-17299939
 ] 

Ian Cook commented on ARROW-11328:
--

I believe this is fixed in arrow version 3.0.0. [~svraka] could you please try 
with the latest version on CRAN and let us know whether or not the problem 
persists?

> [R] Collecting zero columns from a dataset returns entire dataset
> -
>
> Key: ARROW-11328
> URL: https://issues.apache.org/jira/browse/ARROW-11328
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 2.0.0
>Reporter: András Svraka
>Assignee: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>
> Collecting a dataset with zero selected columns returns all columns of the 
> dataset in a data frame without column names.
> {code:r}
> library(dplyr)
> #> 
> #> Attaching package: 'dplyr'
> #> The following objects are masked from 'package:stats':
> #> 
> #> filter, lag
> #> The following objects are masked from 'package:base':
> #> 
> #> intersect, setdiff, setequal, union
> library(arrow)
> #> 
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #> 
> #> timestamp
> tmp <- tempfile()
> write_dataset(mtcars, tmp, format = "parquet")
> open_dataset(tmp) %>% select() %>% collect()
> #> 
> #> 1  21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
> #> 2  21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
> #> 3  22.8 4 108.0  93 3.85 2.320 18.61 1 1 4 1
> #> 4  21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
> #> 5  18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
> #> 6  18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
> #> 7  14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
> #> 8  24.4 4 146.7  62 3.69 3.190 20.00 1 0 4 2
> #> 9  22.8 4 140.8  95 3.92 3.150 22.90 1 0 4 2
> #> 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
> #> 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
> #> 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
> #> 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
> #> 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
> #> 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
> #> 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
> #> 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
> #> 18 32.4 4  78.7  66 4.08 2.200 19.47 1 1 4 1
> #> 19 30.4 4  75.7  52 4.93 1.615 18.52 1 1 4 2
> #> 20 33.9 4  71.1  65 4.22 1.835 19.90 1 1 4 1
> #> 21 21.5 4 120.1  97 3.70 2.465 20.01 1 0 3 1
> #> 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
> #> 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
> #> 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
> #> 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
> #> 26 27.3 4  79.0  66 4.08 1.935 18.90 1 1 4 1
> #> 27 26.0 4 120.3  91 4.43 2.140 16.70 0 1 5 2
> #> 28 30.4 4  95.1 113 3.77 1.513 16.90 1 1 5 2
> #> 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
> #> 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
> #> 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
> #> 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
> {code}
> Empty selections in dplyr return data frames with zero columns and based on 
> test cases covering [dplyr 
> verbs|https://github.com/apache/arrow/blob/dfee3917dc011e184264187f505da1de3d1d6fbb/r/tests/testthat/test-dplyr.R#L413-L425]
>  on RecordBatches already handle empty selections in the same way.
> Created on 2021-01-20 by the [reprex package|https://reprex.tidyverse.org] 
> \(v0.3.0)
> Session info
> {code:r}
> devtools::session_info()
> #> ─ Session info 
> ───
> #>  setting  value   
> #>  version  R version 4.0.3 (2020-10-10)
> #>  os   Ubuntu 20.04.1 LTS  
> #>  system   x86_64, linux-gnu   
> #>  ui   X11 
> #>  language (EN)
> #>  collate  en_US.UTF-8 
> #>  ctypeen_US.UTF-8 
> #>  tz   Etc/UTC 
> #>  date 2021-01-20  
> #> 
> #> - Packages 
> ---
> #>  package * versiondate   lib source
> #>  arrow   * 2.0.0.20210119 2021-01-20 [1] local 
> #>  assertthat0.2.1  2019-03-21 [1] RSPM (R 4.0.0)
> #>  bit   4.0.4  2020-08-04 [1] RSPM (R 4.0.2)
> #>  bit64 4.0.5  2020-08-30 [1] RSPM (R 4.0.2)
> #>  callr 3.5.1  2020-10-13 [1] RSPM (R 4.0.2)
> #>  cli   2.2.0  2020-11-20 [1] CRAN (R 4.0.3)
> #>  crayon1.3.4  2017-09-16 [1] RSPM (R 4.0.0)
> #>  DBI   1.1.1  2021-01-15 [1] CRAN (R 4.0.3)
> #>  desc  1.2.0  2018-05-01 [1] RSPM (R 4.0.0)
> #>  devtools  2.3.2  2020-09-18 [1] RSPM (R 4.0.2)
> #>  diges

[jira] [Resolved] (ARROW-11880) [R] Handle empty or NULL transmute() args properly

2021-03-11 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-11880.
-
Resolution: Fixed

Issue resolved by pull request 9681
[https://github.com/apache/arrow/pull/9681]

> [R] Handle empty or NULL transmute() args properly
> --
>
> Key: ARROW-11880
> URL: https://issues.apache.org/jira/browse/ARROW-11880
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The following examples should all return results with zero columns. Check 
> that this behavior is consistent across Tables, RecordBatches, and Datasets. 
> There are some cases currently where these examples return all columns or 
> throw errors.
>  * {{transmute()}}
>  * {{transmute(NULL)}}
>  * {{transmute(x = NULL, y = NULL, z = NULL)}}
>  * {{x = NULL}}
>  {{transmute(x = !!x)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11940) [Rust][Datafusion] Support joins on TimestampMillisecond columns

2021-03-11 Thread Morgan Cassels (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Morgan Cassels updated ARROW-11940:
---
Description: 
Joining DataFrames on a TimestampMillisecond column gives error:

```

'called `Result::unwrap()` on an `Err` value: Internal("Unsupported data type 
in hasher")

arrow/rust/datafusion/src/physical_plan/hash_join.rs:252:30

'

```

  was:
Joining DataFrames on a TimestampMillisecond column gives error:

```

'called `Result::unwrap()` on an `Err` value: Internal("Unsupported data type 
in hasher")'

```


> [Rust][Datafusion] Support joins on TimestampMillisecond columns
> 
>
> Key: ARROW-11940
> URL: https://issues.apache.org/jira/browse/ARROW-11940
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust - DataFusion
>Reporter: Morgan Cassels
>Priority: Major
>
> Joining DataFrames on a TimestampMillisecond column gives error:
> ```
> 'called `Result::unwrap()` on an `Err` value: Internal("Unsupported data type 
> in hasher")
> arrow/rust/datafusion/src/physical_plan/hash_join.rs:252:30
> '
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11940) [Rust][Datafusion] Support joins on TimestampMillisecond columns

2021-03-11 Thread Morgan Cassels (Jira)

Morgan Cassels created ARROW-11940:
--

 Summary: [Rust][Datafusion] Support joins on TimestampMillisecond 
columns
 Key: ARROW-11940
 URL: https://issues.apache.org/jira/browse/ARROW-11940
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust - DataFusion
Reporter: Morgan Cassels


Joining DataFrames on a TimestampMillisecond column gives error:

```

'called `Result::unwrap()` on an `Err` value: Internal("Unsupported data type 
in hasher")'

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11939) Bug in `pa.PythonFile`?

2021-03-11 Thread Dave Hirschfeld (Jira)

Dave Hirschfeld created ARROW-11939:
---

 Summary: Bug in `pa.PythonFile`?
 Key: ARROW-11939
 URL: https://issues.apache.org/jira/browse/ARROW-11939
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 3.0.0
Reporter: Dave Hirschfeld



```python
with pa.PythonFile('deleteme.jnk', 'wb') as f: pass
AttributeError: 'str' object has no attribute 'closed'
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-11927) [Rust][DataFusion] Support limit push down

2021-03-11 Thread Andrew Lamb (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Lamb resolved ARROW-11927.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 9672
[https://github.com/apache/arrow/pull/9672]

> [Rust][DataFusion]  Support limit push down
> ---
>
> Key: ARROW-11927
> URL: https://issues.apache.org/jira/browse/ARROW-11927
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust - DataFusion
>Reporter: Daniël Heres
>Assignee: Daniël Heres
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11880) [R] Handle empty or NULL transmute() args properly

2021-03-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11880:
---
Labels: pull-request-available  (was: )

> [R] Handle empty or NULL transmute() args properly
> --
>
> Key: ARROW-11880
> URL: https://issues.apache.org/jira/browse/ARROW-11880
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The following examples should all return results with zero columns. Check 
> that this behavior is consistent across Tables, RecordBatches, and Datasets. 
> There are some cases currently where these examples return all columns or 
> throw errors.
>  * {{transmute()}}
>  * {{transmute(NULL)}}
>  * {{transmute(x = NULL, y = NULL, z = NULL)}}
>  * {{x = NULL}}
>  {{transmute(x = !!x)}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9019) [Python] hdfs fails to connect to for HDFS 3.x cluster

2021-03-11 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299894#comment-17299894
 ] 

Thomas Graves commented on ARROW-9019:
--

Note I was able to finally test this and on dataproc at least setting the 
classpath did work around the issue.  It must be a jar file order issue.

> [Python] hdfs fails to connect to for HDFS 3.x cluster
> --
>
> Key: ARROW-9019
> URL: https://issues.apache.org/jira/browse/ARROW-9019
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Thomas Graves
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: filesystem, hdfs
>
> I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an 
> error that looks like a protobuf or jar mismatch problem with Hadoop. The 
> same code works on a Hadoop 2.9 cluster.
>  
> I'm wondering if there is something special I need to do or if pyarrow 
> doesn't support Hadoop 3.x yet?
> Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1.
>  
>     import pyarrow as pa
>     hdfs_kwargs = dict(host="namenodehost",
>                       port=9000,
>                       user="tgraves",
>                       driver='libhdfs',
>                       kerb_ticket=None,
>                       extra_conf=None)
>     fs = pa.hdfs.connect(**hdfs_kwargs)
>     res = fs.exists("/user/tgraves")
>  
> Error that I get on Hadoop 3.x is:
>  
> dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error:
> ClassCastException: 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to 
> org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException:
>  
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-9019) [Python] hdfs fails to connect to for HDFS 3.x cluster

2021-03-11 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299894#comment-17299894
 ] 

Thomas Graves edited comment on ARROW-9019 at 3/11/21, 9:55 PM:


Note I was able to finally test this and on dataproc at least setting the 
classpath did work around the issue.  It must be a jar file order issue.  In 
this case though I set it and manually started pyspark.


was (Author: tgraves):
Note I was able to finally test this and on dataproc at least setting the 
classpath did work around the issue.  It must be a jar file order issue.

> [Python] hdfs fails to connect to for HDFS 3.x cluster
> --
>
> Key: ARROW-9019
> URL: https://issues.apache.org/jira/browse/ARROW-9019
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Thomas Graves
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: filesystem, hdfs
>
> I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an 
> error that looks like a protobuf or jar mismatch problem with Hadoop. The 
> same code works on a Hadoop 2.9 cluster.
>  
> I'm wondering if there is something special I need to do or if pyarrow 
> doesn't support Hadoop 3.x yet?
> Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1.
>  
>     import pyarrow as pa
>     hdfs_kwargs = dict(host="namenodehost",
>                       port=9000,
>                       user="tgraves",
>                       driver='libhdfs',
>                       kerb_ticket=None,
>                       extra_conf=None)
>     fs = pa.hdfs.connect(**hdfs_kwargs)
>     res = fs.exists("/user/tgraves")
>  
> Error that I get on Hadoop 3.x is:
>  
> dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error:
> ClassCastException: 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to 
> org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException:
>  
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11937) [C++] GZip codec hangs if flushed twice

2021-03-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11937:
---
Labels: pull-request-available  (was: )

> [C++] GZip codec hangs if flushed twice
> ---
>
> Key: ARROW-11937
> URL: https://issues.apache.org/jira/browse/ARROW-11937
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 3.0.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:java}
> // "If deflate returns with avail_out == 0, this function must be called
> //  again with the same value of the flush parameter and more output space
> //  (updated avail_out), until the flush is complete (deflate returns
> //  with non-zero avail_out)."
> return FlushResult{bytes_written, (bytes_written == 0)}; {code}
> But contrary to the comment, we're checking bytes_written. So if we flush 
> twice, the second time, we won't write any bytes, but we'll erroneously 
> interpret that as zlib asking for a larger buffer, rather than zlib telling 
> us there's no data to decompress. Then we'll enter a loop where we keep 
> doubling the buffer size forever, hanging the program.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9019) [Python] hdfs fails to connect to for HDFS 3.x cluster

2021-03-11 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299881#comment-17299881
 ] 

Thomas Graves commented on ARROW-9019:
--

[~bradmiro] I don't really understand how that fixes the issue, the hadoop 
classpath is already included when a container launchs on yarn, in this case I 
launched Spark on yarn and the hadoop classpath should already be there.  Now 
the only thing I can think of is if this caused the order of things in the 
classpath to change

> [Python] hdfs fails to connect to for HDFS 3.x cluster
> --
>
> Key: ARROW-9019
> URL: https://issues.apache.org/jira/browse/ARROW-9019
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Thomas Graves
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: filesystem, hdfs
>
> I'm trying to use the pyarrow hdfs connector with Hadoop 3.1.3 and I get an 
> error that looks like a protobuf or jar mismatch problem with Hadoop. The 
> same code works on a Hadoop 2.9 cluster.
>  
> I'm wondering if there is something special I need to do or if pyarrow 
> doesn't support Hadoop 3.x yet?
> Note I tried with pyarrow 0.15.1, 0.16.0, and 0.17.1.
>  
>     import pyarrow as pa
>     hdfs_kwargs = dict(host="namenodehost",
>                       port=9000,
>                       user="tgraves",
>                       driver='libhdfs',
>                       kerb_ticket=None,
>                       extra_conf=None)
>     fs = pa.hdfs.connect(**hdfs_kwargs)
>     res = fs.exists("/user/tgraves")
>  
> Error that I get on Hadoop 3.x is:
>  
> dfsExists: invokeMethod((Lorg/apache/hadoop/fs/Path;)Z) error:
> ClassCastException: 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to 
> org.apache.hadoop.shaded.com.google.protobuf.Messagejava.lang.ClassCastException:
>  
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto
>  cannot be cast to org.apache.hadoop.shaded.com.google.protobuf.Message
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
>         at com.sun.proxy.$Proxy9.getFileInfo(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:904)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
>         at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1661)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1577)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1574)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1589)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1683)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-11938) [R] Enable R build process to find locally built C++ library on Windows

2021-03-11 Thread Ian Cook (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook reassigned ARROW-11938:


Assignee: Ian Cook

> [R] Enable R build process to find locally built C++ library on Windows
> ---
>
> Key: ARROW-11938
> URL: https://issues.apache.org/jira/browse/ARROW-11938
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
>
> Currently, {{configure.win}} and {{tools/winlibs.R}} have two ways of finding 
> the Arrow C++ library:
>  # If {{RWINLIB_LOCAL}} is set, it gets it from that zip file
>  # If not, it downloads it
> Enable and document a third option for the case when the C++ library has been 
> built locally. This will enable R package developers using Windows machines 
> to make changes to code in the C++ library, build and install it, and then 
> build the R package using it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11938) [R] Enable R build process to find locally built C++ library on Windows

2021-03-11 Thread Ian Cook (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299872#comment-17299872
 ] 

Ian Cook commented on ARROW-11938:
--

See the {{find_local_source}} function in {{r/inst/linuxlibs.R}} which handles 
this situation on non-Windows systems, but note that that attempts to build the 
C++ library, not just find it if it’s already built, and we probably do not 
want to attempt to script C++ library build on users’ Windows environments.

> [R] Enable R build process to find locally built C++ library on Windows
> ---
>
> Key: ARROW-11938
> URL: https://issues.apache.org/jira/browse/ARROW-11938
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
>
> Currently, {{configure.win}} and {{tools/winlibs.R}} have two ways of finding 
> the Arrow C++ library:
>  # If {{RWINLIB_LOCAL}} is set, it gets it from that zip file
>  # If not, it downloads it
> Enable and document a third option for the case when the C++ library has been 
> built locally. This will enable R package developers using Windows machines 
> to make changes to code in the C++ library, build and install it, and then 
> build the R package using it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11938) [R] Enable R build process to find locally built C++ library on Windows

2021-03-11 Thread Ian Cook (Jira)

Ian Cook created ARROW-11938:


 Summary: [R] Enable R build process to find locally built C++ 
library on Windows
 Key: ARROW-11938
 URL: https://issues.apache.org/jira/browse/ARROW-11938
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Ian Cook


Currently, {{configure.win}} and {{tools/winlibs.R}} have two ways of finding 
the Arrow C++ library:
 # If {{RWINLIB_LOCAL}} is set, it gets it from that zip file
 # If not, it downloads it

Enable and document a third option for the case when the C++ library has been 
built locally. This will enable R package developers using Windows machines to 
make changes to code in the C++ library, build and install it, and then build 
the R package using it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11937) [C++] GZip codec hangs if flushed twice

2021-03-11 Thread David Li (Jira)

David Li created ARROW-11937:


 Summary: [C++] GZip codec hangs if flushed twice
 Key: ARROW-11937
 URL: https://issues.apache.org/jira/browse/ARROW-11937
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 3.0.0
Reporter: David Li
Assignee: David Li
 Fix For: 4.0.0


{code:java}
// "If deflate returns with avail_out == 0, this function must be called
//  again with the same value of the flush parameter and more output space
//  (updated avail_out), until the flush is complete (deflate returns
//  with non-zero avail_out)."
return FlushResult{bytes_written, (bytes_written == 0)}; {code}
But contrary to the comment, we're checking bytes_written. So if we flush 
twice, the second time, we won't write any bytes, but we'll erroneously 
interpret that as zlib asking for a larger buffer, rather than zlib telling us 
there's no data to decompress. Then we'll enter a loop where we keep doubling 
the buffer size forever, hanging the program.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11936) Rust/Java incorrect serialization of Struct wrapped Int8Dictionary

2021-03-11 Thread Justin (Jira)

Justin created ARROW-11936:
--

 Summary: Rust/Java incorrect serialization of Struct wrapped 
Int8Dictionary
 Key: ARROW-11936
 URL: https://issues.apache.org/jira/browse/ARROW-11936
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java, Rust
Affects Versions: 3.0.0
Reporter: Justin


Using rust, serialized datatype to a file with a schema of
{code:java}
Field { name: "val", data_type: Struct([Field { name: "val", data_type: Utf8, 
nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }]), 
nullable: false, dict_id: 0, dict_is_ordered: false, metadata: None }{code}
Using a java client to read the serialized datatype results in a schema of
{code:java}
Schema not null>{code}
whilst calling ArrowFileReader.loadNextBatch() results in
{code:java}
Exception in thread "main" java.util.NoSuchElementExceptionException in thread 
"main" java.util.NoSuchElementException at 
java.base/java.util.ArrayList$Itr.next(ArrayList.java:1000) at 
org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:81) at 
org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:99) at 
org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61) at 
org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205) 
at 
org.apache.arrow.vector.ipc.ArrowFileReader.loadNextBatch(ArrowFileReader.java:153)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-11477) [R][Doc] Reorganize and improve README and vignette content

2021-03-11 Thread Ian Cook (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299820#comment-17299820
 ] 

Ian Cook edited comment on ARROW-11477 at 3/11/21, 7:22 PM:


Consider extracting out the README and dplyr piece of this for the 4.0.0 
release and deferring the remainder for a later release.


was (Author: icook):
Consider extracting out the dplyr piece of this for the 4.0.0 release and 
deferring the remainder for a later release.

> [R][Doc] Reorganize and improve README and vignette content
> ---
>
> Key: ARROW-11477
> URL: https://issues.apache.org/jira/browse/ARROW-11477
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 3.0.0
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
> Fix For: 4.0.0
>
>
> Collecting various ideas here for general ways to improve the R package 
> README and vignettes for the 4.0.0 release:
>  * Consider moving the "building" and "developing" content out of the REAMDE 
> and into a vignette focused on that topic. (Rationale: most users of the R 
> package today are downloading prebuilt binaries, not building their own; most 
> users today are end users, not developers; a more valuable use for the 
> README—especially since that it's the homepage of the R docs site—would be as 
> a place to highlight key capabilities of the package, not to show folks all 
> the technical details of building it.)
>  * Get the "Using the Arrow C++ Library in R" vignette to show in the 
> Articles menu on the R docs site.
>  * Edit the "Working with Arrow Datasets and dplyr" vignette to make clear 
> that dplyr verbs can be used with Arrow Tables and RecordBatches (not just 
> Datasets) and describe differences in dplyr support for these different Arrow 
> objects.
>  * Check all the links in the "Project docs" menu on the docs site; some of 
> them are currently broken or go to directory listings



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11477) [R][Doc] Reorganize and improve README and vignette content

2021-03-11 Thread Ian Cook (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299820#comment-17299820
 ] 

Ian Cook commented on ARROW-11477:
--

Consider extracting out the dplyr piece of this for the 4.0.0 release and 
deferring the remainder for a later release.

> [R][Doc] Reorganize and improve README and vignette content
> ---
>
> Key: ARROW-11477
> URL: https://issues.apache.org/jira/browse/ARROW-11477
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, R
>Affects Versions: 3.0.0
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
> Fix For: 4.0.0
>
>
> Collecting various ideas here for general ways to improve the R package 
> README and vignettes for the 4.0.0 release:
>  * Consider moving the "building" and "developing" content out of the REAMDE 
> and into a vignette focused on that topic. (Rationale: most users of the R 
> package today are downloading prebuilt binaries, not building their own; most 
> users today are end users, not developers; a more valuable use for the 
> README—especially since that it's the homepage of the R docs site—would be as 
> a place to highlight key capabilities of the package, not to show folks all 
> the technical details of building it.)
>  * Get the "Using the Arrow C++ Library in R" vignette to show in the 
> Articles menu on the R docs site.
>  * Edit the "Working with Arrow Datasets and dplyr" vignette to make clear 
> that dplyr verbs can be used with Arrow Tables and RecordBatches (not just 
> Datasets) and describe differences in dplyr support for these different Arrow 
> objects.
>  * Check all the links in the "Project docs" menu on the docs site; some of 
> them are currently broken or go to directory listings



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11935) [C++] Add push generator

2021-03-11 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-11935:
--

 Summary: [C++] Add push generator
 Key: ARROW-11935
 URL: https://issues.apache.org/jira/browse/ARROW-11935
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Sometimes a producer of values just wants to queue futures and let a consumer 
pop them iteratively.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-1722) [C++] Add linting script to look for C++/CLI issues

2021-03-11 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299775#comment-17299775
 ] 

Antoine Pitrou commented on ARROW-1722:
---

[~wesm] [~tobyshaw] Can you add more context on these exclusions?
While the nullptr case is easy to avoid using the NULLPTR macro, the mutex 
exclusion is potentially annoying.

> [C++] Add linting script to look for C++/CLI issues
> ---
>
> Key: ARROW-1722
> URL: https://issues.apache.org/jira/browse/ARROW-1722
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> This includes:
> * Using {{nullptr}} in header files (we must instead use an appropriate macro 
> to use {{__nullptr}} when the host compiler is C++/CLI)
> * Including {{}} in a public header (e.g. header files without "impl" 
> or "internal" in their name)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10440) [C++][Dataset][Python] Add a callback to visit file writers just before Finish()

2021-03-11 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299773#comment-17299773
 ] 

Weston Pace commented on ARROW-10440:
-

Yes, it should be able to satisfy your use case, now that you've mentioned the 
need here.  Right now a `FileWriter` only has the `schema` which wouldn't have 
the path being written but it seems reasonable that it could have a path 
accessor as well.    [~bkietz] can you confirm?

> [C++][Dataset][Python] Add a callback to visit file writers just before 
> Finish()
> 
>
> Key: ARROW-10440
> URL: https://issues.apache.org/jira/browse/ARROW-10440
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 5.0.0
>
>
> This will fill the role of (for example) {{metadata_collector}} or allow 
> stats to be embedded in IPC file footer metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11907) [C++] Use our own executor in S3FileSystem

2021-03-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11907:
---
Labels: pull-request-available  (was: )

> [C++] Use our own executor in S3FileSystem
> --
>
> Key: ARROW-11907
> URL: https://issues.apache.org/jira/browse/ARROW-11907
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We use the AWS SDK async APIs in some places in S3FileSystem, but all they do 
> is spawn a separate thread and run the sync API in it.
> Instead we should use our IO executor, which would:
> 1) put an upper bound on the number of threads started
> 2) (presumably) reduce latency by reusing threads instead of spawning a 
> throwaway thread for each async call
> 3) allow for cancellation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10440) [C++][Dataset][Python] Add a callback to visit file writers just before Finish()

2021-03-11 Thread Lance Dacey (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299754#comment-17299754
 ] 

Lance Dacey commented on ARROW-10440:
-

Can someone confirm if this issue would cover my use case or if I should add a 
separate feature request issue? My goal is to simply be able to retrieve the 
list of fragment paths which were saved using the ds.write_dataset() function.

I believe it does since I am using the metadata_collector argument to gather 
this information with the legacy dataset, but let me know if this is different. 
thanks!

> [C++][Dataset][Python] Add a callback to visit file writers just before 
> Finish()
> 
>
> Key: ARROW-10440
> URL: https://issues.apache.org/jira/browse/ARROW-10440
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 5.0.0
>
>
> This will fill the role of (for example) {{metadata_collector}} or allow 
> stats to be embedded in IPC file footer metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-11365) [Rust] [Parquet] Implement parsers for v2 of the text schema

2021-03-11 Thread Neville Dipale (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale reassigned ARROW-11365:
--

Assignee: Neville Dipale

> [Rust] [Parquet] Implement parsers for v2 of the text schema
> 
>
> Key: ARROW-11365
> URL: https://issues.apache.org/jira/browse/ARROW-11365
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Affects Versions: 3.0.0
>Reporter: Neville Dipale
>Assignee: Neville Dipale
>Priority: Major
>
> V2 of the writer produces schema like:
>     required INT32 fieldname INTEGER(32, true);
> We should support parsing this format, as it maps to logical types.
> I'm unsure of what the implications are for fields that don't have a logical 
> type representation, but have a converted type (e.g. INTERVAL). We can try 
> write a V2 file with parquet-cpp and observe the behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7224) [C++][Dataset] Partition level filters should be able to provide filtering to file systems

2021-03-11 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-7224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299725#comment-17299725
 ] 

Micah Kornfield commented on ARROW-7224:


[~jorisvandenbossche] I think this is the [relevant 
API|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/SupportsPushDownFilters.java]
 from DataSourceV2.  

 

It seems like a bad user experience to expose a "filter on construction 
parameter", but it could be a way to mitigate this.  I think workarounds 
[~bkietz] proposed are also workable.  As I've said before I think supporting 
the feature that this JIRA is asking for is complex and potentially requires 
big changes to Datasets so I understand if it isn't immediately prioritized 
(but I think it can have a large impact for common cases).

> [C++][Dataset] Partition level filters should be able to provide filtering to 
> file systems
> --
>
> Key: ARROW-7224
> URL: https://issues.apache.org/jira/browse/ARROW-7224
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: dataset
>
> When providing a filter for partitions, it should be possible in some cases 
> to use it to optimize file system list calls.  This can greatly improve the 
> speed for reading data from partitions because fewer number of 
> directories/files need to be explored/expanded.  I've fallen behind on the 
> dataset code, but I want to make sure this issue is tracked someplace.  This 
> came up in SO question linked below (feel free to correct my analysis if I 
> missed the functionality someplace).
> Reference: 
> [https://stackoverflow.com/questions/58868584/pyarrow-parquetdataset-read-is-slow-on-a-hive-partitioned-s3-dataset-despite-u/58951477#58951477]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11260) [C++][Dataset] Don't require dictionaries for reading dataset with schema-based Partitioning

2021-03-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11260:
---
Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Don't require dictionaries for reading dataset with 
> schema-based Partitioning
> 
>
> Key: ARROW-11260
> URL: https://issues.apache.org/jira/browse/ARROW-11260
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: David Li
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As a follow-up on ARROW-10247 (see also 
> https://github.com/apache/arrow/pull/9130#issuecomment-760801124). We 
> currently require the user to pass manually specified dictionary values when 
> reading a dataset with a Partitioning based on a schema with dictionary typed 
> fields. 
> In practice that means that the user for example needs to parse the file 
> paths to get all the possible values the partition field can take, while 
> Arrow will then afterwards again do the same to construct the dataset object. 
> _Naively_, it seems that it should be possible to let Arrow infer the 
> dictionary _values_, even when providing an explicit schema with a dictionary 
> field for the Partitioning (i.e. so when not letting the partitioning schema 
> itself be inferred from the file paths).
> An example use case is when you have a Partitioning schema with both 
> dictionary and non-dictionary fields. When discovering the schema, you can 
> only have all or nothing (all dictionary fields or no dictionary fields).
> cc [~bkietz]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11718) [Rust] IPC writers shouldn't implicitly finish on drop

2021-03-11 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-11718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Leitão updated ARROW-11718:
-
Fix Version/s: 3.0.1

> [Rust] IPC writers shouldn't implicitly finish on drop
> --
>
> Key: ARROW-11718
> URL: https://issues.apache.org/jira/browse/ARROW-11718
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 3.0.0
>Reporter: Steven Fackler
>Assignee: Steven Fackler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> The Rust IPC writer types have a destructor that automatically writes the 
> footer if necessary. This is not ideal, though, since it can hide errors. For 
> example, if a web server is streaming data to a client in the Arrow IPC 
> format and it encounters an internal error trying to generate the next batch, 
> the outbound stream will appear valid to the client as the footer will 
> automatically be written out but some amount of data will actually be 
> missing. If the footer was not automatically written, the client would 
> properly detect the truncation.
> For reference, the C++ implementation does not attempt to write the footer 
> implicitly on drop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11577) [Rust] Concat kernel panics on slices of string arrays

2021-03-11 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11577:
---
Fix Version/s: 3.0.1

> [Rust] Concat kernel panics on slices of string arrays
> --
>
> Key: ARROW-11577
> URL: https://issues.apache.org/jira/browse/ARROW-11577
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Ben Chambers
>Assignee: Ben Chambers
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See the test case below:
> {code:java}
> #[test]
> fn test_string_array_slices() -> Result<()> {
> let input_1 = StringArray::from(vec!["hello", "A", "B", "C"]);
> let input_2 = StringArray::from(vec!["world", "D", "E", "Z"]);
> let arr = concat(&[
> input_1.slice(1, 3).as_ref(),
> input_2.slice(1, 2).as_ref(),
> ])?;
> let expected_output = StringArray::from(vec!["A", "B", "C", "D", "E"]);
> let actual_output = arr
> .as_any()
> .downcast_ref::()
> .unwrap();
> assert_eq!(actual_output, &expected_output);
> Ok(())
> }{code}
> Fails with:
> {noformat}
> range end index 8 out of range for slice of length 7
> thread 'compute::kernels::concat::tests::test_string_array_slices' panicked 
> at 'range end index 8 out of range for slice of length 7', 
> arrow/src/array/transform/variable_size.rs:38:23
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11586) [Rust] [Datafusion] Invalid SQL sometimes panics

2021-03-11 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11586:
---
Fix Version/s: 3.0.1

> [Rust] [Datafusion] Invalid SQL sometimes panics
> 
>
> Key: ARROW-11586
> URL: https://issues.apache.org/jira/browse/ARROW-11586
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust - DataFusion
>Reporter: Marc Prud'hommeaux
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Executing the invalid SQL "select 1 order by x" will panic rather returning 
> an Err:
>  ```
> thread '' panicked at 'called `Result::unwrap()` on an `Err` value: 
> Plan("Invalid identifier \'x\' for schema Int64(1)")', 
> /Users/marc/.cargo/git/checkouts/arrow-3a9cfebb6b7b2bdc/643f420/rust/datafusion/src/sql/planner.rs:649:76
> stack backtrace:
>0: _rust_begin_unwind
>1: core::panicking::panic_fmt
>2: core::option::expect_none_failed
>3: core::result::Result::unwrap
>4: datafusion::sql::planner::SqlToRel::order_by::{{closure}}
>5: core::iter::adapters::map_try_fold::{{closure}}
>6: core::iter::traits::iterator::Iterator::try_fold
>7:  as 
> core::iter::traits::iterator::Iterator>::try_fold
>8:  as 
> core::iter::traits::iterator::Iterator>::try_fold
>9: core::iter::traits::iterator::Iterator::find
>   10:  as 
> core::iter::traits::iterator::Iterator>::next
>   11:  as alloc::vec::SpecFromIterNested>::from_iter
>   12:  as alloc::vec::SpecFromIter>::from_iter
>   13:  as 
> core::iter::traits::collect::FromIterator>::from_iter
>   14: core::iter::traits::iterator::Iterator::collect
>   15:  as 
> core::iter::traits::collect::FromIterator>>::from_iter::{{closure}}
>   16: core::iter::adapters::process_results
>   17:  as 
> core::iter::traits::collect::FromIterator>>::from_iter
>   18: core::iter::traits::iterator::Iterator::collect
>   19: datafusion::sql::planner::SqlToRel::order_by
>   20: datafusion::sql::planner::SqlToRel::query_to_plan
>   21: datafusion::sql::planner::SqlToRel::sql_statement_to_plan
>   22: datafusion::sql::planner::SqlToRel::statement_to_plan
>   23: datafusion::execution::context::ExecutionContext::create_logical_plan
> ```
> This is happening because of an `unwrap` at 
> https://github.com/apache/arrow/blob/6cfbd22b457d873365fa60df31905857856608ee/rust/datafusion/src/sql/planner.rs#L652.
>  
> Perhaps the error should be returned as the Result rather than panicking, so 
> the error can be handled? There are a number of other places in the planner 
> where `unwrap()` is used, so they may warrant similar treatment.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-11681) [Rust] IPC writers shouldn't unwrap in destructors

2021-03-11 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove updated ARROW-11681:
---
Fix Version/s: 3.0.1

> [Rust] IPC writers shouldn't unwrap in destructors
> --
>
> Key: ARROW-11681
> URL: https://issues.apache.org/jira/browse/ARROW-11681
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 3.0.0
>Reporter: Steven Fackler
>Assignee: Steven Fackler
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> FileWriter and StreamWriter call `self.finish().unwrap()` in their `Drop` 
> implementations if the write has not already been finished. However, a common 
> reason for the write to not be finished is an earlier IO error on the 
> underlying stream. In that case, the destructor will panic, which is not 
> desired.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11933) [Developer] Provide a dashboard for improved Pull Request management

2021-03-11 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299680#comment-17299680
 ] 

David Li commented on ARROW-11933:
--

It seems somewhat configurable already (just need some Github and 
Jiracredentials) though it seems there's also some Spark-specific CI 
integration with their own Jenkins instance that would need to be removed or 
disabled.

> [Developer] Provide a dashboard for improved Pull Request management
> 
>
> Key: ARROW-11933
> URL: https://issues.apache.org/jira/browse/ARROW-11933
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Affects Versions: 3.0.0
>Reporter: Ben Kietzman
>Priority: Major
>
> The [spark PR dashboard|https://github.com/databricks/spark-pr-dashboard] 
> (instance at
> http://spark-prs.appspot.com/ ) provides a useful view of pull requests. 
> Information is retrieved from the github API and persisted to a database for 
> analyses, including classification of pull requests based on which files they 
> modify. The added context provides greater visibility of PRs to the 
> committers interested in reviewing/merging them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11934) [Rust] Document patch release process

2021-03-11 Thread Andy Grove (Jira)

Andy Grove created ARROW-11934:
--

 Summary: [Rust] Document patch release process
 Key: ARROW-11934
 URL: https://issues.apache.org/jira/browse/ARROW-11934
 Project: Apache Arrow
  Issue Type: Task
  Components: Rust
Reporter: Andy Grove
Assignee: Andy Grove
 Fix For: 3.0.1


Now that we moved to voting on source releases for patch releases, we need to 
document the process for doing so in the Rust implementation.

 

Google doc for discussion / collaboration: 
https://docs.google.com/document/d/1i2Elk6J0H4nhPeQZdLDyqvHoRbsabx2iOTXLHxxNqRE/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11933) [Developer] Provide a dashboard for improved Pull Request management

2021-03-11 Thread Ben Kietzman (Jira)

Ben Kietzman created ARROW-11933:


 Summary: [Developer] Provide a dashboard for improved Pull Request 
management
 Key: ARROW-11933
 URL: https://issues.apache.org/jira/browse/ARROW-11933
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Affects Versions: 3.0.0
Reporter: Ben Kietzman


The [spark PR dashboard|https://github.com/databricks/spark-pr-dashboard] 
(instance at
http://spark-prs.appspot.com/ ) provides a useful view of pull requests. 
Information is retrieved from the github API and persisted to a database for 
analyses, including classification of pull requests based on which files they 
modify. The added context provides greater visibility of PRs to the committers 
interested in reviewing/merging them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11932) [C++] Provide ArrayBuilder::AppendScalar

2021-03-11 Thread Ben Kietzman (Jira)

Ben Kietzman created ARROW-11932:


 Summary: [C++] Provide ArrayBuilder::AppendScalar
 Key: ARROW-11932
 URL: https://issues.apache.org/jira/browse/ARROW-11932
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 3.0.0
Reporter: Ben Kietzman
 Fix For: 5.0.0


It would be useful to be able to append a Scalar (and/or ScalarVector) to an 
ArrayBuilder. For example, in 
https://github.com/apache/arrow/pull/9621#discussion_r587461083 (ARROW-11591) 
this could be used to accumulate an array of expected grouped aggregation 
results using existing scalar aggregate kernels



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-11907) [C++] Use our own executor in S3FileSystem

2021-03-11 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-11907:
--

Assignee: Antoine Pitrou

> [C++] Use our own executor in S3FileSystem
> --
>
> Key: ARROW-11907
> URL: https://issues.apache.org/jira/browse/ARROW-11907
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> We use the AWS SDK async APIs in some places in S3FileSystem, but all they do 
> is spawn a separate thread and run the sync API in it.
> Instead we should use our IO executor, which would:
> 1) put an upper bound on the number of threads started
> 2) (presumably) reduce latency by reusing threads instead of spawning a 
> throwaway thread for each async call
> 3) allow for cancellation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-11677) [C++][Dataset] Write documentation

2021-03-11 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-11677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299614#comment-17299614
 ] 

Antoine Pitrou commented on ARROW-11677:


cc [~bkietz]

> [C++][Dataset] Write documentation
> --
>
> Key: ARROW-11677
> URL: https://issues.apache.org/jira/browse/ARROW-11677
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 4.0.0
>
>
> The dataset component is currently undocumented. Documentation should be 
> added in two parts:
> * a page in the User Guide
> * a page in the API reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-9513) [Java] Improve documentation in regards to basic-usage / memory-management

2021-03-11 Thread Ben Mosher (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-9513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299597#comment-17299597
 ] 

Ben Mosher commented on ARROW-9513:
---

I share Sascha's struggles and agree that it seems the Java API jargon and 
usage is very different from the Python API, which makes it difficult to draw 
analogies from that documentation.

I'll also say that in particular, I would like documentation about the 
lifecycle of the BufferAllocator/RootAllocator. I don't want to leak memory, 
nor do I want to throw away efficiency by maybe creating and destroying this 
object more often than needed. Most examples either have the `allocator` being 
created "offscreen" or it is in the try-with-resources block living exactly as 
long as the vectors that are read. Which I understand is always going to work 
and be safe, but maybe is not ideal... but I don't know for sure either way 
because I can't find specific guidance.

> [Java] Improve documentation in regards to basic-usage / memory-management
> --
>
> Key: ARROW-9513
> URL: https://issues.apache.org/jira/browse/ARROW-9513
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Documentation, Java
>Affects Versions: 0.17.1
>Reporter: sascha schnug
>Priority: Minor
>
> I'm experimenting with Arrow using Java, C+ +and Python+  IPC format 
> (Bytestream, File) and Parquet: I am struggling alot on the Java-side, even 
> after looking for external resources and some code-reading within the 
> dev-repository .
>   
>  Observing the state of the documentation, there seems to be a strong favour 
> in regards to C++ and Python, which is not surprising. The Java part however, 
> is hard to work with (at least for me; it might be possible that i'm the 
> problem though). Sadly the Java interface is also the one, which is the most 
> diverging from what people would usually do in Java.
>   
>  Acknowledging the user-guide like documentation from 
> [repo/java|https://github.com/apache/arrow/tree/master/java#getting-started] 
> (-i don't think this is referenced in the docs and it might only be 
> referenced by the java-part of the repository > looks "hidden" as the 
> Java-link in the docs points to Javadoc-based content- -> known issue: 
> ARROW-9364 ) and it's warnings about VectorSchemaRoot being special and 
> temporary and also reading [this external 
> article|https://www.infoq.com/articles/apache-arrow-java]
>  which also talks about manual memory-management i'm still struggling with a 
> very simple use-case:
>   
>  - create and fill VectorSchemaRoot
>  - write VectorSchemaRoot in IPC format to disk
>  - read VectorSchemaRoot from IPC format from disk
>    - INTO some out of scope object not owned by the reader! 
>   
>  I won't put example code here, but refer to my StackOverflow question 
> showing the problem of mine: 
> [StackOverflow|https://stackoverflow.com/q/62938237/2320035]
>   
>  Something about memory-ownership is not working as expected for me.
>   
>  No matter what tests (dev-repo) or article (e.g. the second link above) i 
> read, their examples did not help me here as those all are *processing* the 
> data read in *within the reader-scope* (mostly simple elementwise check), 
> while i want to read into some *global* object which outlives the 
> reader-object (see my code on SO or the second link: printing out read data 
> works as long as the reader is open).
>   
>  The article above also says:
>   
> {code:java}
>  A vector is managed by one allocator. We say that the allocator owns the 
> buffer backing the vector. Vector ownership can be transferred from one 
> allocator to another.
> {code}
>  
>  But how exactly would i populate an empty VectorSchemaRoot (of my class) 
> with whatever i read in, surviving closing the reader? I experiment with 
> VectorLoad and VectorUnload, including usage of the only call i found which 
> has "ownership" in his docstring (batch.cloneWithTransfer), but no success. 
> And even if working, the Java-based RecordBatch 
> [link|https://arrow.apache.org/docs/java/] which would be the one using for 
> this looks completely different then what Pythons does look like 
> [link|https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.htm]).
>  
>   
>  Should i be able to see my problem given the documentation? Is there 
> anything else to read? (I know that there must be in this regards within some 
> Flight / Gandiva project-code, but i did not find it yet).
>   
>  Or would it be completely wrong to keep VectorSchemaRoot as core-object to 
> handle all my data? 
>   
>  Feel free to close this issue if you think, that documentation is *not* 
> incomplete.
>   
>  Thanks,
>  Sascha



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-5381) [C++] Crash at arrow::internal::CountSetBits

2021-03-11 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-5381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299514#comment-17299514
 ] 

Antoine Pitrou commented on ARROW-5381:
---

[~thamha] No, it isn't fixed. The code hasn't changed that much, though, you 
can still patch these lines:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit_util.h#L25-L29

(this is git master, but you get the idea)

> [C++] Crash at arrow::internal::CountSetBits
> 
>
> Key: ARROW-5381
> URL: https://issues.apache.org/jira/browse/ARROW-5381
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Operating System: Windows 7 Professional 64-bit (6.1, 
> Build 7601) Service Pack 1(7601.win7sp1_ldr_escrow.181110-1429)
> Language: English (Regional Setting: English)
> System Manufacturer: SAMSUNG ELECTRONICS CO., LTD.
> System Model: RV420/RV520/RV720/E3530/S3530/E3420/E3520
> BIOS: Phoenix SecureCore-Tiano(tm) NB Version 2.1 05PQ
> Processor: Intel(R) Pentium(R) CPU B950 @ 2.10GHz (2 CPUs), ~2.1GHz
> Memory: 2048MB RAM
> Available OS Memory: 1962MB RAM
>   Page File: 1517MB used, 2405MB available
> Windows Dir: C:\Windows
> DirectX Version: DirectX 11
>Reporter: Tham
>Priority: Major
>  Labels: pull-request-available
> Attachments: bit-util.asm, iMac-late2009.png, popcnt_support.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I've got a lot of crash dump from a customer's windows machine. The 
> stacktrace shows that it crashed at arrow::internal::CountSetBits.
>  
> {code:java}
> STACK_TEXT:  
> 00c9`5354a4c0 7ff7`2f2830fd : 00c9`544841c0 ` 
> `1e00 ` : 
> CortexService!arrow::internal::CountSetBits+0x16d
> 00c9`5354a550 7ff7`2f2834b7 : 00c9`5337c930 ` 
> ` ` : 
> CortexService!arrow::ArrayData::GetNullCount+0x8d
> 00c9`5354a580 7ff7`2f13df55 : 00c9`54476080 00c9`5354a5d8 
> ` ` : 
> CortexService!arrow::Array::null_count+0x37
> 00c9`5354a5b0 7ff7`2f13fb68 : 00c9`5354ab40 00c9`5354a6f8 
> 00c9`54476080 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::Visit >+0xa5
> 00c9`5354a640 7ff7`2f12fa34 : 00c9`5354a6f8 00c9`54476080 
> 00c9`5354ab40 ` : 
> CortexService!arrow::VisitArrayInline namespace'::LevelBuilder>+0x298
> 00c9`5354a680 7ff7`2f14bf03 : 00c9`5354ab40 00c9`5354a6f8 
> 00c9`54476080 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::VisitInline+0x44
> 00c9`5354a6c0 7ff7`2f12fe2a : 00c9`5354ab40 00c9`5354ae18 
> 00c9`54476080 00c9`5354b208 : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::LevelBuilder::GenerateLevels+0x93
> 00c9`5354aa00 7ff7`2f14de56 : 00c9`5354b1f8 00c9`5354afc8 
> 00c9`54476080 `1e00 : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::ArrowColumnWriter::Write+0x25a
> 00c9`5354af20 7ff7`2f14e66b : 00c9`5354b1f8 00c9`5354b238 
> 00c9`54445c20 ` : 
> CortexService!parquet::arrow::`anonymous 
> namespace'::ArrowColumnWriter::Write+0x2a6
> 00c9`5354b040 7ff7`2f12f137 : 00c9`544041f0 00c9`5354b4d8 
> 00c9`5354b4a8 ` : 
> CortexService!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x70b
> 00c9`5354b400 7ff7`2f14b4d5 : 00c9`54431180 00c9`5354b4d8 
> 00c9`5354b4a8 ` : 
> CortexService!parquet::arrow::FileWriter::WriteColumnChunk+0x67
> 00c9`5354b450 7ff7`2f12eef1 : 00c9`5354b5d8 00c9`5354b648 
> ` `1e00 : 
> CortexService!::operator()+0x195
> 00c9`5354b530 7ff7`2eb8e31e : 00c9`54431180 00c9`5354b760 
> 00c9`54442fb0 `1e00 : 
> CortexService!parquet::arrow::FileWriter::WriteTable+0x521
> 00c9`5354b730 7ff7`2eb58ac5 : 00c9`5307bd88 00c9`54442fb0 
> ` ` : 
> CortexService!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0xfe
> 00c9`5354b860 7ff7`2eafdce6 : 00c9`5307bd80 00c9`5354ba08 
> 00c9`5354b9e0 00c9`5354b9d8 : 
> CortexService!Cortex::Storage::ParquetFileWriter::writeRowGroup+0x545
> 00c9`5354b9a0 7ff7`2eaf8bae : 00c9`53275600 00c9`53077220 
> `fffe ` : 
> CortexService!Cortex::Storage::DataStreamWriteWorker::onNewData+0x1a6
> {code}
> {code:java}
> FAILED_INSTRUCTION_ADDRESS: 
> CortexService!arrow::internal::CountSetBits+16d 
> [c:\jenkins\workspace\cortexv2-dev-win64-service\src\thirdparty\arrow\cpp\src\arrow\util

[jira] [Updated] (ARROW-11750) [Python][Dataset] Add support for project expressions

2021-03-11 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11750:
---
Labels: pull-request-available  (was: )

> [Python][Dataset] Add support for project expressions
> -
>
> Key: ARROW-11750
> URL: https://issues.apache.org/jira/browse/ARROW-11750
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Ben Kietzman
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-11174 adds support for arbitrary expressions in projected columns. This 
> should be supported in the python binding as well



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-11750) [Python][Dataset] Add support for project expressions

2021-03-11 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-11750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-11750:
-

Assignee: Joris Van den Bossche

> [Python][Dataset] Add support for project expressions
> -
>
> Key: ARROW-11750
> URL: https://issues.apache.org/jira/browse/ARROW-11750
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 3.0.0
>Reporter: Ben Kietzman
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 4.0.0
>
>
> ARROW-11174 adds support for arbitrary expressions in projected columns. This 
> should be supported in the python binding as well



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

49 matches

Mail list logo