[jira] [Updated] (ARROW-11206) [C++][Compute][Python] Rename "project" kernel to "make_struct"

2021-07-15 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-11206:
--
Summary: [C++][Compute][Python] Rename "project" kernel to "make_struct"  
(was: [C++][Dataset][Python] Consider hiding/renaming "project")

> [C++][Compute][Python] Rename "project" kernel to "make_struct"
> ---
>
> Key: ARROW-11206
> URL: https://issues.apache.org/jira/browse/ARROW-11206
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: compute, dataset, pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The "project" compute Function is necessary for ARROW-11174. However it is 
> not intended for direct use outside an Expression ([where the correspondence 
> to projection is not immediately 
> obvious|https://github.com/apache/arrow/pull/9131#issuecomment-757764173]) so 
> it may be preferable to do one/more of:
>  * rename the function to "wrap_struct" or similar so it does make sense 
> outside Expressions
>  * ensure the function is not exposed to python/R bindings except through 
> Expressions
>  * remove the function from the default registry



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-11206) [C++][Compute][Python] Rename "project" kernel to "make_struct"

2021-07-15 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-11206.
---
Resolution: Fixed

Issue resolved by pull request 10728
[https://github.com/apache/arrow/pull/10728]

> [C++][Compute][Python] Rename "project" kernel to "make_struct"
> ---
>
> Key: ARROW-11206
> URL: https://issues.apache.org/jira/browse/ARROW-11206
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: compute, dataset, pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The "project" compute Function is necessary for ARROW-11174. However it is 
> not intended for direct use outside an Expression ([where the correspondence 
> to projection is not immediately 
> obvious|https://github.com/apache/arrow/pull/9131#issuecomment-757764173]) so 
> it may be preferable to do one/more of:
>  * rename the function to "wrap_struct" or similar so it does make sense 
> outside Expressions
>  * ensure the function is not exposed to python/R bindings except through 
> Expressions
>  * remove the function from the default registry



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12513) [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics for dictionary-encoded array with nulls

2021-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12513:
---
Labels: parquet-statistics pull-request-available  (was: parquet-statistics)

> [C++][Parquet] Parquet Writer always puts null_count=0 in Parquet statistics 
> for dictionary-encoded array with nulls
> 
>
> Key: ARROW-12513
> URL: https://issues.apache.org/jira/browse/ARROW-12513
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Affects Versions: 1.0.1, 2.0.0, 3.0.0
> Environment: RHEL6
>Reporter: David Beach
>Assignee: Weston Pace
>Priority: Critical
>  Labels: parquet-statistics, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When writing a Table as Parquet, when the table contains columns represented 
> as dictionary-encoded arrays, those columns show an incorrect null_count of 0 
> in the Parquet metadata.  If the same data is saved without 
> dictionary-encoding the array, then the null_count is correct.
> Confirmed bug with PyArrow 1.0.1, 2.0.0, and 3.0.0.
> NOTE: I'm a PyArrow user, but I believe this bug is actually in the C++ 
> implementation of the Arrow/Parquet writer.
> h3. Setup
> {code:python}
> import pyarrow as pa
> from pyarrow import parquet{code}
> h3. Bug
> (writes a dictionary encoded Arrow array to parquet)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> array1dict = array1.dictionary_encode()
> assert array1dict.null_count == 5
> table = pa.Table.from_arrays([array1dict], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count  # RESULT: 0 (WRONG!){code}
> h3. Correct
> (writes same data without dictionary encoding the Arrow array)
> {code:python}
> array1 = pa.array([None, 'foo', 'bar'] * 5, type=pa.string())
> assert array1.null_count == 5
> table = pa.Table.from_arrays([array1], ["mycol"])
> parquet.write_table(table, "testtable.parquet")
> meta = parquet.read_metadata("testtable.parquet")
> meta.row_group(0).column(0).statistics.null_count  # RESULT: 5 (CORRECT)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12548) [JS] Get rid of columns

2021-07-15 Thread Dominik Moritz (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Moritz reassigned ARROW-12548:
--

Assignee: Paul Taylor

> [JS] Get rid of columns
> ---
>
> Key: ARROW-12548
> URL: https://issues.apache.org/jira/browse/ARROW-12548
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Assignee: Paul Taylor
>Priority: Major
>
> Just use the name Child (as we have for Vectors). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12270) [JS] remove rxjs and ix dependency or make them lighter

2021-07-15 Thread Dominik Moritz (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381742#comment-17381742
 ] 

Dominik Moritz commented on ARROW-12270:


We updated them in ARROW-13299. 

> [JS] remove rxjs and ix dependency or make them lighter
> ---
>
> Key: ARROW-12270
> URL: https://issues.apache.org/jira/browse/ARROW-12270
> Project: Apache Arrow
>  Issue Type: Task
>  Components: JavaScript
>Reporter: Dominik Moritz
>Assignee: Paul Taylor
>Priority: Minor
>
> We don't use these dependencies extensively so they could be good candidates 
> for being cleaned up to make the dev setup easier to understand for 
> newcomers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12270) [JS] remove rxjs and ix dependency or make them lighter

2021-07-15 Thread Dominik Moritz (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Moritz closed ARROW-12270.
--
  Assignee: Dominik Moritz  (was: Paul Taylor)
Resolution: Won't Fix

> [JS] remove rxjs and ix dependency or make them lighter
> ---
>
> Key: ARROW-12270
> URL: https://issues.apache.org/jira/browse/ARROW-12270
> Project: Apache Arrow
>  Issue Type: Task
>  Components: JavaScript
>Reporter: Dominik Moritz
>Assignee: Dominik Moritz
>Priority: Minor
>
> We don't use these dependencies extensively so they could be good candidates 
> for being cleaned up to make the dev setup easier to understand for 
> newcomers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13303) [JS] Revise bundles

2021-07-15 Thread Dominik Moritz (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominik Moritz updated ARROW-13303:
---
Fix Version/s: 5.0.0

> [JS] Revise bundles
> ---
>
> Key: ARROW-13303
> URL: https://issues.apache.org/jira/browse/ARROW-13303
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: Dominik Moritz
>Assignee: Dominik Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> * Use es2015 sources in the apache-arrow package since webpack 4 does not 
> support esnext and many people still use it
> * Generate .cjs and .mjs files instead of just .js to make it clear what the 
> files are. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12955) [C++] Add additional type support for if_else kernel

2021-07-15 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-12955.
--
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10538
[https://github.com/apache/arrow/pull/10538]

> [C++] Add additional type support for if_else kernel
> 
>
> Key: ARROW-12955
> URL: https://issues.apache.org/jira/browse/ARROW-12955
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Niranda Perera
>Assignee: Niranda Perera
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> Current if_else kernel only supports fixed sized primitive, temporal, 
> boolean, and null types only.
> Add the following type support
>  * Decimal
>  * Fixed length binary
>  * Variable length binary
>  * Dict, Union



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13334) Findzstd.cmake doesnt find zstd on Ubuntu 20.04

2021-07-15 Thread Nick Hortovanyi (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381641#comment-17381641
 ] 

Nick Hortovanyi commented on ARROW-13334:
-

I experimented on another machine last night. That machine had been upgraded 
over the years probably from Ubuntu 16.04 -> Ubuntu 18.04 -> Ubuntu 20.04 and 
the cmake-gui configuration messages are a lot more detailed with pkg-config 
messages. However I didnt complete a full compile (long story - as I had 
protobufs installed locally which needed upgrading and other packages).

This issue is clearly related to a fresh install of Ubuntu 20.04.2 with the 
versions of clang, cmake etc installed.

I'd also suggest its pointless looking for Findzstd.cmake, FindRE2.cmake and 
FindC-Ares.cmake in the Arrow cmake modules. They will never exist as the 
package maintainers don't produce them.

So no I dont think this issue can be closed. The arrow cmake files just dont 
make sense and/or instructions are required on what to install in retro.

I dont have the time to work it all out at the moment.

> Findzstd.cmake doesnt find zstd on Ubuntu 20.04
> ---
>
> Key: ARROW-13334
> URL: https://issues.apache.org/jira/browse/ARROW-13334
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.1
> Environment: Ubuntu 20.04.2
>Reporter: Nick Hortovanyi
>Priority: Minor
>
> I'm unable to use the pre-built c++ arrow libraries with a project using a 
> CMakeList.txt containing
> {noformat}
> find_package(Arrow REQUIRED)
> {noformat}
> Giving the following error
> {noformat}
> CMake Error at 
> /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 
> (message):
> Could NOT find zstd (missing: ZSTD_LIB)
> Call Stack (most recent call first):
> /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 
> (_FPHSA_FAILURE_MESSAGE)
> /usr/lib/x86_64-linux-gnu/cmake/arrow/Findzstd.cmake:82 
> (find_package_handle_standard_args)
> /usr/share/cmake-3.16/Modules/CMakeFindDependencyMacro.cmake:47 (find_package)
> /usr/lib/x86_64-linux-gnu/cmake/arrow/ArrowConfig.cmake:96 (find_dependency)
> CMakeLists.txt:12 (find_package)
> {noformat}
> libzstd and libstd-dev are both installed
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13334) Findzstd.cmake doesnt find zstd on Ubuntu 20.04

2021-07-15 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381639#comment-17381639
 ] 

Kouhei Sutou commented on ARROW-13334:
--

It shows that the variable name isn't related.

Do you have any more tries you want to do in this issue? {{cmake ... 
-DCMAKE_FIND_DEBUG_MODE=ON}}?

Can we close this issue?

> Findzstd.cmake doesnt find zstd on Ubuntu 20.04
> ---
>
> Key: ARROW-13334
> URL: https://issues.apache.org/jira/browse/ARROW-13334
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.1
> Environment: Ubuntu 20.04.2
>Reporter: Nick Hortovanyi
>Priority: Minor
>
> I'm unable to use the pre-built c++ arrow libraries with a project using a 
> CMakeList.txt containing
> {noformat}
> find_package(Arrow REQUIRED)
> {noformat}
> Giving the following error
> {noformat}
> CMake Error at 
> /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 
> (message):
> Could NOT find zstd (missing: ZSTD_LIB)
> Call Stack (most recent call first):
> /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 
> (_FPHSA_FAILURE_MESSAGE)
> /usr/lib/x86_64-linux-gnu/cmake/arrow/Findzstd.cmake:82 
> (find_package_handle_standard_args)
> /usr/share/cmake-3.16/Modules/CMakeFindDependencyMacro.cmake:47 (find_package)
> /usr/lib/x86_64-linux-gnu/cmake/arrow/ArrowConfig.cmake:96 (find_dependency)
> CMakeLists.txt:12 (find_package)
> {noformat}
> libzstd and libstd-dev are both installed
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13348) [C++] Allow timestamp parser to parse offset strings

2021-07-15 Thread Rok Mihevc (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381596#comment-17381596
 ] 

Rok Mihevc commented on ARROW-13348:


Related to ARROW-12820.

> [C++] Allow timestamp parser to parse offset strings
> 
>
> Key: ARROW-13348
> URL: https://issues.apache.org/jira/browse/ARROW-13348
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>
> Neither the ISO8601 parser nor the strptime parser support parsing timezone 
> offset strings (e.g. {{2017-08-19 12:22:11.802755+00}}).  These are sometimes 
> emitted by other tools (e.g. Postgresql) and should be supported for 
> compatibility.  I think it's acceptable for them to parse to Timestamp(units, 
> "UTC") or Timestamp(units, offset) whichever is more convenient.
> This issue doesn't necessarily require supporting tzdata timezone names (e.g. 
> "America/Denver", etc.) and those technically aren't ISO8601 compliant 
> anyways.
> Support should be added to both the ISO-8601 parser (which should support 
> 00:00, , and 00) and the strptime parser (which should add support for 
> %z).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12861) [C++][Compute] Add sign function kernels

2021-07-15 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-12861.
--
Resolution: Fixed

Issue resolved by pull request 10395
[https://github.com/apache/arrow/pull/10395]

> [C++][Compute] Add sign function kernels
> 
>
> Key: ARROW-12861
> URL: https://issues.apache.org/jira/browse/ARROW-12861
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Eduardo Ponce
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> Add the sign function to the compute kernels.
> sign(X) =
>  * 1 if X > 0
>  * 0 if X = 0
>  * -1 if X < 0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13346) [C++] Remove compile time parsing from EnumType

2021-07-15 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-13346.
--
Resolution: Fixed

Issue resolved by pull request 10726
[https://github.com/apache/arrow/pull/10726]

> [C++] Remove compile time parsing from EnumType
> ---
>
> Key: ARROW-13346
> URL: https://issues.apache.org/jira/browse/ARROW-13346
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> EnumType doesn't *need* to parse a string at compile time, remove that logic 
> to ensure minimal build time overhead



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13346) [C++] Remove compile time parsing from EnumType

2021-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13346:
---
Labels: pull-request-available  (was: )

> [C++] Remove compile time parsing from EnumType
> ---
>
> Key: ARROW-13346
> URL: https://issues.apache.org/jira/browse/ARROW-13346
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> EnumType doesn't *need* to parse a string at compile time, remove that logic 
> to ensure minimal build time overhead



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11206) [C++][Dataset][Python] Consider hiding/renaming "project"

2021-07-15 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381571#comment-17381571
 ] 

Weston Pace commented on ARROW-11206:
-

I think MakeStruct is good and much clearer.  This caused plenty of confusion 
for me as well.  I doubt many python uses are using pyarrow.compute.project 
anyways since they probably specify it via the scanner or dataset options.

> [C++][Dataset][Python] Consider hiding/renaming "project"
> -
>
> Key: ARROW-11206
> URL: https://issues.apache.org/jira/browse/ARROW-11206
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: compute, dataset, pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The "project" compute Function is necessary for ARROW-11174. However it is 
> not intended for direct use outside an Expression ([where the correspondence 
> to projection is not immediately 
> obvious|https://github.com/apache/arrow/pull/9131#issuecomment-757764173]) so 
> it may be preferable to do one/more of:
>  * rename the function to "wrap_struct" or similar so it does make sense 
> outside Expressions
>  * ensure the function is not exposed to python/R bindings except through 
> Expressions
>  * remove the function from the default registry



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11206) [C++][Dataset][Python] Consider hiding/renaming "project"

2021-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-11206:
---
Labels: compute dataset pull-request-available  (was: compute dataset)

> [C++][Dataset][Python] Consider hiding/renaming "project"
> -
>
> Key: ARROW-11206
> URL: https://issues.apache.org/jira/browse/ARROW-11206
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: compute, dataset, pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The "project" compute Function is necessary for ARROW-11174. However it is 
> not intended for direct use outside an Expression ([where the correspondence 
> to projection is not immediately 
> obvious|https://github.com/apache/arrow/pull/9131#issuecomment-757764173]) so 
> it may be preferable to do one/more of:
>  * rename the function to "wrap_struct" or similar so it does make sense 
> outside Expressions
>  * ensure the function is not exposed to python/R bindings except through 
> Expressions
>  * remove the function from the default registry



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12745) [C++][Compute] Add floor, ceiling, and truncate kernels

2021-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12745:
---
Labels: pull-request-available  (was: )

> [C++][Compute] Add floor, ceiling, and truncate kernels
> ---
>
> Key: ARROW-12745
> URL: https://issues.apache.org/jira/browse/ARROW-12745
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Kernels to round each value in an array of floating point numbers to:
>  * the nearest integer less than or equal to it ({{floor}})
>  * the nearest integer greater than or equal to it ({{ceiling}})
>  * the integral part without fraction digits
> Should return an array of the same type as the input (not an integer type)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13348) [C++] Allow timestamp parser to parse offset strings

2021-07-15 Thread Weston Pace (Jira)
Weston Pace created ARROW-13348:
---

 Summary: [C++] Allow timestamp parser to parse offset strings
 Key: ARROW-13348
 URL: https://issues.apache.org/jira/browse/ARROW-13348
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


Neither the ISO8601 parser nor the strptime parser support parsing timezone 
offset strings (e.g. {{2017-08-19 12:22:11.802755+00}}).  These are sometimes 
emitted by other tools (e.g. Postgresql) and should be supported for 
compatibility.  I think it's acceptable for them to parse to Timestamp(units, 
"UTC") or Timestamp(units, offset) whichever is more convenient.

This issue doesn't necessarily require supporting tzdata timezone names (e.g. 
"America/Denver", etc.) and those technically aren't ISO8601 compliant anyways.

Support should be added to both the ISO-8601 parser (which should support 
00:00, , and 00) and the strptime parser (which should add support for %z).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13064) [C++] Add a general "if, ifelse, ..., else" kernel ("CASE WHEN")

2021-07-15 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman resolved ARROW-13064.
--
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10557
[https://github.com/apache/arrow/pull/10557]

> [C++] Add a general "if, ifelse, ..., else" kernel ("CASE WHEN")
> 
>
> Key: ARROW-13064
> URL: https://issues.apache.org/jira/browse/ARROW-13064
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 9h
>  Remaining Estimate: 0h
>
> ARROW-10640 added a ternary {{if_else}} kernel. Add another kernel that 
> extends this concept to an arbitrary number of conditions and associated 
> results, like a vectorized {{if-ifelse-...-else}} with an arbitrary number of 
> {{ifelse}} and with the {{else}} optional. This is like a SQL {{CASE}} 
> statement.
> How best to achieve this is not obvious. To enable SQL-style uses, it would 
> be most efficient to implement this as a variadic kernel where the 
> even-number arguments (0, 2, ...) are the arrays of boolean conditions, the 
> odd-number arguments (1, 3, ...) are the corresponding arrays of results, and 
> the final argument is the {{else}} result. But I'm not sure if this is 
> practical. Maybe instead we should implement this to operate on listarrays, 
> like NumPy's 
> {{[np.where|https://numpy.org/doc/stable/reference/generated/numpy.where.html]}}
>  or 
> {{[np.select|https://numpy.org/doc/stable/reference/generated/numpy.select.html]}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12688) [R] Use DuckDB to query an Arrow Dataset

2021-07-15 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381548#comment-17381548
 ] 

Jonathan Keane commented on ARROW-12688:


Or probably `compute()` would be better for this (c.f. ARROW-11754 and 
ARROW-12282)

> [R] Use DuckDB to query an Arrow Dataset
> 
>
> Key: ARROW-12688
> URL: https://issues.apache.org/jira/browse/ARROW-12688
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Neal Richardson
>Assignee: Jonathan Keane
>Priority: Major
>
> DuckDB can read data from an Arrow C-interface stream. Once we can provide 
> that struct from R, presumably DuckDB could query on that stream. 
> A first step is just connecting the pieces. A second step would be to handle 
> parts of the DuckDB query and push down filtering/projection to Arrow. 
> We need a function something like this:
> {code}
> #' Run a DuckDB query on Arrow data
> #'
> #' @param .data An `arrow` data object: `Dataset`, `Table`, `RecordBatch`, or 
> #' an `arrow_dplyr_query` containing filter/mutate/etc. expressions
> #' @return A `duckdb::duckdb_connection`
> to_duckdb <- function(.data) {
>   # ARROW-12687: [C++][Python][Dataset] Convert Scanner into a 
> RecordBatchReader 
>   reader <- Scanner$create(.data)$ToRecordBatchReader()
>   # ARROW-12689: [R] Implement ArrowArrayStream C interface
>   stream_ptr <- allocate_arrow_array_stream()
>   on.exit(delete_arrow_array_stream(stream_ptr))
>   ExportRecordBatchReader(x, stream_ptr)
>   # TODO: DuckDB method to create table/connection from ArrowArrayStream ptr
>   duckdb::duck_connection_from_arrow_stream(stream_ptr)
> }
> {code}
> Assuming this existed, we could do something like (a variation of 
> https://arrow.apache.org/docs/r/articles/dataset.html):
> {code}
> ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
> ds %>%
>   filter(total_amount > 100, year == 2015) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = 100 * tip_amount / total_amount) %>%
>   to_duckdb() %>%
>   group_by(passenger_count) %>%
>   summarise(
> median_tip_pct = median(tip_pct),
> n = n()
>   )
> {code}
> and duckdb would do the aggregation while the data reading, predicate 
> pushdown, filtering, and projection would happen in Arrow. Or you could do 
> {{dbGetQuery(ds, "SOME SQL")}} and that would evaluate on arrow data. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11206) [C++][Dataset][Python] Consider hiding/renaming "project"

2021-07-15 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-11206:
-
Fix Version/s: (was: 6.0.0)
   5.0.0

> [C++][Dataset][Python] Consider hiding/renaming "project"
> -
>
> Key: ARROW-11206
> URL: https://issues.apache.org/jira/browse/ARROW-11206
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: compute, dataset
> Fix For: 5.0.0
>
>
> The "project" compute Function is necessary for ARROW-11174. However it is 
> not intended for direct use outside an Expression ([where the correspondence 
> to projection is not immediately 
> obvious|https://github.com/apache/arrow/pull/9131#issuecomment-757764173]) so 
> it may be preferable to do one/more of:
>  * rename the function to "wrap_struct" or similar so it does make sense 
> outside Expressions
>  * ensure the function is not exposed to python/R bindings except through 
> Expressions
>  * remove the function from the default registry



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12688) [R] Use DuckDB to query an Arrow Dataset

2021-07-15 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381546#comment-17381546
 ] 

Jonathan Keane commented on ARROW-12688:


Hmm, I wonder if we (also?) want a `collect(..., .to = c("arrow", "duckdb")` 
that returns either a dataframe/arrowtable or a duckdb based `tbl` reference 
respectively such that pipelines like the following work:

{code}
ds %>%
  select(...) %>%
  filter(...) %>%
  mutate(...) %>% 
  collect(.to = "duckdb") %>% 
  group_by(...) %>% 
  summarise(...) %>% 
  collect()
{code}




> [R] Use DuckDB to query an Arrow Dataset
> 
>
> Key: ARROW-12688
> URL: https://issues.apache.org/jira/browse/ARROW-12688
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Neal Richardson
>Assignee: Jonathan Keane
>Priority: Major
>
> DuckDB can read data from an Arrow C-interface stream. Once we can provide 
> that struct from R, presumably DuckDB could query on that stream. 
> A first step is just connecting the pieces. A second step would be to handle 
> parts of the DuckDB query and push down filtering/projection to Arrow. 
> We need a function something like this:
> {code}
> #' Run a DuckDB query on Arrow data
> #'
> #' @param .data An `arrow` data object: `Dataset`, `Table`, `RecordBatch`, or 
> #' an `arrow_dplyr_query` containing filter/mutate/etc. expressions
> #' @return A `duckdb::duckdb_connection`
> to_duckdb <- function(.data) {
>   # ARROW-12687: [C++][Python][Dataset] Convert Scanner into a 
> RecordBatchReader 
>   reader <- Scanner$create(.data)$ToRecordBatchReader()
>   # ARROW-12689: [R] Implement ArrowArrayStream C interface
>   stream_ptr <- allocate_arrow_array_stream()
>   on.exit(delete_arrow_array_stream(stream_ptr))
>   ExportRecordBatchReader(x, stream_ptr)
>   # TODO: DuckDB method to create table/connection from ArrowArrayStream ptr
>   duckdb::duck_connection_from_arrow_stream(stream_ptr)
> }
> {code}
> Assuming this existed, we could do something like (a variation of 
> https://arrow.apache.org/docs/r/articles/dataset.html):
> {code}
> ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
> ds %>%
>   filter(total_amount > 100, year == 2015) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = 100 * tip_amount / total_amount) %>%
>   to_duckdb() %>%
>   group_by(passenger_count) %>%
>   summarise(
> median_tip_pct = median(tip_pct),
> n = n()
>   )
> {code}
> and duckdb would do the aggregation while the data reading, predicate 
> pushdown, filtering, and projection would happen in Arrow. Or you could do 
> {{dbGetQuery(ds, "SOME SQL")}} and that would evaluate on arrow data. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-12688) [R] Use DuckDB to query an Arrow Dataset

2021-07-15 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381539#comment-17381539
 ] 

Neal Richardson edited comment on ARROW-12688 at 7/15/21, 6:39 PM:
---

Building from the code at 
https://github.com/pdet/duckdb-benchmark/blob/master/arrow/group_by_with_duckdb.R,
 I've worked up a slightly different interface, something we could add to the 
arrow package (adding duckdb and DBI to Suggests):

{code}

summarise.arrow_dplyr_query <- function(.data, ..., .engine = c("arrow", 
"duckdb")) {
  if (match.arg(.engine) == "duckdb") {
summarize_duck(.data, ...)
  } else {
# Continue with the current contents of summarise.arrow_dplyr_query
# ...
  }
}

summarise_duck <- function(.data, ...) {
  # TODO better translation of aggregate functions, parse tree traversal
  aggregates <- vapply(enquos(...), rlang::quo_name, "character")
  tbl_name <- paste0(replicate(10, sample(LETTERS, 1, TRUE)), collapse = "")

  con <- arrow_duck_connection()
  duckdb::duckdb_register_arrow(con, tbl_name, .data$data)
  on.exit(duckdb::duckdb_unregister_arrow(con, tbl_name))

  groups_str <- paste(.data$groups, collapse = ", ")
  aggr_str <- paste(aggregates, collapse = ", ")
  # TODO use relational API instead of SQL string construction
  DBI::dbGetQuery(con, sprintf("SELECT %s, %s FROM %s GROUP BY %s", 
groups_str, aggr_str, tbl_name, groups_str ))
}

arrow_duck_connection <- function() {
  con <- getOption("arrow_duck_con")
  if (is.null(con)) {
con <- dbConnect(duckdb::duckdb())
# Use the same CPU count that the arrow library is set to
DBI::dbExecute(con, paste0("PRAGMA threads=", cpu_count()))
options(arrow_duck_con = con)
  }
  con
}
{code}

Thoughts?


was (Author: npr):
Building from the code at 
https://github.com/pdet/duckdb-benchmark/blob/master/arrow/group_by_with_duckdb.R,
 I've worked up a slightly different interface, something we could add to the 
arrow package (adding duckdb and DBI to Suggests):

{code}

summarise.arrow_dplyr_query <- function(.data, ..., engine = c("arrow", 
"duckdb")) {
  if (match.arg(engine) == "duckdb") {
summarize_duck(.data, ...)
  } else {
# Continue with the current contents of summarise.arrow_dplyr_query
# ...
  }
}

summarise_duck <- function(.data, ...) {
  # TODO better translation of aggregate functions, parse tree traversal
  aggregates <- vapply(enquos(...), rlang::quo_name, "character")
  tbl_name <- paste0(replicate(10, sample(LETTERS, 1, TRUE)), collapse = "")

  con <- arrow_duck_connection()
  duckdb::duckdb_register_arrow(con, tbl_name, .data$data)
  on.exit(duckdb::duckdb_unregister_arrow(con, tbl_name))

  groups_str <- paste(.data$groups, collapse = ", ")
  aggr_str <- paste(aggregates, collapse = ", ")
  # TODO use relational API instead of SQL string construction
  DBI::dbGetQuery(con, sprintf("SELECT %s, %s FROM %s GROUP BY %s", 
groups_str, aggr_str, tbl_name, groups_str ))
}

arrow_duck_connection <- function() {
  con <- getOption("arrow_duck_con")
  if (is.null(con)) {
con <- dbConnect(duckdb::duckdb())
# Use the same CPU count that the arrow library is set to
DBI::dbExecute(con, paste0("PRAGMA threads=", cpu_count()))
options(arrow_duck_con = con)
  }
  con
}
{code}

Thoughts?

> [R] Use DuckDB to query an Arrow Dataset
> 
>
> Key: ARROW-12688
> URL: https://issues.apache.org/jira/browse/ARROW-12688
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Neal Richardson
>Assignee: Jonathan Keane
>Priority: Major
>
> DuckDB can read data from an Arrow C-interface stream. Once we can provide 
> that struct from R, presumably DuckDB could query on that stream. 
> A first step is just connecting the pieces. A second step would be to handle 
> parts of the DuckDB query and push down filtering/projection to Arrow. 
> We need a function something like this:
> {code}
> #' Run a DuckDB query on Arrow data
> #'
> #' @param .data An `arrow` data object: `Dataset`, `Table`, `RecordBatch`, or 
> #' an `arrow_dplyr_query` containing filter/mutate/etc. expressions
> #' @return A `duckdb::duckdb_connection`
> to_duckdb <- function(.data) {
>   # ARROW-12687: [C++][Python][Dataset] Convert Scanner into a 
> RecordBatchReader 
>   reader <- Scanner$create(.data)$ToRecordBatchReader()
>   # ARROW-12689: [R] Implement ArrowArrayStream C interface
>   stream_ptr <- allocate_arrow_array_stream()
>   on.exit(delete_arrow_array_stream(stream_ptr))
>   ExportRecordBatchReader(x, stream_ptr)
>   # TODO: DuckDB method to create table/connection from ArrowArrayStream ptr
>   duckdb::duck_connection_from_arrow_stream(stream_ptr)
> }
> {code}
> Assuming this existed, we could do something like (a variation of 
> 

[jira] [Commented] (ARROW-12688) [R] Use DuckDB to query an Arrow Dataset

2021-07-15 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381539#comment-17381539
 ] 

Neal Richardson commented on ARROW-12688:
-

Building from the code at 
https://github.com/pdet/duckdb-benchmark/blob/master/arrow/group_by_with_duckdb.R,
 I've worked up a slightly different interface, something we could add to the 
arrow package (adding duckdb and DBI to Suggests):

{code}

summarise.arrow_dplyr_query <- function(.data, ..., engine = c("arrow", 
"duckdb")) {
  if (match.arg(engine) == "duckdb") {
summarize_duck(.data, ...)
  } else {
# Continue with the current contents of summarise.arrow_dplyr_query
# ...
  }
}

summarise_duck <- function(.data, ...) {
  # TODO better translation of aggregate functions, parse tree traversal
  aggregates <- vapply(enquos(...), rlang::quo_name, "character")
  tbl_name <- paste0(replicate(10, sample(LETTERS, 1, TRUE)), collapse = "")

  con <- arrow_duck_connection()
  duckdb::duckdb_register_arrow(con, tbl_name, .data$data)
  on.exit(duckdb::duckdb_unregister_arrow(con, tbl_name))

  groups_str <- paste(.data$groups, collapse = ", ")
  aggr_str <- paste(aggregates, collapse = ", ")
  # TODO use relational API instead of SQL string construction
  DBI::dbGetQuery(con, sprintf("SELECT %s, %s FROM %s GROUP BY %s", 
groups_str, aggr_str, tbl_name, groups_str ))
}

arrow_duck_connection <- function() {
  con <- getOption("arrow_duck_con")
  if (is.null(con)) {
con <- dbConnect(duckdb::duckdb())
# Use the same CPU count that the arrow library is set to
DBI::dbExecute(con, paste0("PRAGMA threads=", cpu_count()))
options(arrow_duck_con = con)
  }
  con
}
{code}

Thoughts?

> [R] Use DuckDB to query an Arrow Dataset
> 
>
> Key: ARROW-12688
> URL: https://issues.apache.org/jira/browse/ARROW-12688
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, R
>Reporter: Neal Richardson
>Assignee: Jonathan Keane
>Priority: Major
>
> DuckDB can read data from an Arrow C-interface stream. Once we can provide 
> that struct from R, presumably DuckDB could query on that stream. 
> A first step is just connecting the pieces. A second step would be to handle 
> parts of the DuckDB query and push down filtering/projection to Arrow. 
> We need a function something like this:
> {code}
> #' Run a DuckDB query on Arrow data
> #'
> #' @param .data An `arrow` data object: `Dataset`, `Table`, `RecordBatch`, or 
> #' an `arrow_dplyr_query` containing filter/mutate/etc. expressions
> #' @return A `duckdb::duckdb_connection`
> to_duckdb <- function(.data) {
>   # ARROW-12687: [C++][Python][Dataset] Convert Scanner into a 
> RecordBatchReader 
>   reader <- Scanner$create(.data)$ToRecordBatchReader()
>   # ARROW-12689: [R] Implement ArrowArrayStream C interface
>   stream_ptr <- allocate_arrow_array_stream()
>   on.exit(delete_arrow_array_stream(stream_ptr))
>   ExportRecordBatchReader(x, stream_ptr)
>   # TODO: DuckDB method to create table/connection from ArrowArrayStream ptr
>   duckdb::duck_connection_from_arrow_stream(stream_ptr)
> }
> {code}
> Assuming this existed, we could do something like (a variation of 
> https://arrow.apache.org/docs/r/articles/dataset.html):
> {code}
> ds <- open_dataset("nyc-taxi", partitioning = c("year", "month"))
> ds %>%
>   filter(total_amount > 100, year == 2015) %>%
>   select(tip_amount, total_amount, passenger_count) %>%
>   mutate(tip_pct = 100 * tip_amount / total_amount) %>%
>   to_duckdb() %>%
>   group_by(passenger_count) %>%
>   summarise(
> median_tip_pct = median(tip_pct),
> n = n()
>   )
> {code}
> and duckdb would do the aggregation while the data reading, predicate 
> pushdown, filtering, and projection would happen in Arrow. Or you could do 
> {{dbGetQuery(ds, "SOME SQL")}} and that would evaluate on arrow data. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13347) [C++][Compute] Add option to return null values for nonexistent and ambiguous times

2021-07-15 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-13347:
--

 Summary: [C++][Compute] Add option to return null values for 
nonexistent and ambiguous times
 Key: ARROW-13347
 URL: https://issues.apache.org/jira/browse/ARROW-13347
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Rok Mihevc


ARROW-13033 Implements TzLocalize kernel and can handle ambiguous and 
nonexistent times by raising or shifting backwards or forwards to first valid 
local time.
However we might want to be able to return null in these cases.
We could implement new flags to do so:
{{compute::TemporalLocalizationOptions::Nonexistent::NONEXISTENT_IGNORE}} and 
{{compute::TemporalLocalizationOptions::Ambiguous::AMBIGUOUS_IGNORE}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12992) [R] bindings for substr(), substring(), str_sub()

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-12992:
---

Assignee: Nic Crane  (was: Mauricio 'Pachá' Vargas Sepúlveda)

> [R] bindings for substr(), substring(), str_sub()
> -
>
> Key: ARROW-12992
> URL: https://issues.apache.org/jira/browse/ARROW-12992
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Nic Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> Followup to ARROW-10557, which implemented the C++
> current state:
> {code:r}
> library(arrow)
> library(dplyr)
> library(stringr)
> # get animal products, year 20919
> open_dataset(
>   "../cepii-datasets-arrow/parquet/baci_hs92",
>   partitioning = c("year", "reporter_iso")
> ) %>% 
>   filter(
> year == 2019,
> str_sub(product_code, 1, 2) == "01"
>   ) %>% 
>   collect()
> Error: Filter expression not supported for Arrow Datasets: 
> str_sub(product_code, 1, 2) == "01"
> Call collect() first to pull data into R.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12992) [R] bindings for substr(), substring(), str_sub()

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-12992.
-
Resolution: Fixed

Issue resolved by pull request 10624
[https://github.com/apache/arrow/pull/10624]

> [R] bindings for substr(), substring(), str_sub()
> -
>
> Key: ARROW-12992
> URL: https://issues.apache.org/jira/browse/ARROW-12992
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Followup to ARROW-10557, which implemented the C++
> current state:
> {code:r}
> library(arrow)
> library(dplyr)
> library(stringr)
> # get animal products, year 20919
> open_dataset(
>   "../cepii-datasets-arrow/parquet/baci_hs92",
>   partitioning = c("year", "reporter_iso")
> ) %>% 
>   filter(
> year == 2019,
> str_sub(product_code, 1, 2) == "01"
>   ) %>% 
>   collect()
> Error: Filter expression not supported for Arrow Datasets: 
> str_sub(product_code, 1, 2) == "01"
> Call collect() first to pull data into R.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13280) [R] Bindings for log and trig functions

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13280.
-
Resolution: Fixed

Issue resolved by pull request 10689
[https://github.com/apache/arrow/pull/10689]

> [R] Bindings for log and trig functions
> ---
>
> Key: ARROW-13280
> URL: https://issues.apache.org/jira/browse/ARROW-13280
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Nic Crane
>Assignee: Nic Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13195) [R] Problem with rlang reverse dependency checks

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13195.
-
Resolution: Not A Problem

> [R] Problem with rlang reverse dependency checks
> 
>
> Key: ARROW-13195
> URL: https://issues.apache.org/jira/browse/ARROW-13195
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Nic Crane
>Priority: Major
> Fix For: 5.0.0
>
>
> From: https://github.com/r-lib/rlang/blob/master/revdep/problems.md#arrow
>  
> arrow
> 
> * Version: 4.0.1
> * GitHub: https://github.com/apache/arrow
> * Source code: https://github.com/cran/arrow
> * Date/Publication: 2021-05-28 09:50:02 UTC
> * Number of recursive dependencies: 61
> Run `cloud_details(, "arrow")` for more info
> 
> ## Newly broken
> * checking tests ... ERROR
>  ```
>  Running ‘testthat.R’
>  Running the tests in ‘tests/testthat.R’ failed.
>  Last 13 lines of output:
>  1. ├─arrow:::expect_dplyr_equal(...) test-dplyr.R:96:2
>  2. │ └─rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl helper-expectation.R:79:4
>  3. ├─input %>% group_by(chr) %>% select() %>% collect()
>  4. ├─dplyr::collect(.)
>  5. └─arrow:::collect.arrow_dplyr_query(.)
>  6. └─arrow:::ensure_group_vars(x)
>  7. ├─arrow:::make_field_refs(gv, dataset = query_on_dataset(.data))
>  8. └─arrow:::query_on_dataset(.data)
>  9. ├─x$.data
>  10. └─rlang:::`$.rlang_fake_data_pronoun`(x, ".data")
>  11. └─rlang:::stop_fake_data_subset()
>  
>  [ FAIL 2 | WARN 0 | SKIP 60 | PASS 3778 ]
>  Error: Test failures
>  Execution halted
>  ```
> ## In both
>  * checking installed package size ... NOTE
>  ```
>  installed size is 58.8Mb
>  sub-directories of 1Mb or more:
>  R 3.6Mb
>  libs 54.5Mb
>  ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13346) [C++] Remove compile time parsing from EnumType

2021-07-15 Thread Ben Kietzman (Jira)
Ben Kietzman created ARROW-13346:


 Summary: [C++] Remove compile time parsing from EnumType
 Key: ARROW-13346
 URL: https://issues.apache.org/jira/browse/ARROW-13346
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Ben Kietzman
Assignee: Ben Kietzman
 Fix For: 5.0.0


EnumType doesn't *need* to parse a string at compile time, remove that logic to 
ensure minimal build time overhead



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9056) [C++] Aggregation methods for Scalars?

2021-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-9056:
--
Labels: pull-request-available  (was: )

> [C++] Aggregation methods for Scalars?
> --
>
> Key: ARROW-9056
> URL: https://issues.apache.org/jira/browse/ARROW-9056
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See discussion on https://github.com/apache/arrow/pull/7308. Many/most would 
> no-op (sum, mean, min, max), but maybe they should exist and not error? Maybe 
> they're not needed, but I could see how you might invoke a function on the 
> result of a previous aggregation or something.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-13195) [R] Problem with rlang reverse dependency checks

2021-07-15 Thread Nic Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381458#comment-17381458
 ] 

Nic Crane edited comment on ARROW-13195 at 7/15/21, 4:44 PM:
-

Just test this locally with the CRAN versions of these packages and didn't get 
the same issue:

arrow 4.0.1
 rlang 0.4.11
 dplyr 1.0.7

However, when I tried it again with dev rlang, I could reproduce it, so looks 
like it's still a problem.

On dev Arrow, it's not an issue.


was (Author: thisisnic):
Just test this locally with the CRAN versions of these packages and didn't get 
the same issue:

arrow 4.0.1
rlang 0.4.11
dplyr 1.0.7

However, when I tried it again with dev rlang, I could reproduce it, so looks 
like it's still a problem.

> [R] Problem with rlang reverse dependency checks
> 
>
> Key: ARROW-13195
> URL: https://issues.apache.org/jira/browse/ARROW-13195
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Nic Crane
>Priority: Major
> Fix For: 5.0.0
>
>
> From: https://github.com/r-lib/rlang/blob/master/revdep/problems.md#arrow
>  
> arrow
> 
> * Version: 4.0.1
> * GitHub: https://github.com/apache/arrow
> * Source code: https://github.com/cran/arrow
> * Date/Publication: 2021-05-28 09:50:02 UTC
> * Number of recursive dependencies: 61
> Run `cloud_details(, "arrow")` for more info
> 
> ## Newly broken
> * checking tests ... ERROR
>  ```
>  Running ‘testthat.R’
>  Running the tests in ‘tests/testthat.R’ failed.
>  Last 13 lines of output:
>  1. ├─arrow:::expect_dplyr_equal(...) test-dplyr.R:96:2
>  2. │ └─rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl helper-expectation.R:79:4
>  3. ├─input %>% group_by(chr) %>% select() %>% collect()
>  4. ├─dplyr::collect(.)
>  5. └─arrow:::collect.arrow_dplyr_query(.)
>  6. └─arrow:::ensure_group_vars(x)
>  7. ├─arrow:::make_field_refs(gv, dataset = query_on_dataset(.data))
>  8. └─arrow:::query_on_dataset(.data)
>  9. ├─x$.data
>  10. └─rlang:::`$.rlang_fake_data_pronoun`(x, ".data")
>  11. └─rlang:::stop_fake_data_subset()
>  
>  [ FAIL 2 | WARN 0 | SKIP 60 | PASS 3778 ]
>  Error: Test failures
>  Execution halted
>  ```
> ## In both
>  * checking installed package size ... NOTE
>  ```
>  installed size is 58.8Mb
>  sub-directories of 1Mb or more:
>  R 3.6Mb
>  libs 54.5Mb
>  ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13195) [R] Problem with rlang reverse dependency checks

2021-07-15 Thread Nic Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381461#comment-17381461
 ] 

Nic Crane commented on ARROW-13195:
---

(sorry, updated my comment after you posted yours! - only an issue on CRAN 
arrow)

> [R] Problem with rlang reverse dependency checks
> 
>
> Key: ARROW-13195
> URL: https://issues.apache.org/jira/browse/ARROW-13195
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Nic Crane
>Priority: Major
> Fix For: 5.0.0
>
>
> From: https://github.com/r-lib/rlang/blob/master/revdep/problems.md#arrow
>  
> arrow
> 
> * Version: 4.0.1
> * GitHub: https://github.com/apache/arrow
> * Source code: https://github.com/cran/arrow
> * Date/Publication: 2021-05-28 09:50:02 UTC
> * Number of recursive dependencies: 61
> Run `cloud_details(, "arrow")` for more info
> 
> ## Newly broken
> * checking tests ... ERROR
>  ```
>  Running ‘testthat.R’
>  Running the tests in ‘tests/testthat.R’ failed.
>  Last 13 lines of output:
>  1. ├─arrow:::expect_dplyr_equal(...) test-dplyr.R:96:2
>  2. │ └─rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl helper-expectation.R:79:4
>  3. ├─input %>% group_by(chr) %>% select() %>% collect()
>  4. ├─dplyr::collect(.)
>  5. └─arrow:::collect.arrow_dplyr_query(.)
>  6. └─arrow:::ensure_group_vars(x)
>  7. ├─arrow:::make_field_refs(gv, dataset = query_on_dataset(.data))
>  8. └─arrow:::query_on_dataset(.data)
>  9. ├─x$.data
>  10. └─rlang:::`$.rlang_fake_data_pronoun`(x, ".data")
>  11. └─rlang:::stop_fake_data_subset()
>  
>  [ FAIL 2 | WARN 0 | SKIP 60 | PASS 3778 ]
>  Error: Test failures
>  Execution halted
>  ```
> ## In both
>  * checking installed package size ... NOTE
>  ```
>  installed size is 58.8Mb
>  sub-directories of 1Mb or more:
>  R 3.6Mb
>  libs 54.5Mb
>  ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13195) [R] Problem with rlang reverse dependency checks

2021-07-15 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381459#comment-17381459
 ] 

Neal Richardson commented on ARROW-13195:
-

dev rlang and dev arrow?

> [R] Problem with rlang reverse dependency checks
> 
>
> Key: ARROW-13195
> URL: https://issues.apache.org/jira/browse/ARROW-13195
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Nic Crane
>Priority: Major
> Fix For: 5.0.0
>
>
> From: https://github.com/r-lib/rlang/blob/master/revdep/problems.md#arrow
>  
> arrow
> 
> * Version: 4.0.1
> * GitHub: https://github.com/apache/arrow
> * Source code: https://github.com/cran/arrow
> * Date/Publication: 2021-05-28 09:50:02 UTC
> * Number of recursive dependencies: 61
> Run `cloud_details(, "arrow")` for more info
> 
> ## Newly broken
> * checking tests ... ERROR
>  ```
>  Running ‘testthat.R’
>  Running the tests in ‘tests/testthat.R’ failed.
>  Last 13 lines of output:
>  1. ├─arrow:::expect_dplyr_equal(...) test-dplyr.R:96:2
>  2. │ └─rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl helper-expectation.R:79:4
>  3. ├─input %>% group_by(chr) %>% select() %>% collect()
>  4. ├─dplyr::collect(.)
>  5. └─arrow:::collect.arrow_dplyr_query(.)
>  6. └─arrow:::ensure_group_vars(x)
>  7. ├─arrow:::make_field_refs(gv, dataset = query_on_dataset(.data))
>  8. └─arrow:::query_on_dataset(.data)
>  9. ├─x$.data
>  10. └─rlang:::`$.rlang_fake_data_pronoun`(x, ".data")
>  11. └─rlang:::stop_fake_data_subset()
>  
>  [ FAIL 2 | WARN 0 | SKIP 60 | PASS 3778 ]
>  Error: Test failures
>  Execution halted
>  ```
> ## In both
>  * checking installed package size ... NOTE
>  ```
>  installed size is 58.8Mb
>  sub-directories of 1Mb or more:
>  R 3.6Mb
>  libs 54.5Mb
>  ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13195) [R] Problem with rlang reverse dependency checks

2021-07-15 Thread Nic Crane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381458#comment-17381458
 ] 

Nic Crane commented on ARROW-13195:
---

Just test this locally with the CRAN versions of these packages and didn't get 
the same issue:

arrow 4.0.1
rlang 0.4.11
dplyr 1.0.7

However, when I tried it again with dev rlang, I could reproduce it, so looks 
like it's still a problem.

> [R] Problem with rlang reverse dependency checks
> 
>
> Key: ARROW-13195
> URL: https://issues.apache.org/jira/browse/ARROW-13195
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Romain Francois
>Assignee: Nic Crane
>Priority: Major
> Fix For: 5.0.0
>
>
> From: https://github.com/r-lib/rlang/blob/master/revdep/problems.md#arrow
>  
> arrow
> 
> * Version: 4.0.1
> * GitHub: https://github.com/apache/arrow
> * Source code: https://github.com/cran/arrow
> * Date/Publication: 2021-05-28 09:50:02 UTC
> * Number of recursive dependencies: 61
> Run `cloud_details(, "arrow")` for more info
> 
> ## Newly broken
> * checking tests ... ERROR
>  ```
>  Running ‘testthat.R’
>  Running the tests in ‘tests/testthat.R’ failed.
>  Last 13 lines of output:
>  1. ├─arrow:::expect_dplyr_equal(...) test-dplyr.R:96:2
>  2. │ └─rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl helper-expectation.R:79:4
>  3. ├─input %>% group_by(chr) %>% select() %>% collect()
>  4. ├─dplyr::collect(.)
>  5. └─arrow:::collect.arrow_dplyr_query(.)
>  6. └─arrow:::ensure_group_vars(x)
>  7. ├─arrow:::make_field_refs(gv, dataset = query_on_dataset(.data))
>  8. └─arrow:::query_on_dataset(.data)
>  9. ├─x$.data
>  10. └─rlang:::`$.rlang_fake_data_pronoun`(x, ".data")
>  11. └─rlang:::stop_fake_data_subset()
>  
>  [ FAIL 2 | WARN 0 | SKIP 60 | PASS 3778 ]
>  Error: Test failures
>  Execution halted
>  ```
> ## In both
>  * checking installed package size ... NOTE
>  ```
>  installed size is 58.8Mb
>  sub-directories of 1Mb or more:
>  R 3.6Mb
>  libs 54.5Mb
>  ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13217) [C++][Gandiva] Correct convert_replace function for invalid chars on string beginning

2021-07-15 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-13217.
---
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10625
[https://github.com/apache/arrow/pull/10625]

> [C++][Gandiva] Correct convert_replace function for invalid chars on string 
> beginning
> -
>
> Key: ARROW-13217
> URL: https://issues.apache.org/jira/browse/ARROW-13217
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: João Pedro Antunes Ferreira
>Assignee: João Pedro Antunes Ferreira
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The CONVERT_REPLACE Gandiva function is not working properly for invalid 
> chars on the beginning of the string (e.g. "\xa0\xa1-valid" should be  
> "-valid" considering an empty replacement char. But it is not replacing 
> correctly).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13281) [C++][Gandiva] Error on timestampDiffMonth function behavior for negative diff values

2021-07-15 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-13281.
---
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10674
[https://github.com/apache/arrow/pull/10674]

> [C++][Gandiva] Error on timestampDiffMonth function behavior for negative 
> diff values
> -
>
> Key: ARROW-13281
> URL: https://issues.apache.org/jira/browse/ARROW-13281
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: João Pedro Antunes Ferreira
>Assignee: João Pedro Antunes Ferreira
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The TIMESTAMPDIFF function appears to return incorrect values when a negative 
> number should be returned.
> Example:
>  - For the inputs TIMESTAMPDIFFMONTH("2019-06-30", "2019-03-31") it should 
> return *-3**,* but it returns *-1*
>  - For the inputs TIMESTAMPDIFFMONTH("2019-06-30", "2019-05-31") it should 
> return *-1**,* but it returns *1*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13200) [R] Add binding for case_when()

2021-07-15 Thread Nic Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13200:
-

Assignee: (was: Nic Crane)

> [R] Add binding for case_when()
> ---
>
> Key: ARROW-13200
> URL: https://issues.apache.org/jira/browse/ARROW-13200
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Priority: Major
> Fix For: 5.0.0
>
>
> ARROW-13064 adds a {{case_when}} kernel that works like a SQL {{CASE WHEN}} 
> expression. This allows us to implement a binding for the {{dplyr}} function 
> {{case_when()}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13162) [C++][Gandiva] Add new alias for extract date functions in Gandiva registry

2021-07-15 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-13162.
---
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10594
[https://github.com/apache/arrow/pull/10594]

> [C++][Gandiva] Add new alias for extract date functions in Gandiva registry
> ---
>
> Key: ARROW-13162
> URL: https://issues.apache.org/jira/browse/ARROW-13162
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Gandiva
>Reporter: João Pedro Antunes Ferreira
>Assignee: João Pedro Antunes Ferreira
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13190) [C++] [Gandiva] Change behavior of INITCAP function

2021-07-15 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-13190.
---
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10604
[https://github.com/apache/arrow/pull/10604]

> [C++] [Gandiva] Change behavior of INITCAP function
> ---
>
> Key: ARROW-13190
> URL: https://issues.apache.org/jira/browse/ARROW-13190
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++ - Gandiva
>Reporter: Anthony Louis Gotlib Ferreira
>Assignee: Anthony Louis Gotlib Ferreira
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> The current behavior of the *INITCAP* function is to turn the first character 
> of each word uppercase and remains the other as is.
> The desired behavior is to turn the first letter uppercase and the other 
> lowercase. Any character except the alphanumeric ones should be considered as 
> a word separator. 
> That behavior is based on these database systems:
>  * 
> [Oracle]([https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions065.htm])
>  * [Postgres]([https://w3resource.com/PostgreSQL/initcap-function.php)]
>  * [Redshift]([https://docs.aws.amazon.com/redshift/latest/dg/r_INITCAP.html)]
>  * [Splice 
> Machine]([https://doc.splicemachine.com/sqlref_builtinfcns_initcap.html])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13050) [C++][Gandiva] Implement SPACE Hive function on Gandiva

2021-07-15 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-13050.
---
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10517
[https://github.com/apache/arrow/pull/10517]

> [C++][Gandiva] Implement SPACE Hive function on Gandiva
> ---
>
> Key: ARROW-13050
> URL: https://issues.apache.org/jira/browse/ARROW-13050
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: João Pedro Antunes Ferreira
>Assignee: João Pedro Antunes Ferreira
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Implement SPACE Hive function on Gandiva



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13049) [C++][Gandiva] Implement BIN Hive function on Gandiva

2021-07-15 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-13049.
---
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10516
[https://github.com/apache/arrow/pull/10516]

> [C++][Gandiva] Implement BIN Hive function on Gandiva
> -
>
> Key: ARROW-13049
> URL: https://issues.apache.org/jira/browse/ARROW-13049
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: João Pedro Antunes Ferreira
>Assignee: João Pedro Antunes Ferreira
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Implement BIN Hive function on Gandiva



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13074) [Python] Start with deprecating ParquetDataset custom attributes

2021-07-15 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-13074.
--
Resolution: Fixed

Issue resolved by pull request 10549
[https://github.com/apache/arrow/pull/10549]

> [Python] Start with deprecating ParquetDataset custom attributes
> 
>
> Key: ARROW-13074
> URL: https://issues.apache.org/jira/browse/ARROW-13074
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> As a first step for ARROW-9720, we should start with deprecating 
> attributes/methods of {{pq.ParquetDataset}} that we would definitely not keep 
> / are conflicting with the "dataset API". 
> I am thinking of the {{pieces}} attribute (and the {{ParquetDatasetPiece}} 
> class), the {{partitions}} attribute (and the {{ParquetPartitions}} class). 
> In addition, some of the keywords are also exposed as properties (memory_map, 
> read_dictionary, buffer_size, fs), and could be deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13006) [C++][Gandiva] Implement BASE64 and UNBASE64 Hive functions on Gandiva

2021-07-15 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-13006.
---
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10479
[https://github.com/apache/arrow/pull/10479]

> [C++][Gandiva] Implement BASE64 and UNBASE64 Hive functions on Gandiva
> --
>
> Key: ARROW-13006
> URL: https://issues.apache.org/jira/browse/ARROW-13006
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: João Pedro Antunes Ferreira
>Assignee: João Pedro Antunes Ferreira
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Implement BASE64 and UNBASE64 Hive functions on Gandiva



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12986) [C++][Gandiva] Implement new cache eviction policy in Gandiva

2021-07-15 Thread Praveen Kumar (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Praveen Kumar resolved ARROW-12986.
---
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10465
[https://github.com/apache/arrow/pull/10465]

> [C++][Gandiva] Implement new cache eviction policy in Gandiva
> -
>
> Key: ARROW-12986
> URL: https://issues.apache.org/jira/browse/ARROW-12986
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++ - Gandiva
>Reporter: João Pedro Antunes Ferreira
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Actually, the eviction policy algorithm used by Gandiva's cache is based on 
> LRU.
> I suggest to add a new option of eviction algorithm to use that considers the 
> LLVM build time as a cost, and evicts the elements based on the 
> GrredyDual-Size algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13345) [C++] Implement logN compute function

2021-07-15 Thread Nic Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane updated ARROW-13345:
--
Summary: [C++] Implement logN compute function  (was: [C++] Implement logN)

> [C++] Implement logN compute function
> -
>
> Key: ARROW-13345
> URL: https://issues.apache.org/jira/browse/ARROW-13345
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Nic Crane
>Priority: Minor
>
> Just writing bindings from R functions to the various C++ log functions, but 
> one that we don't have is logN (i.e. where N is a user-supplied value); 
> please could this be implemented?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13345) [C++] Implement logN

2021-07-15 Thread Nic Crane (Jira)
Nic Crane created ARROW-13345:
-

 Summary: [C++] Implement logN
 Key: ARROW-13345
 URL: https://issues.apache.org/jira/browse/ARROW-13345
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Nic Crane


Just writing bindings from R functions to the various C++ log functions, but 
one that we don't have is logN (i.e. where N is a user-supplied value); please 
could this be implemented?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13237) S3 FileSystem doesn't seem to handle redirects

2021-07-15 Thread Alessandro Molina (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Molina updated ARROW-13237:
--
Fix Version/s: (was: 5.0.0)
   6.0.0

> S3 FileSystem doesn't seem to handle redirects
> --
>
> Key: ARROW-13237
> URL: https://issues.apache.org/jira/browse/ARROW-13237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 4.0.1
>Reporter: Alessandro Molina
>Priority: Major
> Fix For: 6.0.0
>
>
> In some conditions AWS S3 seems to respond with a redirect, but Arrow seems 
> to consider it an error instead of following the redirect.
> For example see
> {code}
> s3, bucket = 
> fs.FileSystem.from_uri("s3://ursa-labs-taxi-data/?region=us-east-1")
> print(s3.get_file_info(fs.FileSelector(bucket+"/2011", recursive=True)))
> {code}
> The error that you get is
> {code}
>  OSError: When listing objects under key '2011' in bucket 
> 'ursa-labs-taxi-data': AWS Error [code 100]: Unable to parse ExceptionName: 
> PermanentRedirect Message: The bucket you are attempting to access must be 
> addressed using the specified endpoint. Please send all future requests to 
> this endpoint.
> {code}
> It should probably follow the `PermanentRedirect` instead of choking over it
> IT is also possible to reproduce it using
> {code}
> from pyarrow import fs
> s3 = fs.SubTreeFileSystem("ursa-labs-taxi-data", fs.S3FileSystem())
> print(s3.get_file_info(fs.FileSelector("2011", recursive=True)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12745) [C++][Compute] Add floor, ceiling, and truncate kernels

2021-07-15 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-12745:
--
Fix Version/s: 5.0.0

> [C++][Compute] Add floor, ceiling, and truncate kernels
> ---
>
> Key: ARROW-12745
> URL: https://issues.apache.org/jira/browse/ARROW-12745
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
> Fix For: 5.0.0
>
>
> Kernels to round each value in an array of floating point numbers to:
>  * the nearest integer less than or equal to it ({{floor}})
>  * the nearest integer greater than or equal to it ({{ceiling}})
>  * the integral part without fraction digits
> Should return an array of the same type as the input (not an integer type)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12712) [C++] String repeat kernel

2021-07-15 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-12712:
--
Fix Version/s: 6.0.0

> [C++] String repeat kernel
> --
>
> Key: ARROW-12712
> URL: https://issues.apache.org/jira/browse/ARROW-12712
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
> Fix For: 6.0.0
>
>
> Like SQL {{replicate}} or Python {{'string' * n}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12714) [C++] String title case kernel

2021-07-15 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-12714:
--
Fix Version/s: 6.0.0

> [C++] String title case kernel
> --
>
> Key: ARROW-12714
> URL: https://issues.apache.org/jira/browse/ARROW-12714
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: beginner
> Fix For: 6.0.0
>
>
> Capitalizes the first character of each word in the string, like SQL 
> {{initcap}} or Python {{str.title()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9392) [C++] Document more of the compute layer

2021-07-15 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-9392:
-
Fix Version/s: (was: 5.0.0)
   6.0.0

> [C++] Document more of the compute layer
> 
>
> Key: ARROW-9392
> URL: https://issues.apache.org/jira/browse/ARROW-9392
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Antoine Pitrou
>Assignee: Eduardo Ponce
>Priority: Major
> Fix For: 6.0.0
>
>
> Ideally, we should add:
> * a description and examples of how to call compute functions
> * an API reference for concrete C++ functions such as {{Cast}}, 
> {{NthToIndices}}, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12744) [C++][Compute] Add rounding kernel

2021-07-15 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-12744:
--
Fix Version/s: 5.0.0

> [C++][Compute] Add rounding kernel
> --
>
> Key: ARROW-12744
> URL: https://issues.apache.org/jira/browse/ARROW-12744
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> Kernel to round an array of floating point numbers. Should return an array of 
> the same type as the input. Should have an option to control how many digits 
> after the decimal point (default value 0 meaning round to the nearest 
> integer).
> Midpoint values (e.g. 0.5 rounded to nearest integer) should round away from 
> zero (up for positive numbers, down for negative numbers).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12944) [C++] String capitalize kernel

2021-07-15 Thread Eduardo Ponce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eduardo Ponce updated ARROW-12944:
--
Fix Version/s: 6.0.0

> [C++] String capitalize kernel
> --
>
> Key: ARROW-12944
> URL: https://issues.apache.org/jira/browse/ARROW-12944
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ian Cook
>Assignee: Eduardo Ponce
>Priority: Major
>  Labels: beginner
> Fix For: 6.0.0
>
>
> Capitalizes the first character in the string, like Python 
> {{str.capitalize()}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-11673) [C++] Casting dictionary type to use different index type

2021-07-15 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-11673:


Assignee: Niranda Perera  (was: Ben Kietzman)

> [C++] Casting dictionary type to use different index type
> -
>
> Key: ARROW-11673
> URL: https://issues.apache.org/jira/browse/ARROW-11673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Niranda Perera
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> It's currently not implemented to cast from one dictionary type to another 
> dictionary type to change the index type. 
> Small example:
> {code}
> In [2]: arr = pa.array(['a', 'b', 'a']).dictionary_encode()
> In [3]: arr.type
> Out[3]: DictionaryType(dictionary)
> In [5]: arr.cast(pa.dictionary(pa.int8(), pa.string()))
> ...
> ArrowNotImplementedError: Unsupported cast from dictionary indices=int32, ordered=0> to dictionary ordered=0> (no available cast function for target type)
> ../src/arrow/compute/cast.cc:112  
> GetCastFunctionInternal(cast_options->to_type, args[0].type().get())
> {code}
> From 
> https://stackoverflow.com/questions/66223730/how-to-change-column-datatype-with-pyarrow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12579) [Python] Pyarrow 4.0.0 dependency numpy 1.19.4 throws errors on Apple silicon/M1 compilation

2021-07-15 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12579:

Fix Version/s: 5.0.0

> [Python] Pyarrow 4.0.0 dependency numpy 1.19.4 throws errors on Apple 
> silicon/M1 compilation
> 
>
> Key: ARROW-12579
> URL: https://issues.apache.org/jira/browse/ARROW-12579
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 4.0.0, 4.0.1
>Reporter: Jack Wells
>Assignee: Krisztian Szucs
>Priority: Minor
>  Labels: build-failure
> Fix For: 5.0.0
>
> Attachments: pyarrow_build_errors.txt
>
>
> Hi team! I've been unable to install older numpy versions (including 1.19.4 
> as specified here f[or aarch64 
> machine|https://github.com/apache/arrow/blob/master/python/requirements-wheel-build.txt]s)
>  on my Apple Silicon machine because the build process throws a number of 
> numpy compilation errors. I've been able to successfully install numpy 1.20.2 
> however - is it possible to bump up the numpy acceptable version number to 
> enable Apple silicon installations?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12579) [Python] Pyarrow 4.0.0 dependency numpy 1.19.4 throws errors on Apple silicon/M1 compilation

2021-07-15 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-12579:
---

Assignee: Krisztian Szucs

> [Python] Pyarrow 4.0.0 dependency numpy 1.19.4 throws errors on Apple 
> silicon/M1 compilation
> 
>
> Key: ARROW-12579
> URL: https://issues.apache.org/jira/browse/ARROW-12579
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 4.0.0, 4.0.1
>Reporter: Jack Wells
>Assignee: Krisztian Szucs
>Priority: Minor
>  Labels: build-failure
> Attachments: pyarrow_build_errors.txt
>
>
> Hi team! I've been unable to install older numpy versions (including 1.19.4 
> as specified here f[or aarch64 
> machine|https://github.com/apache/arrow/blob/master/python/requirements-wheel-build.txt]s)
>  on my Apple Silicon machine because the build process throws a number of 
> numpy compilation errors. I've been able to successfully install numpy 1.20.2 
> however - is it possible to bump up the numpy acceptable version number to 
> enable Apple silicon installations?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10658) [Python][Packaging] Wheel builds for Apple Silicon

2021-07-15 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10658:

Summary: [Python][Packaging] Wheel builds for Apple Silicon  (was: [Python] 
Wheel builds for Apple Silicon)

> [Python][Packaging] Wheel builds for Apple Silicon
> --
>
> Key: ARROW-10658
> URL: https://issues.apache.org/jira/browse/ARROW-10658
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 5.0.0
>
>
> We are only able to create Intel builds at the moment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-11673) [C++] Casting dictionary type to use different index type

2021-07-15 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-11673:


Assignee: Ben Kietzman  (was: Niranda Perera)

> [C++] Casting dictionary type to use different index type
> -
>
> Key: ARROW-11673
> URL: https://issues.apache.org/jira/browse/ARROW-11673
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> It's currently not implemented to cast from one dictionary type to another 
> dictionary type to change the index type. 
> Small example:
> {code}
> In [2]: arr = pa.array(['a', 'b', 'a']).dictionary_encode()
> In [3]: arr.type
> Out[3]: DictionaryType(dictionary)
> In [5]: arr.cast(pa.dictionary(pa.int8(), pa.string()))
> ...
> ArrowNotImplementedError: Unsupported cast from dictionary indices=int32, ordered=0> to dictionary ordered=0> (no available cast function for target type)
> ../src/arrow/compute/cast.cc:112  
> GetCastFunctionInternal(cast_options->to_type, args[0].type().get())
> {code}
> From 
> https://stackoverflow.com/questions/66223730/how-to-change-column-datatype-with-pyarrow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10658) [Python][Packaging] Wheel builds for Apple Silicon

2021-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-10658:
---
Labels: pull-request-available  (was: )

> [Python][Packaging] Wheel builds for Apple Silicon
> --
>
> Key: ARROW-10658
> URL: https://issues.apache.org/jira/browse/ARROW-10658
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We are only able to create Intel builds at the moment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-10658) [Python] Wheel builds for Apple Silicon

2021-07-15 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-10658:
---

Assignee: Krisztian Szucs

> [Python] Wheel builds for Apple Silicon
> ---
>
> Key: ARROW-10658
> URL: https://issues.apache.org/jira/browse/ARROW-10658
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 5.0.0
>
>
> We are only able to create Intel builds at the moment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-10658) [Python] Wheel builds for Apple Silicon

2021-07-15 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-10658:

Fix Version/s: 5.0.0

> [Python] Wheel builds for Apple Silicon
> ---
>
> Key: ARROW-10658
> URL: https://issues.apache.org/jira/browse/ARROW-10658
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging, Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 5.0.0
>
>
> We are only able to create Intel builds at the moment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8999) [Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" build

2021-07-15 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-8999:
---
Fix Version/s: (was: 5.0.0)
   6.0.0

> [Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" 
> build
> 
>
> Key: ARROW-8999
> URL: https://issues.apache.org/jira/browse/ARROW-8999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: dataset
> Fix For: 6.0.0
>
>
> I've been seeing this segfault periodically the last week, does anyone have 
> an idea what might be wrong?
> https://github.com/apache/arrow/pull/7273/checks?check_run_id=717249862



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8999) [Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" build

2021-07-15 Thread Krisztian Szucs (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381383#comment-17381383
 ] 

Krisztian Szucs commented on ARROW-8999:


Me neither. I'm not sure how long do we need to keep this open, but going to 
postpone for now.

> [Python][C++] Non-deterministic segfault in "AMD64 MacOS 10.15 Python 3.7" 
> build
> 
>
> Key: ARROW-8999
> URL: https://issues.apache.org/jira/browse/ARROW-8999
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: dataset
> Fix For: 5.0.0
>
>
> I've been seeing this segfault periodically the last week, does anyone have 
> an idea what might be wrong?
> https://github.com/apache/arrow/pull/7273/checks?check_run_id=717249862



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-13305) [C++] Unable to install nightly on Ubuntu 21.04 due to CSV options

2021-07-15 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-13305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mauricio 'Pachá' Vargas Sepúlveda closed ARROW-13305.
-
Resolution: Fixed

> [C++] Unable to install nightly on Ubuntu 21.04 due to CSV options
> --
>
> Key: ARROW-13305
> URL: https://issues.apache.org/jira/browse/ARROW-13305
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.1
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
> Fix For: 5.0.0
>
>
> Related to ARROW-13304
> Another error seen on Ubuntu 21.04 is
> {code:java}
> dataset.cpp:284:31: error: ‘CsvFileWriteOptions’ is not a member of ‘ds’; did 
> you mean ‘IpcFileWriteOptions’?
>   284 | const std::shared_ptr& csv_options,
>   |   ^~~
>   |   IpcFileWriteOptions
> dataset.cpp:284:50: error: template argument 1 is invalid
>   284 | const std::shared_ptr& csv_options,
>   |  ^
> dataset.cpp: In function ‘void dataset___CsvFileWriteOptions__update(const 
> int&, const std::shared_ptr&)’:
> dataset.cpp:286:15: error: base operand of ‘->’ is not a pointer
>   286 |   *csv_options->write_options = *write_options; {code}
>  
> I fetched the last changes up to 91f26
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13305) [C++] Unable to install nightly on Ubuntu 21.04 due to CSV options

2021-07-15 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-13305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381382#comment-17381382
 ] 

Mauricio 'Pachá' Vargas Sepúlveda commented on ARROW-13305:
---

hi

yes,

i do these steps:
 # build libarrow
 # reinstall R package

this was fixed with the last changes to ARROW-13304

> [C++] Unable to install nightly on Ubuntu 21.04 due to CSV options
> --
>
> Key: ARROW-13305
> URL: https://issues.apache.org/jira/browse/ARROW-13305
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.1
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
> Fix For: 5.0.0
>
>
> Related to ARROW-13304
> Another error seen on Ubuntu 21.04 is
> {code:java}
> dataset.cpp:284:31: error: ‘CsvFileWriteOptions’ is not a member of ‘ds’; did 
> you mean ‘IpcFileWriteOptions’?
>   284 | const std::shared_ptr& csv_options,
>   |   ^~~
>   |   IpcFileWriteOptions
> dataset.cpp:284:50: error: template argument 1 is invalid
>   284 | const std::shared_ptr& csv_options,
>   |  ^
> dataset.cpp: In function ‘void dataset___CsvFileWriteOptions__update(const 
> int&, const std::shared_ptr&)’:
> dataset.cpp:286:15: error: base operand of ‘->’ is not a pointer
>   286 |   *csv_options->write_options = *write_options; {code}
>  
> I fetched the last changes up to 91f26
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12122) [Python] Cannot install via pip M1 mac

2021-07-15 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-12122:

Summary: [Python] Cannot install via pip M1 mac  (was: [Python] Cannot 
install via pip. M1 mac)

> [Python] Cannot install via pip M1 mac
> --
>
> Key: ARROW-12122
> URL: https://issues.apache.org/jira/browse/ARROW-12122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Reporter: Bastien Boutonnet
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> when doing {{pip install pyarrow --no-use-pep517}}
> {noformat}
> Collecting pyarrow
>  Using cached pyarrow-3.0.0.tar.gz (682 kB)
> Requirement already satisfied: numpy>=1.16.6 in 
> /Users/bastienboutonnet/Library/Caches/pypoetry/virtualenvs/dbt-sugar-lJO0x__U-py3.8/lib/python3.8/site-packages
>  (from pyarrow) (1.20.2)
> Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (setup.py) ... error
>  ERROR: Command errored out with exit status 1:
>  command: 
> /Users/bastienboutonnet/Library/Caches/pypoetry/virtualenvs/dbt-sugar-lJO0x__U-py3.8/bin/python
>  -u -c 'import sys, setuptools, tokenize; sys.argv[0] = 
> '"'"'/private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-install-ri2w315u/pyarrow_8d01252c437341798da24cfec11f603e/setup.py'"'"';
>  
> __file__='"'"'/private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-install-ri2w315u/pyarrow_8d01252c437341798da24cfec11f603e/setup.py'"'"';f=getattr(tokenize,
>  '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', 
> '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' 
> bdist_wheel -d 
> /private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-wheel-vpkwqzyi
>  cwd: 
> /private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-install-ri2w315u/pyarrow_8d01252c437341798da24cfec11f603e/
>  Complete output (238 lines):
>  running bdist_wheel
>  running build
>  running build_py
>  creating build
>  creating build/lib.macosx-11.2-arm64-3.8
>  creating build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/orc.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/_generated_version.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/compat.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/benchmark.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/parquet.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/ipc.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/util.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/flight.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/cffi.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/filesystem.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/__init__.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/plasma.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/types.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/dataset.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/cuda.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/feather.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/pandas_compat.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/fs.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/csv.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/jvm.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/hdfs.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/json.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/serialization.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/compute.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  creating build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_tensor.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_ipc.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/conftest.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_convert_builtin.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_misc.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_gandiva.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/strategies.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_adhoc_memory_leak.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/arrow_7980.py -> 
> 

[jira] [Resolved] (ARROW-13283) [Developer Tools] Support passing through memory limits in archery docker run

2021-07-15 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-13283.
-
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10690
[https://github.com/apache/arrow/pull/10690]

> [Developer Tools] Support passing through memory limits in archery docker run
> -
>
> Key: ARROW-13283
> URL: https://issues.apache.org/jira/browse/ARROW-13283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery, Developer Tools
>Reporter: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> It can be useful to limit the memory available to one of our Docker 
> containers to more accurately reproduce CI conditions. docker-compose doesn't 
> directly support this (except in 'swarm' deployment mode) but since Archery 
> can directly invoke the docker CLI, we could support this ourselves, either 
> by adding custom metadata into docker-compose.yml or exposing a manual CLI 
> parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13283) [Developer Tools] Support passing through memory limits in archery docker run

2021-07-15 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-13283:
---

Assignee: David Li

> [Developer Tools] Support passing through memory limits in archery docker run
> -
>
> Key: ARROW-13283
> URL: https://issues.apache.org/jira/browse/ARROW-13283
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Archery, Developer Tools
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It can be useful to limit the memory available to one of our Docker 
> containers to more accurately reproduce CI conditions. docker-compose doesn't 
> directly support this (except in 'swarm' deployment mode) but since Archery 
> can directly invoke the docker CLI, we could support this ourselves, either 
> by adding custom metadata into docker-compose.yml or exposing a manual CLI 
> parameter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-12667) [Python] Ensure test coverage for conversion of strided numpy arrays

2021-07-15 Thread Krisztian Szucs (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-12667.
-
Resolution: Fixed

Issue resolved by pull request 10709
[https://github.com/apache/arrow/pull/10709]

> [Python] Ensure test coverage for conversion of strided numpy arrays
> 
>
> Key: ARROW-12667
> URL: https://issues.apache.org/jira/browse/ARROW-12667
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 4.0.0
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> {code:python}
> nparray = np.array([b"ab", b"cd", b"ef"])
> arrow_array = pa.array(nparray[::2], pa.binary(2))
> assert [b"ab", b"ef"] == arrow_array.to_pylist()
> {code}
> The final result instead will be {{[b'ab', b'cd']}} which didn't deal 
> correctly with the numpy array projection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12964) [R] Add bindings for ifelse() and if_else()

2021-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12964:
---
Labels: pull-request-available  (was: )

> [R] Add bindings for ifelse() and if_else()
> ---
>
> Key: ARROW-12964
> URL: https://issues.apache.org/jira/browse/ARROW-12964
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Nic Crane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> ARROW-10640 adds an {{if_else}} kernel to the C++ library. Add R bindings so 
> users can call {{ifelse()}} or {{if_else()}} (the stricter dplyr variant) in 
> dplyr verbs. I believe the C++ kernel requires the second and third arguments 
> to have the same types, just like {{dplyr::if_else()}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13305) [C++] Unable to install nightly on Ubuntu 21.04 due to CSV options

2021-07-15 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381363#comment-17381363
 ] 

Neal Richardson commented on ARROW-13305:
-

Is this on your local build [~pachamaltese]? When the Arrow C++ library 
changes, you need to rebuild it before you rebuild the R library.

> [C++] Unable to install nightly on Ubuntu 21.04 due to CSV options
> --
>
> Key: ARROW-13305
> URL: https://issues.apache.org/jira/browse/ARROW-13305
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 4.0.1
>Reporter: Mauricio 'Pachá' Vargas Sepúlveda
>Priority: Major
> Fix For: 5.0.0
>
>
> Related to ARROW-13304
> Another error seen on Ubuntu 21.04 is
> {code:java}
> dataset.cpp:284:31: error: ‘CsvFileWriteOptions’ is not a member of ‘ds’; did 
> you mean ‘IpcFileWriteOptions’?
>   284 | const std::shared_ptr& csv_options,
>   |   ^~~
>   |   IpcFileWriteOptions
> dataset.cpp:284:50: error: template argument 1 is invalid
>   284 | const std::shared_ptr& csv_options,
>   |  ^
> dataset.cpp: In function ‘void dataset___CsvFileWriteOptions__update(const 
> int&, const std::shared_ptr&)’:
> dataset.cpp:286:15: error: base operand of ‘->’ is not a pointer
>   286 |   *csv_options->write_options = *write_options; {code}
>  
> I fetched the last changes up to 91f26
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11502) [C++] Optimize Arrow ByteStreamSplitDecode with Neon

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11502:

Fix Version/s: (was: 5.0.0)
   6.0.0

> [C++] Optimize Arrow ByteStreamSplitDecode with Neon
> 
>
> Key: ARROW-11502
> URL: https://issues.apache.org/jira/browse/ARROW-11502
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yuqi Gu
>Assignee: Yuqi Gu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> Leverage the Arm64 Neon extension to improve ByteStreamSplitDecode 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11297) [C++][Python] Add WriterOptions for ORC

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11297:

Fix Version/s: (was: 5.0.0)
   6.0.0

> [C++][Python] Add WriterOptions for ORC
> ---
>
> Key: ARROW-11297
> URL: https://issues.apache.org/jira/browse/ARROW-11297
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Ying Zhou
>Assignee: Ying Zhou
>Priority: Major
>  Labels: orc, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Implement orc::WriterOptions in Arrow and add it to the Arrow ORC reader in 
> C++ and Python.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11197) [C++] Add support for the dictionary type in the C++ ORC writer

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11197:

Fix Version/s: (was: 5.0.0)
   6.0.0

> [C++] Add support for the dictionary type in the C++ ORC writer
> ---
>
> Key: ARROW-11197
> URL: https://issues.apache.org/jira/browse/ARROW-11197
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ying Zhou
>Assignee: Ying Zhou
>Priority: Major
>  Labels: orc
> Fix For: 6.0.0
>
>
> We might need dictionary type support in order to process categorical types 
> in Pandas correctly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-11118) [C++] Add union support in ORC reader & writer

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8:

Fix Version/s: (was: 5.0.0)
   6.0.0

> [C++] Add union support in ORC reader & writer
> --
>
> Key: ARROW-8
> URL: https://issues.apache.org/jira/browse/ARROW-8
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Ying Zhou
>Assignee: Ying Zhou
>Priority: Minor
>  Labels: orc
> Fix For: 6.0.0
>
>
> Currently the ORC reader does not has support for the ORC UNION type which 
> led to the ORC writer to be under-tested for ORC DENSE_UNION and SPARSE_UNION 
> types. To fix this problem union support in the ORC reader needs to be added, 
> union support in the ORC writer needs to be added and tested.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9948) [C++] Decimal128 does not check scale range when rescaling; can cause buffer overflow

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-9948:
---
Fix Version/s: (was: 5.0.0)
   6.0.0

> [C++] Decimal128 does not check scale range when rescaling; can cause buffer 
> overflow
> -
>
> Key: ARROW-9948
> URL: https://issues.apache.org/jira/browse/ARROW-9948
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Mingyu Zhong
>Assignee: Andrew Wieteska
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> BasicDecimal128::GetScaleMultiplier has a DCHECK on the scale, but the scale 
> can come from users. For example, Decimal128::FromString("1e100") will cause 
> an out-of-bound read.
> BasicDecimal128::Rescale and BasicDecimal128::GetWholeAndFraction have the 
> same problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8221) [Python][Dataset] Expose schema inference / validation options in the factory

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8221:
---
Fix Version/s: (was: 5.0.0)
   6.0.0

> [Python][Dataset] Expose schema inference / validation options in the factory
> -
>
> Key: ARROW-8221
> URL: https://issues.apache.org/jira/browse/ARROW-8221
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> ARROW-8058 added options related to schema inference / validation for the 
> Dataset factory. We should expose this in Python in the {{dataset(..)}} 
> factory function:
> - Add ability to pass a user-specified schema with a {{schema}} keyword, 
> instead of inferring the schema from (one of) the files (to be passed to the 
> factory finish method)
> - Add {{validate_schema}} option to toggle whether the schema is validated 
> against the actual files or not.
> - Expose in some way the number of fragments to be inspected when inferring 
> or validating the schema. Not sure yet what the best API for this would be. 
> Some relevant notes from the original PR: 
> https://github.com/apache/arrow/pull/6687#issuecomment-604394407



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7617) [Python] parquet.write_to_dataset creates empty partitions for non-observed dictionary items (categories)

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7617:
---
Fix Version/s: (was: 5.0.0)
   6.0.0

> [Python] parquet.write_to_dataset creates empty partitions for non-observed 
> dictionary items (categories)
> -
>
> Key: ARROW-7617
> URL: https://issues.apache.org/jira/browse/ARROW-7617
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
>Reporter: Vladimir
>Priority: Major
>  Labels: dataset, dataset-parquet-write, parquet
> Fix For: 6.0.0
>
>
> Hello,
> it looks like, views with selection along categorical column are not properly 
> respected.
> For the following dummy dataframe:
>  
> {code:java}
> d = pd.date_range('1990-01-01', freq='D', periods=1)
> vals = pd.np.random.randn(len(d), 4)
> x = pd.DataFrame(vals, index=d, columns=['A', 'B', 'C', 'D'])
> x['Year'] = x.index.year
> {code}
> The slice by Year is saved to partitioned parquet properly:
> {code:java}
> table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
> pq.write_to_dataset(table, root_path='test_a.parquet', 
> partition_cols=['Year']){code}
> However, if we convert Year to pandas.Categorical - it will save the whole 
> original dataframe, not only slice of Year=1990:
> {code:java}
> x['Year'] = x['Year'].astype('category')
> table = pa.Table.from_pandas(x[x.Year==1990], preserve_index=False)
> pq.write_to_dataset(table, root_path='test_b.parquet', 
> partition_cols=['Year'])
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7494) [Java] Remove reader index and writer index from ArrowBuf

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7494:
---
Fix Version/s: (was: 5.0.0)

> [Java] Remove reader index and writer index from ArrowBuf
> -
>
> Key: ARROW-7494
> URL: https://issues.apache.org/jira/browse/ARROW-7494
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Jacques Nadeau
>Assignee: Ji Liu
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Reader and writer index and functionality doesn't belong on a chunk of memory 
> and is due to inheritance from ByteBuf. As part of removing ByteBuf 
> inheritance, we should also remove reader and writer indexes from ArrowBuf 
> functionality. It wastes heap memory for rare utility. In general, a slice 
> can be used instead of a reader/writer index pattern.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12122) [Python] Cannot install via pip. M1 mac

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-12122:
---

Assignee: Krisztian Szucs

> [Python] Cannot install via pip. M1 mac
> ---
>
> Key: ARROW-12122
> URL: https://issues.apache.org/jira/browse/ARROW-12122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Reporter: Bastien Boutonnet
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 8h 20m
>  Remaining Estimate: 0h
>
> when doing {{pip install pyarrow --no-use-pep517}}
> {noformat}
> Collecting pyarrow
>  Using cached pyarrow-3.0.0.tar.gz (682 kB)
> Requirement already satisfied: numpy>=1.16.6 in 
> /Users/bastienboutonnet/Library/Caches/pypoetry/virtualenvs/dbt-sugar-lJO0x__U-py3.8/lib/python3.8/site-packages
>  (from pyarrow) (1.20.2)
> Building wheels for collected packages: pyarrow
>  Building wheel for pyarrow (setup.py) ... error
>  ERROR: Command errored out with exit status 1:
>  command: 
> /Users/bastienboutonnet/Library/Caches/pypoetry/virtualenvs/dbt-sugar-lJO0x__U-py3.8/bin/python
>  -u -c 'import sys, setuptools, tokenize; sys.argv[0] = 
> '"'"'/private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-install-ri2w315u/pyarrow_8d01252c437341798da24cfec11f603e/setup.py'"'"';
>  
> __file__='"'"'/private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-install-ri2w315u/pyarrow_8d01252c437341798da24cfec11f603e/setup.py'"'"';f=getattr(tokenize,
>  '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', 
> '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' 
> bdist_wheel -d 
> /private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-wheel-vpkwqzyi
>  cwd: 
> /private/var/folders/v2/lfkghkc147j06_jd13v1f0yrgn/T/pip-install-ri2w315u/pyarrow_8d01252c437341798da24cfec11f603e/
>  Complete output (238 lines):
>  running bdist_wheel
>  running build
>  running build_py
>  creating build
>  creating build/lib.macosx-11.2-arm64-3.8
>  creating build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/orc.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/_generated_version.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/compat.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/benchmark.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/parquet.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/ipc.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/util.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/flight.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/cffi.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/filesystem.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/__init__.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/plasma.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/types.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/dataset.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/cuda.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/feather.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/pandas_compat.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/fs.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/csv.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/jvm.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/hdfs.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/json.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/serialization.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  copying pyarrow/compute.py -> build/lib.macosx-11.2-arm64-3.8/pyarrow
>  creating build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_tensor.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_ipc.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/conftest.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_convert_builtin.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_misc.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_gandiva.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/strategies.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/test_adhoc_memory_leak.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/arrow_7980.py -> 
> build/lib.macosx-11.2-arm64-3.8/pyarrow/tests
>  copying pyarrow/tests/util.py 

[jira] [Updated] (ARROW-12360) [Dev] Unify source code formatting of Archery

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12360:

Fix Version/s: (was: 5.0.0)

> [Dev] Unify source code formatting of Archery 
> --
>
> Key: ARROW-12360
> URL: https://issues.apache.org/jira/browse/ARROW-12360
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Using black, isort and flake8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13254) [Python] Processes killed and semaphore objects leaked when reading pandas data

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13254.
-
Resolution: Duplicate

> [Python] Processes killed and semaphore objects leaked when reading pandas 
> data
> ---
>
> Key: ARROW-13254
> URL: https://issues.apache.org/jira/browse/ARROW-13254
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: OS name and version: macOS 11.4
> Python version: 3.8.10
> Pyarrow version: 4.0.1
>Reporter: Koyomi Akaguro
>Priority: Major
> Fix For: 5.0.0
>
>
> When I run {{pa.Table.from_pandas(df)}} for a >1G dataframe, it reports
>  
>  {{Killed: 9 
> ../anaconda3/envs/py38/lib/python3.8/multiprocessing/resource_tracker.py:216: 
> UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects 
> to clean up at shutdown}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13254) [Python] Processes killed and semaphore objects leaked when reading pandas data

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13254:

Component/s: Python

> [Python] Processes killed and semaphore objects leaked when reading pandas 
> data
> ---
>
> Key: ARROW-13254
> URL: https://issues.apache.org/jira/browse/ARROW-13254
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: OS name and version: macOS 11.4
> Python version: 3.8.10
> Pyarrow version: 4.0.1
>Reporter: Koyomi Akaguro
>Priority: Major
> Fix For: 5.0.0
>
>
> When I run {{pa.Table.from_pandas(df)}} for a >1G dataframe, it reports
>  
>  {{Killed: 9 
> ../anaconda3/envs/py38/lib/python3.8/multiprocessing/resource_tracker.py:216: 
> UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects 
> to clean up at shutdown}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13254) [Python] Processes killed and semaphore objects leaked when reading pandas data

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-13254:
---

Assignee: Weston Pace

> [Python] Processes killed and semaphore objects leaked when reading pandas 
> data
> ---
>
> Key: ARROW-13254
> URL: https://issues.apache.org/jira/browse/ARROW-13254
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
> Environment: OS name and version: macOS 11.4
> Python version: 3.8.10
> Pyarrow version: 4.0.1
>Reporter: Koyomi Akaguro
>Assignee: Weston Pace
>Priority: Major
> Fix For: 5.0.0
>
>
> When I run {{pa.Table.from_pandas(df)}} for a >1G dataframe, it reports
>  
>  {{Killed: 9 
> ../anaconda3/envs/py38/lib/python3.8/multiprocessing/resource_tracker.py:216: 
> UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects 
> to clean up at shutdown}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12643) Add documentation for experimental repos

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12643:

Component/s: Documentation

> Add documentation for experimental repos
> 
>
> Key: ARROW-12643
> URL: https://issues.apache.org/jira/browse/ARROW-12643
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Jorge Leitão
>Assignee: Jorge Leitão
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> See 
> https://lists.apache.org/x/thread.html/r73d09033466e4fe2fbff8d4d81bb63e3aa9d16796f4b342f61a30815@%3Cdev.arrow.apache.org%3E
>  for details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12679) [Java] JDBC adapter does not preserve SQL-nullability

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12679:

Component/s: Java

> [Java] JDBC adapter does not preserve SQL-nullability
> -
>
> Key: ARROW-12679
> URL: https://issues.apache.org/jira/browse/ARROW-12679
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Joris Peeters
>Assignee: Joris Peeters
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> When using the JDBC adapter, the schema of the VectorSchemaRoot's in the 
> ArrowVectorIterator have all the columns as nullable, irregardless of whether 
> they are nullable in SQL.
> This should be fixed, so that the schema has only those columns as nullable 
> that are also nullable in SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-13254) [Python] Processes killed and semaphore objects leaked when reading pandas data

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reopened ARROW-13254:
-

> [Python] Processes killed and semaphore objects leaked when reading pandas 
> data
> ---
>
> Key: ARROW-13254
> URL: https://issues.apache.org/jira/browse/ARROW-13254
> Project: Apache Arrow
>  Issue Type: Bug
> Environment: OS name and version: macOS 11.4
> Python version: 3.8.10
> Pyarrow version: 4.0.1
>Reporter: Koyomi Akaguro
>Priority: Major
> Fix For: 5.0.0
>
>
> When I run {{pa.Table.from_pandas(df)}} for a >1G dataframe, it reports
>  
>  {{Killed: 9 
> ../anaconda3/envs/py38/lib/python3.8/multiprocessing/resource_tracker.py:216: 
> UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects 
> to clean up at shutdown}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13194) [Java][Document] Create prose document about Java algorithms

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13194:

Component/s: Java

> [Java][Document] Create prose document about Java algorithms
> 
>
> Key: ARROW-13194
> URL: https://issues.apache.org/jira/browse/ARROW-13194
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The document should include the basics of comparing vector elements, 
> searching values in a vector, and sorting vectors. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13047) [Website] Add kiszk to committer list

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13047:

Component/s: Website

> [Website] Add kiszk to committer list
> -
>
> Key: ARROW-13047
> URL: https://issues.apache.org/jira/browse/ARROW-13047
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Website
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 5.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13344) [R] Initial bindings for ExecPlan/ExecNode and ScalarAggregateNode

2021-07-15 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-13344:
---

 Summary: [R] Initial bindings for ExecPlan/ExecNode and 
ScalarAggregateNode
 Key: ARROW-13344
 URL: https://issues.apache.org/jira/browse/ARROW-13344
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 5.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13343) [R] Update NEWS.md for 5.0

2021-07-15 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13343:
---
Labels: pull-request-available  (was: )

> [R] Update NEWS.md for 5.0
> --
>
> Key: ARROW-13343
> URL: https://issues.apache.org/jira/browse/ARROW-13343
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13343) [R] Update NEWS.md for 5.0

2021-07-15 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-13343:
---

 Summary: [R] Update NEWS.md for 5.0
 Key: ARROW-13343
 URL: https://issues.apache.org/jira/browse/ARROW-13343
 Project: Apache Arrow
  Issue Type: New Feature
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 5.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12964) [R] Add bindings for ifelse() and if_else()

2021-07-15 Thread Nic Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-12964:
-

Assignee: Nic Crane

> [R] Add bindings for ifelse() and if_else()
> ---
>
> Key: ARROW-12964
> URL: https://issues.apache.org/jira/browse/ARROW-12964
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Nic Crane
>Priority: Major
> Fix For: 5.0.0
>
>
> ARROW-10640 adds an {{if_else}} kernel to the C++ library. Add R bindings so 
> users can call {{ifelse()}} or {{if_else()}} (the stricter dplyr variant) in 
> dplyr verbs. I believe the C++ kernel requires the second and third arguments 
> to have the same types, just like {{dplyr::if_else()}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13200) [R] Add binding for case_when()

2021-07-15 Thread Nic Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nic Crane reassigned ARROW-13200:
-

Assignee: Nic Crane

> [R] Add binding for case_when()
> ---
>
> Key: ARROW-13200
> URL: https://issues.apache.org/jira/browse/ARROW-13200
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Ian Cook
>Assignee: Nic Crane
>Priority: Major
> Fix For: 5.0.0
>
>
> ARROW-13064 adds a {{case_when}} kernel that works like a SQL {{CASE WHEN}} 
> expression. This allows us to implement a binding for the {{dplyr}} function 
> {{case_when()}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13198) [C++][Dataset] Async scanner occasionally segfaulting in CI

2021-07-15 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13198:
--
Priority: Blocker  (was: Major)

> [C++][Dataset] Async scanner occasionally segfaulting in CI
> ---
>
> Key: ARROW-13198
> URL: https://issues.apache.org/jira/browse/ARROW-13198
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: David Li
>Assignee: Weston Pace
>Priority: Blocker
>  Labels: dataset, datasets
> Fix For: 5.0.0
>
> Attachments: AMD64 Conda Python 3.7 Pandas latest.log
>
>
> See attached log; it's failing in 
> {{test_open_dataset_partitioned_directory[threaded-async]}} [^AMD64 Conda 
> Python 3.7 Pandas latest.log]
> {noformat}
> 2021-06-28T15:53:47.0090857Z 
> opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow/tests/test_dataset.py::test_scan_iterator[True-True]
>  PASSED [ 27%]
> 2021-06-28T15:53:48.5943186Z /arrow/ci/scripts/python_test.sh: line 32:  7137 
> Segmentation fault  (core dumped) pytest -r s -v ${PYTEST_ARGS} --pyargs 
> pyarrow
> 2021-06-28T15:53:49.1303267Z 139
> /pyarrow/tests/test_dataset.py::test_open_dataset_partitioned_directory[threaded-async]
>  
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9997) [Python] StructScalar.as_py() fails if the type has duplicate field names

2021-07-15 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-9997:
-
Fix Version/s: (was: 5.0.0)
   6.0.0

> [Python] StructScalar.as_py() fails if the type has duplicate field names
> -
>
> Key: ARROW-9997
> URL: https://issues.apache.org/jira/browse/ARROW-9997
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
> Fix For: 6.0.0
>
>
> {{StructScalar}} currently extends an abstract Mapping interface. Since the 
> type allows duplicate field names we cannot provide that API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13253) [C++][FlightRPC] Segfault when sending record batch >2GB

2021-07-15 Thread Yibo Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai resolved ARROW-13253.
--
Resolution: Fixed

Issue resolved by pull request 10663
[https://github.com/apache/arrow/pull/10663]

> [C++][FlightRPC] Segfault when sending record batch >2GB
> 
>
> Key: ARROW-13253
> URL: https://issues.apache.org/jira/browse/ARROW-13253
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Affects Versions: 4.0.1
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> When sending a record batch > 2GiB, the server will segfault. Although Flight 
> checks for this case and returns an error, it turns out that gRPC always 
> tries to increment the refcount of the result buffer whether the 
> serialization handler returned successfully or not:
> {code:cpp}
> // From gRPC 1.36
> Status CallOpSendMessage::SendMessagePtr(const M* message,
>  WriteOptions options) {
>   msg_ = message;
>   write_options_ = options;
>   // Store the serializer for later since we have access to the message
>   serializer_ = [this](const void* message) {
> bool own_buf;
> // TODO(vjpai): Remove the void below when possible
> // The void in the template parameter below should not be needed
> // (since it should be implicit) but is needed due to an observed
> // difference in behavior between clang and gcc for certain internal users
> Status result = SerializationTraits::Serialize(
> *static_cast(message), send_buf_.bbuf_ptr(), _buf);
> if (!own_buf) {
>   // XXX(lidavidm): This should perhaps check result.ok(), or Serialize 
> should
>   // unconditionally initialize send_buf_
>   send_buf_.Duplicate();
> }
> return result;
>   };
>   return Status();
> }
> {code}
> Hence when Flight returns an error without initializing the buffer, we get a 
> segfault.
> Originally reported on StackOverflow: 
> [https://stackoverflow.com/questions/68230146/pyarrow-flight-do-get-segfault-when-pandas-dataframe-over-3gb]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12590) [C++][R] Update copies of Homebrew files to reflect recent updates

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12590:

Fix Version/s: (was: 5.0.0)
   6.0.0

> [C++][R] Update copies of Homebrew files to reflect recent updates
> --
>
> Key: ARROW-12590
> URL: https://issues.apache.org/jira/browse/ARROW-12590
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, R
>Reporter: Ian Cook
>Assignee: Ian Cook
>Priority: Major
> Fix For: 6.0.0
>
>
> Our copies of the Homebrew formulae at 
> [https://github.com/apache/arrow/tree/master/dev/tasks/homebrew-formulae] 
> have drifted out of sync with what's currently in 
> [https://github.com/Homebrew/homebrew-core/tree/master/Formula] and 
> [https://github.com/autobrew/homebrew-core/blob/master/Formula|https://github.com/autobrew/homebrew-core/blob/master/Formula/].
>  Get them back in sync and consider automating some method of checking that 
> they are in sync, e.g. by failing the {{homebrew-cpp}} and 
>  {{homebrew-r-autobrew}} nightly tests if our copies don't match what's in 
> the Homebrew and autobrew repos (but only if there were changes there that 
> weren't made in our repo, and not the inverse).
> Update the instructions at 
>  
> [https://cwiki.apache.org/confluence/display/ARROW/Release+Management+Guide#ReleaseManagementGuide-UpdatingHomebrewpackages]
>  as needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13215) [R] [CI] Add ENV TZ to docker files

2021-07-15 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-13215.
-
Fix Version/s: 5.0.0
   Resolution: Fixed

Issue resolved by pull request 10703
[https://github.com/apache/arrow/pull/10703]

> [R] [CI] Add ENV TZ to docker files
> ---
>
> Key: ARROW-13215
> URL: https://issues.apache.org/jira/browse/ARROW-13215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Does adding {{ENV TZ UTC}} to docker files prevent the following (harmless) 
> error we see from time to time on CI?
> https://github.com/rocker-org/rocker/blob/master/r-edge/gcc-11/Dockerfile#L37-L38
>  suggests it might
> {code}
> System has not been booted with systemd as init system (PID 1). Can't operate.
> Failed to connect to bus: Host is down
> Warning in system("timedatectl", intern = TRUE) :
>   running command 'timedatectl' had status 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >