date:20220224

[jira] [Resolved] (ARROW-15709) [C++] Compilation of ARROW_ENGINE fails if doing an "inline" build

2022-02-24 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-15709.
--
Resolution: Fixed

Issue resolved by pull request 12457
[https://github.com/apache/arrow/pull/12457]

> [C++] Compilation of ARROW_ENGINE fails if doing an "inline" build
> --
>
> Key: ARROW-15709
> URL: https://issues.apache.org/jira/browse/ARROW-15709
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Jeroen van Straten
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>
> Typically with cmake we create a dedicated build directory.
> {noformat}
> cd cpp
> mkdir build
> cd build
> cmake .. -DARROW_ENGINE=ON
> {noformat}
> However, it is possible to do an "inline" build:
> {noformat}
> cd cpp
> cmake . -DARROW_ENGINE=ON
> {noformat}
> In the latter case we end up with a compilation error because when we clone 
> the substrait repo we clobber our substrait source files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15625) [C++] Convert underscores to hyphens in example executable names too

2022-02-24 Thread Yibo Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai reassigned ARROW-15625:


Assignee: Yibo Cai

> [C++] Convert underscores to hyphens in example executable names too
> 
>
> Key: ARROW-15625
> URL: https://issues.apache.org/jira/browse/ARROW-15625
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Yibo Cai
>Priority: Major
>
> While test executable names use dashes, examples use underscores. We could 
> convert them to be consistent, like what was done in ARROW-4648.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-3476) [Java] mvn test in memory fails on a big-endian platform

2022-02-24 Thread Kazuaki Ishizaki (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved ARROW-3476.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

This is already solved since Arrow 3.0 or later support big-endian in C++ 
binding. 

> [Java] mvn test in memory fails on a big-endian platform
> 
>
> Key: ARROW-3476
> URL: https://issues.apache.org/jira/browse/ARROW-3476
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Kazuaki Ishizaki
>Priority: Major
> Fix For: 3.0.0
>
>
> Apache Arrow is becoming commonplace to exchange data among important 
> emerging analytics frameworks such as Pandas, Numpy, and Spark.
> [IBM Z|https://en.wikipedia.org/wiki/IBM_Z] is one of platforms to process 
> critical transactions such as bank or credit card. Users of IBM Z want to 
> extract insights from these transactions using the emerging analytics systems 
> on IBM Z Linux. These analytics pipelines can be also fast and effective on 
> IBM Z Linux by using Apache Arrow on memory.
> From the technical perspective, since IBM Z Linux uses big-endian data 
> format, it is not possible to use Apache Arrow in this pipeline. If Apache 
> Arrow could support big-endian, the use case would be expanded.
> When I ran test case of Apache arrow on a big-endian platform (ppc64be), 
> {{mvn test}} in memory causes a failure due to an assertion.
> In {{TestEndianess.testLittleEndian}} test suite, the assertion occurs during 
> an allocation of a {{RootAllocator}} class.
> {code}
> $ uname -a
> Linux ppc64be.novalocal 4.5.7-300.fc24.ppc64 #1 SMP Fri Jun 10 20:29:32 UTC 
> 2016 ppc64 ppc64 ppc64 GNU/Linux
> $ arch  
> ppc64
> $ cd java/memory
> $ mvn test
> [INFO] Scanning for projects...
> [INFO]
>  
> [INFO] 
> 
> [INFO] Building Arrow Memory 0.12.0-SNAPSHOT
> [INFO] 
> 
> [INFO] 
> ...
> [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 
> s - in org.apache.arrow.memory.TestAccountant
> [INFO] Running org.apache.arrow.memory.TestLowCostIdentityHashMap
> [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 
> s - in org.apache.arrow.memory.TestLowCostIdentityHashMap
> [INFO] Running org.apache.arrow.memory.TestBaseAllocator
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.746 
> s <<< FAILURE! - in org.apache.arrow.memory.TestEndianess
> [ERROR] testLittleEndian(org.apache.arrow.memory.TestEndianess)  Time 
> elapsed: 0.313 s  <<< ERROR!
> java.lang.ExceptionInInitializerError
>   at 
> org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31)
> Caused by: java.lang.IllegalStateException: Arrow only runs on LittleEndian 
> systems.
>   at 
> org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31)
> [ERROR] Tests run: 22, Failures: 0, Errors: 21, Skipped: 1, Time elapsed: 
> 0.055 s <<< FAILURE! - in org.apache.arrow.memory.TestBaseAllocator
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-3476) [Java] mvn test in memory fails on a big-endian platform

2022-02-24 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497953#comment-17497953
 ] 

Kazuaki Ishizaki commented on ARROW-3476:
-

Sure, let me close this issue since this was already solved.

> [Java] mvn test in memory fails on a big-endian platform
> 
>
> Key: ARROW-3476
> URL: https://issues.apache.org/jira/browse/ARROW-3476
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> Apache Arrow is becoming commonplace to exchange data among important 
> emerging analytics frameworks such as Pandas, Numpy, and Spark.
> [IBM Z|https://en.wikipedia.org/wiki/IBM_Z] is one of platforms to process 
> critical transactions such as bank or credit card. Users of IBM Z want to 
> extract insights from these transactions using the emerging analytics systems 
> on IBM Z Linux. These analytics pipelines can be also fast and effective on 
> IBM Z Linux by using Apache Arrow on memory.
> From the technical perspective, since IBM Z Linux uses big-endian data 
> format, it is not possible to use Apache Arrow in this pipeline. If Apache 
> Arrow could support big-endian, the use case would be expanded.
> When I ran test case of Apache arrow on a big-endian platform (ppc64be), 
> {{mvn test}} in memory causes a failure due to an assertion.
> In {{TestEndianess.testLittleEndian}} test suite, the assertion occurs during 
> an allocation of a {{RootAllocator}} class.
> {code}
> $ uname -a
> Linux ppc64be.novalocal 4.5.7-300.fc24.ppc64 #1 SMP Fri Jun 10 20:29:32 UTC 
> 2016 ppc64 ppc64 ppc64 GNU/Linux
> $ arch  
> ppc64
> $ cd java/memory
> $ mvn test
> [INFO] Scanning for projects...
> [INFO]
>  
> [INFO] 
> 
> [INFO] Building Arrow Memory 0.12.0-SNAPSHOT
> [INFO] 
> 
> [INFO] 
> ...
> [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 
> s - in org.apache.arrow.memory.TestAccountant
> [INFO] Running org.apache.arrow.memory.TestLowCostIdentityHashMap
> [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 
> s - in org.apache.arrow.memory.TestLowCostIdentityHashMap
> [INFO] Running org.apache.arrow.memory.TestBaseAllocator
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.746 
> s <<< FAILURE! - in org.apache.arrow.memory.TestEndianess
> [ERROR] testLittleEndian(org.apache.arrow.memory.TestEndianess)  Time 
> elapsed: 0.313 s  <<< ERROR!
> java.lang.ExceptionInInitializerError
>   at 
> org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31)
> Caused by: java.lang.IllegalStateException: Arrow only runs on LittleEndian 
> systems.
>   at 
> org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31)
> [ERROR] Tests run: 22, Failures: 0, Errors: 21, Skipped: 1, Time elapsed: 
> 0.055 s <<< FAILURE! - in org.apache.arrow.memory.TestBaseAllocator
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15778) [Java] Endianness field not emitted in IPC stream

2022-02-24 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497949#comment-17497949
 ] 

Kazuaki Ishizaki commented on ARROW-15778:
--

[~apitrou] Thank you. In another issue, I am suspecting the endianness in 
schema. I will look at this.

> [Java] Endianness field not emitted in IPC stream
> -
>
> Key: ARROW-15778
> URL: https://issues.apache.org/jira/browse/ARROW-15778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 8.0.0
>
>
> It seems the Java IPC writer implementation does not emit the Endianness 
> information at all (making it Little by default). This complicates 
> interoperability with the C++ IPC reader, which does read this information 
> and acts on it to decide whether it needs to byteswap the incoming data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15680) [C++] Temporal floor/ceil/round should accept week_starts_monday when rounding to multiple of week

2022-02-24 Thread Rok Mihevc (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-15680:
---
Summary: [C++] Temporal floor/ceil/round  should accept week_starts_monday 
when rounding to multiple of week  (was: [C++] Temporal floor/ceil/round  
should accept week_start when rounding to multiple of week)

> [C++] Temporal floor/ceil/round  should accept week_starts_monday when 
> rounding to multiple of week
> ---
>
> Key: ARROW-15680
> URL: https://issues.apache.org/jira/browse/ARROW-15680
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See ARROW-14821 and the [related 
> PR|https://github.com/apache/arrow/pull/12154].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15680) [C++] Temporal floor/ceil/round should accept week_start when rounding to multiple of week

2022-02-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15680:
---
Labels: kernel pull-request-available  (was: kernel)

> [C++] Temporal floor/ceil/round  should accept week_start when rounding to 
> multiple of week
> ---
>
> Key: ARROW-15680
> URL: https://issues.apache.org/jira/browse/ARROW-15680
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See ARROW-14821 and the [related 
> PR|https://github.com/apache/arrow/pull/12154].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15666) [C++][Python][R] Add format inference option to StrptimeOptions

2022-02-24 Thread Matthew Roeschke (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497921#comment-17497921
 ] 

Matthew Roeschke commented on ARROW-15666:
--

Speaking from experience on the pandas side, I agree with [~jorisvandenbossche] 
and would caution against "inference" logic. While convenient for users, the 
maintenance burden can be quite significant since inference tends to have an 
indefinite scope, leading to more custom logic, edge cases, etc

> [C++][Python][R] Add format inference option to StrptimeOptions
> ---
>
> Key: ARROW-15666
> URL: https://issues.apache.org/jira/browse/ARROW-15666
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Rok Mihevc
>Priority: Major
>
> We want to have an option to infer timestamp format.
> See 
> [pandas.to_datetime|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html]
>  and lubridate 
> [parse_date_time|https://lubridate.tidyverse.org/reference/parse_date_time.html]
>  for examples.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15786) [Website] Tidy use of and linking to Arrow logo

2022-02-24 Thread Danielle Navarro (Jira)

Danielle Navarro created ARROW-15786:


 Summary: [Website] Tidy use of and linking to Arrow logo
 Key: ARROW-15786
 URL: https://issues.apache.org/jira/browse/ARROW-15786
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Danielle Navarro
Assignee: Danielle Navarro


Now that ARROW-15684 is merged, it would make sense to be consistent in how the 
Arrow logo is used on the website, and to link to the Visual Identity page 
where appropriate (e.g., there is a natural place to link to it from the 
Powered By page). In addition to improving the cross-linking between pages, 
look for cases where the site can use the updated files rather than the 
outdated ones (but do not delete the outdated files)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15784) [C++][Python] Parallel parquet file reading disabled with single file reads

2022-02-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15784:
---
Labels: pull-request-available  (was: )

> [C++][Python] Parallel parquet file reading disabled with single file reads
> ---
>
> Key: ARROW-15784
> URL: https://issues.apache.org/jira/browse/ARROW-15784
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 7.0.0
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is a flag {{enable_parallel_column_conversion}} which was passed down 
> from python to C++ when reading parquet datasets which controlled whether we 
> would read columns in parallel.  This was allowed for single files but not 
> for reading multiple files.  This was an old check to help prevent nested 
> deadlock.
> Nested deadlock is no longer an issue and the flag was mostly inert once we 
> removed the synchronous scanner.
> Unfortunately, when we removed the synchronous scanner we forgot to remove 
> this flag and the result was that a single-file read ended up disabling 
> parallelism.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15785) [Benchmarks] Add conbench benchmark for single-file parquet reads

2022-02-24 Thread Weston Pace (Jira)

Weston Pace created ARROW-15785:
---

 Summary: [Benchmarks] Add conbench benchmark for single-file 
parquet reads
 Key: ARROW-15785
 URL: https://issues.apache.org/jira/browse/ARROW-15785
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Benchmarking
Reporter: Weston Pace
Assignee: Weston Pace


Release 7.0.0 introduced a regression in parquet single file reads.  We should 
add a macro-level benchmark that does single-file reads to help us detect this 
in the future.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15784) [C++][Python] Parallel parquet file reading disabled with single file reads

2022-02-24 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-15784:

Issue Type: Bug  (was: Improvement)

> [C++][Python] Parallel parquet file reading disabled with single file reads
> ---
>
> Key: ARROW-15784
> URL: https://issues.apache.org/jira/browse/ARROW-15784
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 7.0.0
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
> Fix For: 7.0.1
>
>
> There is a flag {{enable_parallel_column_conversion}} which was passed down 
> from python to C++ when reading parquet datasets which controlled whether we 
> would read columns in parallel.  This was allowed for single files but not 
> for reading multiple files.  This was an old check to help prevent nested 
> deadlock.
> Nested deadlock is no longer an issue and the flag was mostly inert once we 
> removed the synchronous scanner.
> Unfortunately, when we removed the synchronous scanner we forgot to remove 
> this flag and the result was that a single-file read ended up disabling 
> parallelism.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15784) [C++][Python] Parallel parquet file reading disabled with single file reads

2022-02-24 Thread Weston Pace (Jira)

Weston Pace created ARROW-15784:
---

 Summary: [C++][Python] Parallel parquet file reading disabled with 
single file reads
 Key: ARROW-15784
 URL: https://issues.apache.org/jira/browse/ARROW-15784
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 7.0.0
Reporter: Weston Pace
Assignee: Weston Pace
 Fix For: 7.0.1


There is a flag {{enable_parallel_column_conversion}} which was passed down 
from python to C++ when reading parquet datasets which controlled whether we 
would read columns in parallel.  This was allowed for single files but not for 
reading multiple files.  This was an old check to help prevent nested deadlock.

Nested deadlock is no longer an issue and the flag was mostly inert once we 
removed the synchronous scanner.

Unfortunately, when we removed the synchronous scanner we forgot to remove this 
flag and the result was that a single-file read ended up disabling parallelism.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects

2022-02-24 Thread Vibhatha Lakmal Abeykoon (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497856#comment-17497856
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-15765:
--

[~jorisvandenbossche] the new typing generics look interesting. Is it practical 
to adopt this now. I am referring to the Python versions we support now. Is it 
wise to use it in the UDF integration and not do what I am suggesting to do in 
this jira. 

[~apitrou] Numba jit approach is nice and it looks like an advance feature for 
UDFs someday. I will keep this in mind. 

As [~westonpace] suggested, some of our main motivations are to support the 
user and try to provide user friendly options when we write TPCx-BB queries and 
similar applications. If the suggestion from [~jorisvandenbossche] to use 
advance typing is feasible, is it wise to use that instead of doing this change 
if it succeeds in solving our underlying problem. 

> [Python] Extracting Type information from Python Objects
> 
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> When creating user defined functions or similar exercises where we want to 
> extract the Arrow data types from the type hints, the existing Python API 
> have some limitations. 
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
>     return pc.call_function("add", [array1, array2])
>   {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. 
> At the moment there doesn't exist a straightforward manner to get this done. 
> So the idea is to expose this feature to Python. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects

2022-02-24 Thread Vibhatha Lakmal Abeykoon (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497838#comment-17497838
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-15765:
--

I want to clarify a point, if I have not clearly mentioned the reason for the 
necessity of the typing information earlier in the thread. If I am not 
mistaken, here the main issue is not what UDF internally is doing for the data. 
We just need to register it in the function registry without taking the input 
and output types from the user explicitly. It is just a nice to have a feature 
which could look great in terms of presentability and usability with new Python 
upgrades. 

> [Python] Extracting Type information from Python Objects
> 
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> When creating user defined functions or similar exercises where we want to 
> extract the Arrow data types from the type hints, the existing Python API 
> have some limitations. 
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
>     return pc.call_function("add", [array1, array2])
>   {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. 
> At the moment there doesn't exist a straightforward manner to get this done. 
> So the idea is to expose this feature to Python. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15783) [Python] Converting arrow MonthDayNanoInterval to pandas fails DCHECK

2022-02-24 Thread Micah Kornfield (Jira)

Micah Kornfield created ARROW-15783:
---

 Summary: [Python] Converting arrow MonthDayNanoInterval to pandas 
fails DCHECK
 Key: ARROW-15783
 URL: https://issues.apache.org/jira/browse/ARROW-15783
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Micah Kornfield
Assignee: Micah Kornfield


InitPandasStaticData is only called on python/pandas -> Arrow and not the 
reverse path

 

This causes the DCHECK to make sure the Pandas type is not null to fail if 
import code is never used.

 

A workaround to users of the library is to call pa.array([1]) which would avoid 
this issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15782) [C++] Findre2Alt.cmake did not honor RE2_ROOT

2022-02-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15782:
---
Labels: pull-request-available  (was: )

> [C++] Findre2Alt.cmake did not honor RE2_ROOT
> -
>
> Key: ARROW-15782
> URL: https://issues.apache.org/jira/browse/ARROW-15782
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 7.0.0
>Reporter: Haowei Yu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In Findre2Alt.cmake, cmake module find re2 package config first, which is 
> likely to find system re2 package first over my customized re2 library. We 
> should check RE2_ROOT variable first, and if there is no value existed, then 
> we should search for system re2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15782) [C++] Findre2Alt.cmake did not honor RE2_ROOT

2022-02-24 Thread Haowei Yu (Jira)

Haowei Yu created ARROW-15782:
-

 Summary: [C++] Findre2Alt.cmake did not honor RE2_ROOT
 Key: ARROW-15782
 URL: https://issues.apache.org/jira/browse/ARROW-15782
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 7.0.0
Reporter: Haowei Yu


In Findre2Alt.cmake, cmake module find re2 package config first, which is 
likely to find system re2 package first over my customized re2 library. We 
should check RE2_ROOT variable first, and if there is no value existed, then we 
should search for system re2.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15697) [R] Add logo and meta tags to pkgdown site

2022-02-24 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15697.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12439
[https://github.com/apache/arrow/pull/12439]

> [R] Add logo and meta tags to pkgdown site
> --
>
> Key: ARROW-15697
> URL: https://issues.apache.org/jira/browse/ARROW-15697
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Danielle Navarro
>Assignee: Danielle Navarro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The pkgdown site currently doesn't use the Arrow logo and doesn't have nice 
> social media preview images



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15258) [C++] Easy options to create a source node from a table

2022-02-24 Thread Weston Pace (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace resolved ARROW-15258.
-
Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12267
[https://github.com/apache/arrow/pull/12267]

> [C++] Easy options to create a source node from a table
> ---
>
> Key: ARROW-15258
> URL: https://issues.apache.org/jira/browse/ARROW-15258
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Given a Table there should be a very simple way to create a source node.  
> Something like:
> {code}
>   std::shared_ptr table = ...
>   ARROW_RETURN_NOT_OK(arrow::compute::MakeExecNode(
>   "table", plan, {}, arrow::compute::TableSourceOptions{table.get()}));
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497733#comment-17497733
 ] 

Antoine Pitrou commented on ARROW-15765:


> Another dimension to consider is whether a UDF would care if an array were 
>dictionary encoded or not? We probably want a way to express that too.

If you want a UDF to have different implementations based on the parameter 
types, you can't do that using type annotations.

What you could do is use a two-step approach like in Numba's {{generated_jit}}:
https://numba.pydata.org/numba-doc/dev/user/generated-jit.html


> [Python] Extracting Type information from Python Objects
> 
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> When creating user defined functions or similar exercises where we want to 
> extract the Arrow data types from the type hints, the existing Python API 
> have some limitations. 
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
>     return pc.call_function("add", [array1, array2])
>   {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. 
> At the moment there doesn't exist a straightforward manner to get this done. 
> So the idea is to expose this feature to Python. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15680) [C++] Temporal floor/ceil/round should accept week_start when rounding to multiple of week

2022-02-24 Thread Rok Mihevc (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reassigned ARROW-15680:
--

Assignee: Rok Mihevc

> [C++] Temporal floor/ceil/round  should accept week_start when rounding to 
> multiple of week
> ---
>
> Key: ARROW-15680
> URL: https://issues.apache.org/jira/browse/ARROW-15680
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> See ARROW-14821 and the [related 
> PR|https://github.com/apache/arrow/pull/12154].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects

2022-02-24 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497720#comment-17497720
 ] 

Weston Pace commented on ARROW-15765:
-

For a concrete use case consider a user that wants to integrate some kind of 
Arrow native geojson library.  They would have extension types for geojson data 
types and custom functions that can do things like normalize coordinates to 
some kind of different reference or format coordinates in a particular way.  In 
this case the UDFs would be taking in extension arrays for custom data types 
which I think would have its own typings-based considerations.

Another possible example that comes from the TPCx-BB benchmark is doing 
sentiment analysis on strings (is this user comment a positive comment or a 
negative comment?)  If we had an arrow-native natural language processing 
library we could hook in an extract_sentiment operation which took in strings 
and returns ? (maybe doubles?).

As far as I know the type information itself is only used for validation and 
casting purposes.

Another dimension to consider is whether a UDF would care if an array were 
dictionary encoded or not?  We probably want a way to express that too.

> [Python] Extracting Type information from Python Objects
> 
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> When creating user defined functions or similar exercises where we want to 
> extract the Arrow data types from the type hints, the existing Python API 
> have some limitations. 
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
>     return pc.call_function("add", [array1, array2])
>   {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. 
> At the moment there doesn't exist a straightforward manner to get this done. 
> So the idea is to expose this feature to Python. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects

2022-02-24 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497696#comment-17497696
 ] 

Joris Van den Bossche commented on ARROW-15765:
---

In context of a full query plan, I think it is important to know the output 
types given the input types, to be able to resolve the types in your full query?

I am wondering if we could make use of some of the newer typing features, which 
would allow to do something like

{code:python}
def simple_function(arrow_array: pa.Array[pa.int32()]) -> pa.Array[pa.int32()]: 
return call_function("add", [arrow_array, 1])  
{code}

I think such an object with which you can use [] is called a "generic" in 
typing terminology (https://docs.python.org/3.11/library/typing.html#generics), 
and it would allow to more easily get the type of the values in the container. 
On the other hand it creates a bit a separate typing syntax ({{pa.Array}} is 
not actually itself a useful class, it's always subclasses you get in practice).

> [Python] Extracting Type information from Python Objects
> 
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> When creating user defined functions or similar exercises where we want to 
> extract the Arrow data types from the type hints, the existing Python API 
> have some limitations. 
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
>     return pc.call_function("add", [array1, array2])
>   {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. 
> At the moment there doesn't exist a straightforward manner to get this done. 
> So the idea is to expose this feature to Python. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (ARROW-15780) missing header file parquet/parquet_version.h

2022-02-24 Thread Sifang Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sifang Li closed ARROW-15780.
-
Resolution: Not A Problem

> missing header file parquet/parquet_version.h
> -
>
> Key: ARROW-15780
> URL: https://issues.apache.org/jira/browse/ARROW-15780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
> Environment: Ubuntu 20.04
>Reporter: Sifang Li
>Priority: Blocker
>
> I am following instructions of writing a table into parquet file:
> [https://arrow.apache.org/docs/cpp/parquet.html]
> Need to include #include "parquet/arrow/writer.h"
> Apparently one header file is missing in the src - cannot find it anywhere:
> In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24,
> ...
> ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: 
> parquet/parquet_version.h: No such file or directory
>    31 | #include "parquet/parquet_version.h"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h

2022-02-24 Thread Sifang Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497677#comment-17497677
 ] 

Sifang Li commented on ARROW-15780:
---

Thanks - I will close this.

> missing header file parquet/parquet_version.h
> -
>
> Key: ARROW-15780
> URL: https://issues.apache.org/jira/browse/ARROW-15780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
> Environment: Ubuntu 20.04
>Reporter: Sifang Li
>Priority: Blocker
>
> I am following instructions of writing a table into parquet file:
> [https://arrow.apache.org/docs/cpp/parquet.html]
> Need to include #include "parquet/arrow/writer.h"
> Apparently one header file is missing in the src - cannot find it anywhere:
> In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24,
> ...
> ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: 
> parquet/parquet_version.h: No such file or directory
>    31 | #include "parquet/parquet_version.h"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h

2022-02-24 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497675#comment-17497675
 ] 

David Li commented on ARROW-15780:
--

The header is not generated until install time

> missing header file parquet/parquet_version.h
> -
>
> Key: ARROW-15780
> URL: https://issues.apache.org/jira/browse/ARROW-15780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
> Environment: Ubuntu 20.04
>Reporter: Sifang Li
>Priority: Blocker
>
> I am following instructions of writing a table into parquet file:
> [https://arrow.apache.org/docs/cpp/parquet.html]
> Need to include #include "parquet/arrow/writer.h"
> Apparently one header file is missing in the src - cannot find it anywhere:
> In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24,
> ...
> ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: 
> parquet/parquet_version.h: No such file or directory
>    31 | #include "parquet/parquet_version.h"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h

2022-02-24 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497674#comment-17497674
 ] 

David Li commented on ARROW-15780:
--

Try this

{noformat}
cmake .. -DARROW_PARQUET=ON -DCMAKE_INSTALL_PREFIX=(path to where you want 
Arrow to be installed)
make -j8 install
{noformat}

Then point your compiler to the install prefix

> missing header file parquet/parquet_version.h
> -
>
> Key: ARROW-15780
> URL: https://issues.apache.org/jira/browse/ARROW-15780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
> Environment: Ubuntu 20.04
>Reporter: Sifang Li
>Priority: Blocker
>
> I am following instructions of writing a table into parquet file:
> [https://arrow.apache.org/docs/cpp/parquet.html]
> Need to include #include "parquet/arrow/writer.h"
> Apparently one header file is missing in the src - cannot find it anywhere:
> In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24,
> ...
> ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: 
> parquet/parquet_version.h: No such file or directory
>    31 | #include "parquet/parquet_version.h"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15781) [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL

2022-02-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15781:
---
Labels: pull-request-available  (was: )

> [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL
> -
>
> Key: ARROW-15781
> URL: https://issues.apache.org/jira/browse/ARROW-15781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> See https://github.com/apache/arrow/issues/12501



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15781) [Python] ParquetFileFragment.ensure_complete_metadata doesn't release the GIL

2022-02-24 Thread David Li (Jira)

David Li created ARROW-15781:


 Summary: [Python] ParquetFileFragment.ensure_complete_metadata 
doesn't release the GIL
 Key: ARROW-15781
 URL: https://issues.apache.org/jira/browse/ARROW-15781
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: David Li
Assignee: David Li


See https://github.com/apache/arrow/issues/12501



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h

2022-02-24 Thread Sifang Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497657#comment-17497657
 ] 

Sifang Li commented on ARROW-15780:
---

I just ran below: (from the manual config instructions)

$ mkdir build-release
$ cd build-release
$ cmake ..
$ make -j8       # if you have 8 CPU cores, otherwise adjust

> missing header file parquet/parquet_version.h
> -
>
> Key: ARROW-15780
> URL: https://issues.apache.org/jira/browse/ARROW-15780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
> Environment: Ubuntu 20.04
>Reporter: Sifang Li
>Priority: Blocker
>
> I am following instructions of writing a table into parquet file:
> [https://arrow.apache.org/docs/cpp/parquet.html]
> Need to include #include "parquet/arrow/writer.h"
> Apparently one header file is missing in the src - cannot find it anywhere:
> In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24,
> ...
> ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: 
> parquet/parquet_version.h: No such file or directory
>    31 | #include "parquet/parquet_version.h"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h

2022-02-24 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497650#comment-17497650
 ] 

David Li commented on ARROW-15780:
--

What commands exactly did you run? When I {{ninja install}} I do see 
{{parquet_version.h}} in the install prefix.

> missing header file parquet/parquet_version.h
> -
>
> Key: ARROW-15780
> URL: https://issues.apache.org/jira/browse/ARROW-15780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
> Environment: Ubuntu 20.04
>Reporter: Sifang Li
>Priority: Blocker
>
> I am following instructions of writing a table into parquet file:
> [https://arrow.apache.org/docs/cpp/parquet.html]
> Need to include #include "parquet/arrow/writer.h"
> Apparently one header file is missing in the src - cannot find it anywhere:
> In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24,
> ...
> ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: 
> parquet/parquet_version.h: No such file or directory
>    31 | #include "parquet/parquet_version.h"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h

2022-02-24 Thread Sifang Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497647#comment-17497647
 ] 

Sifang Li commented on ARROW-15780:
---

It looks like an installation issue - I followed directly to the manual 
instruction at:
[https://github.com/apache/arrow/blob/master/docs/source/developers/cpp/building.rst]

The libs are built fine in the out source dir - but the parquet_vrsion.h is 
missing - see it has a .in file apparently the process did not convert it to .h

My cmake is 3.16.3 - is that why?

> missing header file parquet/parquet_version.h
> -
>
> Key: ARROW-15780
> URL: https://issues.apache.org/jira/browse/ARROW-15780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
> Environment: Ubuntu 20.04
>Reporter: Sifang Li
>Priority: Blocker
>
> I am following instructions of writing a table into parquet file:
> [https://arrow.apache.org/docs/cpp/parquet.html]
> Need to include #include "parquet/arrow/writer.h"
> Apparently one header file is missing in the src - cannot find it anywhere:
> In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24,
> ...
> ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: 
> parquet/parquet_version.h: No such file or directory
>    31 | #include "parquet/parquet_version.h"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.

2022-02-24 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-15748.

Resolution: Fixed

Issue resolved by pull request 12507
[https://github.com/apache/arrow/pull/12507]

> [Python] Round temporal options default unit is `day` but documented as 
> `second`.
> -
>
> Key: ARROW-15748
> URL: https://issues.apache.org/jira/browse/ARROW-15748
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: A. Coady
>Assignee: Rok Mihevc
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 8.0.0, 7.0.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The [python documentation for round temporal options 
> |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html]
>  says the default unit is `second`, but the [actual 
> behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options]
>  is a default of `day`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15774) [C++] [CMake] Missing hiveserver2 ErrorCodes_types

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497626#comment-17497626
 ] 

Antoine Pitrou commented on ARROW-15774:


I took some time looking at this. It appears that {{ARROW_HIVESERVER2}} is 
currently completely broken (it probably has been for a long time).

Observations:
* we should track the generated {{ErrorCodes.thrift}} in the repository, 
instead of having to run the Python script generating it on each build
* we should put the generated C++ files for thrift definitions inside a new 
{{src/generated/hiveserver2}} directory
* we should update {{build-support/update-thrift.sh}} to recreate said C++ 
generated files
* when I tried to do all the above, it appears that _some_ files are not 
generated by the Thrift compiler even though they should, and no error of any 
sort is printed out; trying to ignore those absent files doesn't work, as there 
are (predictably) missing symbol errors when linking the tests



> [C++] [CMake] Missing hiveserver2 ErrorCodes_types
> --
>
> Key: ARROW-15774
> URL: https://issues.apache.org/jira/browse/ARROW-15774
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
> Environment: Arch Linux 5.16.10; All dependencies are system packages
>Reporter: Pradeep Garigipati
>Priority: Major
> Attachments: cmake_config_generate.log
>
>
> With cmake preset `ninja-release-maximal`, one of the auto-generated files 
> seems to be missing and that in turn is resulting in the following error
> {code:sh}
> [96/576] Building CXX object 
> src/arrow/dbi/hiveserver2/CMakeFiles/arrow_hiveserver2_thrift.dir/ErrorCodes_types.cpp.o
> FAILED: 
> src/arrow/dbi/hiveserver2/CMakeFiles/arrow_hiveserver2_thrift.dir/ErrorCodes_types.cpp.o
>  
> /usr/bin/ccache /usr/bin/c++ -DARROW_HAVE_RUNTIME_AVX2 
> -DARROW_HAVE_RUNTIME_AVX512 -DARROW_HAVE_RUNTIME_BMI2 
> -DARROW_HAVE_RUNTIME_SSE4_2 -DARROW_HAVE_SSE4_2 -DARROW_HDFS -DARROW_JEMALLOC 
> -DARROW_JEMALLOC_INCLUDE_DIR="" -DARROW_MIMALLOC -DARROW_WITH_BROTLI 
> -DARROW_WITH_BZ2 -DARROW_WITH_LZ4 -DARROW_WITH_RE2 -DARROW_WITH_SNAPPY 
> -DARROW_WITH_TIMING_TESTS -DARROW_WITH_UTF8PROC -DARROW_WITH_ZLIB 
> -DARROW_WITH_ZSTD -DURI_STATIC_BUILD 
> -I/home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/src 
> -I/home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/src 
> -I/home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/src/generated -isystem 
> /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/thirdparty/flatbuffers/include 
> -isystem 
> /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/jemalloc_ep-prefix/src 
> -isystem 
> /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/mimalloc_ep/src/mimalloc_ep/include/mimalloc-1.7
>  -isystem 
> /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/xsimd_ep/src/xsimd_ep-install/include
>  -isystem 
> /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/zstd_ep-install/include 
> -isystem 
> /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/thirdparty/hadoop/include 
> -isystem 
> /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/orc_ep-install/include 
> -Wno-noexcept-type  -fdiagnostics-color=always -O3 -DNDEBUG  -Wall 
> -fno-semantic-interposition -msse4.2  -O3 -DNDEBUG -fPIC -std=c++11 
> -Wno-unused-variable -Wno-shadow-field -MD -MT 
> src/arrow/dbi/hiveserver2/CMakeFiles/arrow_hiveserver2_thrift.dir/ErrorCodes_types.cpp.o
>  -MF 
> src/arrow/dbi/hiveserver2/CMakeFiles/arrow_hiveserver2_thrift.dir/ErrorCodes_types.cpp.o.d
>  -o 
> src/arrow/dbi/hiveserver2/CMakeFiles/arrow_hiveserver2_thrift.dir/ErrorCodes_types.cpp.o
>  -c 
> /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/src/arrow/dbi/hiveserver2/ErrorCodes_types.cpp
> cc1plus: fatal error: 
> /home/pradeep/gitroot/ArrowWorkspace/arrow/cpp/build/src/arrow/dbi/hiveserver2/ErrorCodes_types.cpp:
>  No such file or directory
> {code}
> I have attached the cmake log of configuration/generation steps.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15780) missing header file parquet/parquet_version.h

2022-02-24 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497625#comment-17497625
 ] 

David Li commented on ARROW-15780:
--

How did you install Arrow?

> missing header file parquet/parquet_version.h
> -
>
> Key: ARROW-15780
> URL: https://issues.apache.org/jira/browse/ARROW-15780
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
> Environment: Ubuntu 20.04
>Reporter: Sifang Li
>Priority: Blocker
>
> I am following instructions of writing a table into parquet file:
> [https://arrow.apache.org/docs/cpp/parquet.html]
> Need to include #include "parquet/arrow/writer.h"
> Apparently one header file is missing in the src - cannot find it anywhere:
> In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24,
> ...
> ../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: 
> parquet/parquet_version.h: No such file or directory
>    31 | #include "parquet/parquet_version.h"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15780) missing header file parquet/parquet_version.h

2022-02-24 Thread Sifang Li (Jira)

Sifang Li created ARROW-15780:
-

 Summary: missing header file parquet/parquet_version.h
 Key: ARROW-15780
 URL: https://issues.apache.org/jira/browse/ARROW-15780
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 7.0.0
 Environment: Ubuntu 20.04
Reporter: Sifang Li


I am following instructions of writing a table into parquet file:
[https://arrow.apache.org/docs/cpp/parquet.html]

Need to include #include "parquet/arrow/writer.h"

Apparently one header file is missing in the src - cannot find it anywhere:

In file included from ../3rd_party/arrow/cpp/src/parquet/arrow/writer.h:24,

...
../3rd_party/arrow/cpp/src/parquet/properties.h:31:10: fatal error: 
parquet/parquet_version.h: No such file or directory
   31 | #include "parquet/parquet_version.h"



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15772) [Go][Flight] Server Basic Auth Middleware/Interceptor wrongly base64 decode

2022-02-24 Thread Matthew Topol (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-15772.
---
Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12503
[https://github.com/apache/arrow/pull/12503]

> [Go][Flight] Server Basic Auth Middleware/Interceptor wrongly base64 decode
> ---
>
> Key: ARROW-15772
> URL: https://issues.apache.org/jira/browse/ARROW-15772
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Affects Versions: 6.0.1, 7.0.0
>Reporter: Risselin Corentin
>Priority: Major
>  Labels: easyfix, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently the implementation of the Auth interceptors uses 
> `base64.RawStdEncoding.DecodeString` to decode the content of the hanshake.
> In Go RawStdEncoding will not uses padding (with '='), trying to authenticate 
> from pyarrow (with `client.authenticate_basic_token(user, password)`) will 
> result in an error like:
> {quote}{{pyarrow._flight.FlightUnauthenticatedError: gRPC returned 
> unauthenticated error, with message: invalid basic auth encoding: illegal 
> base64 data at input byte XX}}
> {quote}
> StdEncoding would successfully read the content if RawStdEncoding fails.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (ARROW-15704) [C++] Support static linking with customized jemalloc library

2022-02-24 Thread Haowei Yu (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haowei Yu closed ARROW-15704.
-
Resolution: Won't Fix

> [C++] Support static linking with customized jemalloc library
> -
>
> Key: ARROW-15704
> URL: https://issues.apache.org/jira/browse/ARROW-15704
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Haowei Yu
>Priority: Major
>
> [https://github.com/apache/arrow/blob/3a8e409385c8455e6c80b867c5730965a501d113/cpp/cmake_modules/Findjemalloc.cmake#L68]
>  
> It seems that Findjemalloc.cmake think it has found jemalloc only if there 
> exists shared library. It would be nice if Findjemalloc can statically link 
> with libjemalloc.a



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-14798) [Python] Limit the size of the repr for large Tables

2022-02-24 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-14798.

Resolution: Fixed

Issue resolved by pull request 12091
[https://github.com/apache/arrow/pull/12091]

> [Python] Limit the size of the repr for large Tables
> 
>
> Key: ARROW-14798
> URL: https://issues.apache.org/jira/browse/ARROW-14798
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Assignee: Will Jones
>Priority: Major
>  Labels: good-first-issue, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> The new repr is nice that it shows a preview of the data, but this can also 
> become very long flooding your console output for larger tables.
> We already default to 10 preview cols, but each column can still consist of 
> many chunks. So it might be good to also limit it to 2 chunks? 
> The ChunkedArray.to_string method already has a {{window}} keyword, but that 
> seems to control both the number of elements to show per chunk as the number 
> of chunks (while it would be nice to limit eg to 2 chunks but show up to 10 
> elements for each chunk).
> cc [~amol-]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497567#comment-17497567
 ] 

Antoine Pitrou commented on ARROW-15765:


Of course, another question is: do you need to know the types at all? Without 
some concrete use cases it's hard to tell.

> [Python] Extracting Type information from Python Objects
> 
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> When creating user defined functions or similar exercises where we want to 
> extract the Arrow data types from the type hints, the existing Python API 
> have some limitations. 
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
>     return pc.call_function("add", [array1, array2])
>   {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. 
> At the moment there doesn't exist a straightforward manner to get this done. 
> So the idea is to expose this feature to Python. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects

2022-02-24 Thread Vibhatha Lakmal Abeykoon (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497566#comment-17497566
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-15765:
--

Should we design this feature or as [~jorisvandenbossche] and [~westonpace] 
suggested, we can use the inverse option to get the type from the Array type 
and not exposing this to the user? This issue is at the moment mainly focusing 
on the UDF usability piece rather than improving a core functionality for Arrow 
Python API. But it could be useful, but beyond the scope of this usecase it is 
not very clear to me how useful it is going to be to the user. 

What do you think? 

> [Python] Extracting Type information from Python Objects
> 
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> When creating user defined functions or similar exercises where we want to 
> extract the Arrow data types from the type hints, the existing Python API 
> have some limitations. 
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
>     return pc.call_function("add", [array1, array2])
>   {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. 
> At the moment there doesn't exist a straightforward manner to get this done. 
> So the idea is to expose this feature to Python. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects

2022-02-24 Thread Vibhatha Lakmal Abeykoon (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497545#comment-17497545
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-15765:
--

[~apitrou] I see your point. There are pitfalls and limitations to this 
approach. This is mainly a usability piece. I also have a doubt, is it worth 
investing time on it if the the applications of this becomes niche. But it 
feels like a nice to have a feature to at least support some widely used UDF 
function signatures.

> [Python] Extracting Type information from Python Objects
> 
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> When creating user defined functions or similar exercises where we want to 
> extract the Arrow data types from the type hints, the existing Python API 
> have some limitations. 
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
>     return pc.call_function("add", [array1, array2])
>   {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. 
> At the moment there doesn't exist a straightforward manner to get this done. 
> So the idea is to expose this feature to Python. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-13168) [C++] Timezone database configuration and access

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497542#comment-17497542
 ] 

Antoine Pitrou commented on ARROW-13168:


{{date.h}}'s {{set_install()}} seems to support the text form of the IANA 
database, which is also what R provides.

However, on Python, pytz provides the binary form of the IANA database, which 
{{date.h}} currently doesn't support on Windows.

> [C++] Timezone database configuration and access
> 
>
> Key: ARROW-13168
> URL: https://issues.apache.org/jira/browse/ARROW-13168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Will Jones
>Priority: Major
>  Labels: timestamp
>
>  Note: currently timezone database is not available on windows so timezone 
> aware operations will fail.
> We're using tz.h library which needs an updated timezone database to 
> correctly handle timezoned timestamps. See [installation 
> instructions|https://howardhinnant.github.io/date/tz.html#Installation].
> We have the following options for getting a timezone database:
>  # local (non-windows) OS timezone database - no work required.
>  # arrow bundled folder - we could bundle the database at build time for 
> windows. Database would slowly go stale.
>  # download it from IANA Time Zone Database at runtime - tz.h gets the 
> database at runtime, but curl (and 7-zip on windows) are required.
>  # local user-provided folder - user could provide a location at buildtime. 
> Nice to have.
>  # allow runtime configuration - at runtime say: "the tzdata can be found at 
> this location"
> For more context see: 
> [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP 
> 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15440) [Go] Implement 'unpack_bool' with Arm64 GoLang Assembly

2022-02-24 Thread Matthew Topol (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-15440.
---
Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12398
[https://github.com/apache/arrow/pull/12398]

> [Go] Implement 'unpack_bool' with Arm64 GoLang Assembly
> ---
>
> Key: ARROW-15440
> URL: https://issues.apache.org/jira/browse/ARROW-15440
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Go
>Reporter: Yuqi Gu
>Assignee: Yuqi Gu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Implement 'unpack_bool' with Arm64 GoLang Assembly.
> {code:java}
> bytes_to_bools_neon
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15723) [Python] Segfault orcWriter write table

2022-02-24 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497532#comment-17497532
 ] 

Joris Van den Bossche commented on ARROW-15723:
---

Thanks for the report. There are potentially multiple issues here.

First, writing null arrays is not actually supported (yet). When using the 
OrcWriter API directly, we can see this (using the table from the code snippet 
above):

{code}
In [3]: writer = orc.ORCWriter("test.orc")

In [4]: writer.write(table)
...
ArrowNotImplementedError: Unknown or unsupported Arrow type: null
../src/arrow/adapters/orc/util.cc:1062  GetOrcType(*arrow_child_type)
../src/arrow/adapters/orc/adapter.cc:811  GetOrcType(*(table.schema()))
{code}

But, it seems that for some reason this error is not bubbled up when using 
{{write_table}} (which uses this ORCWriter in a context manager).

Then, it further seems that the segfault comes from trying to write (close) an 
empty file. This can be reproduced with the following as well:

{code}
In [1]: from pyarrow import orc

In [2]: writer = orc.ORCWriter("test.orc")

In [3]: writer.close()
Segmentation fault (core dumped)
{code}

> [Python] Segfault  orcWriter write table
> 
>
> Key: ARROW-15723
> URL: https://issues.apache.org/jira/browse/ARROW-15723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: patrice
>Priority: Major
>
> pyarrow segfault when trying to write an orc from a table containing 
> nullArray.
>  
> from pyarrow import orc
> import pyarrow as pa
> a = pa.array([1, None, 3, None])
> b = pa.array([None, None, None, None])
> table = pa.table(\{"int64": a, "utf8": b})
> orc.write_table(table, 'test.orc')
> zsh: segmentation fault  python3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15723) [Python] Segfault orcWriter write table

2022-02-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15723:
--
Fix Version/s: 8.0.0

> [Python] Segfault  orcWriter write table
> 
>
> Key: ARROW-15723
> URL: https://issues.apache.org/jira/browse/ARROW-15723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: patrice
>Priority: Major
> Fix For: 8.0.0
>
>
> pyarrow segfault when trying to write an orc from a table containing 
> nullArray.
>  
> from pyarrow import orc
> import pyarrow as pa
> a = pa.array([1, None, 3, None])
> b = pa.array([None, None, None, None])
> table = pa.table(\{"int64": a, "utf8": b})
> orc.write_table(table, 'test.orc')
> zsh: segmentation fault  python3



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15715) [Go] ipc.Writer includes unneccessary offsets with sliced arrays

2022-02-24 Thread Matthew Topol (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Topol resolved ARROW-15715.
---
Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12453
[https://github.com/apache/arrow/pull/12453]

> [Go] ipc.Writer includes unneccessary offsets with sliced arrays
> 
>
> Key: ARROW-15715
> URL: https://issues.apache.org/jira/browse/ARROW-15715
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Chris Hoff
>Assignee: Chris Hoff
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> PR incoming.
>  
> Sliced arrays will be serialized with unnecessary trailing offsets for values 
> that were sliced off.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15765) [Python] Extracting Type information from Python Objects

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497523#comment-17497523
 ] 

Antoine Pitrou commented on ARROW-15765:


Note that this approach limits the expressivity of the type annotations. For 
example, if you write:
{code:python}
def compute_func(a: pa.ListArray) -> pa.ListArray:
...
{code}
... you are not able to tell what the value type of the list type is. Similarly 
with parametrized types such as timestamps or decimals.

> [Python] Extracting Type information from Python Objects
> 
>
> Key: ARROW-15765
> URL: https://issues.apache.org/jira/browse/ARROW-15765
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> When creating user defined functions or similar exercises where we want to 
> extract the Arrow data types from the type hints, the existing Python API 
> have some limitations. 
> An example case is as follows;
> {code:java}
> def function(array1: pa.Int64Array, arrya2: pa.Int64Array) -> pa.Int64Array:
>     return pc.call_function("add", [array1, array2])
>   {code}
> We want to extract the fact that array1 is an `pa.Array` of `pa.Int32Type`. 
> At the moment there doesn't exist a straightforward manner to get this done. 
> So the idea is to expose this feature to Python. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (ARROW-15729) [R] Reading large files randomly freezes

2022-02-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-15729.
-
Resolution: Duplicate

> [R] Reading large files randomly freezes
> 
>
> Key: ARROW-15729
> URL: https://issues.apache.org/jira/browse/ARROW-15729
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Christian
>Priority: Critical
>
> Hi -
> I recently upgraded to Arrow 6.0.1 and am using it in R.
> Whenever reading a large file (~10gb) in Windows it randomly freezes 
> sometimes. I can see the memory being allocated in the first 10-20 seconds, 
> but then nothing happens and R just doesn't respond (the R process becomes 
> idle too).
> I'm using the option options(arrow.use_threads=FALSE).
> I didn't have this issue with the previous version (0.15.1) I was using. And 
> the file reads fine under Linux.
> I would post a reproducible example but it happens randomly. I even thought I 
> would just read large files in pieces by first getting all the distinct 
> sections of a specific column (with compute>collect) but that hangs too.
> Any ideas would be appreciated.
> *Edit*
> Not sure if it makes sense to anyone but after a few tries it seems that the 
> issue only happens in Rstudio. In the R console it loads it fine. All I'm 
> executing is the below.
> options(arrow.use_threads=FALSE)
> aa <- arrow::read_arrow('.../file.arrow5')
> One thing I want to point out that the underlying Rscript process under 
> Rstudio seems to definitely use more than one core when executing the above.
> *Edit2*
> Using arrow::set_cpu_count(1) seems to solve the issue.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15729) [R] Reading large files randomly freezes

2022-02-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15729:
--
Fix Version/s: (was: 6.0.1)

> [R] Reading large files randomly freezes
> 
>
> Key: ARROW-15729
> URL: https://issues.apache.org/jira/browse/ARROW-15729
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Christian
>Priority: Critical
>
> Hi -
> I recently upgraded to Arrow 6.0.1 and am using it in R.
> Whenever reading a large file (~10gb) in Windows it randomly freezes 
> sometimes. I can see the memory being allocated in the first 10-20 seconds, 
> but then nothing happens and R just doesn't respond (the R process becomes 
> idle too).
> I'm using the option options(arrow.use_threads=FALSE).
> I didn't have this issue with the previous version (0.15.1) I was using. And 
> the file reads fine under Linux.
> I would post a reproducible example but it happens randomly. I even thought I 
> would just read large files in pieces by first getting all the distinct 
> sections of a specific column (with compute>collect) but that hangs too.
> Any ideas would be appreciated.
> *Edit*
> Not sure if it makes sense to anyone but after a few tries it seems that the 
> issue only happens in Rstudio. In the R console it loads it fine. All I'm 
> executing is the below.
> options(arrow.use_threads=FALSE)
> aa <- arrow::read_arrow('.../file.arrow5')
> One thing I want to point out that the underlying Rscript process under 
> Rstudio seems to definitely use more than one core when executing the above.
> *Edit2*
> Using arrow::set_cpu_count(1) seems to solve the issue.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15730) [R] Memory usage in R blows up

2022-02-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15730:
--
Fix Version/s: (was: 6.0.1)

> [R] Memory usage in R blows up
> --
>
> Key: ARROW-15730
> URL: https://issues.apache.org/jira/browse/ARROW-15730
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 6.0.1
>Reporter: Christian
>Assignee: Will Jones
>Priority: Major
> Attachments: image-2022-02-19-09-05-32-278.png
>
>
> Hi,
> I'm trying to load a ~10gb arrow file into R (under Windows)
> _(The file is generated in the 6.0.1 arrow version under Linux)._
> For whatever reason the memory usage blows up to ~110-120gb (in a fresh and 
> empty R instance).
> The weird thing is that when deleting the object again and running a gc() the 
> memory usage goes down to 90gb only. The delta of ~20-30gb is what I would 
> have expected the dataframe to use up in memory (and that's also approx. what 
> was used - in total during the load - when running the old arrow version of 
> 0.15.1. And it is also what R shows me when just printing the object size.)
> The commands I'm running are simply:
> options(arrow.use_threads=FALSE);
> arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
> arrow::read_arrow('file.arrow5')
> Is arrow reserving some resources in the background and not giving them up 
> again? Are there some settings I need to change for this?
> Is this something that is known and fixed in a newer version?
> *Note* that this doesn't happen in Linux. There all the resources are freed 
> up when calling the gc() function - not sure if it matters but there I also 
> don't need to set the cpu count to 1.
> Any help would be appreciated.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Reopened] (ARROW-15729) [R] Reading large files randomly freezes

2022-02-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reopened ARROW-15729:
---

> [R] Reading large files randomly freezes
> 
>
> Key: ARROW-15729
> URL: https://issues.apache.org/jira/browse/ARROW-15729
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Christian
>Priority: Critical
> Fix For: 6.0.1
>
>
> Hi -
> I recently upgraded to Arrow 6.0.1 and am using it in R.
> Whenever reading a large file (~10gb) in Windows it randomly freezes 
> sometimes. I can see the memory being allocated in the first 10-20 seconds, 
> but then nothing happens and R just doesn't respond (the R process becomes 
> idle too).
> I'm using the option options(arrow.use_threads=FALSE).
> I didn't have this issue with the previous version (0.15.1) I was using. And 
> the file reads fine under Linux.
> I would post a reproducible example but it happens randomly. I even thought I 
> would just read large files in pieces by first getting all the distinct 
> sections of a specific column (with compute>collect) but that hangs too.
> Any ideas would be appreciated.
> *Edit*
> Not sure if it makes sense to anyone but after a few tries it seems that the 
> issue only happens in Rstudio. In the R console it loads it fine. All I'm 
> executing is the below.
> options(arrow.use_threads=FALSE)
> aa <- arrow::read_arrow('.../file.arrow5')
> One thing I want to point out that the underlying Rscript process under 
> Rstudio seems to definitely use more than one core when executing the above.
> *Edit2*
> Using arrow::set_cpu_count(1) seems to solve the issue.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15730) [R] Memory usage in R blows up

2022-02-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15730:
--
Affects Version/s: 6.0.1

> [R] Memory usage in R blows up
> --
>
> Key: ARROW-15730
> URL: https://issues.apache.org/jira/browse/ARROW-15730
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 6.0.1
>Reporter: Christian
>Assignee: Will Jones
>Priority: Major
> Fix For: 6.0.1
>
> Attachments: image-2022-02-19-09-05-32-278.png
>
>
> Hi,
> I'm trying to load a ~10gb arrow file into R (under Windows)
> _(The file is generated in the 6.0.1 arrow version under Linux)._
> For whatever reason the memory usage blows up to ~110-120gb (in a fresh and 
> empty R instance).
> The weird thing is that when deleting the object again and running a gc() the 
> memory usage goes down to 90gb only. The delta of ~20-30gb is what I would 
> have expected the dataframe to use up in memory (and that's also approx. what 
> was used - in total during the load - when running the old arrow version of 
> 0.15.1. And it is also what R shows me when just printing the object size.)
> The commands I'm running are simply:
> options(arrow.use_threads=FALSE);
> arrow::set_cpu_count(1); # need this - otherwise it freezes under windows
> arrow::read_arrow('file.arrow5')
> Is arrow reserving some resources in the background and not giving them up 
> again? Are there some settings I need to change for this?
> Is this something that is known and fixed in a newer version?
> *Note* that this doesn't happen in Linux. There all the resources are freed 
> up when calling the gc() function - not sure if it matters but there I also 
> don't need to set the cpu count to 1.
> Any help would be appreciated.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.

2022-02-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15748:
---
Labels: pull-request-available  (was: )

> [Python] Round temporal options default unit is `day` but documented as 
> `second`.
> -
>
> Key: ARROW-15748
> URL: https://issues.apache.org/jira/browse/ARROW-15748
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: A. Coady
>Assignee: Rok Mihevc
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.1, 8.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The [python documentation for round temporal options 
> |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html]
>  says the default unit is `second`, but the [actual 
> behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options]
>  is a default of `day`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15779) [Python] Create python bindings for Substrait consumer

2022-02-24 Thread Weston Pace (Jira)

Weston Pace created ARROW-15779:
---

 Summary: [Python] Create python bindings for Substrait consumer
 Key: ARROW-15779
 URL: https://issues.apache.org/jira/browse/ARROW-15779
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Weston Pace


We will want to figure out how to expose the Substrait consumer to python.  
This could be a single method that accepts a buffer of bytes and returns an 
iterator of record batches but we might also want a helper method that returns 
a table.  I'm thinking this would go in the compute namespace.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15736) [C++] Aggregate functions for min and max index.

2022-02-24 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497519#comment-17497519
 ] 

Joris Van den Bossche commented on ARROW-15736:
---

For reference, we already have a "argsort" kernel ({{sort_to_indices}}, from 
ARROW-1566, later renamed in ARROW-6232)

> [C++] Aggregate functions for min and max index.
> 
>
> Key: ARROW-15736
> URL: https://issues.apache.org/jira/browse/ARROW-15736
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: A. Coady
>Priority: Major
>  Labels: kernel
>
> Numpy and Pandas both have `argmin` and `argmax`, for the common use case of 
> finding values in parallel arrays which correspond to min or max values. 
> Proposals:
>  * `min_max_index` for arrays
>  * `hash_min_max_index` for aggregations
>  * some ability to break ties:
>  ** `min_max_index` for tables with multiple sort keys, similar to 
> `sort_indices`
>  ** `min_max_indices` for arrays to match all equal values



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.

2022-02-24 Thread Rok Mihevc (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497513#comment-17497513
 ] 

Rok Mihevc commented on ARROW-15748:


Thanks for the analysis and suggestion Joris! I'll do that and add a python 
test.

> [Python] Round temporal options default unit is `day` but documented as 
> `second`.
> -
>
> Key: ARROW-15748
> URL: https://issues.apache.org/jira/browse/ARROW-15748
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: A. Coady
>Assignee: Rok Mihevc
>Priority: Minor
> Fix For: 7.0.1, 8.0.0
>
>
> The [python documentation for round temporal options 
> |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html]
>  says the default unit is `second`, but the [actual 
> behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options]
>  is a default of `day`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15281) [C++] Implement ability to retrieve fragment filename

2022-02-24 Thread Sanjiban Sengupta (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanjiban Sengupta reassigned ARROW-15281:
-

Assignee: Sanjiban Sengupta

> [C++] Implement ability to retrieve fragment filename
> -
>
> Key: ARROW-15281
> URL: https://issues.apache.org/jira/browse/ARROW-15281
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Nicola Crane
>Assignee: Sanjiban Sengupta
>Priority: Major
>  Labels: dataset, query-engine
>
> A user has requested the ability to include the filename of the CSV in the 
> dataset output - see discussion on ARROW-15260 for more context.
> Relevant info from that ticket:
>  
> "From a C++ perspective we've got many of the pieces needed already. One 
> challenge is that the datasets API is written to work with "fragments" and 
> not "files". For example, a dataset might be an in-memory table in which case 
> we are working with InMemoryFragment and not FileFragment so there is no 
> concept of "filename".
> That being said, the low level ScanBatchesAsync method actually returns a 
> generator of TaggedRecordBatch for this very purpose. A TaggedRecordBatch is 
> a struct with the record batch as well as the source fragment for that record 
> batch.
> So if you were to execute scan, you could inspect the fragment and, if it is 
> a FileFragment, you could extract the filename.
> Another challenge is that R is moving towards more and more access through an 
> exec plan and not directly using a scanner. In order for that to work we 
> would need to augment the scan results with the filename in C++ before 
> sending into the exec plan. Luckily, we already do this a bit as well. We 
> currently augment the scan results with fragment index, batch index, and 
> whether the batch is the last batch in the fragment.
> Since ExecBatch can work with constants efficiently I don't think there will 
> be much performance cost in always including the filename. So the work 
> remaining is simply to add a new augmented field _{_}fragment_source_name 
> which is always attached if the underlying fragment is a filename. Then users 
> can get this field if they want by including "{_}_fragment_source_name" in 
> the list of columns they query for."



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15680) [C++] Temporal floor/ceil/round should accept week_start when rounding to multiple of week

2022-02-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-15680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497508#comment-17497508
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-15680:
--

[~apitrou] this is the issue I mentioned on the Labs call

> [C++] Temporal floor/ceil/round  should accept week_start when rounding to 
> multiple of week
> ---
>
> Key: ARROW-15680
> URL: https://issues.apache.org/jira/browse/ARROW-15680
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: kernel
>
> See ARROW-14821 and the [related 
> PR|https://github.com/apache/arrow/pull/12154].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-5248) [Python] support zoneinfo / dateutil timezones

2022-02-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-5248.
--
Resolution: Fixed

Issue resolved by pull request 12421
[https://github.com/apache/arrow/pull/12421]

> [Python] support zoneinfo / dateutil timezones
> --
>
> Key: ARROW-5248
> URL: https://issues.apache.org/jira/browse/ARROW-5248
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Alenka Frim
>Priority: Minor
>  Labels: beginner, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> The {{dateutil}} packages also provides a set of timezone objects 
> (https://dateutil.readthedocs.io/en/stable/tz.html) in addition to {{pytz}}. 
> In pyarrow, we only support pytz timezones (and the stdlib datetime.timezone 
> fixed offset):
> {code}
> In [2]: import dateutil.tz
>   
>   
> In [3]: import pyarrow as pa  
>   
>   
> In [5]: pa.timestamp('us', dateutil.tz.gettz('Europe/Brussels'))  
>   
>   
> ...
> ~/miniconda3/envs/dev37/lib/python3.7/site-packages/pyarrow/types.pxi in 
> pyarrow.lib.tzinfo_to_string()
> ValueError: Unable to convert timezone 
> `tzfile('/usr/share/zoneinfo/Europe/Brussels')` to string
> {code}
> But pandas also supports dateutil timezones. As a consequence, when having a 
> pandas DataFrame that uses a dateutil timezone, you get an error when 
> converting to an arrow table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.

2022-02-24 Thread Rok Mihevc (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc reassigned ARROW-15748:
--

Assignee: Rok Mihevc

> [Python] Round temporal options default unit is `day` but documented as 
> `second`.
> -
>
> Key: ARROW-15748
> URL: https://issues.apache.org/jira/browse/ARROW-15748
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: A. Coady
>Assignee: Rok Mihevc
>Priority: Minor
> Fix For: 7.0.1, 8.0.0
>
>
> The [python documentation for round temporal options 
> |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html]
>  says the default unit is `second`, but the [actual 
> behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options]
>  is a default of `day`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15778) [Java] Endianness field not emitted in IPC stream

2022-02-24 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497474#comment-17497474
 ] 

David Li commented on ARROW-15778:
--

Ah, I misunderstood then, thanks for clarifying.

> [Java] Endianness field not emitted in IPC stream
> -
>
> Key: ARROW-15778
> URL: https://issues.apache.org/jira/browse/ARROW-15778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 8.0.0
>
>
> It seems the Java IPC writer implementation does not emit the Endianness 
> information at all (making it Little by default). This complicates 
> interoperability with the C++ IPC reader, which does read this information 
> and acts on it to decide whether it needs to byteswap the incoming data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15778) [Java] Endianness field not emitted in IPC stream

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497472#comment-17497472
 ] 

Antoine Pitrou commented on ARROW-15778:


I'm not sure any form of negotiation is needed? The way it works at the IPC 
level is that the writer emits data in whichever endianness it chooses (also 
setting the corresponding metadata field to the appropriate value) and the 
reader decides to byte-swap data is required. So it would work similarly at the 
Flight level.

> [Java] Endianness field not emitted in IPC stream
> -
>
> Key: ARROW-15778
> URL: https://issues.apache.org/jira/browse/ARROW-15778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 8.0.0
>
>
> It seems the Java IPC writer implementation does not emit the Endianness 
> information at all (making it Little by default). This complicates 
> interoperability with the C++ IPC reader, which does read this information 
> and acts on it to decide whether it needs to byteswap the incoming data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.

2022-02-24 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497471#comment-17497471
 ] 

Joris Van den Bossche edited comment on ARROW-15748 at 2/24/22, 3:26 PM:
-

[~coady] thanks for the report!

The link you provide for the actual behaviour points to the C++ docs, and while 
that indeed uses "day", the bindings in Python _do_ use "second": 
https://github.com/apache/arrow/blob/094c5ba186cddd69d4aa83de5ed2b62d4ed07081/python/pyarrow/_compute.pyx#L892

Now, the confusing part is that this class is not instantiated (I assume) if no 
options are used at all, and in that case it uses the defaults from C++.  You 
can see this in the following example:

{code:python}
>>> arr = pa.array([pd.Timestamp("2012-01-01 09:01:02.123456")])
>>> import pyarrow.compute as pc
>>> pc.round_temporal(arr)# <--- indeed uses "day" by default

[
  2012-01-01 00:00:00.00
]
>>> pc.round_temporal(arr, unit="second")# <--- manually specifying 
>>> "second" still works

[
  2012-01-01 09:01:02.00
]
>>> pc.round_temporal(arr, multiple=5)# <--- but when specifying a 
>>> different option, it now actually defaults to "second" ...

[
  2012-01-01 09:01:00.00
]
{code}

Now, long story short, the simple conclusion is of course still that we should 
align the defaults in C++ and Python


was (Author: jorisvandenbossche):
The link you provide for the actual behaviour points to the C++ docs, and while 
that indeed uses "day", the bindings in Python _do_ use "second": 
https://github.com/apache/arrow/blob/094c5ba186cddd69d4aa83de5ed2b62d4ed07081/python/pyarrow/_compute.pyx#L892

Now, the confusing part is that this class is not instantiated (I assume) if no 
options are used at all, and in that case it uses the defaults from C++.  You 
can see this in the following example:

{code:python}
>>> arr = pa.array([pd.Timestamp("2012-01-01 09:01:02.123456")])
>>> import pyarrow.compute as pc
>>> pc.round_temporal(arr)# <--- indeed uses "day" by default

[
  2012-01-01 00:00:00.00
]
>>> pc.round_temporal(arr, unit="second")# <--- manually specifying 
>>> "second" still works

[
  2012-01-01 09:01:02.00
]
>>> pc.round_temporal(arr, multiple=5)# <--- but when specifying a 
>>> different option, it now actually defaults to "second" ...

[
  2012-01-01 09:01:00.00
]
{code}

Now, long story short, the simple conclusion is of course still that we should 
align the defaults in C++ and Python

> [Python] Round temporal options default unit is `day` but documented as 
> `second`.
> -
>
> Key: ARROW-15748
> URL: https://issues.apache.org/jira/browse/ARROW-15748
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: A. Coady
>Priority: Minor
>
> The [python documentation for round temporal options 
> |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html]
>  says the default unit is `second`, but the [actual 
> behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options]
>  is a default of `day`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.

2022-02-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15748:
--
Fix Version/s: 7.0.1
   8.0.0

> [Python] Round temporal options default unit is `day` but documented as 
> `second`.
> -
>
> Key: ARROW-15748
> URL: https://issues.apache.org/jira/browse/ARROW-15748
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: A. Coady
>Priority: Minor
> Fix For: 7.0.1, 8.0.0
>
>
> The [python documentation for round temporal options 
> |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html]
>  says the default unit is `second`, but the [actual 
> behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options]
>  is a default of `day`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15748) [Python] Round temporal options default unit is `day` but documented as `second`.

2022-02-24 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497471#comment-17497471
 ] 

Joris Van den Bossche commented on ARROW-15748:
---

The link you provide for the actual behaviour points to the C++ docs, and while 
that indeed uses "day", the bindings in Python _do_ use "second": 
https://github.com/apache/arrow/blob/094c5ba186cddd69d4aa83de5ed2b62d4ed07081/python/pyarrow/_compute.pyx#L892

Now, the confusing part is that this class is not instantiated (I assume) if no 
options are used at all, and in that case it uses the defaults from C++.  You 
can see this in the following example:

{code:python}
>>> arr = pa.array([pd.Timestamp("2012-01-01 09:01:02.123456")])
>>> import pyarrow.compute as pc
>>> pc.round_temporal(arr)# <--- indeed uses "day" by default

[
  2012-01-01 00:00:00.00
]
>>> pc.round_temporal(arr, unit="second")# <--- manually specifying 
>>> "second" still works

[
  2012-01-01 09:01:02.00
]
>>> pc.round_temporal(arr, multiple=5)# <--- but when specifying a 
>>> different option, it now actually defaults to "second" ...

[
  2012-01-01 09:01:00.00
]
{code}

Now, long story short, the simple conclusion is of course still that we should 
align the defaults in C++ and Python

> [Python] Round temporal options default unit is `day` but documented as 
> `second`.
> -
>
> Key: ARROW-15748
> URL: https://issues.apache.org/jira/browse/ARROW-15748
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: A. Coady
>Priority: Minor
>
> The [python documentation for round temporal options 
> |https://arrow.apache.org/docs/dev/python/generated/pyarrow.compute.RoundTemporalOptions.html]
>  says the default unit is `second`, but the [actual 
> behavior|https://arrow.apache.org/docs/dev/cpp/api/compute.html#classarrow_1_1compute_1_1_round_temporal_options]
>  is a default of `day`.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15778) [Java] Endianness field not emitted in IPC stream

2022-02-24 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497463#comment-17497463
 ] 

David Li commented on ARROW-15778:
--

Flight doesn't do any endianness detection/negotiation anyways (it expects 
producer/consumer to set appropriate options) though we should eventually fix 
that.

> [Java] Endianness field not emitted in IPC stream
> -
>
> Key: ARROW-15778
> URL: https://issues.apache.org/jira/browse/ARROW-15778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 8.0.0
>
>
> It seems the Java IPC writer implementation does not emit the Endianness 
> information at all (making it Little by default). This complicates 
> interoperability with the C++ IPC reader, which does read this information 
> and acts on it to decide whether it needs to byteswap the incoming data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-15645) [Flight][Java][C++] Data read through Flight is having endianness issue on s390x

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497458#comment-17497458
 ] 

Antoine Pitrou edited comment on ARROW-15645 at 2/24/22, 3:12 PM:
--

If my diagnosis above is correct, then this is really caused by ARROW-15778.

You could work it around by disabling endianness conversion on the Flight 
client side, but unfortunately that option is not exposed in Python (see 
ARROW-15777).


was (Author: pitrou):
If my diagnosis above is correct, then this is really caused by ARROW-15778.

You could work it around by disable endianness conversion on the Flight client 
side, but unfortunately that is not exposed in Python (see ARROW-15777).

> [Flight][Java][C++] Data read through Flight is having endianness issue on 
> s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Java
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> {code}
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> {code}
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> {code:python}
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15757) [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

2022-02-24 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497462#comment-17497462
 ] 

Joris Van den Bossche commented on ARROW-15757:
---

Indeed, we should probably ensure users can pass that keyword in 
write_to_dataset as well. Currently, the {{**kwargs}} are passed to the 
ParquetFileFormat write options (for parquet specific write options). 

Thanks for raising the issue!

> [Python] Missing bindings for existing_data_behavior makes it impossible to 
> maintain old behavior 
> --
>
> Key: ARROW-15757
> URL: https://issues.apache.org/jira/browse/ARROW-15757
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 7.0.0
>Reporter: christophe bagot
>Priority: Major
>
> Shouldn't the missing bindings reported earlier in 
> [https://github.com/apache/arrow/pull/11632] be propagated higher up [here in 
> the parquet.py 
> module|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L2217]?
> Passing **kwargs as is the case for {{write_table}} would do the trick I 
> think.
> I am finding myself stuck while using pandas.to_parquet with 
> {{use_legacy_dataset=false}} and no way to set the {{existing_data_behavior}} 
> flag to {{overwrite_or_ignore}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15645) [Flight][Java][C++] Data read through Flight is having endianness issue on s390x

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497458#comment-17497458
 ] 

Antoine Pitrou commented on ARROW-15645:


If my diagnosis above is correct, then this is really caused by ARROW-15778.

You could work it around by disable endianness conversion on the Flight client 
side, but unfortunately that is not exposed in Python (see ARROW-15777).

> [Flight][Java][C++] Data read through Flight is having endianness issue on 
> s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Java
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> {code}
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> {code}
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> {code:python}
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15778) [Java] Endianness field not emitted in IPC stream

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497459#comment-17497459
 ] 

Antoine Pitrou commented on ARROW-15778:


Also cc [~lidavidm] since this affects Flight.

> [Java] Endianness field not emitted in IPC stream
> -
>
> Key: ARROW-15778
> URL: https://issues.apache.org/jira/browse/ARROW-15778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 8.0.0
>
>
> It seems the Java IPC writer implementation does not emit the Endianness 
> information at all (making it Little by default). This complicates 
> interoperability with the C++ IPC reader, which does read this information 
> and acts on it to decide whether it needs to byteswap the incoming data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15776) [Python] Expose IpcReadOptions

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497460#comment-17497460
 ] 

Antoine Pitrou commented on ARROW-15776:


cc [~alenkaf] [~jorisvandenbossche]

> [Python] Expose IpcReadOptions
> --
>
> Key: ARROW-15776
> URL: https://issues.apache.org/jira/browse/ARROW-15776
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 8.0.0
>
>
> {{IpcWriteOptions}} is exposed in Python but {{IpcReadOptions}} is not. The 
> latter is necessary to change endian conversion behaviour.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15757) [Python] Missing bindings for existing_data_behavior makes it impossible to maintain old behavior

2022-02-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15757:
--
Summary: [Python] Missing bindings for existing_data_behavior makes it 
impossible to maintain old behavior   (was: Missing bindings for 
existing_data_behavior makes it impossible to maintain old behavior )

> [Python] Missing bindings for existing_data_behavior makes it impossible to 
> maintain old behavior 
> --
>
> Key: ARROW-15757
> URL: https://issues.apache.org/jira/browse/ARROW-15757
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 7.0.0
>Reporter: christophe bagot
>Priority: Major
>
> Shouldn't the missing bindings reported earlier in 
> [https://github.com/apache/arrow/pull/11632] be propagated higher up [here in 
> the parquet.py 
> module|https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L2217]?
> Passing **kwargs as is the case for {{write_table}} would do the trick I 
> think.
> I am finding myself stuck while using pandas.to_parquet with 
> {{use_legacy_dataset=false}} and no way to set the {{existing_data_behavior}} 
> flag to {{overwrite_or_ignore}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15778) [Java] Endianness field not emitted in IPC stream

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497456#comment-17497456
 ] 

Antoine Pitrou commented on ARROW-15778:


The offending code seems to be there:
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java#L202-L213

This seems reasonably easy to fix (perhaps a one-line fix, though a test should 
ideally be added as well).

[~emkornfield] [~kiszk]

> [Java] Endianness field not emitted in IPC stream
> -
>
> Key: ARROW-15778
> URL: https://issues.apache.org/jira/browse/ARROW-15778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 8.0.0
>
>
> It seems the Java IPC writer implementation does not emit the Endianness 
> information at all (making it Little by default). This complicates 
> interoperability with the C++ IPC reader, which does read this information 
> and acts on it to decide whether it needs to byteswap the incoming data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15778) [Java] Endianness field not emitted in IPC stream

2022-02-24 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-15778:
--

 Summary: [Java] Endianness field not emitted in IPC stream
 Key: ARROW-15778
 URL: https://issues.apache.org/jira/browse/ARROW-15778
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Antoine Pitrou
 Fix For: 8.0.0


It seems the Java IPC writer implementation does not emit the Endianness 
information at all (making it Little by default). This complicates 
interoperability with the C++ IPC reader, which does read this information and 
acts on it to decide whether it needs to byteswap the incoming data.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-3476) [Java] mvn test in memory fails on a big-endian platform

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497455#comment-17497455
 ] 

Antoine Pitrou commented on ARROW-3476:
---

[~kiszk] Does this issue need to remain opened?

> [Java] mvn test in memory fails on a big-endian platform
> 
>
> Key: ARROW-3476
> URL: https://issues.apache.org/jira/browse/ARROW-3476
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> Apache Arrow is becoming commonplace to exchange data among important 
> emerging analytics frameworks such as Pandas, Numpy, and Spark.
> [IBM Z|https://en.wikipedia.org/wiki/IBM_Z] is one of platforms to process 
> critical transactions such as bank or credit card. Users of IBM Z want to 
> extract insights from these transactions using the emerging analytics systems 
> on IBM Z Linux. These analytics pipelines can be also fast and effective on 
> IBM Z Linux by using Apache Arrow on memory.
> From the technical perspective, since IBM Z Linux uses big-endian data 
> format, it is not possible to use Apache Arrow in this pipeline. If Apache 
> Arrow could support big-endian, the use case would be expanded.
> When I ran test case of Apache arrow on a big-endian platform (ppc64be), 
> {{mvn test}} in memory causes a failure due to an assertion.
> In {{TestEndianess.testLittleEndian}} test suite, the assertion occurs during 
> an allocation of a {{RootAllocator}} class.
> {code}
> $ uname -a
> Linux ppc64be.novalocal 4.5.7-300.fc24.ppc64 #1 SMP Fri Jun 10 20:29:32 UTC 
> 2016 ppc64 ppc64 ppc64 GNU/Linux
> $ arch  
> ppc64
> $ cd java/memory
> $ mvn test
> [INFO] Scanning for projects...
> [INFO]
>  
> [INFO] 
> 
> [INFO] Building Arrow Memory 0.12.0-SNAPSHOT
> [INFO] 
> 
> [INFO] 
> ...
> [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 
> s - in org.apache.arrow.memory.TestAccountant
> [INFO] Running org.apache.arrow.memory.TestLowCostIdentityHashMap
> [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 
> s - in org.apache.arrow.memory.TestLowCostIdentityHashMap
> [INFO] Running org.apache.arrow.memory.TestBaseAllocator
> [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.746 
> s <<< FAILURE! - in org.apache.arrow.memory.TestEndianess
> [ERROR] testLittleEndian(org.apache.arrow.memory.TestEndianess)  Time 
> elapsed: 0.313 s  <<< ERROR!
> java.lang.ExceptionInInitializerError
>   at 
> org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31)
> Caused by: java.lang.IllegalStateException: Arrow only runs on LittleEndian 
> systems.
>   at 
> org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31)
> [ERROR] Tests run: 22, Failures: 0, Errors: 21, Skipped: 1, Time elapsed: 
> 0.055 s <<< FAILURE! - in org.apache.arrow.memory.TestBaseAllocator
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (ARROW-15563) [C++] Compilation failure on s390x platform

2022-02-24 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-15563.
--
Resolution: Done

[~mr.chandureddy] I suggest you follow ARROW-15645 for updates. Thank you for 
this report!

> [C++] Compilation failure on s390x platform
> ---
>
> Key: ARROW-15563
> URL: https://issues.apache.org/jira/browse/ARROW-15563
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 3.0.0
> Environment: s390x (IBM LinuxONE)
>Reporter: Chandra Shekhar Reddy
>Priority: Major
>
>  
> {code:java}
> (pyarrow-dev) [root@s390x]# cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME       
> -DCMAKE_INSTALL_LIBDIR=lib       -DCMAKE_BUILD_TYPE=debug       
> -DARROW_WITH_BZ2=ON       -DARROW_WITH_ZLIB=ON       -DARROW_WITH_ZSTD=ON     
>   -DARROW_WITH_LZ4=ON       -DARROW_WITH_SNAPPY=ON       
> -DARROW_WITH_BROTLI=ON       -DARROW_PARQUET=ON       -DARROW_PYTHON=ON       
> -DARROW_BUILD_TESTS=ON       ..
> -- Building using CMake version: 3.22.2
> -- The C compiler identification is GNU 9.2.1
> -- The CXX compiler identification is GNU 8.5.0
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: /usr/bin/cc - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: /usr/bin/c++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Arrow version: 3.0.0 (full: '3.0.0')
> -- Arrow SO version: 300 (full: 300.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found Python3: /usr/bin/python3.9 (found version "3.9.6") found 
> components: Interpreter
> -- Found cpplint executable at 
> /root/git/repos/arrow/cpp/build-support/cpplint.py
> -- System processor: s390x
> -- Arrow build warning level: CHECKIN
> Using ld linker
> Configured for DEBUG build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: DEBUG
> -- Using AUTO approach to find dependencies
> -- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862
> -- ARROW_AWSSDK_BUILD_VERSION: 1.8.90
> -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.5
> -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59
> -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5
> -- ARROW_BOOST_BUILD_VERSION: 1.71.0
> -- ARROW_BROTLI_BUILD_VERSION: v1.0.7
> -- ARROW_BZIP2_BUILD_VERSION: 1.0.8
> -- ARROW_CARES_BUILD_VERSION: 1.16.1
> -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2
> -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2
> -- ARROW_GLOG_BUILD_VERSION: v0.4.0
> -- ARROW_GRPC_BUILD_VERSION: v1.33.2
> -- ARROW_GTEST_BUILD_VERSION: 1.10.0
> -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1
> -- ARROW_LZ4_BUILD_VERSION: v1.9.2
> -- ARROW_MIMALLOC_BUILD_VERSION: v1.6.4
> -- ARROW_ORC_BUILD_VERSION: 1.6.2
> -- ARROW_PROTOBUF_BUILD_VERSION: v3.13.0
> -- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5
> -- ARROW_RE2_BUILD_VERSION: 2019-08-01
> -- ARROW_SNAPPY_BUILD_VERSION: 1.1.8
> -- ARROW_THRIFT_BUILD_VERSION: 0.12.0
> -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183
> -- ARROW_UTF8PROC_BUILD_VERSION: v2.5.0
> -- ARROW_ZLIB_BUILD_VERSION: 1.2.11
> -- ARROW_ZSTD_BUILD_VERSION: v1.4.5
> -- Looking for pthread.h
> -- Looking for pthread.h - found
> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
> -- Check if compiler accepts -pthread
> -- Check if compiler accepts -pthread - yes
> -- Found Threads: TRUE
> -- Checking for module 'thrift'
> --   Package 'thrift', required by 'virtual:world', not found
> -- Could NOT find Thrift: Found unsuitable version "", but required is at 
> least "0.11.0" (found THRIFT_LIB-NOTFOUND)
> -- Looking for __SIZEOF_INT128__
> -- Looking for __SIZEOF_INT128__ - found
> -- Found Boost: /usr/include (found suitable version "1.66.0", minimum 
> required is "1.58") found components: regex system filesystem
> -- Boost include dir: /usr/include
> -- Boost libraries: Boost::system;Boost::filesystem
> -- Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR)
> -- Building snappy from source
> -- Checking for modules 'libbrotlicommon;libbrotlienc;libbrotlidec'
> --   Package 'libbrotlicommon', required by 'virtual:world', not found
> --   Package 'libbrotlienc', required by 'virtual:world', not found
> --   Package 'libbrotlidec', required by 'virtual:world', not found
> -- Could NOT find Brotli (missing: BROTLI_COMMON_LIBRARY BROTLI_ENC_LIBRARY 
> BROTLI_DEC_LIBRARY BROTLI_INCLUDE_DIR)
> -- Building brotli from source
> -- Building without OpenSSL support. Minimum OpenSSL vers

[jira] [Updated] (ARROW-15645) Data read through Flight is having endianness issue on s390x

2022-02-24 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15645:
---
Component/s: Java
 (was: Python)

> Data read through Flight is having endianness issue on s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Java
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> {code}
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> {code}
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> {code:python}
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15645) [Flight][Java][C++] Data read through Flight is having endianness issue on s390x

2022-02-24 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15645:
---
Summary: [Flight][Java][C++] Data read through Flight is having endianness 
issue on s390x  (was: Data read through Flight is having endianness issue on 
s390x)

> [Flight][Java][C++] Data read through Flight is having endianness issue on 
> s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Java
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> {code}
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> {code}
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> {code:python}
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15645) Data read through Flight is having endianness issue on s390x

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497452#comment-17497452
 ] 

Antoine Pitrou commented on ARROW-15645:


Ok, so my guess is that both server (Java) and client (Python/C++) are on 
s390x, right?

On Arrow C++ 3.0.0, no conversion happens in either Java or C++, and it works 
since client and server have the same endianness (both big endian).

On Arrow C++ 4.0.0+, the Flight client reads the endianness information from 
the IPC stream. If the machine endianness doesn't match the stream endianness, 
endianness conversion is attempted by default.

Here is the problem: Arrow Java (and the Java Flight server) seems to always 
set the endianness information to "little" (even on a big endian machine). 
Arrow C++ interprets that information as meaning a conversion is needed, while 
the data is already in the right format.

> Data read through Flight is having endianness issue on s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Python
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> {code}
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> {code}
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> {code:python}
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15777) [Python][Flight] Allow passing IpcReadOptions to FlightCallOptions

2022-02-24 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15777:
---
Description: Once {{IpcReadOptions}} is exposed in Python (ARROW-15776), it 
should also be accepted as an optional parameter to {{FlightCallOptions}}.  
(was: Once {{IpcReadOptions}} is exposed in Python, it should also be accepted 
as an optional parameter to {{FlightCallOptions}}.)

> [Python][Flight] Allow passing IpcReadOptions to FlightCallOptions
> --
>
> Key: ARROW-15777
> URL: https://issues.apache.org/jira/browse/ARROW-15777
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC, Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 8.0.0
>
>
> Once {{IpcReadOptions}} is exposed in Python (ARROW-15776), it should also be 
> accepted as an optional parameter to {{FlightCallOptions}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15777) [Python][Flight] Allow passing IpcReadOptions to FlightCallOptions

2022-02-24 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-15777:
--

 Summary: [Python][Flight] Allow passing IpcReadOptions to 
FlightCallOptions
 Key: ARROW-15777
 URL: https://issues.apache.org/jira/browse/ARROW-15777
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Python
Reporter: Antoine Pitrou
 Fix For: 8.0.0


Once {{IpcReadOptions}} is exposed in Python, it should also be accepted as an 
optional parameter to {{FlightCallOptions}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15776) [Python] Expose IpcReadOptions

2022-02-24 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15776:
---
Fix Version/s: 8.0.0

> [Python] Expose IpcReadOptions
> --
>
> Key: ARROW-15776
> URL: https://issues.apache.org/jira/browse/ARROW-15776
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 8.0.0
>
>
> {{IpcWriteOptions}} is exposed in Python but {{IpcReadOptions}} is not. The 
> latter is necessary to change endian conversion behaviour.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15776) [Python] Expose IpcReadOptions

2022-02-24 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-15776:
--

 Summary: [Python] Expose IpcReadOptions
 Key: ARROW-15776
 URL: https://issues.apache.org/jira/browse/ARROW-15776
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Antoine Pitrou


{{IpcWriteOptions}} is exposed in Python but {{IpcReadOptions}} is not. The 
latter is necessary to change endian conversion behaviour.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15645) Data read through Flight is having endianness issue on s390x

2022-02-24 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497439#comment-17497439
 ] 

Antoine Pitrou commented on ARROW-15645:


Is the client or the server on s390x?

> Data read through Flight is having endianness issue on s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Python
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> {code}
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> {code}
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> {code:python}
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15645) Data read through Flight is having endianness issue on s390x

2022-02-24 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15645:
---
Description: 
Am facing an endianness issue on s390x(big endian) when converting the data 
read through flight to pandas data frame.

(1) table.validate() fails with error
{code}
Traceback (most recent call last):
  File "/tmp/2.py", line 51, in 
    table.validate()
  File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
binary array
{code}

(2) table.to_pandas() gives a segmentation fault

Here is a sample code that I am using:
{code:python}
from pyarrow import flight
import os
import json

flight_endpoint = os.environ.get("flight_server_url", "grpc+tls://...local:443")
print(flight_endpoint)

#
class TokenClientAuthHandler(flight.ClientAuthHandler):
    """An example implementation of authentication via handshake.
       With the default constructor, the user token is read from the 
environment: TokenClientAuthHandler().
       You can also pass a user token as parameter to the constructor, 
TokenClientAuthHandler(yourtoken).
    """
    def \_\_init\_\_(self, token: str = None):
        super().\_\_init\__()
        if( token != None):
            strToken = strToken = 'Bearer {}'.format(token)
        else:
            strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
        self.token = strToken.encode('utf-8')
        #print(self.token)

    def authenticate(self, outgoing, incoming):
        outgoing.write(self.token)
        self.token = incoming.read()

    def get_token(self):
        return self.token
    
readClient = flight.FlightClient(flight_endpoint)
readClient.authenticate(TokenClientAuthHandler())

cmd = json.dumps(\{...})

descriptor = flight.FlightDescriptor.for_command(cmd)
flightInfo = readClient.get_flight_info(descriptor)

reader = readClient.do_get(flightInfo.endpoints[0].ticket)
table = reader.read_all()

print(table)
print(table.num_columns)
print(table.num_rows)
table.validate()
table.to_pandas()
{code}

  was:
Am facing an endianness issue on s390x(big endian) when converting the data 
read through flight to pandas data frame.

(1) table.validate() fails with error

Traceback (most recent call last):
  File "/tmp/2.py", line 51, in 
    table.validate()
  File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
binary array

(2) table.to_pandas() gives a segmentation fault

Here is a sample code that I am using:

from pyarrow import flight
import os
import json

flight_endpoint = os.environ.get("flight_server_url", "grpc+tls://...local:443")
print(flight_endpoint)

#
class TokenClientAuthHandler(flight.ClientAuthHandler):
    """An example implementation of authentication via handshake.
       With the default constructor, the user token is read from the 
environment: TokenClientAuthHandler().
       You can also pass a user token as parameter to the constructor, 
TokenClientAuthHandler(yourtoken).
    """
    def \_\_init\_\_(self, token: str = None):
        super().\_\_init\__()
        if( token != None):
            strToken = strToken = 'Bearer {}'.format(token)
        else:
            strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
        self.token = strToken.encode('utf-8')
        #print(self.token)

    def authenticate(self, outgoing, incoming):
        outgoing.write(self.token)
        self.token = incoming.read()

    def get_token(self):
        return self.token
    
readClient = flight.FlightClient(flight_endpoint)
readClient.authenticate(TokenClientAuthHandler())

cmd = json.dumps(\{...})

descriptor = flight.FlightDescriptor.for_command(cmd)
flightInfo = readClient.get_flight_info(descriptor)

reader = readClient.do_get(flightInfo.endpoints[0].ticket)
table = reader.read_all()

print(table)
print(table.num_columns)
print(table.num_rows)
table.validate()
table.to_pandas()


> Data read through Flight is having endianness issue on s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Python
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> {code}
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>

[jira] [Updated] (ARROW-15767) [Python] Arrow Table with DenseUnion fails to convert to Python Pandas DataFrame

2022-02-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15767:
--
Summary: [Python] Arrow Table with DenseUnion fails to convert to Python 
Pandas DataFrame  (was: [Python] Arrow Table with Nullable DenseUnion fails to 
convert to Python Pandas DataFrame)

> [Python] Arrow Table with DenseUnion fails to convert to Python Pandas 
> DataFrame
> 
>
> Key: ARROW-15767
> URL: https://issues.apache.org/jira/browse/ARROW-15767
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 6.0.1
>Reporter: Ben Baumgold
>Priority: Major
> Attachments: nothing.arrow
>
>
> A feather file containing column of nullable values errors when converting to 
> a Pandas DataFrame. It can be read into a pyarrow.Table as follows:
> {code:python}
> In [1]: import pyarrow.feather as feather
> In [2]: t = feather.read_table("nothing.arrow")
> In [3]: t
> Out[3]:
> pyarrow.Table
> col: dense_union<: null=0, : int32 not null=1>
>   child 0, : null
>   child 1, : int32 not null
> 
> col: [  -- is_valid: all not null  -- type_ids: [
>   1,
>   1,
>   1,
>   0
> ]  -- value_offsets: [
>   0,
>   1,
>   2,
>   0
> ]  -- child 0 type: null
> 1 nulls  -- child 1 type: int32
> [
>   1,
>   2,
>   3
> ]]
> {code}
> But when trying to convert the pyarrow.Table into a Pandas DataFrame, I get 
> the following error:
> {code:python}
> In [4]: t.to_pandas()
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 t.to_pandas()
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi in 
> pyarrow.lib._PandasConvertible.to_pandas()
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table._to_pandas()
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in 
> table_to_blockmanager(options, table, categories, ignore_metadata, 
> types_mapper)
> 787 _check_data_column_metadata_consistency(all_columns)
> 788 columns = _deserialize_column_index(table, all_columns, 
> column_indexes)
> --> 789 blocks = _table_to_blocks(options, table, categories, 
> ext_columns_dtypes)
> 790
> 791 axes = [columns, index]
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in 
> _table_to_blocks(options, block_table, categories, extension_columns)
>1126 # Convert an arrow table to Block from the internal pandas API
>1127 columns = block_table.column_names
> -> 1128 result = pa.lib.table_to_blocks(options, block_table, categories,
>1129 list(extension_columns.keys()))
>1130 return [_reconstruct_block(item, columns, extension_columns)
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.table_to_blocks()
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of 
> type dense_union<: null=0, : int32 not null=1> is known.
> {code}
> Note the Arrow file is valid and can be read successfully by 
> [Arrow.jl|https://github.com/apache/arrow-julia]. A related issue is 
> [arrow-julia#285|https://github.com/apache/arrow-julia/issues/285].  The  
> [^nothing.arrow]  file used in this example is attached for convenience.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15767) [Python] Arrow Table with Nullable DenseUnion fails to convert to Python Pandas DataFrame

2022-02-24 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497425#comment-17497425
 ] 

Joris Van den Bossche commented on ARROW-15767:
---

There is nothing wrong with your file (it is indeed valid, as it can also be 
read by pyarrow into a pyarrow.Table), but as the error type indicates: this 
conversion is just not yet implemented. 

Specifically for the union types, there are not yet much utilities implemented 
for interacting with this kind of data on the Python (numpy, pandas) <-> Arrow 
interaction layer. For example, also converting a python structure to a union 
array is not yet implemented (for this I found ARROW-2774). For the missing 
conversion to Python, I didn't directly find an issue.

For conversion to Python, only a conversion to a plain python list is supported:

{code}
>>> t["col"].to_pylist()
[1, 2, 3, None]
{code}

In general, we could convert an arrow union type to an object dtype array in 
numpy/pandas, but that might also not always be very useful.

> [Python] Arrow Table with Nullable DenseUnion fails to convert to Python 
> Pandas DataFrame
> -
>
> Key: ARROW-15767
> URL: https://issues.apache.org/jira/browse/ARROW-15767
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 6.0.1
>Reporter: Ben Baumgold
>Priority: Major
> Attachments: nothing.arrow
>
>
> A feather file containing column of nullable values errors when converting to 
> a Pandas DataFrame. It can be read into a pyarrow.Table as follows:
> {code:python}
> In [1]: import pyarrow.feather as feather
> In [2]: t = feather.read_table("nothing.arrow")
> In [3]: t
> Out[3]:
> pyarrow.Table
> col: dense_union<: null=0, : int32 not null=1>
>   child 0, : null
>   child 1, : int32 not null
> 
> col: [  -- is_valid: all not null  -- type_ids: [
>   1,
>   1,
>   1,
>   0
> ]  -- value_offsets: [
>   0,
>   1,
>   2,
>   0
> ]  -- child 0 type: null
> 1 nulls  -- child 1 type: int32
> [
>   1,
>   2,
>   3
> ]]
> {code}
> But when trying to convert the pyarrow.Table into a Pandas DataFrame, I get 
> the following error:
> {code:python}
> In [4]: t.to_pandas()
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 t.to_pandas()
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi in 
> pyarrow.lib._PandasConvertible.to_pandas()
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table._to_pandas()
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in 
> table_to_blockmanager(options, table, categories, ignore_metadata, 
> types_mapper)
> 787 _check_data_column_metadata_consistency(all_columns)
> 788 columns = _deserialize_column_index(table, all_columns, 
> column_indexes)
> --> 789 blocks = _table_to_blocks(options, table, categories, 
> ext_columns_dtypes)
> 790
> 791 axes = [columns, index]
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in 
> _table_to_blocks(options, block_table, categories, extension_columns)
>1126 # Convert an arrow table to Block from the internal pandas API
>1127 columns = block_table.column_names
> -> 1128 result = pa.lib.table_to_blocks(options, block_table, categories,
>1129 list(extension_columns.keys()))
>1130 return [_reconstruct_block(item, columns, extension_columns)
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.table_to_blocks()
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of 
> type dense_union<: null=0, : int32 not null=1> is known.
> {code}
> Note the Arrow file is valid and can be read successfully by 
> [Arrow.jl|https://github.com/apache/arrow-julia]. A related issue is 
> [arrow-julia#285|https://github.com/apache/arrow-julia/issues/285].  The  
> [^nothing.arrow]  file used in this example is attached for convenience.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15767) [Python] Arrow Table with Nullable DenseUnion fails to convert to Python Pandas DataFrame

2022-02-24 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15767:
--
Summary: [Python] Arrow Table with Nullable DenseUnion fails to convert to 
Python Pandas DataFrame  (was: Arrow Table with Nullable DenseUnion fails to 
convert to Python Pandas DataFrame)

> [Python] Arrow Table with Nullable DenseUnion fails to convert to Python 
> Pandas DataFrame
> -
>
> Key: ARROW-15767
> URL: https://issues.apache.org/jira/browse/ARROW-15767
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 6.0.1
>Reporter: Ben Baumgold
>Priority: Major
> Attachments: nothing.arrow
>
>
> A feather file containing column of nullable values errors when converting to 
> a Pandas DataFrame. It can be read into a pyarrow.Table as follows:
> {code:python}
> In [1]: import pyarrow.feather as feather
> In [2]: t = feather.read_table("nothing.arrow")
> In [3]: t
> Out[3]:
> pyarrow.Table
> col: dense_union<: null=0, : int32 not null=1>
>   child 0, : null
>   child 1, : int32 not null
> 
> col: [  -- is_valid: all not null  -- type_ids: [
>   1,
>   1,
>   1,
>   0
> ]  -- value_offsets: [
>   0,
>   1,
>   2,
>   0
> ]  -- child 0 type: null
> 1 nulls  -- child 1 type: int32
> [
>   1,
>   2,
>   3
> ]]
> {code}
> But when trying to convert the pyarrow.Table into a Pandas DataFrame, I get 
> the following error:
> {code:python}
> In [4]: t.to_pandas()
> ---
> ArrowNotImplementedError  Traceback (most recent call last)
>  in 
> > 1 t.to_pandas()
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/array.pxi in 
> pyarrow.lib._PandasConvertible.to_pandas()
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.Table._to_pandas()
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in 
> table_to_blockmanager(options, table, categories, ignore_metadata, 
> types_mapper)
> 787 _check_data_column_metadata_consistency(all_columns)
> 788 columns = _deserialize_column_index(table, all_columns, 
> column_indexes)
> --> 789 blocks = _table_to_blocks(options, table, categories, 
> ext_columns_dtypes)
> 790
> 791 axes = [columns, index]
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/pandas_compat.py in 
> _table_to_blocks(options, block_table, categories, extension_columns)
>1126 # Convert an arrow table to Block from the internal pandas API
>1127 columns = block_table.column_names
> -> 1128 result = pa.lib.table_to_blocks(options, block_table, categories,
>1129 list(extension_columns.keys()))
>1130 return [_reconstruct_block(item, columns, extension_columns)
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/table.pxi in 
> pyarrow.lib.table_to_blocks()
> ~/miniconda3/lib/python3.9/site-packages/pyarrow/error.pxi in 
> pyarrow.lib.check_status()
> ArrowNotImplementedError: No known equivalent Pandas block for Arrow data of 
> type dense_union<: null=0, : int32 not null=1> is known.
> {code}
> Note the Arrow file is valid and can be read successfully by 
> [Arrow.jl|https://github.com/apache/arrow-julia]. A related issue is 
> [arrow-julia#285|https://github.com/apache/arrow-julia/issues/285].  The  
> [^nothing.arrow]  file used in this example is attached for convenience.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15098) [R] Add binding for lubridate::duration() and/or as.difftime()

2022-02-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15098:
---
Labels: pull-request-available  (was: )

> [R] Add binding for lubridate::duration() and/or as.difftime()
> --
>
> Key: ARROW-15098
> URL: https://issues.apache.org/jira/browse/ARROW-15098
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Dewey Dunnington
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After ARROW-14941 we have support for the duration type; however, there is no 
> binding for {{lubridate::duration()}} or {{as.difftime()}} available in dplyr 
> evaluation that could create these objects. I'm actually not sure if we 
> should bind {{lubridate::duration}} since it returns a custom S4 class that's 
> identical in function to base R's difftime.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15291) [C++][Python] Segfault in StructArray.to_numpy and to_pandas if it contains an ExtensionArray

2022-02-24 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15291:
---
Labels: pull-request-available  (was: )

> [C++][Python] Segfault in StructArray.to_numpy and to_pandas if it contains 
> an ExtensionArray
> -
>
> Key: ARROW-15291
> URL: https://issues.apache.org/jira/browse/ARROW-15291
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 6.0.1
> Environment: pyarrow 6.0.1, macbook pro
>Reporter: quentin lhoest
>Assignee: Alenka Frim
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hi !
> If you create a StructArray with an ExtensionArray in it, then both to_numpy 
> and to_pandas  segfault in python:
> {code:java}
> import pyarrow as pa
> class CustomType(pa.PyExtensionType):
> def __init__(self):
> pa.PyExtensionType.__init__(self, pa.binary())
> def __reduce__(self):
> return CustomType, ()
> arr = pa.ExtensionArray.from_storage(CustomType(), pa.array([b"foo"]))
> pa.StructArray.from_arrays([arr], ["name"]).to_numpy(zero_copy_only=False)
>  {code}
> Thanks in advance for the help !



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-14948) [R] Implement lubridate's %m+%, %m-%, add_with_rollback, and addition and subtraction with timestamp

2022-02-24 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-14948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld reassigned ARROW-14948:


Assignee: Dragoș Moldovan-Grünfeld

> [R] Implement lubridate's %m+%, %m-%, add_with_rollback, and addition and 
> subtraction with timestamp
> 
>
> Key: ARROW-14948
> URL: https://issues.apache.org/jira/browse/ARROW-14948
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15771) [C++][Compute] Add window join to execution engine

2022-02-24 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-15771:
-
Labels: query-engine  (was: )

> [C++][Compute] Add window join to execution engine
> --
>
> Key: ARROW-15771
> URL: https://issues.apache.org/jira/browse/ARROW-15771
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: query-engine
>
> We would want to support window joins with as-of support.
> See https://github.com/substrait-io/substrait/issues/3 for more.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15771) [C++][Compute] Add window join to execution engine

2022-02-24 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-15771:
-
Component/s: C++

> [C++][Compute] Add window join to execution engine
> --
>
> Key: ARROW-15771
> URL: https://issues.apache.org/jira/browse/ARROW-15771
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>
> We would want to support window joins with as-of support.
> See https://github.com/substrait-io/substrait/issues/3 for more.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-14947) [C++] Implement maths with timestamps

2022-02-24 Thread Rok Mihevc (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497371#comment-17497371
 ] 

Rok Mihevc edited comment on ARROW-14947 at 2/24/22, 12:41 PM:
---

[~dragosmg] I didn't look into how rollback works, but I'm guessing it's kind 
of like _add/sub/mul/div_ followed by a _floor_ or _ceil_?  If that is the case 
you could just use the add/sub/muldiv kernels from ARROW-11090 and add 
_floor/ceil_ for rollback cases. In case there is something still missing in 
C++ we should identify it and write Jiras.


was (Author: rokm):
[~dragosmg] I didn't look into how rollback works, but I'm guessing it's kind 
of like _floor_ and _ceil_?  If that is the case you could just use the add/sub 
kernels from ARROW-11090 and add _floor/ceil_ for rollback cases. In case there 
is something still missing in C++ we should identify it and write Jiras.

> [C++] Implement maths with timestamps
> -
>
> Key: ARROW-14947
> URL: https://issues.apache.org/jira/browse/ARROW-14947
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>
> Please could we have maths with timestamps implemented?
> In order to implement some of the functionality I'd like in R, I need to be 
> able to do maths with dates.  For example:
> * Addition and subtraction: Timestamp + Duration = Timestamp (with and 
> without rollback so have ability to do e.g. 2021-03-30 minus 1 month and 
> either get a null back, or 2021-02-28), plus the ability to specify whether 
> to rollback to the first or last, and whether to preserve or rest the time.
> See https://lubridate.tidyverse.org/reference/mplus.html for documentation of 
> the R functionality.
> * Multiplying Durations: Duration * Numeric = Duration



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14947) [C++] Implement maths with timestamps

2022-02-24 Thread Rok Mihevc (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497371#comment-17497371
 ] 

Rok Mihevc commented on ARROW-14947:


[~dragosmg] I didn't look into how rollback works, but I'm guessing it's kind 
of like _floor_ and _ceil_?  If that is the case you could just use the add/sub 
kernels from ARROW-11090 and add _floor/ceil_ for rollback cases. In case there 
is something still missing in C++ we should identify it and write Jiras.

> [C++] Implement maths with timestamps
> -
>
> Key: ARROW-14947
> URL: https://issues.apache.org/jira/browse/ARROW-14947
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>
> Please could we have maths with timestamps implemented?
> In order to implement some of the functionality I'd like in R, I need to be 
> able to do maths with dates.  For example:
> * Addition and subtraction: Timestamp + Duration = Timestamp (with and 
> without rollback so have ability to do e.g. 2021-03-30 minus 1 month and 
> either get a null back, or 2021-02-28), plus the ability to specify whether 
> to rollback to the first or last, and whether to preserve or rest the time.
> See https://lubridate.tidyverse.org/reference/mplus.html for documentation of 
> the R functionality.
> * Multiplying Durations: Duration * Numeric = Duration



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15645) Data read through Flight is having endianness issue on s390x

2022-02-24 Thread Ravi Gummadi (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497337#comment-17497337
 ] 

Ravi Gummadi commented on ARROW-15645:
--

Flight server side is using java based arrow 6.0.1 version.
Client side pyarrow 5.0.0 or 6.0.0 or 7.0.0  all 3 versions are facing the 
above reported issue.

> Data read through Flight is having endianness issue on s390x
> 
>
> Key: ARROW-15645
> URL: https://issues.apache.org/jira/browse/ARROW-15645
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC, Python
>Affects Versions: 5.0.0
> Environment: Linux s390x (big endian)
>Reporter: Ravi Gummadi
>Priority: Major
>
> Am facing an endianness issue on s390x(big endian) when converting the data 
> read through flight to pandas data frame.
> (1) table.validate() fails with error
> Traceback (most recent call last):
>   File "/tmp/2.py", line 51, in 
>     table.validate()
>   File "pyarrow/table.pxi", line 1232, in pyarrow.lib.Table.validate
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Column 1: In chunk 0: Invalid: Negative offsets in 
> binary array
> (2) table.to_pandas() gives a segmentation fault
> 
> Here is a sample code that I am using:
> from pyarrow import flight
> import os
> import json
> flight_endpoint = os.environ.get("flight_server_url", 
> "grpc+tls://...local:443")
> print(flight_endpoint)
> #
> class TokenClientAuthHandler(flight.ClientAuthHandler):
>     """An example implementation of authentication via handshake.
>        With the default constructor, the user token is read from the 
> environment: TokenClientAuthHandler().
>        You can also pass a user token as parameter to the constructor, 
> TokenClientAuthHandler(yourtoken).
>     """
>     def \_\_init\_\_(self, token: str = None):
>         super().\_\_init\__()
>         if( token != None):
>             strToken = strToken = 'Bearer {}'.format(token)
>         else:
>             strToken = 'Bearer {}'.format(os.environ.get("some_auth_token"))
>         self.token = strToken.encode('utf-8')
>         #print(self.token)
>     def authenticate(self, outgoing, incoming):
>         outgoing.write(self.token)
>         self.token = incoming.read()
>     def get_token(self):
>         return self.token
>     
> readClient = flight.FlightClient(flight_endpoint)
> readClient.authenticate(TokenClientAuthHandler())
> cmd = json.dumps(\{...})
> descriptor = flight.FlightDescriptor.for_command(cmd)
> flightInfo = readClient.get_flight_info(descriptor)
> reader = readClient.do_get(flightInfo.endpoints[0].ticket)
> table = reader.read_all()
> print(table)
> print(table.num_columns)
> print(table.num_rows)
> table.validate()
> table.to_pandas()



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15563) [C++] Compilation failure on s390x platform

2022-02-24 Thread Chandra Shekhar Reddy (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497299#comment-17497299
 ] 

Chandra Shekhar Reddy commented on ARROW-15563:
---

[~apitrou]  and [~adeetikaushal] I was able to build the PyArrow 7.0.0 with out 
any issues. Thank you !

On the other hand issue https://issues.apache.org/jira/browse/ARROW-15645  
makes me worried to move to greater level of PyArrow versions on ZLinux.

Please advise. Thank you !

> [C++] Compilation failure on s390x platform
> ---
>
> Key: ARROW-15563
> URL: https://issues.apache.org/jira/browse/ARROW-15563
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 3.0.0
> Environment: s390x (IBM LinuxONE)
>Reporter: Chandra Shekhar Reddy
>Priority: Major
>
>  
> {code:java}
> (pyarrow-dev) [root@s390x]# cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME       
> -DCMAKE_INSTALL_LIBDIR=lib       -DCMAKE_BUILD_TYPE=debug       
> -DARROW_WITH_BZ2=ON       -DARROW_WITH_ZLIB=ON       -DARROW_WITH_ZSTD=ON     
>   -DARROW_WITH_LZ4=ON       -DARROW_WITH_SNAPPY=ON       
> -DARROW_WITH_BROTLI=ON       -DARROW_PARQUET=ON       -DARROW_PYTHON=ON       
> -DARROW_BUILD_TESTS=ON       ..
> -- Building using CMake version: 3.22.2
> -- The C compiler identification is GNU 9.2.1
> -- The CXX compiler identification is GNU 8.5.0
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: /usr/bin/cc - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: /usr/bin/c++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Arrow version: 3.0.0 (full: '3.0.0')
> -- Arrow SO version: 300 (full: 300.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found Python3: /usr/bin/python3.9 (found version "3.9.6") found 
> components: Interpreter
> -- Found cpplint executable at 
> /root/git/repos/arrow/cpp/build-support/cpplint.py
> -- System processor: s390x
> -- Arrow build warning level: CHECKIN
> Using ld linker
> Configured for DEBUG build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: DEBUG
> -- Using AUTO approach to find dependencies
> -- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862
> -- ARROW_AWSSDK_BUILD_VERSION: 1.8.90
> -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.5
> -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59
> -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5
> -- ARROW_BOOST_BUILD_VERSION: 1.71.0
> -- ARROW_BROTLI_BUILD_VERSION: v1.0.7
> -- ARROW_BZIP2_BUILD_VERSION: 1.0.8
> -- ARROW_CARES_BUILD_VERSION: 1.16.1
> -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2
> -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2
> -- ARROW_GLOG_BUILD_VERSION: v0.4.0
> -- ARROW_GRPC_BUILD_VERSION: v1.33.2
> -- ARROW_GTEST_BUILD_VERSION: 1.10.0
> -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1
> -- ARROW_LZ4_BUILD_VERSION: v1.9.2
> -- ARROW_MIMALLOC_BUILD_VERSION: v1.6.4
> -- ARROW_ORC_BUILD_VERSION: 1.6.2
> -- ARROW_PROTOBUF_BUILD_VERSION: v3.13.0
> -- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5
> -- ARROW_RE2_BUILD_VERSION: 2019-08-01
> -- ARROW_SNAPPY_BUILD_VERSION: 1.1.8
> -- ARROW_THRIFT_BUILD_VERSION: 0.12.0
> -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183
> -- ARROW_UTF8PROC_BUILD_VERSION: v2.5.0
> -- ARROW_ZLIB_BUILD_VERSION: 1.2.11
> -- ARROW_ZSTD_BUILD_VERSION: v1.4.5
> -- Looking for pthread.h
> -- Looking for pthread.h - found
> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
> -- Check if compiler accepts -pthread
> -- Check if compiler accepts -pthread - yes
> -- Found Threads: TRUE
> -- Checking for module 'thrift'
> --   Package 'thrift', required by 'virtual:world', not found
> -- Could NOT find Thrift: Found unsuitable version "", but required is at 
> least "0.11.0" (found THRIFT_LIB-NOTFOUND)
> -- Looking for __SIZEOF_INT128__
> -- Looking for __SIZEOF_INT128__ - found
> -- Found Boost: /usr/include (found suitable version "1.66.0", minimum 
> required is "1.58") found components: regex system filesystem
> -- Boost include dir: /usr/include
> -- Boost libraries: Boost::system;Boost::filesystem
> -- Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR)
> -- Building snappy from source
> -- Checking for modules 'libbrotlicommon;libbrotlienc;libbrotlidec'
> --   Package 'libbrotlicommon', required by 'virtual:world', not found
> --   Package 'libbrotlienc', required by 'virtual:world', not found
> --   Package 'libbrotlidec', required

[jira] [Commented] (ARROW-14947) [C++] Implement maths with timestamps

2022-02-24 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-14947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497296#comment-17497296
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-14947:
--

[~rokm] I haven't looked much into this. Do you think it's still a C++ 
component issue? 

> [C++] Implement maths with timestamps
> -
>
> Key: ARROW-14947
> URL: https://issues.apache.org/jira/browse/ARROW-14947
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Nicola Crane
>Priority: Major
>
> Please could we have maths with timestamps implemented?
> In order to implement some of the functionality I'd like in R, I need to be 
> able to do maths with dates.  For example:
> * Addition and subtraction: Timestamp + Duration = Timestamp (with and 
> without rollback so have ability to do e.g. 2021-03-30 minus 1 month and 
> either get a null back, or 2021-02-28), plus the ability to specify whether 
> to rollback to the first or last, and whether to preserve or rest the time.
> See https://lubridate.tidyverse.org/reference/mplus.html for documentation of 
> the R functionality.
> * Multiplying Durations: Duration * Numeric = Duration



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

1 2 >

1 - 100 of 109 matches

Mail list logo