[jira] [Commented] (ARROW-8175) [Python] Setup type checking with mypy

2022-12-07 Thread Kyle Barron (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644561#comment-17644561
 ] 

Kyle Barron commented on ARROW-8175:


Working with pyarrow in e.g. vscode is really painful at the moment because of 
the lack of user-facing types. vscode can access the top-level functions, but 
then as soon as you create a class, it becomes `Any`. 

{code:python}
import pyarrow.parquet as pq
table = pq.read_table()
table # <-- Any
table. # <-- no auto completions
{code}

This may be a slightly different ask than the title of this ticket, as I'm 
referring to the developer experience while writing the code, not _checking_ 
code. I can create a separate ticket if desired.

For now, I may create my own third-party type hints for this using mypy's 
stubgen

> [Python] Setup type checking with mypy
> --
>
> Key: ARROW-8175
> URL: https://issues.apache.org/jira/browse/ARROW-8175
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, Python
>Reporter: Uwe Korn
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Get mypy checks running, activate things like {{check_untyped_defs}} later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-17404) [Java] Consolidate JNI compilation #2

2022-12-07 Thread David Dali Susanibar Arce (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Dali Susanibar Arce resolved ARROW-17404.
---
Resolution: Fixed

> [Java] Consolidate JNI compilation #2
> -
>
> Key: ARROW-17404
> URL: https://issues.apache.org/jira/browse/ARROW-17404
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Java
>Reporter: David Dali Susanibar Arce
>Priority: Major
>
> *Umbrella ticket for consolidating Java JNI compilation initiative #2*
> Initial part of consolidate JNI Java initiative was: Consolidate ORC/Dataset 
> code and Separate JNI CMakeLists.txt compilation.
> This 2nd part consist on:
> 1.- Make the Java library able to compile with a single mvn command
> 2.- Make Java library able to compile from an installed libarrow
> 3.- Migrate remaining C++ CMakeLists.txt specific to Java into the Java 
> project: ORC / Dataset / Gandiva
> 4.- Add windows build script that produces DLLs
> 5.- Incorporate Windows DLLs into the maven packages
> 6.- Migrate ORC JNI to use C-Data-Interface



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17464) [C++] Support float16 in writing/reading parquet

2022-12-07 Thread Anja Boskovic (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644536#comment-17644536
 ] 

Anja Boskovic commented on ARROW-17464:
---

An update! Parquet-1222 (https://issues.apache.org/jira/browse/PARQUET-1222) 
which was a blocker for adding float16 support to parquet, has been merged.

> [C++] Support float16 in writing/reading parquet
> 
>
> Key: ARROW-17464
> URL: https://issues.apache.org/jira/browse/ARROW-17464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet, Python
>Reporter: Anja Boskovic
>Priority: Major
>  Labels: parquet, pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Half-float values are not supported in Parquet. Here is a previous issue that 
> talks about that: https://issues.apache.org/jira/browse/PARQUET-1647
> So, this will not work:
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import numpy as np
> arr = pa.array(np.float16([0.1, 2.2, 3]))
> table = pa.table({'a': arr})
> pq.write_table(table, "test_halffloat.parquet") {code}
> {{This is a proposal to store float16 values in Parquet as FixedSizeBinary, 
> and then restore them to float16 when reading them back in.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17404) [Java] Consolidate JNI compilation #2

2022-12-07 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644519#comment-17644519
 ] 

Kouhei Sutou commented on ARROW-17404:
--

[~dsusanibara] Can we close this? Is there any more task to be resolved?

> [Java] Consolidate JNI compilation #2
> -
>
> Key: ARROW-17404
> URL: https://issues.apache.org/jira/browse/ARROW-17404
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Java
>Reporter: David Dali Susanibar Arce
>Priority: Major
>
> *Umbrella ticket for consolidating Java JNI compilation initiative #2*
> Initial part of consolidate JNI Java initiative was: Consolidate ORC/Dataset 
> code and Separate JNI CMakeLists.txt compilation.
> This 2nd part consist on:
> 1.- Make the Java library able to compile with a single mvn command
> 2.- Make Java library able to compile from an installed libarrow
> 3.- Migrate remaining C++ CMakeLists.txt specific to Java into the Java 
> project: ORC / Dataset / Gandiva
> 4.- Add windows build script that produces DLLs
> 5.- Incorporate Windows DLLs into the maven packages
> 6.- Migrate ORC JNI to use C-Data-Interface



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-12724) [C++][Docs] Add documentation for authoring compute kernels

2022-12-07 Thread Aldrin Montana (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644485#comment-17644485
 ] 

Aldrin Montana commented on ARROW-12724:


back from vacation and other tasks, will try to cut a reviewable draft this 
week and maybe break the doc up into incremental pieces to make it easier to 
release

> [C++][Docs] Add documentation for authoring compute kernels
> ---
>
> Key: ARROW-12724
> URL: https://issues.apache.org/jira/browse/ARROW-12724
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: Eduardo Ponce
>Assignee: Aldrin Montana
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 13h 10m
>  Remaining Estimate: 0h
>
> To help incoming developer's to work in the compute layer, it would be good 
> to have documentation on the process to follow for authoring a new compute 
> kernel. This document can help demystify the inner workings of the functions 
> and data structures in the compute layer.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17464) [C++] Support float16 in writing/reading parquet

2022-12-07 Thread Apache Arrow JIRA Bot (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644454#comment-17644454
 ] 

Apache Arrow JIRA Bot commented on ARROW-17464:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [C++] Support float16 in writing/reading parquet
> 
>
> Key: ARROW-17464
> URL: https://issues.apache.org/jira/browse/ARROW-17464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet, Python
>Reporter: Anja Boskovic
>Assignee: Anja Boskovic
>Priority: Major
>  Labels: parquet, pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Half-float values are not supported in Parquet. Here is a previous issue that 
> talks about that: https://issues.apache.org/jira/browse/PARQUET-1647
> So, this will not work:
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import numpy as np
> arr = pa.array(np.float16([0.1, 2.2, 3]))
> table = pa.table({'a': arr})
> pq.write_table(table, "test_halffloat.parquet") {code}
> {{This is a proposal to store float16 values in Parquet as FixedSizeBinary, 
> and then restore them to float16 when reading them back in.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17404) [Java] Consolidate JNI compilation #2

2022-12-07 Thread Apache Arrow JIRA Bot (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644455#comment-17644455
 ] 

Apache Arrow JIRA Bot commented on ARROW-17404:
---

This issue was last updated over 90 days ago, which may be an indication it is 
no longer being actively worked. To better reflect the current state, the issue 
is being unassigned per [project 
policy|https://arrow.apache.org/docs/dev/developers/bug_reports.html#issue-assignment].
 Please feel free to re-take assignment of the issue if it is being actively 
worked, or if you plan to start that work soon.

> [Java] Consolidate JNI compilation #2
> -
>
> Key: ARROW-17404
> URL: https://issues.apache.org/jira/browse/ARROW-17404
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Java
>Reporter: David Dali Susanibar Arce
>Assignee: David Dali Susanibar Arce
>Priority: Major
>
> *Umbrella ticket for consolidating Java JNI compilation initiative #2*
> Initial part of consolidate JNI Java initiative was: Consolidate ORC/Dataset 
> code and Separate JNI CMakeLists.txt compilation.
> This 2nd part consist on:
> 1.- Make the Java library able to compile with a single mvn command
> 2.- Make Java library able to compile from an installed libarrow
> 3.- Migrate remaining C++ CMakeLists.txt specific to Java into the Java 
> project: ORC / Dataset / Gandiva
> 4.- Add windows build script that produces DLLs
> 5.- Incorporate Windows DLLs into the maven packages
> 6.- Migrate ORC JNI to use C-Data-Interface



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17464) [C++] Support float16 in writing/reading parquet

2022-12-07 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-17464:
-

Assignee: (was: Anja Boskovic)

> [C++] Support float16 in writing/reading parquet
> 
>
> Key: ARROW-17464
> URL: https://issues.apache.org/jira/browse/ARROW-17464
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet, Python
>Reporter: Anja Boskovic
>Priority: Major
>  Labels: parquet, pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Half-float values are not supported in Parquet. Here is a previous issue that 
> talks about that: https://issues.apache.org/jira/browse/PARQUET-1647
> So, this will not work:
> {code:java}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import numpy as np
> arr = pa.array(np.float16([0.1, 2.2, 3]))
> table = pa.table({'a': arr})
> pq.write_table(table, "test_halffloat.parquet") {code}
> {{This is a proposal to store float16 values in Parquet as FixedSizeBinary, 
> and then restore them to float16 when reading them back in.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-17404) [Java] Consolidate JNI compilation #2

2022-12-07 Thread Apache Arrow JIRA Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Arrow JIRA Bot reassigned ARROW-17404:
-

Assignee: (was: David Dali Susanibar Arce)

> [Java] Consolidate JNI compilation #2
> -
>
> Key: ARROW-17404
> URL: https://issues.apache.org/jira/browse/ARROW-17404
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Documentation, Java
>Reporter: David Dali Susanibar Arce
>Priority: Major
>
> *Umbrella ticket for consolidating Java JNI compilation initiative #2*
> Initial part of consolidate JNI Java initiative was: Consolidate ORC/Dataset 
> code and Separate JNI CMakeLists.txt compilation.
> This 2nd part consist on:
> 1.- Make the Java library able to compile with a single mvn command
> 2.- Make Java library able to compile from an installed libarrow
> 3.- Migrate remaining C++ CMakeLists.txt specific to Java into the Java 
> project: ORC / Dataset / Gandiva
> 4.- Add windows build script that produces DLLs
> 5.- Incorporate Windows DLLs into the maven packages
> 6.- Migrate ORC JNI to use C-Data-Interface



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

2022-12-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-12264:
---
Issue Type: Bug  (was: Task)

> [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down
> ---
>
> Key: ARROW-12264
> URL: https://issues.apache.org/jira/browse/ARROW-12264
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Antoine Pitrou
>Priority: Major
>
> The Parquet spec (in parquet.thrift) says the following about handling of 
> floating-point statistics:
> {code}
>* (*) Because the sorting order is not specified properly for floating
>* point values (relations vs. total ordering) the following
>* compatibility rules should be applied when reading statistics:
>* - If the min is a NaN, it should be ignored.
>* - If the max is a NaN, it should be ignored.
>* - If the min is +0, the row group may contain -0 values as well.
>* - If the max is -0, the row group may contain +0 values as well.
>* - When looking for NaN values, min and max should be ignored.
> {code}
> It appears that the dataset code uses the following filter expression when 
> doing Parquet predicate push-down (in {{file_parquet.cc}}):
> {code:c++}
> return and_(greater_equal(field_expr, literal(min)),
> less_equal(field_expr, literal(max)));
> {code}
> A NaN value will fail that filter and yet may be found in the given Parquet 
> column chunk.
> We may instead need a "greater_equal_or_nan" comparison that returns true if 
> either value is NaN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

2022-12-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644354#comment-17644354
 ] 

Antoine Pitrou edited comment on ARROW-12264 at 12/7/22 2:08 PM:
-

cc [~westonpace]


was (Author: pitrou):
cc @westonpace

> [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down
> ---
>
> Key: ARROW-12264
> URL: https://issues.apache.org/jira/browse/ARROW-12264
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Antoine Pitrou
>Priority: Major
>
> The Parquet spec (in parquet.thrift) says the following about handling of 
> floating-point statistics:
> {code}
>* (*) Because the sorting order is not specified properly for floating
>* point values (relations vs. total ordering) the following
>* compatibility rules should be applied when reading statistics:
>* - If the min is a NaN, it should be ignored.
>* - If the max is a NaN, it should be ignored.
>* - If the min is +0, the row group may contain -0 values as well.
>* - If the max is -0, the row group may contain +0 values as well.
>* - When looking for NaN values, min and max should be ignored.
> {code}
> It appears that the dataset code uses the following filter expression when 
> doing Parquet predicate push-down (in {{file_parquet.cc}}):
> {code:c++}
> return and_(greater_equal(field_expr, literal(min)),
> less_equal(field_expr, literal(max)));
> {code}
> A NaN value will fail that filter and yet may be found in the given Parquet 
> column chunk.
> We may instead need a "greater_equal_or_nan" comparison that returns true if 
> either value is NaN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-12264) [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down

2022-12-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644354#comment-17644354
 ] 

Antoine Pitrou commented on ARROW-12264:


cc @westonpace

> [C++][Dataset] Handle NaNs correctly in Parquet predicate push-down
> ---
>
> Key: ARROW-12264
> URL: https://issues.apache.org/jira/browse/ARROW-12264
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Antoine Pitrou
>Priority: Major
>
> The Parquet spec (in parquet.thrift) says the following about handling of 
> floating-point statistics:
> {code}
>* (*) Because the sorting order is not specified properly for floating
>* point values (relations vs. total ordering) the following
>* compatibility rules should be applied when reading statistics:
>* - If the min is a NaN, it should be ignored.
>* - If the max is a NaN, it should be ignored.
>* - If the min is +0, the row group may contain -0 values as well.
>* - If the max is -0, the row group may contain +0 values as well.
>* - When looking for NaN values, min and max should be ignored.
> {code}
> It appears that the dataset code uses the following filter expression when 
> doing Parquet predicate push-down (in {{file_parquet.cc}}):
> {code:c++}
> return and_(greater_equal(field_expr, literal(min)),
> less_equal(field_expr, literal(max)));
> {code}
> A NaN value will fail that filter and yet may be found in the given Parquet 
> column chunk.
> We may instead need a "greater_equal_or_nan" comparison that returns true if 
> either value is NaN.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-14799) [C++] Adding tabular pretty printing of Table / RecordBatch

2022-12-07 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644352#comment-17644352
 ] 

Joris Van den Bossche commented on ARROW-14799:
---

If we tackle this in C++, it might be worth checking out duckdb's 
implementation. If we decide to tackle this in the bindings, for Python it 
might be worth checking out ibis' implementation (using rich, they recently 
revamped there table representation, including support for nested columns).

> [C++] Adding tabular pretty printing of Table / RecordBatch
> ---
>
> Key: ARROW-14799
> URL: https://issues.apache.org/jira/browse/ARROW-14799
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>
> It would be nice to show a "preview" (eg xx number of first and last rows) of 
> a Table or RecordBatch in a traditional tabular form (like pandas DataFrames, 
> or R data.frame / tibbles have, or in a format that resembles markdown 
> tables). 
> This could also be added in the bindings, but we could also do it on the C++ 
> level to benefit multiple bindings at once.
> Based on a quick search, there is https://github.com/p-ranav/tabulate which 
> could be vendored (it has a single-include version).
> I suppose that nested data types could represent a challenge on how to 
> include those in a tabular format, though.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-13240) [C++][Parquet] Page statistics not written in v2?

2022-12-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644350#comment-17644350
 ] 

Antoine Pitrou commented on ARROW-13240:


[~jorgecarleitao] Could you try to check if that still happens with the latest 
PyArrow?

> [C++][Parquet] Page statistics not written in v2?
> -
>
> Key: ARROW-13240
> URL: https://issues.apache.org/jira/browse/ARROW-13240
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jorge Leitão
>Priority: Major
>
> While working in integration tests of parquet2 against pyarrow, I noticed 
> that page statistics are only written by pyarrow when using version 1.
> I do not have an easy way to reproduce this within pyarrow as I am not sure 
> how to access individual pages from a column chunk, but it is something that 
> I observe when trying to integrate.
> The row group stats are still written, this only affects page statistics.
> pyarrow call:
> ```
> pa.parquet.write_table(
> t,
> path,
> version="2.0",
> data_page_version="2.0",
> write_statistics=True,
> )
> ```
> changing version to "1.0" does not impact this behavior, suggesting that the 
> specific option causing this behavior is the data_page_version="2.0".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-13240) [C++][Parquet] Page statistics not written in v2?

2022-12-07 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17644349#comment-17644349
 ] 

Antoine Pitrou commented on ARROW-13240:


[~emkornfield] When would that have happened?

> [C++][Parquet] Page statistics not written in v2?
> -
>
> Key: ARROW-13240
> URL: https://issues.apache.org/jira/browse/ARROW-13240
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jorge Leitão
>Priority: Major
>
> While working in integration tests of parquet2 against pyarrow, I noticed 
> that page statistics are only written by pyarrow when using version 1.
> I do not have an easy way to reproduce this within pyarrow as I am not sure 
> how to access individual pages from a column chunk, but it is something that 
> I observe when trying to integrate.
> The row group stats are still written, this only affects page statistics.
> pyarrow call:
> ```
> pa.parquet.write_table(
> t,
> path,
> version="2.0",
> data_page_version="2.0",
> write_statistics=True,
> )
> ```
> changing version to "1.0" does not impact this behavior, suggesting that the 
> specific option causing this behavior is the data_page_version="2.0".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-13240) [C++][Parquet] Page statistics not written in v2?

2022-12-07 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13240:
---
Priority: Major  (was: Minor)

> [C++][Parquet] Page statistics not written in v2?
> -
>
> Key: ARROW-13240
> URL: https://issues.apache.org/jira/browse/ARROW-13240
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jorge Leitão
>Priority: Major
>
> While working in integration tests of parquet2 against pyarrow, I noticed 
> that page statistics are only written by pyarrow when using version 1.
> I do not have an easy way to reproduce this within pyarrow as I am not sure 
> how to access individual pages from a column chunk, but it is something that 
> I observe when trying to integrate.
> The row group stats are still written, this only affects page statistics.
> pyarrow call:
> ```
> pa.parquet.write_table(
> t,
> path,
> version="2.0",
> data_page_version="2.0",
> write_statistics=True,
> )
> ```
> changing version to "1.0" does not impact this behavior, suggesting that the 
> specific option causing this behavior is the data_page_version="2.0".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18123) [Python] Cannot use multi-byte characters in file names in write_table

2022-12-07 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-18123.
---
Resolution: Fixed

Issue resolved by pull request 14764
https://github.com/apache/arrow/pull/14764

> [Python] Cannot use multi-byte characters in file names in write_table
> --
>
> Key: ARROW-18123
> URL: https://issues.apache.org/jira/browse/ARROW-18123
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 9.0.0
>Reporter: SHIMA Tatsuya
>Assignee: Miles Granger
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Error when specifying a file path containing multi-byte characters in 
> {{pyarrow.parquet.write_table}}.
> For example, use {{例.parquet}} as the file path.
> {code:python}
> Python 3.10.7 (main, Oct  5 2022, 14:33:54) [GCC 10.2.1 20210110] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pandas as pd
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> df = pd.DataFrame({'one': [-1, np.nan, 2.5],
> ...'two': ['foo', 'bar', 'baz'],
> ...'three': [True, False, True]},
> ...index=list('abc'))
> >>> table = pa.Table.from_pandas(df)
> >>> import pyarrow.parquet as pq
> >>> pq.write_table(table, '例.parquet')
> Traceback (most recent call last):
>   File "", line 1, in 
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 2920, in write_table
> with ParquetWriter(
>   File
> "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/parquet/__init__.py",
> line 911, in __init__
> filesystem, path = _resolve_filesystem_and_path(
>   File "/home/vscode/.local/lib/python3.10/site-packages/pyarrow/fs.py", line
> 184, in _resolve_filesystem_and_path
> filesystem, path = FileSystem.from_uri(path)
>   File "pyarrow/_fs.pyx", line 463, in pyarrow._fs.FileSystem.from_uri
>   File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Cannot parse URI: '例.parquet'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-18320) [C++] Flight client may crash due to improper Result/Status conversion

2022-12-07 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-18320.
--
Fix Version/s: 11.0.0
   Resolution: Fixed

Resolved by [https://github.com/apache/arrow/pull/14859]

> [C++] Flight client may crash due to improper Result/Status conversion
> --
>
> Key: ARROW-18320
> URL: https://issues.apache.org/jira/browse/ARROW-18320
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, FlightRPC
>Affects Versions: 6.0.0
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 11.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Reported on user@ 
> https://lists.apache.org/thread/84z329t1djhnbr5bq936v4hr8cyngj2l 
> {noformat}
> I have an issue on my project, we have a query execution engine that
> returns result data as a flight stream and c++ client that receives the
> stream. In case a query has no results but the result schema implies
> dictionary encoded fields in results we have client app crushed.
> The cause is in cpp/src/arrow/flight/client.cc:461:
> ::arrow::Result> ReadNextMessage() override {
> if (stream_finished_) {
> return nullptr;
> }
> internal::FlightData* data;
> {
> auto guard = read_mutex_ ? std::unique_lock(*read_mutex_)
> : std::unique_lock();
> peekable_reader_->Next();
> }
> if (!data) {
> stream_finished_ = true;
> return stream_->Finish(Status::OK()); // Here the issue
> }
> // Validate IPC message
> auto result = data->OpenMessage();
> if (!result.ok()) {
> return stream_->Finish(std::move(result).status());
> }
> *app_metadata_ = std::move(data->app_metadata);
> return result;
> }
> The method returns Result object while stream_Finish(..) returns a Status.
> So there is an implicit conversion from Status to Result that causes
> Result(Status) constructor to be called, but the constructor expects only
> error statuses which in turn causes the app to be failed:
> /// Constructs a Result object with the given non-OK Status object. All
> /// calls to ValueOrDie() on this object will abort. The given `status` must
> /// not be an OK status, otherwise this constructor will abort.
> ///
> /// This constructor is not declared explicit so that a function with a
> return
> /// type of `Result` can return a Status object, and the status will be
> /// implicitly converted to the appropriate return type as a matter of
> /// convenience.
> ///
> /// \param status The non-OK Status object to initialize to.
> Result(const Status& status) noexcept // NOLINT(runtime/explicit)
> : status_(status) {
> if (ARROW_PREDICT_FALSE(status.ok())) {
> internal::DieWithMessage(std::string("Constructed with a non-error status: ")
> +
> status.ToString());
> }
> }
> Is there a way to workaround or fix it? We use Arrow 6.0.0, but it seems
> that the issue exists in all future versions.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18427) [C++] Support negative tolerance in `AsofJoinNode`

2022-12-07 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-18427:

Description: Currently, `AsofJoinNode` supports a tolerance that is 
non-negative, allowing past-joining, i.e., joining right-table rows with a 
timestamp at or before that of the left-table row. This issue will add support 
for a negative tolerance, which would allow future-joining too.  (was: 
Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing 
past-joining, i.e., joining right-table rows with a timestamp at or before that 
of the left-table row. This issue will add support for a positive tolerance, 
which would allow future-joining too.)

> [C++] Support negative tolerance in `AsofJoinNode`
> --
>
> Key: ARROW-18427
> URL: https://issues.apache.org/jira/browse/ARROW-18427
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing 
> past-joining, i.e., joining right-table rows with a timestamp at or before 
> that of the left-table row. This issue will add support for a negative 
> tolerance, which would allow future-joining too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18427) [C++] Support negative tolerance in `AsofJoinNode`

2022-12-07 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-18427:

Summary: [C++] Support negative tolerance in `AsofJoinNode`  (was: [C++] 
Support negative toletance in `AsofJoinNode`)

> [C++] Support negative tolerance in `AsofJoinNode`
> --
>
> Key: ARROW-18427
> URL: https://issues.apache.org/jira/browse/ARROW-18427
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing 
> past-joining, i.e., joining right-table rows with a timestamp at or before 
> that of the left-table row. This issue will add support for a positive 
> tolerance, which would allow future-joining too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18427) [C++] Support negative toletance in `AsofJoinNode`

2022-12-07 Thread Yaron Gvili (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaron Gvili updated ARROW-18427:

Summary: [C++] Support negative toletance in `AsofJoinNode`  (was: [C++] 
Suppose negative toletance in `AsofJoinNode`)

> [C++] Support negative toletance in `AsofJoinNode`
> --
>
> Key: ARROW-18427
> URL: https://issues.apache.org/jira/browse/ARROW-18427
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>
> Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing 
> past-joining, i.e., joining right-table rows with a timestamp at or before 
> that of the left-table row. This issue will add support for a positive 
> tolerance, which would allow future-joining too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18427) [C++] Suppose negative toletance in `AsofJoinNode`

2022-12-07 Thread Yaron Gvili (Jira)
Yaron Gvili created ARROW-18427:
---

 Summary: [C++] Suppose negative toletance in `AsofJoinNode`
 Key: ARROW-18427
 URL: https://issues.apache.org/jira/browse/ARROW-18427
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Yaron Gvili
Assignee: Yaron Gvili


Currently, `AsofJoinNode` supports a tolerance that is non-negative, allowing 
past-joining, i.e., joining right-table rows with a timestamp at or before that 
of the left-table row. This issue will add support for a positive tolerance, 
which would allow future-joining too.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18003) [Python] Add sort_by to RecordBatch

2022-12-07 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18003:
--
Labels: good-first-issue  (was: )

> [Python] Add sort_by to RecordBatch
> ---
>
> Key: ARROW-18003
> URL: https://issues.apache.org/jira/browse/ARROW-18003
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alessandro Molina
>Priority: Major
>  Labels: good-first-issue
> Fix For: 11.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-18003) [Python] Add sort_by to RecordBatch

2022-12-07 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-18003:
--
Summary: [Python] Add sort_by to RecordBatch  (was: [Python] Add sort_by to 
Table and RecordBatch)

> [Python] Add sort_by to RecordBatch
> ---
>
> Key: ARROW-18003
> URL: https://issues.apache.org/jira/browse/ARROW-18003
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Alessandro Molina
>Priority: Major
> Fix For: 11.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-18402) [C++] Expose `DeclarationInfo`

2022-12-07 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-18402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace closed ARROW-18402.
---
Resolution: Fixed

Fixed by PR https://github.com/apache/arrow/pull/14765

> [C++] Expose `DeclarationInfo`
> --
>
> Key: ARROW-18402
> URL: https://issues.apache.org/jira/browse/ARROW-18402
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Assignee: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> `DeclarationInfo` is just a pair of `Declaration` and `Schema`, which are 
> public APIs, and so can be made public API itself. This can be part of or a 
> follow-up on [https://github.com/apache/arrow/pull/14485], and will allow 
> implementing extension providers, whose API depends on `DeclarationInfo`, 
> outside of the Arrow repo.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)