[jira] [Commented] (ARROW-13240) [C++][Parquet] Page statistics not written in v2?

2023-01-06 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17655568#comment-17655568
 ] 

Micah Kornfield commented on ARROW-13240:
-

https://issues.apache.org/jira/browse/ARROW-11634?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20text%20~%20%22dictionary%20statistics%22
 is the issue I was referring to so it looks like in Version 6.0, last update 
the JIRA was sept 2021 so a little bit after this bug report.

> [C++][Parquet] Page statistics not written in v2?
> -
>
> Key: ARROW-13240
> URL: https://issues.apache.org/jira/browse/ARROW-13240
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jorge Leitão
>Priority: Major
>
> While working in integration tests of parquet2 against pyarrow, I noticed 
> that page statistics are only written by pyarrow when using version 1.
> I do not have an easy way to reproduce this within pyarrow as I am not sure 
> how to access individual pages from a column chunk, but it is something that 
> I observe when trying to integrate.
> The row group stats are still written, this only affects page statistics.
> pyarrow call:
> ```
> pa.parquet.write_table(
> t,
> path,
> version="2.0",
> data_page_version="2.0",
> write_statistics=True,
> )
> ```
> changing version to "1.0" does not impact this behavior, suggesting that the 
> specific option causing this behavior is the data_page_version="2.0".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-18253) [C++][Parquet] Improve bounds checking on some inputs

2022-11-04 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-18253:
---

 Summary: [C++][Parquet] Improve bounds checking on some inputs
 Key: ARROW-18253
 URL: https://issues.apache.org/jira/browse/ARROW-18253
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Micah Kornfield
Assignee: Micah Kornfield


In some cases we don't check for lower bound of 0, on some non-performance 
critical paths we only have DCHECKs, and while unlikely in some cases we cast 
from size_t to int32 which can overflow, adding some safety checks here would 
be useful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17983) [Parquet][C++][Python] "List index overflow" when read parquet file

2022-10-17 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17619234#comment-17619234
 ] 

Micah Kornfield commented on ARROW-17983:
-

IIRC, I think offset type here is inferred from the schema (i.e. List vs 
LargeList) that we are trying to read back into, once offsets reaches int32 max 
we can't return, since the reading path doesn't support chunking at the moment.

Two options to fix this seems to either be:
1.  Infer LargeList should be used based on RowGroup/File statistics.
2. Allow overriding the schema (this might already be an option) to take 
LargeList override.
3. Modify code to allow for chunking arrays (I seem to recall this would be a 
fare amount of work based on current assumption but its been a while since I 
dug into the code).

I seem to recall someone tried prototyping 2, recently but I'm having trouble 
finding the thread/JIRA at the moment.

> [Parquet][C++][Python] "List index overflow" when read parquet file
> ---
>
> Key: ARROW-17983
> URL: https://issues.apache.org/jira/browse/ARROW-17983
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Reporter: Yibo Cai
>Priority: Major
>
> From issue https://github.com/apache/arrow/issues/14229.
> The bug looks like this:
> - create a pandas dataframe with *one column* and {{n}} rows, {{n < 
> max(int32)}}
> - each elemenet is a list with {{m}} integers, {{m * n > max(int32)}}
> - save to a parquet file
> - reading from the parquet file fails with "OSError: List index overflow"
> See comment below on details to reproudce this bug:
> https://github.com/apache/arrow/issues/14229#issuecomment-1272223773
> Tested with a small dataset, the error might come from below code.
> https://github.com/apache/arrow/blob/master/cpp/src/parquet/level_conversion.cc#L63-L64
> {{OffsetType}} is {{int32}}, but the loop is executed (and {{*offset}} is 
> incremented) {{m * n}} times which is beyond {{max(int32)}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-10784) [Python] Loading pyarrow.compute isn't thread safe

2022-10-15 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield closed ARROW-10784.
---
Resolution: Cannot Reproduce

> [Python] Loading pyarrow.compute isn't thread safe
> --
>
> Key: ARROW-10784
> URL: https://issues.apache.org/jira/browse/ARROW-10784
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: Micah Kornfield
>Priority: Major
>
> When using Arrow in a multithreaded environment it is possible to trigger an 
> initialization race on the pyarrow.compute module when calling Array.flatten.
>  
> Flatten calls _pc() which imports pyarrow compute but if two threads call 
> flatten at the same time is possible that the global initialization of 
> functions from the registry will be incomplete and therefore cause an 
> AttributeError.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-10784) [Python] Loading pyarrow.compute isn't thread safe

2022-10-15 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17618185#comment-17618185
 ] 

Micah Kornfield commented on ARROW-10784:
-

yes, haven't had any luck with a repro

> [Python] Loading pyarrow.compute isn't thread safe
> --
>
> Key: ARROW-10784
> URL: https://issues.apache.org/jira/browse/ARROW-10784
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: Micah Kornfield
>Priority: Major
>
> When using Arrow in a multithreaded environment it is possible to trigger an 
> initialization race on the pyarrow.compute module when calling Array.flatten.
>  
> Flatten calls _pc() which imports pyarrow compute but if two threads call 
> flatten at the same time is possible that the global initialization of 
> functions from the registry will be incomplete and therefore cause an 
> AttributeError.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17535) [Python] List arrays aren't supported in to_pandas calls

2022-10-15 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17618152#comment-17618152
 ] 

Micah Kornfield commented on ARROW-17535:
-

Yeah, so I agree with the conclusion that scalar conversion should currently 
not be used, as it isn't used today except in to_pylist.  I think even using 
the to_pandas call might be tricky but if it can work, then it would be a good 
idea to pursue as the approach I outlined above could get complicated.




> [Python] List arrays aren't supported in to_pandas calls
> ---
>
> Key: ARROW-17535
> URL: https://issues.apache.org/jira/browse/ARROW-17535
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Micah Kornfield
>Priority: Minor
>
> EXTENSION is not in the list of types allowed.  I think in order to enable 
> EXTENSION we need to be able to call to_pylist or similar on the original 
> extension array from C++ code, in case there were user provided overrides.  
> Off the top of my head one way of doing this would be to pass through an 
> additional std::unorderd_map where PyObject is the bound 
> to_pylist python function.  Are there other alternative that might be cleaner?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.

2022-09-02 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-16326:
---

Assignee: Micah Kornfield

> [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
> ---
>
> Key: ARROW-16326
> URL: https://issues.apache.org/jira/browse/ARROW-16326
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: good-first-issue, good-second-issue
>
> Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench 
> isn't installed properly the failure mode is tests timeouts because the 
> connection hangs.  We should add a timeout parameter to prevent this



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.

2022-09-02 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599613#comment-17599613
 ] 

Micah Kornfield commented on ARROW-16326:
-

This was actually done in the PR.  "retry_limit_seconds" can be passed through 
to the URI and if no connection is established within that time it will fail.

> [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
> ---
>
> Key: ARROW-16326
> URL: https://issues.apache.org/jira/browse/ARROW-16326
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Micah Kornfield
>Priority: Major
>  Labels: good-first-issue, good-second-issue
>
> Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench 
> isn't installed properly the failure mode is tests timeouts because the 
> connection hangs.  We should add a timeout parameter to prevent this



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.

2022-09-02 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-16326.
-
Resolution: Fixed

> [C++][Python] Add GCS Timeout parameter for GCS FileSystem.
> ---
>
> Key: ARROW-16326
> URL: https://issues.apache.org/jira/browse/ARROW-16326
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: good-first-issue, good-second-issue
>
> Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench 
> isn't installed properly the failure mode is tests timeouts because the 
> connection hangs.  We should add a timeout parameter to prevent this



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-09-01 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599111#comment-17599111
 ] 

Micah Kornfield commented on ARROW-17459:
-

Its probably a case of different batch sizes.

suppose you have 100 rows that take have 3GB of string evenly distributed.  If 
you try to read all 100 rows it will overflow and create a ChunkedArray.  If 
you read 50 rows at a time it would be an issue because chunking wouldn't be 
necessary.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Assignee: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-09-01 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17599053#comment-17599053
 ] 

Micah Kornfield commented on ARROW-17459:
-

[~arthurpassos] awesome, nice work.  IMO, I don't think we can change the 
default to LargeBinary/LargeString, as you can see based on the assertion there 
is an expectation that types produced match the schema.   Also for most 
use-cases they aren't necessary, and require extra memory (and might be less 
well supported in other implementations).

I think the right way of approach this is to have an option users can set 
(maybe one for each type) that will work on two levels:
1.  Translate any non-large types in the schema to their large variants.
2.  Make the changes at the decoder level that you have already done.

So we keep the assertion but if users run into this issue we can provide 
guidance on how to set this.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Assignee: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-31 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598742#comment-17598742
 ] 

Micah Kornfield commented on ARROW-17459:
-

You would have to follow this up the stack from the previous comments.  Without 
seeing the stack trace it is a bit hard to give guidance, but i'd guess there 
are few places that always expected BinaryArray/BinaryBuilder in the linked 
code and might down_cast, these would need to be adjusted accordingly.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-31 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598501#comment-17598501
 ] 

Micah Kornfield commented on ARROW-17459:
-

Yes, I think there are some code changes, we hard-code non large 
[BinaryBuilder|https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L1693]
 for accumulating chunks and then used when [decoding 
arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/encoding.cc#L1392].

To answer your questions, I don't think the second case applies.  As far as I 
know Parquet C++ does its own chunking and doesn't try to read back the exact 
chunking that the values are written with.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-30 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598058#comment-17598058
 ] 

Micah Kornfield commented on ARROW-17459:
-

i.e. LargeBinary, LargeString, LargeList these are distinct types that use 
int64s to represent offsets instead of int32

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17459) [C++] Support nested data conversions for chunked array

2022-08-30 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17598038#comment-17598038
 ] 

Micah Kornfield commented on ARROW-17459:
-

1.  ChunkedArrays have a Flatten method that will do this but I don't think it 
will help in this case.  IIRC the challenge in this case is that parquet only 
yields chunked arrays if the underlying column data cannot fit into the right 
arrow structure.  In this case for Utf8 arrays it means the sum of bytes across 
all strings has to be less then INT_MAX length.  Otherwise it would need to 
flatten to LargeUtf8 which has implications for schema conversion.  Structs and 
lists always expected Arrays as their inner element types and not chunked 
arrays.
2.  doesn't necessarily seem like the right approach.  
3.  Per 1, this isn't really the issue I think.  The approach here that could 
work (I don't remember all the code paths) is to vary the number of rows read 
back if not all rows are huge).

One way forward here could be to add an option for reading back arrays to 
always use the Large* variant (or maybe on a per column basis) to avoid 
chunking.

> [C++] Support nested data conversions for chunked array
> ---
>
> Key: ARROW-17459
> URL: https://issues.apache.org/jira/browse/ARROW-17459
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Arthur Passos
>Priority: Blocker
>
> `FileReaderImpl::ReadRowGroup` fails with "Nested data conversions not 
> implemented for chunked array outputs". It fails on 
> [ChunksToSingle]([https://github.com/apache/arrow/blob/7f6b074b84b1ca519b7c5fc7da318e8d47d44278/cpp/src/parquet/arrow/reader.cc#L95])
> Data schema is: 
> {code:java}
>   optional group fields_map (MAP) = 217 {
>     repeated group key_value {
>       required binary key (STRING) = 218;
>       optional binary value (STRING) = 219;
>     }
>   }
> fields_map.key_value.value-> Size In Bytes: 13243589 Size In Ratio: 0.20541047
> fields_map.key_value.key-> Size In Bytes: 3008860 Size In Ratio: 0.046667963
> {code}
> Is there a way to work around this issue in the cpp lib?
> In any case, I am willing to implement this, but I need some guidance. I am 
> very new to parquet (as in started reading about it yesterday).
>  
> Probably related to: https://issues.apache.org/jira/browse/ARROW-10958



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17535) [Python] List arrays aren't supported in to_pandas calls

2022-08-25 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-17535:
---

 Summary: [Python] List arrays aren't supported in 
to_pandas calls
 Key: ARROW-17535
 URL: https://issues.apache.org/jira/browse/ARROW-17535
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Micah Kornfield


EXTENSION is not in the list of types allowed.  I think in order to enable 
EXTENSION we need to be able to call to_pylist or similar on the original 
extension array from C++ code, in case there were user provided overrides.  Off 
the top of my head one way of doing this would be to pass through an additional 
std::unorderd_map where PyObject is the bound to_pylist 
python function.  Are there other alternative that might be cleaner?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets

2022-07-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566710#comment-17566710
 ] 

Micah Kornfield commented on ARROW-17069:
-

Also does it help if you increase retry_limit_seconds to something higher?

> [Python][R] GCSFIleSystem reports cannot resolve host on public buckets
> ---
>
> Key: ARROW-17069
> URL: https://issues.apache.org/jira/browse/ARROW-17069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python, R
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Critical
> Fix For: 9.0.0
>
>
> GCSFileSystem will returns {{Couldn't resolve host name}} if you don't supply 
> {{anonymous}} as the user:
> {code:python}
> import pyarrow.dataset as ds
> # Fails:
> dataset = 
> ds.dataset("gs://voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 749, in dataset
> return _filesystem_dataset(source, **kwargs)
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 441, in _filesystem_dataset
> fs, paths_or_selector = _ensure_single_source(source, filesystem)
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 408, in _ensure_single_source
> file_info = filesystem.get_file_info(path)
>   File "pyarrow/_fs.pyx", line 444, in pyarrow._fs.FileSystem.get_file_info
> info = GetResultValue(self.fs.GetFileInfo(path))
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted in 
> GetObjectMetadata: EasyPerform() - CURL error [6]=Couldn't resolve host name)
> # This works fine:
> >>> dataset = 
> >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
> {code}
> I would expect that we could connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17069) [Python][R] GCSFIleSystem reports cannot resolve host on public buckets

2022-07-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566709#comment-17566709
 ] 

Micah Kornfield commented on ARROW-17069:
-

That is surprising but not an expert on GCS and why this would impact host 
resolution and not permissions.  Just to clarify, [~willjones127]  you have a 
similar example working on a private bucket?

> [Python][R] GCSFIleSystem reports cannot resolve host on public buckets
> ---
>
> Key: ARROW-17069
> URL: https://issues.apache.org/jira/browse/ARROW-17069
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python, R
>Affects Versions: 8.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Critical
> Fix For: 9.0.0
>
>
> GCSFileSystem will returns {{Couldn't resolve host name}} if you don't supply 
> {{anonymous}} as the user:
> {code:python}
> import pyarrow.dataset as ds
> # Fails:
> dataset = 
> ds.dataset("gs://voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 749, in dataset
> return _filesystem_dataset(source, **kwargs)
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 441, in _filesystem_dataset
> fs, paths_or_selector = _ensure_single_source(source, filesystem)
>   File "/Users/willjones/Documents/arrows/arrow/python/pyarrow/dataset.py", 
> line 408, in _ensure_single_source
> file_info = filesystem.get_file_info(path)
>   File "pyarrow/_fs.pyx", line 444, in pyarrow._fs.FileSystem.get_file_info
> info = GetResultValue(self.fs.GetFileInfo(path))
>   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
> raise IOError(message)
> OSError: google::cloud::Status(UNAVAILABLE: Retry policy exhausted in 
> GetObjectMetadata: EasyPerform() - CURL error [6]=Couldn't resolve host name)
> # This works fine:
> >>> dataset = 
> >>> ds.dataset("gs://anonymous@voltrondata-labs-datasets/nyc-taxi/?retry_limit_seconds=3")
> {code}
> I would expect that we could connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-7494) [Java] Remove reader index and writer index from ArrowBuf

2022-07-03 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561933#comment-17561933
 ] 

Micah Kornfield commented on ARROW-7494:


It still seems like a valid improvement to me, but it has proved to be a little 
challenging in general.

> [Java] Remove reader index and writer index from ArrowBuf
> -
>
> Key: ARROW-7494
> URL: https://issues.apache.org/jira/browse/ARROW-7494
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Jacques Nadeau
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Reader and writer index and functionality doesn't belong on a chunk of memory 
> and is due to inheritance from ByteBuf. As part of removing ByteBuf 
> inheritance, we should also remove reader and writer indexes from ArrowBuf 
> functionality. It wastes heap memory for rare utility. In general, a slice 
> can be used instead of a reader/writer index pattern.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-16339) [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to Arrow Schema metadata

2022-05-09 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534121#comment-17534121
 ] 

Micah Kornfield commented on ARROW-16339:
-

Q1: I also think the answer is yes.

Q2: Yes, that seems correct to me as well.  I think the assumption is Arrow 
didn't produce this but KV metadata is still useful.

Q3: Yes, it seems like the two things should be separate flags.

Q4: Not sure on this one, it is a little surprising they aren't already covered 
with the Arrow Schema?

> [C++][Parquet] Parquet FileMetaData key_value_metadata not always mapped to 
> Arrow Schema metadata
> -
>
> Key: ARROW-16339
> URL: https://issues.apache.org/jira/browse/ARROW-16339
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Context: I ran into this issue when reading Parquet files created by GDAL 
> (using the Arrow C++ APIs, [https://github.com/OSGeo/gdal/pull/5477]), which 
> writes files that have custom key_value_metadata, but without storing 
> ARROW:schema in those metadata (cc [~paleolimbot]
> —
> Both in reading and writing files, I expected that we would map Arrow 
> {{Schema::metadata}} with Parquet {{{}FileMetaData::key_value_metadata{}}}. 
> But apparently this doesn't (always) happen out of the box, and only happens 
> through the "ARROW:schema" field (which stores the original Arrow schema, and 
> thus the metadata stored in this schema).
> For example, when writing a Table with schema metadata, this is not stored 
> directly in the Parquet FileMetaData (code below is using branch from 
> ARROW-16337 to have the {{store_schema}} keyword):
> {code:python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'a': [1, 2, 3]}, metadata={"key": "value"})
> pq.write_table(table, "test_metadata_with_arrow_schema.parquet")
> pq.write_table(table, "test_metadata_without_arrow_schema.parquet", 
> store_schema=False)
> # original schema has metadata
> >>> table.schema
> a: int64
> -- schema metadata --
> key: 'value'
> # reading back only has the metadata in case we stored ARROW:schema
> >>> pq.read_table("test_metadata_with_arrow_schema.parquet").schema
> a: int64
> -- schema metadata --
> key: 'value'
> # and not if ARROW:schema is absent
> >>> pq.read_table("test_metadata_without_arrow_schema.parquet").schema
> a: int64
> {code}
> It seems that if we store the ARROW:schema, we _also_ store the schema 
> metadata separately. But if {{store_schema}} is False, we also stop writing 
> those metadata (not fully sure if this is the intended behaviour, and that's 
> the reason for the above output):
> {code:python}
> # when storing the ARROW:schema, we ALSO store key:value metadata
> >>> pq.read_metadata("test_metadata_with_arrow_schema.parquet").metadata
> {b'ARROW:schema': b'/7AQAAAKAA4ABgAFAA...',
>  b'key': b'value'}
> # when not storing the schema, we also don't store the key:value
> >>> pq.read_metadata("test_metadata_without_arrow_schema.parquet").metadata 
> >>> is None
> True
> {code}
> On the reading side, it seems that we generally do read custom key/value 
> metadata into schema metadata. We don't have the pyarrow APIs at the moment 
> to create such a file (given the above), but with a small patch I could 
> create such a file:
> {code:python}
> # a Parquet file with ParquetFileMetaData::metadata that ONLY has a custom key
> >>> pq.read_metadata("test_metadata_without_arrow_schema2.parquet").metadata
> {b'key': b'value'}
> # this metadata is now correctly mapped to the Arrow schema metadata
> >>> pq.read_schema("test_metadata_without_arrow_schema2.parquet")
> a: int64
> -- schema metadata --
> key: 'value'
> {code}
> But if you have a file that has both custom key/value metadata and an 
> "ARROW:schema" key, we actually ignore the custom keys, and only look at the 
> "ARROW:schema" one. 
> This was the case that I ran into with GDAL, where I have a file with both 
> keys, but where the custom "geo" key is not also included in the serialized 
> arrow schema in the "ARROW:schema" key:
> {code:python}
> # includes both keys in the Parquet file
> >>> pq.read_metadata("test_gdal.parquet").metadata
> {b'geo': b'{"version":"0.1.0","...',
>  b'ARROW:schema': b'/3gBAAAQ...'}
> # the "geo" key is lost in the Arrow schema
> >>> pq.read_table("test_gdal.parquet").schema.metadata is None
> True
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16484) [Go][Parquet] Ensure a WriterVersion is written out in parquet go.

2022-05-05 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-16484:
---

 Summary: [Go][Parquet] Ensure a WriterVersion is written out in 
parquet go.
 Key: ARROW-16484
 URL: https://issues.apache.org/jira/browse/ARROW-16484
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Go
Reporter: Micah Kornfield
Assignee: Matthew Topol


We should ensure a unique version information for parquet files is populated.  
I tried searching the go code but could only find reading the version back.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16433) [Release][C++] parquet-arrow-test test fails on windows

2022-05-02 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17530800#comment-17530800
 ] 

Micah Kornfield commented on ARROW-16433:
-

Is there a stack trace or exception that can be shared?  I don't have access to 
windows box, can a docker image be used here to reproduce?

> [Release][C++] parquet-arrow-test test fails on windows
> ---
>
> Key: ARROW-16433
> URL: https://issues.apache.org/jira/browse/ARROW-16433
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jacob Wujciak-Jens
>Priority: Major
>
> While verifying RC1 on windows this error happened reproducible using 
> "verify-release-candidate.bat"
> {code:java}
> The following tests FAILED: 
> 59 - parquet-arrow-test (Failed) 
> C:/tmp/arrow-verify-release/apache-arrow-8.0.0/cpp/src/parquet/arrow/arrow_reader_writer_test.cc(1236):
>  error: Expected: this->ReadAndCheckSingleColumnFile(*values) doesn't 
> generate new fatal failures in the current thread. 
>   Actual: it does. 
> [ FAILED ] TestInt96ParquetIO.ReadIntoTimestamp (0 ms)
> {code}
> cc [~apitrou] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16326) [C++][Python] Add GCS Timeout parameter for GCS FileSystem.

2022-04-25 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-16326:
---

 Summary: [C++][Python] Add GCS Timeout parameter for GCS 
FileSystem.
 Key: ARROW-16326
 URL: https://issues.apache.org/jira/browse/ARROW-16326
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Micah Kornfield
Assignee: Micah Kornfield


Follow-up from [https://github.com/apache/arrow/pull/12763] if gcs testbench 
isn't installed properly the failure mode is tests timeouts because the 
connection hangs.  We should add a timeout parameter to prevent this



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16118) [C++] Reduce memory usage when writing to IPC

2022-04-21 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525893#comment-17525893
 ] 

Micah Kornfield commented on ARROW-16118:
-

Also, we should be careful how this enabled, since if someone is actually 
consuming the stream in real-time there would need to be some sort of 
coordination to ensure bytes aren't sent prematurely.

> [C++] Reduce memory usage when writing to IPC
> -
>
> Key: ARROW-16118
> URL: https://issues.apache.org/jira/browse/ARROW-16118
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Jorge Leitão
>Priority: Major
>
> Writing a record batch to IPC ([header][buffers]) currently requires O(N*B) 
> where N is the average size of the buffer and B the number of buffers in the 
> recordbatch.
> This is because we need the buffer location and total number of bytes to 
> write the header of the record, which is only known after e.g. knowning by 
> how much the buffers were compressed.
> When the writer supports seeking, this memory usage can be reduced to O(N) 
> where N is the average size of a primitive buffer over all fields. This is 
> done using the following pseudo-code implementation:
> {code:java}
> start = writer.seek(current);
> empty_locations = create_empty_header(schema)
> write_header(writer, empty_locations)
> locations = write_buffers(writer, batch)
> writer.seek(start)
> write_header(writer, locations)
> {code}
> This has a significantly lower memory footprint. O(N) vs O(N*B)
> It could be interesting for the C++ implementation to support this.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16270) [C++][Python][FileSystem] Make directory paths returned uniform

2022-04-21 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-16270:
---

 Summary: [C++][Python][FileSystem] Make directory paths returned 
uniform
 Key: ARROW-16270
 URL: https://issues.apache.org/jira/browse/ARROW-16270
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Micah Kornfield


Depending on if paths are selected with recursion or without code the result of 
the returned directories changes to include a slash or not include a slash (see 
code linked below).  It would be nice to provide consistent output here.  It 
isn't clear i the breaking change is worthwhile here.

 

 [1] 
https://github.com/apache/arrow/blob/3eaa7dd0e8b3dabc5438203331f05e3e6c011e37/python/pyarrow/tests/test_fs.py#L688

  [2] 
https://github.com/apache/arrow/blob/3eaa7dd0e8b3dabc5438203331f05e3e6c011e37/cpp/src/arrow/filesystem/test_util.cc#L767



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-12203) [C++][Python] Switch default Parquet version to 2.4

2022-04-20 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525397#comment-17525397
 ] 

Micah Kornfield commented on ARROW-12203:
-

CC [~willb_google] 

 

This will still potentially cause issues with imports into BQ if unsigned types 
are used I think.  I think the project has generally been pretty patient, so I 
understand if there is a strong desire to move forward with it.  Will can 
probably give a better timeline on when BQ would be able to handle the logical 
types.

> [C++][Python] Switch default Parquet version to 2.4
> ---
>
> Key: ARROW-12203
> URL: https://issues.apache.org/jira/browse/ARROW-12203
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 8.0.0
>
>
> Currently, Parquet write APIs default to maximum-compatibility Parquet 
> version "1.0", which disables some logical types such as UINT32. We may want 
> to switch the default to "2.0" instead, to allow faithful representation of 
> more types.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16227) [Archery] Make cpp argument list keyword only

2022-04-18 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-16227:
---

 Summary: [Archery] Make cpp argument list keyword only
 Key: ARROW-16227
 URL: https://issues.apache.org/jira/browse/ARROW-16227
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Archery
Reporter: Micah Kornfield
Assignee: Micah Kornfield


cpp params should be keyword only.  See 
[https://github.com/apache/arrow/pull/12763/files#r852112789] (i.e. adding *, 
before all keyword options.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16226) [C++] Add better coverage for filesystem tell.

2022-04-18 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-16226:
---

 Summary: [C++] Add better coverage for filesystem tell.
 Key: ARROW-16226
 URL: https://issues.apache.org/jira/browse/ARROW-16226
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Micah Kornfield
Assignee: Micah Kornfield


Add a C++ generic file system test that writes wrote N bytes to a file. then 
seeks to N/2 and and read the remainder.  Verify the remainder bytes are N/2 
and expected from the bytes writter.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-16160) [C++] IPC Stream Reader doesn't check if extra fields are present for RecordBatches

2022-04-08 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519819#comment-17519819
 ] 

Micah Kornfield commented on ARROW-16160:
-

It appears in on master branch we get:
"pyarrow.lib.ArrowInvalid: Tried reading schema message, was null or length 0"

It seems like the error message could be improved here.

> [C++] IPC Stream Reader doesn't check if extra fields are present for 
> RecordBatches
> ---
>
> Key: ARROW-16160
> URL: https://issues.apache.org/jira/browse/ARROW-16160
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 6.0.1
>Reporter: Micah Kornfield
>Priority: Major
>
> I looked through recent commits and I don't think this issue has been patched 
> since:
> {code:title=test.python|borderStyle=solid}
> import pyarrow as pa
> with pa.output_stream("/tmp/f1") as sink:
>   with pa.RecordBatchStreamWriter(sink, rb1.schema) as writer:
> writer.write(rb1)
> end_rb1 = sink.tell()
> with pa.output_stream("/tmp/f2") as sink:
>   with pa.RecordBatchStreamWriter(sink, rb2.schema) as writer:
> writer.write(rb2)
> start_rb2_only = sink.tell()
> writer.write(rb2)
> end_rb2 = sink.tell()
> # Stitch to togher rb1.schema, rb1 and rb2 without schema.
> with pa.output_stream("/tmp/f3") as sink:
>   with pa.input_stream("/tmp/f1") as inp:
>  sink.write(inp.read(end_rb1))
>   with pa.input_stream("/tmp/f2") as inp:
> inp.seek(start_rb2_only)
> sink.write(inp.read(end_rb2 - start_rb2_only))
> with pa.ipc.open_stream("/tmp/f3") as sink:
>   print(sink.read_all())
> {code}
> Yields:
> {code}
> {{pyarrow.Table
> c1: int64
> 
> c1: [[1],[1]]
> {code}
> I would expect this to error because the second stiched in record batch has 
> more fields then necessary but it appears to load just fine.  
> Is this intended behavior?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-16160) [C++] IPC Stream Reader doesn't check if extra fields are present for RecordBatches

2022-04-08 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519818#comment-17519818
 ] 

Micah Kornfield commented on ARROW-16160:
-

The opposite direction returns an error "ArrowInvalid: Ran out of field 
metadata, likely malformed".  If we agree this case should be an error it would 
an error message that indicates too many fields were provided.

I can provide a patch if we think we want to change this behavior.

> [C++] IPC Stream Reader doesn't check if extra fields are present for 
> RecordBatches
> ---
>
> Key: ARROW-16160
> URL: https://issues.apache.org/jira/browse/ARROW-16160
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 6.0.1
>Reporter: Micah Kornfield
>Priority: Major
>
> I looked through recent commits and I don't think this issue has been patched 
> since:
> {code:title=test.python|borderStyle=solid}
> import pyarrow as pa
> with pa.output_stream("/tmp/f1") as sink:
>   with pa.RecordBatchStreamWriter(sink, rb1.schema) as writer:
> writer.write(rb1)
> end_rb1 = sink.tell()
> with pa.output_stream("/tmp/f2") as sink:
>   with pa.RecordBatchStreamWriter(sink, rb2.schema) as writer:
> writer.write(rb2)
> start_rb2_only = sink.tell()
> writer.write(rb2)
> end_rb2 = sink.tell()
> # Stitch to togher rb1.schema, rb1 and rb2 without schema.
> with pa.output_stream("/tmp/f3") as sink:
>   with pa.input_stream("/tmp/f1") as inp:
>  sink.write(inp.read(end_rb1))
>   with pa.input_stream("/tmp/f2") as inp:
> inp.seek(start_rb2_only)
> sink.write(inp.read(end_rb2 - start_rb2_only))
> with pa.ipc.open_stream("/tmp/f3") as sink:
>   print(sink.read_all())
> {code}
> Yields:
> {code}
> {{pyarrow.Table
> c1: int64
> 
> c1: [[1],[1]]
> {code}
> I would expect this to error because the second stiched in record batch has 
> more fields then necessary but it appears to load just fine.  
> Is this intended behavior?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16160) [C++] IPC Stream Reader doesn't check if extra fields are present for RecordBatches

2022-04-08 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-16160:
---

 Summary: [C++] IPC Stream Reader doesn't check if extra fields are 
present for RecordBatches
 Key: ARROW-16160
 URL: https://issues.apache.org/jira/browse/ARROW-16160
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 6.0.1
Reporter: Micah Kornfield


I looked through recent commits and I don't think this issue has been patched 
since:

{code:title=test.python|borderStyle=solid}
import pyarrow as pa
with pa.output_stream("/tmp/f1") as sink:
  with pa.RecordBatchStreamWriter(sink, rb1.schema) as writer:
writer.write(rb1)
end_rb1 = sink.tell()

with pa.output_stream("/tmp/f2") as sink:
  with pa.RecordBatchStreamWriter(sink, rb2.schema) as writer:
writer.write(rb2)
start_rb2_only = sink.tell()
writer.write(rb2)
end_rb2 = sink.tell()

# Stitch to togher rb1.schema, rb1 and rb2 without schema.
with pa.output_stream("/tmp/f3") as sink:
  with pa.input_stream("/tmp/f1") as inp:
 sink.write(inp.read(end_rb1))
  with pa.input_stream("/tmp/f2") as inp:
inp.seek(start_rb2_only)
sink.write(inp.read(end_rb2 - start_rb2_only))

with pa.ipc.open_stream("/tmp/f3") as sink:
  print(sink.read_all())
{code}
Yields:
{code}
{{pyarrow.Table
c1: int64

c1: [[1],[1]]
{code}

I would expect this to error because the second stiched in record batch has 
more fields then necessary but it appears to load just fine.  

Is this intended behavior?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-16147) [C++] ParquetFileWriter doesn't call sink_.Close when using GcsRandomAccessFile

2022-04-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519176#comment-17519176
 ] 

Micah Kornfield commented on ARROW-16147:
-

Thanks for the thorough tests.  It isn't clear to me why we wouldn't close the 
sink here but I'd be a little concerned in the change of behavior.  
unfortunately this isn't specified in the contract.  I think we should fix the 
GcsOutputFile to close on destruction.



> [C++] ParquetFileWriter doesn't call sink_.Close when using 
> GcsRandomAccessFile
> ---
>
> Key: ARROW-16147
> URL: https://issues.apache.org/jira/browse/ARROW-16147
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Priority: Major
>  Labels: GCP
>
> On parquet::arrow::FileWriter::Close the underlying sink is not closed. The 
> implementation goes to FileSerializer::Close:
> {code:cpp}
> void Close() override {
> if (is_open_) {
>   // If any functions here raise an exception, we set is_open_ to be false
>   // so that this does not get called again (possibly causing segfault)
>   is_open_ = false;
>   if (row_group_writer_) {
> num_rows_ += row_group_writer_->num_rows();
> row_group_writer_->Close();
>   }
>   row_group_writer_.reset();
>   // Write magic bytes and metadata
>   auto file_encryption_properties = 
> properties_->file_encryption_properties();
>   if (file_encryption_properties == nullptr) {  // Non encrypted file.
> file_metadata_ = metadata_->Finish();
> WriteFileMetaData(*file_metadata_, sink_.get());
>   } else {  // Encrypted file
> CloseEncryptedFile(file_encryption_properties);
>   }
> }
>   }
> {code}
> It doesn't call sink_->Close(), which leads to resource leaking and bugs.
> With files (they have own close() in destructor) it works fine, but doesn't 
> work with fs::GcsRandomAccessFile. When I calling 
> parquet::arrow::FileWriter::Close the data is not flushed to storage, until 
> manual close of a sink stream (or stack space change).
> Is it done by intention or a bug?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-16102) [C++] Builds that use cpp/cmake_modules/FindgRPCAlt.cmake cannot build GCS support

2022-04-04 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516633#comment-17516633
 ] 

Micah Kornfield commented on ARROW-16102:
-

https://github.com/apache/arrow/runs/5775653588?check_suite_focus=true is an 
example of the failing build.  

https://github.com/apache/arrow/pull/12763 is the PR that I ran into this issue 
(switching the releveant builds to keep GCS off for the time being.



> [C++] Builds that use cpp/cmake_modules/FindgRPCAlt.cmake cannot build GCS 
> support
> --
>
> Key: ARROW-16102
> URL: https://issues.apache.org/jira/browse/ARROW-16102
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> cpp/cmake_modules/FindgRPCAlt.cmake somehow exposes the same libraries 
> defined in build_absl_once (defined in 
> cpp/cmake_modules/ThirdpartyToolchain.cmake) causing CMake to fail when GCS 
> client is enabled for building.
> I tried playing around with various options but given my limited CMake skills 
> I could not figure out an easy solution to this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-16102) [C++] Builds that use cpp/cmake_modules/FindgRPCAlt.cmake cannot build GCS support

2022-04-04 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-16102:

Component/s: C++

> [C++] Builds that use cpp/cmake_modules/FindgRPCAlt.cmake cannot build GCS 
> support
> --
>
> Key: ARROW-16102
> URL: https://issues.apache.org/jira/browse/ARROW-16102
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Priority: Major
>
> cpp/cmake_modules/FindgRPCAlt.cmake somehow exposes the same libraries 
> defined in build_absl_once (defined in 
> cpp/cmake_modules/ThirdpartyToolchain.cmake) causing CMake to fail when GCS 
> client is enabled for building.
> I tried playing around with various options but given my limited CMake skills 
> I could not figure out an easy solution to this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-16102) [C++] Builds that use cpp/cmake_modules/FindgRPCAlt.cmake cannot build GCS support

2022-04-04 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-16102:

Summary: [C++] Builds that use cpp/cmake_modules/FindgRPCAlt.cmake cannot 
build GCS support  (was: [C++] Builds that us 
cpp/cmake_modules/FindgRPCAlt.cmake cannot build GCS support)

> [C++] Builds that use cpp/cmake_modules/FindgRPCAlt.cmake cannot build GCS 
> support
> --
>
> Key: ARROW-16102
> URL: https://issues.apache.org/jira/browse/ARROW-16102
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Priority: Major
>
> cpp/cmake_modules/FindgRPCAlt.cmake somehow exposes the same libraries 
> defined in build_absl_once (defined in 
> cpp/cmake_modules/ThirdpartyToolchain.cmake) causing CMake to fail when GCS 
> client is enabled for building.
> I tried playing around with various options but given my limited CMake skills 
> I could not figure out an easy solution to this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-16102) [C++] Builds that us cpp/cmake_modules/FindgRPCAlt.cmake cannot build GCS support

2022-04-03 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516576#comment-17516576
 ] 

Micah Kornfield commented on ARROW-16102:
-

This affects mingw and os/x CI builds.

> [C++] Builds that us cpp/cmake_modules/FindgRPCAlt.cmake cannot build GCS 
> support
> -
>
> Key: ARROW-16102
> URL: https://issues.apache.org/jira/browse/ARROW-16102
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Micah Kornfield
>Priority: Major
>
> cpp/cmake_modules/FindgRPCAlt.cmake somehow exposes the same libraries 
> defined in build_absl_once (defined in 
> cpp/cmake_modules/ThirdpartyToolchain.cmake) causing CMake to fail when GCS 
> client is enabled for building.
> I tried playing around with various options but given my limited CMake skills 
> I could not figure out an easy solution to this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16102) [C++] Builds that us cpp/cmake_modules/FindgRPCAlt.cmake cannot build GCS support

2022-04-03 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-16102:
---

 Summary: [C++] Builds that us cpp/cmake_modules/FindgRPCAlt.cmake 
cannot build GCS support
 Key: ARROW-16102
 URL: https://issues.apache.org/jira/browse/ARROW-16102
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Micah Kornfield


cpp/cmake_modules/FindgRPCAlt.cmake somehow exposes the same libraries defined 
in build_absl_once (defined in cpp/cmake_modules/ThirdpartyToolchain.cmake) 
causing CMake to fail when GCS client is enabled for building.

I tried playing around with various options but given my limited CMake skills I 
could not figure out an easy solution to this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-16048) [PyArrow] Null buffers with Pickle protocol.

2022-03-28 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513531#comment-17513531
 ] 

Micah Kornfield edited comment on ARROW-16048 at 3/28/22, 5:47 PM:
---

I don't think this affect faithfulness.  Another option specifically for pickle 
could be to modify the condition in 
[__reduce_ex|https://github.com/apache/arrow/blob/d327f694d7cd6b5919fe1df7d9dea6e7ebef03e1/python/pyarrow/io.pxi#L1119]
 which returns pybytes also when the underlying buffer has size 0?

 


was (Author: emkornfield):
I don't think this affect faithfulness.  Another option specifically for pickle 
could be to modify the condition in 
[__reduce_ex_]([https://github.com/apache/arrow/blob/d327f694d7cd6b5919fe1df7d9dea6e7ebef03e1/python/pyarrow/io.pxi#L1119)]
 which returns pybytes also when the underlying buffer has size 0?

 

> [PyArrow] Null buffers with Pickle protocol.
> 
>
> Key: ARROW-16048
> URL: https://issues.apache.org/jira/browse/ARROW-16048
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Minor
>
> When underlying buffers are null they populate the buffer protocol ".buf" 
> value with a null value.  In some cases this can violate contracts [asserted 
> in 
> cpython|https://github.com/python/cpython/blob/882d8096c262a5945e0cfdd706e5db3ad2b73543/Modules/_pickle.c#L1072].
>   It might be best to always return an empty non-null buffer when the 
> underlying buffer is null.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-16048) [PyArrow] Null buffers with Pickle protocol.

2022-03-28 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513531#comment-17513531
 ] 

Micah Kornfield commented on ARROW-16048:
-

I don't think this affect faithfulness.  Another option specifically for pickle 
could be to modify the condition in 
[__reduce_ex_]([https://github.com/apache/arrow/blob/d327f694d7cd6b5919fe1df7d9dea6e7ebef03e1/python/pyarrow/io.pxi#L1119)]
 which returns pybytes also when the underlying buffer has size 0?

 

> [PyArrow] Null buffers with Pickle protocol.
> 
>
> Key: ARROW-16048
> URL: https://issues.apache.org/jira/browse/ARROW-16048
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Minor
>
> When underlying buffers are null they populate the buffer protocol ".buf" 
> value with a null value.  In some cases this can violate contracts [asserted 
> in 
> cpython|https://github.com/python/cpython/blob/882d8096c262a5945e0cfdd706e5db3ad2b73543/Modules/_pickle.c#L1072].
>   It might be best to always return an empty non-null buffer when the 
> underlying buffer is null.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-16048) [PyArrow] Null buffers with Pickle protocol.

2022-03-28 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-16048:
---

 Summary: [PyArrow] Null buffers with Pickle protocol.
 Key: ARROW-16048
 URL: https://issues.apache.org/jira/browse/ARROW-16048
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Micah Kornfield
Assignee: Micah Kornfield


When underlying buffers are null they populate the buffer protocol ".buf" value 
with a null value.  In some cases this can violate contracts [asserted in 
cpython|https://github.com/python/cpython/blob/882d8096c262a5945e0cfdd706e5db3ad2b73543/Modules/_pickle.c#L1072].
  It might be best to always return an empty non-null buffer when the 
underlying buffer is null.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14892) [Python] Add bindings for GCS filesystem

2022-03-15 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507198#comment-17507198
 ] 

Micah Kornfield commented on ARROW-14892:
-

Yes, spending some time on it today and hopefully have a PR up this week.  I 
might need some help with python packaging.

> [Python] Add bindings for GCS filesystem
> 
>
> Key: ARROW-14892
> URL: https://issues.apache.org/jira/browse/ARROW-14892
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15855) [Python] Add dictionary_pagesize_limit to Parquet writer

2022-03-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17502449#comment-17502449
 ] 

Micah Kornfield commented on ARROW-15855:
-

This is a duplicate of https://issues.apache.org/jira/browse/ARROW-7174 I think.

> [Python] Add dictionary_pagesize_limit to Parquet writer
> 
>
> Key: ARROW-15855
> URL: https://issues.apache.org/jira/browse/ARROW-15855
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Parquet, Python
>Reporter: Xinyu Zeng
>Assignee: Alenka Frim
>Priority: Major
> Fix For: 8.0.0
>
>
> Although the python Parquet api is a wrapper of C+\+, there are some tuning 
> knobs not included in python. For example, dictionary_pagesize_limit_. The 
> dictionary page size will easily exceed the limit when any or many of the 
> followings happen: 1. The row_group_size is relatively large e.g. the default 
> is 64M. 2. The size per entry is large e.g large string column 3. the 
> repeatability of data is not so high. This may result in the dictionary 
> encoding not being fully utilized if this parameter cannot be tuned. In C+\+, 
> however, this parameter can be tuned to the optimized setting.
>  
> There are also other parameters not exposed in python, for example, 
> max_statistics_size.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15783) [Python] Converting arrow MonthDayNanoInterval to pandas fails DCHECK

2022-02-24 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-15783:
---

 Summary: [Python] Converting arrow MonthDayNanoInterval to pandas 
fails DCHECK
 Key: ARROW-15783
 URL: https://issues.apache.org/jira/browse/ARROW-15783
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Reporter: Micah Kornfield
Assignee: Micah Kornfield


InitPandasStaticData is only called on python/pandas -> Arrow and not the 
reverse path

 

This causes the DCHECK to make sure the Pandas type is not null to fail if 
import code is never used.

 

A workaround to users of the library is to call pa.array([1]) which would avoid 
this issue.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15728) [Python] Zstd IPC test is flaky.

2022-02-17 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-15728:
---

 Summary: [Python] Zstd IPC test is flaky.
 Key: ARROW-15728
 URL: https://issues.apache.org/jira/browse/ARROW-15728
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Micah Kornfield
Assignee: Micah Kornfield


Our internal CI system shows flakes on the test at approximately a 2% rate.  By 
reducing the integer range we can make this much less flaky (zero observed 
flakes in 5000 runs).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15727) [Python] Lists of MonthDayNano Interval can't be converted to Pandas

2022-02-17 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-15727:
---

 Summary: [Python] Lists of MonthDayNano Interval can't be 
converted to Pandas
 Key: ARROW-15727
 URL: https://issues.apache.org/jira/browse/ARROW-15727
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Micah Kornfield
Assignee: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15492) [Python] handle timestamp type in parquet file for compatibility with older HiveQL

2022-02-10 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17490695#comment-17490695
 ] 

Micah Kornfield commented on ARROW-15492:
-

OK so it isn't a bug with missing the logical type.

 

> So I'm not sure whether Arrow should adjust int96 type data stored in parquet 
> file with the local time zone?

The only solution seems to be to pass in an optional target timezone to convert 
to.  There is no guarantee local time aligns with the writer's timezone.   I 
think the C++ library has started vendoring the necessary utilities to do the 
time zone conversions.  

 

An alternative could also be to provide additional metadata that consumers 
could use to determine the source and pad as necessary outside of pyarrow.  

> [Python] handle timestamp type in parquet file for compatibility with older 
> HiveQL
> --
>
> Key: ARROW-15492
> URL: https://issues.apache.org/jira/browse/ARROW-15492
> Project: Apache Arrow
>  Issue Type: New Feature
>Affects Versions: 6.0.1
>Reporter: nero
>Priority: Major
>
> Hi there,
> I face an issue when I write a parquet file by PyArrow.
> In the older version of Hive, it can only recognize the timestamp type stored 
> in INT96, so I use table.write_to_data with `use_deprecated 
> timestamp_int96_timestamps=True` option to save the parquet file. But the 
> HiveQL will skip conversion when the metadata of parquet file is not 
> created_by "parquet-mr".
> [hive/ParquetRecordReaderBase.java at 
> f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive 
> (github.com)|https://github.com/apache/hive/blob/f1ff99636a5546231336208a300a114bcf8c5944/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L137-L139]
>  
> So I have to save the timestamp columns with timezone info(pad to UTC+8).
> But when pyarrow.parquet read from a dir which contains parquets created by 
> both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for 
> parquet-mr files.
>  
> Maybe PyArrow can expose the created_by option in pyarrow({*}prefer{*}, 
> parquet::WriterProperties::created_by is available in the C++ ).
> Or handle the timestamp type with timezone which files created by parquet-mr?
>  
> Maybe related to https://issues.apache.org/jira/browse/ARROW-14422



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-12509) [C++] More fine-grained control of file creation in filesystem layer

2022-02-08 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489286#comment-17489286
 ] 

Micah Kornfield edited comment on ARROW-12509 at 2/9/22, 6:52 AM:
--

Note about the use-case on the blocking issue.  In the hadoop ecosystem GCS 
tries to emulate renames by doing a copy and delete.  The 
[copy|https://github.com/GoogleCloudDataproc/hadoop-connectors/pull/622/files#diff-122b4a11389e8967ef7004ddf049fd1e7d688cf1c52fc16e61b0af2e60ed120dL1164]
 use the if match feature) but I think this type of operation might be 
expensive.

 

 


was (Author: emkornfield):
Note about the use-case on the blocking issue.  In the hadoop ecosystem GCS 
tries to emulate renames by doing a copy and delete.  The 
(https://github.com/GoogleCloudDataproc/hadoop-connectors/pull/622/files[copy 
use the if match feature]) but I think this type of operation might be 
expensive.

 

 

> [C++] More fine-grained control of file creation in filesystem layer
> 
>
> Key: ARROW-12509
> URL: https://issues.apache.org/jira/browse/ARROW-12509
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> {{FileSystem::OpenOutputStream}} silently truncates an existing file.
> It would be better to give more control to the user. Ideally, one could 
> choose between several options: "always overwrite and fail if doesn't exist", 
> "overwrite if exists, otherwise create", "creates if doesn't exist, otherwise 
> fails".
> One should research whether e.g. S3 supports such control.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-12509) [C++] More fine-grained control of file creation in filesystem layer

2022-02-08 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489286#comment-17489286
 ] 

Micah Kornfield commented on ARROW-12509:
-

Note about the use-case on the blocking issue.  In the hadoop ecosystem GCS 
tries to emulate renames by doing a copy and delete.  The 
(https://github.com/GoogleCloudDataproc/hadoop-connectors/pull/622/files[copy 
use the if match feature]) but I think this type of operation might be 
expensive.

 

 

> [C++] More fine-grained control of file creation in filesystem layer
> 
>
> Key: ARROW-12509
> URL: https://issues.apache.org/jira/browse/ARROW-12509
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> {{FileSystem::OpenOutputStream}} silently truncates an existing file.
> It would be better to give more control to the user. Ideally, one could 
> choose between several options: "always overwrite and fail if doesn't exist", 
> "overwrite if exists, otherwise create", "creates if doesn't exist, otherwise 
> fails".
> One should research whether e.g. S3 supports such control.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-14893) [C++] Allow creating GCS filesystem from URI

2022-02-07 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-14893:
---

Assignee: Micah Kornfield

> [C++] Allow creating GCS filesystem from URI
> 
>
> Key: ARROW-14893
> URL: https://issues.apache.org/jira/browse/ARROW-14893
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>
> Similarly to what already exists for S3. See {{FileSystemFromUri}} and 
> {{S3Options::FromUri}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-14892) [Python] Add bindings for GCS filesystem

2022-02-07 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-14892:
---

Assignee: Micah Kornfield

> [Python] Add bindings for GCS filesystem
> 
>
> Key: ARROW-14892
> URL: https://issues.apache.org/jira/browse/ARROW-14892
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15492) [Python] handle timestamp type in parquet file for compatibility with older HiveQL

2022-02-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17488603#comment-17488603
 ] 

Micah Kornfield commented on ARROW-15492:
-

So this looks like an oversight with int96. The logical type with 
isAdjustedToUtc isn't accounted for when making the [arrow type for 
int96|https://github.com/apache/arrow/blob/85f192a45755b3f15653fdc0a8fbd788086e125f/cpp/src/parquet/arrow/schema_internal.cc#L197].
 It is used for [int64|#L197].  [~amznero]  would you be interested in 
contributing a fix for this?

> [Python] handle timestamp type in parquet file for compatibility with older 
> HiveQL
> --
>
> Key: ARROW-15492
> URL: https://issues.apache.org/jira/browse/ARROW-15492
> Project: Apache Arrow
>  Issue Type: New Feature
>Affects Versions: 6.0.1
>Reporter: nero
>Priority: Major
>
> Hi there,
> I face an issue when I write a parquet file by PyArrow.
> In the older version of Hive, it can only recognize the timestamp type stored 
> in INT96, so I use table.write_to_data with `use_deprecated 
> timestamp_int96_timestamps=True` option to save the parquet file. But the 
> HiveQL will skip conversion when the metadata of parquet file is not 
> created_by "parquet-mr".
> [hive/ParquetRecordReaderBase.java at 
> f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive 
> (github.com)|https://github.com/apache/hive/blob/f1ff99636a5546231336208a300a114bcf8c5944/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L137-L139]
>  
> So I have to save the timestamp columns with timezone info(pad to UTC+8).
> But when pyarrow.parquet read from a dir which contains parquets created by 
> both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for 
> parquet-mr files.
>  
> Maybe PyArrow can expose the created_by option in pyarrow({*}prefer{*}, 
> parquet::WriterProperties::created_by is available in the C++ ).
> Or handle the timestamp type with timezone which files created by parquet-mr?
>  
> Maybe related to https://issues.apache.org/jira/browse/ARROW-14422



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15596) thift_internal.h assumes shared_ptr type in some cases

2022-02-06 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-15596:
---

 Summary: thift_internal.h assumes shared_ptr type in some cases
 Key: ARROW-15596
 URL: https://issues.apache.org/jira/browse/ARROW-15596
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Micah Kornfield
Assignee: Micah Kornfield


Thrift can still be built with boost shared_ptrs so we need to be pointer 
agnostic.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (ARROW-15080) [Python] Allow creation of month_day_nano interval from tuple

2022-02-06 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-15080:
---

Assignee: Micah Kornfield

> [Python] Allow creation of month_day_nano interval from tuple
> -
>
> Key: ARROW-15080
> URL: https://issues.apache.org/jira/browse/ARROW-15080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Minor
>
> This should ideally be allowed but isn't:
> {code:python}
> >>> a = pa.array([(3, 20, 100)], type=pa.month_day_nano_interval())
> Traceback (most recent call last):
>   File "", line 1, in 
> a = pa.array([(3, 20, 100)], type=pa.month_day_nano_interval())
>   File "pyarrow/array.pxi", line 315, in pyarrow.lib.array
> return _sequence_to_array(obj, mask, size, type, pool, c_from_pandas)
>   File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
> chunked = GetResultValue(
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
> raise ArrowTypeError(message)
> ArrowTypeError: No temporal attributes found on object.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15492) [Python] handle timestamp type in parquet file for compatibility with older HiveQL

2022-02-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487634#comment-17487634
 ] 

Micah Kornfield commented on ARROW-15492:
-

On the exposing the write field, per the other Jira I don't think we should do 
it.  It makes it much harder to deal with bugs that might occur in a particular 
version of the library.

 
{quote}Or handle the timestamp type with timezone which files created by 
parquet-mr?
{quote}
I'm not familiar with this, could you link the to specification on this or 
provide more details?  It seems like this might be a better approach.

> [Python] handle timestamp type in parquet file for compatibility with older 
> HiveQL
> --
>
> Key: ARROW-15492
> URL: https://issues.apache.org/jira/browse/ARROW-15492
> Project: Apache Arrow
>  Issue Type: New Feature
>Affects Versions: 6.0.1
>Reporter: nero
>Priority: Major
>
> Hi there,
> I face an issue when I write a parquet file by PyArrow.
> In the older version of Hive, it can only recognize the timestamp type stored 
> in INT96, so I use table.write_to_data with `use_deprecated 
> timestamp_int96_timestamps=True` option to save the parquet file. But the 
> HiveQL will skip conversion when the metadata of parquet file is not 
> created_by "parquet-mr".
> [hive/ParquetRecordReaderBase.java at 
> f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive 
> (github.com)|https://github.com/apache/hive/blob/f1ff99636a5546231336208a300a114bcf8c5944/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L137-L139]
>  
> So I have to save the timestamp columns with timezone info(pad to UTC+8).
> But when pyarrow.parquet read from a dir which contains parquets created by 
> both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for 
> parquet-mr files.
>  
> Maybe PyArrow can expose the created_by option in pyarrow({*}prefer{*}, 
> parquet::WriterProperties::created_by is available in the C++ ).
> Or handle the timestamp type with timezone which files created by parquet-mr?
>  
> Maybe related to https://issues.apache.org/jira/browse/ARROW-14422



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-9311) [JS] Use feature enum in javascript

2022-02-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487632#comment-17487632
 ] 

Micah Kornfield edited comment on ARROW-9311 at 2/6/22, 6:52 AM:
-

There was a [Feature 
enum|[https://github.com/apache/arrow/blob/master/format/Schema.fbs#L67|https://github.com/apache/arrow/blob/master/format/Schema.fbs#L67]
 added to schema.fbs in the hope of guarding forwards compatible changes.  This 
Jira was to make use of this in Javascript (I don't think any implementation 
has done any work here yet).


was (Author: emkornfield):
There was a [Feature 
enum]([https://github.com/apache/arrow/blob/master/format/Schema.fbs#L67)] 
added to schema.fbs in the hope of guarding forwards compatible changes.  This 
Jira was to make use of this in Javascript (I don't think any implementation 
has done any work here yet).

> [JS] Use feature enum in javascript
> ---
>
> Key: ARROW-9311
> URL: https://issues.apache.org/jira/browse/ARROW-9311
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: JavaScript
>Reporter: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-9311) [JS] Use feature enum in javascript

2022-02-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487632#comment-17487632
 ] 

Micah Kornfield edited comment on ARROW-9311 at 2/6/22, 6:52 AM:
-

There was a [Feature 
enum|https://github.com/apache/arrow/blob/master/format/Schema.fbs#L67] added 
to schema.fbs in the hope of guarding forwards compatible changes.  This Jira 
was to make use of this in Javascript (I don't think any implementation has 
done any work here yet).


was (Author: emkornfield):
There was a [Feature 
enum|[https://github.com/apache/arrow/blob/master/format/Schema.fbs#L67|https://github.com/apache/arrow/blob/master/format/Schema.fbs#L67]
 added to schema.fbs in the hope of guarding forwards compatible changes.  This 
Jira was to make use of this in Javascript (I don't think any implementation 
has done any work here yet).

> [JS] Use feature enum in javascript
> ---
>
> Key: ARROW-9311
> URL: https://issues.apache.org/jira/browse/ARROW-9311
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: JavaScript
>Reporter: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-9311) [JS] Use feature enum in javascript

2022-02-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487632#comment-17487632
 ] 

Micah Kornfield commented on ARROW-9311:


There was a [Feature 
enum]([https://github.com/apache/arrow/blob/master/format/Schema.fbs#L67)] 
added to schema.fbs in the hope of guarding forwards compatible changes.  This 
Jira was to make use of this in Javascript (I don't think any implementation 
has done any work here yet).

> [JS] Use feature enum in javascript
> ---
>
> Key: ARROW-9311
> URL: https://issues.apache.org/jira/browse/ARROW-9311
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: JavaScript
>Reporter: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15548) [C++][Parquet] Field-level metadata are not supported? (ColumnMetadata.key_value_metadata)

2022-02-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487624#comment-17487624
 ] 

Micah Kornfield commented on ARROW-15548:
-

I don't think it is a bad idea to have something control for placing metadata 
for each column on a per row-group basis.  Based on the discussion on this 
thread it seems like this should be opt-in though

> [C++][Parquet] Field-level metadata are not supported? 
> (ColumnMetadata.key_value_metadata)
> --
>
> Key: ARROW-15548
> URL: https://issues.apache.org/jira/browse/ARROW-15548
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> Due to an application where we are considering to use field-level metadata 
> (so not schema-level metadata), but also want to be able to save this data to 
> Parquet, I was looking into "field-level metadata" for Parquet, which I 
> assumed we supported this. 
> We can roundtrip Arrow's field-level metadata to/from Parquet, as shown with 
> this example:
> {code:python}
> schema = pa.schema([pa.field("column_name", pa.int64(), metadata={"key": 
> "value"})])
> table = pa.table({'column_name': [0, 1, 2]}, schema=schema)
> pq.write_table(table, "test_field_metadata.parquet")
> >>> pq.read_table("test_field_metadata.parquet").schema
> column_name: int64
>   -- field metadata --
>   key: 'value'
> {code}
> However, the reason this is restored is actually because of this being stored 
> in the Arrow schema that we (by default) store in the {{ARROW:schema}} 
> metadata in the Parquet FileMetaData.key_value_metadata.
> With a small patched version to be able to turn this off (currently this is 
> harcoded to be turned on in the python bindings), it is clear this 
> field-level metadata is not restored on roundtrip without this stored arrow 
> schema:
> {code:python}
> pq.write_table(table, "test_field_metadata_without_schema.parquet", 
> store_arrow_schema=False)
> >>> pq.read_table("test_field_metadata_without_schema.parquet").schema
> column_name: int64
> {code}
> So there is currently no mapping from Arrow's field level metadata to 
> Parquet's column-level metadata ({{ColumnMetaData.key_value_metadata}} in 
> Parquet's thrift structures). 
> (which also means that using field-level metadata roundtripping to parquet 
> only works as long as you are using Arrow for writing/reading, but not if you 
> want to be able to also exchange such data with non-Arrow Parquet 
> implementations)
> In addition, it also seems we don't even expose this field in our C++ or 
> Python bindings, to just access that data if you would have a Parquet file 
> (written by another implementation) that has key_value_metadata in the 
> ColumnMetaData.
> cc [~emkornfield] 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-12203) [C++][Python] Switch default Parquet version to 2.4

2022-02-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17487623#comment-17487623
 ] 

Micah Kornfield commented on ARROW-12203:
-

8.0 release is targeted in the April time frame?

> [C++][Python] Switch default Parquet version to 2.4
> ---
>
> Key: ARROW-12203
> URL: https://issues.apache.org/jira/browse/ARROW-12203
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 8.0.0
>
>
> Currently, Parquet write APIs default to maximum-compatibility Parquet 
> version "1.0", which disables some logical types such as UINT32. We may want 
> to switch the default to "2.0" instead, to allow faithful representation of 
> more types.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15511) [Python] GIL not held for Ndarray1DIndexer on

2022-01-31 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-15511:
---

 Summary: [Python] GIL not held for Ndarray1DIndexer on
 Key: ARROW-15511
 URL: https://issues.apache.org/jira/browse/ARROW-15511
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 6.0.1
Reporter: Micah Kornfield


[In _ndarray_to_array the call to 
NdarrayToArrow|https://github.com/apache/arrow/blob/658bec37aa5cbdd53b5e4cdc81b8ba3962e67f11/python/pyarrow/array.pxi#L82]
 in explicitly excluded from the GIL.  In some code-paths Ndarray1DIndexer is 
instantiated which will try to do PyINC_REF and PyDdecREf on initialization and 
destruction.  These code paths do not appear to acquire the GIL.

 

I'm not sure what the best fix is:
 # Acquire GIL as part of Ndarray1DIndexer construction.
 # Eliminate the nogil block in _ndarray_to_array
 # Eliminate the incref and decref calls in Ndarray1DIndexer
 # Something else?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-5569) [C++] import avro C++ code to code base.

2021-12-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-5569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459655#comment-17459655
 ] 

Micah Kornfield commented on ARROW-5569:


[~willjones127] agreed.  If you want to pursue support for iceberg, I'd approve 
and try to help out with reviews.  At the moment it isn't clear when I will get 
to this.

> [C++] import avro C++ code to code base.
> 
>
> Key: ARROW-5569
> URL: https://issues.apache.org/jira/browse/ARROW-5569
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> The goal here is to take code as is without compiling it, but flattening it 
> to conform with Arrow's code base standards.  This will give a basis for 
> future PR.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-12203) [C++][Python] Switch default Parquet version to 2.4

2021-12-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459648#comment-17459648
 ] 

Micah Kornfield commented on ARROW-12203:
-

Unfortunately not yet, I think if we could wait until after the 7.0 release 
that should finally be sufficient.

> [C++][Python] Switch default Parquet version to 2.4
> ---
>
> Key: ARROW-12203
> URL: https://issues.apache.org/jira/browse/ARROW-12203
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 7.0.0
>
>
> Currently, Parquet write APIs default to maximum-compatibility Parquet 
> version "1.0", which disables some logical types such as UINT32. We may want 
> to switch the default to "2.0" instead, to allow faithful representation of 
> more types.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15080) [Python] Allow creation of month_day_nano interval from tuple

2021-12-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459646#comment-17459646
 ] 

Micah Kornfield commented on ARROW-15080:
-

I wasn't sure if we wanted this, if you think it is a good idea I can add it.

> [Python] Allow creation of month_day_nano interval from tuple
> -
>
> Key: ARROW-15080
> URL: https://issues.apache.org/jira/browse/ARROW-15080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Priority: Minor
>
> This should ideally be allowed but isn't:
> {code:python}
> >>> a = pa.array([(3, 20, 100)], type=pa.month_day_nano_interval())
> Traceback (most recent call last):
>   File "", line 1, in 
> a = pa.array([(3, 20, 100)], type=pa.month_day_nano_interval())
>   File "pyarrow/array.pxi", line 315, in pyarrow.lib.array
> return _sequence_to_array(obj, mask, size, type, pool, c_from_pandas)
>   File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
> chunked = GetResultValue(
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
> raise ArrowTypeError(message)
> ArrowTypeError: No temporal attributes found on object.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-12706) [Python] Drop python 3.6 and numpy 1.16 support

2021-12-14 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17459645#comment-17459645
 ] 

Micah Kornfield commented on ARROW-12706:
-

My understanding is that most Google have committed to following the end of 
life policy set upstream by python maintainers.  IIUC, I think this means we 
can drop python 3.6 for next release if this is accurate 
[https://endoflife.date/python]

 

 

> [Python] Drop python 3.6 and numpy 1.16 support
> ---
>
> Key: ARROW-12706
> URL: https://issues.apache.org/jira/browse/ARROW-12706
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Adam Lippai
>Priority: Major
> Fix For: 7.0.0
>
>
> If we are following [NEP 
> 29|https://numpy.org/neps/nep-0029-deprecation_policy.html] we can safely 
> drop python 3.6 (released in 2016) and numpy 1.16 support (released in 2019), 
> they got unsupported in January by numpy.
> Python 3.6 will reach end of life in 6-7 months anyways, so it's a good 
> target for removal in pyarrow 5.0.0 or 6.0.0. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

2021-12-13 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458302#comment-17458302
 ] 

Micah Kornfield commented on ARROW-15073:
-

If LZ4 gets translated to LZ4_RAW depending on the version of parquet-mr, I 
wonder if null could indicate an unknown value for that field.

> [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable 
> by (py)spark
> --
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Jorge Leitão
>Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

2021-12-13 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17458302#comment-17458302
 ] 

Micah Kornfield edited comment on ARROW-15073 at 12/13/21, 10:56 AM:
-

If LZ4 gets translated to LZ4_RAW depending on the version of parquet-mr, I 
wonder if null could indicate an unknown value for that field.  
[~jorgecarleitao] is it actually the same error message with ZSTD in Spark the 
code above only shows LZ4?  From the original polars ticket it seems ZSTD 
parquet was not an issue.


was (Author: emkornfield):
If LZ4 gets translated to LZ4_RAW depending on the version of parquet-mr, I 
wonder if null could indicate an unknown value for that field.

> [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable 
> by (py)spark
> --
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Jorge Leitão
>Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

2021-12-11 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457700#comment-17457700
 ] 

Micah Kornfield edited comment on ARROW-15073 at 12/11/21, 6:07 PM:


This is expected LZ4 has always had compatibility issues which is why lz4_raw 
was introduced in https://issues.apache.org/jira/browse/PARQUET-1998  I'm 
trying to find the Jira for it but failing for the Java implementation of 
LZ4_RAW in java but to my knowledge it hasn't been done yet.  
[https://github.com/apache/parquet-format/pull/168/files] has a description.

 

ZSTD is news to me, but haven't been tracking it carefully, I thought it was 
tested at some point but we might need to have a conversation on that one as 
well.

 

 


was (Author: emkornfield):
This is expected LZ4 has always had compatibility issues which is why lz4_raw 
was introduced in https://issues.apache.org/jira/browse/PARQUET-1998  I'm 
trying to find the Jira for it but failing for the Java implementation of 
LZ4_RAW in java but to my knowledge it hasn't been done yet.

 

ZSTD is news to me, but haven't been tracking it carefully, I thought it was 
tested at some point but we might need to have a conversation on that one as 
well.

 

 

> [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable 
> by (py)spark
> --
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Jorge Leitão
>Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15073) [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable by (py)spark

2021-12-11 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457700#comment-17457700
 ] 

Micah Kornfield commented on ARROW-15073:
-

This is expected LZ4 has always had compatibility issues which is why lz4_raw 
was introduced in https://issues.apache.org/jira/browse/PARQUET-1998  I'm 
trying to find the Jira for it but failing for the Java implementation of 
LZ4_RAW in java but to my knowledge it hasn't been done yet.

 

ZSTD is news to me, but haven't been tracking it carefully, I thought it was 
tested at some point but we might need to have a conversation on that one as 
well.

 

 

> [C++][Parquet][Python] LZ4- and zstd- compressed parquet files are unreadable 
> by (py)spark
> --
>
> Key: ARROW-15073
> URL: https://issues.apache.org/jira/browse/ARROW-15073
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet
>Reporter: Jorge Leitão
>Priority: Major
>
> The following snipped shows the issue
> {code:java}
> import pyarrow as pa  # pyarrow==6.0.1
> import pyarrow.parquet
> import pyspark.sql  # pyspark==3.1.2
> path = "bla.parquet"
> t = pa.table(
> [pa.array([0, 1, None, 3, None, 5, 6, 7, None, 9])],
> schema=pa.schema([pa.field("int64", pa.int64(), nullable=True)]),
> )
> pyarrow.parquet.write_table(
> t,
> path,
> use_dictionary=False,
> compression="LZ4",
> )
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> result = spark.read.parquet(path).select("int64").collect()
> {code}
> This fails with:
> {code:java}
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'codec' was not present! Struct: ColumnMetaData(type:INT64, 
> encodings:[PLAIN, RLE], path_in_schema:[int64], codec:null, num_values:10, 
> total_uncompressed_size:142, total_compressed_size:104, data_page_offset:4, 
> statistics:Statistics(max:09 00 00 00 00 00 00 00, min:00 00 00 00 00 00 00 
> 00, null_count:0, max_value:09 00 00 00 00 00 00 00, min_value:00 00 00 00 00 
> 00 00 00), encoding_stats:[PageEncodingStats(page_type:DATA_PAGE, 
> encoding:PLAIN, count:1)])
> {code}
> Found while debugging the root cause of 
> https://github.com/pola-rs/polars/issues/2018
> pyarrow reads the file correctly.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14960) [C++] Google style guide allows mutable references now, what do?

2021-12-06 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17454381#comment-17454381
 ] 

Micah Kornfield commented on ARROW-14960:
-

I think the size of Google's code base meant that there were "a lot" of 
optional parameters that got passed by pointer and this caused real production 
bugs.  I think the incidence in Arrow has been a lot lower (as I think most of 
the output parameters are not optional?).

 

The change inside of Google was made a little while ago (many months, maybe 
even 1+ years), and I'm still not fully accustomed to it.  I'm happy to 
document this as a deviation from the Google style guide.  I don't think this 
particular change is a good use of people's time to try to change/update 
existing code.  As others have noted I think the old rule based on readability 
at call-sites still has value.

 

I'd be fine adopting it as a standard for new APIs but don't have a strong 
preference to do so.

> [C++] Google style guide allows mutable references now, what do?
> 
>
> Key: ARROW-14960
> URL: https://issues.apache.org/jira/browse/ARROW-14960
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Ben Kietzman
>Priority: Minor
>
> As of 
> https://github.com/google/styleguide/commit/7a7a2f510efe7d7fc5ea8fbed549ddb31fac8f3e
>  the Google Style Guide no longer forbids use of mutable references for 
> output arguments, and actually encourages using them when the output argument 
> is not optional.
> This puts arrow c++ style out of sync since we've continued to police toward 
> usage of pointers for output arguments. We could:
> - keep the ban and note this as a deviation from google style in 
> [development.rst|https://github.com/bkietz/arrow/blob/392af8aa999f940ab8fd61684820b2c6d89f7871/docs/source/developers/cpp/development.rst#L74-L75]
> - open JIRA(s) for deprecating/replacing pointer-output APIs where applicable



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-8214) [C++] Flatbuffers based serialization protocol for Expressions

2021-11-26 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449665#comment-17449665
 ] 

Micah Kornfield commented on ARROW-8214:


I think the IR model seems like the right way to go with this.

> [C++] Flatbuffers based serialization protocol for Expressions
> --
>
> Key: ARROW-8214
> URL: https://issues.apache.org/jira/browse/ARROW-8214
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: dataset
>
> It might provide a more scalable solution for serialization.
> cc [~bkietz] [~fsaintjacques]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-11829) [C++] Update developer style guide on usage of shared_ptr

2021-11-26 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449664#comment-17449664
 ] 

Micah Kornfield commented on ARROW-11829:
-

Yes, that is the thread.  Sorry, hope to make some time for this soon.

> [C++] Update developer style guide on usage of shared_ptr
> -
>
> Key: ARROW-11829
> URL: https://issues.apache.org/jira/browse/ARROW-11829
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14770) Direct (individualized) access to definition levels, repetition levels, and numeric data of a column

2021-11-22 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447632#comment-17447632
 ] 

Micah Kornfield commented on ARROW-14770:
-

FWIW writing V2 data pages isn't production ready in arrow.  There is at least 
one open bug for incorrect statistics and we don't align pages to row 
boundaries which I believe is a requirement for V2.  My understanding is that 
V2 is not widely used in general, and we certainly haven't put a lot of effort 
into optimizing the read paths either.

 

In regards to addressing the specific issue, would a higher level API that 
returned list lengths be more appropriate? 

I think exposing the "values" column as a raw buffer is not something I would 
really like to support, because while it is easy to get to a representation 
that uses would agree with numeric types, it is a little bit less 
straight-forward to string/byte types.   For only processing the 
repetition/levels and definition levels it would take some refactoring to 
isolate these components, but there still might be a performance win if we 
decode and ignore the values buffer (which would in turn allow the use of 
existing parquet C++ APIs).   

 

[~jpivarski] is this something you would like to contribute, I can give you 
some code pointers. 

> Direct (individualized) access to definition levels, repetition levels, and 
> numeric data of a column
> 
>
> Key: ARROW-14770
> URL: https://issues.apache.org/jira/browse/ARROW-14770
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Parquet, Python
>Reporter: Jim Pivarski
>Priority: Minor
>
> It would be useful to have more low-level access to the three components of a 
> Parquet column in Python: the definition levels, the repetition levels, and 
> the numeric data, {_}individually{_}.
> The particular use-case we have in Awkward Array is that users will sometimes 
> lazily read an array of lists of structs without reading any of the fields of 
> those structs. To build the data structure, we need the lengths of the lists 
> independently of the columns (which users can then use in functions like 
> {{{}ak.num{}}}; the number of structs without their field values is useful 
> information).
> What we're doing right now is reading a column, converting it to Arrow 
> ({{{}pa.Array{}}}), and getting the list lengths from that Arrow array. We 
> have been using the schema to try to pick the smallest column (booleans are 
> best!), but that's because we really just want the definition and repetition 
> levels without the numeric data.
> I've heard that the Parquet metadata includes offsets to select just the 
> definition levels, just the repetition levels, or just the numeric data 
> (pre-decompression?). Exposing those in Python as {{pa.Buffer}} objects would 
> be ideal.
> Beyond our use case, such a feature could also help with wide structs in 
> lists: all of the non-nullable fields of the struct would share the same 
> definition and repetition levels, so they don't need to be re-read. For that 
> use-case, the ability to pick out definition, repetition, and numeric data 
> separately would still be useful, but the purpose would be to read the 
> numeric data without the structural integers (opposite of ours).
> The desired interface would be like {{{}ParquetFile.read_row_group{}}}, but 
> would return one, two, or three {{pa.Buffer}} objects depending on three 
> boolean arguments, {{{}definition{}}}, {{{}repetition{}}}, and 
> {{{}numeric{}}}. The {{pa.Buffer}} would be unpacked, with all run-length 
> encodings and fixed-width encodings converted into integers of at least one 
> byte each. It may make more sense for the output to be {{{}np.ndarray{}}}, to 
> carry {{dtype}} information if that can depend on the maximum level (though 
> levels larger than 255 are likely rare!). This information must be available 
> at some level in Arrow's C++ code; the request is to expose it to Python.
> I've labeled this minor because it is for optimizations, but it would be 
> really nice to have!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-11901) [Java] Investigate potential performance improvement of compression codec

2021-11-17 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445041#comment-17445041
 ] 

Micah Kornfield commented on ARROW-11901:
-

{quote}It's not about eliminating anything, it's about developing the existing 
Java API, such as this very specific use case for compression codecs. 
[~benjamin.wilh...@knime.com] was able to wrap LZ4 using JavaCPP, all by 
himself! it's a lot easier to do than code everything manually with JNI:
[https://github.com/bytedeco/javacpp-presets/pull/1094]
{quote}
I think there is some miscommunication, on what I thought were 2 separate 
issues.  How to implement an efficient LZ4 decoder and whether to base the Java 
API as a wrapper on the C++ API.  The second would essentially would need a 
heavy rewrite of the Java API as it is fundamentally different than the design 
of the C++ API.  I think there could be some interest from consumers of Arrow 
in an API that more accurately mimics the C++ version, but again that is a 
different thread.  It could be for some of the more complex bindings (DataSets) 
JavaCPP might be a better choice then hand-coded JNI.

 
{quote}[~emkornfield], since the C++ builds of Arrow already include LZ4, it is 
indeed pretty trivial to expose a few JNI methods to access it. 
{quote}
I was not referring to binding to the C++ implementation here but directly to 
the LZ4 library.  It looks like JavaCPP makes this efficient from a developer 
perspective.  But the 
[API|https://github.com/bytedeco/javacpp-presets/pull/1094/files#diff-3d9af736e997982d68098d986670f05ff40ae0cc62773a1dd0eb418e55990317R38]
 isn't quite what I imagined, it looks like it goes through ByteBuffer, when 
all we really need is something like [ZSTD 
API|https://github.com/luben/zstd-jni/blob/master/src/main/java/com/github/luben/zstd/Zstd.java#L454].
  For such a minimal API I'm ambivalent on taking on a new dependency here.

 
{quote}If you have some ideas as to why most engineers are OK using Cython in 
the case of Python, but not the equivalent in the case of Java, I would be very 
much interested in hearing your opinions.
{quote}
I'm not an expert but a few thoughts:
 # Cython is more then just a C++ wrapper.  It speeds up python even if you 
never want to write native code by effectively allowing one to write C code as 
python.  In Java, at least in theory, the JIT can do some heavy lifting here.
 # The Python GIL is a pain point that Java doesn't have and Cython + Native 
code can effectively work around it.
 # There has always been a tight relationship between Python and Native code 
where as JNI is much more esoteric, and can cause unexpected deployment issues 
(e.g. correctly pointing the JVM to .so files, correctly integrating with the 
JVM's memory capacity features, etc). 
 # Cython was also a pretty easy way to get compatibility between python 2.x 
and python 3.x

Sometimes there is watershed moment, more mature projects can be reluctant to 
try new technologies unless they are proven elsewhere and they solve a 
significant pain-point.  
{quote}We could do the same for Arrow!
{quote}
The dev@ mailing list is the place to discuss this.  I tried searching and 
couldn't find any previous discussions on the topic there.

> [Java] Investigate potential performance improvement of compression codec
> -
>
> Key: ARROW-11901
> URL: https://issues.apache.org/jira/browse/ARROW-11901
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Benjamin Wilhelm
>Priority: Major
>
> In response to the discussion in 
> https://github.com/apache/arrow/pull/8949/files#r588046787
> There are some performance penalties in the implementation of the compression 
> codecs (e.g. data copying between heap/off-heap data). We need to revise the 
> code to improve the performance. 
> We should also provide some benchmarks to validate that the performance 
> actually improves. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (ARROW-14601) [Java] Error comments for Minor Type of TIMESTAMPSEC

2021-11-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-14601.
-
Resolution: Fixed

Issue resolved by pull request 11618
[https://github.com/apache/arrow/pull/11618]

> [Java] Error comments for Minor Type of TIMESTAMPSEC
> 
>
> Key: ARROW-14601
> URL: https://issues.apache.org/jira/browse/ARROW-14601
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 6.0.0
>Reporter: Kun Liu
>Assignee: Kun Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> For the MinorType of TIMESTAMP SEC, which should be the time in the second 
> from the 
> `
> the Unix epoch, 00:00:00  on 1 jan 1970 UTC
> `
> But the comment is 
> ```
> 00:00:00.00
> ``` and is present the type TIMESTAMPMICRO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-14601) [Java] Error comments for Minor Type of TIMESTAMPSEC

2021-11-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-14601:
---

Assignee: Kun Liu

> [Java] Error comments for Minor Type of TIMESTAMPSEC
> 
>
> Key: ARROW-14601
> URL: https://issues.apache.org/jira/browse/ARROW-14601
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 6.0.0
>Reporter: Kun Liu
>Assignee: Kun Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> For the MinorType of TIMESTAMP SEC, which should be the time in the second 
> from the 
> `
> the Unix epoch, 00:00:00  on 1 jan 1970 UTC
> `
> But the comment is 
> ```
> 00:00:00.00
> ``` and is present the type TIMESTAMPMICRO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14601) [Java] Error comments for Minor Type of TIMESTAMPSEC

2021-11-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-14601:

Affects Version/s: 6.0.0

> [Java] Error comments for Minor Type of TIMESTAMPSEC
> 
>
> Key: ARROW-14601
> URL: https://issues.apache.org/jira/browse/ARROW-14601
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 6.0.0
>Reporter: Kun Liu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> For the MinorType of TIMESTAMP SEC, which should be the time in the second 
> from the 
> `
> the Unix epoch, 00:00:00  on 1 jan 1970 UTC
> `
> But the comment is 
> ```
> 00:00:00.00
> ``` and is present the type TIMESTAMPMICRO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14601) [Java] Error comments for Minor Type of TIMESTAMPSEC

2021-11-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-14601:

Component/s: Java

> [Java] Error comments for Minor Type of TIMESTAMPSEC
> 
>
> Key: ARROW-14601
> URL: https://issues.apache.org/jira/browse/ARROW-14601
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Kun Liu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> For the MinorType of TIMESTAMP SEC, which should be the time in the second 
> from the 
> `
> the Unix epoch, 00:00:00  on 1 jan 1970 UTC
> `
> But the comment is 
> ```
> 00:00:00.00
> ``` and is present the type TIMESTAMPMICRO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14601) [Java] Error comments for Minor Type of TIMESTAMPSEC

2021-11-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-14601:

Fix Version/s: 7.0.0

> [Java] Error comments for Minor Type of TIMESTAMPSEC
> 
>
> Key: ARROW-14601
> URL: https://issues.apache.org/jira/browse/ARROW-14601
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 6.0.0
>Reporter: Kun Liu
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> For the MinorType of TIMESTAMP SEC, which should be the time in the second 
> from the 
> `
> the Unix epoch, 00:00:00  on 1 jan 1970 UTC
> `
> But the comment is 
> ```
> 00:00:00.00
> ``` and is present the type TIMESTAMPMICRO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12970) [Python] Efficient "row accessor" for a pyarrow RecordBatch / Table

2021-11-05 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-12970:
---

Assignee: Micah Kornfield

> [Python] Efficient "row accessor" for a pyarrow RecordBatch / Table
> ---
>
> Key: ARROW-12970
> URL: https://issues.apache.org/jira/browse/ARROW-12970
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Luke Higgins
>Assignee: Micah Kornfield
>Priority: Minor
> Fix For: 7.0.0
>
>
> It would be nice to have a nice row accessor for a Table akin to 
> pandas.DataFrame.itertuples.
> I have a lot of code where I am converting a parquet file to pandas just to 
> have access to the rows through iterating with itertuples.  Having this 
> ability in pyarrow natively would be a nice feature and would avoid memory 
> copy in the pandas conversion.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11901) [Java] Investigate potential performance improvement of compression codec

2021-11-05 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17439561#comment-17439561
 ] 

Micah Kornfield commented on ARROW-11901:
-

{quote} As Samuel pointed out, it might be a valid idea to base the Java API on 
JavaCPP, but this is not the right place for this discussion (a thread in the 
mailing list?).
{quote}
This would be a mailing dev@ mailing list discussion.  I don't think we would 
eliminate the existing API, but there might be some interest alternative Java 
APIs.

 
{quote}Seeing where the JavaCPP is used I think it is a viable project. I could 
contribute my {{CompressionCodec}} implementation to Arrow if this is desired. 
Creating JNI bindings for LZ4 in the Arrow repository would take more time and 
I won't be able to do this soon.
{quote}
[~benjamin.wilh...@knime.com] Do you have pointers?  I looked maybe too quickly 
and didn't see it used in other Apache projects for instance.  If you have 
something that works for your use-case that is great, and if you want to 
open-source it also great, but it might need to live in a KNIME hosted project 
for the time being.  I believe Arrow is now building JNI bindings for all major 
platforms, so the release story is a little bit better for a JNI code hosted by 
Arrow, I'll see how hard it would be to make the bindings at this point.

> [Java] Investigate potential performance improvement of compression codec
> -
>
> Key: ARROW-11901
> URL: https://issues.apache.org/jira/browse/ARROW-11901
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Benjamin Wilhelm
>Priority: Major
>
> In response to the discussion in 
> https://github.com/apache/arrow/pull/8949/files#r588046787
> There are some performance penalties in the implementation of the compression 
> codecs (e.g. data copying between heap/off-heap data). We need to revise the 
> code to improve the performance. 
> We should also provide some benchmarks to validate that the performance 
> actually improves. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14547) Reading FixedSizeListArray from Parquet with nulls

2021-11-02 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-14547.
-
Resolution: Fixed

> Reading FixedSizeListArray from Parquet with nulls
> --
>
> Key: ARROW-14547
> URL: https://issues.apache.org/jira/browse/ARROW-14547
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 6.0.0
>Reporter: Jim Pivarski
>Priority: Major
>
> This one is easy to describe: given an array of fixed-sized lists, in which 
> some are null,
> {code:python}
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> import pyarrow.parquet
> >>> a = pa.FixedSizeListArray.from_arrays(np.arange(10), 5).take([0, None])
> >>> a
> 
> [
>   [
> 0,
> 1,
> 2,
> 3,
> 4
>   ],
>   null
> ]
> {code}
> you can write them to a Parquet file, but not read them back:
> {code:python}
> >>> pa.parquet.write_table(pa.table({"": a}), "tmp.parquet")
> >>> pa.parquet.read_table("tmp.parquet")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", 
> line 1941, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", 
> line 1776, in read
> table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Expected all lists to be of size=5 but index 2 had 
> size=0
> {code}
> It could be that, at some level, the second list is considered to be empty.
> For completeness, this doesn't happen if the fixed-sized lists have no nulls:
> {code:python}
> >>> b = pa.FixedSizeListArray.from_arrays(np.arange(10), 5)
> >>> b
> 
> [
>   [
> 0,
> 1,
> 2,
> 3,
> 4
>   ],
>   [
> 5,
> 6,
> 7,
> 8,
> 9
>   ]
> ]
> >>> pa.parquet.write_table(pa.table({"": b}), "tmp2.parquet")
> >>> pa.parquet.read_table("tmp2.parquet")
> pyarrow.Table
> : fixed_size_list[5]
>   child 0, item: int64
> 
> : [[[0,1,2,3,4],[5,6,7,8,9]]]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14547) Reading FixedSizeListArray from Parquet with nulls

2021-11-02 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17437162#comment-17437162
 ] 

Micah Kornfield commented on ARROW-14547:
-

Duplicate of https://issues.apache.org/jira/browse/ARROW-9796 

Also, writing of FixedSizeLists is limited as well.  [[1,2], null, [3,4]] 
should fail on writes too.

> Reading FixedSizeListArray from Parquet with nulls
> --
>
> Key: ARROW-14547
> URL: https://issues.apache.org/jira/browse/ARROW-14547
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Affects Versions: 6.0.0
>Reporter: Jim Pivarski
>Priority: Major
>
> This one is easy to describe: given an array of fixed-sized lists, in which 
> some are null,
> {code:python}
> >>> import numpy as np
> >>> import pyarrow as pa
> >>> import pyarrow.parquet
> >>> a = pa.FixedSizeListArray.from_arrays(np.arange(10), 5).take([0, None])
> >>> a
> 
> [
>   [
> 0,
> 1,
> 2,
> 3,
> 4
>   ],
>   null
> ]
> {code}
> you can write them to a Parquet file, but not read them back:
> {code:python}
> >>> pa.parquet.write_table(pa.table({"": a}), "tmp.parquet")
> >>> pa.parquet.read_table("tmp.parquet")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", 
> line 1941, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/home/jpivarski/miniconda3/lib/python3.9/site-packages/pyarrow/parquet.py", 
> line 1776, in read
> table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 491, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 3235, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: Expected all lists to be of size=5 but index 2 had 
> size=0
> {code}
> It could be that, at some level, the second list is considered to be empty.
> For completeness, this doesn't happen if the fixed-sized lists have no nulls:
> {code:python}
> >>> b = pa.FixedSizeListArray.from_arrays(np.arange(10), 5)
> >>> b
> 
> [
>   [
> 0,
> 1,
> 2,
> 3,
> 4
>   ],
>   [
> 5,
> 6,
> 7,
> 8,
> 9
>   ]
> ]
> >>> pa.parquet.write_table(pa.table({"": b}), "tmp2.parquet")
> >>> pa.parquet.read_table("tmp2.parquet")
> pyarrow.Table
> : fixed_size_list[5]
>   child 0, item: int64
> 
> : [[[0,1,2,3,4],[5,6,7,8,9]]]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11901) [Java] Investigate potential performance improvement of compression codec

2021-10-26 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17434646#comment-17434646
 ] 

Micah Kornfield commented on ARROW-11901:
-

Does the presets library add a lot of value?  Could this be done in a new 
package within Arrow.  I'm a little hesitant to take a new dependency (or would 
at least want to do more research in terms of viability of the project/how 
widely used the packages in the repo are used).

> [Java] Investigate potential performance improvement of compression codec
> -
>
> Key: ARROW-11901
> URL: https://issues.apache.org/jira/browse/ARROW-11901
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Benjamin Wilhelm
>Priority: Major
>
> In response to the discussion in 
> https://github.com/apache/arrow/pull/8949/files#r588046787
> There are some performance penalties in the implementation of the compression 
> codecs (e.g. data copying between heap/off-heap data). We need to revise the 
> code to improve the performance. 
> We should also provide some benchmarks to validate that the performance 
> actually improves. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14303) [C++][Parquet] Do not duplicate Schema metadata in Parquet schema metadata and serialized ARROW:schema value

2021-10-23 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433403#comment-17433403
 ] 

Micah Kornfield commented on ARROW-14303:
-

Its been a while since I looked at this code, but I don't remember if all 
metadata is copied back into the deserialized schema or if we'd have to modify 
the read side as well (or maybe this is already implemented?)

 

I guess is is a 2X cost where X is the size of the metadata.  so depending on 
parameters and relative size of the data I suppose this could add up.

> [C++][Parquet] Do not duplicate Schema metadata in Parquet schema metadata 
> and serialized ARROW:schema value
> 
>
> Key: ARROW-14303
> URL: https://issues.apache.org/jira/browse/ARROW-14303
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 7.0.0
>
>
> Metadata values are being duplicated in the Parquet file footer — we should 
> either only store them in ARROW:schema or the Parquet schema metadata. 
> Removing them from the Parquet schema metadata may break applications that 
> are expecting that metadata to be there when serialized from Arrow, so 
> dropping the keys from ARROW:schema is probably a safer choice



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14422) [Python] Allow parquet::WriterProperties::created_by to be set via pyarrow.ParquetWriter for compatibility with older parquet-mr

2021-10-23 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433402#comment-17433402
 ] 

Micah Kornfield commented on ARROW-14422:
-

[fastparquet created_by 
string|https://github.com/dask/fastparquet/blob/main/fastparquet/util.py#L19] 
has the "build" as part of its string. I'd guess what is happening is that Hive 
only looks at the metafile and then doesn't try to parse the create_by version 
in the data files if the metafile is present (it only parse the metafile value).

> [Python] Allow parquet::WriterProperties::created_by to be set via 
> pyarrow.ParquetWriter for compatibility with older parquet-mr
> 
>
> Key: ARROW-14422
> URL: https://issues.apache.org/jira/browse/ARROW-14422
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Kevin
>Priority: Major
>
> have a couple of files  and using  pyarrow.table (0.17)
>  to save it as parquet on disk (parquet version 1.4)
> colums
>  id : string
>  val : string
> *table = pa.Table.from_pandas(df)* 
>  *pq.write_table(table, "df.parquet", version='1.0', flavor='spark', 
> write_statistics=True, )*
> However, Hive and Spark does not recognize the parquet version:
> {{org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version 
> ((.*) )?(build ?(.*))}}
>  \{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
>  \{{ at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
>  \{{ at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}
>  
> +*It seems related to this issue:*+
> It appears you've encountered PARQUET-349 which was fixed in 2015 before 
> Arrow was even started. The underlying C++ code does allow this 
> {{created_by}} field to be customized 
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
>  but the python wrapper does not expose this 
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
>  
>   
> *+EDIT Add infos+*
> Current python wrapper does NOT expose :  created_by builder  (when writing 
> parquet on disk)
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361]
>  
> But, this is available in CPP version:
> [https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320]
>  
> This creates an issue when Hadoop parquet reader reads this pyarrow parquet 
> file:
>  
>  
> +*SO Question here:*+
>  
> [https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12976) [Python] Arrow-to-Python conversion is slow

2021-10-23 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433399#comment-17433399
 ] 

Micah Kornfield commented on ARROW-12976:
-

Yeah, given #1 and #2, I think I'll try to simply replicate existing behavior 
in C++, even though it can lead to unexpected behavior.

> [Python] Arrow-to-Python conversion is slow
> ---
>
> Key: ARROW-12976
> URL: https://issues.apache.org/jira/browse/ARROW-12976
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>
> It seems that we are 20x slower than Numpy for converting the exact same data 
> to a Python list.
> With integers:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.int64)
> >>> %timeit arr.tolist()
> 8.24 µs ± 3.46 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 218 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> {code}
> With floats:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.float64)
> >>> %timeit arr.tolist()
> 10.2 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 199 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14422) [Python] Allow parquet::WriterProperties::created_by to be set via pyarrow.ParquetWriter for compatibility with older parquet-mr

2021-10-23 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17433398#comment-17433398
 ] 

Micah Kornfield commented on ARROW-14422:
-

{quote}Maintaining some regression test
between pyarron export and “parquet-mr” maybbebuseful
{quote}
Agreed, there was some proposals but it appears no one has had time to devote 
to this.  I'm not sure in this case since as Weston pointed out, the version of 
broken parquet and we would likely only test a few versions.

 
{quote}I agree that adding the word "build" to the C++ created_by string would 
be another way to solve this issue. We could change "parquet-cpp-arrow version 
6.0.0-SNAPSHOT" to "parquet-cpp-arrow build 6.0.0-SNAPSHOT" but I don't know 
how I feel about that either.
{quote}
I'd be more in favor of of adding a build string to C++ then exposing the flag 
in python (at least if we expose the flag in python (or at least we would need 
to validate the flag in python to see if it is parseable.  In general, I think 
this is fairly low level so I'd be hesitant to expose it in more places.  Using 
the BUILD field to hold the SHA git hash could be interesting.

> [Python] Allow parquet::WriterProperties::created_by to be set via 
> pyarrow.ParquetWriter for compatibility with older parquet-mr
> 
>
> Key: ARROW-14422
> URL: https://issues.apache.org/jira/browse/ARROW-14422
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Kevin
>Priority: Major
>
> have a couple of files  and using  pyarrow.table (0.17)
>  to save it as parquet on disk (parquet version 1.4)
> colums
>  id : string
>  val : string
> *table = pa.Table.from_pandas(df)* 
>  *pq.write_table(table, "df.parquet", version='1.0', flavor='spark', 
> write_statistics=True, )*
> However, Hive and Spark does not recognize the parquet version:
> {{org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-cpp version 1.5.1-SNAPSHOT using format: (.+) version 
> ((.*) )?(build ?(.*))}}
>  \{{ at org.apache.parquet.VersionParser.parse(VersionParser.java:112)}}
>  \{{ at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)}}
>  \{{ at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)}}
>  
> +*It seems related to this issue:*+
> It appears you've encountered PARQUET-349 which was fixed in 2015 before 
> Arrow was even started. The underlying C++ code does allow this 
> {{created_by}} field to be customized 
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
>  but the python wrapper does not expose this 
> [source|https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/python/pyarrow/_parquet.pxd#L360].
>  
>   
> *+EDIT Add infos+*
> Current python wrapper does NOT expose :  created_by builder  (when writing 
> parquet on disk)
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L361]
>  
> But, this is available in CPP version:
> [https://github.com/apache/arrow/blob/4591d76fce2846a29dac33bf01e9ba0337b118e9/cpp/src/parquet/properties.h#L249]
> [https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pxd#L320]
>  
> This creates an issue when Hadoop parquet reader reads this pyarrow parquet 
> file:
>  
>  
> +*SO Question here:*+
>  
> [https://stackoverflow.com/questions/69658140/how-to-save-a-parquet-with-pandas-using-same-header-than-hadoop-spark-parquet?noredirect=1#comment123131862_69658140]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14345) [C++] Implement streaming reads for GCS FileSystem

2021-10-20 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield updated ARROW-14345:

Fix Version/s: (was: 6.0.0)
   7.0.0

> [C++] Implement streaming reads for GCS FileSystem
> --
>
> Key: ARROW-14345
> URL: https://issues.apache.org/jira/browse/ARROW-14345
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Carlos O'Ryan
>Assignee: Carlos O'Ryan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Implement the {{GcsFileSystem::OpenInputStream()}} functions and tests for 
> them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-14345) [C++] Implement streaming reads for GCS FileSystem

2021-10-20 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-14345.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 11436
[https://github.com/apache/arrow/pull/11436]

> [C++] Implement streaming reads for GCS FileSystem
> --
>
> Key: ARROW-14345
> URL: https://issues.apache.org/jira/browse/ARROW-14345
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Carlos O'Ryan
>Assignee: Carlos O'Ryan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Implement the {{GcsFileSystem::OpenInputStream()}} functions and tests for 
> them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12976) [Python] Arrow-to-Python conversion is slow

2021-10-15 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429520#comment-17429520
 ] 

Micah Kornfield commented on ARROW-12976:
-

One thing we discussed on the sync call is if a more explicit API should be 
provided to control coercion of timestamp[ns] to pd.Timestamp instead of the 
current behavior that will do the conversion if pandas is installed but fall 
back to datetime (and check nanoseconds=0) if pandas is not installed.  
[~jorisvandenbossche] any thoughts here?



> [Python] Arrow-to-Python conversion is slow
> ---
>
> Key: ARROW-12976
> URL: https://issues.apache.org/jira/browse/ARROW-12976
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>
> It seems that we are 20x slower than Numpy for converting the exact same data 
> to a Python list.
> With integers:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.int64)
> >>> %timeit arr.tolist()
> 8.24 µs ± 3.46 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 218 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> {code}
> With floats:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.float64)
> >>> %timeit arr.tolist()
> 10.2 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 199 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12976) [Python] Arrow-to-Python conversion is slow

2021-10-09 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17426744#comment-17426744
 ] 

Micah Kornfield commented on ARROW-12976:
-

[~apitrou] [~jorisvandenbossche] going to see if I consolidate this logic in 
C++ (unless you were thinking of taking it up).  Any preference for trying to 
split up into smaller PRs or one large one to migrate all types to C++ code?

> [Python] Arrow-to-Python conversion is slow
> ---
>
> Key: ARROW-12976
> URL: https://issues.apache.org/jira/browse/ARROW-12976
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>
> It seems that we are 20x slower than Numpy for converting the exact same data 
> to a Python list.
> With integers:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.int64)
> >>> %timeit arr.tolist()
> 8.24 µs ± 3.46 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 218 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> {code}
> With floats:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.float64)
> >>> %timeit arr.tolist()
> 10.2 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 199 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12976) [Python] Arrow-to-Python conversion is slow

2021-10-09 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-12976:
---

Assignee: Micah Kornfield

> [Python] Arrow-to-Python conversion is slow
> ---
>
> Key: ARROW-12976
> URL: https://issues.apache.org/jira/browse/ARROW-12976
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>
> It seems that we are 20x slower than Numpy for converting the exact same data 
> to a Python list.
> With integers:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.int64)
> >>> %timeit arr.tolist()
> 8.24 µs ± 3.46 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 218 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> {code}
> With floats:
> {code:python}
> >>> arr = np.arange(0,1000, dtype=np.float64)
> >>> %timeit arr.tolist()
> 10.2 µs ± 25.5 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)
> >>> parr = pa.array(arr)
> >>> %timeit parr.to_pylist()
> 199 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1 loops each)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13604) [Java] Remove deprecation annotations for APIs representing unsupported operations

2021-10-07 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-13604.
-
Fix Version/s: 6.0.0
   Resolution: Fixed

Issue resolved by pull request 10911
[https://github.com/apache/arrow/pull/10911]

> [Java] Remove deprecation annotations for APIs representing unsupported 
> operations
> --
>
> Key: ARROW-13604
> URL: https://issues.apache.org/jira/browse/ARROW-13604
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Some APIs representing unsupported operations should not be annotated 
> deprecated, unless the overriden APIs are deprecated in the super 
> classes/interfaces.
> According to the discussion in 
> https://github.com/apache/arrow/pull/10864#issuecomment-895707729, we open a 
> separate issue for this. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13257) [Java][Dataset] Allow passing empty columns for projection

2021-10-07 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-13257.
-
Resolution: Fixed

Issue resolved by pull request 10652
[https://github.com/apache/arrow/pull/10652]

> [Java][Dataset] Allow passing empty columns for projection
> --
>
> Key: ARROW-13257
> URL: https://issues.apache.org/jira/browse/ARROW-13257
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Affects Versions: 4.0.1
>Reporter: Hongze Zhang
>Assignee: Hongze Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Projecting on empty columns (functionally like a pure metadata query on 
> source of data) was supported in C++ Dataset API in ARROW-11174. This is to 
> align Java API's behavior within C++.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425938#comment-17425938
 ] 

Micah Kornfield commented on ARROW-14196:
-

Also CC [~jpivarski] on thoughts on how this might impact awkward arrays and 
ways to mitigate.

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425937#comment-17425937
 ] 

Micah Kornfield commented on ARROW-14196:
-

I think similar to other default changes we have discussed, a very public 
advertisement of the intended change and consequences would be good.  At some 
point we should just bite the bullet.  I think with the first feature (take a 
compliant nested type and rename it on read) it would give users a temporary 
fallback.

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-14196) [C++][Parquet] Default to compliant nested types in Parquet writer

2021-10-07 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425933#comment-17425933
 ] 

Micah Kornfield commented on ARROW-14196:
-

Yeah, I think it is really only the name change that I'm concerned about.  
https://issues.apache.org/jira/browse/ARROW-13151 has another example where 
people where trying to reference things by path that was broken for other 
reasons.

 

A few item that don't really solve all cases but would make things better or at 
least adaptable to the long term:

1.  Add an option that translates "compliant nest type" name back to the 
arrow's naming schema.

2.  Make it possible to select columns by eliding the list name components.

 

Another question that is dataset specific, is if one file was written with 
compliant nested types and one was not, and both where read in the same dataset 
are the results sane (schema's get unified?)

> [C++][Parquet] Default to compliant nested types in Parquet writer
> --
>
> Key: ARROW-14196
> URL: https://issues.apache.org/jira/browse/ARROW-14196
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Parquet
>Reporter: Joris Van den Bossche
>Priority: Major
>
> In C++ there is already an option to get the "compliant_nested_types" (to 
> have the list columns follow the Parquet specification), and ARROW-11497 
> exposed this option in Python.
> This is still set to False by default, but in the source it says "TODO: At 
> some point we should flip this.", and in ARROW-11497 there was also some 
> discussion about what it would take to change the default.
> cc [~emkornfield] [~apitrou]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13151) [Python] Unable to read single child field of struct column from Parquet

2021-10-06 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17425351#comment-17425351
 ] 

Micah Kornfield commented on ARROW-13151:
-

With the PR that is up this now seems to work:
{code:java}

>>> data = {"root": [[\{"addr": {"this": 3, "that": 3}}]]}
 >>> table = pa.Table.from_pydict(data)
 >>> pq.write_table(table, "/tmp/table.parquet")
 >>> file = pq.ParquetFile("/tmp/table.parquet")
 >>> array = file.read(["root.list.item.addr.that"])
 >>> array
 pyarrow.Table
 root: list>>
 child 0, item: struct>
 child 0, addr: struct
 child 0, that: int64

root: [[ – is_valid: all not null
 – child 0 type: struct
 – is_valid: all not null
 – child 0 type: int64
 [
 3
 ]]]
 {code}

 

> [Python] Unable to read single child field of struct column from Parquet
> 
>
> Key: ARROW-13151
> URL: https://issues.apache.org/jira/browse/ARROW-13151
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Reporter: Angus Hollands
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given the following table
> {code:java}
> data = {"root": [[{"addr": {"this": 3, "that": 3}}]]}
> table = pa.Table.from_pydict(data)
> {code}
> reading the nested column leads to an {{pyarrow.lib.ArrowInvalid}} error:
> {code:java}
> pq.write_table(table, "/tmp/table.parquet")
> file = pq.ParquetFile("/tmp/table.parquet")
> array = file.read(["root.list.item.addr.that"])
> {code}
> Traceback:
> {code:java}
> Traceback (most recent call last):
>   File "", line 21, in 
> array = file.read(["root.list.item.addr.that"])
>   File 
> "/home/angus/.mambaforge/envs/awkward/lib/python3.9/site-packages/pyarrow/parquet.py",
>  line 383, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1097, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: List child array invalid: Invalid: Struct child 
> array #0 does not match type field: struct vs struct int64, this: int64>
> {code}
> It's possible that I don't quite understand this properly - am I doing 
> something wrong?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-13151) [Python] Unable to read single child field of struct column from Parquet

2021-10-06 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420211#comment-17420211
 ] 

Micah Kornfield edited comment on ARROW-13151 at 10/7/21, 5:38 AM:
---

Looking at this the problem is we do not propagate filtered fields through 
lists or nested structs (only one level of structs).


was (Author: emkornfield):
Looking at this the problem is we do not propatete filtered fields through 
lists or nested structs (only one level of structs.

> [Python] Unable to read single child field of struct column from Parquet
> 
>
> Key: ARROW-13151
> URL: https://issues.apache.org/jira/browse/ARROW-13151
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Parquet, Python
>Reporter: Angus Hollands
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Given the following table
> {code:java}
> data = {"root": [[{"addr": {"this": 3, "that": 3}}]]}
> table = pa.Table.from_pydict(data)
> {code}
> reading the nested column leads to an {{pyarrow.lib.ArrowInvalid}} error:
> {code:java}
> pq.write_table(table, "/tmp/table.parquet")
> file = pq.ParquetFile("/tmp/table.parquet")
> array = file.read(["root.list.item.addr.that"])
> {code}
> Traceback:
> {code:java}
> Traceback (most recent call last):
>   File "", line 21, in 
> array = file.read(["root.list.item.addr.that"])
>   File 
> "/home/angus/.mambaforge/envs/awkward/lib/python3.9/site-packages/pyarrow/parquet.py",
>  line 383, in read
> return self.reader.read_all(column_indices=column_indices,
>   File "pyarrow/_parquet.pyx", line 1097, in 
> pyarrow._parquet.ParquetReader.read_all
>   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: List child array invalid: Invalid: Struct child 
> array #0 does not match type field: struct vs struct int64, this: int64>
> {code}
> It's possible that I don't quite understand this properly - am I doing 
> something wrong?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-13806) [Python] Add conversion to/from Pandas/Python for Month, Day Nano Interval Type

2021-10-04 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17423775#comment-17423775
 ] 

Micah Kornfield edited comment on ARROW-13806 at 10/4/21, 6:52 AM:
---

[~jorisvandenbossche] I posted a PR for MonthDayNanos interval.  I'll think 
this was large enough that I will try to do another one for the other types 
(the PR contains a proposal for moving most of the logic to C++ and didn't want 
to put too much in it, if this looks good I think the other interval types 
probably won't be too bad).


was (Author: emkornfield):
[~jorisvandenbossche] I posted a PR for MonthDayNanos interval.  I'll think 
this was large enough that I will try to do another one for the other types.

> [Python] Add conversion to/from Pandas/Python for Month, Day Nano Interval 
> Type
> ---
>
> Key: ARROW-13806
> URL: https://issues.apache.org/jira/browse/ARROW-13806
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/arrow/pull/10177] has been merged we should 
> support conversion to and from this type for standard python surface areas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   >