[jira] [Comment Edited] (ARROW-15554) [Format][C++] Add "LargeMap" type with 64-bit offsets

2022-02-07 Thread Sarah Gilmore (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488413#comment-17488413
 ] 

Sarah Gilmore edited comment on ARROW-15554 at 2/7/22, 8:21 PM:


Will do!


was (Author: sgilmore):
Wil do!

> [Format][C++] Add "LargeMap" type with 64-bit offsets
> -
>
> Key: ARROW-15554
> URL: https://issues.apache.org/jira/browse/ARROW-15554
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Sarah Gilmore
>Priority: Major
>
> It would be nice if a "LargeMap" type existed along side the "Map" type for 
> parity. For other datatypes that require offset arrays/buffers, such as 
> String, List, BinaryArray, provides a "large" version of these types, i.e. 
> LargeString, LargeList, and LargeBinaryArray. It would be nice to have a 
> "LargeMap" for parity.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15554) [Format][C++] Add "LargeMap" type with 64-bit offsets

2022-02-07 Thread Sarah Gilmore (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488413#comment-17488413
 ] 

Sarah Gilmore commented on ARROW-15554:
---

Wil do!

> [Format][C++] Add "LargeMap" type with 64-bit offsets
> -
>
> Key: ARROW-15554
> URL: https://issues.apache.org/jira/browse/ARROW-15554
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Sarah Gilmore
>Priority: Major
>
> It would be nice if a "LargeMap" type existed along side the "Map" type for 
> parity. For other datatypes that require offset arrays/buffers, such as 
> String, List, BinaryArray, provides a "large" version of these types, i.e. 
> LargeString, LargeList, and LargeBinaryArray. It would be nice to have a 
> "LargeMap" for parity.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15554) [Format][C++] Add "LargeMap" type with 64-bit offsets

2022-02-04 Thread Sarah Gilmore (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487136#comment-17487136
 ] 

Sarah Gilmore commented on ARROW-15554:
---

Hi [~apitrou],
 
I was more thinking about the future when I created this Jira issue. I don't 
have a concrete need now, but I can picture a few scenarios in which the size 
limitation imposed by MapArray's 32-bit offsets cannot be worked around.
 
*Scenario 1:*
 
Suppose you have a ListArray of MapArrays. If one of the maps requires more 
than int32::max key-value pairs, there's no way to do this currently. You could 
try using a ChunkedArray, but you would still need to split the large map 
across multiple rows in the list.
 
*Scenario 2:*
 
Even if the MapArray is at the top of the object hierarchy, the same problem 
could potentially arise if a row within the array needs to contain more than 
int32::max key-value pairs. You could try to use a ChunkedArray to resolve the 
issue, but the key-value pairs would still be split across multiple rows.
 
I've seen Parquet files with MAP columns, and I can imagine a situation in 
which someone has a very large MAP as the top-most data structure or within a 
nested one. While running into a situation in which they can't use MapArrays to 
represent their data is probably rare, it's not entirely impossible given 
int32's size restrictions. 
 
I'd honestly be interested in looking into this myself.
 
I hope this helps.
 
Best,
Sarah
 
 

> [Format][C++] Add "LargeMap" type with 64-bit offsets
> -
>
> Key: ARROW-15554
> URL: https://issues.apache.org/jira/browse/ARROW-15554
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Sarah Gilmore
>Priority: Major
>
> It would be nice if a "LargeMap" type existed along side the "Map" type for 
> parity. For other datatypes that require offset arrays/buffers, such as 
> String, List, BinaryArray, provides a "large" version of these types, i.e. 
> LargeString, LargeList, and LargeBinaryArray. It would be nice to have a 
> "LargeMap" for parity.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15554) [Format][C++] Add "LargeMap" type with 64-bit offsets

2022-02-03 Thread Sarah Gilmore (Jira)
Sarah Gilmore created ARROW-15554:
-

 Summary: [Format][C++] Add "LargeMap" type with 64-bit offsets
 Key: ARROW-15554
 URL: https://issues.apache.org/jira/browse/ARROW-15554
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Format
Reporter: Sarah Gilmore


It would be nice if a "LargeMap" type existed along side the "Map" type for 
parity. For other datatypes that require offset arrays/buffers, such as String, 
List, BinaryArray, provides a "large" version of these types, i.e. LargeString, 
LargeList, and LargeBinaryArray. It would be nice to have a "LargeMap" for 
parity.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14723) [Python] pyarrow cannot import parquet files containing row groups whose lengths exceed int32 max.

2021-11-18 Thread Sarah Gilmore (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446022#comment-17446022
 ] 

Sarah Gilmore commented on ARROW-14723:
---

Hi [~jorisvandenbossche],

Here's code you can use to generate to generate both files: [^main.cpp]. In the 
terminal, you'll be prompted to give the output filename and the number of rows 
you want the Parquet file to have. 

I noticed that if I linked against the latest version of Arrow (I believe 
7.0.0), the files created by the program can be read in via pyarrow. However, 
if you link against 4.0.1, Parquet files with row groups that exceed 2147483647 
in length cannot be read in via pyarrow. I suppose this issue has been resolved 
in a later release of Arrow?

 

Best,

Sarah

 

> [Python] pyarrow cannot import parquet files containing row groups whose 
> lengths exceed int32 max. 
> ---
>
> Key: ARROW-14723
> URL: https://issues.apache.org/jira/browse/ARROW-14723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
>Reporter: Sarah Gilmore
>Priority: Minor
> Attachments: intmax32.parq, intmax32plus1.parq, main.cpp
>
>
> It's possible to create Parquet files containing row groups whose lengths are 
> greater than int32 max (2147483647). However, Pyarrow cannot read these 
> files. 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> # intmax32.parq can be read in without any issues
> >>> t = pq.read_table("intmax32.parq"); 
> $ intmax32plus1.parq cannot be read in
> >>> t = pq.read_table("intmax32plus1.parq"); 
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1895, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1744, in read
>     table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 465, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 3075, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: Negative size (corrupt file?)
> {code}
>  
> However, both files can be imported via the C++ Arrow bindings without any 
> issues.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (ARROW-14723) [Python] pyarrow cannot import parquet files containing row groups whose lengths exceed int32 max.

2021-11-18 Thread Sarah Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Gilmore updated ARROW-14723:
--
Attachment: main.cpp

> [Python] pyarrow cannot import parquet files containing row groups whose 
> lengths exceed int32 max. 
> ---
>
> Key: ARROW-14723
> URL: https://issues.apache.org/jira/browse/ARROW-14723
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
>Reporter: Sarah Gilmore
>Priority: Minor
> Attachments: intmax32.parq, intmax32plus1.parq, main.cpp
>
>
> It's possible to create Parquet files containing row groups whose lengths are 
> greater than int32 max (2147483647). However, Pyarrow cannot read these 
> files. 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> # intmax32.parq can be read in without any issues
> >>> t = pq.read_table("intmax32.parq"); 
> $ intmax32plus1.parq cannot be read in
> >>> t = pq.read_table("intmax32plus1.parq"); 
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1895, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
>  line 1744, in read
>     table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 465, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 3075, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> OSError: Negative size (corrupt file?)
> {code}
>  
> However, both files can be imported via the C++ Arrow bindings without any 
> issues.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-14723) [Python] pyarrow cannot import parquet files containing row groups whose lengths exceed int32 max.

2021-11-16 Thread Sarah Gilmore (Jira)
Sarah Gilmore created ARROW-14723:
-

 Summary: [Python] pyarrow cannot import parquet files containing 
row groups whose lengths exceed int32 max. 
 Key: ARROW-14723
 URL: https://issues.apache.org/jira/browse/ARROW-14723
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 5.0.0
Reporter: Sarah Gilmore
 Attachments: intmax32.parq, intmax32plus1.parq

It's possible to create Parquet files containing row groups whose lengths are 
greater than int32 max (2147483647). However, Pyarrow cannot read these files. 
{code:java}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq

# intmax32.parq can be read in without any issues
>>> t = pq.read_table("intmax32.parq"); 

$ intmax32plus1.parq cannot be read in
>>> t = pq.read_table("intmax32plus1.parq"); 
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
 line 1895, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File 
"/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyarrow/parquet.py",
 line 1744, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 465, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 3075, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: Negative size (corrupt file?)


{code}
 

However, both files can be imported via the C++ Arrow bindings without any 
issues.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14104) Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to preserve the TimeZone - unlike in Arrow 4.0.0

2021-10-26 Thread Sarah Gilmore (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434368#comment-17434368
 ] 

Sarah Gilmore commented on ARROW-14104:
---

Hi [~jorisvandenbossche],

It actually looks like I was running an older version of pyarrow based on the 
output of {{pa.__version__}}. According to pip, I have pyarrow 5.0.0:

 
{code:java}
// code placeholder
Name: pyarrow
Version: 5.0.0
Summary: Python library for Apache Arrow
Home-page: https://arrow.apache.org/
Author: 
Author-email: 
License: Apache License, Version 2.0
Location: /usr/local/lib/python3.9/site-packages
Requires: numpy
Required-by: parquet-tools
{code}
 

But {{pa.__version__}} returns {{'0.17.1'}}. It looks like my system 
configuration got messed up, though I'm not sure how. I was able to confirm 
that the TimeZone is round-tripped in pyarrow 5.0.0 by creating a virtual 
environment with python's venv module and installing pyarrow 5.0.0 there.

I'm sorry for any confusion I caused. I'll close this issue.

Best,
Sarah 

 

 

 

 

 

> Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to 
> preserve the TimeZone - unlike in Arrow 4.0.0
> 
>
> Key: ARROW-14104
> URL: https://issues.apache.org/jira/browse/ARROW-14104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Sarah Gilmore
>Priority: Minor
> Attachments: exampleArrow4.parq, exampleArrow5.parq
>
>
> In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
> List columns to and from parquet files: 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> import datetime 
> >>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
> >>> pa.list_(pa.timestamp('us', 'America/New_York')));
> >>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
> >>> pq.write_table(t, "example.parq");
> >>> t2 = pq.read_table("example.parq");
> >>> t2
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=America/New_York]
> {code}
> However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
> set to UTC:
> {code:java}
> >>> t3 = pq.read_table("example.parq");
> >>> t3
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=UTC]
>  {code}
>  
> I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
> timestamp columns. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-14104) Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to preserve the TimeZone - unlike in Arrow 4.0.0

2021-10-26 Thread Sarah Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Gilmore closed ARROW-14104.
-
Resolution: Not A Problem

This issue was a result of a configuration problem in my environment. 

> Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to 
> preserve the TimeZone - unlike in Arrow 4.0.0
> 
>
> Key: ARROW-14104
> URL: https://issues.apache.org/jira/browse/ARROW-14104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Sarah Gilmore
>Priority: Minor
> Attachments: exampleArrow4.parq, exampleArrow5.parq
>
>
> In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
> List columns to and from parquet files: 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> import datetime 
> >>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
> >>> pa.list_(pa.timestamp('us', 'America/New_York')));
> >>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
> >>> pq.write_table(t, "example.parq");
> >>> t2 = pq.read_table("example.parq");
> >>> t2
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=America/New_York]
> {code}
> However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
> set to UTC:
> {code:java}
> >>> t3 = pq.read_table("example.parq");
> >>> t3
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=UTC]
>  {code}
>  
> I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
> timestamp columns. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-14104) Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to preserve the TimeZone - unlike in Arrow 4.0.0

2021-10-19 Thread Sarah Gilmore (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430530#comment-17430530
 ] 

Sarah Gilmore edited comment on ARROW-14104 at 10/19/21, 12:57 PM:
---

So sorry about the delay.  [~jorisvandenbossche] and [~westonpace].

I've attached two files ([^exampleArrow4.parq] and [^exampleArrow5.parq]) that 
were both created with the following code:
{code:java}
// code placeholder
import pyarrow as pa
import pyarrow.parquet as pq

import datetime
>>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
>>> pa.list_(pa.timestamp('us', 'America/New_York')));
>>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
>>> pq.write_table(t, "example.parq"); 

{code}
  

I ran the code snippet above in pyarrow 4.0.0 and pyarrow 5.0.0 to create 
exampleArrow4.parq and exampleArrow5.parq, respectively. 

 

Here's the output of reading both files in pyarrow 4.0.0: 
{code:java}
// code placeholder
import pyarrow as pa
import pyarrow.parquet as pq

>>> t1 = pq.read_table("exampleArrow4.parq")
>>> t1
pyarrow.Table
TimestampColumn: list
  child 0, item: timestamp[us, tz=America/New_York]

>>> t2 = pq.read_table("exampleArrow5.parq")
>>> t2
pyarrow.Table
TimestampColumn: list
  child 0, item: timestamp[us, tz=America/New_York]{code}
 The TimeZone is read in properly for both files.

 

Here's the output of reading both files in pyarrow 5.0.0:
{code:java}
// code placeholder
import pyarrow as pa import pyarrow.parquet as pq

>>> t1 = pq.read_table("exampleArrow4.parq")
>>> t1
pyarrow.Table
TimestampColumn: list
  child 0, item: timestamp[us, tz=UTC]

>>> t2 = pq.read_table("exampleArrow5.parq")
>>> t2
pyarrow.Table
TimestampColumn: list
  child 0, item: timestamp[us, tz=UTC]
{code}
 

It looks like pyarrow 5.0.0 writes out the TimeZone information, but doesn't 
read it in properly.

 

 

 


was (Author: sgilmore):
So sorry about the delay.  [~jorisvandenbossche] and [~westonpace].

I've attached two files ([^exampleArrow4.parq] and [^exampleArrow5.parq]) that 
were both created with the following code:

 
{code:java}
// code placeholder
import pyarrow as pa
import pyarrow.parquet as pq

import datetime
>>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
>>> pa.list_(pa.timestamp('us', 'America/New_York')));
>>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
>>> pq.write_table(t, "example.parq"); 

{code}
 

 

I ran the code snippet above in pyarrow 4.0.0 and pyarrow 5.0.0 to create 
exampleArrow4.parq and exampleArrow5.parq, respectively. 

 

Here's the output of reading both files in pyarrow 4.0.0:

 
{code:java}
// code placeholder
import pyarrow as pa
import pyarrow.parquet as pq

>>> t1 = pq.read_table("exampleArrow4.parq")
>>> t1
pyarrow.Table
TimestampColumn: list
  child 0, item: timestamp[us, tz=America/New_York]

>>> t2 = pq.read_table("exampleArrow5.parq")
>>> t2
pyarrow.Table
TimestampColumn: list
  child 0, item: timestamp[us, tz=America/New_York]{code}
 

The TimeZone is read in properly for both files.

 

Here's the output of reading both files in pyarrow 5.0.0:

 
{code:java}
// code placeholder
import pyarrow as pa import pyarrow.parquet as pq

>>> t1 = pq.read_table("exampleArrow4.parq")
>>> t1
pyarrow.Table
TimestampColumn: list
  child 0, item: timestamp[us, tz=UTC]

>>> t2 = pq.read_table("exampleArrow5.parq")
>>> t2
pyarrow.Table
TimestampColumn: list
  child 0, item: timestamp[us, tz=UTC]
{code}
 

 

It looks like pyarrow 5.0.0 writes out the TimeZone information, but doesn't 
read it in properly.

 

 

 

> Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to 
> preserve the TimeZone - unlike in Arrow 4.0.0
> 
>
> Key: ARROW-14104
> URL: https://issues.apache.org/jira/browse/ARROW-14104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Sarah Gilmore
>Priority: Minor
> Attachments: exampleArrow4.parq, exampleArrow5.parq
>
>
> In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
> List columns to and from parquet files: 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> import datetime 
> >>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
> >>> pa.list_(pa.timestamp('us', 'America/New_York')));
> >>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
> >>> pq.write_table(t, "example.parq");
> >>> t2 = pq.read_table("example.parq");
> >>> t2
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=America/New_York]
> {code}
> However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
> set to UTC:
> {code:java}
> >

[jira] [Commented] (ARROW-14104) Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to preserve the TimeZone - unlike in Arrow 4.0.0

2021-10-19 Thread Sarah Gilmore (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17430530#comment-17430530
 ] 

Sarah Gilmore commented on ARROW-14104:
---

So sorry about the delay.  [~jorisvandenbossche] and [~westonpace].

I've attached two files ([^exampleArrow4.parq] and [^exampleArrow5.parq]) that 
were both created with the following code:

 
{code:java}
// code placeholder
import pyarrow as pa
import pyarrow.parquet as pq

import datetime
>>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
>>> pa.list_(pa.timestamp('us', 'America/New_York')));
>>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
>>> pq.write_table(t, "example.parq"); 

{code}
 

 

I ran the code snippet above in pyarrow 4.0.0 and pyarrow 5.0.0 to create 
exampleArrow4.parq and exampleArrow5.parq, respectively. 

 

Here's the output of reading both files in pyarrow 4.0.0:

 
{code:java}
// code placeholder
import pyarrow as pa
import pyarrow.parquet as pq

>>> t1 = pq.read_table("exampleArrow4.parq")
>>> t1
pyarrow.Table
TimestampColumn: list
  child 0, item: timestamp[us, tz=America/New_York]

>>> t2 = pq.read_table("exampleArrow5.parq")
>>> t2
pyarrow.Table
TimestampColumn: list
  child 0, item: timestamp[us, tz=America/New_York]{code}
 

The TimeZone is read in properly for both files.

 

Here's the output of reading both files in pyarrow 5.0.0:

 
{code:java}
// code placeholder
import pyarrow as pa import pyarrow.parquet as pq

>>> t1 = pq.read_table("exampleArrow4.parq")
>>> t1
pyarrow.Table
TimestampColumn: list
  child 0, item: timestamp[us, tz=UTC]

>>> t2 = pq.read_table("exampleArrow5.parq")
>>> t2
pyarrow.Table
TimestampColumn: list
  child 0, item: timestamp[us, tz=UTC]
{code}
 

 

It looks like pyarrow 5.0.0 writes out the TimeZone information, but doesn't 
read it in properly.

 

 

 

> Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to 
> preserve the TimeZone - unlike in Arrow 4.0.0
> 
>
> Key: ARROW-14104
> URL: https://issues.apache.org/jira/browse/ARROW-14104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Sarah Gilmore
>Priority: Minor
> Attachments: exampleArrow4.parq, exampleArrow5.parq
>
>
> In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
> List columns to and from parquet files: 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> import datetime 
> >>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
> >>> pa.list_(pa.timestamp('us', 'America/New_York')));
> >>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
> >>> pq.write_table(t, "example.parq");
> >>> t2 = pq.read_table("example.parq");
> >>> t2
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=America/New_York]
> {code}
> However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
> set to UTC:
> {code:java}
> >>> t3 = pq.read_table("example.parq");
> >>> t3
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=UTC]
>  {code}
>  
> I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
> timestamp columns. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14104) Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to preserve the TimeZone - unlike in Arrow 4.0.0

2021-10-19 Thread Sarah Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Gilmore updated ARROW-14104:
--
Attachment: exampleArrow5.parq
exampleArrow4.parq

> Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to 
> preserve the TimeZone - unlike in Arrow 4.0.0
> 
>
> Key: ARROW-14104
> URL: https://issues.apache.org/jira/browse/ARROW-14104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Sarah Gilmore
>Priority: Minor
> Attachments: exampleArrow4.parq, exampleArrow5.parq
>
>
> In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
> List columns to and from parquet files: 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> import datetime 
> >>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
> >>> pa.list_(pa.timestamp('us', 'America/New_York')));
> >>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
> >>> pq.write_table(t, "example.parq");
> >>> t2 = pq.read_table("example.parq");
> >>> t2
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=America/New_York]
> {code}
> However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
> set to UTC:
> {code:java}
> >>> t3 = pq.read_table("example.parq");
> >>> t3
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=UTC]
>  {code}
>  
> I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
> timestamp columns. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14104) Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to preserve the TimeZone - unlike in Arrow 4.0.0

2021-10-19 Thread Sarah Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Gilmore updated ARROW-14104:
--
Description: 
In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
List columns to and from parquet files: 
{code:java}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import datetime 

>>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
>>> pa.list_(pa.timestamp('us', 'America/New_York')));

>>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
>>> pq.write_table(t, "example.parq");

>>> t2 = pq.read_table("example.parq");
>>> t2
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=America/New_York]
{code}
However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
set to UTC:
{code:java}
>>> t3 = pq.read_table("example.parq");
>>> t3
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=UTC]
 {code}
 

I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
timestamp columns. 

  was:
In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
List columns to and from parquet files: 
{code:java}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import datetime 

>>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
>>> pa.list_(pa.timestamp('us', 'America/New_York')));

>>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
>>> pq.write_table(t, "example.parq");

>>> t2 = pq.read_table("example.parq");
>>> t2
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=America/Denver]
{code}
However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
set to UTC:
{code:java}
>>> t3 = pq.read_table("example.parq");
>>> t3
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=UTC]
 {code}
 

I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
timestamp columns. 


> Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to 
> preserve the TimeZone - unlike in Arrow 4.0.0
> 
>
> Key: ARROW-14104
> URL: https://issues.apache.org/jira/browse/ARROW-14104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Sarah Gilmore
>Priority: Minor
>
> In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
> List columns to and from parquet files: 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> import datetime 
> >>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
> >>> pa.list_(pa.timestamp('us', 'America/New_York')));
> >>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
> >>> pq.write_table(t, "example.parq");
> >>> t2 = pq.read_table("example.parq");
> >>> t2
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=America/New_York]
> {code}
> However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
> set to UTC:
> {code:java}
> >>> t3 = pq.read_table("example.parq");
> >>> t3
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=UTC]
>  {code}
>  
> I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
> timestamp columns. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14104) Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to preserve the TimeZone - unlike in Arrow 4.0.0

2021-10-19 Thread Sarah Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Gilmore updated ARROW-14104:
--
Description: 
In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
List columns to and from parquet files: 
{code:java}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import datetime 

>>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
>>> pa.list_(pa.timestamp('us', 'America/New_York')));

>>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
>>> pq.write_table(t, "example.parq");

>>> t2 = pq.read_table("example.parq");
>>> t2
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=America/Denver]
{code}
However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
set to UTC:
{code:java}
>>> t3 = pq.read_table("example.parq");
>>> t3
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=UTC]
 {code}
 

I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
timestamp columns. 

  was:
In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
List columns to and from parquet files: 
{code:java}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import datetime 

>>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
>>> pa.list_(pa.timestamp('us', 'America/New_York')));

>>> t = pa.Table.from_arrays([column], name=['TimestampColumn']);
>>> pq.write_table(t, "example.parq", version='2.0');

>>> t2 = pq.read_table("example.parq");
>>> t2
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=America/Denver]
{code}
However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
set to UTC:
{code:java}
>>> t3 = pq.read_table("example.parq");
>>> t3
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=UTC]
 {code}
 

I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
timestamp columns. 


> Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to 
> preserve the TimeZone - unlike in Arrow 4.0.0
> 
>
> Key: ARROW-14104
> URL: https://issues.apache.org/jira/browse/ARROW-14104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Sarah Gilmore
>Priority: Minor
>
> In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
> List columns to and from parquet files: 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> import datetime 
> >>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
> >>> pa.list_(pa.timestamp('us', 'America/New_York')));
> >>> t = pa.Table.from_arrays([column], names=['TimestampColumn']);
> >>> pq.write_table(t, "example.parq");
> >>> t2 = pq.read_table("example.parq");
> >>> t2
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=America/Denver]
> {code}
> However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
> set to UTC:
> {code:java}
> >>> t3 = pq.read_table("example.parq");
> >>> t3
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=UTC]
>  {code}
>  
> I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
> timestamp columns. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14104) Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to preserve the TimeZone - unlike in Arrow 4.0.0

2021-10-19 Thread Sarah Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Gilmore updated ARROW-14104:
--
Description: 
In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
List columns to and from parquet files: 
{code:java}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import datetime 

>>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
>>> pa.list_(pa.timestamp('us', 'America/New_York')));

>>> t = pa.Table.from_arrays([column], name=['TimestampColumn']);
>>> pq.write_table(t, "example.parq", version='2.0');

>>> t2 = pq.read_table("example.parq");
>>> t2
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=America/Denver]
{code}
However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
set to UTC:
{code:java}
>>> t3 = pq.read_table("example.parq");
>>> t3
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=UTC]
 {code}
 

I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
timestamp columns. 

  was:
In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
List columns to and from parquet files: 
{code:java}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import datetime 

>>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
>>> pa.list_(pa.timestamp('us', 'America/New_York')));

>>> t = pa.Table.from_arrays([column], name=['TimestampColumn']);
>>> pq.write_table(t, "example.parq");

>>> t2 = pq.read_table("example.parq");
>>> t2
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=America/Denver]
{code}
However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
set to UTC:
{code:java}
>>> t3 = pq.read_table("example.parq");
>>> t3
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=UTC]
 {code}
 

I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
timestamp columns. 


> Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to 
> preserve the TimeZone - unlike in Arrow 4.0.0
> 
>
> Key: ARROW-14104
> URL: https://issues.apache.org/jira/browse/ARROW-14104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Sarah Gilmore
>Priority: Minor
>
> In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
> List columns to and from parquet files: 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> import datetime 
> >>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
> >>> pa.list_(pa.timestamp('us', 'America/New_York')));
> >>> t = pa.Table.from_arrays([column], name=['TimestampColumn']);
> >>> pq.write_table(t, "example.parq", version='2.0');
> >>> t2 = pq.read_table("example.parq");
> >>> t2
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=America/Denver]
> {code}
> However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
> set to UTC:
> {code:java}
> >>> t3 = pq.read_table("example.parq");
> >>> t3
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=UTC]
>  {code}
>  
> I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
> timestamp columns. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-14104) Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to preserve the TimeZone - unlike in Arrow 4.0.0

2021-10-19 Thread Sarah Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-14104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Gilmore updated ARROW-14104:
--
Description: 
In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
List columns to and from parquet files: 
{code:java}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import datetime 

>>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
>>> pa.list_(pa.timestamp('us', 'America/New_York')));

>>> t = pa.Table.from_arrays([column], name=['TimestampColumn']);
>>> pq.write_table(t, "example.parq");

>>> t2 = pq.read_table("example.parq");
>>> t2
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=America/Denver]
{code}
However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
set to UTC:
{code:java}
>>> t3 = pq.read_table("example.parq");
>>> t3
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=UTC]
 {code}
 

I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
timestamp columns. 

  was:
In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
List columns to and from parquet files: 
{code:java}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq

>>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
>>> pa.list_(pa.timestamp('us', 'America/New_York'));

>>> t = pa.Table.from_arrays([column], name=['TimestampColumn']);
>>> pq.write_table(t, "example.parq");

>>> t2 = pq.read_table("example.parq");
>>> t2
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=America/Denver]
{code}
However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
set to UTC:
{code:java}
>>> t3 = pq.read_table("example.parq");
>>> t3
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=UTC]
 {code}
 

I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
timestamp columns. 


> Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to 
> preserve the TimeZone - unlike in Arrow 4.0.0
> 
>
> Key: ARROW-14104
> URL: https://issues.apache.org/jira/browse/ARROW-14104
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Parquet, Python
>Affects Versions: 5.0.0
>Reporter: Sarah Gilmore
>Priority: Minor
>
> In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
> List columns to and from parquet files: 
> {code:java}
> >>> import pyarrow as pa
> >>> import pyarrow.parquet as pq
> >>> import datetime 
> >>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
> >>> pa.list_(pa.timestamp('us', 'America/New_York')));
> >>> t = pa.Table.from_arrays([column], name=['TimestampColumn']);
> >>> pq.write_table(t, "example.parq");
> >>> t2 = pq.read_table("example.parq");
> >>> t2
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=America/Denver]
> {code}
> However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
> set to UTC:
> {code:java}
> >>> t3 = pq.read_table("example.parq");
> >>> t3
> pyarrow.Table
> Dates: list
>   child 0, item: timestamp[us, tz=UTC]
>  {code}
>  
> I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
> timestamp columns. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-14104) Reading Lists of Timestamps from parquet files in Arrow 5.0.0 fails to preserve the TimeZone - unlike in Arrow 4.0.0

2021-09-23 Thread Sarah Gilmore (Jira)
Sarah Gilmore created ARROW-14104:
-

 Summary: Reading Lists of Timestamps from parquet files in Arrow 
5.0.0 fails to preserve the TimeZone - unlike in Arrow 4.0.0
 Key: ARROW-14104
 URL: https://issues.apache.org/jira/browse/ARROW-14104
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Parquet, Python
Affects Versions: 5.0.0
Reporter: Sarah Gilmore


In Arrow 4.0.0 it is possible to round-trip the TimeZone property of 
List columns to and from parquet files: 
{code:java}
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq

>>> column = pa.array([[datetime.datetime(2023, 9, 23, 11)]], 
>>> pa.list_(pa.timestamp('us', 'America/New_York'));

>>> t = pa.Table.from_arrays([column], name=['TimestampColumn']);
>>> pq.write_table(t, "example.parq");

>>> t2 = pq.read_table("example.parq");
>>> t2
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=America/Denver]
{code}
However, if you read the same parquet file in pyarrow 5.0.0, the TimeZone is 
set to UTC:
{code:java}
>>> t3 = pq.read_table("example.parq");
>>> t3
pyarrow.Table
Dates: list
  child 0, item: timestamp[us, tz=UTC]
 {code}
 

I noticed that the TimeZone is preserved in Arrow 5.0 when reading non-nested 
timestamp columns. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13185) [MATLAB] Consider alternatives to placing the MEX binaries within the source tree

2021-09-08 Thread Sarah Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Gilmore reassigned ARROW-13185:
-

Assignee: Sarah Gilmore

> [MATLAB] Consider alternatives to placing the MEX binaries within the source 
> tree
> -
>
> Key: ARROW-13185
> URL: https://issues.apache.org/jira/browse/ARROW-13185
> Project: Apache Arrow
>  Issue Type: Task
>  Components: MATLAB
>Reporter: Sarah Gilmore
>Assignee: Sarah Gilmore
>Priority: Minor
>
> Since modifying the source directory via the build process is generally 
> considered non-optimal, we may want to explore alternative approaches. For 
> example, during the build process, we could create a derived source tree (a 
> copy of the original source tree) within the build area and place our build 
> artifacts within the derived source tree. Then, we could add the derived 
> source tree to the MATLAB search path. That's just one option, but there are 
> others we could explore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12855) error: no member named 'TableReader' in namespace during compilation

2021-06-28 Thread Sarah Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Gilmore reassigned ARROW-12855:
-

Assignee: Sarah Gilmore

> error: no member named 'TableReader' in namespace during compilation
> 
>
> Key: ARROW-12855
> URL: https://issues.apache.org/jira/browse/ARROW-12855
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: MATLAB
>Affects Versions: 4.0.0
> Environment: MATLAB 2020a, Mac OS 11.2.1
>Reporter: Andraž Matkovič
>Assignee: Sarah Gilmore
>Priority: Major
>  Labels: matlab
>
> I followed instructions for compilation of arrow under MATLAB 
> ([https://github.com/apache/arrow/tree/master/matlab).] First I set 
> environment variable ARROW_HOME, e.g.
>  
> {code:java}
> setenv ARROW_HOME ~/.pyenv/versions/3.8.0/lib/python3.8/site-packages/pyarrow
> {code}
> (I also tried other pyarrow versions, even /usr/local, it's always the same).
>  
> Next, when I run compile in MATLAB I get the following error:
> {code:java}
> Verbose mode is on.Verbose mode is on Looking for compiler 'Xcode 
> Clang++' .. Looking for environment variable 'DEVELOPER_DIR' ...No 
> Executing command 'xcode-select -print-path' ...Yes 
> ('/Applications/Xcode.app/Contents/Developer') Looking for folder 
> '/Applications/Xcode.app/Contents/Developer' ...Yes Executing command 
> 'which xcrun' ...Yes ('/usr/bin/xcrun') Looking for folder '/usr/bin' 
> ...Yes Executing command 'defaults read com.apple.dt.Xcode 
> IDEXcodeVersionForAgreedToGMLicense' ...No Executing command 'defaults 
> read /Library/Preferences/com.apple.dt.Xcode 
> IDEXcodeVersionForAgreedToGMLicense' ...Yes ('11.0') Executing command 
> 'agreed=11.0  if echo $agreed | grep -E '[\.\"]' >/dev/null; then  lhs=`expr 
> "$agreed" : '\([0-9]*\)[\.].*'`   rhs=`expr "$agreed" : '[0-9]*[\.]\(.*\)$'`  
> if echo $rhs | grep -E '[\."]' >/dev/null; then  rhs=`expr "$rhs" : 
> '\([0-9]*\)[\.].*'`  fi  if [ $lhs -gt 4 ] || ( [ $lhs -eq 4 ] && [ $rhs -ge 
> 3 ] ); then  echo $agreed  else  exit 1 fi  fi' ...Yes ('11.0') Executing 
> command 'xcrun -sdk macosx --show-sdk-path' ...Yes 
> ('/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk')
>  Executing command 'xcrun -sdk macosx --show-sdk-version | awk 'BEGIN 
> {FS="."} ; {print $1"."$2}'' ...Yes ('11.1') Executing command 'clang 
> --version | grep -Eo '[0-9]+\.[0-9]+\.[0-9]'|head -1' ...Yes ('12.0.0').Found 
> installed compiler 'Xcode Clang++'.Set INCLUDE = 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1;/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/12.0.0/include;/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include;/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk/usr/include;/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk/System/Library/Frameworks;/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1;/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/12.0.0/include;/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include;/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk/usr/include;/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk/System/Library/Frameworks;Options
>  file 
> details--- 
> Compiler location: /Applications/Xcode.app/Contents/Developer Options file: 
> /Users/andraz/Library/Application 
> Support/MathWorks/MATLAB/R2020a/mex_C++_maci64.xml CMDLINE200 : 
> /usr/bin/xcrun -sdk macosx11.1 clang++ \-Wl,-twolevel_namespace -undefined 
> error -arch x86_64 -mmacosx-version-min=10.9 
> -Wl,-syslibroot,/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk
>  -framework Cocoa -bundle  -stdlib=libc++ -Wl,-rpath '/usr/local/lib' -O 
> -Wl,-exported_symbols_list,"/Applications/MATLAB_R2020a.app/extern/lib/maci64/mexFunction.map"
>  
> -Wl,-exported_symbols_list,"/Applications/MATLAB_R2020a.app/extern/lib/maci64/c_exportsmexfileversion.map"
>  -Wl,-U,_mexCreateMexFunction -Wl,-U,_mexDestroyMexFunction 
> -Wl,-U,_mexFunctionAdapter 
> -Wl,-exported_symbols_list,"/Applications/MATLAB_R2020a.app/extern/lib/maci64/cppMexFunction.map"
>  
> /var/folders/s1/f1fgqkcs6bs4c13v_50btmd4gn/T/mex_5123220870320_661/featherreadmex.o
>  
> /var/folders/s1/f1

[jira] [Commented] (ARROW-12855) error: no member named 'TableReader' in namespace during compilation

2021-06-28 Thread Sarah Gilmore (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17370802#comment-17370802
 ] 

Sarah Gilmore commented on ARROW-12855:
---

Hi [~amatkovic],

We actually just fixed this issue in a recent pull request that just got 
accepted yesterday. 

 

Here's a link to the pull request:

[https://github.com/apache/arrow/pull/10305]

 

and here's a link to the JIRA issue associated with it:

https://issues.apache.org/jira/browse/ARROW-12730

 

Could you try pulling in the most recent changes from the master branch and 
building again? Sorry about this. 

 

Best,

Sarah

 

> error: no member named 'TableReader' in namespace during compilation
> 
>
> Key: ARROW-12855
> URL: https://issues.apache.org/jira/browse/ARROW-12855
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: MATLAB
>Affects Versions: 4.0.0
> Environment: MATLAB 2020a, Mac OS 11.2.1
>Reporter: Andraž Matkovič
>Priority: Major
>  Labels: matlab
>
> I followed instructions for compilation of arrow under MATLAB 
> ([https://github.com/apache/arrow/tree/master/matlab).] First I set 
> environment variable ARROW_HOME, e.g.
>  
> {code:java}
> setenv ARROW_HOME ~/.pyenv/versions/3.8.0/lib/python3.8/site-packages/pyarrow
> {code}
> (I also tried other pyarrow versions, even /usr/local, it's always the same).
>  
> Next, when I run compile in MATLAB I get the following error:
> {code:java}
> Verbose mode is on.Verbose mode is on Looking for compiler 'Xcode 
> Clang++' .. Looking for environment variable 'DEVELOPER_DIR' ...No 
> Executing command 'xcode-select -print-path' ...Yes 
> ('/Applications/Xcode.app/Contents/Developer') Looking for folder 
> '/Applications/Xcode.app/Contents/Developer' ...Yes Executing command 
> 'which xcrun' ...Yes ('/usr/bin/xcrun') Looking for folder '/usr/bin' 
> ...Yes Executing command 'defaults read com.apple.dt.Xcode 
> IDEXcodeVersionForAgreedToGMLicense' ...No Executing command 'defaults 
> read /Library/Preferences/com.apple.dt.Xcode 
> IDEXcodeVersionForAgreedToGMLicense' ...Yes ('11.0') Executing command 
> 'agreed=11.0  if echo $agreed | grep -E '[\.\"]' >/dev/null; then  lhs=`expr 
> "$agreed" : '\([0-9]*\)[\.].*'`   rhs=`expr "$agreed" : '[0-9]*[\.]\(.*\)$'`  
> if echo $rhs | grep -E '[\."]' >/dev/null; then  rhs=`expr "$rhs" : 
> '\([0-9]*\)[\.].*'`  fi  if [ $lhs -gt 4 ] || ( [ $lhs -eq 4 ] && [ $rhs -ge 
> 3 ] ); then  echo $agreed  else  exit 1 fi  fi' ...Yes ('11.0') Executing 
> command 'xcrun -sdk macosx --show-sdk-path' ...Yes 
> ('/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk')
>  Executing command 'xcrun -sdk macosx --show-sdk-version | awk 'BEGIN 
> {FS="."} ; {print $1"."$2}'' ...Yes ('11.1') Executing command 'clang 
> --version | grep -Eo '[0-9]+\.[0-9]+\.[0-9]'|head -1' ...Yes ('12.0.0').Found 
> installed compiler 'Xcode Clang++'.Set INCLUDE = 
> /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1;/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/12.0.0/include;/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include;/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk/usr/include;/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk/System/Library/Frameworks;/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1;/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../lib/clang/12.0.0/include;/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include;/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk/usr/include;/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk/System/Library/Frameworks;Options
>  file 
> details--- 
> Compiler location: /Applications/Xcode.app/Contents/Developer Options file: 
> /Users/andraz/Library/Application 
> Support/MathWorks/MATLAB/R2020a/mex_C++_maci64.xml CMDLINE200 : 
> /usr/bin/xcrun -sdk macosx11.1 clang++ \-Wl,-twolevel_namespace -undefined 
> error -arch x86_64 -mmacosx-version-min=10.9 
> -Wl,-syslibroot,/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX11.1.sdk
>  -framework Cocoa -bundle  -stdlib=libc++ -Wl,-rpath '/usr/local/lib' -O 
> -Wl,-exported_symbols_list,"/Applications/MATLAB_R2020a.app/extern/lib/maci64/mexFunctio

[jira] [Created] (ARROW-13185) [MATLAB] Consider alternatives to placing the MEX binaries within the source tree

2021-06-25 Thread Sarah Gilmore (Jira)
Sarah Gilmore created ARROW-13185:
-

 Summary: [MATLAB] Consider alternatives to placing the MEX 
binaries within the source tree
 Key: ARROW-13185
 URL: https://issues.apache.org/jira/browse/ARROW-13185
 Project: Apache Arrow
  Issue Type: Task
  Components: MATLAB
Reporter: Sarah Gilmore


Since modifying the source directory via the build process is generally 
considered non-optimal, we may want to explore alternative approaches. For 
example, during the build process, we could create a derived source tree (a 
copy of the original source tree) within the build area and place our build 
artifacts within the derived source tree. Then, we could add the derived source 
tree to the MATLAB search path. That's just one option, but there are others we 
could explore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12754) [MATLAB] Create the LifetimeManager C++ class

2021-05-12 Thread Sarah Gilmore (Jira)
Sarah Gilmore created ARROW-12754:
-

 Summary: [MATLAB] Create the LifetimeManager C++ class
 Key: ARROW-12754
 URL: https://issues.apache.org/jira/browse/ARROW-12754
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: MATLAB
Reporter: Sarah Gilmore


LifetimeManager is a singleton that each arrow.Array subclass will interact 
with to keep their corresponding C++ data structures alive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12753) [MATLAB] Create a templated ObjectMap for storing arrow C++ data structures with IDs

2021-05-12 Thread Sarah Gilmore (Jira)
Sarah Gilmore created ARROW-12753:
-

 Summary: [MATLAB] Create a templated ObjectMap for storing arrow 
C++ data structures with IDs
 Key: ARROW-12753
 URL: https://issues.apache.org/jira/browse/ARROW-12753
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: MATLAB
Reporter: Sarah Gilmore


In order to keep arrow C++ data structures alive for the duration of the 
wrapping MATLAB object (e.g. arrow.Array), we can store the arrow C++ data 
structure in a map indexed by a unique ID. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12752) [MATLAB] Implement LifetimeManager for managing arrow memory lifetime

2021-05-12 Thread Sarah Gilmore (Jira)
Sarah Gilmore created ARROW-12752:
-

 Summary: [MATLAB] Implement LifetimeManager for managing arrow 
memory lifetime
 Key: ARROW-12752
 URL: https://issues.apache.org/jira/browse/ARROW-12752
 Project: Apache Arrow
  Issue Type: Task
  Components: MATLAB
Reporter: Sarah Gilmore


When we create an arrow object in MATLAB (e.g. arrow.Array) we need to ensure 
the underlying arrow C++ data structures stay alive for the duration of the 
wrapping MATLAB object's lifetime.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12730) [MATLAB] Update featherreadmex and featherwritemex to build against latest arrow c++ APIs

2021-05-10 Thread Sarah Gilmore (Jira)
Sarah Gilmore created ARROW-12730:
-

 Summary: [MATLAB] Update featherreadmex and featherwritemex to 
build against latest arrow c++ APIs
 Key: ARROW-12730
 URL: https://issues.apache.org/jira/browse/ARROW-12730
 Project: Apache Arrow
  Issue Type: Task
  Components: MATLAB
Reporter: Sarah Gilmore
Assignee: Sarah Gilmore


The mex functions featherreadmex and featherwritemex currently do not compile 
if you are using the latest arrow c++ APIs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12349) [MATLAB] add support for converting MATLAB numeric arrays to arrow::NumericArrays

2021-04-14 Thread Sarah Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Gilmore updated ARROW-12349:
--
Summary: [MATLAB] add support for converting MATLAB numeric arrays to 
arrow::NumericArrays   (was: [MATLAB] add support for converting a MATLAB 
uint64 array to an arrow::NumericArrays arrow::NumericArray)

> [MATLAB] add support for converting MATLAB numeric arrays to 
> arrow::NumericArrays 
> --
>
> Key: ARROW-12349
> URL: https://issues.apache.org/jira/browse/ARROW-12349
> Project: Apache Arrow
>  Issue Type: Task
>  Components: MATLAB
>Reporter: Sarah Gilmore
>Assignee: Sarah Gilmore
>Priority: Minor
>
> Create a C++ function that accepts a MALTAB uint64 array and converts it into 
> a arrow::NumericArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12349) [MATLAB] add support for converting a MATLAB uint64 array to an arrow::NumericArrays arrow::NumericArray

2021-04-14 Thread Sarah Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Gilmore reassigned ARROW-12349:
-

Assignee: Sarah Gilmore

> [MATLAB] add support for converting a MATLAB uint64 array to an 
> arrow::NumericArrays arrow::NumericArray
> ---
>
> Key: ARROW-12349
> URL: https://issues.apache.org/jira/browse/ARROW-12349
> Project: Apache Arrow
>  Issue Type: Task
>  Components: MATLAB
>Reporter: Sarah Gilmore
>Assignee: Sarah Gilmore
>Priority: Minor
>
> Create a C++ function that accepts a MALTAB uint64 array and converts it into 
> a arrow::NumericArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-12368) [MATLAB] create a matlab2mex function

2021-04-14 Thread Sarah Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarah Gilmore reassigned ARROW-12368:
-

Assignee: Sarah Gilmore

> [MATLAB] create a matlab2mex function
> -
>
> Key: ARROW-12368
> URL: https://issues.apache.org/jira/browse/ARROW-12368
> Project: Apache Arrow
>  Issue Type: Task
>  Components: MATLAB
>Reporter: Sarah Gilmore
>Assignee: Sarah Gilmore
>Priority: Minor
>
> Create a function that takes a native numeric MATLAB array and converts it 
> into a form that can be manipulated in a C++ MEX function. Once the data is 
> accessible inside a MEX function, it can be converted into the corresponding  
> arrow::NumericArray type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12368) [MATLAB] create a matlab2mex function

2021-04-13 Thread Sarah Gilmore (Jira)
Sarah Gilmore created ARROW-12368:
-

 Summary: [MATLAB] create a matlab2mex function
 Key: ARROW-12368
 URL: https://issues.apache.org/jira/browse/ARROW-12368
 Project: Apache Arrow
  Issue Type: Task
  Components: MATLAB
Reporter: Sarah Gilmore


Create a function that takes a native numeric MATLAB array and converts it into 
a form that can be manipulated in a C++ MEX function. Once the data is 
accessible inside a MEX function, it can be converted into the corresponding  
arrow::NumericArray type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12349) [MATLAB] add support for converting a MATLAB uint64 array to an arrow::NumericArrays arrow::NumericArray

2021-04-12 Thread Sarah Gilmore (Jira)
Sarah Gilmore created ARROW-12349:
-

 Summary: [MATLAB] add support for converting a MATLAB uint64 array 
to an arrow::NumericArrays arrow::NumericArray
 Key: ARROW-12349
 URL: https://issues.apache.org/jira/browse/ARROW-12349
 Project: Apache Arrow
  Issue Type: Task
  Components: MATLAB
Reporter: Sarah Gilmore


Create a C++ function that accepts a MALTAB uint64 array and converts it into a 
arrow::NumericArray.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)