[jira] [Created] (ARROW-13897) [Python] TimestampScalar.as_py() and DurationScalar.as_py() docs inaccurately describe return types

2021-09-03 Thread Tim Swast (Jira)
Tim Swast created ARROW-13897:
-

 Summary: [Python] TimestampScalar.as_py() and 
DurationScalar.as_py() docs inaccurately describe return types
 Key: ARROW-13897
 URL: https://issues.apache.org/jira/browse/ARROW-13897
 Project: Apache Arrow
  Issue Type: Task
Reporter: Tim Swast


If I'm reading the code correctly, Pandas data types are only used if units are 
nanoseconds. Also, TimestampScalar returns a Python datetime.datetime, not 
datetime.timedelta.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11010) [Python] `np.float` deprecation warning in `_pandas_logical_type_map`

2020-12-22 Thread Tim Swast (Jira)
Tim Swast created ARROW-11010:
-

 Summary: [Python] `np.float` deprecation warning in 
`_pandas_logical_type_map`
 Key: ARROW-11010
 URL: https://issues.apache.org/jira/browse/ARROW-11010
 Project: Apache Arrow
  Issue Type: Task
Reporter: Tim Swast


I get the following warning when converting a floating point column in a pandas 
dataframe into a pyarrow array:

 

```

/Users/swast/src/python-bigquery/.nox/prerelease_deps/lib/python3.8/site-packages/pyarrow/pandas_compat.py:1031:
 DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. 
Use `float` by itself, which is identical in behavior, to silence this warning. 
If you specifically wanted the numpy scalar type, use `np.float_` here.
 'floating': np.float,

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6955) [Python] [Packaging] Add Python 3.8 to packaging matrix

2019-11-21 Thread Tim Swast (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979687#comment-16979687
 ] 

Tim Swast commented on ARROW-6955:
--

Sub-task of https://issues.apache.org/jira/browse/ARROW-6920

> [Python] [Packaging] Add Python 3.8 to packaging matrix
> ---
>
> Key: ARROW-6955
> URL: https://issues.apache.org/jira/browse/ARROW-6955
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Packaging, Python
>Reporter: Antoine Pitrou
>Priority: Critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6954) [Python] [CI] Add Python 3.8 to CI matrix

2019-11-21 Thread Tim Swast (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979688#comment-16979688
 ] 

Tim Swast commented on ARROW-6954:
--

Sub-task of https://issues.apache.org/jira/browse/ARROW-6920

> [Python] [CI] Add Python 3.8 to CI matrix
> -
>
> Key: ARROW-6954
> URL: https://issues.apache.org/jira/browse/ARROW-6954
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Continuous Integration, Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-2607) [Java/Python] Support VarCharVector / StringArray in pyarrow.Array.from_jvm

2019-06-07 Thread Tim Swast (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858909#comment-16858909
 ] 

Tim Swast commented on ARROW-2607:
--

I'm very interested in this issue, as it would be extremely useful for parsing 
files into Arrow tables from numba. I would expect to be able to do the 
following:
{code:java}
my_string_array = pyarrow.Array.from_buffers(
  pyarrow.string(),
  row_count,
  [
pyarrow.py_buffer(my_string_nullmask),
pyarrow.py_buffer(my_string_offsets),
pyarrow.py_buffer(my_string_bytes),
  ],
){code}
But I get :
{quote}File "pyarrow/array.pxi", line 578, in pyarrow.lib.Array.from_buffers
NotImplementedError: from_buffers is only supported for primitive arrays yet.
{quote}
I suppose if I wanted to contribute this fix, I should start looking at 
pyarrow/array.pxi first?

> [Java/Python] Support VarCharVector / StringArray in pyarrow.Array.from_jvm
> ---
>
> Key: ARROW-2607
> URL: https://issues.apache.org/jira/browse/ARROW-2607
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java, Python
>Reporter: Uwe L. Korn
>Priority: Major
>
> Follow-up after https://issues.apache.org/jira/browse/ARROW-2249: Currently 
> only primitive arrays are supported in {{pyarrow.Array.from_jvm}} as it uses 
> {{pyarrow.Array.from_buffers}} underneath. We should extend one of the two 
> functions to be able to deal with string arrays. There is a currently failing 
> unit test {{test_jvm_string_array}} in {{pyarrow/tests/test_jvm.py}} to 
> verify the implementation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long

2019-06-05 Thread Tim Swast (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857049#comment-16857049
 ] 

Tim Swast edited comment on ARROW-5450 at 6/5/19 9:24 PM:
--

Since datetime.datetime objects don't support nanosecond precision, pandas 
Timestamp is a good default with nanosecond precision columns. But with 
microsecond precision columns, I'd always prefer a datetime.datetime object.


was (Author: tswast):
Since datetime.datetime objects don't support nanosecond precision, pandas 
Timestamp is a good default with nanosecond precision columns. But with 
microsecond precision objects, I'd always prefer a datetime.datetime object.

> [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too 
> large to convert to C long
> ---
>
> Key: ARROW-5450
> URL: https://issues.apache.org/jira/browse/ARROW-5450
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Tim Swast
>Priority: Major
>
> When I attempt to roundtrip from a list of moderately large (beyond what can 
> be represented in nanosecond precision, but within microsecond precision) 
> datetime objects to pyarrow and back, I get an OverflowError: Python int too 
> large to convert to C long.
> pyarrow version:
> {noformat}
> $ pip freeze | grep pyarrow
> pyarrow==0.13.0{noformat}
>  
> Reproduction:
> {code:java}
> import datetime
> import pandas
> import pyarrow
> import pytz
> timestamp_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99, tzinfo=pytz.utc),
> datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> ]
> timestamp_array = pyarrow.array(timestamp_rows, pyarrow.timestamp("us", 
> tz="UTC"))
> timestamp_roundtrip = timestamp_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 timestamp_roundtrip = timestamp_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}
> For good measure, I also tested with timezone-naive timestamps with the same 
> error:
> {code:java}
> naive_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99),
> datetime.datetime(1970, 1, 1, 0, 0, 0),
> ]
> naive_array = pyarrow.array(naive_rows, pyarrow.timestamp("us", tz=None))
> naive_roundtrip = naive_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 naive_roundtrip = naive_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long

2019-06-05 Thread Tim Swast (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857049#comment-16857049
 ] 

Tim Swast commented on ARROW-5450:
--

Since datetime.datetime objects don't support nanosecond precision, pandas 
Timestamp is a good default with nanosecond precision columns. But with 
microsecond precision objects, I'd always prefer a datetime.datetime object.

> [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too 
> large to convert to C long
> ---
>
> Key: ARROW-5450
> URL: https://issues.apache.org/jira/browse/ARROW-5450
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Tim Swast
>Priority: Major
>
> When I attempt to roundtrip from a list of moderately large (beyond what can 
> be represented in nanosecond precision, but within microsecond precision) 
> datetime objects to pyarrow and back, I get an OverflowError: Python int too 
> large to convert to C long.
> pyarrow version:
> {noformat}
> $ pip freeze | grep pyarrow
> pyarrow==0.13.0{noformat}
>  
> Reproduction:
> {code:java}
> import datetime
> import pandas
> import pyarrow
> import pytz
> timestamp_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99, tzinfo=pytz.utc),
> datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
> ]
> timestamp_array = pyarrow.array(timestamp_rows, pyarrow.timestamp("us", 
> tz="UTC"))
> timestamp_roundtrip = timestamp_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 timestamp_roundtrip = timestamp_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}
> For good measure, I also tested with timezone-naive timestamps with the same 
> error:
> {code:java}
> naive_rows = [
> datetime.datetime(1, 1, 1, 0, 0, 0),
> None,
> datetime.datetime(, 12, 31, 23, 59, 59, 99),
> datetime.datetime(1970, 1, 1, 0, 0, 0),
> ]
> naive_array = pyarrow.array(naive_rows, pyarrow.timestamp("us", tz=None))
> naive_roundtrip = naive_array.to_pylist()
> # ---
> # OverflowError Traceback (most recent call last)
> #  in 
> # > 1 naive_roundtrip = naive_array.to_pylist()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
>  in __iter__()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib.TimestampValue.as_py()
> #
> # 
> ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
>  in pyarrow.lib._datetime_conversion_functions.lambda5()
> #
> # pandas/_libs/tslibs/timestamps.pyx in 
> pandas._libs.tslibs.timestamps.Timestamp.__new__()
> #
> # pandas/_libs/tslibs/conversion.pyx in 
> pandas._libs.tslibs.conversion.convert_to_tsobject()
> #
> # OverflowError: Python int too large to convert to C long
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long

2019-05-30 Thread Tim Swast (JIRA)
Tim Swast created ARROW-5450:


 Summary: [Python] TimestampArray.to_pylist() fails with 
OverflowError: Python int too large to convert to C long
 Key: ARROW-5450
 URL: https://issues.apache.org/jira/browse/ARROW-5450
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Tim Swast


When I attempt to roundtrip from a list of moderately large (beyond what can be 
represented in nanosecond precision, but within microsecond precision) datetime 
objects to pyarrow and back, I get an OverflowError: Python int too large to 
convert to C long.

pyarrow version:
{noformat}
$ pip freeze | grep pyarrow
pyarrow==0.13.0{noformat}
 

Reproduction:
{code:java}
import datetime

import pandas
import pyarrow
import pytz


timestamp_rows = [
datetime.datetime(1, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
None,
datetime.datetime(, 12, 31, 23, 59, 59, 99, tzinfo=pytz.utc),
datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc),
]
timestamp_array = pyarrow.array(timestamp_rows, pyarrow.timestamp("us", 
tz="UTC"))
timestamp_roundtrip = timestamp_array.to_pylist()


# ---
# OverflowError Traceback (most recent call last)
#  in 
# > 1 timestamp_roundtrip = timestamp_array.to_pylist()
#
# 
~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
 in __iter__()
#
# 
~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
 in pyarrow.lib.TimestampValue.as_py()
#
# 
~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
 in pyarrow.lib._datetime_conversion_functions.lambda5()
#
# pandas/_libs/tslibs/timestamps.pyx in 
pandas._libs.tslibs.timestamps.Timestamp.__new__()
#
# pandas/_libs/tslibs/conversion.pyx in 
pandas._libs.tslibs.conversion.convert_to_tsobject()
#
# OverflowError: Python int too large to convert to C long
{code}
For good measure, I also tested with timezone-naive timestamps with the same 
error:
{code:java}
naive_rows = [
datetime.datetime(1, 1, 1, 0, 0, 0),
None,
datetime.datetime(, 12, 31, 23, 59, 59, 99),
datetime.datetime(1970, 1, 1, 0, 0, 0),
]
naive_array = pyarrow.array(naive_rows, pyarrow.timestamp("us", tz=None))
naive_roundtrip = naive_array.to_pylist()

# ---
# OverflowError Traceback (most recent call last)
#  in 
# > 1 naive_roundtrip = naive_array.to_pylist()
#
# 
~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi
 in __iter__()
#
# 
~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
 in pyarrow.lib.TimestampValue.as_py()
#
# 
~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi
 in pyarrow.lib._datetime_conversion_functions.lambda5()
#
# pandas/_libs/tslibs/timestamps.pyx in 
pandas._libs.tslibs.timestamps.Timestamp.__new__()
#
# pandas/_libs/tslibs/conversion.pyx in 
pandas._libs.tslibs.conversion.convert_to_tsobject()
#
# OverflowError: Python int too large to convert to C long
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4965) [Python] Timestamp array type detection should use tzname of datetime.datetime objects

2019-03-19 Thread Tim Swast (JIRA)
Tim Swast created ARROW-4965:


 Summary: [Python] Timestamp array type detection should use tzname 
of datetime.datetime objects
 Key: ARROW-4965
 URL: https://issues.apache.org/jira/browse/ARROW-4965
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
 Environment: $ python --version
Python 3.7.2

$ pip freeze
numpy==1.16.2
pyarrow==0.12.1
pytz==2018.9
six==1.12.0

$ sw_vers
ProductName:Mac OS X
ProductVersion: 10.14.3
BuildVersion:   18D109
(pyarrow) 
Reporter: Tim Swast


The type detection from datetime objects to array appears to ignore the 
presence of a tzinfo on the datetime object, instead storing them as naive 
timestamp columns.

Python code:

{code:python}
import datetime
import pytz
import pyarrow as pa

naive_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10)
utc_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10, tzinfo=pytz.utc)
tzaware_datetime = utc_datetime.astimezone(pytz.timezone('America/Los_Angeles'))

def inspect(varname):
print(varname)
arr = globals()[varname]
print(arr.type)
print(arr)
print()

auto_naive_arr = pa.array([naive_datetime])
inspect("auto_naive_arr")

auto_utc_arr = pa.array([utc_datetime])
inspect("auto_utc_arr")

auto_tzaware_arr = pa.array([tzaware_datetime])
inspect("auto_tzaware_arr")

auto_mixed_arr = pa.array([utc_datetime, tzaware_datetime])
inspect("auto_mixed_arr")

naive_type = pa.timestamp("us", naive_datetime.tzname())
utc_type = pa.timestamp("us", utc_datetime.tzname())
tzaware_type = pa.timestamp("us", tzaware_datetime.tzname())

naive_arr = pa.array([naive_datetime], type=naive_type)
inspect("naive_arr")

utc_arr = pa.array([utc_datetime], type=utc_type)
inspect("utc_arr")

tzaware_arr = pa.array([tzaware_datetime], type=tzaware_type)
inspect("tzaware_arr")

mixed_arr = pa.array([utc_datetime, tzaware_datetime], type=utc_type)
inspect("mixed_arr")
{code}

This prints:

{noformat}
$ python detect_timezone.py
auto_naive_arr
timestamp[us]
[
  154738147000
]

auto_utc_arr
timestamp[us]
[
  154738147000
]

auto_tzaware_arr
timestamp[us]
[
  154735267000
]

auto_mixed_arr
timestamp[us]
[
  154738147000,
  154735267000
]

naive_arr
timestamp[us]
[
  154738147000
]

utc_arr
timestamp[us, tz=UTC]
[
  154738147000
]

tzaware_arr
timestamp[us, tz=PST]
[
  154735267000
]

mixed_arr
timestamp[us, tz=UTC]
[
  154738147000,
  154735267000
]
{noformat}

But I would expect the following types instead:

* {{naive_datetime}}: {{timestamp[us]}}
* {{auto_utc_arr}}: {{timestamp[us, tz=UTC]}}
* {{auto_tzaware_arr}}: {{timestamp[us, tz=PST]}} (Or maybe 
{{tz='America/Los_Angeles'}}. I'm not sure why {{pytz}} returns {{PST}} as the 
{{tzname}})
* {{auto_mixed_arr}}: {{timestamp[us, tz=UTC]}}

Also, in the "mixed" case, I'd expect the actual stored microseconds to be the 
same for both rows, since {{utc_datetime}} and {{tzaware_datetime}} both refer 
to the same point in time. It seems reasonable for any naive datetime objects 
mixed in with tz-aware datetimes to be interpreted as UTC.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)