[jira] [Created] (ARROW-13897) [Python] TimestampScalar.as_py() and DurationScalar.as_py() docs inaccurately describe return types
Tim Swast created ARROW-13897: - Summary: [Python] TimestampScalar.as_py() and DurationScalar.as_py() docs inaccurately describe return types Key: ARROW-13897 URL: https://issues.apache.org/jira/browse/ARROW-13897 Project: Apache Arrow Issue Type: Task Reporter: Tim Swast If I'm reading the code correctly, Pandas data types are only used if units are nanoseconds. Also, TimestampScalar returns a Python datetime.datetime, not datetime.timedelta. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11010) [Python] `np.float` deprecation warning in `_pandas_logical_type_map`
Tim Swast created ARROW-11010: - Summary: [Python] `np.float` deprecation warning in `_pandas_logical_type_map` Key: ARROW-11010 URL: https://issues.apache.org/jira/browse/ARROW-11010 Project: Apache Arrow Issue Type: Task Reporter: Tim Swast I get the following warning when converting a floating point column in a pandas dataframe into a pyarrow array: ``` /Users/swast/src/python-bigquery/.nox/prerelease_deps/lib/python3.8/site-packages/pyarrow/pandas_compat.py:1031: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. Use `float` by itself, which is identical in behavior, to silence this warning. If you specifically wanted the numpy scalar type, use `np.float_` here. 'floating': np.float, ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6955) [Python] [Packaging] Add Python 3.8 to packaging matrix
[ https://issues.apache.org/jira/browse/ARROW-6955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979687#comment-16979687 ] Tim Swast commented on ARROW-6955: -- Sub-task of https://issues.apache.org/jira/browse/ARROW-6920 > [Python] [Packaging] Add Python 3.8 to packaging matrix > --- > > Key: ARROW-6955 > URL: https://issues.apache.org/jira/browse/ARROW-6955 > Project: Apache Arrow > Issue Type: Wish > Components: Packaging, Python >Reporter: Antoine Pitrou >Priority: Critical > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6954) [Python] [CI] Add Python 3.8 to CI matrix
[ https://issues.apache.org/jira/browse/ARROW-6954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979688#comment-16979688 ] Tim Swast commented on ARROW-6954: -- Sub-task of https://issues.apache.org/jira/browse/ARROW-6920 > [Python] [CI] Add Python 3.8 to CI matrix > - > > Key: ARROW-6954 > URL: https://issues.apache.org/jira/browse/ARROW-6954 > Project: Apache Arrow > Issue Type: Wish > Components: Continuous Integration, Python >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Critical > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-2607) [Java/Python] Support VarCharVector / StringArray in pyarrow.Array.from_jvm
[ https://issues.apache.org/jira/browse/ARROW-2607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858909#comment-16858909 ] Tim Swast commented on ARROW-2607: -- I'm very interested in this issue, as it would be extremely useful for parsing files into Arrow tables from numba. I would expect to be able to do the following: {code:java} my_string_array = pyarrow.Array.from_buffers( pyarrow.string(), row_count, [ pyarrow.py_buffer(my_string_nullmask), pyarrow.py_buffer(my_string_offsets), pyarrow.py_buffer(my_string_bytes), ], ){code} But I get : {quote}File "pyarrow/array.pxi", line 578, in pyarrow.lib.Array.from_buffers NotImplementedError: from_buffers is only supported for primitive arrays yet. {quote} I suppose if I wanted to contribute this fix, I should start looking at pyarrow/array.pxi first? > [Java/Python] Support VarCharVector / StringArray in pyarrow.Array.from_jvm > --- > > Key: ARROW-2607 > URL: https://issues.apache.org/jira/browse/ARROW-2607 > Project: Apache Arrow > Issue Type: New Feature > Components: Java, Python >Reporter: Uwe L. Korn >Priority: Major > > Follow-up after https://issues.apache.org/jira/browse/ARROW-2249: Currently > only primitive arrays are supported in {{pyarrow.Array.from_jvm}} as it uses > {{pyarrow.Array.from_buffers}} underneath. We should extend one of the two > functions to be able to deal with string arrays. There is a currently failing > unit test {{test_jvm_string_array}} in {{pyarrow/tests/test_jvm.py}} to > verify the implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long
[ https://issues.apache.org/jira/browse/ARROW-5450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857049#comment-16857049 ] Tim Swast edited comment on ARROW-5450 at 6/5/19 9:24 PM: -- Since datetime.datetime objects don't support nanosecond precision, pandas Timestamp is a good default with nanosecond precision columns. But with microsecond precision columns, I'd always prefer a datetime.datetime object. was (Author: tswast): Since datetime.datetime objects don't support nanosecond precision, pandas Timestamp is a good default with nanosecond precision columns. But with microsecond precision objects, I'd always prefer a datetime.datetime object. > [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too > large to convert to C long > --- > > Key: ARROW-5450 > URL: https://issues.apache.org/jira/browse/ARROW-5450 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Tim Swast >Priority: Major > > When I attempt to roundtrip from a list of moderately large (beyond what can > be represented in nanosecond precision, but within microsecond precision) > datetime objects to pyarrow and back, I get an OverflowError: Python int too > large to convert to C long. > pyarrow version: > {noformat} > $ pip freeze | grep pyarrow > pyarrow==0.13.0{noformat} > > Reproduction: > {code:java} > import datetime > import pandas > import pyarrow > import pytz > timestamp_rows = [ > datetime.datetime(1, 1, 1, 0, 0, 0, tzinfo=pytz.utc), > None, > datetime.datetime(, 12, 31, 23, 59, 59, 99, tzinfo=pytz.utc), > datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc), > ] > timestamp_array = pyarrow.array(timestamp_rows, pyarrow.timestamp("us", > tz="UTC")) > timestamp_roundtrip = timestamp_array.to_pylist() > # --- > # OverflowError Traceback (most recent call last) > # in > # > 1 timestamp_roundtrip = timestamp_array.to_pylist() > # > # > ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi > in __iter__() > # > # > ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi > in pyarrow.lib.TimestampValue.as_py() > # > # > ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi > in pyarrow.lib._datetime_conversion_functions.lambda5() > # > # pandas/_libs/tslibs/timestamps.pyx in > pandas._libs.tslibs.timestamps.Timestamp.__new__() > # > # pandas/_libs/tslibs/conversion.pyx in > pandas._libs.tslibs.conversion.convert_to_tsobject() > # > # OverflowError: Python int too large to convert to C long > {code} > For good measure, I also tested with timezone-naive timestamps with the same > error: > {code:java} > naive_rows = [ > datetime.datetime(1, 1, 1, 0, 0, 0), > None, > datetime.datetime(, 12, 31, 23, 59, 59, 99), > datetime.datetime(1970, 1, 1, 0, 0, 0), > ] > naive_array = pyarrow.array(naive_rows, pyarrow.timestamp("us", tz=None)) > naive_roundtrip = naive_array.to_pylist() > # --- > # OverflowError Traceback (most recent call last) > # in > # > 1 naive_roundtrip = naive_array.to_pylist() > # > # > ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi > in __iter__() > # > # > ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi > in pyarrow.lib.TimestampValue.as_py() > # > # > ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi > in pyarrow.lib._datetime_conversion_functions.lambda5() > # > # pandas/_libs/tslibs/timestamps.pyx in > pandas._libs.tslibs.timestamps.Timestamp.__new__() > # > # pandas/_libs/tslibs/conversion.pyx in > pandas._libs.tslibs.conversion.convert_to_tsobject() > # > # OverflowError: Python int too large to convert to C long > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long
[ https://issues.apache.org/jira/browse/ARROW-5450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857049#comment-16857049 ] Tim Swast commented on ARROW-5450: -- Since datetime.datetime objects don't support nanosecond precision, pandas Timestamp is a good default with nanosecond precision columns. But with microsecond precision objects, I'd always prefer a datetime.datetime object. > [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too > large to convert to C long > --- > > Key: ARROW-5450 > URL: https://issues.apache.org/jira/browse/ARROW-5450 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Tim Swast >Priority: Major > > When I attempt to roundtrip from a list of moderately large (beyond what can > be represented in nanosecond precision, but within microsecond precision) > datetime objects to pyarrow and back, I get an OverflowError: Python int too > large to convert to C long. > pyarrow version: > {noformat} > $ pip freeze | grep pyarrow > pyarrow==0.13.0{noformat} > > Reproduction: > {code:java} > import datetime > import pandas > import pyarrow > import pytz > timestamp_rows = [ > datetime.datetime(1, 1, 1, 0, 0, 0, tzinfo=pytz.utc), > None, > datetime.datetime(, 12, 31, 23, 59, 59, 99, tzinfo=pytz.utc), > datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc), > ] > timestamp_array = pyarrow.array(timestamp_rows, pyarrow.timestamp("us", > tz="UTC")) > timestamp_roundtrip = timestamp_array.to_pylist() > # --- > # OverflowError Traceback (most recent call last) > # in > # > 1 timestamp_roundtrip = timestamp_array.to_pylist() > # > # > ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi > in __iter__() > # > # > ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi > in pyarrow.lib.TimestampValue.as_py() > # > # > ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi > in pyarrow.lib._datetime_conversion_functions.lambda5() > # > # pandas/_libs/tslibs/timestamps.pyx in > pandas._libs.tslibs.timestamps.Timestamp.__new__() > # > # pandas/_libs/tslibs/conversion.pyx in > pandas._libs.tslibs.conversion.convert_to_tsobject() > # > # OverflowError: Python int too large to convert to C long > {code} > For good measure, I also tested with timezone-naive timestamps with the same > error: > {code:java} > naive_rows = [ > datetime.datetime(1, 1, 1, 0, 0, 0), > None, > datetime.datetime(, 12, 31, 23, 59, 59, 99), > datetime.datetime(1970, 1, 1, 0, 0, 0), > ] > naive_array = pyarrow.array(naive_rows, pyarrow.timestamp("us", tz=None)) > naive_roundtrip = naive_array.to_pylist() > # --- > # OverflowError Traceback (most recent call last) > # in > # > 1 naive_roundtrip = naive_array.to_pylist() > # > # > ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi > in __iter__() > # > # > ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi > in pyarrow.lib.TimestampValue.as_py() > # > # > ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi > in pyarrow.lib._datetime_conversion_functions.lambda5() > # > # pandas/_libs/tslibs/timestamps.pyx in > pandas._libs.tslibs.timestamps.Timestamp.__new__() > # > # pandas/_libs/tslibs/conversion.pyx in > pandas._libs.tslibs.conversion.convert_to_tsobject() > # > # OverflowError: Python int too large to convert to C long > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5450) [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long
Tim Swast created ARROW-5450: Summary: [Python] TimestampArray.to_pylist() fails with OverflowError: Python int too large to convert to C long Key: ARROW-5450 URL: https://issues.apache.org/jira/browse/ARROW-5450 Project: Apache Arrow Issue Type: Bug Reporter: Tim Swast When I attempt to roundtrip from a list of moderately large (beyond what can be represented in nanosecond precision, but within microsecond precision) datetime objects to pyarrow and back, I get an OverflowError: Python int too large to convert to C long. pyarrow version: {noformat} $ pip freeze | grep pyarrow pyarrow==0.13.0{noformat} Reproduction: {code:java} import datetime import pandas import pyarrow import pytz timestamp_rows = [ datetime.datetime(1, 1, 1, 0, 0, 0, tzinfo=pytz.utc), None, datetime.datetime(, 12, 31, 23, 59, 59, 99, tzinfo=pytz.utc), datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=pytz.utc), ] timestamp_array = pyarrow.array(timestamp_rows, pyarrow.timestamp("us", tz="UTC")) timestamp_roundtrip = timestamp_array.to_pylist() # --- # OverflowError Traceback (most recent call last) # in # > 1 timestamp_roundtrip = timestamp_array.to_pylist() # # ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi in __iter__() # # ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi in pyarrow.lib.TimestampValue.as_py() # # ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi in pyarrow.lib._datetime_conversion_functions.lambda5() # # pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.__new__() # # pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_to_tsobject() # # OverflowError: Python int too large to convert to C long {code} For good measure, I also tested with timezone-naive timestamps with the same error: {code:java} naive_rows = [ datetime.datetime(1, 1, 1, 0, 0, 0), None, datetime.datetime(, 12, 31, 23, 59, 59, 99), datetime.datetime(1970, 1, 1, 0, 0, 0), ] naive_array = pyarrow.array(naive_rows, pyarrow.timestamp("us", tz=None)) naive_roundtrip = naive_array.to_pylist() # --- # OverflowError Traceback (most recent call last) # in # > 1 naive_roundtrip = naive_array.to_pylist() # # ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/array.pxi in __iter__() # # ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi in pyarrow.lib.TimestampValue.as_py() # # ~/.pyenv/versions/3.6.4/envs/scratch/lib/python3.6/site-packages/pyarrow/scalar.pxi in pyarrow.lib._datetime_conversion_functions.lambda5() # # pandas/_libs/tslibs/timestamps.pyx in pandas._libs.tslibs.timestamps.Timestamp.__new__() # # pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.convert_to_tsobject() # # OverflowError: Python int too large to convert to C long {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4965) [Python] Timestamp array type detection should use tzname of datetime.datetime objects
Tim Swast created ARROW-4965: Summary: [Python] Timestamp array type detection should use tzname of datetime.datetime objects Key: ARROW-4965 URL: https://issues.apache.org/jira/browse/ARROW-4965 Project: Apache Arrow Issue Type: Improvement Components: Python Environment: $ python --version Python 3.7.2 $ pip freeze numpy==1.16.2 pyarrow==0.12.1 pytz==2018.9 six==1.12.0 $ sw_vers ProductName:Mac OS X ProductVersion: 10.14.3 BuildVersion: 18D109 (pyarrow) Reporter: Tim Swast The type detection from datetime objects to array appears to ignore the presence of a tzinfo on the datetime object, instead storing them as naive timestamp columns. Python code: {code:python} import datetime import pytz import pyarrow as pa naive_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10) utc_datetime = datetime.datetime(2019, 1, 13, 12, 11, 10, tzinfo=pytz.utc) tzaware_datetime = utc_datetime.astimezone(pytz.timezone('America/Los_Angeles')) def inspect(varname): print(varname) arr = globals()[varname] print(arr.type) print(arr) print() auto_naive_arr = pa.array([naive_datetime]) inspect("auto_naive_arr") auto_utc_arr = pa.array([utc_datetime]) inspect("auto_utc_arr") auto_tzaware_arr = pa.array([tzaware_datetime]) inspect("auto_tzaware_arr") auto_mixed_arr = pa.array([utc_datetime, tzaware_datetime]) inspect("auto_mixed_arr") naive_type = pa.timestamp("us", naive_datetime.tzname()) utc_type = pa.timestamp("us", utc_datetime.tzname()) tzaware_type = pa.timestamp("us", tzaware_datetime.tzname()) naive_arr = pa.array([naive_datetime], type=naive_type) inspect("naive_arr") utc_arr = pa.array([utc_datetime], type=utc_type) inspect("utc_arr") tzaware_arr = pa.array([tzaware_datetime], type=tzaware_type) inspect("tzaware_arr") mixed_arr = pa.array([utc_datetime, tzaware_datetime], type=utc_type) inspect("mixed_arr") {code} This prints: {noformat} $ python detect_timezone.py auto_naive_arr timestamp[us] [ 154738147000 ] auto_utc_arr timestamp[us] [ 154738147000 ] auto_tzaware_arr timestamp[us] [ 154735267000 ] auto_mixed_arr timestamp[us] [ 154738147000, 154735267000 ] naive_arr timestamp[us] [ 154738147000 ] utc_arr timestamp[us, tz=UTC] [ 154738147000 ] tzaware_arr timestamp[us, tz=PST] [ 154735267000 ] mixed_arr timestamp[us, tz=UTC] [ 154738147000, 154735267000 ] {noformat} But I would expect the following types instead: * {{naive_datetime}}: {{timestamp[us]}} * {{auto_utc_arr}}: {{timestamp[us, tz=UTC]}} * {{auto_tzaware_arr}}: {{timestamp[us, tz=PST]}} (Or maybe {{tz='America/Los_Angeles'}}. I'm not sure why {{pytz}} returns {{PST}} as the {{tzname}}) * {{auto_mixed_arr}}: {{timestamp[us, tz=UTC]}} Also, in the "mixed" case, I'd expect the actual stored microseconds to be the same for both rows, since {{utc_datetime}} and {{tzaware_datetime}} both refer to the same point in time. It seems reasonable for any naive datetime objects mixed in with tz-aware datetimes to be interpreted as UTC. -- This message was sent by Atlassian JIRA (v7.6.3#76005)