[jira] [Created] (ARROW-17360) [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns
Matthew Roeschke created ARROW-17360: Summary: [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns Key: ARROW-17360 URL: https://issues.apache.org/jira/browse/ARROW-17360 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 8.0.1 Reporter: Matthew Roeschke xref [https://github.com/pandas-dev/pandas/issues/47944] {code:java} In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]}) # pandas main branch / 1.5 In [2]: df.to_orc("abc") In [3]: pd.read_orc("abc", columns=['b', 'a']) Out[3]: a b 0 1 a 1 2 b 2 3 c In [4]: import pyarrow.orc as orc In [5]: orc_file = orc.ORCFile("abc") # reordered to a, b In [6]: orc_file.read(columns=['b', 'a']).to_pandas() Out[6]: a b 0 1 a 1 2 b 2 3 c # reordered to a, b In [7]: orc_file.read(columns=['b', 'a']) Out[7]: pyarrow.Table a: int64 b: string a: [[1,2,3]] b: [["a","b","c"]] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17134) [C++(?)/Python] pyarrow.compute.replace_with_mask does not replace null when providing an array mask
[ https://issues.apache.org/jira/browse/ARROW-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569124#comment-17569124 ] Matthew Roeschke commented on ARROW-17134: -- Ah okay that makes sense. When I read len(replacements) == number of true values in the mask, for some reason I thought "len(replacements)" meant the values could still be corresponding to the mask. > We should maybe consider raising an error if the {{replacements}} are too > long? That would be helpful, or maybe an example in the docstring could help clarify that point. Fine either way > [C++(?)/Python] pyarrow.compute.replace_with_mask does not replace null when > providing an array mask > > > Key: ARROW-17134 > URL: https://issues.apache.org/jira/browse/ARROW-17134 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 8.0.0 >Reporter: Matthew Roeschke >Priority: Major > > > {code:java} > In [1]: import pyarrow as pa > In [2]: arr1 = pa.array([1, 0, 1, None, None]) > In [3]: arr2 = pa.array([None, None, 1, 0, 1]) > In [4]: pa.compute.replace_with_mask(arr1, [False, False, False, True, True], > arr2) > Out[4]: > > [ > 1, > 0, > 1, > null, # I would expect 0 > null # I would expect 1 > ] > In [5]: pa.__version__ > Out[5]: '8.0.0'{code} > > I have noticed this behavior occur with the integer, floating, bool, temporal > types > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17134) [C++(?)/Python] pyarrow.compute.replace_with_mask does not replace null when providing an array mask
Matthew Roeschke created ARROW-17134: Summary: [C++(?)/Python] pyarrow.compute.replace_with_mask does not replace null when providing an array mask Key: ARROW-17134 URL: https://issues.apache.org/jira/browse/ARROW-17134 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 8.0.0 Reporter: Matthew Roeschke {code:java} In [1]: import pyarrow as pa In [2]: arr1 = pa.array([1, 0, 1, None, None]) In [3]: arr2 = pa.array([None, None, 1, 0, 1]) In [4]: pa.compute.replace_with_mask(arr1, [False, False, False, True, True], arr2) Out[4]: [ 1, 0, 1, null, # I would expect 0 null # I would expect 1 ] In [5]: pa.__version__ Out[5]: '8.0.0'{code} I have noticed this behavior occur with the integer, floating, bool, temporal types -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17096) pyarrow.compute.mode for boolean arrays does not return true when mixed with false
[ https://issues.apache.org/jira/browse/ARROW-17096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Roeschke updated ARROW-17096: - Description: {code:java} In [1]: import pyarrow.compute as pc In [2]: import pyarrow as pa In [3]: pa.__version__ Out[3]: '8.0.0' In [4]: pc.mode(pa.array([True, True])) # Correct Out[4]: -- is_valid: all not null -- child 0 type: bool [ true ] -- child 1 type: int64 [ 2 ] # Incorrect In [5]: pc.mode(pa.array([True, False]), 2) Out[5]: -- is_valid: all not null -- child 0 type: bool [ false, # should be true false ] -- child 1 type: int64 [ 1, 1 ] {code} was: {code:java} In [1]: import pyarrow.compute as pc In [2]: import pyarrow as pa In [3]: pa.__version__ Out[3]: '8.0.0' In [4]: pc.mode(pa.array([True, True])) # Correct Out[4]: -- is_valid: all not null -- child 0 type: bool [ true ] -- child 1 type: int64 [ 2 ] # Incorrect In [5]: pc.mode(pa.array([True, False]), 2) Out[5]: -- is_valid: all not null -- child 0 type: bool [ false, # should be true false ] -- child 1 type: int64 [ 1, 1 ] {code} > pyarrow.compute.mode for boolean arrays does not return true when mixed with > false > -- > > Key: ARROW-17096 > URL: https://issues.apache.org/jira/browse/ARROW-17096 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 8.0.0 >Reporter: Matthew Roeschke >Priority: Major > > {code:java} > In [1]: import pyarrow.compute as pc > In [2]: import pyarrow as pa > In [3]: pa.__version__ > Out[3]: '8.0.0' > In [4]: pc.mode(pa.array([True, True])) > # Correct > Out[4]: > > -- is_valid: all not null > -- child 0 type: bool > [ > true > ] > -- child 1 type: int64 > [ > 2 > ] > # Incorrect > In [5]: pc.mode(pa.array([True, False]), 2) > Out[5]: > > -- is_valid: all not null > -- child 0 type: bool > [ > false, # should be true > false > ] > -- child 1 type: int64 > [ > 1, > 1 > ] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17096) pyarrow.compute.mode for boolean arrays does not return true when mixed with false
Matthew Roeschke created ARROW-17096: Summary: pyarrow.compute.mode for boolean arrays does not return true when mixed with false Key: ARROW-17096 URL: https://issues.apache.org/jira/browse/ARROW-17096 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Affects Versions: 8.0.0 Reporter: Matthew Roeschke {code:java} In [1]: import pyarrow.compute as pc In [2]: import pyarrow as pa In [3]: pa.__version__ Out[3]: '8.0.0' In [4]: pc.mode(pa.array([True, True])) # Correct Out[4]: -- is_valid: all not null -- child 0 type: bool [ true ] -- child 1 type: int64 [ 2 ] # Incorrect In [5]: pc.mode(pa.array([True, False]), 2) Out[5]: -- is_valid: all not null -- child 0 type: bool [ false, # should be true false ] -- child 1 type: int64 [ 1, 1 ] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (ARROW-16645) [Python] Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow container
[ https://issues.apache.org/jira/browse/ARROW-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Roeschke closed ARROW-16645. Resolution: Duplicate > [Python] Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow > container > --- > > Key: ARROW-16645 > URL: https://issues.apache.org/jira/browse/ARROW-16645 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 7.0.0 >Reporter: Matthew Roeschke >Priority: Major > > Example: > > > {code:java} > In [15]: import pyarrow as pa > In [16]: pa.array([1, pa.NA]) > ArrowInvalid: Could not convert with type > pyarrow.lib.NullScalar: did not recognize Python value type when inferring an > Arrow data type{code} > > I would be great if this could be equivalent to > {code:java} > In [17]: pa.array([1, pa.NA], mask=[False, True]) > Out[17]: > > [ > 1, > null > ] > In [18]: pa.__version__ > Out[18]: '7.0.0'{code} > > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16645) [Python] Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow container
[ https://issues.apache.org/jira/browse/ARROW-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542247#comment-17542247 ] Matthew Roeschke commented on ARROW-16645: -- Ah thanks, I didn't see ARROW-5295 Since this issue is just a subset of that larger issue. I'll close this one > [Python] Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow > container > --- > > Key: ARROW-16645 > URL: https://issues.apache.org/jira/browse/ARROW-16645 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 7.0.0 >Reporter: Matthew Roeschke >Priority: Major > > Example: > > > {code:java} > In [15]: import pyarrow as pa > In [16]: pa.array([1, pa.NA]) > ArrowInvalid: Could not convert with type > pyarrow.lib.NullScalar: did not recognize Python value type when inferring an > Arrow data type{code} > > I would be great if this could be equivalent to > {code:java} > In [17]: pa.array([1, pa.NA], mask=[False, True]) > Out[17]: > > [ > 1, > null > ] > In [18]: pa.__version__ > Out[18]: '7.0.0'{code} > > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16645) Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow container
Matthew Roeschke created ARROW-16645: Summary: Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow container Key: ARROW-16645 URL: https://issues.apache.org/jira/browse/ARROW-16645 Project: Apache Arrow Issue Type: Improvement Components: Python Affects Versions: 7.0.0 Reporter: Matthew Roeschke Example: {code:java} In [15]: import pyarrow as pa In [16]: pa.array([1, pa.NA]) ArrowInvalid: Could not convert with type pyarrow.lib.NullScalar: did not recognize Python value type when inferring an Arrow data type{code} I would be great if this could be equivalent to {code:java} In [17]: pa.array([1, pa.NA], mask=[False, True]) Out[17]: [ 1, null ] In [18]: pa.__version__ Out[18]: '7.0.0'{code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-15666) [C++][Python][R] Add format inference option to StrptimeOptions
[ https://issues.apache.org/jira/browse/ARROW-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497921#comment-17497921 ] Matthew Roeschke commented on ARROW-15666: -- Speaking from experience on the pandas side, I agree with [~jorisvandenbossche] and would caution against "inference" logic. While convenient for users, the maintenance burden can be quite significant since inference tends to have an indefinite scope, leading to more custom logic, edge cases, etc > [C++][Python][R] Add format inference option to StrptimeOptions > --- > > Key: ARROW-15666 > URL: https://issues.apache.org/jira/browse/ARROW-15666 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Rok Mihevc >Priority: Major > > We want to have an option to infer timestamp format. > See > [pandas.to_datetime|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html] > and lubridate > [parse_date_time|https://lubridate.tidyverse.org/reference/parse_date_time.html] > for examples. -- This message was sent by Atlassian Jira (v8.20.1#820001)