[jira] [Created] (ARROW-17360) [Python] pyarrow.orc.ORCFile.read does not preserve ordering of columns

2022-08-09 Thread Matthew Roeschke (Jira)
Matthew Roeschke created ARROW-17360:


 Summary: [Python] pyarrow.orc.ORCFile.read does not preserve 
ordering of columns
 Key: ARROW-17360
 URL: https://issues.apache.org/jira/browse/ARROW-17360
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 8.0.1
Reporter: Matthew Roeschke


xref [https://github.com/pandas-dev/pandas/issues/47944]

 
{code:java}
In [1]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})

# pandas main branch / 1.5
In [2]: df.to_orc("abc")

In [3]: pd.read_orc("abc", columns=['b', 'a'])
Out[3]:
   a  b
0  1  a
1  2  b
2  3  c

In [4]: import pyarrow.orc as orc

In [5]: orc_file = orc.ORCFile("abc")

# reordered to a, b
In [6]: orc_file.read(columns=['b', 'a']).to_pandas()
Out[6]:
   a  b
0  1  a
1  2  b
2  3  c

# reordered to a, b
In [7]: orc_file.read(columns=['b', 'a'])
Out[7]:
pyarrow.Table
a: int64
b: string

a: [[1,2,3]]
b: [["a","b","c"]] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-17134) [C++(?)/Python] pyarrow.compute.replace_with_mask does not replace null when providing an array mask

2022-07-20 Thread Matthew Roeschke (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17569124#comment-17569124
 ] 

Matthew Roeschke commented on ARROW-17134:
--

Ah okay that makes sense. When I read len(replacements) == number of true 
values in the mask, for some reason I thought "len(replacements)" meant the 
values could still be corresponding to the mask.

> We should maybe consider raising an error if the {{replacements}} are too 
> long?

That would be helpful, or maybe an example in the docstring could help clarify 
that point. Fine either way

> [C++(?)/Python] pyarrow.compute.replace_with_mask does not replace null when 
> providing an array mask
> 
>
> Key: ARROW-17134
> URL: https://issues.apache.org/jira/browse/ARROW-17134
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 8.0.0
>Reporter: Matthew Roeschke
>Priority: Major
>
>  
> {code:java}
> In [1]: import pyarrow as pa
> In [2]: arr1 = pa.array([1, 0, 1, None, None])
> In [3]: arr2 = pa.array([None, None, 1, 0, 1])
> In [4]: pa.compute.replace_with_mask(arr1, [False, False, False, True, True], 
> arr2)
> Out[4]:
> 
> [
>   1,
>   0,
>   1,
>   null, # I would expect 0
>   null  # I would expect 1
> ]
> In [5]: pa.__version__
> Out[5]: '8.0.0'{code}
>  
> I have noticed this behavior occur with the integer, floating, bool, temporal 
> types
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17134) [C++(?)/Python] pyarrow.compute.replace_with_mask does not replace null when providing an array mask

2022-07-19 Thread Matthew Roeschke (Jira)
Matthew Roeschke created ARROW-17134:


 Summary: [C++(?)/Python] pyarrow.compute.replace_with_mask does 
not replace null when providing an array mask
 Key: ARROW-17134
 URL: https://issues.apache.org/jira/browse/ARROW-17134
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 8.0.0
Reporter: Matthew Roeschke


 
{code:java}
In [1]: import pyarrow as pa

In [2]: arr1 = pa.array([1, 0, 1, None, None])

In [3]: arr2 = pa.array([None, None, 1, 0, 1])

In [4]: pa.compute.replace_with_mask(arr1, [False, False, False, True, True], 
arr2)

Out[4]:

[
  1,
  0,
  1,
  null, # I would expect 0
  null  # I would expect 1
]

In [5]: pa.__version__
Out[5]: '8.0.0'{code}
 

I have noticed this behavior occur with the integer, floating, bool, temporal 
types

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17096) pyarrow.compute.mode for boolean arrays does not return true when mixed with false

2022-07-15 Thread Matthew Roeschke (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Roeschke updated ARROW-17096:
-
Description: 
{code:java}
In [1]: import pyarrow.compute as pc

In [2]: import pyarrow as pa

In [3]: pa.__version__
Out[3]: '8.0.0'

In [4]: pc.mode(pa.array([True, True]))
# Correct
Out[4]:

-- is_valid: all not null
-- child 0 type: bool
  [
    true
  ]
-- child 1 type: int64
  [
    2
  ]

# Incorrect
In [5]: pc.mode(pa.array([True, False]), 2)
Out[5]:

-- is_valid: all not null
-- child 0 type: bool
  [
    false, # should be true
    false
  ]
-- child 1 type: int64
  [
    1,
    1
  ] {code}

  was:
{code:java}
In [1]: import pyarrow.compute as pc
In [2]: import pyarrow as pa
In [3]: pa.__version__
Out[3]: '8.0.0'
In [4]: pc.mode(pa.array([True, True]))
# Correct
Out[4]:

-- is_valid: all not null
-- child 0 type: bool
  [
    true
  ]
-- child 1 type: int64
  [
    2
  ]
# Incorrect
In [5]: pc.mode(pa.array([True, False]), 2)
Out[5]:

-- is_valid: all not null
-- child 0 type: bool
  [
    false, # should be true
    false
  ]
-- child 1 type: int64
  [
    1,
    1
  ] {code}


> pyarrow.compute.mode for boolean arrays does not return true when mixed with 
> false
> --
>
> Key: ARROW-17096
> URL: https://issues.apache.org/jira/browse/ARROW-17096
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 8.0.0
>Reporter: Matthew Roeschke
>Priority: Major
>
> {code:java}
> In [1]: import pyarrow.compute as pc
> In [2]: import pyarrow as pa
> In [3]: pa.__version__
> Out[3]: '8.0.0'
> In [4]: pc.mode(pa.array([True, True]))
> # Correct
> Out[4]:
> 
> -- is_valid: all not null
> -- child 0 type: bool
>   [
>     true
>   ]
> -- child 1 type: int64
>   [
>     2
>   ]
> # Incorrect
> In [5]: pc.mode(pa.array([True, False]), 2)
> Out[5]:
> 
> -- is_valid: all not null
> -- child 0 type: bool
>   [
>     false, # should be true
>     false
>   ]
> -- child 1 type: int64
>   [
>     1,
>     1
>   ] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (ARROW-17096) pyarrow.compute.mode for boolean arrays does not return true when mixed with false

2022-07-15 Thread Matthew Roeschke (Jira)
Matthew Roeschke created ARROW-17096:


 Summary: pyarrow.compute.mode for boolean arrays does not return 
true when mixed with false
 Key: ARROW-17096
 URL: https://issues.apache.org/jira/browse/ARROW-17096
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Affects Versions: 8.0.0
Reporter: Matthew Roeschke


{code:java}
In [1]: import pyarrow.compute as pc
In [2]: import pyarrow as pa
In [3]: pa.__version__
Out[3]: '8.0.0'
In [4]: pc.mode(pa.array([True, True]))
# Correct
Out[4]:

-- is_valid: all not null
-- child 0 type: bool
  [
    true
  ]
-- child 1 type: int64
  [
    2
  ]
# Incorrect
In [5]: pc.mode(pa.array([True, False]), 2)
Out[5]:

-- is_valid: all not null
-- child 0 type: bool
  [
    false, # should be true
    false
  ]
-- child 1 type: int64
  [
    1,
    1
  ] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Closed] (ARROW-16645) [Python] Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow container

2022-05-25 Thread Matthew Roeschke (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Roeschke closed ARROW-16645.

Resolution: Duplicate

> [Python] Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow 
> container
> ---
>
> Key: ARROW-16645
> URL: https://issues.apache.org/jira/browse/ARROW-16645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: Matthew Roeschke
>Priority: Major
>
> Example:
>  
>  
> {code:java}
> In [15]: import pyarrow as pa
> In [16]: pa.array([1, pa.NA])
> ArrowInvalid: Could not convert  with type 
> pyarrow.lib.NullScalar: did not recognize Python value type when inferring an 
> Arrow data type{code}
>  
> I would be great if this could be equivalent to
> {code:java}
> In [17]: pa.array([1, pa.NA], mask=[False, True])
> Out[17]:
> 
> [
>   1,
>   null
> ]
> In [18]: pa.__version__
> Out[18]: '7.0.0'{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16645) [Python] Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow container

2022-05-25 Thread Matthew Roeschke (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17542247#comment-17542247
 ] 

Matthew Roeschke commented on ARROW-16645:
--

Ah thanks, I didn't see ARROW-5295

Since this issue is just a subset of that larger issue. I'll close this one

> [Python] Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow 
> container
> ---
>
> Key: ARROW-16645
> URL: https://issues.apache.org/jira/browse/ARROW-16645
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: Matthew Roeschke
>Priority: Major
>
> Example:
>  
>  
> {code:java}
> In [15]: import pyarrow as pa
> In [16]: pa.array([1, pa.NA])
> ArrowInvalid: Could not convert  with type 
> pyarrow.lib.NullScalar: did not recognize Python value type when inferring an 
> Arrow data type{code}
>  
> I would be great if this could be equivalent to
> {code:java}
> In [17]: pa.array([1, pa.NA], mask=[False, True])
> Out[17]:
> 
> [
>   1,
>   null
> ]
> In [18]: pa.__version__
> Out[18]: '7.0.0'{code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16645) Allow pa.array/pa.chunked_array to infer pa.NA when in a non pyarrow container

2022-05-24 Thread Matthew Roeschke (Jira)
Matthew Roeschke created ARROW-16645:


 Summary: Allow pa.array/pa.chunked_array to infer pa.NA when in a 
non pyarrow container
 Key: ARROW-16645
 URL: https://issues.apache.org/jira/browse/ARROW-16645
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 7.0.0
Reporter: Matthew Roeschke


Example:

 

 
{code:java}
In [15]: import pyarrow as pa

In [16]: pa.array([1, pa.NA])
ArrowInvalid: Could not convert  with type 
pyarrow.lib.NullScalar: did not recognize Python value type when inferring an 
Arrow data type{code}
 

I would be great if this could be equivalent to
{code:java}
In [17]: pa.array([1, pa.NA], mask=[False, True])
Out[17]:

[
  1,
  null
]


In [18]: pa.__version__
Out[18]: '7.0.0'{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-15666) [C++][Python][R] Add format inference option to StrptimeOptions

2022-02-24 Thread Matthew Roeschke (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17497921#comment-17497921
 ] 

Matthew Roeschke commented on ARROW-15666:
--

Speaking from experience on the pandas side, I agree with [~jorisvandenbossche] 
and would caution against "inference" logic. While convenient for users, the 
maintenance burden can be quite significant since inference tends to have an 
indefinite scope, leading to more custom logic, edge cases, etc

> [C++][Python][R] Add format inference option to StrptimeOptions
> ---
>
> Key: ARROW-15666
> URL: https://issues.apache.org/jira/browse/ARROW-15666
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Rok Mihevc
>Priority: Major
>
> We want to have an option to infer timestamp format.
> See 
> [pandas.to_datetime|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html]
>  and lubridate 
> [parse_date_time|https://lubridate.tidyverse.org/reference/parse_date_time.html]
>  for examples.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)