[jira] [Created] (ARROW-12431) [Python] pa.array mask inverted when type is binary and value to be converted in numpy array

2021-04-17 Thread Daniel Nugent (Jira)
Daniel Nugent created ARROW-12431:
-

 Summary: [Python] pa.array mask inverted when type is binary and 
value to be converted in numpy array
 Key: ARROW-12431
 URL: https://issues.apache.org/jira/browse/ARROW-12431
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Daniel Nugent


{code:python}
Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46)   

[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import pyarrow as pa
>>>
>>> pa.array(np.array([b'\x00']),type=pa.binary(1), mask = np.array([False]))

[
  null
]
>>> pa.array(np.array([b'\x00']),type=pa.binary(1), mask = np.array([True]))

[
  00
]
>>> pa.array([b'\x00'],type=pa.binary(1), mask = np.array([False]))

[
  00
]
>>> pa.__version__
'3.0.0'
>>> np.__version__
'1.20.1'
{code}

Happens both with FixedSizeBinary and variable sized binary (I was working with 
FixedSizeBinary). Does not happen for integers (presumably other types, didn't 
exhaustively check)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11989) [C++][Python] Improve ChunkedArray's complexity for the access of elements

2021-03-17 Thread Daniel Nugent (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303320#comment-17303320
 ] 

Daniel Nugent commented on ARROW-11989:
---

Saw this on the list and just wanted to point out that individual index 
accessing or arbitrary vector at a time accessing might be less common than 
accessing with a sorted vector of indices at a time. Sorted contiguous vector 
at a time indexing may be most common of all (for example, an attempt to 
iterate across a table in batches of records not aligned to chunk size).

> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --
>
> Key: ARROW-11989
> URL: https://issues.apache.org/jira/browse/ARROW-11989
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 3.0.0
>Reporter: quentin lhoest
>Priority: Major
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for 
> anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0%: 6.7μs
> Time to access example at i=10%   : 7.2μs
> Time to access example at i=20%   : 9.1μs
> Time to access example at i=30%   : 11.4μs
> Time to access example at i=40%   : 13.8μs
> Time to access example at i=50%   : 16.2μs
> Time to access example at i=60%   : 18.7μs
> Time to access example at i=70%   : 21.1μs
> Time to access example at i=80%   : 26.8μs
> Time to access example at i=90%   : 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on 
> the index we want to fetch (from the first example at 0% to the example at 
> 90% of the length of the table).
> You can take a look at the code that produces this benchmark 
> [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing 
> list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E]
>  suggested different approaches to improve the complexity:
> - use a contiguous array of chunk lengths, since having a contiguous array of 
> lengths makes the iteration over the chunks lengths faster;
> - use a binary search, as in the Julia implementation 
> [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14];
> - use interpolation search.
> Apparently there is also a lookup structure in the compute layer 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94].
> cc [~emkornfield], [~wesm]
> Thanks again for the amazing work !



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11634) [Python] Parquet statistics for dictionary columns are incorrect

2021-02-15 Thread Daniel Nugent (Jira)
Daniel Nugent created ARROW-11634:
-

 Summary: [Python] Parquet statistics for dictionary columns are 
incorrect
 Key: ARROW-11634
 URL: https://issues.apache.org/jira/browse/ARROW-11634
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Daniel Nugent


I would expect to see {{('A','A')}} for the first row group and {{('B','B')}} 
for the second rowgroup.

I suspect this is a C++ issue, but I went looking for the way that the 
statistics are calculated and was unable to find them.

{code:python}
>>> import pyarrow as pa
>>> import pyarrow.parquet as papq
>>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"])
>>> t = pa.table({"col":d})
>>> papq.write_table(t,'sample.parquet',row_group_size=100)
>>> f = papq.ParquetFile('sample.parquet')
>>> (f.metadata.row_group(0).column(0).statistics.min, 
>>> f.metadata.row_group(0).column(0).statistics.max)
('A', 'B')
>>> (f.metadata.row_group(1).column(0).statistics.min, 
>>> f.metadata.row_group(1).column(0).statistics.max)
('A', 'B')
>>> f.read_row_groups([0]).column(0)

[ 
  -- dictionary:
[
  "A",
  "B"
]
  -- indices:
[
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  ...
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0
]
]
>>> f.read_row_groups([1]).column(0)

[
  -- dictionary:
[
  "A",
  "B"
]
  -- indices:
[
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  ...
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1
]
]
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-8025) [C++] Implement cast to Binary and FixedSizeBinary

2021-02-09 Thread Daniel Nugent (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-8025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281979#comment-17281979
 ] 

Daniel Nugent commented on ARROW-8025:
--

Was there ever a similar issue for the FixedSizeBinary to String cast? It’s 
nice to have when you want to ensure that single byte records are interpreted 
as characters.

> [C++] Implement cast to Binary and FixedSizeBinary
> --
>
> Key: ARROW-8025
> URL: https://issues.apache.org/jira/browse/ARROW-8025
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It appears you can cast from Binary to String but not the other way. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7702) [C++][Dataset] Provide (optional) deterministic order of batches

2020-06-18 Thread Daniel Nugent (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140027#comment-17140027
 ] 

Daniel Nugent commented on ARROW-7702:
--

[~jorisvandenbossche] Please confirm that issue is now resolved.

> [C++][Dataset] Provide (optional) deterministic order of batches
> 
>
> Key: ARROW-7702
> URL: https://issues.apache.org/jira/browse/ARROW-7702
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
>
> Example with python:
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> table = pa.table({'a': range(12)}) 
> pq.write_table(table, "test_chunks.parquet", chunk_size=3) 
> # reading with dataset
> import pyarrow.dataset as ds
> ds.dataset("test_chunks.parquet").to_table().to_pandas()
> {code}
> gives non-deterministic result (order of the row groups in the parquet file):
> {code}
> In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas() 
>   
>
> Out[25]: 
>  a
> 00
> 11
> 22
> 33
> 44
> 55
> 66
> 77
> 88
> 99
> 10  10
> 11  11
> In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas() 
>   
>
> Out[26]: 
>  a
> 00
> 11
> 22
> 33
> 48
> 59
> 6   10
> 7   11
> 84
> 95
> 10   6
> 11   7
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-9150) Support Dictionary Unification with Concatenate

2020-06-17 Thread Daniel Nugent (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Nugent reopened ARROW-9150:
--

> Support Dictionary Unification with Concatenate
> ---
>
> Key: ARROW-9150
> URL: https://issues.apache.org/jira/browse/ARROW-9150
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Daniel Nugent
>Priority: Minor
>
> Seems to be supported through arrow to pandas conversions at the moment. I 
> *believe* the DictionaryUnifier could be leveraged for this.
> Not sure if there are unintended consequences, but the NYI implies that this 
> is desired. Didn't see an open issue for it already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-9150) Support Dictionary Unification with Concatenate

2020-06-17 Thread Daniel Nugent (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Nugent closed ARROW-9150.

Resolution: Duplicate

> Support Dictionary Unification with Concatenate
> ---
>
> Key: ARROW-9150
> URL: https://issues.apache.org/jira/browse/ARROW-9150
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Daniel Nugent
>Priority: Minor
>
> Seems to be supported through arrow to pandas conversions at the moment. I 
> *believe* the DictionaryUnifier could be leveraged for this.
> Not sure if there are unintended consequences, but the NYI implies that this 
> is desired. Didn't see an open issue for it already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-9150) Support Dictionary Unification with Concatenate

2020-06-17 Thread Daniel Nugent (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Nugent closed ARROW-9150.

Resolution: Duplicate

> Support Dictionary Unification with Concatenate
> ---
>
> Key: ARROW-9150
> URL: https://issues.apache.org/jira/browse/ARROW-9150
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Daniel Nugent
>Priority: Minor
>
> Seems to be supported through arrow to pandas conversions at the moment. I 
> *believe* the DictionaryUnifier could be leveraged for this.
> Not sure if there are unintended consequences, but the NYI implies that this 
> is desired. Didn't see an open issue for it already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9150) Support Dictionary Unification with Concatenate

2020-06-17 Thread Daniel Nugent (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138462#comment-17138462
 ] 

Daniel Nugent commented on ARROW-9150:
--

Yes, looks like. Sorry, I think I searched for "concat" rather than 
"concatenate" when looking for an existing issue.

> Support Dictionary Unification with Concatenate
> ---
>
> Key: ARROW-9150
> URL: https://issues.apache.org/jira/browse/ARROW-9150
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Daniel Nugent
>Priority: Minor
>
> Seems to be supported through arrow to pandas conversions at the moment. I 
> *believe* the DictionaryUnifier could be leveraged for this.
> Not sure if there are unintended consequences, but the NYI implies that this 
> is desired. Didn't see an open issue for it already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-9150) Support Dictionary Unification with Concatenate

2020-06-16 Thread Daniel Nugent (Jira)
Daniel Nugent created ARROW-9150:


 Summary: Support Dictionary Unification with Concatenate
 Key: ARROW-9150
 URL: https://issues.apache.org/jira/browse/ARROW-9150
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Daniel Nugent


Seems to be supported through arrow to pandas conversions at the moment. I 
*believe* the DictionaryUnifier could be leveraged for this.

Not sure if there are unintended consequences, but the NYI implies that this is 
desired. Didn't see an open issue for it already.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)