[jira] [Created] (ARROW-12431) [Python] pa.array mask inverted when type is binary and value to be converted in numpy array
Daniel Nugent created ARROW-12431: - Summary: [Python] pa.array mask inverted when type is binary and value to be converted in numpy array Key: ARROW-12431 URL: https://issues.apache.org/jira/browse/ARROW-12431 Project: Apache Arrow Issue Type: Bug Reporter: Daniel Nugent {code:python} Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> import pyarrow as pa >>> >>> pa.array(np.array([b'\x00']),type=pa.binary(1), mask = np.array([False])) [ null ] >>> pa.array(np.array([b'\x00']),type=pa.binary(1), mask = np.array([True])) [ 00 ] >>> pa.array([b'\x00'],type=pa.binary(1), mask = np.array([False])) [ 00 ] >>> pa.__version__ '3.0.0' >>> np.__version__ '1.20.1' {code} Happens both with FixedSizeBinary and variable sized binary (I was working with FixedSizeBinary). Does not happen for integers (presumably other types, didn't exhaustively check)? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11989) [C++][Python] Improve ChunkedArray's complexity for the access of elements
[ https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17303320#comment-17303320 ] Daniel Nugent commented on ARROW-11989: --- Saw this on the list and just wanted to point out that individual index accessing or arbitrary vector at a time accessing might be less common than accessing with a sorted vector of indices at a time. Sorted contiguous vector at a time indexing may be most common of all (for example, an attempt to iterate across a table in batches of records not aligned to chunk size). > [C++][Python] Improve ChunkedArray's complexity for the access of elements > -- > > Key: ARROW-11989 > URL: https://issues.apache.org/jira/browse/ARROW-11989 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Affects Versions: 3.0.0 >Reporter: quentin lhoest >Priority: Major > > Chunked arrays are stored as a C++ vector of Arrays. > There is currently no indexing structure on top of the vector to allow for > anything better than O(chunk) to access an arbitrary element. > For example, with a Table consisting of 1 column “text” defined by: > - 1024 chunks > - each chunk is 1024 rows > - each row is a text of 1024 characters > Then the time it takes to access one example are: > {code:java} > Time to access example at i=0%: 6.7μs > Time to access example at i=10% : 7.2μs > Time to access example at i=20% : 9.1μs > Time to access example at i=30% : 11.4μs > Time to access example at i=40% : 13.8μs > Time to access example at i=50% : 16.2μs > Time to access example at i=60% : 18.7μs > Time to access example at i=70% : 21.1μs > Time to access example at i=80% : 26.8μs > Time to access example at i=90% : 25.2μs > {code} > The time measured are the average times to do `table[“text”][j]` depending on > the index we want to fetch (from the first example at 0% to the example at > 90% of the length of the table). > You can take a look at the code that produces this benchmark > [here|https://pastebin.com/pSkYHQn9]. > Some discussions in [this thread on the mailing > list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E] > suggested different approaches to improve the complexity: > - use a contiguous array of chunk lengths, since having a contiguous array of > lengths makes the iteration over the chunks lengths faster; > - use a binary search, as in the Julia implementation > [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14]; > - use interpolation search. > Apparently there is also a lookup structure in the compute layer > [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94]. > cc [~emkornfield], [~wesm] > Thanks again for the amazing work ! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11634) [Python] Parquet statistics for dictionary columns are incorrect
Daniel Nugent created ARROW-11634: - Summary: [Python] Parquet statistics for dictionary columns are incorrect Key: ARROW-11634 URL: https://issues.apache.org/jira/browse/ARROW-11634 Project: Apache Arrow Issue Type: Bug Affects Versions: 3.0.0 Reporter: Daniel Nugent I would expect to see {{('A','A')}} for the first row group and {{('B','B')}} for the second rowgroup. I suspect this is a C++ issue, but I went looking for the way that the statistics are calculated and was unable to find them. {code:python} >>> import pyarrow as pa >>> import pyarrow.parquet as papq >>> d = pa.DictionaryArray.from_arrays((100*[0]) + (100*[1]),["A","B"]) >>> t = pa.table({"col":d}) >>> papq.write_table(t,'sample.parquet',row_group_size=100) >>> f = papq.ParquetFile('sample.parquet') >>> (f.metadata.row_group(0).column(0).statistics.min, >>> f.metadata.row_group(0).column(0).statistics.max) ('A', 'B') >>> (f.metadata.row_group(1).column(0).statistics.min, >>> f.metadata.row_group(1).column(0).statistics.max) ('A', 'B') >>> f.read_row_groups([0]).column(0) [ -- dictionary: [ "A", "B" ] -- indices: [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] ] >>> f.read_row_groups([1]).column(0) [ -- dictionary: [ "A", "B" ] -- indices: [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ] ] {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8025) [C++] Implement cast to Binary and FixedSizeBinary
[ https://issues.apache.org/jira/browse/ARROW-8025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281979#comment-17281979 ] Daniel Nugent commented on ARROW-8025: -- Was there ever a similar issue for the FixedSizeBinary to String cast? It’s nice to have when you want to ensure that single byte records are interpreted as characters. > [C++] Implement cast to Binary and FixedSizeBinary > -- > > Key: ARROW-8025 > URL: https://issues.apache.org/jira/browse/ARROW-8025 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Neal Richardson >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > It appears you can cast from Binary to String but not the other way. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7702) [C++][Dataset] Provide (optional) deterministic order of batches
[ https://issues.apache.org/jira/browse/ARROW-7702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17140027#comment-17140027 ] Daniel Nugent commented on ARROW-7702: -- [~jorisvandenbossche] Please confirm that issue is now resolved. > [C++][Dataset] Provide (optional) deterministic order of batches > > > Key: ARROW-7702 > URL: https://issues.apache.org/jira/browse/ARROW-7702 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > > Example with python: > {code} > import pyarrow as pa > import pyarrow.parquet as pq > table = pa.table({'a': range(12)}) > pq.write_table(table, "test_chunks.parquet", chunk_size=3) > # reading with dataset > import pyarrow.dataset as ds > ds.dataset("test_chunks.parquet").to_table().to_pandas() > {code} > gives non-deterministic result (order of the row groups in the parquet file): > {code} > In [25]: ds.dataset("test_chunks.parquet").to_table().to_pandas() > > > Out[25]: > a > 00 > 11 > 22 > 33 > 44 > 55 > 66 > 77 > 88 > 99 > 10 10 > 11 11 > In [26]: ds.dataset("test_chunks.parquet").to_table().to_pandas() > > > Out[26]: > a > 00 > 11 > 22 > 33 > 48 > 59 > 6 10 > 7 11 > 84 > 95 > 10 6 > 11 7 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (ARROW-9150) Support Dictionary Unification with Concatenate
[ https://issues.apache.org/jira/browse/ARROW-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Nugent reopened ARROW-9150: -- > Support Dictionary Unification with Concatenate > --- > > Key: ARROW-9150 > URL: https://issues.apache.org/jira/browse/ARROW-9150 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Daniel Nugent >Priority: Minor > > Seems to be supported through arrow to pandas conversions at the moment. I > *believe* the DictionaryUnifier could be leveraged for this. > Not sure if there are unintended consequences, but the NYI implies that this > is desired. Didn't see an open issue for it already. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-9150) Support Dictionary Unification with Concatenate
[ https://issues.apache.org/jira/browse/ARROW-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Nugent closed ARROW-9150. Resolution: Duplicate > Support Dictionary Unification with Concatenate > --- > > Key: ARROW-9150 > URL: https://issues.apache.org/jira/browse/ARROW-9150 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Daniel Nugent >Priority: Minor > > Seems to be supported through arrow to pandas conversions at the moment. I > *believe* the DictionaryUnifier could be leveraged for this. > Not sure if there are unintended consequences, but the NYI implies that this > is desired. Didn't see an open issue for it already. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-9150) Support Dictionary Unification with Concatenate
[ https://issues.apache.org/jira/browse/ARROW-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Nugent closed ARROW-9150. Resolution: Duplicate > Support Dictionary Unification with Concatenate > --- > > Key: ARROW-9150 > URL: https://issues.apache.org/jira/browse/ARROW-9150 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Daniel Nugent >Priority: Minor > > Seems to be supported through arrow to pandas conversions at the moment. I > *believe* the DictionaryUnifier could be leveraged for this. > Not sure if there are unintended consequences, but the NYI implies that this > is desired. Didn't see an open issue for it already. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-9150) Support Dictionary Unification with Concatenate
[ https://issues.apache.org/jira/browse/ARROW-9150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138462#comment-17138462 ] Daniel Nugent commented on ARROW-9150: -- Yes, looks like. Sorry, I think I searched for "concat" rather than "concatenate" when looking for an existing issue. > Support Dictionary Unification with Concatenate > --- > > Key: ARROW-9150 > URL: https://issues.apache.org/jira/browse/ARROW-9150 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Daniel Nugent >Priority: Minor > > Seems to be supported through arrow to pandas conversions at the moment. I > *believe* the DictionaryUnifier could be leveraged for this. > Not sure if there are unintended consequences, but the NYI implies that this > is desired. Didn't see an open issue for it already. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-9150) Support Dictionary Unification with Concatenate
Daniel Nugent created ARROW-9150: Summary: Support Dictionary Unification with Concatenate Key: ARROW-9150 URL: https://issues.apache.org/jira/browse/ARROW-9150 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Daniel Nugent Seems to be supported through arrow to pandas conversions at the moment. I *believe* the DictionaryUnifier could be leveraged for this. Not sure if there are unintended consequences, but the NYI implies that this is desired. Didn't see an open issue for it already. -- This message was sent by Atlassian Jira (v8.3.4#803005)