[jira] [Created] (ARROW-3659) Clang Travis build (matrix entry 2) might not actually be using clang
Philipp Moritz created ARROW-3659: - Summary: Clang Travis build (matrix entry 2) might not actually be using clang Key: ARROW-3659 URL: https://issues.apache.org/jira/browse/ARROW-3659 Project: Apache Arrow Issue Type: Improvement Reporter: Philipp Moritz See for example [https://travis-ci.org/apache/arrow/jobs/448267169:] {code:java} Setting environment variables from .travis.yml $ export ANACONDA_TOKEN=[secure] $ export ARROW_TRAVIS_USE_TOOLCHAIN=1 $ export ARROW_TRAVIS_VALGRIND=1 $ export ARROW_TRAVIS_PLASMA=1 $ export ARROW_TRAVIS_ORC=1 $ export ARROW_TRAVIS_COVERAGE=1 $ export ARROW_TRAVIS_PARQUET=1 $ export ARROW_TRAVIS_PYTHON_DOCS=1 $ export ARROW_BUILD_WARNING_LEVEL=CHECKIN $ export ARROW_TRAVIS_PYTHON_JVM=1 $ export ARROW_TRAVIS_JAVA_BUILD_ONLY=1 $ export CC="clang-6.0" $ export CXX="clang++-6.0" $ export TRAVIS_COMPILER=gcc $ export CXX=g++ $ export CC=gcc $ export PATH=/usr/lib/ccache:$PATH cache.1 Setting up build cache{code} The CC and CXX command line variables are overwritten by travis (because the travis toolchain is set to gcc). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3658) [Rust] validation of offsets buffer is incorrect for `List`
Paddy Horan created ARROW-3658: -- Summary: [Rust] validation of offsets buffer is incorrect for `List` Key: ARROW-3658 URL: https://issues.apache.org/jira/browse/ARROW-3658 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Paddy Horan Assignee: Paddy Horan -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3657) [R] Require bit64 package
Javier Luraschi created ARROW-3657: -- Summary: [R] Require bit64 package Key: ARROW-3657 URL: https://issues.apache.org/jira/browse/ARROW-3657 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Javier Luraschi Assignee: Javier Luraschi {code:java} devtools::install_github("apache/arrow", subdir = "r") {code} {code:java} Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]) : there is no package called ‘bit64’ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3656) [C++] Allow whitespace in numeric CSV fields
Antoine Pitrou created ARROW-3656: - Summary: [C++] Allow whitespace in numeric CSV fields Key: ARROW-3656 URL: https://issues.apache.org/jira/browse/ARROW-3656 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.11.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou Pandas allows whitespace before and after numbers in CSV files, but Arrow doesn't: {code:python} >>> s = b"a,b,c\n12 , 34 , 56\n" >>> pd.read_csv(io.BytesIO(s)) a b c 0 12 34 56 >>> csv.read_csv(io.BytesIO(s)).to_pandas() ab c 0 b'12 ' b' 34 ' b' 56' {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3655) [Gandiva] switch away from default_memory_pool
Pindikura Ravindra created ARROW-3655: - Summary: [Gandiva] switch away from default_memory_pool Key: ARROW-3655 URL: https://issues.apache.org/jira/browse/ARROW-3655 Project: Apache Arrow Issue Type: Task Components: Gandiva Reporter: Pindikura Ravindra After changes to ARROW-3519, Gandiva uses default_memory_pool for some allocations. This needs to be replaced with the pool passed in the Evaluate call. Also, change signatures of all Evaluate APIs (both in project and filter) to take a pool argument. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3654) [Python] Column with CategoricalIndex fails to be read back
Armin Berres created ARROW-3654: --- Summary: [Python] Column with CategoricalIndex fails to be read back Key: ARROW-3654 URL: https://issues.apache.org/jira/browse/ARROW-3654 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.11.1 Reporter: Armin Berres When a column with a \{Categoricalndex} is written the data can never be read back. {code:python} df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['c1', 'c2']) df['c1'] = df['c1'].astype('category') df = df.set_index(['c1']) table = pa.Table.from_pandas(df) pq.write_table(table, 'test.parquet') pq.read_pandas('test.parquet').to_pandas() {code} Results in {code} KeyError Traceback (most recent call last) ~/venv/mpptool/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _pandas_type_to_numpy_type(pandas_type) 676 try: --> 677 return _pandas_logical_type_map[pandas_type] 678 except KeyError: KeyError: 'categorical' {code} The schema looks good: {code} column_indexes": [{"name": "c1", "field_name": "c1", "pandas_type": "categorical", "numpy_type": "int8", "metadata": {"num_categories": 2, "ordered": false}}] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3653) [Python/C++] Support data copying between different GPU devices
Pearu Peterson created ARROW-3653: - Summary: [Python/C++] Support data copying between different GPU devices Key: ARROW-3653 URL: https://issues.apache.org/jira/browse/ARROW-3653 Project: Apache Arrow Issue Type: Improvement Components: C++, Python Reporter: Pearu Peterson Currently, the data copying is supported from host to device, from device to host, from device to the same device. For multiple GPU systems, copying data from one device to another is needed. See also https://github.com/apache/arrow/pull/2844#discussion_r228910757 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3652) [Python] CategoricalIndex is lost after reading back
Armin Berres created ARROW-3652: --- Summary: [Python] CategoricalIndex is lost after reading back Key: ARROW-3652 URL: https://issues.apache.org/jira/browse/ARROW-3652 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Armin Berres When a {{CategoricalIndex}} is written and read back the resulting index is not more categorical. {code} df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['c1', 'c2']) df['c1'] = df['c1'].astype('category') df = df.set_index(['c1']) table = pa.Table.from_pandas(df) pq.write_table(table, 'test.parquet') ref_df = pq.read_pandas('test.parquet').to_pandas() print(df.index) # CategoricalIndex(['a', 'c'], categories=['a', 'c'], ordered=False, name='c1', dtype='category') print(ref_df.index) # Index(['a', 'c'], dtype='object', name='c1') {code} In the metadata the information is correctly contained: {code:java} {"name": "c1", "field_name": "c1", "p' b'andas_type": "categorical", "numpy_type": "int8", "metadata": {"' b'num_categories": 2, "ordered": false} {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3651) [Python] Datetimes from non-DateTimeIndex cannot be deserialized
Armin Berres created ARROW-3651: --- Summary: [Python] Datetimes from non-DateTimeIndex cannot be deserialized Key: ARROW-3651 URL: https://issues.apache.org/jira/browse/ARROW-3651 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.11.1 Reporter: Armin Berres Given an index which contains datetimes but is no DateTimeIndex writing the file works but reading back fails. {code:python} df = pd.DataFrame(1, index=pd.MultiIndex.from_arrays([[1,2],[3,4]]), columns=[pd.to_datetime("2018/01/01")]) # columns index is no DateTimeIndex anymore df = df.reset_index().set_index(['level_0', 'level_1']) table = pa.Table.from_pandas(df) pq.write_table(table, 'test.parquet') pq.read_pandas('test.parquet').to_pandas() {code} results in {code} KeyError Traceback (most recent call last) ~/venv/mpptool/lib/python3.7/site-packages/pyarrow/pandas_compat.py in _pandas_type_to_numpy_type(pandas_type) 676 try: --> 677 return _pandas_logical_type_map[pandas_type] 678 except KeyError: KeyError: 'datetime' {code} The created schema: {code} 2018-01-01 00:00:00: int64 level_0: int64 level_1: int64 metadata {b'pandas': b'{"index_columns": ["level_0", "level_1"], "column_indexes": [{"n' b'ame": null, "field_name": null, "pandas_type": "datetime", "nump' b'y_type": "object", "metadata": null}], "columns": [{"name": "201' b'8-01-01 00:00:00", "field_name": "2018-01-01 00:00:00", "pandas_' b'type": "int64", "numpy_type": "int64", "metadata": null}, {"name' b'": "level_0", "field_name": "level_0", "pandas_type": "int64", "' b'numpy_type": "int64", "metadata": null}, {"name": "level_1", "fi' b'eld_name": "level_1", "pandas_type": "int64", "numpy_type": "int' b'64", "metadata": null}], "pandas_version": "0.23.4"}'} {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3650) [Python] Mixed column indexes are read back as strings
Armin Berres created ARROW-3650: --- Summary: [Python] Mixed column indexes are read back as strings Key: ARROW-3650 URL: https://issues.apache.org/jira/browse/ARROW-3650 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.11.1 Reporter: Armin Berres Consider the following example: {code:java} df = pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['a string', pd.to_datetime('2018/01/02')]) table = pa.Table.from_pandas(df) pq.write_table(table, 'test.parquet') ref_df = pq.read_pandas('test.parquet').to_pandas() print(df.columns) # Index(['a string', 2018-01-02 00:00:00], dtype='object') print(ref_df.columns) # Index(['a string', '2018-01-02 00:00:00'], dtype='object') {code} The serialized data frame has an index with a string and a datetime field (happened when resetting the index of a formerly datetime only column). When reading the string back the datetime is converted into a string. When looking at the schema I find {{"pandas_type": "mixed", "numpy_ty' b'pe": "object"}} before serializing and {{"pandas_type": "unicode", "numpy_' b'type": "object"}} after reading back. So the schema was aware of the mixed type but did not store the actual types. The same happens with other types like numbers as well. One can produce interesting situations: {{pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['1', 1])}} can be written but fails to be read back as the index is no more unique with '1' showing up two times. IIf this is not a bug but expected maybe the user should be somehow warned that information is lost? Like a {{NotImplemented}} exception. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3648) [Plasma] Add API to get metadata and data at the same time
Yuhong Guo created ARROW-3648: - Summary: [Plasma] Add API to get metadata and data at the same time Key: ARROW-3648 URL: https://issues.apache.org/jira/browse/ARROW-3648 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Yuhong Guo Current Arrow Java Plasma client has no API to get the metadata and data together in one API call. If we split this process into two API calls, the object status could be different. Current observation shows that the first call could be empty(object not stored yet) while the second call will success but the metadata and data does not match. -- This message was sent by Atlassian JIRA (v7.6.3#76005)