[jira] [Created] (ARROW-3659) Clang Travis build (matrix entry 2) might not actually be using clang

2018-10-30 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3659:
-

 Summary: Clang Travis build (matrix entry 2) might not actually be 
using clang
 Key: ARROW-3659
 URL: https://issues.apache.org/jira/browse/ARROW-3659
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


See for example [https://travis-ci.org/apache/arrow/jobs/448267169:]
{code:java}
Setting environment variables from .travis.yml
$ export ANACONDA_TOKEN=[secure]
$ export ARROW_TRAVIS_USE_TOOLCHAIN=1
$ export ARROW_TRAVIS_VALGRIND=1
$ export ARROW_TRAVIS_PLASMA=1
$ export ARROW_TRAVIS_ORC=1
$ export ARROW_TRAVIS_COVERAGE=1
$ export ARROW_TRAVIS_PARQUET=1
$ export ARROW_TRAVIS_PYTHON_DOCS=1
$ export ARROW_BUILD_WARNING_LEVEL=CHECKIN
$ export ARROW_TRAVIS_PYTHON_JVM=1
$ export ARROW_TRAVIS_JAVA_BUILD_ONLY=1
$ export CC="clang-6.0"
$ export CXX="clang++-6.0"
$ export TRAVIS_COMPILER=gcc
$ export CXX=g++
$ export CC=gcc
$ export PATH=/usr/lib/ccache:$PATH
cache.1
Setting up build cache{code}
The CC and CXX command line variables are overwritten by travis (because the 
travis toolchain is set to gcc).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3658) [Rust] validation of offsets buffer is incorrect for `List`

2018-10-30 Thread Paddy Horan (JIRA)
Paddy Horan created ARROW-3658:
--

 Summary: [Rust] validation of offsets buffer is incorrect for 
`List`
 Key: ARROW-3658
 URL: https://issues.apache.org/jira/browse/ARROW-3658
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Paddy Horan
Assignee: Paddy Horan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3657) [R] Require bit64 package

2018-10-30 Thread Javier Luraschi (JIRA)
Javier Luraschi created ARROW-3657:
--

 Summary: [R] Require bit64 package
 Key: ARROW-3657
 URL: https://issues.apache.org/jira/browse/ARROW-3657
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Javier Luraschi
Assignee: Javier Luraschi


{code:java}
devtools::install_github("apache/arrow", subdir = "r")
{code}
{code:java}
Error in loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = 
vI[[j]]) : there is no package called ‘bit64’
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3656) [C++] Allow whitespace in numeric CSV fields

2018-10-30 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3656:
-

 Summary: [C++] Allow whitespace in numeric CSV fields
 Key: ARROW-3656
 URL: https://issues.apache.org/jira/browse/ARROW-3656
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.11.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


Pandas allows whitespace before and after numbers in CSV files, but Arrow 
doesn't:
{code:python}
>>> s = b"a,b,c\n12 , 34 , 56\n"
>>> pd.read_csv(io.BytesIO(s))
a   b   c
0  12  34  56
>>> csv.read_csv(io.BytesIO(s)).to_pandas()
ab   c
0  b'12 '  b' 34 '  b' 56'
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3655) [Gandiva] switch away from default_memory_pool

2018-10-30 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-3655:
-

 Summary: [Gandiva] switch away from default_memory_pool
 Key: ARROW-3655
 URL: https://issues.apache.org/jira/browse/ARROW-3655
 Project: Apache Arrow
  Issue Type: Task
  Components: Gandiva
Reporter: Pindikura Ravindra


After changes to ARROW-3519, Gandiva uses default_memory_pool for some 
allocations. This needs to be replaced with the pool passed in the Evaluate 
call. 

 

Also, change signatures of all Evaluate APIs (both in project and filter) to 
take a pool argument.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3654) [Python] Column with CategoricalIndex fails to be read back

2018-10-30 Thread Armin Berres (JIRA)
Armin Berres created ARROW-3654:
---

 Summary: [Python] Column with CategoricalIndex fails to be read 
back
 Key: ARROW-3654
 URL: https://issues.apache.org/jira/browse/ARROW-3654
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.11.1
Reporter: Armin Berres


When a column with a \{Categoricalndex} is written the data can never be read 
back.

 {code:python}
df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['c1', 'c2'])
df['c1'] = df['c1'].astype('category')
df = df.set_index(['c1'])

table = pa.Table.from_pandas(df)
pq.write_table(table, 'test.parquet')

pq.read_pandas('test.parquet').to_pandas()
{code}

Results in

{code}
KeyError  Traceback (most recent call last)
~/venv/mpptool/lib/python3.7/site-packages/pyarrow/pandas_compat.py in 
_pandas_type_to_numpy_type(pandas_type)
676 try:
--> 677 return _pandas_logical_type_map[pandas_type]
678 except KeyError:

KeyError: 'categorical'
{code}

The schema looks good:
{code}
column_indexes": [{"name": "c1", "field_name": "c1", "pandas_type": 
"categorical", "numpy_type": "int8", "metadata": {"num_categories": 2, 
"ordered": false}}]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3653) [Python/C++] Support data copying between different GPU devices

2018-10-30 Thread Pearu Peterson (JIRA)
Pearu Peterson created ARROW-3653:
-

 Summary: [Python/C++] Support data copying between different GPU 
devices
 Key: ARROW-3653
 URL: https://issues.apache.org/jira/browse/ARROW-3653
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Python
Reporter: Pearu Peterson


Currently, the data copying is supported from host to device, from device to 
host, from device to the same device. For multiple GPU systems, copying data 
from one device to another is needed.

See also
https://github.com/apache/arrow/pull/2844#discussion_r228910757



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3652) [Python] CategoricalIndex is lost after reading back

2018-10-30 Thread Armin Berres (JIRA)
Armin Berres created ARROW-3652:
---

 Summary: [Python] CategoricalIndex is lost after reading back
 Key: ARROW-3652
 URL: https://issues.apache.org/jira/browse/ARROW-3652
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Armin Berres


When a {{CategoricalIndex}} is written and read back the resulting index is not 
more categorical.
{code}
df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['c1', 'c2'])
df['c1'] = df['c1'].astype('category')
df = df.set_index(['c1'])

table = pa.Table.from_pandas(df)
pq.write_table(table, 'test.parquet')

ref_df = pq.read_pandas('test.parquet').to_pandas()

print(df.index)
# CategoricalIndex(['a', 'c'], categories=['a', 'c'], ordered=False, name='c1', 
dtype='category')

print(ref_df.index)
# Index(['a', 'c'], dtype='object', name='c1')
{code}
In the metadata the information is correctly contained:
{code:java}
{"name": "c1", "field_name": "c1", "p'
b'andas_type": "categorical", "numpy_type": "int8", "metadata": {"'
b'num_categories": 2, "ordered": false}
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3651) [Python] Datetimes from non-DateTimeIndex cannot be deserialized

2018-10-30 Thread Armin Berres (JIRA)
Armin Berres created ARROW-3651:
---

 Summary: [Python] Datetimes from non-DateTimeIndex cannot be 
deserialized
 Key: ARROW-3651
 URL: https://issues.apache.org/jira/browse/ARROW-3651
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.11.1
Reporter: Armin Berres


Given an index which contains datetimes but is no DateTimeIndex writing the 
file works but reading back fails.
{code:python}
df = pd.DataFrame(1, index=pd.MultiIndex.from_arrays([[1,2],[3,4]]), 
columns=[pd.to_datetime("2018/01/01")])

# columns index is no DateTimeIndex anymore
df = df.reset_index().set_index(['level_0', 'level_1'])

table = pa.Table.from_pandas(df)
pq.write_table(table, 'test.parquet')

pq.read_pandas('test.parquet').to_pandas()
{code}

results in 
{code}
KeyError  Traceback (most recent call last)
~/venv/mpptool/lib/python3.7/site-packages/pyarrow/pandas_compat.py in 
_pandas_type_to_numpy_type(pandas_type)
676 try:
--> 677 return _pandas_logical_type_map[pandas_type]
678 except KeyError:

KeyError: 'datetime'
{code}

The created schema:

{code}
2018-01-01 00:00:00: int64
level_0: int64
level_1: int64
metadata

{b'pandas': b'{"index_columns": ["level_0", "level_1"], "column_indexes": [{"n'
b'ame": null, "field_name": null, "pandas_type": "datetime", "nump'
b'y_type": "object", "metadata": null}], "columns": [{"name": "201'
b'8-01-01 00:00:00", "field_name": "2018-01-01 00:00:00", "pandas_'
b'type": "int64", "numpy_type": "int64", "metadata": null}, {"name'
b'": "level_0", "field_name": "level_0", "pandas_type": "int64", "'
b'numpy_type": "int64", "metadata": null}, {"name": "level_1", "fi'
b'eld_name": "level_1", "pandas_type": "int64", "numpy_type": "int'
b'64", "metadata": null}], "pandas_version": "0.23.4"}'}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3650) [Python] Mixed column indexes are read back as strings

2018-10-30 Thread Armin Berres (JIRA)
Armin Berres created ARROW-3650:
---

 Summary: [Python] Mixed column indexes are read back as strings 
 Key: ARROW-3650
 URL: https://issues.apache.org/jira/browse/ARROW-3650
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.11.1
Reporter: Armin Berres


Consider the following example: 

{code:java}
df = pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['a string', 
pd.to_datetime('2018/01/02')])

table = pa.Table.from_pandas(df)
pq.write_table(table, 'test.parquet')

ref_df = pq.read_pandas('test.parquet').to_pandas()

print(df.columns)
# Index(['a string', 2018-01-02 00:00:00], dtype='object')
print(ref_df.columns)
# Index(['a string', '2018-01-02 00:00:00'], dtype='object')
{code}

The serialized data frame has an index with a string and a datetime field 
(happened when resetting the index of a formerly datetime only column).
When reading the string back the datetime is converted into a string.

When looking at the schema I find {{"pandas_type": "mixed", "numpy_ty'
b'pe": "object"}} before serializing and {{"pandas_type": 
"unicode", "numpy_'
b'type": "object"}} after reading back. So the schema was aware of 
the mixed type but did not store the actual types.

The same happens with other types like numbers as well. One can produce 
interesting situations:

{{pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['1', 1])}} can 
be written but fails to be read back as the index is no more unique with '1' 
showing up two times.

IIf this is not a bug but expected maybe the user should be somehow warned that 
information is lost? Like a {{NotImplemented}} exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3648) [Plasma] Add API to get metadata and data at the same time

2018-10-30 Thread Yuhong Guo (JIRA)
Yuhong Guo created ARROW-3648:
-

 Summary: [Plasma] Add API to get metadata and data at the same time
 Key: ARROW-3648
 URL: https://issues.apache.org/jira/browse/ARROW-3648
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Yuhong Guo


Current Arrow Java Plasma client has no API to get the metadata and data 
together in one API call. If we split this process into two API calls, the 
object status could be different. Current observation shows that the first call 
could be empty(object not stored yet) while the second call will success but 
the metadata and data does not match.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)