[jira] [Created] (ARROW-2308) Serialized tensor data should be 64-byte aligned.

2018-03-13 Thread Robert Nishihara (JIRA)
Robert Nishihara created ARROW-2308:
---

 Summary: Serialized tensor data should be 64-byte aligned.
 Key: ARROW-2308
 URL: https://issues.apache.org/jira/browse/ARROW-2308
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Robert Nishihara


See [https://github.com/ray-project/ray/issues/1658] for an example of this 
issue. Non-aligned data can trigger a copy when fed into TensorFlow and things 
like that.
{code}
import pyarrow as pa
import numpy as np

x = np.zeros(10)
y = pa.deserialize(pa.serialize(x).to_buffer())

x.ctypes.data % 64  # 0 (it starts out aligned)
y.ctypes.data % 64  # 48 (it is no longer aligned)
{code}
It should be possible to fix this by calling something like 
{{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. 
Note that we already do this before writing the tensor header, but the tensor 
header is not necessarily a multiple of 64 bytes, so the subsequent data can be 
unaligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2304) [C++] MultipleClients test in io-hdfs-test fails on trunk

2018-03-13 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2304:
--
Labels: pull-request-available  (was: )

> [C++] MultipleClients test in io-hdfs-test fails on trunk
> -
>
> Key: ARROW-2304
> URL: https://issues.apache.org/jira/browse/ARROW-2304
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This fails for me locally:
> {code}
> [ RUN  ] TestHadoopFileSystem/0.MultipleClients
> ../src/arrow/io/io-hdfs-test.cc:192: Failure
> Value of: s.ok()
>   Actual: false
> Expected: true
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2304) [C++] MultipleClients test in io-hdfs-test fails on trunk

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398064#comment-16398064
 ] 

ASF GitHub Bot commented on ARROW-2304:
---

wesm opened a new pull request #1743: ARROW-2304: [C++} Fix HDFS 
MultipleClients unit test
URL: https://github.com/apache/arrow/pull/1743
 
 
   This test was failing because the `scratch_dir_` directory did not exist


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] MultipleClients test in io-hdfs-test fails on trunk
> -
>
> Key: ARROW-2304
> URL: https://issues.apache.org/jira/browse/ARROW-2304
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This fails for me locally:
> {code}
> [ RUN  ] TestHadoopFileSystem/0.MultipleClients
> ../src/arrow/io/io-hdfs-test.cc:192: Failure
> Value of: s.ok()
>   Actual: false
> Expected: true
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2304) [C++] MultipleClients test in io-hdfs-test fails on trunk

2018-03-13 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2304:
---

Assignee: Wes McKinney

> [C++] MultipleClients test in io-hdfs-test fails on trunk
> -
>
> Key: ARROW-2304
> URL: https://issues.apache.org/jira/browse/ARROW-2304
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Critical
> Fix For: 0.9.0
>
>
> This fails for me locally:
> {code}
> [ RUN  ] TestHadoopFileSystem/0.MultipleClients
> ../src/arrow/io/io-hdfs-test.cc:192: Failure
> Value of: s.ok()
>   Actual: false
> Expected: true
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2306) [Python] HDFS test failures

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16398063#comment-16398063
 ] 

ASF GitHub Bot commented on ARROW-2306:
---

wesm opened a new pull request #1742: ARROW-2306: [Python] Fix partitioned 
Parquet test against HDFS
URL: https://github.com/apache/arrow/pull/1742
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] HDFS test failures
> ---
>
> Key: ARROW-2306
> URL: https://issues.apache.org/jira/browse/ARROW-2306
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> These weren't caught because we aren't running the HDFS tests in Travis CI
> {code}
> pyarrow/tests/test_hdfs.py::TestLibHdfs::test_write_to_dataset_no_partitions 
> FAILED
> >>> traceback 
> >>> 
> self =  testMethod=test_write_to_dataset_no_partitions>
> @test_parquet.parquet
> def test_write_to_dataset_no_partitions(self):
> tmpdir = pjoin(self.tmp_path, 'write-no_partitions-' + guid())
> self.hdfs.mkdir(tmpdir)
> test_parquet._test_write_to_dataset_no_partitions(
> >   tmpdir, filesystem=self.hdfs)
> pyarrow/tests/test_hdfs.py:367: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> pyarrow/tests/test_parquet.py:1475: in _test_write_to_dataset_no_partitions
> filesystem=filesystem)
> pyarrow/parquet.py:1059: in write_to_dataset
> _mkdir_if_not_exists(fs, root_path)
> pyarrow/parquet.py:1006: in _mkdir_if_not_exists
> if fs._isfilestore() and not fs.exists(path):
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> self = 
> def _isfilestore(self):
> """
> Returns True if this FileSystem is a unix-style file store with
> directories.
> """
> >   raise NotImplementedError
> E   NotImplementedError
> pyarrow/filesystem.py:143: NotImplementedError
> >> entering PDB 
> >> >>
> > /home/wesm/code/arrow/python/pyarrow/filesystem.py(143)_isfilestore()
> -> raise NotImplementedError
> (Pdb) c
> pyarrow/tests/test_hdfs.py::TestLibHdfs::test_write_to_dataset_with_partitions
>  FAILED
> >>> traceback 
> >>> 
> self =  testMethod=test_write_to_dataset_with_partitions>
> @test_parquet.parquet
> def test_write_to_dataset_with_partitions(self):
> tmpdir = pjoin(self.tmp_path, 'write-partitions-' + guid())
> self.hdfs.mkdir(tmpdir)
> test_parquet._test_write_to_dataset_with_partitions(
> >   tmpdir, filesystem=self.hdfs)
> pyarrow/tests/test_hdfs.py:360: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> pyarrow/tests/test_parquet.py:1433: in _test_write_to_dataset_with_partitions
> filesystem=filesystem)
> pyarrow/parquet.py:1059: in write_to_dataset
> _mkdir_if_not_exists(fs, root_path)
> pyarrow/parquet.py:1006: in _mkdir_if_not_exists
> if fs._isfilestore() and not fs.exists(path):
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> self = 
> def _isfilestore(self):
> """
> Returns True if this FileSystem is a unix-style file store with
> directories.
> """
> >   raise NotImplementedError
> E   NotImplementedError
> pyarrow/filesystem.py:143: NotImplementedError
> >> entering PDB 
> >> >>
> > /home/wesm/code/arrow/python/pyarrow/filesystem.py(143)_isfilestore()
> -> raise NotImplementedError
> (Pdb) c
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2306) [Python] HDFS test failures

2018-03-13 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2306:
--
Labels: pull-request-available  (was: )

> [Python] HDFS test failures
> ---
>
> Key: ARROW-2306
> URL: https://issues.apache.org/jira/browse/ARROW-2306
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> These weren't caught because we aren't running the HDFS tests in Travis CI
> {code}
> pyarrow/tests/test_hdfs.py::TestLibHdfs::test_write_to_dataset_no_partitions 
> FAILED
> >>> traceback 
> >>> 
> self =  testMethod=test_write_to_dataset_no_partitions>
> @test_parquet.parquet
> def test_write_to_dataset_no_partitions(self):
> tmpdir = pjoin(self.tmp_path, 'write-no_partitions-' + guid())
> self.hdfs.mkdir(tmpdir)
> test_parquet._test_write_to_dataset_no_partitions(
> >   tmpdir, filesystem=self.hdfs)
> pyarrow/tests/test_hdfs.py:367: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> pyarrow/tests/test_parquet.py:1475: in _test_write_to_dataset_no_partitions
> filesystem=filesystem)
> pyarrow/parquet.py:1059: in write_to_dataset
> _mkdir_if_not_exists(fs, root_path)
> pyarrow/parquet.py:1006: in _mkdir_if_not_exists
> if fs._isfilestore() and not fs.exists(path):
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> self = 
> def _isfilestore(self):
> """
> Returns True if this FileSystem is a unix-style file store with
> directories.
> """
> >   raise NotImplementedError
> E   NotImplementedError
> pyarrow/filesystem.py:143: NotImplementedError
> >> entering PDB 
> >> >>
> > /home/wesm/code/arrow/python/pyarrow/filesystem.py(143)_isfilestore()
> -> raise NotImplementedError
> (Pdb) c
> pyarrow/tests/test_hdfs.py::TestLibHdfs::test_write_to_dataset_with_partitions
>  FAILED
> >>> traceback 
> >>> 
> self =  testMethod=test_write_to_dataset_with_partitions>
> @test_parquet.parquet
> def test_write_to_dataset_with_partitions(self):
> tmpdir = pjoin(self.tmp_path, 'write-partitions-' + guid())
> self.hdfs.mkdir(tmpdir)
> test_parquet._test_write_to_dataset_with_partitions(
> >   tmpdir, filesystem=self.hdfs)
> pyarrow/tests/test_hdfs.py:360: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> pyarrow/tests/test_parquet.py:1433: in _test_write_to_dataset_with_partitions
> filesystem=filesystem)
> pyarrow/parquet.py:1059: in write_to_dataset
> _mkdir_if_not_exists(fs, root_path)
> pyarrow/parquet.py:1006: in _mkdir_if_not_exists
> if fs._isfilestore() and not fs.exists(path):
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> self = 
> def _isfilestore(self):
> """
> Returns True if this FileSystem is a unix-style file store with
> directories.
> """
> >   raise NotImplementedError
> E   NotImplementedError
> pyarrow/filesystem.py:143: NotImplementedError
> >> entering PDB 
> >> >>
> > /home/wesm/code/arrow/python/pyarrow/filesystem.py(143)_isfilestore()
> -> raise NotImplementedError
> (Pdb) c
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison

2018-03-13 Thread Alex Hagerman (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397909#comment-16397909
 ] 

Alex Hagerman commented on ARROW-640:
-

Thanks [~pitrou] this was actually what I had implemented locally so glad to 
see I was on the right track. Tonight I was working on doing a little bit of 
benchmarking and writing the tests. Any specific loads or types you might want 
to see related to the speed concern? Or is it better to get a consistent hash 
implementation like this setup in a PR and then worry about speed?

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2307) Unable to read arrow stream containing 0 record batches using pyarrow

2018-03-13 Thread Benjamin Duffield (JIRA)
Benjamin Duffield created ARROW-2307:


 Summary: Unable to read arrow stream containing 0 record batches 
using pyarrow
 Key: ARROW-2307
 URL: https://issues.apache.org/jira/browse/ARROW-2307
 Project: Apache Arrow
  Issue Type: Bug
  Components: C, Python
Affects Versions: 0.8.0
Reporter: Benjamin Duffield


Using java arrow I'm creating an arrow stream, using the stream writer.

 

Sometimes I don't have anything to serialize, and so I don't write any record 
batches. My arrow stream thus consists of just a schema message. 
{code:java}


{code}

I am able to deserialize this arrow stream correctly using the java stream 
reader, but when reading it with python I instead hit an error
{code}
import pyarrow as pa
# ...
reader = pa.open_stream(stream)
df = reader.read_all().to_pandas()
{code}

produces

{code}
  File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all
  File "error.pxi", line 77, in pyarrow.lib.check_status
ArrowInvalid: Must pass at least one record batch
{code}

i.e. we're hitting the check in 
https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284

The workaround we're currently using is to always ensure we serialize at least 
one record batch, even if it's empty. However, I think it would be nice to 
either support a stream without record batches or explicitly disallow this and 
then match behaviour in java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2305) [Python] Cython 0.25.2 compilation failure

2018-03-13 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397652#comment-16397652
 ] 

Uwe L. Korn commented on ARROW-2305:


Would raising tge minimal required cython version be ok for us?

> [Python] Cython 0.25.2 compilation failure 
> ---
>
> Key: ARROW-2305
> URL: https://issues.apache.org/jira/browse/ARROW-2305
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Observed on master branch
> {code}
> Error compiling Cython file:
> 
> ...
> if hasattr(self, 'as_py'):
> return repr(self.as_py())
> else:
> return super(Scalar, self).__repr__()
> def __eq__(self, other):
>^
> 
> /home/wesm/code/arrow/python/pyarrow/scalar.pxi:67:4: Special method __eq__ 
> must be implemented via __richcmp__
> Error compiling Cython file:
> 
> ...
> Return true if the tensors contains exactly equal data
> """
> self._validate()
> return self.tp.Equals(deref(other.tp))
> def __eq__(self, other):
>^
> 
> /home/wesm/code/arrow/python/pyarrow/array.pxi:571:4: Special method __eq__ 
> must be implemented via __richcmp__
> Error compiling Cython file:
> 
> ...
> cdef c_bool result = False
> with nogil:
> result = self.buffer.get().Equals(deref(other.buffer.get()))
> return result
> def __eq__(self, other):
>^
> 
> /home/wesm/code/arrow/python/pyarrow/io.pxi:675:4: Special method __eq__ must 
> be implemented via __richcmp__
> {code}
> Upgrading Cython made this go away. We should probably use {{__richcmp__}} 
> though



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2306) [Python] HDFS test failures

2018-03-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2306:
---

 Summary: [Python] HDFS test failures
 Key: ARROW-2306
 URL: https://issues.apache.org/jira/browse/ARROW-2306
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.9.0


These weren't caught because we aren't running the HDFS tests in Travis CI

{code}
pyarrow/tests/test_hdfs.py::TestLibHdfs::test_write_to_dataset_no_partitions 
FAILED
>>> traceback 
>>> 

self = 

@test_parquet.parquet
def test_write_to_dataset_no_partitions(self):
tmpdir = pjoin(self.tmp_path, 'write-no_partitions-' + guid())
self.hdfs.mkdir(tmpdir)
test_parquet._test_write_to_dataset_no_partitions(
>   tmpdir, filesystem=self.hdfs)

pyarrow/tests/test_hdfs.py:367: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/tests/test_parquet.py:1475: in _test_write_to_dataset_no_partitions
filesystem=filesystem)
pyarrow/parquet.py:1059: in write_to_dataset
_mkdir_if_not_exists(fs, root_path)
pyarrow/parquet.py:1006: in _mkdir_if_not_exists
if fs._isfilestore() and not fs.exists(path):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = 

def _isfilestore(self):
"""
Returns True if this FileSystem is a unix-style file store with
directories.
"""
>   raise NotImplementedError
E   NotImplementedError

pyarrow/filesystem.py:143: NotImplementedError
>> entering PDB 
>> >>
> /home/wesm/code/arrow/python/pyarrow/filesystem.py(143)_isfilestore()
-> raise NotImplementedError
(Pdb) c

pyarrow/tests/test_hdfs.py::TestLibHdfs::test_write_to_dataset_with_partitions 
FAILED
>>> traceback 
>>> 

self = 

@test_parquet.parquet
def test_write_to_dataset_with_partitions(self):
tmpdir = pjoin(self.tmp_path, 'write-partitions-' + guid())
self.hdfs.mkdir(tmpdir)
test_parquet._test_write_to_dataset_with_partitions(
>   tmpdir, filesystem=self.hdfs)

pyarrow/tests/test_hdfs.py:360: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pyarrow/tests/test_parquet.py:1433: in _test_write_to_dataset_with_partitions
filesystem=filesystem)
pyarrow/parquet.py:1059: in write_to_dataset
_mkdir_if_not_exists(fs, root_path)
pyarrow/parquet.py:1006: in _mkdir_if_not_exists
if fs._isfilestore() and not fs.exists(path):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = 

def _isfilestore(self):
"""
Returns True if this FileSystem is a unix-style file store with
directories.
"""
>   raise NotImplementedError
E   NotImplementedError

pyarrow/filesystem.py:143: NotImplementedError
>> entering PDB 
>> >>
> /home/wesm/code/arrow/python/pyarrow/filesystem.py(143)_isfilestore()
-> raise NotImplementedError
(Pdb) c
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2305) [Python] Cython 0.25.2 compilation failure

2018-03-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2305:
---

 Summary: [Python] Cython 0.25.2 compilation failure 
 Key: ARROW-2305
 URL: https://issues.apache.org/jira/browse/ARROW-2305
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.9.0


Observed on master branch

{code}
Error compiling Cython file:

...
if hasattr(self, 'as_py'):
return repr(self.as_py())
else:
return super(Scalar, self).__repr__()

def __eq__(self, other):
   ^


/home/wesm/code/arrow/python/pyarrow/scalar.pxi:67:4: Special method __eq__ 
must be implemented via __richcmp__

Error compiling Cython file:

...
Return true if the tensors contains exactly equal data
"""
self._validate()
return self.tp.Equals(deref(other.tp))

def __eq__(self, other):
   ^


/home/wesm/code/arrow/python/pyarrow/array.pxi:571:4: Special method __eq__ 
must be implemented via __richcmp__

Error compiling Cython file:

...
cdef c_bool result = False
with nogil:
result = self.buffer.get().Equals(deref(other.buffer.get()))
return result

def __eq__(self, other):
   ^


/home/wesm/code/arrow/python/pyarrow/io.pxi:675:4: Special method __eq__ must 
be implemented via __richcmp__
{code}

Upgrading Cython made this go away. We should probably use {{__richcmp__}} 
though



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2305) [Python] Cython 0.25.2 compilation failure

2018-03-13 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2305:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] Cython 0.25.2 compilation failure 
> ---
>
> Key: ARROW-2305
> URL: https://issues.apache.org/jira/browse/ARROW-2305
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Observed on master branch
> {code}
> Error compiling Cython file:
> 
> ...
> if hasattr(self, 'as_py'):
> return repr(self.as_py())
> else:
> return super(Scalar, self).__repr__()
> def __eq__(self, other):
>^
> 
> /home/wesm/code/arrow/python/pyarrow/scalar.pxi:67:4: Special method __eq__ 
> must be implemented via __richcmp__
> Error compiling Cython file:
> 
> ...
> Return true if the tensors contains exactly equal data
> """
> self._validate()
> return self.tp.Equals(deref(other.tp))
> def __eq__(self, other):
>^
> 
> /home/wesm/code/arrow/python/pyarrow/array.pxi:571:4: Special method __eq__ 
> must be implemented via __richcmp__
> Error compiling Cython file:
> 
> ...
> cdef c_bool result = False
> with nogil:
> result = self.buffer.get().Equals(deref(other.buffer.get()))
> return result
> def __eq__(self, other):
>^
> 
> /home/wesm/code/arrow/python/pyarrow/io.pxi:675:4: Special method __eq__ must 
> be implemented via __richcmp__
> {code}
> Upgrading Cython made this go away. We should probably use {{__richcmp__}} 
> though



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397626#comment-16397626
 ] 

ASF GitHub Bot commented on ARROW-2227:
---

wesm closed pull request #1740: ARROW-2227: [Python] Fix off-by-one error in 
chunked binary conversions
URL: https://github.com/apache/arrow/pull/1740
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc
index ef4e7fde9..aa9f3ce42 100644
--- a/cpp/src/arrow/builder.cc
+++ b/cpp/src/arrow/builder.cc
@@ -1236,7 +1236,7 @@ Status ListBuilder::Append(const int32_t* offsets, 
int64_t length,
 
 Status ListBuilder::AppendNextOffset() {
   int64_t num_values = value_builder_->length();
-  if (ARROW_PREDICT_FALSE(num_values >= std::numeric_limits::max())) {
+  if (ARROW_PREDICT_FALSE(num_values > kListMaximumElements)) {
 std::stringstream ss;
 ss << "ListArray cannot contain more then INT32_MAX - 1 child elements,"
<< " have " << num_values;
@@ -1252,14 +1252,14 @@ Status ListBuilder::Append(bool is_valid) {
 }
 
 Status ListBuilder::Init(int64_t elements) {
-  DCHECK_LT(elements, std::numeric_limits::max());
+  DCHECK_LE(elements, kListMaximumElements);
   RETURN_NOT_OK(ArrayBuilder::Init(elements));
   // one more then requested for offsets
   return offsets_builder_.Resize((elements + 1) * sizeof(int32_t));
 }
 
 Status ListBuilder::Resize(int64_t capacity) {
-  DCHECK_LT(capacity, std::numeric_limits::max());
+  DCHECK_LE(capacity, kListMaximumElements);
   // one more then requested for offsets
   RETURN_NOT_OK(offsets_builder_.Resize((capacity + 1) * sizeof(int32_t)));
   return ArrayBuilder::Resize(capacity);
@@ -1303,14 +1303,14 @@ BinaryBuilder::BinaryBuilder(const 
std::shared_ptr& type, MemoryPool*
 BinaryBuilder::BinaryBuilder(MemoryPool* pool) : BinaryBuilder(binary(), pool) 
{}
 
 Status BinaryBuilder::Init(int64_t elements) {
-  DCHECK_LT(elements, std::numeric_limits::max());
+  DCHECK_LE(elements, kListMaximumElements);
   RETURN_NOT_OK(ArrayBuilder::Init(elements));
   // one more then requested for offsets
   return offsets_builder_.Resize((elements + 1) * sizeof(int32_t));
 }
 
 Status BinaryBuilder::Resize(int64_t capacity) {
-  DCHECK_LT(capacity, std::numeric_limits::max());
+  DCHECK_LE(capacity, kListMaximumElements);
   // one more then requested for offsets
   RETURN_NOT_OK(offsets_builder_.Resize((capacity + 1) * sizeof(int32_t)));
   return ArrayBuilder::Resize(capacity);
@@ -1318,7 +1318,7 @@ Status BinaryBuilder::Resize(int64_t capacity) {
 
 Status BinaryBuilder::ReserveData(int64_t elements) {
   if (value_data_length() + elements > value_data_capacity()) {
-if (value_data_length() + elements > std::numeric_limits::max()) {
+if (value_data_length() + elements > kBinaryMemoryLimit) {
   return Status::Invalid("Cannot reserve capacity larger than 2^31 - 1 for 
binary");
 }
 RETURN_NOT_OK(value_data_builder_.Reserve(elements));
@@ -1328,9 +1328,9 @@ Status BinaryBuilder::ReserveData(int64_t elements) {
 
 Status BinaryBuilder::AppendNextOffset() {
   const int64_t num_bytes = value_data_builder_.length();
-  if (ARROW_PREDICT_FALSE(num_bytes > kMaximumCapacity)) {
+  if (ARROW_PREDICT_FALSE(num_bytes > kBinaryMemoryLimit)) {
 std::stringstream ss;
-ss << "BinaryArray cannot contain more than " << kMaximumCapacity << " 
bytes, have "
+ss << "BinaryArray cannot contain more than " << kBinaryMemoryLimit << " 
bytes, have "
<< num_bytes;
 return Status::Invalid(ss.str());
   }
diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h
index dabfb7506..cdcee80be 100644
--- a/cpp/src/arrow/builder.h
+++ b/cpp/src/arrow/builder.h
@@ -41,13 +41,16 @@ namespace arrow {
 class Array;
 class Decimal128;
 
+constexpr int64_t kBinaryMemoryLimit = std::numeric_limits::max() - 1;
+constexpr int64_t kListMaximumElements = std::numeric_limits::max() - 
1;
+
 namespace internal {
 
 struct ArrayData;
 
 }  // namespace internal
 
-static constexpr int64_t kMinBuilderCapacity = 1 << 5;
+constexpr int64_t kMinBuilderCapacity = 1 << 5;
 
 /// Base class for all data array builders.
 //
@@ -702,8 +705,6 @@ class ARROW_EXPORT BinaryBuilder : public ArrayBuilder {
   TypedBufferBuilder offsets_builder_;
   TypedBufferBuilder value_data_builder_;
 
-  static constexpr int64_t kMaximumCapacity = 
std::numeric_limits::max() - 1;
-
   Status AppendNextOffset();
   void Reset();
 };
diff --git a/cpp/src/arrow/python/numpy_to_arrow.cc 
b/cpp/src/arrow/python/numpy_to_arrow.cc
index 4d91e5317..71bf69fc1 100644
--- a/cpp/src/arrow/python/numpy_to_arrow.cc
+++ b/cpp/src/arrow/python/numpy_to_arrow.cc
@@ -60,8 +60,6 @@ namespace py {
 

[jira] [Resolved] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-13 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2227.
-
Resolution: Fixed

Issue resolved by pull request 1740
[https://github.com/apache/arrow/pull/1740]

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397624#comment-16397624
 ] 

ASF GitHub Bot commented on ARROW-2227:
---

wesm commented on issue #1740: ARROW-2227: [Python] Fix off-by-one error in 
chunked binary conversions
URL: https://github.com/apache/arrow/pull/1740#issuecomment-372812464
 
 
   Appveyor build: https://ci.appveyor.com/project/wesm/arrow/build/1.0.1770. 
Merging


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2300) [Python] python/testing/test_hdfs.sh no longer works

2018-03-13 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2300:

Fix Version/s: (was: 0.9.0)
   0.10.0

> [Python] python/testing/test_hdfs.sh no longer works
> 
>
> Key: ARROW-2300
> URL: https://issues.apache.org/jira/browse/ARROW-2300
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> Tried this on a fresh Ubuntu 16.04 install:
> {code}
> $ ./test_hdfs.sh 
> + docker build -t arrow-hdfs-test -f hdfs/Dockerfile .
> Sending build context to Docker daemon  36.86kB
> Step 1/6 : FROM cpcloud86/impala:metastore
> manifest for cpcloud86/impala:metastore not found
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397540#comment-16397540
 ] 

ASF GitHub Bot commented on ARROW-2227:
---

wesm commented on a change in pull request #1740: ARROW-2227: [Python] Fix 
off-by-one error in chunked binary conversions
URL: https://github.com/apache/arrow/pull/1740#discussion_r174261484
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -60,7 +60,7 @@ namespace py {
 
 using internal::NumPyTypeSize;
 
-constexpr int64_t kBinaryMemoryLimit = std::numeric_limits::max();
+constexpr int64_t kBinaryMemoryLimit = std::numeric_limits::max() - 1;
 
 Review comment:
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2266) [CI] Improve runtime of integration tests in Travis CI

2018-03-13 Thread Brian Hulette (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397399#comment-16397399
 ] 

Brian Hulette commented on ARROW-2266:
--

The JS consumer is also to blame for the long runtime, since we validate each 
file with the entire matrix of build targets.  [~ptaylor]: Do you have any 
ideas to improve runtime? could we run all the targets in parallel?

> [CI] Improve runtime of integration tests in Travis CI
> --
>
> Key: ARROW-2266
> URL: https://issues.apache.org/jira/browse/ARROW-2266
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Integration
>Reporter: Wes McKinney
>Priority: Major
>
> I was surprised to see that travis_script_integration.sh is taking over 25 
> minutes to run (https://travis-ci.org/apache/arrow/jobs/349493491). My only 
> real guess about what's going on is that JVM startup time on these hosts is 
> super slow.
> I can think of some things we could do to make things better:
> * Add debugging output so we can see what's slow
> * Write a Java integration test handler that validates multiple files at once
> * Generate a single set of binary files for each producer rather than 
> regenerating them each time (so Java would only need to produce binary files 
> once instead of 3 times like now)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2277) [Python] Tensor.from_numpy doesn't support struct arrays

2018-03-13 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397389#comment-16397389
 ] 

Wes McKinney commented on ARROW-2277:
-

This is possible, eventually, see ARROW-1790: 
https://issues.apache.org/jira/browse/ARROW-1790. 

We don't currently have a memory format defined for more complex array "cells". 
It would be useful to be able to support the full gamut of fixed-size packed 
structs a la NumPy, and potentially also support the more complex 
representations being developed in the xnd/libndtypes projects right now. This 
is a different, though complementary effort the columnar analytics problem

> [Python] Tensor.from_numpy doesn't support struct arrays
> 
>
> Key: ARROW-2277
> URL: https://issues.apache.org/jira/browse/ARROW-2277
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> {code:python}
> >>> dt = np.dtype([('x', np.int8), ('y', np.float32)])
> >>> dt.itemsize
> 5
> >>> arr = np.arange(5*10, dtype=np.int8).view(dt)
> >>> pa.Tensor.from_numpy(arr)
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.Tensor.from_numpy(arr)
>   File "array.pxi", line 523, in pyarrow.lib.Tensor.from_numpy
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_convert.cc:250 code: 
> GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray)), )
> Unsupported numpy type 20
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2276) [Python] Tensor could implement the buffer protocol

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397259#comment-16397259
 ] 

ASF GitHub Bot commented on ARROW-2276:
---

pitrou opened a new pull request #1741: ARROW-2276: [Python] Expose buffer 
protocol on Tensor
URL: https://github.com/apache/arrow/pull/1741
 
 
   Also add a bit_width property to the DataType class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Tensor could implement the buffer protocol
> ---
>
> Key: ARROW-2276
> URL: https://issues.apache.org/jira/browse/ARROW-2276
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> Tensors have an underlying buffer, a data type, shape and strides. It seems 
> like they could implement the Python buffer protocol.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2277) [Python] Tensor.from_numpy doesn't support struct arrays

2018-03-13 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-2277.
-
Resolution: Won't Fix

> [Python] Tensor.from_numpy doesn't support struct arrays
> 
>
> Key: ARROW-2277
> URL: https://issues.apache.org/jira/browse/ARROW-2277
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> {code:python}
> >>> dt = np.dtype([('x', np.int8), ('y', np.float32)])
> >>> dt.itemsize
> 5
> >>> arr = np.arange(5*10, dtype=np.int8).view(dt)
> >>> pa.Tensor.from_numpy(arr)
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.Tensor.from_numpy(arr)
>   File "array.pxi", line 523, in pyarrow.lib.Tensor.from_numpy
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_convert.cc:250 code: 
> GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray)), )
> Unsupported numpy type 20
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2276) [Python] Tensor could implement the buffer protocol

2018-03-13 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2276:
--
Labels: pull-request-available  (was: )

> [Python] Tensor could implement the buffer protocol
> ---
>
> Key: ARROW-2276
> URL: https://issues.apache.org/jira/browse/ARROW-2276
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> Tensors have an underlying buffer, a data type, shape and strides. It seems 
> like they could implement the Python buffer protocol.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2277) [Python] Tensor.from_numpy doesn't support struct arrays

2018-03-13 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397257#comment-16397257
 ] 

Antoine Pitrou commented on ARROW-2277:
---

Actually, it is reasonable not to support struct arrays in Tensor.from_numpy(). 
The reason is that the physical layout of Arrow struct arrays is fundamentally 
different from the physical layout of Numpy struct arrays. Arrow struct arrays 
have separate child array data for each struct component, while Numpy packs 
struct elements contiguously. So it's not possible to get a zero-copy view of 
one representation into another.

> [Python] Tensor.from_numpy doesn't support struct arrays
> 
>
> Key: ARROW-2277
> URL: https://issues.apache.org/jira/browse/ARROW-2277
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>
> {code:python}
> >>> dt = np.dtype([('x', np.int8), ('y', np.float32)])
> >>> dt.itemsize
> 5
> >>> arr = np.arange(5*10, dtype=np.int8).view(dt)
> >>> pa.Tensor.from_numpy(arr)
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.Tensor.from_numpy(arr)
>   File "array.pxi", line 523, in pyarrow.lib.Tensor.from_numpy
>   File "error.pxi", line 85, in pyarrow.lib.check_status
> ArrowNotImplementedError: 
> /home/antoine/arrow/cpp/src/arrow/python/numpy_convert.cc:250 code: 
> GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray)), )
> Unsupported numpy type 20
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison

2018-03-13 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397116#comment-16397116
 ] 

Antoine Pitrou commented on ARROW-640:
--

[~alexhagerman], you'll need to be careful for hashing to be consistent with 
Python scalars (in other words, for every hashable x and y where {{x == y}}, 
{{hash\(x) == hash\(y)}} should also be true).

The simplest way to do that is probably to convert the Arrow value to a Python 
scalar, though that may not be the fastest:
{code:python}
def __hash__(self):
return hash(self.as_py())
{code}

Otherwise you'll need to reproduce the exact hashing algorithm that Python uses.

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2304) [C++] MultipleClients test in io-hdfs-test fails on trunk

2018-03-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2304:
---

 Summary: [C++] MultipleClients test in io-hdfs-test fails on trunk
 Key: ARROW-2304
 URL: https://issues.apache.org/jira/browse/ARROW-2304
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.9.0


This fails for me locally:

{code}
[ RUN  ] TestHadoopFileSystem/0.MultipleClients
../src/arrow/io/io-hdfs-test.cc:192: Failure
Value of: s.ok()
  Actual: false
Expected: true
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2303) [C++] Disable ASAN when building io-hdfs-test.cc

2018-03-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2303:
---

 Summary: [C++] Disable ASAN when building io-hdfs-test.cc
 Key: ARROW-2303
 URL: https://issues.apache.org/jira/browse/ARROW-2303
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney


ASAN reports spurious memory leaks in this unit test module. I am not sure the 
easiest way to conditionally scrub the ASAN flags from such a unit test's 
compilation flags



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397101#comment-16397101
 ] 

ASF GitHub Bot commented on ARROW-2227:
---

pitrou commented on a change in pull request #1740: ARROW-2227: [Python] Fix 
off-by-one error in chunked binary conversions
URL: https://github.com/apache/arrow/pull/1740#discussion_r174173270
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -60,7 +60,7 @@ namespace py {
 
 using internal::NumPyTypeSize;
 
-constexpr int64_t kBinaryMemoryLimit = std::numeric_limits::max();
+constexpr int64_t kBinaryMemoryLimit = std::numeric_limits::max() - 1;
 
 Review comment:
   Yes, there are several places in `builder.cc` that compare against 
`std::numeric_limits::max()` (with possible other off-by-one errors).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397098#comment-16397098
 ] 

ASF GitHub Bot commented on ARROW-2227:
---

wesm commented on a change in pull request #1740: ARROW-2227: [Python] Fix 
off-by-one error in chunked binary conversions
URL: https://github.com/apache/arrow/pull/1740#discussion_r174172314
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -60,7 +60,7 @@ namespace py {
 
 using internal::NumPyTypeSize;
 
-constexpr int64_t kBinaryMemoryLimit = std::numeric_limits::max();
+constexpr int64_t kBinaryMemoryLimit = std::numeric_limits::max() - 1;
 
 Review comment:
   I can move this constexpr to builder.h and use that consistently


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397095#comment-16397095
 ] 

ASF GitHub Bot commented on ARROW-2227:
---

pitrou commented on a change in pull request #1740: ARROW-2227: [Python] Fix 
off-by-one error in chunked binary conversions
URL: https://github.com/apache/arrow/pull/1740#discussion_r174171492
 
 

 ##
 File path: cpp/src/arrow/python/numpy_to_arrow.cc
 ##
 @@ -60,7 +60,7 @@ namespace py {
 
 using internal::NumPyTypeSize;
 
-constexpr int64_t kBinaryMemoryLimit = std::numeric_limits::max();
+constexpr int64_t kBinaryMemoryLimit = std::numeric_limits::max() - 1;
 
 Review comment:
   Apparently this is already available as `BinaryBuilder::kMaximumCapacity` 
(though for some reason it's protected).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16397077#comment-16397077
 ] 

ASF GitHub Bot commented on ARROW-2227:
---

wesm opened a new pull request #1740: ARROW-2227: [Python] Fix off-by-one error 
in chunked binary conversions
URL: https://github.com/apache/arrow/pull/1740
 
 
   We were already testing the chunked behavior but we did not exercise the 
off-by-one error edge case


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2227) [Python] Table.from_pandas does not create chunked_arrays.

2018-03-13 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2227:
--
Labels: pull-request-available  (was: )

> [Python] Table.from_pandas does not create chunked_arrays.
> --
>
> Key: ARROW-2227
> URL: https://issues.apache.org/jira/browse/ARROW-2227
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Chris Ellison
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> When creating a large enough array, pyarrow raises an exception:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> x = list('1' * 2**31)
> y = pd.DataFrame({'x': x})
> t = pa.Table.from_pandas(y)
> # ArrowInvalid: BinaryArrow cannot contain more than 2147483646 bytes, have 
> 2147483647{code}
> The array should be chunked for the user. As is, data frames with >2 GiB in 
> binary data will struggle to get into arrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2302) [GLib] Run autotools and meson Linux builds in same Travis CI build entry

2018-03-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2302:
---

 Summary: [GLib] Run autotools and meson Linux builds in same 
Travis CI build entry
 Key: ARROW-2302
 URL: https://issues.apache.org/jira/browse/ARROW-2302
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Wes McKinney
 Fix For: 0.10.0


Since our CI matrix is going to expand, and these builds are fast (< 5 
minutes), I suggest we run these builds in the same job:

https://travis-ci.org/apache/arrow/builds/352848066



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-1643) [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect to HDFS

2018-03-13 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-1643.
-
Resolution: Fixed

Issue resolved by pull request 1668
[https://github.com/apache/arrow/pull/1668]

> [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect 
> to HDFS
> -
>
> Key: ARROW-1643
> URL: https://issues.apache.org/jira/browse/ARROW-1643
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Ehsan Totoni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1643) [Python] Accept hdfs:// prefixes in parquet.read_table and attempt to connect to HDFS

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396962#comment-16396962
 ] 

ASF GitHub Bot commented on ARROW-1643:
---

wesm closed pull request #1668: ARROW-1643: [Python] Accept hdfs:// prefixes in 
parquet.read_table and attempt to connect to HDFS
URL: https://github.com/apache/arrow/pull/1668
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py
index 42c558b0b..fd9c740f1 100644
--- a/python/pyarrow/parquet.py
+++ b/python/pyarrow/parquet.py
@@ -21,6 +21,13 @@
 import json
 import re
 import six
+from six.moves.urllib.parse import urlparse
+# pathlib might not be available in Python 2
+try:
+import pathlib
+_has_pathlib = True
+except ImportError:
+_has_pathlib = False
 
 import numpy as np
 
@@ -53,6 +60,7 @@ class ParquetFile(object):
 """
 def __init__(self, source, metadata=None, common_metadata=None):
 self.reader = ParquetReader()
+source = _ensure_file(source)
 self.reader.open(source, metadata=metadata)
 self.common_metadata = common_metadata
 self._nested_paths_by_prefix = self._build_nested_paths()
@@ -279,8 +287,20 @@ def __init__(self, where, schema, flavor=None,
 self.schema_changed = False
 
 self.schema = schema
+self.where = where
+
+# If we open a file using an implied filesystem, so it can be assured
+# to be closed
+self.file_handle = None
+
+if is_path(where):
+fs = _get_fs_from_path(where)
+sink = self.file_handle = fs.open(where, 'wb')
+else:
+sink = where
+
 self.writer = _parquet.ParquetWriter(
-where, schema,
+sink, schema,
 version=version,
 compression=compression,
 use_dictionary=use_dictionary,
@@ -310,6 +330,8 @@ def close(self):
 if self.is_open:
 self.writer.close()
 self.is_open = False
+if self.file_handle is not None:
+self.file_handle.close()
 
 
 def _get_pandas_index_columns(keyvalues):
@@ -559,8 +581,9 @@ def get_index(self, level, name, key):
 return self.levels[level].get_index(key)
 
 
-def is_string(x):
-return isinstance(x, six.string_types)
+def is_path(x):
+return (isinstance(x, six.string_types)
+or (_has_pathlib and isinstance(x, pathlib.Path)))
 
 
 class ParquetManifest(object):
@@ -569,7 +592,7 @@ class ParquetManifest(object):
 """
 def __init__(self, dirpath, filesystem=None, pathsep='/',
  partition_scheme='hive'):
-self.filesystem = filesystem or LocalFileSystem.get_instance()
+self.filesystem = filesystem or _get_fs_from_path(dirpath)
 self.pathsep = pathsep
 self.dirpath = dirpath
 self.partition_scheme = partition_scheme
@@ -692,7 +715,10 @@ class ParquetDataset(object):
 def __init__(self, path_or_paths, filesystem=None, schema=None,
  metadata=None, split_row_groups=False, validate_schema=True):
 if filesystem is None:
-self.fs = LocalFileSystem.get_instance()
+a_path = path_or_paths
+if isinstance(a_path, list):
+a_path = a_path[0]
+self.fs = _get_fs_from_path(a_path)
 else:
 self.fs = _ensure_filesystem(filesystem)
 
@@ -851,7 +877,7 @@ def _make_manifest(path_or_paths, fs, pathsep='/'):
 # Dask passes a directory as a list of length 1
 path_or_paths = path_or_paths[0]
 
-if is_string(path_or_paths) and fs.isdir(path_or_paths):
+if is_path(path_or_paths) and fs.isdir(path_or_paths):
 manifest = ParquetManifest(path_or_paths, filesystem=fs,
pathsep=fs.pathsep)
 common_metadata_path = manifest.common_metadata_path
@@ -904,11 +930,11 @@ def _make_manifest(path_or_paths, fs, pathsep='/'):
 
 def read_table(source, columns=None, nthreads=1, metadata=None,
use_pandas_metadata=False):
-if is_string(source):
-fs = LocalFileSystem.get_instance()
+if is_path(source):
+fs = _get_fs_from_path(source)
+
 if fs.isdir(source):
-return fs.read_parquet(source, columns=columns,
-   metadata=metadata)
+return fs.read_parquet(source, columns=columns, metadata=metadata)
 
 pf = ParquetFile(source, metadata=metadata)
 return pf.read(columns=columns, nthreads=nthreads,
@@ -957,7 +983,7 @@ def write_table(table, where, row_group_size=None, 
version='1.0',
 **kwargs) as writer:

[jira] [Commented] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396847#comment-16396847
 ] 

ASF GitHub Bot commented on ARROW-2122:
---

pitrou commented on a change in pull request #1707: ARROW-2122: [Python] 
Pyarrow fails to serialize dataframe with timestamp.
URL: https://github.com/apache/arrow/pull/1707#discussion_r174097652
 
 

 ##
 File path: python/pyarrow/tests/test_convert_pandas.py
 ##
 @@ -1001,6 +1001,17 @@ def test_array_from_pandas_date_with_mask(self):
 assert pa.Array.from_pandas(expected).equals(result)
 
 
+def test_fixed_offset_timezone():
 
 Review comment:
   Please put this under the class above.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Pyarrow fails to serialize dataframe with timestamp.
> -
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Albert Shieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396844#comment-16396844
 ] 

ASF GitHub Bot commented on ARROW-2122:
---

pitrou commented on a change in pull request #1707: ARROW-2122: [Python] 
Pyarrow fails to serialize dataframe with timestamp.
URL: https://github.com/apache/arrow/pull/1707#discussion_r174097132
 
 

 ##
 File path: python/pyarrow/types.pxi
 ##
 @@ -847,6 +847,25 @@ cdef timeunit_to_string(TimeUnit unit):
 return 'ns'
 
 
+FIXED_OFFSET_PREFIX = '+'
 
 Review comment:
   We probably want the offset to be encoded as `[+-]HH:MM`.
   See https://github.com/apache/arrow/blob/master/format/Schema.fbs#L162-L166
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Pyarrow fails to serialize dataframe with timestamp.
> -
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Albert Shieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396845#comment-16396845
 ] 

ASF GitHub Bot commented on ARROW-2122:
---

pitrou commented on a change in pull request #1707: ARROW-2122: [Python] 
Pyarrow fails to serialize dataframe with timestamp.
URL: https://github.com/apache/arrow/pull/1707#discussion_r174097194
 
 

 ##
 File path: python/pyarrow/types.pxi
 ##
 @@ -847,6 +847,25 @@ cdef timeunit_to_string(TimeUnit unit):
 return 'ns'
 
 
+FIXED_OFFSET_PREFIX = '+'
+
+
+def tzinfo_to_string(tz):
 
 Review comment:
   These two functions would deserve a docstring, and appropriate unit tests.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Pyarrow fails to serialize dataframe with timestamp.
> -
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Assignee: Albert Shieh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)