[jira] [Commented] (ARROW-2313) [GLib] Release builds must define NDEBUG

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399956#comment-16399956
 ] 

ASF GitHub Bot commented on ARROW-2313:
---

kou opened a new pull request #1752: ARROW-2313: [C++] Add -NDEBUG flag to 
arrow.pc
URL: https://github.com/apache/arrow/pull/1752
 
 
   Arrow C++ users should use the same -NDEBUG flag as Arrow C++ itself.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [GLib] Release builds must define NDEBUG
> 
>
> Key: ARROW-2313
> URL: https://issues.apache.org/jira/browse/ARROW-2313
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GLib
>Reporter: Wes McKinney
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Ran into another problem with {{verify-release-candidate.sh 0.9.0 0}} -- the 
> GLib build is not defining NDEBUG -- depending on whether Arrow was built in 
> release or debug mode, some symbols (like {{Buffer::mutable_data}}) may not 
> be inlined
> {code}
>   CXX  libarrow_glib_la-compute.lo
>   CC   enums.lo
>   CXXLDlibarrow-glib.la
> ar: `u' modifier ignored since `D' is the default (see `U')
>   GISCAN   Arrow-1.0.gir
> ./.libs/libarrow-glib.so: undefined reference to 
> `arrow::Buffer::mutable_data()'
> collect2: error: ld returned 1 exit status
> linking of temporary binary failed: Command '['/bin/bash', '../libtool', 
> '--mode=link', '--tag=CC', '--silent', 'gcc', '-o', 
> '/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0',
>  '-export-dynamic', '-g', '-O2', 
> 'tmp-introspect7g38ad/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0.o',
>  '-L.', 'libarrow-glib.la', '-Wl,--export-dynamic', '-lgmodule-2.0', 
> '-pthread', '-lgio-2.0', '-lgobject-2.0', '-lglib-2.0']' returned non-zero 
> exit status 1
> /usr/share/gobject-introspection-1.0/Makefile.introspection:155: recipe for 
> target 'Arrow-1.0.gir' failed
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2313) [GLib] Release builds must define NDEBUG

2018-03-14 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2313:
--
Labels: pull-request-available  (was: )

> [GLib] Release builds must define NDEBUG
> 
>
> Key: ARROW-2313
> URL: https://issues.apache.org/jira/browse/ARROW-2313
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GLib
>Reporter: Wes McKinney
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Ran into another problem with {{verify-release-candidate.sh 0.9.0 0}} -- the 
> GLib build is not defining NDEBUG -- depending on whether Arrow was built in 
> release or debug mode, some symbols (like {{Buffer::mutable_data}}) may not 
> be inlined
> {code}
>   CXX  libarrow_glib_la-compute.lo
>   CC   enums.lo
>   CXXLDlibarrow-glib.la
> ar: `u' modifier ignored since `D' is the default (see `U')
>   GISCAN   Arrow-1.0.gir
> ./.libs/libarrow-glib.so: undefined reference to 
> `arrow::Buffer::mutable_data()'
> collect2: error: ld returned 1 exit status
> linking of temporary binary failed: Command '['/bin/bash', '../libtool', 
> '--mode=link', '--tag=CC', '--silent', 'gcc', '-o', 
> '/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0',
>  '-export-dynamic', '-g', '-O2', 
> 'tmp-introspect7g38ad/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0.o',
>  '-L.', 'libarrow-glib.la', '-Wl,--export-dynamic', '-lgmodule-2.0', 
> '-pthread', '-lgio-2.0', '-lgobject-2.0', '-lglib-2.0']' returned non-zero 
> exit status 1
> /usr/share/gobject-introspection-1.0/Makefile.introspection:155: recipe for 
> target 'Arrow-1.0.gir' failed
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2313) [GLib] Release builds must define NDEBUG

2018-03-14 Thread Kouhei Sutou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-2313:
---

Assignee: Kouhei Sutou

> [GLib] Release builds must define NDEBUG
> 
>
> Key: ARROW-2313
> URL: https://issues.apache.org/jira/browse/ARROW-2313
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GLib
>Reporter: Wes McKinney
>Assignee: Kouhei Sutou
>Priority: Major
> Fix For: 0.9.0
>
>
> Ran into another problem with {{verify-release-candidate.sh 0.9.0 0}} -- the 
> GLib build is not defining NDEBUG -- depending on whether Arrow was built in 
> release or debug mode, some symbols (like {{Buffer::mutable_data}}) may not 
> be inlined
> {code}
>   CXX  libarrow_glib_la-compute.lo
>   CC   enums.lo
>   CXXLDlibarrow-glib.la
> ar: `u' modifier ignored since `D' is the default (see `U')
>   GISCAN   Arrow-1.0.gir
> ./.libs/libarrow-glib.so: undefined reference to 
> `arrow::Buffer::mutable_data()'
> collect2: error: ld returned 1 exit status
> linking of temporary binary failed: Command '['/bin/bash', '../libtool', 
> '--mode=link', '--tag=CC', '--silent', 'gcc', '-o', 
> '/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0',
>  '-export-dynamic', '-g', '-O2', 
> 'tmp-introspect7g38ad/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0.o',
>  '-L.', 'libarrow-glib.la', '-Wl,--export-dynamic', '-lgmodule-2.0', 
> '-pthread', '-lgio-2.0', '-lgobject-2.0', '-lglib-2.0']' returned non-zero 
> exit status 1
> /usr/share/gobject-introspection-1.0/Makefile.introspection:155: recipe for 
> target 'Arrow-1.0.gir' failed
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2312) [JS] verify-release-candidate-sh must be updated to include JS in integration tests

2018-03-14 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2312:
---

Assignee: Paul Taylor

> [JS] verify-release-candidate-sh must be updated to include JS in integration 
> tests
> ---
>
> Key: ARROW-2312
> URL: https://issues.apache.org/jira/browse/ARROW-2312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Assignee: Paul Taylor
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I was unable to run verify-release-candidate.sh when working on the first 
> iteration of the 0.9.0 release. 
> JavaScript was added to the integration tests, but the verification script 
> has not been updated yet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2312) [JS] verify-release-candidate-sh must be updated to include JS in integration tests

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399909#comment-16399909
 ] 

ASF GitHub Bot commented on ARROW-2312:
---

wesm closed pull request #1751: ARROW-2312: [JS] run test_js before 
test_integration
URL: https://github.com/apache/arrow/pull/1751
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/dev/release/verify-release-candidate.sh 
b/dev/release/verify-release-candidate.sh
index cb9b01b37..0b278e7cf 100755
--- a/dev/release/verify-release-candidate.sh
+++ b/dev/release/verify-release-candidate.sh
@@ -246,12 +246,11 @@ cd ${DIST_NAME}
 test_package_java
 setup_miniconda
 test_and_install_cpp
+test_js
 test_integration
 test_glib
 install_parquet_cpp
 test_python
 
-test_js
-
 echo 'Release candidate looks good!'
 exit 0


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JS] verify-release-candidate-sh must be updated to include JS in integration 
> tests
> ---
>
> Key: ARROW-2312
> URL: https://issues.apache.org/jira/browse/ARROW-2312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I was unable to run verify-release-candidate.sh when working on the first 
> iteration of the 0.9.0 release. 
> JavaScript was added to the integration tests, but the verification script 
> has not been updated yet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2312) [JS] verify-release-candidate-sh must be updated to include JS in integration tests

2018-03-14 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2312.
-
Resolution: Fixed

Issue resolved by pull request 1751
[https://github.com/apache/arrow/pull/1751]

> [JS] verify-release-candidate-sh must be updated to include JS in integration 
> tests
> ---
>
> Key: ARROW-2312
> URL: https://issues.apache.org/jira/browse/ARROW-2312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I was unable to run verify-release-candidate.sh when working on the first 
> iteration of the 0.9.0 release. 
> JavaScript was added to the integration tests, but the verification script 
> has not been updated yet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2313) [GLib] Release builds must define NDEBUG

2018-03-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2313:
---

 Summary: [GLib] Release builds must define NDEBUG
 Key: ARROW-2313
 URL: https://issues.apache.org/jira/browse/ARROW-2313
 Project: Apache Arrow
  Issue Type: Bug
  Components: GLib
Reporter: Wes McKinney
 Fix For: 0.9.0


Ran into another problem with {{verify-release-candidate.sh 0.9.0 0}} -- the 
GLib build is not defining NDEBUG -- depending on whether Arrow was built in 
release or debug mode, some symbols (like {{Buffer::mutable_data}}) may not be 
inlined

{code}
  CXX  libarrow_glib_la-compute.lo
  CC   enums.lo
  CXXLDlibarrow-glib.la
ar: `u' modifier ignored since `D' is the default (see `U')
  GISCAN   Arrow-1.0.gir
./.libs/libarrow-glib.so: undefined reference to `arrow::Buffer::mutable_data()'
collect2: error: ld returned 1 exit status
linking of temporary binary failed: Command '['/bin/bash', '../libtool', 
'--mode=link', '--tag=CC', '--silent', 'gcc', '-o', 
'/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0',
 '-export-dynamic', '-g', '-O2', 
'tmp-introspect7g38ad/tmp/arrow-0.9.0.hlQDV/apache-arrow-0.9.0/c_glib/arrow-glib/tmp-introspect7g38ad/Arrow-1.0.o',
 '-L.', 'libarrow-glib.la', '-Wl,--export-dynamic', '-lgmodule-2.0', 
'-pthread', '-lgio-2.0', '-lgobject-2.0', '-lglib-2.0']' returned non-zero exit 
status 1
/usr/share/gobject-introspection-1.0/Makefile.introspection:155: recipe for 
target 'Arrow-1.0.gir' failed
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison

2018-03-14 Thread Alex Hagerman (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399783#comment-16399783
 ] 

Alex Hagerman commented on ARROW-640:
-

Sounds good. Just to verify Integer only or Number types in general? I've got a 
deployment happening during the day right now, so I'll hopefully be able to 
wrap up a version one this weekend and do a PR for review.

You mentioned for items like StructValue the as_py fallback won't work. 
Similarly with ListValue I would expect both of these to raise a TypeError: 
Unhashable Type, but I'll check the current behavior. Depending on what that is 
do you have any thoughts if the hash() TypeError should be raised on mutable 
types like standard python behavior? Wanted to check so I don't conflict with 
existing expected behavior if this has been handled previously and to look at 
tying it in with __eq__.

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2312) [JS] verify-release-candidate-sh must be updated to include JS in integration tests

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399757#comment-16399757
 ] 

ASF GitHub Bot commented on ARROW-2312:
---

trxcllnt opened a new pull request #1751: ARROW-2312: [JS] run test_js before 
test_integration
URL: https://github.com/apache/arrow/pull/1751
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [JS] verify-release-candidate-sh must be updated to include JS in integration 
> tests
> ---
>
> Key: ARROW-2312
> URL: https://issues.apache.org/jira/browse/ARROW-2312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I was unable to run verify-release-candidate.sh when working on the first 
> iteration of the 0.9.0 release. 
> JavaScript was added to the integration tests, but the verification script 
> has not been updated yet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2312) [JS] verify-release-candidate-sh must be updated to include JS in integration tests

2018-03-14 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2312:
--
Labels: pull-request-available  (was: )

> [JS] verify-release-candidate-sh must be updated to include JS in integration 
> tests
> ---
>
> Key: ARROW-2312
> URL: https://issues.apache.org/jira/browse/ARROW-2312
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> I was unable to run verify-release-candidate.sh when working on the first 
> iteration of the 0.9.0 release. 
> JavaScript was added to the integration tests, but the verification script 
> has not been updated yet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2312) [JS] verify-release-candidate-sh must be updated to include JS in integration tests

2018-03-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2312:
---

 Summary: [JS] verify-release-candidate-sh must be updated to 
include JS in integration tests
 Key: ARROW-2312
 URL: https://issues.apache.org/jira/browse/ARROW-2312
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Reporter: Wes McKinney
 Fix For: 0.9.0


I was unable to run verify-release-candidate.sh when working on the first 
iteration of the 0.9.0 release. 

JavaScript was added to the integration tests, but the verification script has 
not been updated yet



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1886) [Python] Add function to "flatten" structs within tables

2018-03-14 Thread Antoine Pitrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-1886:
-

Assignee: Antoine Pitrou

> [Python] Add function to "flatten" structs within tables
> 
>
> Key: ARROW-1886
> URL: https://issues.apache.org/jira/browse/ARROW-1886
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 0.10.0
>
>
> See discussion in https://issues.apache.org/jira/browse/ARROW-1873
> When a user has a struct column, it may be more efficient to flatten the 
> struct into multiple columns of the form {{struct_name.field_name}} for each 
> field in the struct. Then when you call {{to_pandas}}, Python dictionaries do 
> not have to be created, and the conversion will be much more efficient



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2311) [Python] Struct array slicing defective

2018-03-14 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2311:
-

 Summary: [Python] Struct array slicing defective
 Key: ARROW-2311
 URL: https://issues.apache.org/jira/browse/ARROW-2311
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


{code:python}
>>> arr = pa.array([(1, 2.0), (3, 4.0), (5, 6.0)], 
>>> type=pa.struct([pa.field('x', pa.int16()), pa.field('y', pa.float32())]))
>>> arr

[
  {'x': 1, 'y': 2.0},
  {'x': 3, 'y': 4.0},
  {'x': 5, 'y': 6.0}
]
>>> arr[1:]

[
  {'x': 1, 'y': 2.0},
  {'x': 3, 'y': 4.0}
]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2310) Source release scripts fail with Java8

2018-03-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-2310:
---

 Summary: Source release scripts fail with Java8
 Key: ARROW-2310
 URL: https://issues.apache.org/jira/browse/ARROW-2310
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Wes McKinney
 Fix For: 0.10.0


It's getting harder and harder to install Java7 these days. On a new install of 
Ubuntu 16.04 I am not even sure how to get Oracle's Java7 installed (though 
Java8 can be installed through a PPA).

In lieu of fixing all the javadoc problems, it would be great if there was some 
other workaround to build the release on Java8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches

2018-03-14 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-2307.
-
   Resolution: Fixed
Fix Version/s: 0.9.0

Issue resolved by pull request 1747
[https://github.com/apache/arrow/pull/1747]

> [Python] Unable to read arrow stream containing 0 record batches
> 
>
> Key: ARROW-2307
> URL: https://issues.apache.org/jira/browse/ARROW-2307
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Benjamin Duffield
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> Using java arrow I'm creating an arrow stream, using the stream writer.
>  
> Sometimes I don't have anything to serialize, and so I don't write any record 
> batches. My arrow stream thus consists of just a schema message. 
> {code:java}
> 
> 
> {code}
> I am able to deserialize this arrow stream correctly using the java stream 
> reader, but when reading it with python I instead hit an error
> {code}
> import pyarrow as pa
> # ...
> reader = pa.open_stream(stream)
> df = reader.read_all().to_pandas()
> {code}
> produces
> {code}
>   File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Must pass at least one record batch
> {code}
> i.e. we're hitting the check in 
> https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284
> The workaround we're currently using is to always ensure we serialize at 
> least one record batch, even if it's empty. However, I think it would be nice 
> to either support a stream without record batches or explicitly disallow this 
> and then match behaviour in java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399687#comment-16399687
 ] 

ASF GitHub Bot commented on ARROW-2307:
---

wesm closed pull request #1747: ARROW-2307: [Python] Allow reading record batch 
streams with zero record batches
URL: https://github.com/apache/arrow/pull/1747
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/ci/msvc-build.bat b/ci/msvc-build.bat
index beefee6c0..a29ef0bad 100644
--- a/ci/msvc-build.bat
+++ b/ci/msvc-build.bat
@@ -69,7 +69,8 @@ if "%JOB%" == "Build_Debug" (
 )
 
 conda create -n arrow -q -y python=%PYTHON% ^
-  six pytest setuptools numpy pandas cython ^
+  six pytest setuptools numpy pandas ^
+  cython=0.27.3 ^
   thrift-cpp=0.11.0
 
 call activate arrow
diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh
index a776c4263..247d10278 100755
--- a/ci/travis_script_python.sh
+++ b/ci/travis_script_python.sh
@@ -41,7 +41,7 @@ conda install -y -q pip \
   cloudpickle \
   numpy=1.13.1 \
   pandas \
-  cython
+  cython=0.27.3
 
 # ARROW-2093: PyTorch increases the size of our conda dependency stack
 # significantly, and so we have disabled these tests in Travis CI for now
diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc
index 24c8d5e15..b1cf6e59a 100644
--- a/cpp/src/arrow/table-test.cc
+++ b/cpp/src/arrow/table-test.cc
@@ -374,6 +374,17 @@ TEST_F(TestTable, FromRecordBatches) {
   ASSERT_RAISES(Invalid, Table::FromRecordBatches({batch1, batch2}, &result));
 }
 
+TEST_F(TestTable, FromRecordBatchesZeroLength) {
+  // ARROW-2307
+  MakeExample1(10);
+
+  std::shared_ptr result;
+  ASSERT_OK(Table::FromRecordBatches(schema_, {}, &result));
+
+  ASSERT_EQ(0, result->num_rows());
+  ASSERT_TRUE(result->schema()->Equals(*schema_));
+}
+
 TEST_F(TestTable, ConcatenateTables) {
   const int64_t length = 10;
 
diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc
index ed5858624..f6ac6dd3b 100644
--- a/cpp/src/arrow/table.cc
+++ b/cpp/src/arrow/table.cc
@@ -297,14 +297,9 @@ std::shared_ptr Table::Make(const 
std::shared_ptr& schema,
   return std::make_shared(schema, arrays, num_rows);
 }
 
-Status Table::FromRecordBatches(const 
std::vector>& batches,
+Status Table::FromRecordBatches(const std::shared_ptr& schema,
+const 
std::vector>& batches,
 std::shared_ptr* table) {
-  if (batches.size() == 0) {
-return Status::Invalid("Must pass at least one record batch");
-  }
-
-  std::shared_ptr schema = batches[0]->schema();
-
   const int nbatches = static_cast(batches.size());
   const int ncolumns = static_cast(schema->num_fields());
 
@@ -332,6 +327,15 @@ Status Table::FromRecordBatches(const 
std::vector>&
   return Status::OK();
 }
 
+Status Table::FromRecordBatches(const 
std::vector>& batches,
+std::shared_ptr* table) {
+  if (batches.size() == 0) {
+return Status::Invalid("Must pass at least one record batch");
+  }
+
+  return FromRecordBatches(batches[0]->schema(), batches, table);
+}
+
 Status ConcatenateTables(const std::vector>& tables,
  std::shared_ptr* table) {
   if (tables.size() == 0) {
diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h
index 7274fca4d..20d027d6a 100644
--- a/cpp/src/arrow/table.h
+++ b/cpp/src/arrow/table.h
@@ -169,9 +169,25 @@ class ARROW_EXPORT Table {
  const 
std::vector>& arrays,
  int64_t num_rows = -1);
 
-  // Construct table from RecordBatch, but only if all of the batch schemas are
-  // equal. Returns Status::Invalid if there is some problem
+  /// \brief Construct table from RecordBatches, using schema supplied by the 
first
+  /// RecordBatch.
+  ///
+  /// \param[in] batches a std::vector of record batches
+  /// \param[out] table the returned table
+  /// \return Status Returns Status::Invalid if there is some problem
+  static Status FromRecordBatches(
+  const std::vector>& batches,
+  std::shared_ptr* table);
+
+  /// Construct table from RecordBatches, using supplied schema. There may be
+  /// zero record batches
+  ///
+  /// \param[in] schema the arrow::Schema for each batch
+  /// \param[in] batches a std::vector of record batches
+  /// \param[out] table the returned table
+  /// \return Status
   static Status FromRecordBatches(
+  const std::shared_ptr& schema,
   const std::vector>& batches,
   std::shared_ptr* table);
 
diff --git a/python/pyarrow/includes/libarrow.pxd 
b/python/pyarrow/includes/libarrow.pxd
index 3d0c02b89..01a641896 100644
--- a/pyt

[jira] [Commented] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399684#comment-16399684
 ] 

ASF GitHub Bot commented on ARROW-2307:
---

wesm commented on issue #1747: ARROW-2307: [Python] Allow reading record batch 
streams with zero record batches
URL: https://github.com/apache/arrow/pull/1747#issuecomment-373223120
 
 
   +1. Appveyor build looking good: 
https://ci.appveyor.com/project/wesm/arrow/build/1.0.1776


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Unable to read arrow stream containing 0 record batches
> 
>
> Key: ARROW-2307
> URL: https://issues.apache.org/jira/browse/ARROW-2307
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Benjamin Duffield
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>
> Using java arrow I'm creating an arrow stream, using the stream writer.
>  
> Sometimes I don't have anything to serialize, and so I don't write any record 
> batches. My arrow stream thus consists of just a schema message. 
> {code:java}
> 
> 
> {code}
> I am able to deserialize this arrow stream correctly using the java stream 
> reader, but when reading it with python I instead hit an error
> {code}
> import pyarrow as pa
> # ...
> reader = pa.open_stream(stream)
> df = reader.read_all().to_pandas()
> {code}
> produces
> {code}
>   File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Must pass at least one record batch
> {code}
> i.e. we're hitting the check in 
> https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284
> The workaround we're currently using is to always ensure we serialize at 
> least one record batch, even if it's empty. However, I think it would be nice 
> to either support a stream without record batches or explicitly disallow this 
> and then match behaviour in java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2309) [C++] Use std::make_unsigned

2018-03-14 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2309:
--
Labels: pull-request-available  (was: )

> [C++] Use std::make_unsigned
> 
>
> Key: ARROW-2309
> URL: https://issues.apache.org/jira/browse/ARROW-2309
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
>
> {{arrow/util/bit-util.h}} has a reimplementation of {{boost::make_unsigned}}, 
> but we could simply use {{std::make_unsigned}}, which is C++11.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2309) [C++] Use std::make_unsigned

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399439#comment-16399439
 ] 

ASF GitHub Bot commented on ARROW-2309:
---

pitrou opened a new pull request #1748: ARROW-2309: [C++] Use std::make_unsigned
URL: https://github.com/apache/arrow/pull/1748
 
 
   No need for our own reimplementation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Use std::make_unsigned
> 
>
> Key: ARROW-2309
> URL: https://issues.apache.org/jira/browse/ARROW-2309
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Trivial
>  Labels: pull-request-available
>
> {{arrow/util/bit-util.h}} has a reimplementation of {{boost::make_unsigned}}, 
> but we could simply use {{std::make_unsigned}}, which is C++11.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2309) [C++] Use std::make_unsigned

2018-03-14 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2309:
-

 Summary: [C++] Use std::make_unsigned
 Key: ARROW-2309
 URL: https://issues.apache.org/jira/browse/ARROW-2309
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Affects Versions: 0.8.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


{{arrow/util/bit-util.h}} has a reimplementation of {{boost::make_unsigned}}, 
but we could simply use {{std::make_unsigned}}, which is C++11.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches

2018-03-14 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2307:
--
Labels: pull-request-available  (was: )

> [Python] Unable to read arrow stream containing 0 record batches
> 
>
> Key: ARROW-2307
> URL: https://issues.apache.org/jira/browse/ARROW-2307
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Benjamin Duffield
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>
> Using java arrow I'm creating an arrow stream, using the stream writer.
>  
> Sometimes I don't have anything to serialize, and so I don't write any record 
> batches. My arrow stream thus consists of just a schema message. 
> {code:java}
> 
> 
> {code}
> I am able to deserialize this arrow stream correctly using the java stream 
> reader, but when reading it with python I instead hit an error
> {code}
> import pyarrow as pa
> # ...
> reader = pa.open_stream(stream)
> df = reader.read_all().to_pandas()
> {code}
> produces
> {code}
>   File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Must pass at least one record batch
> {code}
> i.e. we're hitting the check in 
> https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284
> The workaround we're currently using is to always ensure we serialize at 
> least one record batch, even if it's empty. However, I think it would be nice 
> to either support a stream without record batches or explicitly disallow this 
> and then match behaviour in java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16399055#comment-16399055
 ] 

ASF GitHub Bot commented on ARROW-2307:
---

wesm opened a new pull request #1747: ARROW-2307: [Python] Allow reading record 
batch streams with zero record batches
URL: https://github.com/apache/arrow/pull/1747
 
 
   This is a pretty rough edge case -- it would be good to get this fix into 
0.9.0.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Unable to read arrow stream containing 0 record batches
> 
>
> Key: ARROW-2307
> URL: https://issues.apache.org/jira/browse/ARROW-2307
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Benjamin Duffield
>Assignee: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
>
> Using java arrow I'm creating an arrow stream, using the stream writer.
>  
> Sometimes I don't have anything to serialize, and so I don't write any record 
> batches. My arrow stream thus consists of just a schema message. 
> {code:java}
> 
> 
> {code}
> I am able to deserialize this arrow stream correctly using the java stream 
> reader, but when reading it with python I instead hit an error
> {code}
> import pyarrow as pa
> # ...
> reader = pa.open_stream(stream)
> df = reader.read_all().to_pandas()
> {code}
> produces
> {code}
>   File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Must pass at least one record batch
> {code}
> i.e. we're hitting the check in 
> https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284
> The workaround we're currently using is to always ensure we serialize at 
> least one record batch, even if it's empty. However, I think it would be nice 
> to either support a stream without record batches or explicitly disallow this 
> and then match behaviour in java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2307) Unable to read arrow stream containing 0 record batches using pyarrow

2018-03-14 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2307:
---

Assignee: Wes McKinney

> Unable to read arrow stream containing 0 record batches using pyarrow
> -
>
> Key: ARROW-2307
> URL: https://issues.apache.org/jira/browse/ARROW-2307
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Benjamin Duffield
>Assignee: Wes McKinney
>Priority: Major
>
> Using java arrow I'm creating an arrow stream, using the stream writer.
>  
> Sometimes I don't have anything to serialize, and so I don't write any record 
> batches. My arrow stream thus consists of just a schema message. 
> {code:java}
> 
> 
> {code}
> I am able to deserialize this arrow stream correctly using the java stream 
> reader, but when reading it with python I instead hit an error
> {code}
> import pyarrow as pa
> # ...
> reader = pa.open_stream(stream)
> df = reader.read_all().to_pandas()
> {code}
> produces
> {code}
>   File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Must pass at least one record batch
> {code}
> i.e. we're hitting the check in 
> https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284
> The workaround we're currently using is to always ensure we serialize at 
> least one record batch, even if it's empty. However, I think it would be nice 
> to either support a stream without record batches or explicitly disallow this 
> and then match behaviour in java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches

2018-03-14 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398813#comment-16398813
 ] 

Wes McKinney commented on ARROW-2307:
-

Working on a fix for this

> [Python] Unable to read arrow stream containing 0 record batches
> 
>
> Key: ARROW-2307
> URL: https://issues.apache.org/jira/browse/ARROW-2307
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Benjamin Duffield
>Assignee: Wes McKinney
>Priority: Major
>
> Using java arrow I'm creating an arrow stream, using the stream writer.
>  
> Sometimes I don't have anything to serialize, and so I don't write any record 
> batches. My arrow stream thus consists of just a schema message. 
> {code:java}
> 
> 
> {code}
> I am able to deserialize this arrow stream correctly using the java stream 
> reader, but when reading it with python I instead hit an error
> {code}
> import pyarrow as pa
> # ...
> reader = pa.open_stream(stream)
> df = reader.read_all().to_pandas()
> {code}
> produces
> {code}
>   File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Must pass at least one record batch
> {code}
> i.e. we're hitting the check in 
> https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284
> The workaround we're currently using is to always ensure we serialize at 
> least one record batch, even if it's empty. However, I think it would be nice 
> to either support a stream without record batches or explicitly disallow this 
> and then match behaviour in java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches

2018-03-14 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2307:

Component/s: (was: C)

> [Python] Unable to read arrow stream containing 0 record batches
> 
>
> Key: ARROW-2307
> URL: https://issues.apache.org/jira/browse/ARROW-2307
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Benjamin Duffield
>Assignee: Wes McKinney
>Priority: Major
>
> Using java arrow I'm creating an arrow stream, using the stream writer.
>  
> Sometimes I don't have anything to serialize, and so I don't write any record 
> batches. My arrow stream thus consists of just a schema message. 
> {code:java}
> 
> 
> {code}
> I am able to deserialize this arrow stream correctly using the java stream 
> reader, but when reading it with python I instead hit an error
> {code}
> import pyarrow as pa
> # ...
> reader = pa.open_stream(stream)
> df = reader.read_all().to_pandas()
> {code}
> produces
> {code}
>   File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Must pass at least one record batch
> {code}
> i.e. we're hitting the check in 
> https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284
> The workaround we're currently using is to always ensure we serialize at 
> least one record batch, even if it's empty. However, I think it would be nice 
> to either support a stream without record batches or explicitly disallow this 
> and then match behaviour in java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2307) [Python] Unable to read arrow stream containing 0 record batches

2018-03-14 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2307:

Summary: [Python] Unable to read arrow stream containing 0 record batches  
(was: Unable to read arrow stream containing 0 record batches using pyarrow)

> [Python] Unable to read arrow stream containing 0 record batches
> 
>
> Key: ARROW-2307
> URL: https://issues.apache.org/jira/browse/ARROW-2307
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Benjamin Duffield
>Assignee: Wes McKinney
>Priority: Major
>
> Using java arrow I'm creating an arrow stream, using the stream writer.
>  
> Sometimes I don't have anything to serialize, and so I don't write any record 
> batches. My arrow stream thus consists of just a schema message. 
> {code:java}
> 
> 
> {code}
> I am able to deserialize this arrow stream correctly using the java stream 
> reader, but when reading it with python I instead hit an error
> {code}
> import pyarrow as pa
> # ...
> reader = pa.open_stream(stream)
> df = reader.read_all().to_pandas()
> {code}
> produces
> {code}
>   File "ipc.pxi", line 307, in pyarrow.lib._RecordBatchReader.read_all
>   File "error.pxi", line 77, in pyarrow.lib.check_status
> ArrowInvalid: Must pass at least one record batch
> {code}
> i.e. we're hitting the check in 
> https://github.com/apache/arrow/blob/apache-arrow-0.8.0/cpp/src/arrow/table.cc#L284
> The workaround we're currently using is to always ensure we serialize at 
> least one record batch, even if it's empty. However, I think it would be nice 
> to either support a stream without record batches or explicitly disallow this 
> and then match behaviour in java.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1701) [Serialization] Support zero copy PyTorch Tensor serialization

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398697#comment-16398697
 ] 

ASF GitHub Bot commented on ARROW-1701:
---

ppwwyyxx commented on issue #1223: ARROW-1701: [Serialization] Support zero 
copy PyTorch Tensor serialization
URL: https://github.com/apache/arrow/pull/1223#issuecomment-373049215
 
 
   Can we push the update to pypi?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Serialization] Support zero copy PyTorch Tensor serialization
> --
>
> Key: ARROW-1701
> URL: https://issues.apache.org/jira/browse/ARROW-1701
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see http://pytorch.org/docs/master/tensors.html
> This should be optional and only included if the user has PyTorch installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2140) [Python] Conversion from Numpy float16 array unimplemented

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398732#comment-16398732
 ] 

ASF GitHub Bot commented on ARROW-2140:
---

pitrou commented on issue #1744: ARROW-2140: [Python] Improve float16 support
URL: https://github.com/apache/arrow/pull/1744#issuecomment-373059009
 
 
   The Travis-CI failure is due to a regression in a Cython 0.28: 
https://github.com/cython/cython/issues/2148


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy float16 array unimplemented
> --
>
> Key: ARROW-2140
> URL: https://issues.apache.org/jira/browse/ARROW-2140
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code}
> >>> arr = np.array([1.5], dtype=np.float16)
> >>> pa.array(arr, type=pa.float16())
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.array(arr)
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "array.pxi", line 84, in pyarrow.lib._ndarray_to_array
>   File "public-api.pxi", line 158, in pyarrow.lib.pyarrow_wrap_array
> KeyError: 10
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2140) [Python] Conversion from Numpy float16 array unimplemented

2018-03-14 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2140:
--
Labels: pull-request-available  (was: )

> [Python] Conversion from Numpy float16 array unimplemented
> --
>
> Key: ARROW-2140
> URL: https://issues.apache.org/jira/browse/ARROW-2140
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code}
> >>> arr = np.array([1.5], dtype=np.float16)
> >>> pa.array(arr, type=pa.float16())
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.array(arr)
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "array.pxi", line 84, in pyarrow.lib._ndarray_to_array
>   File "public-api.pxi", line 158, in pyarrow.lib.pyarrow_wrap_array
> KeyError: 10
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1701) [Serialization] Support zero copy PyTorch Tensor serialization

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398699#comment-16398699
 ] 

ASF GitHub Bot commented on ARROW-1701:
---

ppwwyyxx commented on issue #1223: ARROW-1701: [Serialization] Support zero 
copy PyTorch Tensor serialization
URL: https://github.com/apache/arrow/pull/1223#issuecomment-373049215
 
 
   Can we push the update to pypi? (Just bitten by the same issue again)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Serialization] Support zero copy PyTorch Tensor serialization
> --
>
> Key: ARROW-1701
> URL: https://issues.apache.org/jira/browse/ARROW-1701
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> see http://pytorch.org/docs/master/tensors.html
> This should be optional and only included if the user has PyTorch installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2140) [Python] Conversion from Numpy float16 array unimplemented

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398688#comment-16398688
 ] 

ASF GitHub Bot commented on ARROW-2140:
---

pitrou opened a new pull request #1744: ARROW-2140: [Python] Improve float16 
support
URL: https://github.com/apache/arrow/pull/1744
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Conversion from Numpy float16 array unimplemented
> --
>
> Key: ARROW-2140
> URL: https://issues.apache.org/jira/browse/ARROW-2140
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> {code}
> >>> arr = np.array([1.5], dtype=np.float16)
> >>> pa.array(arr, type=pa.float16())
> Traceback (most recent call last):
>   File "", line 1, in 
> pa.array(arr)
>   File "array.pxi", line 177, in pyarrow.lib.array
>   File "array.pxi", line 84, in pyarrow.lib._ndarray_to_array
>   File "public-api.pxi", line 158, in pyarrow.lib.pyarrow_wrap_array
> KeyError: 10
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-45) Python: Add unnest/flatten function for List types

2018-03-14 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398663#comment-16398663
 ] 

Wes McKinney commented on ARROW-45:
---

Yes

> Python: Add unnest/flatten function for List types
> --
>
> Key: ARROW-45
> URL: https://issues.apache.org/jira/browse/ARROW-45
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2308) Serialized tensor data should be 64-byte aligned.

2018-03-14 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398658#comment-16398658
 ] 

Wes McKinney commented on ARROW-2308:
-

Making tensors 64-byte aligned makes sense to me. There's some ongoing 
refactoring related to this in ARROW-1860 -- I suggest we work on resolving all 
of these issues together

> Serialized tensor data should be 64-byte aligned.
> -
>
> Key: ARROW-2308
> URL: https://issues.apache.org/jira/browse/ARROW-2308
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
>
> See [https://github.com/ray-project/ray/issues/1658] for an example of this 
> issue. Non-aligned data can trigger a copy when fed into TensorFlow and 
> things like that.
> {code}
> import pyarrow as pa
> import numpy as np
> x = np.zeros(10)
> y = pa.deserialize(pa.serialize(x).to_buffer())
> x.ctypes.data % 64  # 0 (it starts out aligned)
> y.ctypes.data % 64  # 48 (it is no longer aligned)
> {code}
> It should be possible to fix this by calling something like 
> {{RETURN_NOT_OK(AlignStreamPosition(dst));}} before writing the array data. 
> Note that we already do this before writing the tensor header, but the tensor 
> header is not necessarily a multiple of 64 bytes, so the subsequent data can 
> be unaligned.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1886) [Python] Add function to "flatten" structs within tables

2018-03-14 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398620#comment-16398620
 ] 

Wes McKinney commented on ARROW-1886:
-

I believe so, yes

> [Python] Add function to "flatten" structs within tables
> 
>
> Key: ARROW-1886
> URL: https://issues.apache.org/jira/browse/ARROW-1886
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> See discussion in https://issues.apache.org/jira/browse/ARROW-1873
> When a user has a struct column, it may be more efficient to flatten the 
> struct into multiple columns of the form {{struct_name.field_name}} for each 
> field in the struct. Then when you call {{to_pandas}}, Python dictionaries do 
> not have to be created, and the conversion will be much more efficient



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-1886) [Python] Add function to "flatten" structs within tables

2018-03-14 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398563#comment-16398563
 ] 

Antoine Pitrou commented on ARROW-1886:
---

Should this happen on the C++ side as well?

> [Python] Add function to "flatten" structs within tables
> 
>
> Key: ARROW-1886
> URL: https://issues.apache.org/jira/browse/ARROW-1886
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>
> See discussion in https://issues.apache.org/jira/browse/ARROW-1873
> When a user has a struct column, it may be more efficient to flatten the 
> struct into multiple columns of the form {{struct_name.field_name}} for each 
> field in the struct. Then when you call {{to_pandas}}, Python dictionaries do 
> not have to be created, and the conversion will be much more efficient



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-45) Python: Add unnest/flatten function for List types

2018-03-14 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398556#comment-16398556
 ] 

Antoine Pitrou commented on ARROW-45:
-

Should this happen on the C++ side as well?

> Python: Add unnest/flatten function for List types
> --
>
> Key: ARROW-45
> URL: https://issues.apache.org/jira/browse/ARROW-45
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.10.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2304) [C++] MultipleClients test in io-hdfs-test fails on trunk

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398362#comment-16398362
 ] 

ASF GitHub Bot commented on ARROW-2304:
---

xhochy closed pull request #1743: ARROW-2304: [C++] Fix HDFS MultipleClients 
unit test
URL: https://github.com/apache/arrow/pull/1743
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc
index 610a91fbc..e02215b5e 100644
--- a/cpp/src/arrow/io/io-hdfs-test.cc
+++ b/cpp/src/arrow/io/io-hdfs-test.cc
@@ -181,6 +181,8 @@ TYPED_TEST(TestHadoopFileSystem, ConnectsAgain) {
 TYPED_TEST(TestHadoopFileSystem, MultipleClients) {
   SKIP_IF_NO_DRIVER();
 
+  ASSERT_OK(this->MakeScratchDir());
+
   std::shared_ptr client1;
   std::shared_ptr client2;
   ASSERT_OK(HadoopFileSystem::Connect(&this->conf_, &client1));
@@ -189,7 +191,7 @@ TYPED_TEST(TestHadoopFileSystem, MultipleClients) {
 
   // client2 continues to function after equivalent client1 has shutdown
   std::vector listing;
-  EXPECT_OK(client2->ListDirectory(this->scratch_dir_, &listing));
+  ASSERT_OK(client2->ListDirectory(this->scratch_dir_, &listing));
   ASSERT_OK(client2->Disconnect());
 }
 


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] MultipleClients test in io-hdfs-test fails on trunk
> -
>
> Key: ARROW-2304
> URL: https://issues.apache.org/jira/browse/ARROW-2304
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This fails for me locally:
> {code}
> [ RUN  ] TestHadoopFileSystem/0.MultipleClients
> ../src/arrow/io/io-hdfs-test.cc:192: Failure
> Value of: s.ok()
>   Actual: false
> Expected: true
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2304) [C++] MultipleClients test in io-hdfs-test fails on trunk

2018-03-14 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-2304.

Resolution: Fixed

Issue resolved by pull request 1743
[https://github.com/apache/arrow/pull/1743]

> [C++] MultipleClients test in io-hdfs-test fails on trunk
> -
>
> Key: ARROW-2304
> URL: https://issues.apache.org/jira/browse/ARROW-2304
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> This fails for me locally:
> {code}
> [ RUN  ] TestHadoopFileSystem/0.MultipleClients
> ../src/arrow/io/io-hdfs-test.cc:192: Failure
> Value of: s.ok()
>   Actual: false
> Expected: true
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2306) [Python] HDFS test failures

2018-03-14 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-2306.

Resolution: Fixed

Issue resolved by pull request 1742
[https://github.com/apache/arrow/pull/1742]

> [Python] HDFS test failures
> ---
>
> Key: ARROW-2306
> URL: https://issues.apache.org/jira/browse/ARROW-2306
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> These weren't caught because we aren't running the HDFS tests in Travis CI
> {code}
> pyarrow/tests/test_hdfs.py::TestLibHdfs::test_write_to_dataset_no_partitions 
> FAILED
> >>> traceback 
> >>> 
> self =  testMethod=test_write_to_dataset_no_partitions>
> @test_parquet.parquet
> def test_write_to_dataset_no_partitions(self):
> tmpdir = pjoin(self.tmp_path, 'write-no_partitions-' + guid())
> self.hdfs.mkdir(tmpdir)
> test_parquet._test_write_to_dataset_no_partitions(
> >   tmpdir, filesystem=self.hdfs)
> pyarrow/tests/test_hdfs.py:367: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> pyarrow/tests/test_parquet.py:1475: in _test_write_to_dataset_no_partitions
> filesystem=filesystem)
> pyarrow/parquet.py:1059: in write_to_dataset
> _mkdir_if_not_exists(fs, root_path)
> pyarrow/parquet.py:1006: in _mkdir_if_not_exists
> if fs._isfilestore() and not fs.exists(path):
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> self = 
> def _isfilestore(self):
> """
> Returns True if this FileSystem is a unix-style file store with
> directories.
> """
> >   raise NotImplementedError
> E   NotImplementedError
> pyarrow/filesystem.py:143: NotImplementedError
> >> entering PDB 
> >> >>
> > /home/wesm/code/arrow/python/pyarrow/filesystem.py(143)_isfilestore()
> -> raise NotImplementedError
> (Pdb) c
> pyarrow/tests/test_hdfs.py::TestLibHdfs::test_write_to_dataset_with_partitions
>  FAILED
> >>> traceback 
> >>> 
> self =  testMethod=test_write_to_dataset_with_partitions>
> @test_parquet.parquet
> def test_write_to_dataset_with_partitions(self):
> tmpdir = pjoin(self.tmp_path, 'write-partitions-' + guid())
> self.hdfs.mkdir(tmpdir)
> test_parquet._test_write_to_dataset_with_partitions(
> >   tmpdir, filesystem=self.hdfs)
> pyarrow/tests/test_hdfs.py:360: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> pyarrow/tests/test_parquet.py:1433: in _test_write_to_dataset_with_partitions
> filesystem=filesystem)
> pyarrow/parquet.py:1059: in write_to_dataset
> _mkdir_if_not_exists(fs, root_path)
> pyarrow/parquet.py:1006: in _mkdir_if_not_exists
> if fs._isfilestore() and not fs.exists(path):
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> self = 
> def _isfilestore(self):
> """
> Returns True if this FileSystem is a unix-style file store with
> directories.
> """
> >   raise NotImplementedError
> E   NotImplementedError
> pyarrow/filesystem.py:143: NotImplementedError
> >> entering PDB 
> >> >>
> > /home/wesm/code/arrow/python/pyarrow/filesystem.py(143)_isfilestore()
> -> raise NotImplementedError
> (Pdb) c
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2306) [Python] HDFS test failures

2018-03-14 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398360#comment-16398360
 ] 

ASF GitHub Bot commented on ARROW-2306:
---

xhochy closed pull request #1742: ARROW-2306: [Python] Fix partitioned Parquet 
test against HDFS
URL: https://github.com/apache/arrow/pull/1742
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/python/pyarrow/hdfs.py b/python/pyarrow/hdfs.py
index 3f2014b65..34ddfaef3 100644
--- a/python/pyarrow/hdfs.py
+++ b/python/pyarrow/hdfs.py
@@ -40,6 +40,13 @@ def __reduce__(self):
 return (HadoopFileSystem, (self.host, self.port, self.user,
self.kerb_ticket, self.driver))
 
+def _isfilestore(self):
+"""
+Returns True if this FileSystem is a unix-style file store with
+directories.
+"""
+return True
+
 @implements(FileSystem.isdir)
 def isdir(self, path):
 return super(HadoopFileSystem, self).isdir(path)
diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py
index fd9c740f1..0929a1549 100644
--- a/python/pyarrow/parquet.py
+++ b/python/pyarrow/parquet.py
@@ -1103,6 +1103,9 @@ def write_metadata(schema, where, version='1.0',
 coerce_timestamps : string, default None
 Cast timestamps a particular resolution.
 Valid values: {None, 'ms', 'us'}
+filesystem : FileSystem, default None
+If nothing passed, paths assumed to be found in the local on-disk
+filesystem
 """
 writer = ParquetWriter(
 where, schema, version=version,
diff --git a/python/pyarrow/tests/test_parquet.py 
b/python/pyarrow/tests/test_parquet.py
index a3da05fe3..b301de606 100644
--- a/python/pyarrow/tests/test_parquet.py
+++ b/python/pyarrow/tests/test_parquet.py
@@ -1431,8 +1431,15 @@ def _test_write_to_dataset_with_partitions(base_path, 
filesystem=None):
 output_table = pa.Table.from_pandas(output_df)
 pq.write_to_dataset(output_table, base_path, partition_by,
 filesystem=filesystem)
-pq.write_metadata(output_table.schema,
-  os.path.join(base_path, '_common_metadata'))
+
+metadata_path = os.path.join(base_path, '_common_metadata')
+
+if filesystem is not None:
+with filesystem.open(metadata_path, 'wb') as f:
+pq.write_metadata(output_table.schema, f)
+else:
+pq.write_metadata(output_table.schema, metadata_path)
+
 dataset = pq.ParquetDataset(base_path, filesystem=filesystem)
 # ARROW-2209: Ensure the dataset schema also includes the partition columns
 dataset_cols = set(dataset.schema.to_arrow_schema().names)


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] HDFS test failures
> ---
>
> Key: ARROW-2306
> URL: https://issues.apache.org/jira/browse/ARROW-2306
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> These weren't caught because we aren't running the HDFS tests in Travis CI
> {code}
> pyarrow/tests/test_hdfs.py::TestLibHdfs::test_write_to_dataset_no_partitions 
> FAILED
> >>> traceback 
> >>> 
> self =  testMethod=test_write_to_dataset_no_partitions>
> @test_parquet.parquet
> def test_write_to_dataset_no_partitions(self):
> tmpdir = pjoin(self.tmp_path, 'write-no_partitions-' + guid())
> self.hdfs.mkdir(tmpdir)
> test_parquet._test_write_to_dataset_no_partitions(
> >   tmpdir, filesystem=self.hdfs)
> pyarrow/tests/test_hdfs.py:367: 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> pyarrow/tests/test_parquet.py:1475: in _test_write_to_dataset_no_partitions
> filesystem=filesystem)
> pyarrow/parquet.py:1059: in write_to_dataset
> _mkdir_if_not_exists(fs, root_path)
> pyarrow/parquet.py:1006: in _mkdir_if_not_exists
> if fs._isfilestore() and not fs.exists(path):
> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> _ _ _ _ _ _ _ _ _ _ 

[jira] [Commented] (ARROW-640) [Python] Arrow scalar values should have a sensible __hash__ and comparison

2018-03-14 Thread Antoine Pitrou (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398355#comment-16398355
 ] 

Antoine Pitrou commented on ARROW-640:
--

I don't think we're concerned about particular workloads for now. Something 
like {{%timeit hash\(x)}} (in IPython syntax) is a good micro-benchmark for 
this.

Integer is the main type that I think might be use in a hashing context so you 
may want to write a native hash implementation for them, while letting other 
types defer to {{as_py}}. 

Also in some cases (such as StructValue), the {{as_py}} fallback won't work. We 
may or may not care about this immediately (i.e. if you only want to implement 
numbers, we can open an issue for the other types).

> [Python] Arrow scalar values should have a sensible __hash__ and comparison
> ---
>
> Key: ARROW-640
> URL: https://issues.apache.org/jira/browse/ARROW-640
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Miki Tebeka
>Assignee: Alex Hagerman
>Priority: Major
> Fix For: 0.10.0
>
>
> {noformat}
> In [86]: arr = pa.from_pylist([1, 1, 1, 2])
> In [87]: set(arr)
> Out[87]: {1, 2, 1, 1}
> In [88]: arr[0] == arr[1]
> Out[88]: False
> In [89]: arr
> Out[89]: 
> 
> [
>   1,
>   1,
>   1,
>   2
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)