[jira] [Commented] (ARROW-2430) MVP for branch based packaging automation

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438842#comment-16438842
 ] 

ASF GitHub Bot commented on ARROW-2430:
---

kszucs commented on issue #1869: ARROW-2430: [Packaging] MVP for branch based 
packaging automation
URL: https://github.com/apache/arrow/pull/1869#issuecomment-381438795
 
 
   Ok, cleaning it!


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MVP for branch based packaging automation
> -
>
> Key: ARROW-2430
> URL: https://issues.apache.org/jira/browse/ARROW-2430
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Described in 
> https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2430) MVP for branch based packaging automation

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438841#comment-16438841
 ] 

ASF GitHub Bot commented on ARROW-2430:
---

wesm commented on issue #1869: ARROW-2430: [Packaging] MVP for branch based 
packaging automation
URL: https://github.com/apache/arrow/pull/1869#issuecomment-381438403
 
 
   There's some library code in crossbow.py that's commingled with the CLI 
interface. It would be better to have a clean library (with logging callbacks) 
and put a CLI on top rather than having a CLI be the only way to use the 
system. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MVP for branch based packaging automation
> -
>
> Key: ARROW-2430
> URL: https://issues.apache.org/jira/browse/ARROW-2430
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Described in 
> https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2430) MVP for branch based packaging automation

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438839#comment-16438839
 ] 

ASF GitHub Bot commented on ARROW-2430:
---

kszucs commented on issue #1869: ARROW-2430: [Packaging] MVP for branch based 
packaging automation
URL: https://github.com/apache/arrow/pull/1869#issuecomment-381438013
 
 
   Currently this is just for building the packages, I haven't added the 
deployment steps yet - I'm trying to develop incrementally here.
   There will be multiple issues in practice, which should come to light before 
the next release of arrow.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MVP for branch based packaging automation
> -
>
> Key: ARROW-2430
> URL: https://issues.apache.org/jira/browse/ARROW-2430
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Described in 
> https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2430) MVP for branch based packaging automation

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438838#comment-16438838
 ] 

ASF GitHub Bot commented on ARROW-2430:
---

kszucs commented on issue #1869: ARROW-2430: [Packaging] MVP for branch based 
packaging automation
URL: https://github.com/apache/arrow/pull/1869#issuecomment-381437573
 
 
   @wesm No particular reason, should I move under `dev/cd` or `dev/packaging`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MVP for branch based packaging automation
> -
>
> Key: ARROW-2430
> URL: https://issues.apache.org/jira/browse/ARROW-2430
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Described in 
> https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2430) MVP for branch based packaging automation

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438837#comment-16438837
 ] 

ASF GitHub Bot commented on ARROW-2430:
---

cpcloud commented on issue #1869: ARROW-2430: [Packaging] MVP for branch based 
packaging automation
URL: https://github.com/apache/arrow/pull/1869#issuecomment-381437289
 
 
   @kszucs great, I will take it for a spin next week and do some more review 
here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MVP for branch based packaging automation
> -
>
> Key: ARROW-2430
> URL: https://issues.apache.org/jira/browse/ARROW-2430
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Described in 
> https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2430) MVP for branch based packaging automation

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438836#comment-16438836
 ] 

ASF GitHub Bot commented on ARROW-2430:
---

wesm commented on issue #1869: ARROW-2430: [Packaging] MVP for branch based 
packaging automation
URL: https://github.com/apache/arrow/pull/1869#issuecomment-381437220
 
 
   I will review in some detail when I can (this coming week). Is there a 
reason not to put all of this under the `dev` directory? On this note, the 
`ci/` directory would arguably fit better under `dev/ci/`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MVP for branch based packaging automation
> -
>
> Key: ARROW-2430
> URL: https://issues.apache.org/jira/browse/ARROW-2430
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Described in 
> https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2430) MVP for branch based packaging automation

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438834#comment-16438834
 ] 

ASF GitHub Bot commented on ARROW-2430:
---

wesm commented on a change in pull request #1869: ARROW-2430: [Packaging] MVP 
for branch based packaging automation
URL: https://github.com/apache/arrow/pull/1869#discussion_r181596516
 
 

 ##
 File path: cd/crossbow.py
 ##
 @@ -0,0 +1,201 @@
+#!/usr/bin/env python
+
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import re
+import sys
+import click
+import pygit2
 
 Review comment:
   Since this dependency isn't necessary for the operation of the Arrow 
libraries for normal users, it is not an issue


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> MVP for branch based packaging automation
> -
>
> Key: ARROW-2430
> URL: https://issues.apache.org/jira/browse/ARROW-2430
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Packaging
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> Described in 
> https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-g9EGPOtcFdtMBzEyDJv48BKc/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2462) [C++] Segfault when writing a parquet table containing a dictionary column from Record Batch Stream

2018-04-15 Thread Matt Topol (JIRA)
Matt Topol created ARROW-2462:
-

 Summary: [C++] Segfault when writing a parquet table containing a 
dictionary column from Record Batch Stream
 Key: ARROW-2462
 URL: https://issues.apache.org/jira/browse/ARROW-2462
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.9.1
Reporter: Matt Topol


Discovered this through using pyarrow and dealing with RecordBatch Streams and 
parquet. The issue can be replicated as follows:

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

# create record batch with 1 dictionary column
indices = pa.array([1,0,1,1,0])
dictionary = pa.array(['Foo', 'Bar'])
dict_array = pa.DictionaryArray.from_arrays(indices, dictionary)
rb = pa.RecordBatch.from_arrays( [ dict_array ], [ 'd0' ] )

# write out using RecordBatchStreamWriter
sink = pa.BufferOutputStream()
writer = pa.RecordBatchStreamWriter(sink, rb.schema)
writer.write_batch(rb)
writer.close()
buf = sink.get_result()

# read in and try to write parquet table
reader = pa.open_stream(buf)
tbl = reader.read_all()
pq.write_table(tbl, 'dict_table.parquet') # SEGFAULTS
{code}

When writing record batch streams, if there are no nulls in an array, Arrow 
will put a placeholder nullptr instead of putting the full bitmap of 1s, when 
deserializing that stream, the bitmap for the nulls isn't populated and is left 
to being a nullptr. When attempting to write this table via pyarrow.parquet, 
you end up 
[here|https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.cc#L963]
  in the parquet writer code which attempts to Cast the dictionary to a 
non-dictionary representation. Since the null count isn't checked before 
creating a BitmapReader, the BitmapReader is constructed with a nullptr for the 
bitmap_data, but a non-zero length which then segfaults in the constructor 
[here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit-util.h#L415]
 because bitmap is null.

So a simple check of the null count before constructing the BitmapReader avoids 
the segfault.

Already filed [PR 1896|https://github.com/apache/arrow/pull/1896]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438794#comment-16438794
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

pitrou commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381426651
 
 
   I'm not saying we should necessarily make it faster, just wanted to make 
sure people are aware of the inefficiency.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2222) [C++] Add option to validate Flatbuffers messages

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438790#comment-16438790
 ] 

ASF GitHub Bot commented on ARROW-:
---

xhochy commented on issue #1763: ARROW-: handle untrusted inputs (POC)
URL: https://github.com/apache/arrow/pull/1763#issuecomment-381425486
 
 
   The failure is due to all IPC r/w tests failing on `MakeDeeplyNestedList`. 
As this only happens in the builds where we have flatbuffers 1.7.1 installed, 
this might also be due to flatbuffers. To ensure that this is not due to it, 
I'm going to update the manylinux1 flatbuffer dependency to 1.9


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Add option to validate Flatbuffers messages
> -
>
> Key: ARROW-
> URL: https://issues.apache.org/jira/browse/ARROW-
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Marco Neumann
>Priority: Major
>  Labels: pull-request-available
>
> This is follow up work to ARROW-1589, ARROW-2023, and can be validated by the 
> {{ipc-fuzzer-test}}. Users receiving untrusted input streams can prevent 
> segfaults this way
> As part of this, we should quantify the overhead associated with message 
> validation in regular use



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438721#comment-16438721
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

joshuastorck commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381410530
 
 
   @pitrou, on second look it won't be more efficient to move the check to 
outside of AppendObjectStrings. When passing check_valid to 
AppendObjectStrings, the UTF-8 decoding/check only happens if the data is 
Python 3 bytes or Python 2 strings. However, if the user passes Python 3 
strings or Python 2 unicode and wants a string type, no extra checks are done. 
In the case where the user wants the output type to be an arrow string, then we 
need to do the check on each bytes object. Otherwise, we will return a 
StringArray that has data that's not actually UTF-8.
   
   Please let me know if that makes sense, and if not, let me know how you 
would make it faster. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438715#comment-16438715
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

joshuastorck commented on issue #1886: ARROW-2101: [Python/C++] Correctly 
convert numpy arrays of bytes to arrow arrays of strings when user specifies 
arrow type of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381263228
 
 
   I built for Python 2 and confirmed the behavior is the same. 
   
   @pitrou, in regards to the inefficiency of utf-8 encoding, it could be moved 
below to the check of global_have_bytes. Would you prefer this?
   
   ```cpp
 if (global_have_bytes) {
   if (force_string)
   {
   PyObject* obj;
   
Ndarray1DIndexer objects(arr_);
Ndarray1DIndexer mask_values;

bool have_mask = false;
if (mask_ != nullptr) {
  mask_values.Init(mask_);
  have_mask = true;
}

PyBytesView view;
for (int64_t offset = 0; offset < objects.size(); ++offset) {
  OwnedRef tmp_obj;
  obj = objects[offset];
  if ((have_mask && mask_values[offset]) || 
internal::PandasObjectIsNull(obj)) {
continue;
  }
 RETURN_NOT_OK(view.FromString(obj, true);
}
   }
   else
   {
 for (size_t i = 0; i < out_arrays_.size(); ++i) {
auto binary_data = out_arrays_[i]->data()->Copy();c
binary_data->type = ::arrow::binary();
out_arrays_[i] = std::make_shared(binary_data);
 }
   }
   ```
   
   I'm not fond of how much code I had to copy from AppendObjectStrings to 
write that loop. I think it would be helpful to have iterators that look like 
this:
   
   ```cpp
   NdArray1DIndexer array(array_);
   auto mask = NdArray1DIndexer::from_mask(mask_);
   NdArray1DMaskedIterator iterator(array.begin() + offset, array.end(), mask, 
true /* include masked value */);
   for (OwnedRef& obj: iterator)
   {
  // Maybe we use None to indicate masked values?
   }
   ```
   Or even better, we use pybind11 and these are light wrappers over them?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2057) [Python] Configure size of data pages in pyarrow.parquet.write_table

2018-04-15 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-2057:
---
Labels: beginner  (was: )

> [Python] Configure size of data pages in pyarrow.parquet.write_table
> 
>
> Key: ARROW-2057
> URL: https://issues.apache.org/jira/browse/ARROW-2057
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 0.10.0
>
>
> It would be useful to be able to set the size of data pages (within Parquet 
> column chunks) from Python



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2249) [Java/Python] in-process vector sharing from Java to Python

2018-04-15 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated ARROW-2249:
---
Description: 
Currently we seem to use in all applications of Arrow the IPC capabilities to 
move data between a Java process and a Python process. While this is 
0-serialization, it is not zero-copy. By taking the address and offset, we can 
already create Python buffers from Java buffers: 
https://github.com/apache/arrow/pull/1693. This is still a very low-level 
interface and we should provide the user with:

* A guide on how to load Apache Arrow java libraries in Python (either through 
a fat-jar that was shipped with Arrow or how he should integrate it into its 
Java packaging)
* {{pyarrow.Array.from_jvm}}, {{pyarrow.RecordBatch.from_jvm}}, … functions 
that take the respective Java objects and emit Python objects. These Python 
objects should also ensure that the underlying memory regions are kept alive as 
long as the Python objects exist.

This issue can also be used as a tracker for the various sub-tasks that will 
need to be done to complete this rather large milestone.

  was:
Currently we seem to use in all applications of Arrow the IPC capabilities to 
move data between a Java process and a Python process. While this is 
0-serialization, it is not zero-copy. I'm going to have a first shot at 
exposing Java Vectors in Python as {{pyarrow.Array}}.

This issue can also be used as a tracker for the various sub-tasks that will 
need to be done to complete this rather large milestone.


> [Java/Python] in-process vector sharing from Java to Python
> ---
>
> Key: ARROW-2249
> URL: https://issues.apache.org/jira/browse/ARROW-2249
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java - Vectors, Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 0.10.0
>
>
> Currently we seem to use in all applications of Arrow the IPC capabilities to 
> move data between a Java process and a Python process. While this is 
> 0-serialization, it is not zero-copy. By taking the address and offset, we 
> can already create Python buffers from Java buffers: 
> https://github.com/apache/arrow/pull/1693. This is still a very low-level 
> interface and we should provide the user with:
> * A guide on how to load Apache Arrow java libraries in Python (either 
> through a fat-jar that was shipped with Arrow or how he should integrate it 
> into its Java packaging)
> * {{pyarrow.Array.from_jvm}}, {{pyarrow.RecordBatch.from_jvm}}, … functions 
> that take the respective Java objects and emit Python objects. These Python 
> objects should also ensure that the underlying memory regions are kept alive 
> as long as the Python objects exist.
> This issue can also be used as a tracker for the various sub-tasks that will 
> need to be done to complete this rather large milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2461) [Python] Build wheels for manylinux2010 tag

2018-04-15 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2461:
--

 Summary: [Python] Build wheels for manylinux2010 tag
 Key: ARROW-2461
 URL: https://issues.apache.org/jira/browse/ARROW-2461
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Uwe L. Korn
 Fix For: 0.11.0


There is now work in progress on an updated manylinux tag based on CentOS6. We 
should provide wheels for this tag and the old {{manylinux1}} tag for one 
release and then switch to the new tag in the release afterwards. This should 
enable us also to raise the minimum compiler requirement to gcc 4.9 (or higher 
once conda-forge has migrated to a newer compiler).

The relevant PEP is https://www.python.org/dev/peps/pep-0571/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2101) [Python] from_pandas reads 'str' type as binary Arrow data with Python 2

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438623#comment-16438623
 ] 

ASF GitHub Bot commented on ARROW-2101:
---

xhochy commented on issue #1886: ARROW-2101: [Python/C++] Correctly convert 
numpy arrays of bytes to arrow arrays of strings when user specifies arrow type 
of string
URL: https://github.com/apache/arrow/pull/1886#issuecomment-381388117
 
 
   Not sure if there were more comment on it, but just want to iterate on 
   
   >> Also, this doesn't change anything for Python 2 if using 'str' objects 
and the type is not specified, it will still create a BinaryArray, is this what 
we want?
   
   > Probably. Python 2 str objects are bytestrings just like Python 3 bytes 
objects.
   
   Yes this is definitely the indented behaviour. We had some discussion in the 
past about it and stuck to the following when no type is specified:
   
   ```
   str(PY2) / bytes(PY3) –> pa.binary
   unicode(PY2) / str(PY3) –> pa.string
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] from_pandas reads 'str' type as binary Arrow data with Python 2
> 
>
> Key: ARROW-2101
> URL: https://issues.apache.org/jira/browse/ARROW-2101
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Using Python 2, converting Pandas with 'str' data to Arrow results in Arrow 
> data of binary type, even if the user supplies type information.  conversion 
> of 'unicode' type works to create Arrow data of string types.  For example
> {code}
> In [25]: pa.Array.from_pandas(pd.Series(['a'])).type
> Out[25]: DataType(binary)
> In [26]: pa.Array.from_pandas(pd.Series(['a']), type=pa.string()).type
> Out[26]: DataType(binary)
> In [27]: pa.Array.from_pandas(pd.Series([u'a'])).type
> Out[27]: DataType(string)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-2435) [Rust] Add memory pool abstraction.

2018-04-15 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-2435.

   Resolution: Fixed
Fix Version/s: 0.10.0

Issue resolved by pull request 1875
[https://github.com/apache/arrow/pull/1875]

> [Rust] Add memory pool abstraction.
> ---
>
> Key: ARROW-2435
> URL: https://issues.apache.org/jira/browse/ARROW-2435
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Affects Versions: 0.9.0
>Reporter: Renjie Liu
>Assignee: Renjie Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Add memory pool abstraction as the c++ api.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2435) [Rust] Add memory pool abstraction.

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438618#comment-16438618
 ] 

ASF GitHub Bot commented on ARROW-2435:
---

xhochy closed pull request #1875: ARROW-2435: [Rust] Add memory pool 
abstraction.
URL: https://github.com/apache/arrow/pull/1875
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/.gitignore b/.gitignore
index c902ba39c..d9c69e954 100644
--- a/.gitignore
+++ b/.gitignore
@@ -30,6 +30,7 @@ MANIFEST
 *.vcxproj
 *.vcxproj.*
 *.sln
+*.iml
 
 cpp/.idea/
 python/.eggs/
diff --git a/rust/src/error.rs b/rust/src/error.rs
index 6a342e063..d82ee1190 100644
--- a/rust/src/error.rs
+++ b/rust/src/error.rs
@@ -20,3 +20,5 @@ pub enum ArrowError {
 MemoryError(String),
 ParseError(String),
 }
+
+pub type Result = ::std::result::Result;
diff --git a/rust/src/lib.rs b/rust/src/lib.rs
index 6ab3daabb..2d2274029 100644
--- a/rust/src/lib.rs
+++ b/rust/src/lib.rs
@@ -29,3 +29,4 @@ pub mod datatypes;
 pub mod error;
 pub mod list;
 pub mod memory;
+pub mod memory_pool;
diff --git a/rust/src/memory_pool.rs b/rust/src/memory_pool.rs
new file mode 100644
index 0..acfcc3071
--- /dev/null
+++ b/rust/src/memory_pool.rs
@@ -0,0 +1,105 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use libc;
+use std::mem;
+use std::cmp;
+
+use super::error::ArrowError;
+use super::error::Result;
+
+const ALIGNMENT: usize = 64;
+
+/// Memory pool for allocating memory. It's also responsible for tracking 
memory usage.
+pub trait MemoryPool {
+/// Allocate memory.
+/// The implementation should ensures that allocated memory is aligned.
+fn allocate(&self, size: usize) -> Result<*mut u8>;
+
+/// Reallocate memory.
+/// If the implementation doesn't support reallocating aligned memory, it 
allocates new memory
+/// and copied old memory to it.
+fn reallocate(&self, old_size: usize, new_size: usize, pointer: *mut u8) 
-> Result<*mut u8>;
+
+/// Free memory.
+fn free(&self, ptr: *mut u8);
+}
+
+/// Implementation of memory pool using libc api.
+#[allow(dead_code)]
+struct LibcMemoryPool;
+
+impl MemoryPool for LibcMemoryPool {
+fn allocate(&self, size: usize) -> Result<*mut u8> {
+unsafe {
+let mut page: *mut libc::c_void = mem::uninitialized();
+let result = libc::posix_memalign(&mut page, ALIGNMENT, size);
+match result {
+0 => Ok(mem::transmute::<*mut libc::c_void, *mut u8>(page)),
+_ => Err(ArrowError::MemoryError(
+"Failed to allocate memory".to_string(),
+)),
+}
+}
+}
+
+fn reallocate(&self, old_size: usize, new_size: usize, pointer: *mut u8) 
-> Result<*mut u8> {
+unsafe {
+let old_src = mem::transmute::<*mut u8, *mut 
libc::c_void>(pointer);
+let result = self.allocate(new_size)?;
+let dst = mem::transmute::<*mut u8, *mut libc::c_void>(result);
+libc::memcpy(dst, old_src, cmp::min(old_size, new_size));
+libc::free(old_src);
+Ok(result)
+}
+}
+
+fn free(&self, ptr: *mut u8) {
+unsafe { libc::free(mem::transmute::<*mut u8, *mut libc::c_void>(ptr)) 
}
+}
+}
+
+#[cfg(test)]
+mod tests {
+use super::*;
+
+#[test]
+fn test_allocate() {
+let memory_pool = LibcMemoryPool {};
+
+for _ in 0..10 {
+let p = memory_pool.allocate(1024).unwrap();
+// make sure this is 64-byte aligned
+assert_eq!(0, (p as usize) % ALIGNMENT);
+memory_pool.free(p);
+}
+}
+
+#[test]
+fn test_reallocate() {
+let memory_pool = LibcMemoryPool {};
+
+for _ in 0..10 {
+let p1 = memory_pool.allocate(1024).unwrap();
+let p2 = memory_pool.reallocate(1024, 2048, p1).unwrap();
+// make sure this is 64-

[jira] [Commented] (ARROW-2458) [Plasma] PlasmaClient uses global variable

2018-04-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438616#comment-16438616
 ] 

ASF GitHub Bot commented on ARROW-2458:
---

xhochy commented on issue #1893: ARROW-2458: [Plasma] Use one thread pool per 
PlasmaClient
URL: https://github.com/apache/arrow/pull/1893#issuecomment-381387911
 
 
   > Note we could also take an existing header-only C++11 thread pool 
implementation (example: https://github.com/inkooboo/thread-pool-cpp), though 
I'm not sure what our policy is for vendoring code.
   
   @pitrou Vendoring code is ok. We need to clearly mark that the code is 
vendored, e.g. by placing in a `thirdparty` directory. Additionally, this 
should be lined out in the `NOTICE` and `LICENSE` files. I think an important 
component for ASF policies is that it must be visible what code is just 
vendored, not submitted a code donation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Plasma] PlasmaClient uses global variable
> --
>
> Key: ARROW-2458
> URL: https://issues.apache.org/jira/browse/ARROW-2458
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Affects Versions: 0.9.0
>Reporter: Philipp Moritz
>Assignee: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>
> The threadpool threadpool_ that PlasmaClient is using is global at the 
> moment. This prevents us from using multiple PlasmaClients in the same 
> process (one per thread).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)