[jira] [Commented] (ARROW-1163) [Plasma] Java client for Plasma

2017-11-07 Thread Lu Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243388#comment-16243388
 ] 

Lu Qi  commented on ARROW-1163:
---

Hi, Philipp,
Thanks for providing me these material. I see that numpy uses 
"PyArray_NewFromDescr" to wrap a
memory without copying data. So, on Java side, we will mimic this method and 
provide a wrapper 
class for viewing or modify the underlying "mmap" share memory. But , for now , 
as in my case,  
I have an already defined Tensor using float array . I have to copy data into 
it , which is pretty sad.
Maybe one day we can drop our Tensor

> [Plasma] Java client for Plasma
> ---
>
> Key: ARROW-1163
> URL: https://issues.apache.org/jira/browse/ARROW-1163
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Philipp Moritz
>
> We should start thinking about how a Java client for plasma would look like. 
> Given the focus of arrow to support Python, C++ and Java really well, it is 
> the next important target after Python and C++.
> My preliminary thoughts on it are the following ones: We can either go with 
> JNI and wrap the C++ client or (in my opinion preferable) write a pure Java 
> client. It would communicate with the Plasma store via Java flatbuffers over 
> sockets.
> It seems that the only thing blocking a pure Java client at the moment is the 
> way we ship file descriptors for the memory mapped files between store and 
> client (see the file fling.cc in the Plasma repo). We would need to get rid 
> of that because there is no pure Java API that allows transferring file 
> descriptors over a process boundary. So the way to transfer memory mapped 
> files over process boundaries then is probably to use the file system and 
> keep the memory mapped files in the file system instead of unlinking them 
> immediately (as we do at the moment), so they can be opened by the client 
> process via their path.
> The challenge in this case is how to clean the files up and make sure they 
> are not lying around if the plasma store crashes. One option is to store the 
> plasma store PID with the file (i.e. as part of the file name) and let the 
> plasma store clean them up the next time it is started); maybe there is OS 
> level support for temporary files we can reuse.
> I probably won't get to this for a while, so if anybody needs this or has 
> free cycles, they should feel free to chime in. Also opinions on the design 
> are appreciated!
> -- Philipp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1163) [Plasma] Java client for Plasma

2017-11-07 Thread Philipp Moritz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243420#comment-16243420
 ] 

Philipp Moritz edited comment on ARROW-1163 at 11/8/17 5:59 AM:


That makes sense for now and I agree it's a little sad; for the future maybe 
you can get some insights from https://github.com/deeplearning4j/deeplearning4j 
on how to write the Tensor class in the "right" way; unfortunately Java doesn't 
really have a long tradition of scientific computing like Python has so there 
is no good standard Tensor classes like numpy.

Edit: This is also an opportunity for Arrow, if we had a good Java tensor class 
it could be widely used because of the increasing importance of deep learning. 
Another project to look at is https://github.com/intel-analytics/BigDL. We also 
wrote our own in the past: 
https://github.com/amplab/SparkNet/blob/master/src/main/scala/libs/NDArray.scala
 and 
https://github.com/amplab/SparkNet/blob/master/src/main/java/libs/JavaNDArray.java
 to interop with Caffe and TensorFlow, but it might not be too useful for 
shared memory.


was (Author: pcmoritz):
That makes sense for now and I agree it's a little sad; for the future maybe 
you can get some insights from https://github.com/deeplearning4j/deeplearning4j 
on how to write the Tensor class in the "right" way; unfortunately Java doesn't 
really have a long tradition of scientific computing like Python has so there 
is no good standard Tensor classes like numpy.

> [Plasma] Java client for Plasma
> ---
>
> Key: ARROW-1163
> URL: https://issues.apache.org/jira/browse/ARROW-1163
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Philipp Moritz
>
> We should start thinking about how a Java client for plasma would look like. 
> Given the focus of arrow to support Python, C++ and Java really well, it is 
> the next important target after Python and C++.
> My preliminary thoughts on it are the following ones: We can either go with 
> JNI and wrap the C++ client or (in my opinion preferable) write a pure Java 
> client. It would communicate with the Plasma store via Java flatbuffers over 
> sockets.
> It seems that the only thing blocking a pure Java client at the moment is the 
> way we ship file descriptors for the memory mapped files between store and 
> client (see the file fling.cc in the Plasma repo). We would need to get rid 
> of that because there is no pure Java API that allows transferring file 
> descriptors over a process boundary. So the way to transfer memory mapped 
> files over process boundaries then is probably to use the file system and 
> keep the memory mapped files in the file system instead of unlinking them 
> immediately (as we do at the moment), so they can be opened by the client 
> process via their path.
> The challenge in this case is how to clean the files up and make sure they 
> are not lying around if the plasma store crashes. One option is to store the 
> plasma store PID with the file (i.e. as part of the file name) and let the 
> plasma store clean them up the next time it is started); maybe there is OS 
> level support for temporary files we can reuse.
> I probably won't get to this for a while, so if anybody needs this or has 
> free cycles, they should feel free to chime in. Also opinions on the design 
> are appreciated!
> -- Philipp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1163) [Plasma] Java client for Plasma

2017-11-07 Thread Philipp Moritz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243420#comment-16243420
 ] 

Philipp Moritz commented on ARROW-1163:
---

That makes sense for now and I agree it's a little sad; for the future maybe 
you can get some insights from https://github.com/deeplearning4j/deeplearning4j 
on how to write the Tensor class in the "right" way; unfortunately Java doesn't 
really have a long tradition of scientific computing like Python has so there 
is no good standard Tensor classes like numpy.

> [Plasma] Java client for Plasma
> ---
>
> Key: ARROW-1163
> URL: https://issues.apache.org/jira/browse/ARROW-1163
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Philipp Moritz
>
> We should start thinking about how a Java client for plasma would look like. 
> Given the focus of arrow to support Python, C++ and Java really well, it is 
> the next important target after Python and C++.
> My preliminary thoughts on it are the following ones: We can either go with 
> JNI and wrap the C++ client or (in my opinion preferable) write a pure Java 
> client. It would communicate with the Plasma store via Java flatbuffers over 
> sockets.
> It seems that the only thing blocking a pure Java client at the moment is the 
> way we ship file descriptors for the memory mapped files between store and 
> client (see the file fling.cc in the Plasma repo). We would need to get rid 
> of that because there is no pure Java API that allows transferring file 
> descriptors over a process boundary. So the way to transfer memory mapped 
> files over process boundaries then is probably to use the file system and 
> keep the memory mapped files in the file system instead of unlinking them 
> immediately (as we do at the moment), so they can be opened by the client 
> process via their path.
> The challenge in this case is how to clean the files up and make sure they 
> are not lying around if the plasma store crashes. One option is to store the 
> plasma store PID with the file (i.e. as part of the file name) and let the 
> plasma store clean them up the next time it is started); maybe there is OS 
> level support for temporary files we can reuse.
> I probably won't get to this for a while, so if anybody needs this or has 
> free cycles, they should feel free to chime in. Also opinions on the design 
> are appreciated!
> -- Philipp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1163) [Plasma] Java client for Plasma

2017-11-07 Thread Lu Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243388#comment-16243388
 ] 

Lu Qi  edited comment on ARROW-1163 at 11/8/17 5:10 AM:


Hi, Philipp,
Thanks for providing me these material. I see that numpy uses 
"PyArray_NewFromDescr" to wrap a
memory without copying data. So, on Java side, we will mimic this method and 
provide a wrapper 
class for viewing or modify the underlying "mmap" share memory. But , for now , 
as in my case,  
I have an already defined Tensor using float array . I have to copy data into 
it , which is pretty sad.
Maybe one day I can drop my Tensor


was (Author: luchy0120):
Hi, Philipp,
Thanks for providing me these material. I see that numpy uses 
"PyArray_NewFromDescr" to wrap a
memory without copying data. So, on Java side, we will mimic this method and 
provide a wrapper 
class for viewing or modify the underlying "mmap" share memory. But , for now , 
as in my case,  
I have an already defined Tensor using float array . I have to copy data into 
it , which is pretty sad.
Maybe one day we can drop our Tensor

> [Plasma] Java client for Plasma
> ---
>
> Key: ARROW-1163
> URL: https://issues.apache.org/jira/browse/ARROW-1163
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Philipp Moritz
>
> We should start thinking about how a Java client for plasma would look like. 
> Given the focus of arrow to support Python, C++ and Java really well, it is 
> the next important target after Python and C++.
> My preliminary thoughts on it are the following ones: We can either go with 
> JNI and wrap the C++ client or (in my opinion preferable) write a pure Java 
> client. It would communicate with the Plasma store via Java flatbuffers over 
> sockets.
> It seems that the only thing blocking a pure Java client at the moment is the 
> way we ship file descriptors for the memory mapped files between store and 
> client (see the file fling.cc in the Plasma repo). We would need to get rid 
> of that because there is no pure Java API that allows transferring file 
> descriptors over a process boundary. So the way to transfer memory mapped 
> files over process boundaries then is probably to use the file system and 
> keep the memory mapped files in the file system instead of unlinking them 
> immediately (as we do at the moment), so they can be opened by the client 
> process via their path.
> The challenge in this case is how to clean the files up and make sure they 
> are not lying around if the plasma store crashes. One option is to store the 
> plasma store PID with the file (i.e. as part of the file name) and let the 
> plasma store clean them up the next time it is started); maybe there is OS 
> level support for temporary files we can reuse.
> I probably won't get to this for a while, so if anybody needs this or has 
> free cycles, they should feel free to chime in. Also opinions on the design 
> are appreciated!
> -- Philipp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1717) [Java] Remove public static helper method in vector classes for JSONReader/Writer

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243127#comment-16243127
 ] 

ASF GitHub Bot commented on ARROW-1717:
---

icexelloss commented on a change in pull request #1290: ARROW-1717: Refactor 
JsonReader
URL: https://github.com/apache/arrow/pull/1290#discussion_r149543200
 
 

 ##
 File path: 
java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java
 ##
 @@ -183,6 +185,282 @@ public VectorSchemaRoot read() throws IOException {
 }
   }
 
+  private abstract class BufferReader {
+abstract protected ArrowBuf read(BufferAllocator allocator, int count) 
throws IOException;
+
+final ArrowBuf readBuffer(BufferAllocator allocator, int count) throws 
IOException {
+  readToken(START_ARRAY);
+  ArrowBuf buf = read(allocator, count);
+  readToken(END_ARRAY);
+  return buf;
+}
+  }
+
+  private class BufferHelper {
+ BufferReader BIT = new BufferReader() {
+
+  @Override
+  protected ArrowBuf read(BufferAllocator allocator, int count) throws 
IOException {
+final int bufferSize = BitVectorHelper.getValidityBufferSize(count);
+ArrowBuf buf = allocator.buffer(bufferSize);
+
+// C++ integration test fails without this.
+buf.setZero(0, bufferSize);
 
 Review comment:
   See https://issues.apache.org/jira/browse/ARROW-1779


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Remove public static helper method in vector classes for 
> JSONReader/Writer
> -
>
> Key: ARROW-1717
> URL: https://issues.apache.org/jira/browse/ARROW-1717
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Li Jin
>Assignee: Li Jin
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1779) [Java] Integration test breaks without zeroing out validity vectors

2017-11-07 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243121#comment-16243121
 ] 

Li Jin commented on ARROW-1779:
---

cc [~cpcloud] [~wesmckinn]

This is probably a Java issue but I am kind of stuck figuring out what's wrong 
because the error happens in C++ integration validation (Java producing, C++ 
consuming). I have the file to reproduce this 

{code:java}
>>> good = 
>>> pyarrow.RecordBatchFileReader("/Users/ljin/workspace/arrow/nested.good")
^[[A
>>> good_batch = good.get_record_batch(1)
>>> good_batch.column(1)

[
  NA,
  {'f1': None, 'f2': 'BSZRpGI'},
  {'f1': None, 'f2': None},
  {'f1': None, 'f2': None},
  NA,
  NA,
  {'f1': None, 'f2': None},
  {'f1': None, 'f2': None},
  {'f1': 416507125, 'f2': None},
  NA
]
{code}

{code:java}
>>> bad = 
>>> pyarrow.RecordBatchFileReader("/Users/ljin/workspace/arrow/nested.bad")
>>> bad_batch = bad.get_record_batch(1)
>>> bad_batch.column(1)

[
  {'f1': -1345581951, 'f2': None},
  {'f1': None, 'f2': 'BSZRpGI'},
  {'f1': None, 'f2': None},
  {'f1': None, 'f2': None},
  {'f1': -497925054, 'f2': 'E34Dqdr'},
  {'f1': 94270936, 'f2': '5aksGEG'},
  {'f1': None, 'f2': None},
  {'f1': None, 'f2': None},
  {'f1': 416507125, 'f2': None},
  {'f1': None, 'f2': None}
]
{code}

They are supposed to have the same data but the bad one doesn't read validity 
vector correctly. Can you guys help shed some light?


> [Java] Integration test breaks without zeroing out validity vectors
> ---
>
> Key: ARROW-1779
> URL: https://issues.apache.org/jira/browse/ARROW-1779
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Li Jin
> Fix For: 0.8.0
>
> Attachments: nested.bad, nested.good, nested.json
>
>
> This is discovered in https://github.com/apache/arrow/pull/1290
> I found one the integration test (nested) failed without zeroing out validity 
> vectors before loading the array from json.
> I have created three files to reproduce this:
> (1) nested.json 
> (2) nested.good (zeroing out validity vector before reading)
> (3) nested.bad (not zeroing out validity vector before reading)
> (1) / (2) and (1) / (3) both pass Java integration test, however (1) / (3) 
> fails C++ test - one of the validity vector in (3) doesn't seem to be read 
> correctly.
> I am not sure what the issue is because I cannot reproduce an error in Java. 
> I am hoping maybe some one more familiar with C++ could take a look and give 
> some insights what's the wrong with (3). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1776) [C++[ arrow::gpu::CudaContext::bytes_allocated() isn't defined

2017-11-07 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1776:
--
Labels: pull-request-available  (was: )

> [C++[ arrow::gpu::CudaContext::bytes_allocated() isn't defined
> --
>
> Key: ARROW-1776
> URL: https://issues.apache.org/jira/browse/ARROW-1776
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.7.1
>Reporter: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> arrow/gpu/cuda_context.h declares arrow::gpu::CudaContext::bytes_allocated() 
> but it's not defined.
> Should it be removed or defined?
> CudaContext::CudaContextImple::bytes_allocated() exists. So it's easy to 
> define it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1776) [C++[ arrow::gpu::CudaContext::bytes_allocated() isn't defined

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243273#comment-16243273
 ] 

ASF GitHub Bot commented on ARROW-1776:
---

kou opened a new pull request #1293: ARROW-1776: [C++] Define 
arrow::gpu::CudaContext::bytes_allocated()
URL: https://github.com/apache/arrow/pull/1293
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++[ arrow::gpu::CudaContext::bytes_allocated() isn't defined
> --
>
> Key: ARROW-1776
> URL: https://issues.apache.org/jira/browse/ARROW-1776
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.7.1
>Reporter: Kouhei Sutou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> arrow/gpu/cuda_context.h declares arrow::gpu::CudaContext::bytes_allocated() 
> but it's not defined.
> Should it be removed or defined?
> CudaContext::CudaContextImple::bytes_allocated() exists. So it's easy to 
> define it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1709) [C++] Decimal.ToString is incorrect for negative scale

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243221#comment-16243221
 ] 

ASF GitHub Bot commented on ARROW-1709:
---

cpcloud commented on issue #1292: ARROW-1709: [C++] Decimal.ToString is 
incorrect for negative scale
URL: https://github.com/apache/arrow/pull/1292#issuecomment-342678512
 
 
   Hm looks like another osx build timeout.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Decimal.ToString is incorrect for negative scale
> --
>
> Key: ARROW-1709
> URL: https://issues.apache.org/jira/browse/ARROW-1709
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> {{Decimal128::ToString(int precision, int scale)}} doesn't handle {{scale < 
> 0}} correctly. It should tack on an extra {{e}} to the end.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (ARROW-1717) [Java] Remove public static helper method in vector classes for JSONReader/Writer

2017-11-07 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin reassigned ARROW-1717:
-

Assignee: Li Jin

> [Java] Remove public static helper method in vector classes for 
> JSONReader/Writer
> -
>
> Key: ARROW-1717
> URL: https://issues.apache.org/jira/browse/ARROW-1717
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Li Jin
>Assignee: Li Jin
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1779) [Java] Integration test breaks without zeroing out validity vectors

2017-11-07 Thread Li Jin (JIRA)
Li Jin created ARROW-1779:
-

 Summary: [Java] Integration test breaks without zeroing out 
validity vectors
 Key: ARROW-1779
 URL: https://issues.apache.org/jira/browse/ARROW-1779
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Li Jin
 Attachments: nested.bad, nested.good, nested.json

This is discovered in https://github.com/apache/arrow/pull/1290

I found one the integration test (nested) failed without zeroing out validity 
vectors before loading the array from json.

I have created three files to reproduce this:
(1) nested.json 
(2) nested.good (zeroing out validity vector before reading)
(3) nested.bad (not zeroing out validity vector before reading)

(1) / (2) and (1) / (3) both pass Java integration test, however (1) / (3) 
fails C++ test - one of the validity vector in (3) doesn't seem to be read 
correctly.

I am not sure what the issue is because I cannot reproduce an error in Java. I 
am hoping maybe some one more familiar with C++ could take a look and give some 
insights what's the wrong with (3). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1282) Large memory reallocation by Arrow causes hang in jemalloc

2017-11-07 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-1282.

Resolution: Fixed

Merged upstream and the thirdparty builds all pull the new revision. Will deal 
with the vendoring in a separate story.

> Large memory reallocation by Arrow causes hang in jemalloc
> --
>
> Key: ARROW-1282
> URL: https://issues.apache.org/jira/browse/ARROW-1282
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Jeff Knupp
>Assignee: Uwe L. Korn
> Fix For: 0.8.0
>
>
> When reallocating a large amount of memory, Arrow is either triggering a bug 
> in jemalloc or has a bug itself in the memory manager (many different 
> applications reporting same issue but not clear from jemalloc issue 
> description if they're sure it's in jemalloc or caused by other issues like 
> using multiple memory allocation libraries in the same process, multithreaded 
> access, etc).
> Link to stack trace is here: 
> https://gist.github.com/jeffknupp/73879feacf9c560afd4f1a20213dc6ef
> Link to issue in jemalloc GitHub is here: 
> https://github.com/jemalloc/jemalloc/issues/802
> Originally observed in redis, discussed with jemalloc maintainer here: 
> https://github.com/antirez/redis/issues/3799
> *This is entirely reproducible on Ubuntu 16.04 xenial, which uses version 
> 3.6.0 according to `apt` metadata.*



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-972) [Python] Add test cases and basic APIs for UnionArray

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242940#comment-16242940
 ] 

ASF GitHub Bot commented on ARROW-972:
--

wesm commented on a change in pull request #1216: ARROW-972: UnionArray in 
pyarrow
URL: https://github.com/apache/arrow/pull/1216#discussion_r149506945
 
 

 ##
 File path: cpp/src/arrow/array.cc
 ##
 @@ -393,6 +393,71 @@ UnionArray::UnionArray(const std::shared_ptr& 
type, int64_t length,
   SetData(internal_data);
 }
 
+std::shared_ptr MakeUnionArrayType(
+UnionMode mode, const std::vector& children) {
+  auto types = std::vector();
+  std::vector type_codes;
+  uint8_t counter = 0;
+  for (const auto& child : children) {
+types.push_back(field(std::to_string(counter), child->type()));
+type_codes.push_back(counter);
+counter++;
+  }
+  return union_(types, type_codes, mode);
+}
+
+Status UnionArray::FromDense(const Array& type_ids, const Array& value_offsets,
+ const std::vector& 
children,
+ std::shared_ptr* out) {
+  if (value_offsets.length() == 0) {
+return Status::Invalid("UnionArray offsets must have non-zero length");
+  }
+
+  if (value_offsets.type_id() != Type::INT32) {
+return Status::Invalid("UnionArray offsets must be signed int32");
+  }
+
+  if (type_ids.type_id() != Type::INT8) {
+return Status::Invalid("UnionArray type_ids must be signed int8");
+  }
+
+  BufferVector buffers = {type_ids.null_bitmap(),
+  static_cast(type_ids).values(),
+  static_cast(value_offsets).values()};
 
 Review comment:
   May also want to assert that `value_offsets` has 0 null count


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add test cases and basic APIs for UnionArray
> -
>
> Key: ARROW-972
> URL: https://issues.apache.org/jira/browse/ARROW-972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.3.0
>Reporter: Wes McKinney
>Assignee: Philipp Moritz
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> While this is implemented in C++, there isn't any API exposure yet in Python



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-972) [Python] Add test cases and basic APIs for UnionArray

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242942#comment-16242942
 ] 

ASF GitHub Bot commented on ARROW-972:
--

wesm commented on a change in pull request #1216: ARROW-972: UnionArray in 
pyarrow
URL: https://github.com/apache/arrow/pull/1216#discussion_r149511086
 
 

 ##
 File path: python/pyarrow/scalar.pxi
 ##
 @@ -315,6 +315,26 @@ cdef class ListValue(ArrayValue):
 return result
 
 
+cdef class UnionValue(ArrayValue):
+
+cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
 
 Review comment:
   Is it true that using an underscore saves you from declaring this method in 
the pxd file? If so I've been wasting my time a bunch in the past :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add test cases and basic APIs for UnionArray
> -
>
> Key: ARROW-972
> URL: https://issues.apache.org/jira/browse/ARROW-972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.3.0
>Reporter: Wes McKinney
>Assignee: Philipp Moritz
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> While this is implemented in C++, there isn't any API exposure yet in Python



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1777) [C++] Add static ctor ArrayData::Make for nicer syntax in places

2017-11-07 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1777:
---

 Summary: [C++] Add static ctor ArrayData::Make for nicer syntax in 
places
 Key: ARROW-1777
 URL: https://issues.apache.org/jira/browse/ARROW-1777
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.8.0


Because of the fickleness of {{std::make_shared}}, we are having to store a 
vector of buffers in an lvalue rather than passing the buffers in the 
{{make_shared}} function call as an initializer list. This makes things a bit 
awkward, and could be made possibly better with a factory function



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1778) Python: Link parquet-cpp statically, privately in manylinux1 wheels

2017-11-07 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16243029#comment-16243029
 ] 

Uwe L. Korn commented on ARROW-1778:


[~wesmckinn] any strong opinion on this? Otherwise I would go ahead and change 
the linking.

> Python: Link parquet-cpp statically, privately in manylinux1 wheels
> ---
>
> Key: ARROW-1778
> URL: https://issues.apache.org/jira/browse/ARROW-1778
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
> Fix For: 0.8.0
>
>
> We currently link parquet-cpp dynamically in the {{manylinux1}} wheels. This 
> also makes us the authority on the distribution of {{parquet-cpp}} inside of 
> the wheel-based ecosystem. Instead of doing this, we should statically, 
> privately link {{parquet-cpp}} inside of the wheels.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1775) Ability to abort created but unsealed Plasma objects

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242967#comment-16242967
 ] 

ASF GitHub Bot commented on ARROW-1775:
---

wesm commented on a change in pull request #1289: ARROW-1775: Ability to abort 
created but unsealed Plasma objects
URL: https://github.com/apache/arrow/pull/1289#discussion_r149513238
 
 

 ##
 File path: cpp/src/plasma/store.h
 ##
 @@ -73,6 +73,8 @@ class PlasmaStore {
   int create_object(const ObjectID& object_id, int64_t data_size, int64_t 
metadata_size,
 Client* client, PlasmaObject* result);
 
+  void abort_object(const ObjectID& object_id);
 
 Review comment:
   We should note to `PascalCase` these methods sometime soon


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ability to abort created but unsealed Plasma objects
> 
>
> Key: ARROW-1775
> URL: https://issues.apache.org/jira/browse/ARROW-1775
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Plasma (C++)
>Reporter: Stephanie Wang
>  Labels: pull-request-available
>
> It would be useful to allow a Plasma client to abort an object that it 
> created but hasn't yet sealed. After the abort, it should appear as if the 
> object was never created all. The logic is similar to the delete case, except 
> that the client must release the object atomically with the removal of the 
> object from the cache and store.
> In Ray, for example, we need this for the distributed version of the Plasma 
> store, where many Plasma clients transfer objects to each other. If a sending 
> Plasma client fails during a transfer, we want to make sure that the 
> receiving client can abort the transfer, so that we can later recreate the 
> object successfully. Otherwise, we will fail with an error that the object 
> already exists.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1775) Ability to abort created but unsealed Plasma objects

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242965#comment-16242965
 ] 

ASF GitHub Bot commented on ARROW-1775:
---

wesm commented on a change in pull request #1289: ARROW-1775: Ability to abort 
created but unsealed Plasma objects
URL: https://github.com/apache/arrow/pull/1289#discussion_r149512244
 
 

 ##
 File path: cpp/src/plasma/client.cc
 ##
 @@ -278,6 +278,43 @@ Status PlasmaClient::Get(const ObjectID* object_ids, 
int64_t num_objects,
   return Status::OK();
 }
 
+/// This is a helper method for unmapping objects for which all references have
+/// gone out of scope, either by calling Release or Abort.
+///
+/// @param object_id The object ID whose data we should unmap.
 
 Review comment:
   This comment may go better in the header file


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ability to abort created but unsealed Plasma objects
> 
>
> Key: ARROW-1775
> URL: https://issues.apache.org/jira/browse/ARROW-1775
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Plasma (C++)
>Reporter: Stephanie Wang
>  Labels: pull-request-available
>
> It would be useful to allow a Plasma client to abort an object that it 
> created but hasn't yet sealed. After the abort, it should appear as if the 
> object was never created all. The logic is similar to the delete case, except 
> that the client must release the object atomically with the removal of the 
> object from the cache and store.
> In Ray, for example, we need this for the distributed version of the Plasma 
> store, where many Plasma clients transfer objects to each other. If a sending 
> Plasma client fails during a transfer, we want to make sure that the 
> receiving client can abort the transfer, so that we can later recreate the 
> object successfully. Otherwise, we will fail with an error that the object 
> already exists.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1775) Ability to abort created but unsealed Plasma objects

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242966#comment-16242966
 ] 

ASF GitHub Bot commented on ARROW-1775:
---

wesm commented on a change in pull request #1289: ARROW-1775: Ability to abort 
created but unsealed Plasma objects
URL: https://github.com/apache/arrow/pull/1289#discussion_r149512465
 
 

 ##
 File path: cpp/src/plasma/client.cc
 ##
 @@ -278,6 +278,43 @@ Status PlasmaClient::Get(const ObjectID* object_ids, 
int64_t num_objects,
   return Status::OK();
 }
 
+/// This is a helper method for unmapping objects for which all references have
+/// gone out of scope, either by calling Release or Abort.
+///
+/// @param object_id The object ID whose data we should unmap.
+Status PlasmaClient::UnmapObject(const ObjectID& object_id) {
+  auto object_entry = objects_in_use_.find(object_id);
+  ARROW_CHECK(object_entry != objects_in_use_.end());
+  ARROW_CHECK(object_entry->second->count == 0);
+
+  // Decrement the count of the number of objects in this memory-mapped file
+  // that the client is using. The corresponding increment should have
+  // happened in plasma_get.
+  int fd = object_entry->second->object.handle.store_fd;
+  auto entry = mmap_table_.find(fd);
+  ARROW_CHECK(entry != mmap_table_.end());
+  ARROW_CHECK(entry->second.count >= 1);
+  if (entry->second.count == 1) {
+// If no other objects are being used, then unmap the file.
+int err = munmap(entry->second.pointer, entry->second.length);
+if (err == -1) {
+  return Status::IOError("Error during munmap");
+}
+// Remove the corresponding entry from the hash table.
+mmap_table_.erase(fd);
 
 Review comment:
   If munmap fails, what should happen to this `fd`? I guess it would be an 
esoteric failure


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ability to abort created but unsealed Plasma objects
> 
>
> Key: ARROW-1775
> URL: https://issues.apache.org/jira/browse/ARROW-1775
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Plasma (C++)
>Reporter: Stephanie Wang
>  Labels: pull-request-available
>
> It would be useful to allow a Plasma client to abort an object that it 
> created but hasn't yet sealed. After the abort, it should appear as if the 
> object was never created all. The logic is similar to the delete case, except 
> that the client must release the object atomically with the removal of the 
> object from the cache and store.
> In Ray, for example, we need this for the distributed version of the Plasma 
> store, where many Plasma clients transfer objects to each other. If a sending 
> Plasma client fails during a transfer, we want to make sure that the 
> receiving client can abort the transfer, so that we can later recreate the 
> object successfully. Otherwise, we will fail with an error that the object 
> already exists.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1716) [Format/JSON] Use string integer value for Decimals in JSON

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242907#comment-16242907
 ] 

ASF GitHub Bot commented on ARROW-1716:
---

wesm closed pull request #1267: ARROW-1716: [Format/JSON] Use string integer 
value for Decimals in JSON
URL: https://github.com/apache/arrow/pull/1267
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc 
b/cpp/src/arrow/ipc/ipc-read-write-test.cc
index 6f2f5cf85..40cd3f0ee 100644
--- a/cpp/src/arrow/ipc/ipc-read-write-test.cc
+++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc
@@ -727,7 +727,7 @@ TEST_F(TestTensorRoundTrip, BasicRoundtrip) {
   int64_t size = 24;
 
   std::vector values;
-  test::randint(size, 0, 100, );
+  test::randint(size, 0, 100, );
 
   auto data = test::GetBufferFromVector(values);
 
@@ -748,7 +748,7 @@ TEST_F(TestTensorRoundTrip, NonContiguous) {
   ASSERT_OK(io::MemoryMapFixture::InitMemoryMap(kBufferSize, path, _));
 
   std::vector values;
-  test::randint(24, 0, 100, );
+  test::randint(24, 0, 100, );
 
   auto data = test::GetBufferFromVector(values);
   Tensor tensor(int64(), data, {4, 3}, {48, 16});
diff --git a/cpp/src/arrow/ipc/json-internal.cc 
b/cpp/src/arrow/ipc/json-internal.cc
index 025f6c276..c1c0661d6 100644
--- a/cpp/src/arrow/ipc/json-internal.cc
+++ b/cpp/src/arrow/ipc/json-internal.cc
@@ -33,6 +33,7 @@
 #include "arrow/type.h"
 #include "arrow/type_traits.h"
 #include "arrow/util/bit-util.h"
+#include "arrow/util/decimal.h"
 #include "arrow/util/logging.h"
 #include "arrow/util/string.h"
 #include "arrow/visitor_inline.h"
@@ -448,7 +449,8 @@ class ArrayWriter {
   }
 
   void WriteDataValues(const FixedSizeBinaryArray& arr) {
-int32_t width = arr.byte_width();
+const int32_t width = arr.byte_width();
+
 for (int64_t i = 0; i < arr.length(); ++i) {
   const uint8_t* buf = arr.GetValue(i);
   std::string encoded = HexEncode(buf, width);
@@ -456,6 +458,13 @@ class ArrayWriter {
 }
   }
 
+  void WriteDataValues(const DecimalArray& arr) {
+for (int64_t i = 0; i < arr.length(); ++i) {
+  const Decimal128 value(arr.GetValue(i));
+  writer_->String(value.ToIntegerString());
+}
+  }
+
   void WriteDataValues(const BooleanArray& arr) {
 for (int i = 0; i < arr.length(); ++i) {
   writer_->Bool(arr.Value(i));
@@ -1053,7 +1062,9 @@ class ArrayReader {
   }
 
   template 
-  typename std::enable_if::value, 
Status>::type
+  typename std::enable_if::value &&
+  !std::is_base_of::value,
+  Status>::type
   Visit(const T& type) {
 typename TypeTraits::BuilderType builder(type_, pool_);
 
@@ -1073,22 +1084,52 @@ class ArrayReader {
 for (int i = 0; i < length_; ++i) {
   if (!is_valid_[i]) {
 RETURN_NOT_OK(builder.AppendNull());
-continue;
-  }
+  } else {
+const rj::Value& val = json_data_arr[i];
+DCHECK(val.IsString())
+<< "Found non-string JSON value when parsing FixedSizeBinary 
value";
+std::string hex_string = val.GetString();
+if (static_cast(hex_string.size()) != byte_width * 2) {
+  DCHECK(false) << "Expected size: " << byte_width * 2
+<< " got: " << hex_string.size();
+}
+const char* hex_data = hex_string.c_str();
 
-  const rj::Value& val = json_data_arr[i];
-  DCHECK(val.IsString());
-  std::string hex_string = val.GetString();
-  if (static_cast(hex_string.size()) != byte_width * 2) {
-DCHECK(false) << "Expected size: " << byte_width * 2
-  << " got: " << hex_string.size();
+for (int32_t j = 0; j < byte_width; ++j) {
+  RETURN_NOT_OK(ParseHexValue(hex_data + j * 2, _buffer_data[j]));
+}
+RETURN_NOT_OK(builder.Append(byte_buffer_data));
   }
-  const char* hex_data = hex_string.c_str();
+}
+return builder.Finish(_);
+  }
+
+  template 
+  typename std::enable_if::value, 
Status>::type Visit(
+  const T& type) {
+typename TypeTraits::BuilderType builder(type_, pool_);
+
+const auto& json_data = obj_->FindMember("DATA");
+RETURN_NOT_ARRAY("DATA", json_data, *obj_);
 
-  for (int32_t j = 0; j < byte_width; ++j) {
-RETURN_NOT_OK(ParseHexValue(hex_data + j * 2, _buffer_data[j]));
+const auto& json_data_arr = json_data->value.GetArray();
+
+DCHECK_EQ(static_cast(json_data_arr.Size()), length_);
+
+for (int i = 0; i < length_; ++i) {
+  if (!is_valid_[i]) {
+

[jira] [Commented] (ARROW-972) [Python] Add test cases and basic APIs for UnionArray

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242945#comment-16242945
 ] 

ASF GitHub Bot commented on ARROW-972:
--

wesm commented on a change in pull request #1216: ARROW-972: UnionArray in 
pyarrow
URL: https://github.com/apache/arrow/pull/1216#discussion_r149510499
 
 

 ##
 File path: cpp/src/arrow/array.h
 ##
 @@ -628,6 +659,8 @@ class ARROW_EXPORT UnionArray : public Array {
   /// Only use this while the UnionArray is in scope
   const Array* UnsafeChild(int pos) const;
 
+  std::shared_ptr value_type(int pos) const;
 
 Review comment:
   It wasn't initially clear that this is the child type, maybe call this 
`child_type` instead (and add doxygen comment)?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add test cases and basic APIs for UnionArray
> -
>
> Key: ARROW-972
> URL: https://issues.apache.org/jira/browse/ARROW-972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.3.0
>Reporter: Wes McKinney
>Assignee: Philipp Moritz
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> While this is implemented in C++, there isn't any API exposure yet in Python



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-972) [Python] Add test cases and basic APIs for UnionArray

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242943#comment-16242943
 ] 

ASF GitHub Bot commented on ARROW-972:
--

wesm commented on a change in pull request #1216: ARROW-972: UnionArray in 
pyarrow
URL: https://github.com/apache/arrow/pull/1216#discussion_r149506569
 
 

 ##
 File path: cpp/src/arrow/array.cc
 ##
 @@ -393,6 +393,71 @@ UnionArray::UnionArray(const std::shared_ptr& 
type, int64_t length,
   SetData(internal_data);
 }
 
+std::shared_ptr MakeUnionArrayType(
 
 Review comment:
   Would this be useful as an alternate version of `arrow::union_`? Maybe with 
the same API (the mode as the last argument) but omitting the type codes


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add test cases and basic APIs for UnionArray
> -
>
> Key: ARROW-972
> URL: https://issues.apache.org/jira/browse/ARROW-972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.3.0
>Reporter: Wes McKinney
>Assignee: Philipp Moritz
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> While this is implemented in C++, there isn't any API exposure yet in Python



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-972) [Python] Add test cases and basic APIs for UnionArray

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242939#comment-16242939
 ] 

ASF GitHub Bot commented on ARROW-972:
--

wesm commented on a change in pull request #1216: ARROW-972: UnionArray in 
pyarrow
URL: https://github.com/apache/arrow/pull/1216#discussion_r149511274
 
 

 ##
 File path: python/pyarrow/scalar.pxi
 ##
 @@ -315,6 +315,26 @@ cdef class ListValue(ArrayValue):
 return result
 
 
+cdef class UnionValue(ArrayValue):
+
+cdef void _set_array(self, const shared_ptr[CArray]& sp_array):
+self.sp_array = sp_array
+self.ap =  sp_array.get()
+self.value_types = [pyarrow_wrap_data_type(self.ap.value_type(i))
+for i in range(self.ap.num_fields())]
 
 Review comment:
   This is quite expensive on a per-value basis. Could these wrapped types be 
accessed from the parent pyarrow UnionArray?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add test cases and basic APIs for UnionArray
> -
>
> Key: ARROW-972
> URL: https://issues.apache.org/jira/browse/ARROW-972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.3.0
>Reporter: Wes McKinney
>Assignee: Philipp Moritz
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> While this is implemented in C++, there isn't any API exposure yet in Python



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1709) [C++] Decimal.ToString is incorrect for negative scale

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242828#comment-16242828
 ] 

ASF GitHub Bot commented on ARROW-1709:
---

cpcloud opened a new pull request #1292: ARROW-1709: [C++] Decimal.ToString is 
incorrect for negative scale
URL: https://github.com/apache/arrow/pull/1292
 
 
   This is on top of ARROW-1716. Will rebase when that's merged.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [C++] Decimal.ToString is incorrect for negative scale
> --
>
> Key: ARROW-1709
> URL: https://issues.apache.org/jira/browse/ARROW-1709
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> {{Decimal128::ToString(int precision, int scale)}} doesn't handle {{scale < 
> 0}} correctly. It should tack on an extra {{e}} to the end.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-972) [Python] Add test cases and basic APIs for UnionArray

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242944#comment-16242944
 ] 

ASF GitHub Bot commented on ARROW-972:
--

wesm commented on a change in pull request #1216: ARROW-972: UnionArray in 
pyarrow
URL: https://github.com/apache/arrow/pull/1216#discussion_r149509888
 
 

 ##
 File path: cpp/src/arrow/array.h
 ##
 @@ -612,12 +612,43 @@ class ARROW_EXPORT UnionArray : public Array {
  const std::shared_ptr& null_bitmap = NULLPTR, int64_t 
null_count = 0,
  int64_t offset = 0);
 
+  /// \brief Construct Dense UnionArray from types_ids, value_offsets and 
children
+  ///
+  /// This function does the bare minimum of validation of the offsets and
+  /// input types. The value_offsets are assumed to be well-formed.
+  ///
+  /// \param[in] type_ids An array of 8-bit signed integers, enumerated from
+  /// 0 corresponding to each type.
+  /// \param[in] value_offsets An array of signed int32 values indicating the
+  /// relative offset into the respective child array for the type in a given 
slot.
+  /// The respective offsets for each child value array must be in order / 
increasing.
+  /// \param[in] children Vector of children Arrays containing the data for 
each type.
+  /// \param[out] out Will have length equal to value_offsets.length()
+  static Status FromDense(const Array& type_ids, const Array& value_offsets,
 
 Review comment:
   Hm, this is naming similar to `ListArray::FromArrays`, but because there are 
two versions, I wonder if calling this `UnionArray::MakeDense` (or 
`FromArraysDense`, more verbose) would be clearer


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add test cases and basic APIs for UnionArray
> -
>
> Key: ARROW-972
> URL: https://issues.apache.org/jira/browse/ARROW-972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.3.0
>Reporter: Wes McKinney
>Assignee: Philipp Moritz
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> While this is implemented in C++, there isn't any API exposure yet in Python



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-972) [Python] Add test cases and basic APIs for UnionArray

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242941#comment-16242941
 ] 

ASF GitHub Bot commented on ARROW-972:
--

wesm commented on a change in pull request #1216: ARROW-972: UnionArray in 
pyarrow
URL: https://github.com/apache/arrow/pull/1216#discussion_r149506402
 
 

 ##
 File path: cpp/src/arrow/array.cc
 ##
 @@ -393,6 +393,71 @@ UnionArray::UnionArray(const std::shared_ptr& 
type, int64_t length,
   SetData(internal_data);
 }
 
+std::shared_ptr MakeUnionArrayType(
+UnionMode mode, const std::vector& children) {
+  auto types = std::vector();
 
 Review comment:
   This might be better with the type on the LHS 
(`std::vector types;`)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Add test cases and basic APIs for UnionArray
> -
>
> Key: ARROW-972
> URL: https://issues.apache.org/jira/browse/ARROW-972
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.3.0
>Reporter: Wes McKinney
>Assignee: Philipp Moritz
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> While this is implemented in C++, there isn't any API exposure yet in Python



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1709) [C++] Decimal.ToString is incorrect for negative scale

2017-11-07 Thread Phillip Cloud (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Phillip Cloud updated ARROW-1709:
-
Summary: [C++] Decimal.ToString is incorrect for negative scale  (was: 
Decimal.ToString is incorrect for negative scale)

> [C++] Decimal.ToString is incorrect for negative scale
> --
>
> Key: ARROW-1709
> URL: https://issues.apache.org/jira/browse/ARROW-1709
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
> Fix For: 0.8.0
>
>
> {{Decimal128::ToString(int precision, int scale)}} doesn't handle {{scale < 
> 0}} correctly. It should tack on an extra {{e}} to the end.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1709) [C++] Decimal.ToString is incorrect for negative scale

2017-11-07 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1709:
--
Labels: pull-request-available  (was: )

> [C++] Decimal.ToString is incorrect for negative scale
> --
>
> Key: ARROW-1709
> URL: https://issues.apache.org/jira/browse/ARROW-1709
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.7.1
>Reporter: Phillip Cloud
>Assignee: Phillip Cloud
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> {{Decimal128::ToString(int precision, int scale)}} doesn't handle {{scale < 
> 0}} correctly. It should tack on an extra {{e}} to the end.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1425) [Python] Document semantic differences between Spark timestamps and Arrow timestamps

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242678#comment-16242678
 ] 

ASF GitHub Bot commented on ARROW-1425:
---

cloud-fan commented on a change in pull request #1095: ARROW-1425 [Python] 
Document semantic differences between Spark and Arrow timestamps
URL: https://github.com/apache/arrow/pull/1095#discussion_r149474373
 
 

 ##
 File path: python/doc/source/other_systems.rst
 ##
 @@ -0,0 +1,182 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow
+.. _other_systems:
+
+Using Arrow with other systems
+==
+
+Timestamps
+--
+
+Timestamps are data structures that mark a particular point in time
 
 Review comment:
   I think it's not true. In SQL standard, timestamp(by default it's TIMESTAMP 
WITHOUT TIMEZONE) means a "floating" time, which is kind of the seconds from 
local epoch, e.g. use 0 to represent "1970-1-1 00:00:00" no matter which 
timezone you are. In Spark SQL and Parquet, the timestamp is seconds from Unix 
epoch, which is a particular point in time.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Python] Document semantic differences between Spark timestamps and Arrow 
> timestamps
> 
>
> Key: ARROW-1425
> URL: https://issues.apache.org/jira/browse/ARROW-1425
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Heimir Thor Sverrisson
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> The way that Spark treats non-timezone-aware timestamps as session local can 
> be problematic when using pyarrow which may view the data coming from 
> toPandas() as time zone naive (but with fields as though it were UTC, not 
> session local). We should document carefully how to properly handle the data 
> coming from Spark to avoid problems.
> cc [~bryanc] [~holdenkarau]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1710) [Java] Decide what to do with non-nullable vectors in new vector class hierarchy

2017-11-07 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242687#comment-16242687
 ] 

Li Jin edited comment on ARROW-1710 at 11/7/17 7:18 PM:


Seems we agree to remove non nullable vectors.

The next question is, what do people feel about dropping the "Nullable" prefix 
in new vector classes? [~bryanc] brought this up initially.

I am +1 for dropping the "Nullable" prefix. I think It makes the code more 
concise. 


was (Author: icexelloss):
Seems we agree to remove non nullable vectors.

The next question is, what do people feel about dropping the "Nullable" prefix 
in new vector classes? [~bryanc] brought this up initially.

I am +1 for dropping the "Nullable" prefix. It makes the code more concise. 

> [Java] Decide what to do with non-nullable vectors in new vector class 
> hierarchy 
> -
>
> Key: ARROW-1710
> URL: https://issues.apache.org/jira/browse/ARROW-1710
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java - Vectors
>Reporter: Li Jin
> Fix For: 0.8.0
>
>
> So far the consensus seems to be remove all non-nullable vectors. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1710) [Java] Decide what to do with non-nullable vectors in new vector class hierarchy

2017-11-07 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242687#comment-16242687
 ] 

Li Jin commented on ARROW-1710:
---

Seems we agree to remove non nullable vectors.

The next question is, what do people feel about dropping the "Nullable" prefix 
in new vector classes? [~bryanc] brought this up initially.

I am +1 for dropping the "Nullable" prefix. It makes the code more concise. 

> [Java] Decide what to do with non-nullable vectors in new vector class 
> hierarchy 
> -
>
> Key: ARROW-1710
> URL: https://issues.apache.org/jira/browse/ARROW-1710
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java - Vectors
>Reporter: Li Jin
> Fix For: 0.8.0
>
>
> So far the consensus seems to be remove all non-nullable vectors. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1431) [Java] JsonFileReader doesn't intialize some vectors approperately

2017-11-07 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242664#comment-16242664
 ] 

Li Jin commented on ARROW-1431:
---

https://github.com/apache/arrow/pull/1290

> [Java] JsonFileReader doesn't intialize some vectors approperately 
> ---
>
> Key: ARROW-1431
> URL: https://issues.apache.org/jira/browse/ARROW-1431
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Li Jin
>
> One example is for ListVector, the JsonFileReader sets the validity, offset 
> and data, but doesn't set `lastSet` variable in the ListVector instance.
> ArrowFileReader works correct before it invokes `loadFieldBuffers` and 
> intialize `lastSet` correctly:
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java#L120
> This doesn't break integration tests but can cause weird bugs when people 
> call methods on the vector read from json.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1717) [Java] Remove public static helper method in vector classes for JSONReader/Writer

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242662#comment-16242662
 ] 

ASF GitHub Bot commented on ARROW-1717:
---

icexelloss commented on issue #1290: ARROW-1717: Refactor JsonReader
URL: https://github.com/apache/arrow/pull/1290#issuecomment-342588734
 
 
   This also fixes ARROW-1431


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Remove public static helper method in vector classes for 
> JSONReader/Writer
> -
>
> Key: ARROW-1717
> URL: https://issues.apache.org/jira/browse/ARROW-1717
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Li Jin
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1717) [Java] Remove public static helper method in vector classes for JSONReader/Writer

2017-11-07 Thread Li Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated ARROW-1717:
--
Summary: [Java] Remove public static helper method in vector classes for 
JSONReader/Writer  (was: [Java] Decide what to do with public static helper 
method in vector classes for JSONReader/Writer)

> [Java] Remove public static helper method in vector classes for 
> JSONReader/Writer
> -
>
> Key: ARROW-1717
> URL: https://issues.apache.org/jira/browse/ARROW-1717
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Li Jin
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1717) [Java] Decide what to do with public static helper method in vector classes for JSONReader/Writer

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242586#comment-16242586
 ] 

ASF GitHub Bot commented on ARROW-1717:
---

BryanCutler commented on issue #1290: [ARROW-1717] Refactor JsonReader
URL: https://github.com/apache/arrow/pull/1290#issuecomment-342577047
 
 
   +1, I prefer it this way to keep the vector classes cleaner


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Decide what to do with public static helper method in vector classes 
> for JSONReader/Writer
> -
>
> Key: ARROW-1717
> URL: https://issues.apache.org/jira/browse/ARROW-1717
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Li Jin
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1694) [Java] Unclosed VectorSchemaRoot in JsonFileReader#readDictionaryBatches()

2017-11-07 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242582#comment-16242582
 ] 

Ted Yu commented on ARROW-1694:
---

Makes sense - reader is closed in execute().

> [Java] Unclosed VectorSchemaRoot in JsonFileReader#readDictionaryBatches()
> --
>
> Key: ARROW-1694
> URL: https://issues.apache.org/jira/browse/ARROW-1694
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Ted Yu
>Priority: Minor
> Fix For: 0.8.0
>
>
> {code}
>   VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, 
> vector.getAccessor().getValueCount());
>   read(root);
> {code}
> root should be closed upon return from the method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1694) [Java] Unclosed VectorSchemaRoot in JsonFileReader#readDictionaryBatches()

2017-11-07 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242575#comment-16242575
 ] 

Bryan Cutler edited comment on ARROW-1694 at 11/7/17 6:24 PM:
--

[~tedyu] closing the VectorSchemaRoot will release the vector data buffers 
also, so it needs to be kept open to use the data and can not be automatically 
closed after {{read(root)}} is called.

Take a look at 
https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L166
Here the JsonFileReader calls {{read(root)}} which loads values into the root 
owned by  {{ArrowFileWriter}} so that it can write back out the values.  The 
VectorSchemaRoot is only closed after all writing is done.

I'm closing this as not a problem


was (Author: bryanc):
[~tedyu]] closing the VectorSchemaRoot will release the vector data buffers 
also, so it needs to be kept open to use the data and can not be automatically 
closed after {{read(root)}} is called.

Take a look at 
https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L166
Here the JsonFileReader calls {{read(root)}} which loads values into the root 
owned by  {{ArrowFileWriter}} so that it can write back out the values.  The 
VectorSchemaRoot is only closed after all writing is done.

I'm closing this as not a problem

> [Java] Unclosed VectorSchemaRoot in JsonFileReader#readDictionaryBatches()
> --
>
> Key: ARROW-1694
> URL: https://issues.apache.org/jira/browse/ARROW-1694
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Ted Yu
>Priority: Minor
> Fix For: 0.8.0
>
>
> {code}
>   VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, 
> vector.getAccessor().getValueCount());
>   read(root);
> {code}
> root should be closed upon return from the method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (ARROW-1694) [Java] Unclosed VectorSchemaRoot in JsonFileReader#readDictionaryBatches()

2017-11-07 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242575#comment-16242575
 ] 

Bryan Cutler edited comment on ARROW-1694 at 11/7/17 6:24 PM:
--

[~tedyu]] closing the VectorSchemaRoot will release the vector data buffers 
also, so it needs to be kept open to use the data and can not be automatically 
closed after {{read(root)}} is called.

Take a look at 
https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L166
Here the JsonFileReader calls {{read(root)}} which loads values into the root 
owned by  {{ArrowFileWriter}} so that it can write back out the values.  The 
VectorSchemaRoot is only closed after all writing is done.

I'm closing this as not a problem


was (Author: bryanc):
[~ted_yu] closing the VectorSchemaRoot will release the vector data buffers 
also, so it needs to be kept open to use the data and can not be automatically 
closed after {{read(root)}} is called.

Take a look at 
https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L166
Here the JsonFileReader calls {{read(root)}} which loads values into the root 
owned by  {{ArrowFileWriter}} so that it can write back out the values.  The 
VectorSchemaRoot is only closed after all writing is done.

I'm closing this as not a problem

> [Java] Unclosed VectorSchemaRoot in JsonFileReader#readDictionaryBatches()
> --
>
> Key: ARROW-1694
> URL: https://issues.apache.org/jira/browse/ARROW-1694
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Ted Yu
>Priority: Minor
> Fix For: 0.8.0
>
>
> {code}
>   VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, 
> vector.getAccessor().getValueCount());
>   read(root);
> {code}
> root should be closed upon return from the method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (ARROW-1694) [Java] Unclosed VectorSchemaRoot in JsonFileReader#readDictionaryBatches()

2017-11-07 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved ARROW-1694.
-
Resolution: Not A Problem

[~ted_yu] closing the VectorSchemaRoot will release the vector data buffers 
also, so it needs to be kept open to use the data and can not be automatically 
closed after {{read(root)}} is called.

Take a look at 
https://github.com/apache/arrow/blob/master/java/tools/src/main/java/org/apache/arrow/tools/Integration.java#L166
Here the JsonFileReader calls {{read(root)}} which loads values into the root 
owned by  {{ArrowFileWriter}} so that it can write back out the values.  The 
VectorSchemaRoot is only closed after all writing is done.

I'm closing this as not a problem

> [Java] Unclosed VectorSchemaRoot in JsonFileReader#readDictionaryBatches()
> --
>
> Key: ARROW-1694
> URL: https://issues.apache.org/jira/browse/ARROW-1694
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Ted Yu
>Priority: Minor
> Fix For: 0.8.0
>
>
> {code}
>   VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, 
> vector.getAccessor().getValueCount());
>   read(root);
> {code}
> root should be closed upon return from the method.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1717) [Java] Decide what to do with public static helper method in vector classes for JSONReader/Writer

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242564#comment-16242564
 ] 

ASF GitHub Bot commented on ARROW-1717:
---

icexelloss commented on issue #1290: [ARROW-1717] Refactor JsonReader
URL: https://github.com/apache/arrow/pull/1290#issuecomment-342574257
 
 
   cc @BryanCutler @siddharthteotia 
   
   This patch cleans up JsonReader.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Decide what to do with public static helper method in vector classes 
> for JSONReader/Writer
> -
>
> Key: ARROW-1717
> URL: https://issues.apache.org/jira/browse/ARROW-1717
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Li Jin
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1717) [Java] Decide what to do with public static helper method in vector classes for JSONReader/Writer

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242553#comment-16242553
 ] 

ASF GitHub Bot commented on ARROW-1717:
---

icexelloss opened a new pull request #1290: [ARROW-1717] Refactor JsonReader
URL: https://github.com/apache/arrow/pull/1290
 
 
   This patch:
   
   * Refactor JsonReader and remove static helper function in vector classes
   * Fix integration test
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [Java] Decide what to do with public static helper method in vector classes 
> for JSONReader/Writer
> -
>
> Key: ARROW-1717
> URL: https://issues.apache.org/jira/browse/ARROW-1717
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Li Jin
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1717) [Java] Decide what to do with public static helper method in vector classes for JSONReader/Writer

2017-11-07 Thread ASF GitHub Bot (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1717:
--
Labels: pull-request-available  (was: )

> [Java] Decide what to do with public static helper method in vector classes 
> for JSONReader/Writer
> -
>
> Key: ARROW-1717
> URL: https://issues.apache.org/jira/browse/ARROW-1717
> Project: Apache Arrow
>  Issue Type: Sub-task
>Reporter: Li Jin
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1163) [Plasma] Java client for Plasma

2017-11-07 Thread Philipp Moritz (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242298#comment-16242298
 ] 

Philipp Moritz commented on ARROW-1163:
---

Hey Lu Qi,

I have very limited experience with Java, here are some thoughts that are I 
hope are helpful:

You can do zero copy reads in Java using an off-heap method like 
http://xcorpion.tech/2016/09/10/It-s-all-about-buffers-zero-copy-mmap-and-Java-NIO/.
 Given the data already lives in (in-memory) memory-mapped files, this might be 
the best way to go forward here.

We would essentially define our own Tensor class and then use code like 
https://github.com/apache/spark/tree/50ada2a4d31609b6c828158cad8e128c2f605b8d/common/unsafe/src/main/java/org/apache/spark/unsafe
 (see for example 
https://github.com/apache/spark/blob/50ada2a4d31609b6c828158cad8e128c2f605b8d/common/unsafe/src/main/java/org/apache/spark/unsafe/array/LongArray.java)
 to access the data without copies.

Arrow already has a Tensor class in C++ that does similar things and the the 
current Python serialization code uses that to read Tensors in a zero copy way 
from the object store and expose them as numpy arrays to the user. On the Java 
side I think not much is available yet for reading tensors; as a point to get 
started, the code for parsing Tensor metadata is generated here: 
https://github.com/apache/arrow/blob/82eea49b3eea6047f53478113ab3ff9a38f0d344/java/format/pom.xml#L108

If you look at the code for reading C++ Tensors, you should be able to get a 
prototype of this working. I'm also cc'ing some of the people who have done 
most work on the Java implementation for more input.

[~bryanc]  [~siddteotia] [~jnadeau]

-- Philipp.

> [Plasma] Java client for Plasma
> ---
>
> Key: ARROW-1163
> URL: https://issues.apache.org/jira/browse/ARROW-1163
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Philipp Moritz
>
> We should start thinking about how a Java client for plasma would look like. 
> Given the focus of arrow to support Python, C++ and Java really well, it is 
> the next important target after Python and C++.
> My preliminary thoughts on it are the following ones: We can either go with 
> JNI and wrap the C++ client or (in my opinion preferable) write a pure Java 
> client. It would communicate with the Plasma store via Java flatbuffers over 
> sockets.
> It seems that the only thing blocking a pure Java client at the moment is the 
> way we ship file descriptors for the memory mapped files between store and 
> client (see the file fling.cc in the Plasma repo). We would need to get rid 
> of that because there is no pure Java API that allows transferring file 
> descriptors over a process boundary. So the way to transfer memory mapped 
> files over process boundaries then is probably to use the file system and 
> keep the memory mapped files in the file system instead of unlinking them 
> immediately (as we do at the moment), so they can be opened by the client 
> process via their path.
> The challenge in this case is how to clean the files up and make sure they 
> are not lying around if the plasma store crashes. One option is to store the 
> plasma store PID with the file (i.e. as part of the file name) and let the 
> plasma store clean them up the next time it is started); maybe there is OS 
> level support for temporary files we can reuse.
> I probably won't get to this for a while, so if anybody needs this or has 
> free cycles, they should feel free to chime in. Also opinions on the design 
> are appreciated!
> -- Philipp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1775) Ability to abort created but unsealed Plasma objects

2017-11-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242191#comment-16242191
 ] 

ASF GitHub Bot commented on ARROW-1775:
---

pcmoritz commented on issue #1289: ARROW-1775: Ability to abort created but 
unsealed Plasma objects
URL: https://github.com/apache/arrow/pull/1289#issuecomment-342518983
 
 
   +1 LGTM Will leave this open a bit longer in case there are more comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Ability to abort created but unsealed Plasma objects
> 
>
> Key: ARROW-1775
> URL: https://issues.apache.org/jira/browse/ARROW-1775
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Plasma (C++)
>Reporter: Stephanie Wang
>  Labels: pull-request-available
>
> It would be useful to allow a Plasma client to abort an object that it 
> created but hasn't yet sealed. After the abort, it should appear as if the 
> object was never created all. The logic is similar to the delete case, except 
> that the client must release the object atomically with the removal of the 
> object from the cache and store.
> In Ray, for example, we need this for the distributed version of the Plasma 
> store, where many Plasma clients transfer objects to each other. If a sending 
> Plasma client fails during a transfer, we want to make sure that the 
> receiving client can abort the transfer, so that we can later recreate the 
> object successfully. Otherwise, we will fail with an error that the object 
> already exists.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1163) [Plasma] Java client for Plasma

2017-11-07 Thread Lu Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242190#comment-16242190
 ] 

Lu Qi  commented on ARROW-1163:
---

Hi,Philipp Moritz,
I've been working on reading and writing Tensor in Java for several weeks. I've 
got Tensor structure like this:
Class Tensor{ private float[] storage; private int[] shape }
I used JNI to leverage plasma C++ client . One good thing is when writing 
tensor ,there is 
"getPrimitiveArrayCritical" method which gets the address in Java heap (based 
on vm impletation),
thus I can construct Tensor in C++ easily without copying, although it stops GC 
in this process, but 
plasma writing is non blocking. On the other side of the world, when reading 
tensor , I need to copy 
the share memory into java heap, this will cost time.  So, in order to save 
reading time , pure Java 
client may be a good choice. 

As to pure Java client , may be we can use jni to get fd first and construct a 
FileDescriptor .
https://stackoverflow.com/questions/4845122/using-a-numbered-file-descriptor-from-java
 


> [Plasma] Java client for Plasma
> ---
>
> Key: ARROW-1163
> URL: https://issues.apache.org/jira/browse/ARROW-1163
> Project: Apache Arrow
>  Issue Type: New Feature
>Reporter: Philipp Moritz
>
> We should start thinking about how a Java client for plasma would look like. 
> Given the focus of arrow to support Python, C++ and Java really well, it is 
> the next important target after Python and C++.
> My preliminary thoughts on it are the following ones: We can either go with 
> JNI and wrap the C++ client or (in my opinion preferable) write a pure Java 
> client. It would communicate with the Plasma store via Java flatbuffers over 
> sockets.
> It seems that the only thing blocking a pure Java client at the moment is the 
> way we ship file descriptors for the memory mapped files between store and 
> client (see the file fling.cc in the Plasma repo). We would need to get rid 
> of that because there is no pure Java API that allows transferring file 
> descriptors over a process boundary. So the way to transfer memory mapped 
> files over process boundaries then is probably to use the file system and 
> keep the memory mapped files in the file system instead of unlinking them 
> immediately (as we do at the moment), so they can be opened by the client 
> process via their path.
> The challenge in this case is how to clean the files up and make sure they 
> are not lying around if the plasma store crashes. One option is to store the 
> plasma store PID with the file (i.e. as part of the file name) and let the 
> plasma store clean them up the next time it is started); maybe there is OS 
> level support for temporary files we can reuse.
> I probably won't get to this for a while, so if anybody needs this or has 
> free cycles, they should feel free to chime in. Also opinions on the design 
> are appreciated!
> -- Philipp.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ARROW-1776) [C++[ arrow::gpu::CudaContext::bytes_allocated() isn't defined

2017-11-07 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16242136#comment-16242136
 ] 

Wes McKinney commented on ARROW-1776:
-

Yes, it should be defined

> [C++[ arrow::gpu::CudaContext::bytes_allocated() isn't defined
> --
>
> Key: ARROW-1776
> URL: https://issues.apache.org/jira/browse/ARROW-1776
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.7.1
>Reporter: Kouhei Sutou
>Priority: Minor
> Fix For: 0.8.0
>
>
> arrow/gpu/cuda_context.h declares arrow::gpu::CudaContext::bytes_allocated() 
> but it's not defined.
> Should it be removed or defined?
> CudaContext::CudaContextImple::bytes_allocated() exists. So it's easy to 
> define it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1776) [C++[ arrow::gpu::CudaContext::bytes_allocated() isn't defined

2017-11-07 Thread Wes McKinney (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1776:

Fix Version/s: 0.8.0

> [C++[ arrow::gpu::CudaContext::bytes_allocated() isn't defined
> --
>
> Key: ARROW-1776
> URL: https://issues.apache.org/jira/browse/ARROW-1776
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.7.1
>Reporter: Kouhei Sutou
>Priority: Minor
> Fix For: 0.8.0
>
>
> arrow/gpu/cuda_context.h declares arrow::gpu::CudaContext::bytes_allocated() 
> but it's not defined.
> Should it be removed or defined?
> CudaContext::CudaContextImple::bytes_allocated() exists. So it's easy to 
> define it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1776) [C++[ arrow::gpu::CudaContext::bytes_allocated() isn't defined

2017-11-07 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-1776:
---

 Summary: [C++[ arrow::gpu::CudaContext::bytes_allocated() isn't 
defined
 Key: ARROW-1776
 URL: https://issues.apache.org/jira/browse/ARROW-1776
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.7.1
Reporter: Kouhei Sutou
Priority: Minor


arrow/gpu/cuda_context.h declares arrow::gpu::CudaContext::bytes_allocated() 
but it's not defined.
Should it be removed or defined?

CudaContext::CudaContextImple::bytes_allocated() exists. So it's easy to define 
it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)