[jira] [Updated] (ARROW-3591) [R] Support to collect decimal type

2018-10-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3591:
--
Labels: pull-request-available  (was: )

> [R] Support to collect decimal type
> ---
>
> Key: ARROW-3591
> URL: https://issues.apache.org/jira/browse/ARROW-3591
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Javier Luraschi
>Priority: Major
>  Labels: pull-request-available
>
> Collecting from `sparklyr` decimal types through:
>  
> {code:java}
> library(sparklyr)
> sc <- spark_connect(master = "local")
> sdf_len(sc, 3) %>% dplyr::mutate(new = 1) %>% dplyr::collect(){code}
> causes,
>  
> {code:java}
> Error in RecordBatch__to_dataframe(x) : cannot handle Array of type decimal
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3591) [R] Support to collect decimal type

2018-10-22 Thread Javier Luraschi (JIRA)
Javier Luraschi created ARROW-3591:
--

 Summary: [R] Support to collect decimal type
 Key: ARROW-3591
 URL: https://issues.apache.org/jira/browse/ARROW-3591
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Javier Luraschi


Collecting from `sparklyr` decimal types through:

 
{code:java}
library(sparklyr)
sc <- spark_connect(master = "local")
sdf_len(sc, 3) %>% dplyr::mutate(new = 1) %>% dplyr::collect(){code}
causes,

 
{code:java}
Error in RecordBatch__to_dataframe(x) : cannot handle Array of type decimal
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2712) [C#] Initial C# .NET library

2018-10-22 Thread Jamie Elliott (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659971#comment-16659971
 ] 

Jamie Elliott commented on ARROW-2712:
--

Hey! Sorry I let this slide. I had more or less clear in my head what I was 
planning but just got too busy with my day job. If someone is donating that is 
very exciting. Can you give any more details? 

BTW - the name I was going to suggest was SharpArrow. 

> [C#] Initial C# .NET library
> 
>
> Key: ARROW-2712
> URL: https://issues.apache.org/jira/browse/ARROW-2712
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Jamie Elliott
>Priority: Major
>  Labels: features, newbie, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> A feature request. I've seen this pop up in a few places. Want to have a 
> record of discussion on this topic. 
> I may be open to contributing this, but first need some general guidance on 
> approach so I can understand effort level. 
> It looks like there is not a good tool available for GObject Introspection 
> binding to .NET so the easy pathway via Arrow Glib C API appears to be 
> closed. 
> The only GObject integration for .NET appears to be Mono GAPI
> [http://www.mono-project.com/docs/gui/gtksharp/gapi/]
> From what I can see this produces a GIR or similar XML, then generates C# 
> code directly from that. Likely involves many manual fix ups of the XML. 
> Worth a try? 
>  
> Alternatively I could look at generating some other direct binding from .NET 
> to C/C++. Where I work we use Swig [http://www.swig.org/]. Good for vanilla 
> cases, requires hand crafting of the .i files and specialized marshalling 
> strategies for optimizing performance critical cases. 
> Haven't tried CppSharp but it looks more appealing than Swig in some ways 
> [https://github.com/mono/CppSharp/wiki/Users-Manual]
> In either case, not sure if better to use Glib C API or C++ API directly. 
> What would be pros/cons? 
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3574) Fix remaining bug with plasma static versus shared libraries.

2018-10-22 Thread Philipp Moritz (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Moritz resolved ARROW-3574.
---
Resolution: Fixed

> Fix remaining bug with plasma static versus shared libraries.
> -
>
> Key: ARROW-3574
> URL: https://issues.apache.org/jira/browse/ARROW-3574
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Address a few missing pieces in [https://github.com/apache/arrow/pull/2792.] 
> On Mac, moving the {{plasma_store_server}} executable around and then 
> executing it leads to
>  
> {code:java}
> dyld: Library not loaded: @rpath/libarrow.12.dylib
>   Referenced from: 
> /Users/rkn/Workspace/ray/./python/ray/core/src/plasma/plasma_store_server
>   Reason: image not found
> Abort trap: 6{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3574) Fix remaining bug with plasma static versus shared libraries.

2018-10-22 Thread Philipp Moritz (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philipp Moritz updated ARROW-3574:
--
Fix Version/s: 0.12.0

> Fix remaining bug with plasma static versus shared libraries.
> -
>
> Key: ARROW-3574
> URL: https://issues.apache.org/jira/browse/ARROW-3574
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Robert Nishihara
>Assignee: Robert Nishihara
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Address a few missing pieces in [https://github.com/apache/arrow/pull/2792.] 
> On Mac, moving the {{plasma_store_server}} executable around and then 
> executing it leads to
>  
> {code:java}
> dyld: Library not loaded: @rpath/libarrow.12.dylib
>   Referenced from: 
> /Users/rkn/Workspace/ray/./python/ray/core/src/plasma/plasma_store_server
>   Reason: image not found
> Abort trap: 6{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3590) Expose Python API for start and end offset of row group in parquet file

2018-10-22 Thread Heejong Lee (JIRA)
Heejong Lee created ARROW-3590:
--

 Summary: Expose Python API for start and end offset of row group 
in parquet file
 Key: ARROW-3590
 URL: https://issues.apache.org/jira/browse/ARROW-3590
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Heejong Lee


Is there a way to get more detailed metadata from Parquet file in Pyarrow? 
Specifically, I want to access the start and end offset information about each 
row group.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3589) [Gandiva] Make it possible to compile gandiva without JNI

2018-10-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3589:
--
Labels: pull-request-available  (was: )

> [Gandiva] Make it possible to compile gandiva without JNI
> -
>
> Key: ARROW-3589
> URL: https://issues.apache.org/jira/browse/ARROW-3589
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Priority: Major
>  Labels: pull-request-available
>
> When trying to compile arrow with
> {code:java}
> cmake -DARROW_PYTHON=on -DARROW_GANDIVA=on -DARROW_PLASMA=on ..{code}
> I'm seeing the following error right now:
> {code:java}
> CMake Error at 
> /home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:137
>  (message):
>   Could NOT find JNI (missing: JAVA_AWT_LIBRARY JAVA_JVM_LIBRARY
>   JAVA_INCLUDE_PATH JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH)
> Call Stack (most recent call first):
>   
> /home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:378
>  (_FPHSA_FAILURE_MESSAGE)
>   /home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindJNI.cmake:356 
> (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
>   src/gandiva/jni/CMakeLists.txt:21 (find_package)
> -- Configuring incomplete, errors occurred{code}
> It should be possible to compile the C++ gandiva code without JNI bindings, 
> how about we introduce a new flag "-DARROW_GANDIVA_JAVA=off" (which could be 
> on by default if desired).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3589) [Gandiva] Make it possible to compile gandiva without JNI

2018-10-22 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659803#comment-16659803
 ] 

Wes McKinney commented on ARROW-3589:
-

You could also do -DARROW_JNI=off to disable all JNI extensions as they exist 

> [Gandiva] Make it possible to compile gandiva without JNI
> -
>
> Key: ARROW-3589
> URL: https://issues.apache.org/jira/browse/ARROW-3589
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Philipp Moritz
>Priority: Major
>
> When trying to compile arrow with
> {code:java}
> cmake -DARROW_PYTHON=on -DARROW_GANDIVA=on -DARROW_PLASMA=on ..{code}
> I'm seeing the following error right now:
> {code:java}
> CMake Error at 
> /home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:137
>  (message):
>   Could NOT find JNI (missing: JAVA_AWT_LIBRARY JAVA_JVM_LIBRARY
>   JAVA_INCLUDE_PATH JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH)
> Call Stack (most recent call first):
>   
> /home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:378
>  (_FPHSA_FAILURE_MESSAGE)
>   /home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindJNI.cmake:356 
> (FIND_PACKAGE_HANDLE_STANDARD_ARGS)
>   src/gandiva/jni/CMakeLists.txt:21 (find_package)
> -- Configuring incomplete, errors occurred{code}
> It should be possible to compile the C++ gandiva code without JNI bindings, 
> how about we introduce a new flag "-DARROW_GANDIVA_JAVA=off" (which could be 
> on by default if desired).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3589) [Gandiva] Make it possible to compile gandiva without JNI

2018-10-22 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-3589:
-

 Summary: [Gandiva] Make it possible to compile gandiva without JNI
 Key: ARROW-3589
 URL: https://issues.apache.org/jira/browse/ARROW-3589
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Philipp Moritz


When trying to compile arrow with
{code:java}
cmake -DARROW_PYTHON=on -DARROW_GANDIVA=on -DARROW_PLASMA=on ..{code}
I'm seeing the following error right now:
{code:java}
CMake Error at 
/home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:137
 (message):

  Could NOT find JNI (missing: JAVA_AWT_LIBRARY JAVA_JVM_LIBRARY

  JAVA_INCLUDE_PATH JAVA_INCLUDE_PATH2 JAVA_AWT_INCLUDE_PATH)

Call Stack (most recent call first):

  
/home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindPackageHandleStandardArgs.cmake:378
 (_FPHSA_FAILURE_MESSAGE)

  /home/ubuntu/anaconda3/share/cmake-3.12/Modules/FindJNI.cmake:356 
(FIND_PACKAGE_HANDLE_STANDARD_ARGS)

  src/gandiva/jni/CMakeLists.txt:21 (find_package)





-- Configuring incomplete, errors occurred{code}
It should be possible to compile the C++ gandiva code without JNI bindings, how 
about we introduce a new flag "-DARROW_GANDIVA_JAVA=off" (which could be on by 
default if desired).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3588) [Java] checkstyle - fix license

2018-10-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3588:
--
Labels: pull-request-available  (was: )

> [Java] checkstyle - fix license
> ---
>
> Key: ARROW-3588
> URL: https://issues.apache.org/jira/browse/ARROW-3588
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Java
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>  Labels: pull-request-available
>
> Make header correspond to the defined Apache license in checkstyle.license



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3588) [Java] checkstyle - fix license

2018-10-22 Thread Bryan Cutler (JIRA)
Bryan Cutler created ARROW-3588:
---

 Summary: [Java] checkstyle - fix license
 Key: ARROW-3588
 URL: https://issues.apache.org/jira/browse/ARROW-3588
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Bryan Cutler
Assignee: Bryan Cutler


Make header correspond to the defined Apache license in checkstyle.license



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3585) [Python] Update the documentation about Schema & Metadata usage

2018-10-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3585:

Summary: [Python] Update the documentation about Schema & Metadata usage  
(was: Update the documentation about Schema & Metadata usage)

> [Python] Update the documentation about Schema & Metadata usage
> ---
>
> Key: ARROW-3585
> URL: https://issues.apache.org/jira/browse/ARROW-3585
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Daniel Haviv
>Assignee: Daniel Haviv
>Priority: Trivial
>  Labels: beginner, documentation, easyfix, newbie
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Reusing the Schema object from a Parquet file written with Spark with Pandas 
> fails due to Schema mismatch.
> The culprit is in the metadata part of the schema which each component fills 
> according to it's implementation. More details can be found here: 
> [https://github.com/apache/arrow/issues/2805]
> The documentation should point that out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3586) [Python] Segmentation fault when converting empty table to pandas with categoricals

2018-10-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3586:

Fix Version/s: 0.12.0

> [Python] Segmentation fault when converting empty table to pandas with 
> categoricals
> ---
>
> Key: ARROW-3586
> URL: https://issues.apache.org/jira/browse/ARROW-3586
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11.0
> Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas 
> 0.23.4
> - Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4
>Reporter: Andreas
>Priority: Major
> Fix For: 0.12.0
>
>
> {code:java}
> import pyarrow as pa
> table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], 
> names=['col'])
> table.to_pandas(categories=['col']){code}
> This produces a segmentation fault for certain types (e.g, int\{32,64}) while 
> it works for others (e.g. string, binary).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3586) [Python] Segmentation fault when converting empty table to pandas with categoricals

2018-10-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3586:

Summary: [Python] Segmentation fault when converting empty table to pandas 
with categoricals  (was: Segmentation fault when converting empty table to 
pandas with categoricals)

> [Python] Segmentation fault when converting empty table to pandas with 
> categoricals
> ---
>
> Key: ARROW-3586
> URL: https://issues.apache.org/jira/browse/ARROW-3586
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.10.0, 0.11.0
> Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas 
> 0.23.4
> - Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4
>Reporter: Andreas
>Priority: Major
> Fix For: 0.12.0
>
>
> {code:java}
> import pyarrow as pa
> table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], 
> names=['col'])
> table.to_pandas(categories=['col']){code}
> This produces a segmentation fault for certain types (e.g, int\{32,64}) while 
> it works for others (e.g. string, binary).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3587) [Python] Efficient serialization for Arrow Objects (array, table, tensor, etc)

2018-10-22 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3587:

Summary: [Python] Efficient serialization for Arrow Objects (array, table, 
tensor, etc)  (was: Efficient serialization for Arrow Objects (array, table, 
tensor, etc))

> [Python] Efficient serialization for Arrow Objects (array, table, tensor, etc)
> --
>
> Key: ARROW-3587
> URL: https://issues.apache.org/jira/browse/ARROW-3587
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Plasma (C++), Python
>Reporter: Siyuan Zhuang
>Priority: Major
>
> Currently, Arrow seems to have poor serialization support for its own objects.
> For example,
>   
> {code}
> import pyarrow 
> arr = pyarrow.array([1, 2, 3, 4]) 
> pyarrow.serialize(arr)
> {code}
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow/serialization.pxi", line 337, in pyarrow.lib.serialize
>  File "pyarrow/serialization.pxi", line 136, in 
> pyarrow.lib.SerializationContext._serialize_callback
>  pyarrow.lib.SerializationCallbackError: pyarrow does not know how to 
> serialize objects of type .
> I am working Ray & modin project, using plasma to store Arrow objects. Lack 
> of direct serialization support harms the performance, so I would like to 
> push a PR to fix this problem.
> I wonder if it is welcome or is there someone else doing it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3587) Efficient serialization for Arrow Objects (array, table, tensor, etc)

2018-10-22 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659749#comment-16659749
 ] 

Wes McKinney commented on ARROW-3587:
-

No objections from me. The kinds of objects supported by {{pyarrow.serialize}} 
as you see are quite limited at the moment

> Efficient serialization for Arrow Objects (array, table, tensor, etc)
> -
>
> Key: ARROW-3587
> URL: https://issues.apache.org/jira/browse/ARROW-3587
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Plasma (C++), Python
>Reporter: Siyuan Zhuang
>Priority: Major
>
> Currently, Arrow seems to have poor serialization support for its own objects.
> For example,
>   
> {code}
> import pyarrow 
> arr = pyarrow.array([1, 2, 3, 4]) 
> pyarrow.serialize(arr)
> {code}
> Traceback (most recent call last):
>  File "", line 1, in 
>  File "pyarrow/serialization.pxi", line 337, in pyarrow.lib.serialize
>  File "pyarrow/serialization.pxi", line 136, in 
> pyarrow.lib.SerializationContext._serialize_callback
>  pyarrow.lib.SerializationCallbackError: pyarrow does not know how to 
> serialize objects of type .
> I am working Ray & modin project, using plasma to store Arrow objects. Lack 
> of direct serialization support harms the performance, so I would like to 
> push a PR to fix this problem.
> I wonder if it is welcome or is there someone else doing it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3547) [R] Protect against Null crash when reading from RecordBatch

2018-10-22 Thread Javier Luraschi (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Javier Luraschi resolved ARROW-3547.

Resolution: Fixed

Fixed by [https://github.com/apache/arrow/pull/2795]

 

> [R] Protect against Null crash when reading from RecordBatch
> 
>
> Key: ARROW-3547
> URL: https://issues.apache.org/jira/browse/ARROW-3547
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Javier Luraschi
>Priority: Minor
>
> Reprex:
>  
> {code:java}
>   tbl <- tibble::tibble(
> int = 1:10, dbl = as.numeric(1:10),
> lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE),
> chr = letters[1:10]
>   )
>   batch <- record_batch(tbl)
>   bytes <- write_record_batch(batch, raw())
>   stream_reader <- record_batch_stream_reader(bytes)
>   batch1 <- read_record_batch(stream_reader)
>   batch2 <- read_record_batch(stream_reader)
>   
>   # Crash
>   as_tibble(batch2){code}
>  
> While users should check for Null entries by running:
>  
> {code:java}
> if(!batch2$is_null()) as_tibble(batch2)
> {code}
> It's harsh to trigger a crash, we should consider protecting all functions 
> that use RecordBatch pointers to return NULL instead, for instance:
>  
> {code:java}
> List RecordBatch__to_dataframe(const std::shared_ptr& 
> batch) {
>if (batch->get() == nullptr) Rcpp::stop("Can't read from NULL record 
> batch.")
> }{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3547) [R] Protect against Null crash when reading from RecordBatch

2018-10-22 Thread Javier Luraschi (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659731#comment-16659731
 ] 

Javier Luraschi commented on ARROW-3547:


This one got fixed by https://github.com/apache/arrow/pull/2795

> [R] Protect against Null crash when reading from RecordBatch
> 
>
> Key: ARROW-3547
> URL: https://issues.apache.org/jira/browse/ARROW-3547
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Javier Luraschi
>Priority: Minor
>
> Reprex:
>  
> {code:java}
>   tbl <- tibble::tibble(
> int = 1:10, dbl = as.numeric(1:10),
> lgl = sample(c(TRUE, FALSE, NA), 10, replace = TRUE),
> chr = letters[1:10]
>   )
>   batch <- record_batch(tbl)
>   bytes <- write_record_batch(batch, raw())
>   stream_reader <- record_batch_stream_reader(bytes)
>   batch1 <- read_record_batch(stream_reader)
>   batch2 <- read_record_batch(stream_reader)
>   
>   # Crash
>   as_tibble(batch2){code}
>  
> While users should check for Null entries by running:
>  
> {code:java}
> if(!batch2$is_null()) as_tibble(batch2)
> {code}
> It's harsh to trigger a crash, we should consider protecting all functions 
> that use RecordBatch pointers to return NULL instead, for instance:
>  
> {code:java}
> List RecordBatch__to_dataframe(const std::shared_ptr& 
> batch) {
>if (batch->get() == nullptr) Rcpp::stop("Can't read from NULL record 
> batch.")
> }{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2712) [C#] Initial C# .NET library

2018-10-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2712:
--
Labels: features newbie pull-request-available  (was: features newbie)

> [C#] Initial C# .NET library
> 
>
> Key: ARROW-2712
> URL: https://issues.apache.org/jira/browse/ARROW-2712
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C#
>Reporter: Jamie Elliott
>Priority: Major
>  Labels: features, newbie, pull-request-available
>
> A feature request. I've seen this pop up in a few places. Want to have a 
> record of discussion on this topic. 
> I may be open to contributing this, but first need some general guidance on 
> approach so I can understand effort level. 
> It looks like there is not a good tool available for GObject Introspection 
> binding to .NET so the easy pathway via Arrow Glib C API appears to be 
> closed. 
> The only GObject integration for .NET appears to be Mono GAPI
> [http://www.mono-project.com/docs/gui/gtksharp/gapi/]
> From what I can see this produces a GIR or similar XML, then generates C# 
> code directly from that. Likely involves many manual fix ups of the XML. 
> Worth a try? 
>  
> Alternatively I could look at generating some other direct binding from .NET 
> to C/C++. Where I work we use Swig [http://www.swig.org/]. Good for vanilla 
> cases, requires hand crafting of the .i files and specialized marshalling 
> strategies for optimizing performance critical cases. 
> Haven't tried CppSharp but it looks more appealing than Swig in some ways 
> [https://github.com/mono/CppSharp/wiki/Users-Manual]
> In either case, not sure if better to use Glib C API or C++ API directly. 
> What would be pros/cons? 
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3557) [Python] Set language_level in Cython sources

2018-10-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3557:
--
Labels: pull-request-available  (was: )

> [Python] Set language_level in Cython sources
> -
>
> Key: ARROW-3557
> URL: https://issues.apache.org/jira/browse/ARROW-3557
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.11.0
>Reporter: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
>
> Cython 0.29.0 emits the following warning:
> {code}
> C:\Miniconda36-x64\envs\arrow\lib\site-packages\Cython\Compiler\Main.py:367: 
> FutureWarning: Cython directive 'language_level' not set, using 2 for now 
> (Py2). This will change in a later release! File: 
> C:\projects\arrow\python\pyarrow\_parquet.pxd
> {code}
> We should probably try to switch it to Python 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3587) Efficient serialization for Arrow Objects (array, table, tensor, etc)

2018-10-22 Thread Siyuan Zhuang (JIRA)
Siyuan Zhuang created ARROW-3587:


 Summary: Efficient serialization for Arrow Objects (array, table, 
tensor, etc)
 Key: ARROW-3587
 URL: https://issues.apache.org/jira/browse/ARROW-3587
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Plasma (C++), Python
Reporter: Siyuan Zhuang


Currently, Arrow seems to have poor serialization support for its own objects.

For example,
  
{code}
import pyarrow 
arr = pyarrow.array([1, 2, 3, 4]) 
pyarrow.serialize(arr)
{code}
Traceback (most recent call last):
 File "", line 1, in 
 File "pyarrow/serialization.pxi", line 337, in pyarrow.lib.serialize
 File "pyarrow/serialization.pxi", line 136, in 
pyarrow.lib.SerializationContext._serialize_callback
 pyarrow.lib.SerializationCallbackError: pyarrow does not know how to serialize 
objects of type .

I am working Ray & modin project, using plasma to store Arrow objects. Lack of 
direct serialization support harms the performance, so I would like to push a 
PR to fix this problem.
I wonder if it is welcome or is there someone else doing it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3381) [C++] Implement InputStream for bz2 files

2018-10-22 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3381:
--
Labels: csv pull-request-available  (was: csv)

> [C++] Implement InputStream for bz2 files
> -
>
> Key: ARROW-3381
> URL: https://issues.apache.org/jira/browse/ARROW-3381
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: csv, pull-request-available
> Fix For: 0.12.0
>
>
> For reading compressed CSV files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3586) Segmentation fault when converting empty table to pandas with categoricals

2018-10-22 Thread Andreas (JIRA)
Andreas created ARROW-3586:
--

 Summary: Segmentation fault when converting empty table to pandas 
with categoricals
 Key: ARROW-3586
 URL: https://issues.apache.org/jira/browse/ARROW-3586
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.11.0, 0.10.0
 Environment: - Ubuntu 16.04, Python 2.7.12, pyarrow 0.11.0, pandas 
0.23.4
- Debian9, Python 2.7.13, pyarrow 0.10.0, pandas 0.23.4
Reporter: Andreas


{code:java}
import pyarrow as pa


table = pa.Table.from_arrays(arrays=[pa.array([], type=pa.int32())], 
names=['col'])
table.to_pandas(categories=['col']){code}
This produces a segmentation fault for certain types (e.g, int\{32,64}) while 
it works for others (e.g. string, binary).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3585) Update the documentation about Schema & Metadata usage

2018-10-22 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658758#comment-16658758
 ] 

Uwe L. Korn commented on ARROW-3585:


[~danielil] Assigned to you and also gave you permission to self-assign JIRAs 
in the future.

> Update the documentation about Schema & Metadata usage
> --
>
> Key: ARROW-3585
> URL: https://issues.apache.org/jira/browse/ARROW-3585
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Daniel Haviv
>Assignee: Daniel Haviv
>Priority: Trivial
>  Labels: beginner, documentation, easyfix, newbie
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Reusing the Schema object from a Parquet file written with Spark with Pandas 
> fails due to Schema mismatch.
> The culprit is in the metadata part of the schema which each component fills 
> according to it's implementation. More details can be found here: 
> [https://github.com/apache/arrow/issues/2805]
> The documentation should point that out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3585) Update the documentation about Schema & Metadata usage

2018-10-22 Thread Uwe L. Korn (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned ARROW-3585:
--

Assignee: Daniel Haviv

> Update the documentation about Schema & Metadata usage
> --
>
> Key: ARROW-3585
> URL: https://issues.apache.org/jira/browse/ARROW-3585
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Daniel Haviv
>Assignee: Daniel Haviv
>Priority: Trivial
>  Labels: beginner, documentation, easyfix, newbie
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Reusing the Schema object from a Parquet file written with Spark with Pandas 
> fails due to Schema mismatch.
> The culprit is in the metadata part of the schema which each component fills 
> according to it's implementation. More details can be found here: 
> [https://github.com/apache/arrow/issues/2805]
> The documentation should point that out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3585) Update the documentation about Schema & Metadata usage

2018-10-22 Thread Daniel Haviv (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16658754#comment-16658754
 ] 

Daniel Haviv commented on ARROW-3585:
-

Please assign to me

> Update the documentation about Schema & Metadata usage
> --
>
> Key: ARROW-3585
> URL: https://issues.apache.org/jira/browse/ARROW-3585
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Daniel Haviv
>Priority: Trivial
>  Labels: beginner, documentation, easyfix, newbie
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Reusing the Schema object from a Parquet file written with Spark with Pandas 
> fails due to Schema mismatch.
> The culprit is in the metadata part of the schema which each component fills 
> according to it's implementation. More details can be found here: 
> [https://github.com/apache/arrow/issues/2805]
> The documentation should point that out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3585) Update the documentation about Schema & Metadata usage

2018-10-22 Thread Daniel Haviv (JIRA)
Daniel Haviv created ARROW-3585:
---

 Summary: Update the documentation about Schema & Metadata usage
 Key: ARROW-3585
 URL: https://issues.apache.org/jira/browse/ARROW-3585
 Project: Apache Arrow
  Issue Type: Task
  Components: Documentation
Reporter: Daniel Haviv


Reusing the Schema object from a Parquet file written with Spark with Pandas 
fails due to Schema mismatch.

The culprit is in the metadata part of the schema which each component fills 
according to it's implementation. More details can be found here: 
[https://github.com/apache/arrow/issues/2805]

The documentation should point that out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)