[jira] [Assigned] (ARROW-3741) [R] Add support for arrow::compute::Cast to convert Arrow arrays from one type to another

2018-11-13 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/ARROW-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Romain François reassigned ARROW-3741:
--

Assignee: Romain François

> [R] Add support for arrow::compute::Cast to convert Arrow arrays from one 
> type to another
> -
>
> Key: ARROW-3741
> URL: https://issues.apache.org/jira/browse/ARROW-3741
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Wes McKinney
>Assignee: Romain François
>Priority: Major
>
> See {{pyarrow.Array.cast}} and {{pyarrow.Table.cast}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3787) Implement From for BinaryArray

2018-11-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3787:
--
Labels: pull-request-available  (was: )

> Implement From for BinaryArray
> -
>
> Key: ARROW-3787
> URL: https://issues.apache.org/jira/browse/ARROW-3787
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3787) Implement From for BinaryArray

2018-11-13 Thread Paddy Horan (JIRA)
Paddy Horan created ARROW-3787:
--

 Summary: Implement From for BinaryArray
 Key: ARROW-3787
 URL: https://issues.apache.org/jira/browse/ARROW-3787
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Paddy Horan
Assignee: Paddy Horan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3786) Enable merge_arrow_pr.py script to run in non-English JIRA accounts.

2018-11-13 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-3786:
---

 Summary: Enable merge_arrow_pr.py script to run in non-English 
JIRA accounts.
 Key: ARROW-3786
 URL: https://issues.apache.org/jira/browse/ARROW-3786
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Yosuke Shiro


I read [https://github.com/apache/arrow/tree/master/dev#arrow-developer-scripts]
 
I did the following instruction.
{code:java}
dev/merge_arrow_pr.py{code}
I got the following result.
{code:java}
Would you like to update the associated JIRA? (y/n): y
Enter comma-separated fix version(s) [0.12.0]:
=== JIRA ARROW-3748 ===
summary [GLib] Add GArrowCSVReader
assigneeKouhei Sutou
status  オープン
url https://issues.apache.org/jira/browse/ARROW-3748
 
list index out of range{code}
 
It looks like an error on 
[https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py#L181] .
My JIRA account language is Japanese.
This script does not seem to work if it is not English.
{code:java}
print(self.jira_con.transitions(self.jira_id))

[{'id': '701', 'name': '課題のクローズ', 'to': {'self': 
'https://issues.apache.org/jira/rest/api/2/status/6';, 'description': '課題の検 
討が終了し、解決方法が正しいことを表します。クローズした課題は再オープンすることができます。', 'iconUrl': 
'https://issues.apache.org/jira/images/icons/statuses/closed.png';, 'name': 
'クローズ', 'id': '6', 'statusCategory': {'self': 
'https://issues.apache.org/jira/rest/api/2/statuscategory/3';, 'id': 3, 'key': 
'done', 'colorName': 'green', 'name': '完了'}}}, {'id': '3', 'name': 
'課題を再オープンする', 'to': {'self': 
'https://issues.apache.org/jira/rest/api/2/status/4';, 'description': 
'課題が一度解決されたが解決に間違いがあったと見なされ たことを表します。ここから課題を割り当て済みにするか解決済みに設定できます。', 'iconUrl': 
'https://issues.apache.org/jira/images/icons/statuses/reopened.png';, 'name': 
'再オープン', 'id': '4', 'statusCategory': {'self': 
'https://issues.apache.org/jira/rest/api/2/statuscategory/2';, 'id': 2, 'key': 
'new', 'colorName': 'blue-gray', 'name': 'To Do'}}}]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2956) [Python] Arrow plasma throws ArrowIOError and process crashed

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2956:

Summary: [Python] Arrow plasma throws ArrowIOError and process crashed  
(was: [Python]Arrow plasma throws ArrowIOError and process crashed)

> [Python] Arrow plasma throws ArrowIOError and process crashed
> -
>
> Key: ARROW-2956
> URL: https://issues.apache.org/jira/browse/ARROW-2956
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: He Kaisheng
>Priority: Major
>
> hello,
> We start a plasma store with 100k memory. when storage is full, it throws 
> ArrowIOError and the *process crashed,* not the expected PlasmaStoreFull 
> error.
> code:
> {code:java}
> import pyarrow.plasma as plasma
> import numpy as np
> plasma_client = plasma.connect(plasma_socket, '', 0)
> ref = []
> for _ in range(1000):
> obj_id = plasma_client.put(np.random.randint(100, size=(100, 100), 
> dtype=np.int16))
> data = plasma_client.get(obj_id)
> ref.append(data)
> {code}
> error:
> {noformat}
> ---
> ArrowIOError  Traceback (most recent call last)
>  in ()
>   2 ref = []
>   3 for _ in range(1000):
> > 4 obj_id = plasma_client.put(np.random.randint(100, size=(100, 
> 100), dtype=np.int16))
>   5 data = plasma_client.get(obj_id)
>   6 ref.append(data)
> plasma.pyx in pyarrow.plasma.PlasmaClient.put()
> plasma.pyx in pyarrow.plasma.PlasmaClient.create()
> error.pxi in pyarrow.lib.check_status()
> ArrowIOError: Encountered unexpected EOF{noformat}
> this problem doesn't exist when dtype is np.int64 or share memory is 
> larger(like more than 100M), it seems so strange, anybody knows the reason? 
> Thanks a lot.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3785) [C++] Use double-conversion conda package in CI toolchain

2018-11-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3785:
---

 Summary: [C++] Use double-conversion conda package in CI toolchain
 Key: ARROW-3785
 URL: https://issues.apache.org/jira/browse/ARROW-3785
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


This is being built from the EP currently



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3780) [R] Failed to fetch data: invalid data when collecting int16

2018-11-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3780:
--
Labels: pull-request-available spark  (was: spark)

> [R] Failed to fetch data: invalid data when collecting int16
> 
>
> Key: ARROW-3780
> URL: https://issues.apache.org/jira/browse/ARROW-3780
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Javier Luraschi
>Priority: Major
>  Labels: pull-request-available, spark
> Fix For: 0.12.0
>
>
> Repro from sparklyr unit test:
> {code:java}
> library(dplyr)
> library(sparklyr)
> library(arrow)
> sc <- spark_connect(master = "local")
> hive_type <- tibble::frame_data(
>  ~stype, ~svalue, ~rtype, ~rvalue, ~arrow,
>  "smallint", "1", "integer", "1", "integer",
> )
> spark_query <- hive_type %>%
>  mutate(
>  query = paste0("cast(", svalue, " as ", stype, ") as ", gsub("\\(|\\)", "", 
> stype), "_col")
>  ) %>%
>  pull(query) %>%
>  paste(collapse = ", ") %>%
>  paste("SELECT", .)
> spark_types <- DBI::dbGetQuery(sc, spark_query) %>%
>  lapply(function(e) class(e)[[1]]) %>%
>  as.character(){code}
> Actual: error: Failed to fetch data: invalid data 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream

2018-11-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685945#comment-16685945
 ] 

Wes McKinney commented on ARROW-3781:
-

It would definitely require some design work. In 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/buffered.cc#L104, 
you would need to use a buffer pool of some kind so that if Flush is holding a 
temporary buffer, that Write can write to a new buffer. In any case, it's out 
of scope for this issue. Once we have file system implementations for one or 
more cloud services we can use benchmarks to drive the development. In the 
meantime, a mock remote file system with configurable write latency could help 
with throughput tests

> [C++] Configure buffer size in arrow::io::BufferedOutputStream
> --
>
> Key: ARROW-3781
> URL: https://issues.apache.org/jira/browse/ARROW-3781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> This is hard-coded to 4096 right now. For higher latency file systems it may 
> be desirable to use a larger buffer. See also ARROW-3777 about performance 
> testing for high latency files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3784) [R] Array with type fails with x is not a vector

2018-11-13 Thread Javier Luraschi (JIRA)
Javier Luraschi created ARROW-3784:
--

 Summary: [R] Array with type fails with x is not a vector 
 Key: ARROW-3784
 URL: https://issues.apache.org/jira/browse/ARROW-3784
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Javier Luraschi


{code:java}
array(1:10, type = int32())
{code}
Actual:
{code:java}
 Error: `x` is not a vector 
{code}
Expected:
{code:java}
arrow::Array [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3784) [R] Array with type fails with x is not a vector

2018-11-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3784:
--
Labels: pull-request-available  (was: )

> [R] Array with type fails with x is not a vector 
> -
>
> Key: ARROW-3784
> URL: https://issues.apache.org/jira/browse/ARROW-3784
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Javier Luraschi
>Priority: Major
>  Labels: pull-request-available
>
> {code:java}
> array(1:10, type = int32())
> {code}
> Actual:
> {code:java}
>  Error: `x` is not a vector 
> {code}
> Expected:
> {code:java}
> arrow::Array [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream

2018-11-13 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685817#comment-16685817
 ] 

Antoine Pitrou commented on ARROW-3781:
---

We may want to think about flushing in a separate thread, then.

> [C++] Configure buffer size in arrow::io::BufferedOutputStream
> --
>
> Key: ARROW-3781
> URL: https://issues.apache.org/jira/browse/ARROW-3781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> This is hard-coded to 4096 right now. For higher latency file systems it may 
> be desirable to use a larger buffer. See also ARROW-3777 about performance 
> testing for high latency files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream

2018-11-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685812#comment-16685812
 ] 

Wes McKinney edited comment on ARROW-3781 at 11/13/18 10:07 PM:


What I mean is that if I call {{out->Flush()}} it may not be safe to continue 
to call {{out->Write(...)}} until the flush completes. So my proposal was to 
think about devising a buffered output stream where a writer thread can 
continue writing while a Flush is in progress. The current 
{{BufferedOutputStream}} holds a mutex while Flush so further writes are not 
possible


was (Author: wesmckinn):
What I mean is that if I call {{out->Flush()}} it may not be safe to continue 
to call {{out->Write(...)}} until the flush completes. So my proposal was to 
think about devising a buffered output stream where a writer thread can 
continue writing while a Flush is in progress

> [C++] Configure buffer size in arrow::io::BufferedOutputStream
> --
>
> Key: ARROW-3781
> URL: https://issues.apache.org/jira/browse/ARROW-3781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> This is hard-coded to 4096 right now. For higher latency file systems it may 
> be desirable to use a larger buffer. See also ARROW-3777 about performance 
> testing for high latency files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream

2018-11-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685812#comment-16685812
 ] 

Wes McKinney commented on ARROW-3781:
-

What I mean is that if I call {{out->Flush()}} it may not be safe to continue 
to call {{out->Write(...)}} until the flush completes. So my proposal was to 
think about devising a buffered output stream where a writer thread can 
continue writing while a Flush is in progress

> [C++] Configure buffer size in arrow::io::BufferedOutputStream
> --
>
> Key: ARROW-3781
> URL: https://issues.apache.org/jira/browse/ARROW-3781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> This is hard-coded to 4096 right now. For higher latency file systems it may 
> be desirable to use a larger buffer. See also ARROW-3777 about performance 
> testing for high latency files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream

2018-11-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685809#comment-16685809
 ] 

Wes McKinney commented on ARROW-3781:
-

Sorry, I'm using file systems here again proverbially. TensorFlow and other 
projects call their integrations with other file storage systems "file 
systems", e.g.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/s3/s3_file_system.h#L25

I am not sure a Write or Flush into S3 is necessarily going to be asynchronous. 
The implementation in TensorFlow of Flush blocks until the PutRequest is 
completed

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/s3/s3_file_system.cc#L238

> [C++] Configure buffer size in arrow::io::BufferedOutputStream
> --
>
> Key: ARROW-3781
> URL: https://issues.apache.org/jira/browse/ARROW-3781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> This is hard-coded to 4096 right now. For higher latency file systems it may 
> be desirable to use a larger buffer. See also ARROW-3777 about performance 
> testing for high latency files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream

2018-11-13 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685791#comment-16685791
 ] 

Antoine Pitrou commented on ARROW-3781:
---

Are you thinking about the `Flush` method? It's as asynchronous as `Write` is.

> [C++] Configure buffer size in arrow::io::BufferedOutputStream
> --
>
> Key: ARROW-3781
> URL: https://issues.apache.org/jira/browse/ARROW-3781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> This is hard-coded to 4096 right now. For higher latency file systems it may 
> be desirable to use a larger buffer. See also ARROW-3777 about performance 
> testing for high latency files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream

2018-11-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685779#comment-16685779
 ] 

Wes McKinney commented on ARROW-3781:
-

I'm thinking about the "file systems" HDFS, AWS S3, Google Cloud Storage, and 
Azure Blob Storage, all of which can be pretty high latency for writes

> [C++] Configure buffer size in arrow::io::BufferedOutputStream
> --
>
> Key: ARROW-3781
> URL: https://issues.apache.org/jira/browse/ARROW-3781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> This is hard-coded to 4096 right now. For higher latency file systems it may 
> be desirable to use a larger buffer. See also ARROW-3777 about performance 
> testing for high latency files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream

2018-11-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685784#comment-16685784
 ] 

Wes McKinney commented on ARROW-3781:
-

For cloud stores, at some point we might want to consider asynchronous flushing 
also, to mitigate latency when a flush triggers (so the writer thread can begin 
to buffer the next chunk)

> [C++] Configure buffer size in arrow::io::BufferedOutputStream
> --
>
> Key: ARROW-3781
> URL: https://issues.apache.org/jira/browse/ARROW-3781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> This is hard-coded to 4096 right now. For higher latency file systems it may 
> be desirable to use a larger buffer. See also ARROW-3777 about performance 
> testing for high latency files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3783) [R] Incorrect collection of float type

2018-11-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3783:
--
Labels: pull-request-available  (was: )

> [R] Incorrect collection of float type
> --
>
> Key: ARROW-3783
> URL: https://issues.apache.org/jira/browse/ARROW-3783
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Javier Luraschi
>Priority: Major
>  Labels: pull-request-available
>
> Repro from `sparklyr`:
>  
> {code:java}
> library(sparklyr)
> library(arrow)
> sc <- spark_connect(master = "local")
> DBI::dbGetQuery(sc, "SELECT cast(1 as float)"){code}
>  
> Actual:
> {code:java}
>   CAST(1 AS FLOAT)
> 1   1065353216{code}
> Expected:
>  
> {code:java}
>   CAST(1 AS FLOAT)
> 11{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3783) [R] Incorrect collection of float type

2018-11-13 Thread Javier Luraschi (JIRA)
Javier Luraschi created ARROW-3783:
--

 Summary: [R] Incorrect collection of float type
 Key: ARROW-3783
 URL: https://issues.apache.org/jira/browse/ARROW-3783
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Javier Luraschi


Repro from `sparklyr`:

 
{code:java}
library(sparklyr)
library(arrow)

sc <- spark_connect(master = "local")
DBI::dbGetQuery(sc, "SELECT cast(1 as float)"){code}
 

Actual:
{code:java}
  CAST(1 AS FLOAT)
1   1065353216{code}
Expected:

 
{code:java}
  CAST(1 AS FLOAT)
11{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3782) [C++] Implement BufferedReader for C++

2018-11-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3782:
---

 Summary: [C++] Implement BufferedReader for C++
 Key: ARROW-3782
 URL: https://issues.apache.org/jira/browse/ARROW-3782
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


This will be the reader companion to {{arrow::io::BufferedOutputStream}} and a 
C++-like version of the {{io.BufferedReader}} class in the Python standard 
library

https://docs.python.org/3/library/io.html#io.BufferedReader

We already have a partial version of this that's used in the Parquet library

https://github.com/apache/arrow/blob/master/cpp/src/parquet/util/memory.h#L413

In particular we need

* Seek implemented for random access (it will invalidate the buffer)
* Peek method returning {{shared_ptr}}, a zero copy view into buffered 
memory

This is needed for ARROW-3126



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-3306) [R] Objects and support functions different kinds of arrow::Buffer

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-3306.
-
Resolution: Fixed
  Assignee: Romain François

This was resolved in passing. 

> [R] Objects and support functions different kinds of arrow::Buffer
> --
>
> Key: ARROW-3306
> URL: https://issues.apache.org/jira/browse/ARROW-3306
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Wes McKinney
>Assignee: Romain François
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream

2018-11-13 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685727#comment-16685727
 ] 

Antoine Pitrou commented on ARROW-3781:
---

I don't think it's dependent on filesystem latency. Unless the filesystem 
implementation is broken, writing should be asynchronous (i.e. the `Write` call 
returns before the OS actually flushed the buffer to disk or to the network). 
The point of the buffer is to avoid paying the cost of a system call (and 
userspace/kernel transition) for every tiny write.

But we can make the buffer size configurable regardless.

> [C++] Configure buffer size in arrow::io::BufferedOutputStream
> --
>
> Key: ARROW-3781
> URL: https://issues.apache.org/jira/browse/ARROW-3781
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.12.0
>
>
> This is hard-coded to 4096 right now. For higher latency file systems it may 
> be desirable to use a larger buffer. See also ARROW-3777 about performance 
> testing for high latency files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2237) [Python] [Plasma] Huge pages test failure

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2237:

Summary: [Python] [Plasma] Huge pages test failure  (was: [Python] Huge 
tables test failure)

> [Python] [Plasma] Huge pages test failure
> -
>
> Key: ARROW-2237
> URL: https://issues.apache.org/jira/browse/ARROW-2237
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.12.0
>
>
> This is a new failure here (Ubuntu 16.04, x86-64):
> {code}
> _ test_use_huge_pages 
> _
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, 
> in test_use_huge_pages
> create_object(plasma_client, 1)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in 
> create_object
> seal=seal)
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in 
> create_object_with_id
> memory_buffer = client.create(object_id, data_size, metadata)
>   File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create
>   File "error.pxi", line 79, in pyarrow.lib.check_status
> pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 
> code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, )
> /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, 
> , buffer)
> Encountered unexpected EOF
>  Captured stderr call 
> -
> Allowing the Plasma store to use up to 0.1GB of memory.
> Starting object store with directory /mnt/hugepages and huge page support 
> enabled
> create_buffer failed to open file /mnt/hugepages/plasmapSNc0X
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream

2018-11-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3781:
---

 Summary: [C++] Configure buffer size in 
arrow::io::BufferedOutputStream
 Key: ARROW-3781
 URL: https://issues.apache.org/jira/browse/ARROW-3781
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


This is hard-coded to 4096 right now. For higher latency file systems it may be 
desirable to use a larger buffer. See also ARROW-3777 about performance testing 
for high latency files



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2807) [Python] Enable memory-mapping to be toggled in get_reader when reading Parquet files

2018-11-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-2807:
--
Labels: parquet pull-request-available  (was: parquet)

> [Python] Enable memory-mapping to be toggled in get_reader when reading 
> Parquet files
> -
>
> Key: ARROW-2807
> URL: https://issues.apache.org/jira/browse/ARROW-2807
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.12.0
>
>
> See relevant discussion in ARROW-2654



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3344) [Python] test_plasma.py fails (in test_plasma_list)

2018-11-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685711#comment-16685711
 ] 

Wes McKinney commented on ARROW-3344:
-

This bug is still present for me on Ubuntu 14.04

{code}
pyarrow/tests/test_plasma.py::test_plasma_list FAILED   

[ 83%]
>>> >>>
>>>  captured stderr 
>>> >>>
../src/plasma/store.cc:1000: Allowing the Plasma store to use up to 0.1GB of 
memory.
../src/plasma/store.cc:1030: Starting object store with directory /dev/shm and 
huge page support disabled
>> >>
>>  traceback 
>> >>

@pytest.mark.plasma
def test_plasma_list():
import pyarrow.plasma as plasma

with plasma.start_plasma_store(
plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY) \
as (plasma_store_name, p):
plasma_client = plasma.connect(plasma_store_name, "", 0)

# Test sizes
u, _, _ = create_object(plasma_client, 11, metadata_size=7, 
seal=False)
l1 = plasma_client.list()
assert l1[u]["data_size"] == 11
assert l1[u]["metadata_size"] == 7

# Test ref_count
v = plasma_client.put(np.zeros(3))
l2 = plasma_client.list()
# Ref count has already been released
assert l2[v]["ref_count"] == 0
a = plasma_client.get(v)
l3 = plasma_client.list()
>   assert l3[v]["ref_count"] == 1
E   assert 0 == 1

pyarrow/tests/test_plasma.py:966: AssertionError
 
  entering PDB 
 >
> /home/wesm/code/arrow/python/pyarrow/tests/test_plasma.py(966)test_plasma_list()
-> assert l3[v]["ref_count"] == 1
{code}

> [Python] test_plasma.py fails (in test_plasma_list)
> ---
>
> Key: ARROW-3344
> URL: https://issues.apache.org/jira/browse/ARROW-3344
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Plasma (C++), Python
>Reporter: Antoine Pitrou
>Priority: Major
>
> I routinely get the following failure in {{test_plasma.py}}:
> {code}
> Traceback (most recent call last):
>   File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 825, 
> in test_plasma_list
> assert l3[v]["ref_count"] == 1
> AssertionError: assert 0 == 1
>  Captured stderr call 
> -
> ../src/plasma/store.cc:926: Allowing the Plasma store to use up to 0.1GB of 
> memory.
> ../src/plasma/store.cc:956: Starting object store with directory /dev/shm and 
> huge page support disabled
> {code}
> I'm not sure whether there's something wrong in my setup (on Ubuntu 18.04, 
> x86-64), or it's a genuine bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-2807) [Python] Enable memory-mapping to be toggled in get_reader when reading Parquet files

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-2807:
---

Assignee: Wes McKinney

> [Python] Enable memory-mapping to be toggled in get_reader when reading 
> Parquet files
> -
>
> Key: ARROW-2807
> URL: https://issues.apache.org/jira/browse/ARROW-2807
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> See relevant discussion in ARROW-2654



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3780) [R] Failed to fetch data: invalid data when collecting int16

2018-11-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685675#comment-16685675
 ] 

Wes McKinney commented on ARROW-3780:
-

I was pretty sure this non-specific error message was going to rear its ugly 
head

https://github.com/apache/arrow/blob/202265fbb67685f1ed179ba080a85b48fbd53adc/r/src/arrow_types.h#L36

> [R] Failed to fetch data: invalid data when collecting int16
> 
>
> Key: ARROW-3780
> URL: https://issues.apache.org/jira/browse/ARROW-3780
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Javier Luraschi
>Priority: Major
>  Labels: spark
> Fix For: 0.12.0
>
>
> Repro from sparklyr unit test:
> {code:java}
> library(dplyr)
> library(sparklyr)
> library(arrow)
> sc <- spark_connect(master = "local")
> hive_type <- tibble::frame_data(
>  ~stype, ~svalue, ~rtype, ~rvalue, ~arrow,
>  "smallint", "1", "integer", "1", "integer",
> )
> spark_query <- hive_type %>%
>  mutate(
>  query = paste0("cast(", svalue, " as ", stype, ") as ", gsub("\\(|\\)", "", 
> stype), "_col")
>  ) %>%
>  pull(query) %>%
>  paste(collapse = ", ") %>%
>  paste("SELECT", .)
> spark_types <- DBI::dbGetQuery(sc, spark_query) %>%
>  lapply(function(e) class(e)[[1]]) %>%
>  as.character(){code}
> Actual: error: Failed to fetch data: invalid data 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3779) [Python] Validate timezone passed to pa.timestamp

2018-11-13 Thread Krisztian Szucs (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685682#comment-16685682
 ] 

Krisztian Szucs commented on ARROW-3779:


Renamed. I created the issue before I saw that... 
On long term We should validate it on C++ side.

> [Python] Validate timezone passed to pa.timestamp
> -
>
> Key: ARROW-3779
> URL: https://issues.apache.org/jira/browse/ARROW-3779
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Krisztian Szucs
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3779) [Python] Validate timezone passed to pa.timestamp

2018-11-13 Thread Krisztian Szucs (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs updated ARROW-3779:
---
Summary: [Python] Validate timezone passed to pa.timestamp  (was: [Format] 
Standardize timezone specification)

> [Python] Validate timezone passed to pa.timestamp
> -
>
> Key: ARROW-3779
> URL: https://issues.apache.org/jira/browse/ARROW-3779
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Krisztian Szucs
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3780) [R] Failed to fetch data: invalid data when collecting int16

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3780:

Labels: spark  (was: )

> [R] Failed to fetch data: invalid data when collecting int16
> 
>
> Key: ARROW-3780
> URL: https://issues.apache.org/jira/browse/ARROW-3780
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Javier Luraschi
>Priority: Major
>  Labels: spark
> Fix For: 0.12.0
>
>
> Repro from sparklyr unit test:
> {code:java}
> library(dplyr)
> library(sparklyr)
> library(arrow)
> sc <- spark_connect(master = "local")
> hive_type <- tibble::frame_data(
>  ~stype, ~svalue, ~rtype, ~rvalue, ~arrow,
>  "smallint", "1", "integer", "1", "integer",
> )
> spark_query <- hive_type %>%
>  mutate(
>  query = paste0("cast(", svalue, " as ", stype, ") as ", gsub("\\(|\\)", "", 
> stype), "_col")
>  ) %>%
>  pull(query) %>%
>  paste(collapse = ", ") %>%
>  paste("SELECT", .)
> spark_types <- DBI::dbGetQuery(sc, spark_query) %>%
>  lapply(function(e) class(e)[[1]]) %>%
>  as.character(){code}
> Actual: error: Failed to fetch data: invalid data 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3780) [R] Failed to fetch data: invalid data when collecting int16

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3780:

Fix Version/s: 0.12.0

> [R] Failed to fetch data: invalid data when collecting int16
> 
>
> Key: ARROW-3780
> URL: https://issues.apache.org/jira/browse/ARROW-3780
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Javier Luraschi
>Priority: Major
>  Labels: spark
> Fix For: 0.12.0
>
>
> Repro from sparklyr unit test:
> {code:java}
> library(dplyr)
> library(sparklyr)
> library(arrow)
> sc <- spark_connect(master = "local")
> hive_type <- tibble::frame_data(
>  ~stype, ~svalue, ~rtype, ~rvalue, ~arrow,
>  "smallint", "1", "integer", "1", "integer",
> )
> spark_query <- hive_type %>%
>  mutate(
>  query = paste0("cast(", svalue, " as ", stype, ") as ", gsub("\\(|\\)", "", 
> stype), "_col")
>  ) %>%
>  pull(query) %>%
>  paste(collapse = ", ") %>%
>  paste("SELECT", .)
> spark_types <- DBI::dbGetQuery(sc, spark_query) %>%
>  lapply(function(e) class(e)[[1]]) %>%
>  as.character(){code}
> Actual: error: Failed to fetch data: invalid data 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3780) [R] Failed to fetch data: invalid data when collecting int16

2018-11-13 Thread Javier Luraschi (JIRA)
Javier Luraschi created ARROW-3780:
--

 Summary: [R] Failed to fetch data: invalid data when collecting 
int16
 Key: ARROW-3780
 URL: https://issues.apache.org/jira/browse/ARROW-3780
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Javier Luraschi


Repro from sparklyr unit test:
{code:java}
library(dplyr)
library(sparklyr)
library(arrow)

sc <- spark_connect(master = "local")

hive_type <- tibble::frame_data(
 ~stype, ~svalue, ~rtype, ~rvalue, ~arrow,
 "smallint", "1", "integer", "1", "integer",
)

spark_query <- hive_type %>%
 mutate(
 query = paste0("cast(", svalue, " as ", stype, ") as ", gsub("\\(|\\)", "", 
stype), "_col")
 ) %>%
 pull(query) %>%
 paste(collapse = ", ") %>%
 paste("SELECT", .)

spark_types <- DBI::dbGetQuery(sc, spark_query) %>%
 lapply(function(e) class(e)[[1]]) %>%
 as.character(){code}
Actual: error: Failed to fetch data: invalid data 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3779) [Format] Standardize timezone specification

2018-11-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685672#comment-16685672
 ] 

Wes McKinney commented on ARROW-3779:
-

What do we need to do beyond what's in Schema.fbs? 
https://github.com/apache/arrow/blob/master/format/Schema.fbs#L162

> [Format] Standardize timezone specification
> ---
>
> Key: ARROW-3779
> URL: https://issues.apache.org/jira/browse/ARROW-3779
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Krisztian Szucs
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3779) [Format] Standardize timezone specification

2018-11-13 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-3779:
--

 Summary: [Format] Standardize timezone specification
 Key: ARROW-3779
 URL: https://issues.apache.org/jira/browse/ARROW-3779
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Krisztian Szucs






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3778) [C++] Don't put implementations in test-util.h

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3778:

Fix Version/s: 0.12.0

> [C++] Don't put implementations in test-util.h
> --
>
> Key: ARROW-3778
> URL: https://issues.apache.org/jira/browse/ARROW-3778
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.12.0
>
>
> {{test-util.h}} is included in most (all?) test files, and it's quite long to 
> compile because it includes many other files and recompiles helper functions 
> all the time. Instead we should have only declarations in {{test-util.h}} and 
> put implementations in a separate {{.cc}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3778) [C++] Don't put implementations in test-util.h

2018-11-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685666#comment-16685666
 ] 

Wes McKinney commented on ARROW-3778:
-

Agreed. I had partly done this in https://github.com/apache/arrow/pull/2704, so 
if you wanted to use just the arrow/util/testing.h/testing.cc changes from 
there go ahead

> [C++] Don't put implementations in test-util.h
> --
>
> Key: ARROW-3778
> URL: https://issues.apache.org/jira/browse/ARROW-3778
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Priority: Major
>
> {{test-util.h}} is included in most (all?) test files, and it's quite long to 
> compile because it includes many other files and recompiles helper functions 
> all the time. Instead we should have only declarations in {{test-util.h}} and 
> put implementations in a separate {{.cc}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings

2018-11-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3738:
--
Labels: csv pull-request-available  (was: csv)

> [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
> ---
>
> Key: ARROW-3738
> URL: https://issues.apache.org/jira/browse/ARROW-3738
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: csv, pull-request-available
>
> See similar functionality in other libraries. I believe pandas has a fast 
> path for iso8601



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3778) [C++] Don't put implementations in test-util.h

2018-11-13 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3778:
-

 Summary: [C++] Don't put implementations in test-util.h
 Key: ARROW-3778
 URL: https://issues.apache.org/jira/browse/ARROW-3778
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.11.1
Reporter: Antoine Pitrou


{{test-util.h}} is included in most (all?) test files, and it's quite long to 
compile because it includes many other files and recompiles helper functions 
all the time. Instead we should have only declarations in {{test-util.h}} and 
put implementations in a separate {{.cc}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1837) [Java] Unable to read unsigned integers outside signed range for bit width in integration tests

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1837:

Fix Version/s: (was: 0.12.0)
   0.13.0

> [Java] Unable to read unsigned integers outside signed range for bit width in 
> integration tests
> ---
>
> Key: ARROW-1837
> URL: https://issues.apache.org/jira/browse/ARROW-1837
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Wes McKinney
>Priority: Blocker
>  Labels: columnar-format-1.0
> Fix For: 0.13.0
>
> Attachments: generated_primitive.json
>
>
> I believe this was introduced recently (perhaps in the refactors), but there 
> was a problem where the integration tests weren't being properly run that hid 
> the error from us
> see https://github.com/apache/arrow/pull/1294#issuecomment-345553066



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1875) Write 64-bit ints as strings in integration test JSON files

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1875:

Fix Version/s: (was: 0.12.0)
   0.13.0

> Write 64-bit ints as strings in integration test JSON files
> ---
>
> Key: ARROW-1875
> URL: https://issues.apache.org/jira/browse/ARROW-1875
> Project: Apache Arrow
>  Issue Type: Task
>Reporter: Brian Hulette
>Priority: Minor
> Fix For: 0.13.0
>
>
> Javascript can't handle 64-bit integers natively, so writing them as strings 
> in the JSON would make implementing the integration tests a lot simpler.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings

2018-11-13 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-3738:
-

Assignee: Antoine Pitrou

> [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
> ---
>
> Key: ARROW-3738
> URL: https://issues.apache.org/jira/browse/ARROW-3738
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: csv
>
> See similar functionality in other libraries. I believe pandas has a fast 
> path for iso8601



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3085) [Rust] Add an adapter for parquet.

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3085:

Component/s: Rust

> [Rust] Add an adapter for parquet.
> --
>
> Key: ARROW-3085
> URL: https://issues.apache.org/jira/browse/ARROW-3085
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Renjie Liu
>Assignee: Renjie Liu
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3346) [Python] Segfault when reading parquet files if torch is imported before pyarrow

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3346:

Summary: [Python] Segfault when reading parquet files if torch is imported 
before pyarrow  (was: Segfault when reading parquet files if torch is imported 
before pyarrow)

> [Python] Segfault when reading parquet files if torch is imported before 
> pyarrow
> 
>
> Key: ARROW-3346
> URL: https://issues.apache.org/jira/browse/ARROW-3346
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
>Reporter: Alexey Strokach
>Priority: Major
>
> pyarrow (version 0.10.0) appears to crash sporadically with a segmentation 
> fault when reading parquet files if it is used in a program where torch is 
> imported first.
> A self-contained example is available here: 
> [https://gitlab.com/ostrokach/pyarrow_pytorch_segfault].
> Basically, running
> {{python -X faulthandler -c "import torch; import pyarrow.parquet as pq; _ = 
> pq.ParquetFile('example.parquet').read_row_group(0)"}}
> sooner or later results in a segfault:
> {{Fatal Python error: Segmentation fault}}
> {{Current thread 0x7f52959bb740 (most recent call first):}}
> {{File 
> "/home/kimlab1/strokach/anaconda/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 125 in read_row_group}}
>  {{File "", line 1 in }}
>  {{./test_fail.sh: line 5: 42612 Segmentation fault (core dumped) python -X 
> faulthandler -c "import torch; import pyarrow.parquet as pq; _ = 
> pq.ParquetFile('example.parquet').read_row_group(0)"}}
> The number of iterations before a segfault varies, but it usually happens 
> within the first several calls.
> Running
> {{python -X faulthandler -c "import pyarrow.parquet as pq import torch; _ = 
> pq.ParquetFile('example.parquet').read_row_group(0)"}}
> works without a problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2786) [JS] Read Parquet files in JavaScript

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2786:

Labels: parquet  (was: )

> [JS] Read Parquet files in JavaScript
> -
>
> Key: ARROW-2786
> URL: https://issues.apache.org/jira/browse/ARROW-2786
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: JavaScript
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
>
> See question in https://github.com/apache/arrow/issues/2209



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2627) [Python] Add option (or some equivalent) to toggle memory mapping functionality when using parquet.ParquetFile or other read entry points

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2627:

Labels: parquet  (was: )

> [Python] Add option (or some equivalent) to toggle memory mapping 
> functionality when using parquet.ParquetFile or other read entry points
> -
>
> Key: ARROW-2627
> URL: https://issues.apache.org/jira/browse/ARROW-2627
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
>
> See issue described in https://github.com/apache/arrow/issues/1946. When 
> passing a filename to {{parquet.ParquetFile}}, one cannot control what kind 
> of file reader internally is created (OSFile or MemoryMappedFile)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2079) [Python] Possibly use `_common_metadata` for schema if `_metadata` isn't available

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2079:

Labels: parquet  (was: )

> [Python] Possibly use `_common_metadata` for schema if `_metadata` isn't 
> available
> --
>
> Key: ARROW-2079
> URL: https://issues.apache.org/jira/browse/ARROW-2079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Minor
>  Labels: parquet
>
> Currently pyarrow's parquet writer only writes `_common_metadata` and not 
> `_metadata`. From what I understand these are intended to contain the dataset 
> schema but not any row group information.
>  
> A few (possibly naive) questions:
>  
> 1. In the `__init__` for `ParquetDataset`, the following lines exist:
> {code:java}
> if self.metadata_path is not None:
> with self.fs.open(self.metadata_path) as f:
> self.common_metadata = ParquetFile(f).metadata
> else:
> self.common_metadata = None
> {code}
> I believe this should use `common_metadata_path` instead of `metadata_path`, 
> as the latter is never written by `pyarrow`, and is given by the `_metadata` 
> file instead of `_common_metadata` (as seemingly intended?).
>  
> 2. In `validate_schemas` I believe an option should exist for using the 
> schema from `_common_metadata` instead of `_metadata`, as pyarrow currently 
> only writes the former, and as far as I can tell `_common_metadata` does 
> include all the schema information needed.
>  
> Perhaps the logic in `validate_schemas` could be ported over to:
>  
> {code:java}
> if self.schema is not None:
> pass  # schema explicitly provided
> elif self.metadata is not None:
> self.schema = self.metadata.schema
> elif self.common_metadata is not None:
> self.schema = self.common_metadata.schema
> else:
> self.schema = self.pieces[0].get_metadata(open_file).schema{code}
> If these changes are valid, I'd be happy to submit a PR. It's not 100% clear 
> to me the difference between `_common_metadata` and `_metadata`, but I 
> believe the schema in both should be the same. Figured I'd open this for 
> discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1957) [Python] Handle nanosecond timestamps in parquet serialization

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1957:

Labels: parquet  (was: )

> [Python] Handle nanosecond timestamps in parquet serialization
> --
>
> Key: ARROW-1957
> URL: https://issues.apache.org/jira/browse/ARROW-1957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Python 3.6.4.  Mac OSX and CentOS Linux release 
> 7.3.1611.  Pandas 0.21.1 .
>Reporter: Jordan Samuels
>Priority: Minor
>  Labels: parquet
>
> The following code
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> n=3
> df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', 
> freq='1n', periods=n))
> pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code}
> results in:
> {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
> 14832288001}}
> The desired effect is that we can save nanosecond resolution without losing 
> precision (e.g. conversion to ms).  Note that if {{freq='1u'}} is used, the 
> code runs properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1957) [Python] Handle nanosecond timestamps in parquet serialization

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1957:

Component/s: Python

> [Python] Handle nanosecond timestamps in parquet serialization
> --
>
> Key: ARROW-1957
> URL: https://issues.apache.org/jira/browse/ARROW-1957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Python 3.6.4.  Mac OSX and CentOS Linux release 
> 7.3.1611.  Pandas 0.21.1 .
>Reporter: Jordan Samuels
>Priority: Minor
>  Labels: parquet
>
> The following code
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> n=3
> df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', 
> freq='1n', periods=n))
> pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code}
> results in:
> {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
> 14832288001}}
> The desired effect is that we can save nanosecond resolution without losing 
> precision (e.g. conversion to ms).  Note that if {{freq='1u'}} is used, the 
> code runs properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3248) [C++] Arrow tests should have label "arrow"

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3248:

Fix Version/s: 0.12.0

> [C++] Arrow tests should have label "arrow"
> ---
>
> Key: ARROW-3248
> URL: https://issues.apache.org/jira/browse/ARROW-3248
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.10.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.12.0
>
>
> It would help executing only them, not Parquet unit tests which for some 
> reason are quite a bit longer to run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3325) [Python] Support reading Parquet binary/string columns as pandas Categorical

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3325:

Labels: parquet  (was: )

> [Python] Support reading Parquet binary/string columns as pandas Categorical
> 
>
> Key: ARROW-3325
> URL: https://issues.apache.org/jira/browse/ARROW-3325
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> Requires PARQUET-1324 and probably quite a bit of extra work  
> Properly implementing this will require dictionary normalization across row 
> groups. When reading a new row group, a fast path that compares the current 
> dictionary with the prior dictionary should be used. This also needs to 
> handle the case where a column chunk "fell back" to PLAIN encoding mid-stream



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1957) [Python] Handle nanosecond timestamps in parquet serialization

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1957:

Summary: [Python] Handle nanosecond timestamps in parquet serialization  
(was: Handle nanosecond timestamps in parquet serialization)

> [Python] Handle nanosecond timestamps in parquet serialization
> --
>
> Key: ARROW-1957
> URL: https://issues.apache.org/jira/browse/ARROW-1957
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Python 3.6.4.  Mac OSX and CentOS Linux release 
> 7.3.1611.  Pandas 0.21.1 .
>Reporter: Jordan Samuels
>Priority: Minor
>  Labels: parquet
>
> The following code
> {code}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> n=3
> df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', 
> freq='1n', periods=n))
> pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code}
> results in:
> {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: 
> 14832288001}}
> The desired effect is that we can save nanosecond resolution without losing 
> precision (e.g. conversion to ms).  Note that if {{freq='1u'}} is used, the 
> code runs properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3085) [Rust] Add an adapter for parquet.

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3085:

Labels: parquet  (was: )

> [Rust] Add an adapter for parquet.
> --
>
> Key: ARROW-3085
> URL: https://issues.apache.org/jira/browse/ARROW-3085
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Renjie Liu
>Assignee: Renjie Liu
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3762) [C++] Arrow table reads error when overflowing capacity of BinaryArray

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3762:

Labels: parquet  (was: )

> [C++] Arrow table reads error when overflowing capacity of BinaryArray
> --
>
> Key: ARROW-3762
> URL: https://issues.apache.org/jira/browse/ARROW-3762
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Chris Ellison
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-3139) [Python] ArrowIOError: Arrow error: Capacity error during read

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3139.
---
Resolution: Duplicate

> [Python] ArrowIOError: Arrow error: Capacity error during read
> --
>
> Key: ARROW-3139
> URL: https://issues.apache.org/jira/browse/ARROW-3139
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
> Environment: pandas=0.23.1=py36h637b7d7_0
> pyarrow==0.10.0
>Reporter: Frédérique Vanneste
>Priority: Major
>  Labels: parquet
>
> My assumption: the problem is caused by a large object column containing 
> strings up to 27 characters long. (so that column is much larger than 2GB of 
> strings, chunking issue)
> looks similar as  
> https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
>  
> Code
>  * basket_plateau= pq.read_table("basket_plateau.parquet")
>  * basket_plateau = pd.read_parquet("basket_plateau.parquet")
> Error produced
>  * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more 
> than 2147483646 bytes, have 2147483655
> Dataset
>  * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
>  * 2.7 billion record, 4 columns ( int64/object/datetime64/float64)
>  * aprox 90GB in memory
>  * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think 
> food retail categories)
> History to bug:
>  * was using older version of pyarrow
>  * tried writing dataset to disk (parquet) and failed
>  * stumbled on https://issues.apache.org/jira/browse/ARROW-2227
>  * upgraded to 0.10
>  * tried writing dataset to disk (parquet) and succeeded
>  * tried reading dataset and failed
>  * looks like a similar case as: 
> https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2360) [C++] Add set_chunksize for RecordBatchReader in arrow/record_batch.h

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2360:

Component/s: C++

> [C++] Add set_chunksize for RecordBatchReader in arrow/record_batch.h
> -
>
> Key: ARROW-2360
> URL: https://issues.apache.org/jira/browse/ARROW-2360
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Xianjin YE
>Priority: Major
>
> As discussed in [https://github.com/apache/parquet-cpp/pull/445,] 
> Maybe it's better to expose chunksize related API in RecordBatchReader.
>  
> However RecordBatchStreamReader doesn't conforms to this requirement. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3208) [Python] Segmentation fault when reading a Parquet partitioned dataset to a Parquet file

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3208:

Labels: parquet  (was: )

> [Python] Segmentation fault when reading a Parquet partitioned dataset to a 
> Parquet file
> 
>
> Key: ARROW-3208
> URL: https://issues.apache.org/jira/browse/ARROW-3208
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Ubuntu 16.04 LTS; System76 Oryx Pro
>Reporter: Ying Wang
>Priority: Major
>  Labels: parquet
>
> Steps to reproduce:
>  # Create a partitioned dataset with the following code:
> ```python
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame({ 'one': [-1, 10, 2.5, 100, 1000, 1, 29.2], 'two': [-1, 10, 
> 2, 100, 1000, 1, 11], 'three': [0, 0, 0, 0, 0, 0, 0] })
> table = pa.Table.from_pandas(df)
> pq.write_to_dataset(table, root_path='/home/yingw787/misc/example_dataset', 
> partition_cols=['one', 'two'])
> ```
>  # Create a Parquet file from a PyArrow Table created from the partitioned 
> Parquet dataset:
> ```python
> import pyarrow.parquet as pq
> table = pq.ParquetDataset('/path/to/dataset').read()
> pq.write_table(table, '/path/to/example.parquet')
> ```
> EXPECTED:
>  * Successful write
> GOT:
>  * Segmentation fault
> Issue reference on GitHub mirror: https://github.com/apache/arrow/issues/2511



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3722) [C++] Allow specifying column types to CSV reader

2018-11-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3722:
--
Labels: pull-request-available  (was: )

> [C++] Allow specifying column types to CSV reader
> -
>
> Key: ARROW-3722
> URL: https://issues.apache.org/jira/browse/ARROW-3722
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> I'm not sure how to expose this. The easiest, implementation-wise, would be 
> to allow passing a {{Schema}} (for example inside the {{ConvertOptions}}).
> Another possibility is to allow specifying the default types for type 
> inference. For example type inference currently infers integers as {{int64}}, 
> but the user might prefer {{int32}}.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3502) [C++] parquet-column_scanner-test failure building ARROW_PARQUET build 11.

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3502:

Labels: parquet  (was: )

> [C++] parquet-column_scanner-test failure building ARROW_PARQUET build 11.
> --
>
> Key: ARROW-3502
> URL: https://issues.apache.org/jira/browse/ARROW-3502
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Tanveer
>Priority: Major
>  Labels: parquet
> Attachments: Screenshot from 2018-10-11 12-25-13.png
>
>
> For building Arrow Apache, I have enabled following flags and got error in 
> the attachment (parquet-
>  column_scanner-test failure) in making arrow build 11.
> cmake .. -DCMAKE_BUILD_TYPE=Release -DARROW_PARQUET=ON -DARROW_PLASMA=ON 
> -DARROW_PLASMA_JAVA_CLIENT=ON



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3731) [R] R API for reading and writing Parquet files

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3731:

Labels: parquet  (was: )

> [R] R API for reading and writing Parquet files
> ---
>
> Key: ARROW-3731
> URL: https://issues.apache.org/jira/browse/ARROW-3731
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
>
> To start, this would be at the level of complexity of 
> {{pyarrow.parquet.read_table}} and {{write_table}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3703) [Python] DataFrame.to_parquet crashes if datetime column has time zones

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3703:

Labels: parquet  (was: )

> [Python] DataFrame.to_parquet crashes if datetime column has time zones
> ---
>
> Key: ARROW-3703
> URL: https://issues.apache.org/jira/browse/ARROW-3703
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
> Environment: pandas 0.23.4
> pyarrow 0.11.1
> Python 2.7, 3.5 - 3.7
> MacOS High Sierra (10.13.6)
>Reporter: Diego Argueta
>Priority: Major
>  Labels: parquet
>
> On CPython 2.7.15, 3.5.6, 3.6.6, and 3.7.0, creating a Pandas DataFrame with 
> a {{datetime.datetime}} object serializes to Parquet just fine, but crashes 
> with an {{AttributeError}} if you try to use the built-in {{timezone}} 
> objects.
> To reproduce, on Python 3:
> {code:java}
> import datetime as dt
> import pandas as pd
> df = pd.DataFrame({'foo': [dt.datetime(2018, 1, 1, 1, 23, 45, 
> tzinfo=dt.timezone.utc)]})
> df.to_parquet('data.parq')
> {code}
>  
> On Python 2, create a subclass of {{datetime.tzinfo}} as shown 
> [here|https://docs.python.org/2/library/datetime.html#datetime.tzinfo] and 
> try the same thing.
>  
> The following exception results:
> {noformat}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/core/frame.py",
>  line 1945, in to_parquet
> compression=compression, **kwargs)
>   File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 257, in to_parquet
> return impl.write(df, path, compression=compression, **kwargs)
>   File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py",
>  line 118, in write
> table = self.api.Table.from_pandas(df)
>   File "pyarrow/table.pxi", line 1217, in pyarrow.lib.Table.from_pandas
>   File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 381, in dataframe_to_arrays
> convert_types)]
>   File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 380, in 
> for c, t in zip(columns_to_convert,
>   File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 370, in convert_column
> return pa.array(col, type=ty, from_pandas=True, safe=safe)
>   File "pyarrow/array.pxi", line 167, in pyarrow.lib.array
>   File 
> "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 409, in get_datetimetz_type
> type_ = pa.timestamp(unit, tz)
>   File "pyarrow/types.pxi", line 1038, in pyarrow.lib.timestamp
>   File "pyarrow/types.pxi", line 955, in pyarrow.lib.tzinfo_to_string
> AttributeError: 'datetime.timezone' object has no attribute 'zone'
> 'datetime.timezone' object has no attribute 'zone'
> {noformat}
>  
>  This doesn't happen if you use {{pytz.UTC}} as the timezone object.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1988) [Python] Extend flavor=spark in Parquet writing to handle INT types

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1988:

Labels: parquet  (was: )

> [Python] Extend flavor=spark in Parquet writing to handle INT types
> ---
>
> Key: ARROW-1988
> URL: https://issues.apache.org/jira/browse/ARROW-1988
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: parquet
> Fix For: 0.13.0
>
>
> See the relevant code sections at 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L139
> We should cater for them in the {{pyarrow}} code and also reach out to Spark 
> developers so that they are supported there in the longterm.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3166) [C++] Consolidate IO interfaces used in arrow/io and parquet-cpp

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3166:

Labels: parquet  (was: )

> [C++] Consolidate IO interfaces used in arrow/io and parquet-cpp
> 
>
> Key: ARROW-3166
> URL: https://issues.apache.org/jira/browse/ARROW-3166
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> With the codebase consolidation, we have the opportunity to remove cruft from 
> the Parquet codebase. I believe it would be simpler and better for the 
> ecosystem to use the Arrow IO interface classes rather than maintaining 
> separate vitual IO interfaces exported from the {{parquet::}} namespace



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3728:

Labels: parquet  (was: )

> [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
> ---
>
> Key: ARROW-3728
> URL: https://issues.apache.org/jira/browse/ARROW-3728
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0, 0.11.0, 0.11.1
> Environment: Python 3.6.3
> OSX 10.14
>Reporter: Micah Williamson
>Priority: Major
>  Labels: parquet
>
> From: 
> https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch
>  
> I am trying to merge multiple parquet files into one. Their schemas are 
> identical field-wise but my {{ParquetWriter}} is complaining that they are 
> not. After some investigation I found that the pandas meta in the schemas are 
> different, causing this error.
>  
> Sample-
> {code:python}
> import pyarrow.parquet as pq
> pq_tables=[]
> for file_ in files:
> pq_table = pq.read_table(f'{MESS_DIR}/{file_}')
> pq_tables.append(pq_table)
> if writer is None:
> writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, 
> use_deprecated_int96_timestamps=True)
> writer.write_table(table=pq_table)
> {code}
> The error-
> {code}
> Traceback (most recent call last):
>   File "{PATH_TO}/main.py", line 68, in lambda_handler
> writer.write_table(table=pq_table)
>   File 
> "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py",
>  line 335, in write_table
> raise ValueError(msg)
> ValueError: Table schema does not match schema used to create file:
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3525) [Packaging] Remove arrow/ and parquet-cpp/ dependencies in dev/run_docker_compose.sh

2018-11-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685329#comment-16685329
 ] 

Wes McKinney commented on ARROW-3525:
-

[~kszucs] was this completed?

> [Packaging] Remove arrow/ and parquet-cpp/ dependencies in 
> dev/run_docker_compose.sh
> 
>
> Key: ARROW-3525
> URL: https://issues.apache.org/jira/browse/ARROW-3525
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Packaging
>Affects Versions: 0.11.0
>Reporter: Kouhei Sutou
>Priority: Minor
>
> Because we merge parquet-cpp into the Apache Arrow repository.
>  
> Can someone work on this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2624) [Python] Random schema and data generator for Arrow conversion and Parquet testing

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2624:

Labels: parquet  (was: )

> [Python] Random schema and data generator for Arrow conversion and Parquet 
> testing
> --
>
> Key: ARROW-2624
> URL: https://issues.apache.org/jira/browse/ARROW-2624
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
>
> See discussion in https://github.com/apache/arrow/issues/2067
> Being able to generate random complex schemas and corresponding example data 
> sets will help with exercising edge cases in many different parts of the 
> codebase. One practical example: reading and writing nested data to Parquet 
> format



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1848) [Python] Add documentation examples for reading single Parquet files and datasets from HDFS

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1848:

Labels: parquet  (was: )

> [Python] Add documentation examples for reading single Parquet files and 
> datasets from HDFS
> ---
>
> Key: ARROW-1848
> URL: https://issues.apache.org/jira/browse/ARROW-1848
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> see 
> https://stackoverflow.com/questions/47443151/read-a-parquet-files-from-hdfs-using-pyarrow



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2728) [Python] Support partitioned Parquet datasets using glob-style file paths

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2728:

Labels: parquet  (was: newbie)

> [Python] Support partitioned Parquet datasets using glob-style file paths
> -
>
> Key: ARROW-2728
> URL: https://issues.apache.org/jira/browse/ARROW-2728
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: pyarrow : 0.9.0.post1
> dask : 0.17.1
> Mac OS
>Reporter: pranav kohli
>Priority: Minor
>  Labels: parquet
>
> I am saving a dask dataframe to parquet with two partition columns using the 
> pyarrow engine. The problem arises in scanning the partition columns. When I 
> scan using the directory path, I get the partition columns in the output 
> dataframe, whereas if I scan using the glob path, I dont get these columns
>  
> https://github.com/apache/arrow/issues/2147



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3210) [Python] Creating ParquetDataset creates partitioned ParquetFiles with mismatched Parquet schemas

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3210:

Labels: parquet  (was: )

> [Python] Creating ParquetDataset creates partitioned ParquetFiles with 
> mismatched Parquet schemas
> -
>
> Key: ARROW-3210
> URL: https://issues.apache.org/jira/browse/ARROW-3210
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Ubuntu 16.04 LTS, System76 Oryx Pro
>Reporter: Ying Wang
>Priority: Major
>  Labels: parquet
> Attachments: environment.yml, repro.csv, repro.py, repro_2.py
>
>
> STEPS TO REPRODUCE:
> 1. Create a conda environment reflecting [^environment.yml]
> 2. Execute script [^repro.py], replacing various config variables to create a 
> ParquetDataset on S3 given [^repro.csv]
> 3. Create reference of ParquetDataset using script [^repro_2.py], again 
> replacing various config variables.
>  
> EXPECTED:
> Reference is created correctly.
> GOT:
> Mismatched Arrow schemas in validate_schemas() method:
>  
> ```python
> *** ValueError: Schema in partition[Draught=1, Name=1, VesselType=0, x=1, 
> Heading=1] 
> s3://kio-tests-files/_tmp/test_parquet_dataset/Draught=10.3/Name=MSC 
> RAFAELA/VesselType=Cargo/x=130.43158/Heading=270.0/e9e3cea5a5c24c4da587c263ec817c98.parquet
>  was different. 
> Record_ID: int64
> y: double
> TRACKID: string
> MMSI: int64
> IMO: int64
> AgeMinutes: double
> SoG: double
> Width: int64
> Length: int64
> Callsign: string
> Destination: string
> ETA: int64
> Status: string
> ExtraInfo: string
> TIMESTAMP: int64
> __index_level_0__: int64
> metadata
> 
> {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": 
> [{"na'
>  b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
>  b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":'
>  b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"'
>  b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y'
>  b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f'
>  b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T'
>  b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta'
>  b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ'
>  b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": '
>  b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"'
>  b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name'
>  b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6'
>  b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan'
>  b'das_type": "float64", "numpy_type": "float64", "metadata": null}'
>  b', {"name": "Width", "field_name": "Width", "pandas_type": "int64'
>  b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", '
>  b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i'
>  b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca'
>  b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta'
>  b'data": null}, {"name": "Destination", "field_name": "Destination'
>  b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":'
>  b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int'
>  b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"'
>  b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"'
>  b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name'
>  b'": "ExtraInfo", "pandas_type": "unicode", "numpy_type": "object"'
>  b', "metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMEST'
>  b'AMP", "pandas_type": "int64", "numpy_type": "int64", "metadata":'
>  b' null}, {"name": null, "field_name": "__index_level_0__", "panda'
>  b's_type": "int64", "numpy_type": "int64", "metadata": null}], "pa'
>  b'ndas_version": "0.21.0"}'}
> vs
> Record_ID: int64
> y: double
> TRACKID: string
> MMSI: int64
> IMO: int64
> AgeMinutes: double
> SoG: double
> Width: int64
> Length: int64
> Callsign: string
> Destination: string
> ETA: int64
> Status: string
> ExtraInfo: null
> TIMESTAMP: int64
> __index_level_0__: int64
> metadata
> 
> {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": 
> [{"na'
>  b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
>  b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":'
>  b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"'
>  b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y'
>  b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f'
>  b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T'
>  b'RACKID", 

[jira] [Commented] (ARROW-3722) [C++] Allow specifying column types to CSV reader

2018-11-13 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685325#comment-16685325
 ] 

Antoine Pitrou commented on ARROW-3722:
---

> We also need a way to provide column names (or even default to numbering) for 
> files without a header. This topic is related, but maybe a new Jira would be 
> better suited for it.

Yes, I think a separate JIRA is better.

> additional thoughts on passing ColumnBuilder instead of just a type. Ideally, 
> the user would be able to implement own converters to support, let's say, 
> uncommon date formats or even parse struct types at load time. 

Right now most CSV APIs are internal. APIs like ColumnBuilder and Converter 
expose implementation details that we don't want to set in stone. If there's 
some demand we might think about an API to let people define their conversion 
functions without having to depend on internal APIs.

> [C++] Allow specifying column types to CSV reader
> -
>
> Key: ARROW-3722
> URL: https://issues.apache.org/jira/browse/ARROW-3722
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> I'm not sure how to expose this. The easiest, implementation-wise, would be 
> to allow passing a {{Schema}} (for example inside the {{ConvertOptions}}).
> Another possibility is to allow specifying the default types for type 
> inference. For example type inference currently infers integers as {{int64}}, 
> but the user might prefer {{int32}}.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2628) [Python] parquet.write_to_dataset is memory-hungry on large DataFrames

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2628:

Labels: parquet  (was: )

> [Python] parquet.write_to_dataset is memory-hungry on large DataFrames
> --
>
> Key: ARROW-2628
> URL: https://issues.apache.org/jira/browse/ARROW-2628
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
>
> See discussion in https://github.com/apache/arrow/issues/1749. We should 
> consider strategies for writing very large tables to a partitioned directory 
> scheme. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1012) [C++] Create implementation of StreamReader that reads from Apache Parquet files

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1012:

Labels: parquet  (was: )

> [C++] Create implementation of StreamReader that reads from Apache Parquet 
> files
> 
>
> Key: ARROW-1012
> URL: https://issues.apache.org/jira/browse/ARROW-1012
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
>
> This will be enabled by ARROW-1008



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2366) [Python] Support reading Parquet files having a permutation of column order

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2366:

Labels: parquet  (was: )

> [Python] Support reading Parquet files having a permutation of column order
> ---
>
> Key: ARROW-2366
> URL: https://issues.apache.org/jira/browse/ARROW-2366
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> See discussion in https://github.com/dask/fastparquet/issues/320



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2038) [Python] Follow-up bug fixes for s3fs Parquet support

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2038:

Labels: aws parquet  (was: aws)

> [Python] Follow-up bug fixes for s3fs Parquet support
> -
>
> Key: ARROW-2038
> URL: https://issues.apache.org/jira/browse/ARROW-2038
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: aws, parquet
> Fix For: 0.12.0
>
>
> see discussion in 
> https://github.com/apache/arrow/pull/916#issuecomment-360558248



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2077) [Python] Document on how to use Storefact & Arrow to read Parquet from S3/Azure/...

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2077:

Labels: parquet  (was: )

> [Python] Document on how to use Storefact & Arrow to read Parquet from 
> S3/Azure/...
> ---
>
> Key: ARROW-2077
> URL: https://issues.apache.org/jira/browse/ARROW-2077
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> We're using this happily in production, also with column projection down to 
> the storage layer. Others should also benefit from this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1682) [Python] Add documentation / example for reading a directory of Parquet files on S3

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1682:

Labels: parquet  (was: )

> [Python] Add documentation / example for reading a directory of Parquet files 
> on S3
> ---
>
> Key: ARROW-1682
> URL: https://issues.apache.org/jira/browse/ARROW-1682
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> Opened based on comment 
> https://github.com/apache/arrow/pull/916#issuecomment-337563492



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3722) [C++] Allow specifying column types to CSV reader

2018-11-13 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-3722:
-

Assignee: Antoine Pitrou

> [C++] Allow specifying column types to CSV reader
> -
>
> Key: ARROW-3722
> URL: https://issues.apache.org/jira/browse/ARROW-3722
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.11.1
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> I'm not sure how to expose this. The easiest, implementation-wise, would be 
> to allow passing a {{Schema}} (for example inside the {{ConvertOptions}}).
> Another possibility is to allow specifying the default types for type 
> inference. For example type inference currently infers integers as {{int64}}, 
> but the user might prefer {{int32}}.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1925) [Python] Wrapping PyArrow Table with Numpy without copy

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1925:

Summary: [Python] Wrapping PyArrow Table with Numpy without copy  (was: 
Wrapping PyArrow Table with Numpy without copy)

> [Python] Wrapping PyArrow Table with Numpy without copy
> ---
>
> Key: ARROW-1925
> URL: https://issues.apache.org/jira/browse/ARROW-1925
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Young-Jun Ko
>Priority: Minor
>  Labels: parquet
>
> The scenario is the following:
> I have a parquet file, which has a column containing a float array of 
> constant size.
> So it can be thought of as a matrix.
> When I read the parquet file, the way I currently access it, is to convert it 
> to pandas, extract the values, giving me a list of np.array and then doing 
> np.vstack to get the matrix.
> This involves a copy that would be nice to avoid.
> When a parquet file (or more generally a parquet dataset) is read, would the 
> values of the array column be contiguous in memory, so that a view on the 
> data could be created without having to copy? That would be neat.
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-976) [Python] Provide API for defining and reading Parquet datasets with more ad hoc partition schemes

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-976:
---
Labels: parquet  (was: )

> [Python] Provide API for defining and reading Parquet datasets with more ad 
> hoc partition schemes
> -
>
> Key: ARROW-976
> URL: https://issues.apache.org/jira/browse/ARROW-976
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1925) [Python] Wrapping PyArrow Table with Numpy without copy

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-1925:

Labels: parquet  (was: )

> [Python] Wrapping PyArrow Table with Numpy without copy
> ---
>
> Key: ARROW-1925
> URL: https://issues.apache.org/jira/browse/ARROW-1925
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Affects Versions: 0.7.1
>Reporter: Young-Jun Ko
>Priority: Minor
>  Labels: parquet
>
> The scenario is the following:
> I have a parquet file, which has a column containing a float array of 
> constant size.
> So it can be thought of as a matrix.
> When I read the parquet file, the way I currently access it, is to convert it 
> to pandas, extract the values, giving me a list of np.array and then doing 
> np.vstack to get the matrix.
> This involves a copy that would be nice to avoid.
> When a parquet file (or more generally a parquet dataset) is read, would the 
> values of the array column be contiguous in memory, so that a view on the 
> data could be created without having to copy? That would be neat.
> Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2360) [C++] Add set_chunksize for RecordBatchReader in arrow/record_batch.h

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2360:

Summary: [C++] Add set_chunksize for RecordBatchReader in 
arrow/record_batch.h  (was: Add set_chunksize for RecordBatchReader in 
arrow/record_batch.h)

> [C++] Add set_chunksize for RecordBatchReader in arrow/record_batch.h
> -
>
> Key: ARROW-2360
> URL: https://issues.apache.org/jira/browse/ARROW-2360
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Xianjin YE
>Priority: Major
>
> As discussed in [https://github.com/apache/parquet-cpp/pull/445,] 
> Maybe it's better to expose chunksize related API in RecordBatchReader.
>  
> However RecordBatchStreamReader doesn't conforms to this requirement. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3762) [C++] Arrow table reads error when overflowing capacity of BinaryArray

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3762:

Component/s: Python
 C++

> [C++] Arrow table reads error when overflowing capacity of BinaryArray
> --
>
> Key: ARROW-3762
> URL: https://issues.apache.org/jira/browse/ARROW-3762
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Chris Ellison
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError 
> due to it not creating chunked arrays. Reading each row group individually 
> and then concatenating the tables works, however.
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> x = pa.array(list('1' * 2**30))
> demo = 'demo.parquet'
> def scenario():
> t = pa.Table.from_arrays([x], ['x'])
> writer = pq.ParquetWriter(demo, t.schema)
> for i in range(2):
> writer.write_table(t)
> writer.close()
> pf = pq.ParquetFile(demo)
> # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot 
> contain more than 2147483646 bytes, have 2147483647
> t2 = pf.read()
> # Works, but note, there are 32 row groups, not 2 as suggested by:
> # 
> https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)]
> t3 = pa.concat_tables(tables)
> scenario()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2592) [Python] AssertionError in to_pandas()

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2592:

Labels: parquet  (was: )

> [Python] AssertionError in to_pandas()
> --
>
> Key: ARROW-2592
> URL: https://issues.apache.org/jira/browse/ARROW-2592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.11.1
>Reporter: Dima Ryazanov
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
>
> Pyarrow 0.8 and 0.9 raises an AssertionError for one of the datasets I have 
> (created using an older version of pyarrow). Repro steps:
> {{In [1]: from pyarrow.parquet import ParquetDataset}}
> {{In [2]: d = ParquetDataset(['bug.parq'])}}
> {{In [3]: t = d.read()}}
> {{In [4]: t.to_pandas()}}
> {{---}}
> {{AssertionError    Traceback (most recent call 
> last)}}
> {{ in ()}}
> {{> 1 t.to_pandas()}}
> {{table.pxi in pyarrow.lib.Table.to_pandas()}}
> {{~/envs/cli3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in 
> table_to_blockmanager(options, table, memory_pool, nthreads, categories)}}
> {{    529 # There must be the same number of field names and physical 
> names}}
> {{    530 # (fields in the arrow Table)}}
> {{--> 531 assert len(logical_index_names) == len(index_columns_set)}}
> {{    532 }}
> {{    533 # It can never be the case in a released version of pyarrow 
> that}}
> {{AssertionError: }}
>  
> Here's the file: [https://www.dropbox.com/s/oja3khjsc5tycfh/bug.parq]
> (I was not able to attach it here due to a "missing token", whatever that 
> means.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2659:

Labels: parquet  (was: beginner)

> [Python] More graceful reading of empty String columns in ParquetDataset
> 
>
> Key: ARROW-2659
> URL: https://issues.apache.org/jira/browse/ARROW-2659
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: parquet
> Fix For: 0.12.0
>
> Attachments: read_parquet_dataset.error.read_table.novalidation.txt, 
> read_parquet_dataset.error.read_table.txt
>
>
> When currently saving a {{ParquetDataset}} from Pandas, we don't get 
> consistent schemas, even if the source was a single DataFrame. This is due to 
> the fact that in some partitions object columns like string can become empty. 
> Then the resulting Arrow schema will differ. In the central metadata, we will 
> store this column as {{pa.string}} whereas in the partition file with the 
> empty columns, this columns will be stored as {{pa.null}}.
> The two schemas are still a valid match in terms of schema evolution and we 
> should respect that in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754
>  Instead of doing a {{pa.Schema.equals}} in 
> https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778
>  we should introduce a new method {{pa.Schema.can_evolve_to}} that is more 
> graceful and returns {{True}} if a dataset piece has a null column where the 
> main metadata states a nullable column of any type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3538) [Python] ability to override the automated assignment of uuid for filenames when writing datasets

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3538:

Labels: features parquet  (was: features)

> [Python] ability to override the automated assignment of uuid for filenames 
> when writing datasets
> -
>
> Key: ARROW-3538
> URL: https://issues.apache.org/jira/browse/ARROW-3538
> Project: Apache Arrow
>  Issue Type: Wish
>Affects Versions: 0.10.0
>Reporter: Ji Xu
>Priority: Major
>  Labels: features, parquet
>
> Say I have a pandas DataFrame {{df}} that I would like to store on disk as 
> dataset using pyarrow parquet, I would do this:
> {code:java}
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_to_dataset(table, root_path=some_path, 
> partition_cols=['a',]){code}
> On disk the dataset would look like something like this:
>  {color:#14892c}some_path{color}
>  {color:#14892c}├── a=1{color}
>  {color:#14892c}├── 4498704937d84fe5abebb3f06515ab2d.parquet{color}
>  {color:#14892c}├── a=2{color}
>  {color:#14892c}├── 8bcfaed8986c4bdba587aaaee532370c.parquet{color}
> *Wished Feature:* It'd be great if I can override the auto-assignment of the 
> long UUID as filename somehow during the *dataset* writing. My purpose is to 
> be able to overwrite the dataset on disk when I have a new version of {{df}}. 
> Currently if I try to write the dataset again, another new uniquely named 
> [UUID].parquet file will be placed next to the old one, with the same, 
> redundant data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2098) [Python] Implement "errors as null" option when coercing Python object arrays to Arrow format

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2098:

Labels: parquet  (was: )

> [Python] Implement "errors as null" option when coercing Python object arrays 
> to Arrow format
> -
>
> Key: ARROW-2098
> URL: https://issues.apache.org/jira/browse/ARROW-2098
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
>
> Inspired by 
> https://stackoverflow.com/questions/48611998/type-error-on-first-steps-with-apache-parquet
>  where the user has a string inside a mostly integer column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2598) [Python] table.to_pandas segfault

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2598:

Labels: parquet  (was: )

> [Python]  table.to_pandas segfault
> --
>
> Key: ARROW-2598
> URL: https://issues.apache.org/jira/browse/ARROW-2598
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: jacques
>Priority: Major
>  Labels: parquet
>
> Here is a small snippet which produces a segfault:
> {noformat}
> In [1]: import pyarrow as pa
> In [2]: import pyarrow.parquet as pq
> In [3]: pa_ar = pa.array([[], []])
> In [4]: pq.write_table(
>    ...: table=pa.Table.from_arrays([pa_ar],["test"]),
>    ...: where="test5.parquet",
>    ...: compression="snappy",
>    ...: flavor="spark"
>    ...: )
> In [5]: pq.read_table("test5.parquet")
> Out[5]: 
> pyarrow.Table
> test: list
>   child 0, item: null
> In [6]: pq.read_table("test5.parquet").to_pydict()
> Out[6]: OrderedDict([(u'test', [None, None])])
> In [7]: pq.read_table("test5.parquet").to_pandas()
> Segmentation fault
> {noformat}
> I thank you in advance for having this fixed.
> Best, 
> Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3538) [Python] ability to override the automated assignment of uuid for filenames when writing datasets

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3538:

Summary: [Python] ability to override the automated assignment of uuid for 
filenames when writing datasets  (was: ability to override the automated 
assignment of uuid for filenames when writing datasets)

> [Python] ability to override the automated assignment of uuid for filenames 
> when writing datasets
> -
>
> Key: ARROW-3538
> URL: https://issues.apache.org/jira/browse/ARROW-3538
> Project: Apache Arrow
>  Issue Type: Wish
>Affects Versions: 0.10.0
>Reporter: Ji Xu
>Priority: Major
>  Labels: features, parquet
>
> Say I have a pandas DataFrame {{df}} that I would like to store on disk as 
> dataset using pyarrow parquet, I would do this:
> {code:java}
> table = pyarrow.Table.from_pandas(df)
> pyarrow.parquet.write_to_dataset(table, root_path=some_path, 
> partition_cols=['a',]){code}
> On disk the dataset would look like something like this:
>  {color:#14892c}some_path{color}
>  {color:#14892c}├── a=1{color}
>  {color:#14892c}├── 4498704937d84fe5abebb3f06515ab2d.parquet{color}
>  {color:#14892c}├── a=2{color}
>  {color:#14892c}├── 8bcfaed8986c4bdba587aaaee532370c.parquet{color}
> *Wished Feature:* It'd be great if I can override the auto-assignment of the 
> long UUID as filename somehow during the *dataset* writing. My purpose is to 
> be able to overwrite the dataset on disk when I have a new version of {{df}}. 
> Currently if I try to write the dataset again, another new uniquely named 
> [UUID].parquet file will be placed next to the old one, with the same, 
> redundant data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2026) [Python] µs timestamps saved as int64 even if use_deprecated_int96_timestamps=True

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2026:

Labels: parquet redshift timestamps  (was: redshift timestamps)

> [Python] µs timestamps saved as int64 even if 
> use_deprecated_int96_timestamps=True
> --
>
> Key: ARROW-2026
> URL: https://issues.apache.org/jira/browse/ARROW-2026
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: OS: Mac OS X 10.13.2
> Python: 3.6.4
> PyArrow: 0.8.0
>Reporter: Diego Argueta
>Priority: Major
>  Labels: parquet, redshift, timestamps
> Fix For: 0.12.0
>
>
> When writing to a Parquet file, if `use_deprecated_int96_timestamps` is True, 
> timestamps are only written as 96-bit integers if the timestamp has 
> nanosecond resolution. This is a problem because Amazon Redshift timestamps 
> only have microsecond resolution but require them to be stored in 96-bit 
> format in Parquet files.
> I'd expect the use_deprecated_int96_timestamps flag to cause _all_ timestamps 
> to be written as 96 bits, regardless of resolution. If this is a deliberate 
> design decision, it'd be immensely helpful if it were explicitly documented 
> as part of the argument.
>  
> To reproduce:
>  
> 1. Create a table with a timestamp having microsecond or millisecond 
> resolution, and save it to a Parquet file. Be sure to set 
> `use_deprecated_int96_timestamps` to True.
>  
> {code:java}
> import datetime
> import pyarrow
> from pyarrow import parquet
> schema = pyarrow.schema([
> pyarrow.field('last_updated', pyarrow.timestamp('us')),
> ])
> data = [
> pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')),
> ]
> table = pyarrow.Table.from_arrays(data, ['last_updated'])
> with open('test_file.parquet', 'wb') as fdesc:
> parquet.write_table(table, fdesc,
> use_deprecated_int96_timestamps=True)
> {code}
>  
> 2. Inspect the file. I used parquet-tools:
>  
> {noformat}
> dak@tux ~ $ parquet-tools meta test_file.parquet
> file:         file:/Users/dak/test_file.parquet
> creator:      parquet-cpp version 1.3.2-SNAPSHOT
> file schema:  schema
> 
> last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1:  RC:1 TS:76 OFFSET:4
> 
> last_updated:  INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 
> ENC:PLAIN,PLAIN_DICTIONARY,RLE{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2079) [Python] Possibly use `_common_metadata` for schema if `_metadata` isn't available

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2079:

Summary: [Python] Possibly use `_common_metadata` for schema if `_metadata` 
isn't available  (was: Possibly use `_common_metadata` for schema if 
`_metadata` isn't available)

> [Python] Possibly use `_common_metadata` for schema if `_metadata` isn't 
> available
> --
>
> Key: ARROW-2079
> URL: https://issues.apache.org/jira/browse/ARROW-2079
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Jim Crist
>Priority: Minor
>  Labels: parquet
>
> Currently pyarrow's parquet writer only writes `_common_metadata` and not 
> `_metadata`. From what I understand these are intended to contain the dataset 
> schema but not any row group information.
>  
> A few (possibly naive) questions:
>  
> 1. In the `__init__` for `ParquetDataset`, the following lines exist:
> {code:java}
> if self.metadata_path is not None:
> with self.fs.open(self.metadata_path) as f:
> self.common_metadata = ParquetFile(f).metadata
> else:
> self.common_metadata = None
> {code}
> I believe this should use `common_metadata_path` instead of `metadata_path`, 
> as the latter is never written by `pyarrow`, and is given by the `_metadata` 
> file instead of `_common_metadata` (as seemingly intended?).
>  
> 2. In `validate_schemas` I believe an option should exist for using the 
> schema from `_common_metadata` instead of `_metadata`, as pyarrow currently 
> only writes the former, and as far as I can tell `_common_metadata` does 
> include all the schema information needed.
>  
> Perhaps the logic in `validate_schemas` could be ported over to:
>  
> {code:java}
> if self.schema is not None:
> pass  # schema explicitly provided
> elif self.metadata is not None:
> self.schema = self.metadata.schema
> elif self.common_metadata is not None:
> self.schema = self.common_metadata.schema
> else:
> self.schema = self.pieces[0].get_metadata(open_file).schema{code}
> If these changes are valid, I'd be happy to submit a PR. It's not 100% clear 
> to me the difference between `_common_metadata` and `_metadata`, but I 
> believe the schema in both should be the same. Figured I'd open this for 
> discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3139) [Python] ArrowIOError: Arrow error: Capacity error during read

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3139:

Labels: parquet  (was: )

> [Python] ArrowIOError: Arrow error: Capacity error during read
> --
>
> Key: ARROW-3139
> URL: https://issues.apache.org/jira/browse/ARROW-3139
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
> Environment: pandas=0.23.1=py36h637b7d7_0
> pyarrow==0.10.0
>Reporter: Frédérique Vanneste
>Priority: Major
>  Labels: parquet
>
> My assumption: the problem is caused by a large object column containing 
> strings up to 27 characters long. (so that column is much larger than 2GB of 
> strings, chunking issue)
> looks similar as  
> https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
>  
> Code
>  * basket_plateau= pq.read_table("basket_plateau.parquet")
>  * basket_plateau = pd.read_parquet("basket_plateau.parquet")
> Error produced
>  * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more 
> than 2147483646 bytes, have 2147483655
> Dataset
>  * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
>  * 2.7 billion record, 4 columns ( int64/object/datetime64/float64)
>  * aprox 90GB in memory
>  * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think 
> food retail categories)
> History to bug:
>  * was using older version of pyarrow
>  * tried writing dataset to disk (parquet) and failed
>  * stumbled on https://issues.apache.org/jira/browse/ARROW-2227
>  * upgraded to 0.10
>  * tried writing dataset to disk (parquet) and succeeded
>  * tried reading dataset and failed
>  * looks like a similar case as: 
> https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (ARROW-3139) [Python] ArrowIOError: Arrow error: Capacity error during read

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reopened ARROW-3139:
-

> [Python] ArrowIOError: Arrow error: Capacity error during read
> --
>
> Key: ARROW-3139
> URL: https://issues.apache.org/jira/browse/ARROW-3139
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
> Environment: pandas=0.23.1=py36h637b7d7_0
> pyarrow==0.10.0
>Reporter: Frédérique Vanneste
>Priority: Major
>  Labels: parquet
>
> My assumption: the problem is caused by a large object column containing 
> strings up to 27 characters long. (so that column is much larger than 2GB of 
> strings, chunking issue)
> looks similar as  
> https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
>  
> Code
>  * basket_plateau= pq.read_table("basket_plateau.parquet")
>  * basket_plateau = pd.read_parquet("basket_plateau.parquet")
> Error produced
>  * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more 
> than 2147483646 bytes, have 2147483655
> Dataset
>  * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
>  * 2.7 billion record, 4 columns ( int64/object/datetime64/float64)
>  * aprox 90GB in memory
>  * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think 
> food retail categories)
> History to bug:
>  * was using older version of pyarrow
>  * tried writing dataset to disk (parquet) and failed
>  * stumbled on https://issues.apache.org/jira/browse/ARROW-2227
>  * upgraded to 0.10
>  * tried writing dataset to disk (parquet) and succeeded
>  * tried reading dataset and failed
>  * looks like a similar case as: 
> https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2710) [Python] pyarrow.lib.ArrowIOError when running PyTorch DataLoader in multiprocessing

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2710:

Labels: parquet  (was: )

> [Python] pyarrow.lib.ArrowIOError when running PyTorch DataLoader in 
> multiprocessing
> 
>
> Key: ARROW-2710
> URL: https://issues.apache.org/jira/browse/ARROW-2710
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
> Environment: Tested on several Linux OSs.
>Reporter: Michael Andrews
>Priority: Major
>  Labels: parquet
>
> Unable to open a parquet file via {{pq.ParquetFile(filename)}} when called 
> using the PyTorch DataLoader in multiprocessing mode. Affects versions 
> pyarrow > 0.7.1.
> As detailed in [https://github.com/apache/arrow/issues/1946].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2591) [Python] Segmentationfault issue in pq.write_table

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-2591:

Labels: parquet  (was: )

> [Python] Segmentationfault issue in pq.write_table
> --
>
> Key: ARROW-2591
> URL: https://issues.apache.org/jira/browse/ARROW-2591
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0, 0.9.0
>Reporter: jacques
>Priority: Major
>  Labels: parquet
>
> Context Is the following: I am currently dealing with sparse column 
> serialization in parquet. In some cases, many lines are empty I can also have 
> columns containing only empty lists.
> However I got a segmentation fault when I try to write in parquet thoses 
> columns filled only with empty lists.
> Here is a simple code snipet reproduces the segmentation fault I had:
> {noformat}
> In [1]: import pyarrow as pa
> In [2]: import pyarrow.parquet as pq
> In [3]: pa_ar = pa.array([[],[]],pa.list_(pa.int32()))
> In [4]: table = pa.Table.from_arrays([pa_ar],["test"])
> In [5]: pq.write_table(
>    ...: table=table,
>    ...: where="test.parquet",
>    ...: compression="snappy",
>    ...: flavor="spark"
>    ...: )
> Segmentation fault
> {noformat}
> May I have it fixed?
> Best
> Jacques



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-3139) [Python] ArrowIOError: Arrow error: Capacity error during read

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3139.
---
Resolution: Duplicate

duplicate of ARROW-3762 (formerly PARQUET-1239)

> [Python] ArrowIOError: Arrow error: Capacity error during read
> --
>
> Key: ARROW-3139
> URL: https://issues.apache.org/jira/browse/ARROW-3139
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
> Environment: pandas=0.23.1=py36h637b7d7_0
> pyarrow==0.10.0
>Reporter: Frédérique Vanneste
>Priority: Major
>
> My assumption: the problem is caused by a large object column containing 
> strings up to 27 characters long. (so that column is much larger than 2GB of 
> strings, chunking issue)
> looks similar as  
> https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
>  
> Code
>  * basket_plateau= pq.read_table("basket_plateau.parquet")
>  * basket_plateau = pd.read_parquet("basket_plateau.parquet")
> Error produced
>  * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more 
> than 2147483646 bytes, have 2147483655
> Dataset
>  * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
>  * 2.7 billion record, 4 columns ( int64/object/datetime64/float64)
>  * aprox 90GB in memory
>  * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think 
> food retail categories)
> History to bug:
>  * was using older version of pyarrow
>  * tried writing dataset to disk (parquet) and failed
>  * stumbled on https://issues.apache.org/jira/browse/ARROW-2227
>  * upgraded to 0.10
>  * tried writing dataset to disk (parquet) and succeeded
>  * tried reading dataset and failed
>  * looks like a similar case as: 
> https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3585) [Python] Update the documentation about Schema & Metadata usage

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3585:

Labels: beginner documentation easyfix newbie parquet  (was: beginner 
documentation easyfix newbie)

> [Python] Update the documentation about Schema & Metadata usage
> ---
>
> Key: ARROW-3585
> URL: https://issues.apache.org/jira/browse/ARROW-3585
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Documentation
>Reporter: Daniel Haviv
>Assignee: Daniel Haviv
>Priority: Trivial
>  Labels: beginner, documentation, easyfix, newbie, parquet
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Reusing the Schema object from a Parquet file written with Spark with Pandas 
> fails due to Schema mismatch.
> The culprit is in the metadata part of the schema which each component fills 
> according to it's implementation. More details can be found here: 
> [https://github.com/apache/arrow/issues/2805]
> The documentation should point that out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3139) [Python] ArrowIOError: Arrow error: Capacity error during read

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3139:

Summary: [Python] ArrowIOError: Arrow error: Capacity error during read  
(was: [Python]ArrowIOError: Arrow error: Capacity error during read)

> [Python] ArrowIOError: Arrow error: Capacity error during read
> --
>
> Key: ARROW-3139
> URL: https://issues.apache.org/jira/browse/ARROW-3139
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.10.0
> Environment: pandas=0.23.1=py36h637b7d7_0
> pyarrow==0.10.0
>Reporter: Frédérique Vanneste
>Priority: Major
>
> My assumption: the problem is caused by a large object column containing 
> strings up to 27 characters long. (so that column is much larger than 2GB of 
> strings, chunking issue)
> looks similar as  
> https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
>  
> Code
>  * basket_plateau= pq.read_table("basket_plateau.parquet")
>  * basket_plateau = pd.read_parquet("basket_plateau.parquet")
> Error produced
>  * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more 
> than 2147483646 bytes, have 2147483655
> Dataset
>  * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0)
>  * 2.7 billion record, 4 columns ( int64/object/datetime64/float64)
>  * aprox 90GB in memory
>  * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think 
> food retail categories)
> History to bug:
>  * was using older version of pyarrow
>  * tried writing dataset to disk (parquet) and failed
>  * stumbled on https://issues.apache.org/jira/browse/ARROW-2227
>  * upgraded to 0.10
>  * tried writing dataset to disk (parquet) and succeeded
>  * tried reading dataset and failed
>  * looks like a similar case as: 
> https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3652) [Python] CategoricalIndex is lost after reading back

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3652:

Labels: parquet  (was: )

> [Python] CategoricalIndex is lost after reading back
> 
>
> Key: ARROW-3652
> URL: https://issues.apache.org/jira/browse/ARROW-3652
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Armin Berres
>Priority: Major
>  Labels: parquet
>
> When a {{CategoricalIndex}} is written and read back the resulting index is 
> not more categorical.
> {code}
> df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['c1', 'c2'])
> df['c1'] = df['c1'].astype('category')
> df = df.set_index(['c1'])
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'test.parquet')
> ref_df = pq.read_pandas('test.parquet').to_pandas()
> print(df.index)
> # CategoricalIndex(['a', 'c'], categories=['a', 'c'], ordered=False, 
> name='c1', dtype='category')
> print(ref_df.index)
> # Index(['a', 'c'], dtype='object', name='c1')
> {code}
> In the metadata the information is correctly contained:
> {code:java}
> {"name": "c1", "field_name": "c1", "p'
> b'andas_type": "categorical", "numpy_type": "int8", "metadata": 
> {"'
> b'num_categories": 2, "ordered": false}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3654) [Python] Column with CategoricalIndex fails to be read back

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3654:

Labels: parquet  (was: )

> [Python] Column with CategoricalIndex fails to be read back
> ---
>
> Key: ARROW-3654
> URL: https://issues.apache.org/jira/browse/ARROW-3654
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Armin Berres
>Priority: Major
>  Labels: parquet
>
> When a column with a \{Categoricalndex} is written the data can never be read 
> back.
>  {code:python}
> df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['c1', 'c2'])
> df['c1'] = df['c1'].astype('category')
> df = df.set_index(['c1'])
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'test.parquet')
> pq.read_pandas('test.parquet').to_pandas()
> {code}
> Results in
> {code}
> KeyError  Traceback (most recent call last)
> ~/venv/mpptool/lib/python3.7/site-packages/pyarrow/pandas_compat.py in 
> _pandas_type_to_numpy_type(pandas_type)
> 676 try:
> --> 677 return _pandas_logical_type_map[pandas_type]
> 678 except KeyError:
> KeyError: 'categorical'
> {code}
> The schema looks good:
> {code}
> column_indexes": [{"name": "c1", "field_name": "c1", "pandas_type": 
> "categorical", "numpy_type": "int8", "metadata": {"num_categories": 2, 
> "ordered": false}}]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3650) [Python] Mixed column indexes are read back as strings

2018-11-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-3650:

Labels: parquet  (was: )

> [Python] Mixed column indexes are read back as strings 
> ---
>
> Key: ARROW-3650
> URL: https://issues.apache.org/jira/browse/ARROW-3650
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.11.1
>Reporter: Armin Berres
>Priority: Major
>  Labels: parquet
>
> Consider the following example: 
> {code:java}
> df = pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['a 
> string', pd.to_datetime('2018/01/02')])
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'test.parquet')
> ref_df = pq.read_pandas('test.parquet').to_pandas()
> print(df.columns)
> # Index(['a string', 2018-01-02 00:00:00], dtype='object')
> print(ref_df.columns)
> # Index(['a string', '2018-01-02 00:00:00'], dtype='object')
> {code}
> The serialized data frame has an index with a string and a datetime field 
> (happened when resetting the index of a formerly datetime only column).
> When reading the string back the datetime is converted into a string.
> When looking at the schema I find {{"pandas_type": "mixed", "numpy_ty'
> b'pe": "object"}} before serializing and {{"pandas_type": 
> "unicode", "numpy_'
> b'type": "object"}} after reading back. So the schema was aware 
> of the mixed type but did not store the actual types.
> The same happens with other types like numbers as well. One can produce 
> interesting situations:
> {{pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['1', 1])}} 
> can be written but fails to be read back as the index is no more unique with 
> '1' showing up two times.
> IIf this is not a bug but expected maybe the user should be somehow warned 
> that information is lost? Like a {{NotImplemented}} exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >