[jira] [Assigned] (ARROW-3741) [R] Add support for arrow::compute::Cast to convert Arrow arrays from one type to another
[ https://issues.apache.org/jira/browse/ARROW-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Romain François reassigned ARROW-3741: -- Assignee: Romain François > [R] Add support for arrow::compute::Cast to convert Arrow arrays from one > type to another > - > > Key: ARROW-3741 > URL: https://issues.apache.org/jira/browse/ARROW-3741 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Wes McKinney >Assignee: Romain François >Priority: Major > > See {{pyarrow.Array.cast}} and {{pyarrow.Table.cast}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3787) Implement From for BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-3787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3787: -- Labels: pull-request-available (was: ) > Implement From for BinaryArray > - > > Key: ARROW-3787 > URL: https://issues.apache.org/jira/browse/ARROW-3787 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3787) Implement From for BinaryArray
Paddy Horan created ARROW-3787: -- Summary: Implement From for BinaryArray Key: ARROW-3787 URL: https://issues.apache.org/jira/browse/ARROW-3787 Project: Apache Arrow Issue Type: Improvement Components: Rust Reporter: Paddy Horan Assignee: Paddy Horan -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3786) Enable merge_arrow_pr.py script to run in non-English JIRA accounts.
Yosuke Shiro created ARROW-3786: --- Summary: Enable merge_arrow_pr.py script to run in non-English JIRA accounts. Key: ARROW-3786 URL: https://issues.apache.org/jira/browse/ARROW-3786 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Yosuke Shiro I read [https://github.com/apache/arrow/tree/master/dev#arrow-developer-scripts] I did the following instruction. {code:java} dev/merge_arrow_pr.py{code} I got the following result. {code:java} Would you like to update the associated JIRA? (y/n): y Enter comma-separated fix version(s) [0.12.0]: === JIRA ARROW-3748 === summary [GLib] Add GArrowCSVReader assigneeKouhei Sutou status オープン url https://issues.apache.org/jira/browse/ARROW-3748 list index out of range{code} It looks like an error on [https://github.com/apache/arrow/blob/master/dev/merge_arrow_pr.py#L181] . My JIRA account language is Japanese. This script does not seem to work if it is not English. {code:java} print(self.jira_con.transitions(self.jira_id)) [{'id': '701', 'name': '課題のクローズ', 'to': {'self': 'https://issues.apache.org/jira/rest/api/2/status/6';, 'description': '課題の検 討が終了し、解決方法が正しいことを表します。クローズした課題は再オープンすることができます。', 'iconUrl': 'https://issues.apache.org/jira/images/icons/statuses/closed.png';, 'name': 'クローズ', 'id': '6', 'statusCategory': {'self': 'https://issues.apache.org/jira/rest/api/2/statuscategory/3';, 'id': 3, 'key': 'done', 'colorName': 'green', 'name': '完了'}}}, {'id': '3', 'name': '課題を再オープンする', 'to': {'self': 'https://issues.apache.org/jira/rest/api/2/status/4';, 'description': '課題が一度解決されたが解決に間違いがあったと見なされ たことを表します。ここから課題を割り当て済みにするか解決済みに設定できます。', 'iconUrl': 'https://issues.apache.org/jira/images/icons/statuses/reopened.png';, 'name': '再オープン', 'id': '4', 'statusCategory': {'self': 'https://issues.apache.org/jira/rest/api/2/statuscategory/2';, 'id': 2, 'key': 'new', 'colorName': 'blue-gray', 'name': 'To Do'}}}]{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2956) [Python] Arrow plasma throws ArrowIOError and process crashed
[ https://issues.apache.org/jira/browse/ARROW-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2956: Summary: [Python] Arrow plasma throws ArrowIOError and process crashed (was: [Python]Arrow plasma throws ArrowIOError and process crashed) > [Python] Arrow plasma throws ArrowIOError and process crashed > - > > Key: ARROW-2956 > URL: https://issues.apache.org/jira/browse/ARROW-2956 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: He Kaisheng >Priority: Major > > hello, > We start a plasma store with 100k memory. when storage is full, it throws > ArrowIOError and the *process crashed,* not the expected PlasmaStoreFull > error. > code: > {code:java} > import pyarrow.plasma as plasma > import numpy as np > plasma_client = plasma.connect(plasma_socket, '', 0) > ref = [] > for _ in range(1000): > obj_id = plasma_client.put(np.random.randint(100, size=(100, 100), > dtype=np.int16)) > data = plasma_client.get(obj_id) > ref.append(data) > {code} > error: > {noformat} > --- > ArrowIOError Traceback (most recent call last) > in () > 2 ref = [] > 3 for _ in range(1000): > > 4 obj_id = plasma_client.put(np.random.randint(100, size=(100, > 100), dtype=np.int16)) > 5 data = plasma_client.get(obj_id) > 6 ref.append(data) > plasma.pyx in pyarrow.plasma.PlasmaClient.put() > plasma.pyx in pyarrow.plasma.PlasmaClient.create() > error.pxi in pyarrow.lib.check_status() > ArrowIOError: Encountered unexpected EOF{noformat} > this problem doesn't exist when dtype is np.int64 or share memory is > larger(like more than 100M), it seems so strange, anybody knows the reason? > Thanks a lot. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3785) [C++] Use double-conversion conda package in CI toolchain
Wes McKinney created ARROW-3785: --- Summary: [C++] Use double-conversion conda package in CI toolchain Key: ARROW-3785 URL: https://issues.apache.org/jira/browse/ARROW-3785 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.12.0 This is being built from the EP currently -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3780) [R] Failed to fetch data: invalid data when collecting int16
[ https://issues.apache.org/jira/browse/ARROW-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3780: -- Labels: pull-request-available spark (was: spark) > [R] Failed to fetch data: invalid data when collecting int16 > > > Key: ARROW-3780 > URL: https://issues.apache.org/jira/browse/ARROW-3780 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Javier Luraschi >Priority: Major > Labels: pull-request-available, spark > Fix For: 0.12.0 > > > Repro from sparklyr unit test: > {code:java} > library(dplyr) > library(sparklyr) > library(arrow) > sc <- spark_connect(master = "local") > hive_type <- tibble::frame_data( > ~stype, ~svalue, ~rtype, ~rvalue, ~arrow, > "smallint", "1", "integer", "1", "integer", > ) > spark_query <- hive_type %>% > mutate( > query = paste0("cast(", svalue, " as ", stype, ") as ", gsub("\\(|\\)", "", > stype), "_col") > ) %>% > pull(query) %>% > paste(collapse = ", ") %>% > paste("SELECT", .) > spark_types <- DBI::dbGetQuery(sc, spark_query) %>% > lapply(function(e) class(e)[[1]]) %>% > as.character(){code} > Actual: error: Failed to fetch data: invalid data -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream
[ https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685945#comment-16685945 ] Wes McKinney commented on ARROW-3781: - It would definitely require some design work. In https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/buffered.cc#L104, you would need to use a buffer pool of some kind so that if Flush is holding a temporary buffer, that Write can write to a new buffer. In any case, it's out of scope for this issue. Once we have file system implementations for one or more cloud services we can use benchmarks to drive the development. In the meantime, a mock remote file system with configurable write latency could help with throughput tests > [C++] Configure buffer size in arrow::io::BufferedOutputStream > -- > > Key: ARROW-3781 > URL: https://issues.apache.org/jira/browse/ARROW-3781 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > This is hard-coded to 4096 right now. For higher latency file systems it may > be desirable to use a larger buffer. See also ARROW-3777 about performance > testing for high latency files -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3784) [R] Array with type fails with x is not a vector
Javier Luraschi created ARROW-3784: -- Summary: [R] Array with type fails with x is not a vector Key: ARROW-3784 URL: https://issues.apache.org/jira/browse/ARROW-3784 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Javier Luraschi {code:java} array(1:10, type = int32()) {code} Actual: {code:java} Error: `x` is not a vector {code} Expected: {code:java} arrow::Array [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3784) [R] Array with type fails with x is not a vector
[ https://issues.apache.org/jira/browse/ARROW-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3784: -- Labels: pull-request-available (was: ) > [R] Array with type fails with x is not a vector > - > > Key: ARROW-3784 > URL: https://issues.apache.org/jira/browse/ARROW-3784 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Javier Luraschi >Priority: Major > Labels: pull-request-available > > {code:java} > array(1:10, type = int32()) > {code} > Actual: > {code:java} > Error: `x` is not a vector > {code} > Expected: > {code:java} > arrow::Array [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream
[ https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685817#comment-16685817 ] Antoine Pitrou commented on ARROW-3781: --- We may want to think about flushing in a separate thread, then. > [C++] Configure buffer size in arrow::io::BufferedOutputStream > -- > > Key: ARROW-3781 > URL: https://issues.apache.org/jira/browse/ARROW-3781 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > This is hard-coded to 4096 right now. For higher latency file systems it may > be desirable to use a larger buffer. See also ARROW-3777 about performance > testing for high latency files -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream
[ https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685812#comment-16685812 ] Wes McKinney edited comment on ARROW-3781 at 11/13/18 10:07 PM: What I mean is that if I call {{out->Flush()}} it may not be safe to continue to call {{out->Write(...)}} until the flush completes. So my proposal was to think about devising a buffered output stream where a writer thread can continue writing while a Flush is in progress. The current {{BufferedOutputStream}} holds a mutex while Flush so further writes are not possible was (Author: wesmckinn): What I mean is that if I call {{out->Flush()}} it may not be safe to continue to call {{out->Write(...)}} until the flush completes. So my proposal was to think about devising a buffered output stream where a writer thread can continue writing while a Flush is in progress > [C++] Configure buffer size in arrow::io::BufferedOutputStream > -- > > Key: ARROW-3781 > URL: https://issues.apache.org/jira/browse/ARROW-3781 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > This is hard-coded to 4096 right now. For higher latency file systems it may > be desirable to use a larger buffer. See also ARROW-3777 about performance > testing for high latency files -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream
[ https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685812#comment-16685812 ] Wes McKinney commented on ARROW-3781: - What I mean is that if I call {{out->Flush()}} it may not be safe to continue to call {{out->Write(...)}} until the flush completes. So my proposal was to think about devising a buffered output stream where a writer thread can continue writing while a Flush is in progress > [C++] Configure buffer size in arrow::io::BufferedOutputStream > -- > > Key: ARROW-3781 > URL: https://issues.apache.org/jira/browse/ARROW-3781 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > This is hard-coded to 4096 right now. For higher latency file systems it may > be desirable to use a larger buffer. See also ARROW-3777 about performance > testing for high latency files -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream
[ https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685809#comment-16685809 ] Wes McKinney commented on ARROW-3781: - Sorry, I'm using file systems here again proverbially. TensorFlow and other projects call their integrations with other file storage systems "file systems", e.g. https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/s3/s3_file_system.h#L25 I am not sure a Write or Flush into S3 is necessarily going to be asynchronous. The implementation in TensorFlow of Flush blocks until the PutRequest is completed https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/platform/s3/s3_file_system.cc#L238 > [C++] Configure buffer size in arrow::io::BufferedOutputStream > -- > > Key: ARROW-3781 > URL: https://issues.apache.org/jira/browse/ARROW-3781 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > This is hard-coded to 4096 right now. For higher latency file systems it may > be desirable to use a larger buffer. See also ARROW-3777 about performance > testing for high latency files -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream
[ https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685791#comment-16685791 ] Antoine Pitrou commented on ARROW-3781: --- Are you thinking about the `Flush` method? It's as asynchronous as `Write` is. > [C++] Configure buffer size in arrow::io::BufferedOutputStream > -- > > Key: ARROW-3781 > URL: https://issues.apache.org/jira/browse/ARROW-3781 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > This is hard-coded to 4096 right now. For higher latency file systems it may > be desirable to use a larger buffer. See also ARROW-3777 about performance > testing for high latency files -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream
[ https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685779#comment-16685779 ] Wes McKinney commented on ARROW-3781: - I'm thinking about the "file systems" HDFS, AWS S3, Google Cloud Storage, and Azure Blob Storage, all of which can be pretty high latency for writes > [C++] Configure buffer size in arrow::io::BufferedOutputStream > -- > > Key: ARROW-3781 > URL: https://issues.apache.org/jira/browse/ARROW-3781 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > This is hard-coded to 4096 right now. For higher latency file systems it may > be desirable to use a larger buffer. See also ARROW-3777 about performance > testing for high latency files -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream
[ https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685784#comment-16685784 ] Wes McKinney commented on ARROW-3781: - For cloud stores, at some point we might want to consider asynchronous flushing also, to mitigate latency when a flush triggers (so the writer thread can begin to buffer the next chunk) > [C++] Configure buffer size in arrow::io::BufferedOutputStream > -- > > Key: ARROW-3781 > URL: https://issues.apache.org/jira/browse/ARROW-3781 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > This is hard-coded to 4096 right now. For higher latency file systems it may > be desirable to use a larger buffer. See also ARROW-3777 about performance > testing for high latency files -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3783) [R] Incorrect collection of float type
[ https://issues.apache.org/jira/browse/ARROW-3783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3783: -- Labels: pull-request-available (was: ) > [R] Incorrect collection of float type > -- > > Key: ARROW-3783 > URL: https://issues.apache.org/jira/browse/ARROW-3783 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Javier Luraschi >Priority: Major > Labels: pull-request-available > > Repro from `sparklyr`: > > {code:java} > library(sparklyr) > library(arrow) > sc <- spark_connect(master = "local") > DBI::dbGetQuery(sc, "SELECT cast(1 as float)"){code} > > Actual: > {code:java} > CAST(1 AS FLOAT) > 1 1065353216{code} > Expected: > > {code:java} > CAST(1 AS FLOAT) > 11{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3783) [R] Incorrect collection of float type
Javier Luraschi created ARROW-3783: -- Summary: [R] Incorrect collection of float type Key: ARROW-3783 URL: https://issues.apache.org/jira/browse/ARROW-3783 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Javier Luraschi Repro from `sparklyr`: {code:java} library(sparklyr) library(arrow) sc <- spark_connect(master = "local") DBI::dbGetQuery(sc, "SELECT cast(1 as float)"){code} Actual: {code:java} CAST(1 AS FLOAT) 1 1065353216{code} Expected: {code:java} CAST(1 AS FLOAT) 11{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3782) [C++] Implement BufferedReader for C++
Wes McKinney created ARROW-3782: --- Summary: [C++] Implement BufferedReader for C++ Key: ARROW-3782 URL: https://issues.apache.org/jira/browse/ARROW-3782 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: Wes McKinney Fix For: 0.12.0 This will be the reader companion to {{arrow::io::BufferedOutputStream}} and a C++-like version of the {{io.BufferedReader}} class in the Python standard library https://docs.python.org/3/library/io.html#io.BufferedReader We already have a partial version of this that's used in the Parquet library https://github.com/apache/arrow/blob/master/cpp/src/parquet/util/memory.h#L413 In particular we need * Seek implemented for random access (it will invalidate the buffer) * Peek method returning {{shared_ptr}}, a zero copy view into buffered memory This is needed for ARROW-3126 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-3306) [R] Objects and support functions different kinds of arrow::Buffer
[ https://issues.apache.org/jira/browse/ARROW-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-3306. - Resolution: Fixed Assignee: Romain François This was resolved in passing. > [R] Objects and support functions different kinds of arrow::Buffer > -- > > Key: ARROW-3306 > URL: https://issues.apache.org/jira/browse/ARROW-3306 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Wes McKinney >Assignee: Romain François >Priority: Major > Fix For: 0.12.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream
[ https://issues.apache.org/jira/browse/ARROW-3781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685727#comment-16685727 ] Antoine Pitrou commented on ARROW-3781: --- I don't think it's dependent on filesystem latency. Unless the filesystem implementation is broken, writing should be asynchronous (i.e. the `Write` call returns before the OS actually flushed the buffer to disk or to the network). The point of the buffer is to avoid paying the cost of a system call (and userspace/kernel transition) for every tiny write. But we can make the buffer size configurable regardless. > [C++] Configure buffer size in arrow::io::BufferedOutputStream > -- > > Key: ARROW-3781 > URL: https://issues.apache.org/jira/browse/ARROW-3781 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Fix For: 0.12.0 > > > This is hard-coded to 4096 right now. For higher latency file systems it may > be desirable to use a larger buffer. See also ARROW-3777 about performance > testing for high latency files -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2237) [Python] [Plasma] Huge pages test failure
[ https://issues.apache.org/jira/browse/ARROW-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2237: Summary: [Python] [Plasma] Huge pages test failure (was: [Python] Huge tables test failure) > [Python] [Plasma] Huge pages test failure > - > > Key: ARROW-2237 > URL: https://issues.apache.org/jira/browse/ARROW-2237 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.12.0 > > > This is a new failure here (Ubuntu 16.04, x86-64): > {code} > _ test_use_huge_pages > _ > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 779, > in test_use_huge_pages > create_object(plasma_client, 1) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 80, in > create_object > seal=seal) > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 69, in > create_object_with_id > memory_buffer = client.create(object_id, data_size, metadata) > File "plasma.pyx", line 302, in pyarrow.plasma.PlasmaClient.create > File "error.pxi", line 79, in pyarrow.lib.check_status > pyarrow.lib.ArrowIOError: /home/antoine/arrow/cpp/src/plasma/client.cc:192 > code: PlasmaReceive(store_conn_, MessageType_PlasmaCreateReply, ) > /home/antoine/arrow/cpp/src/plasma/protocol.cc:46 code: ReadMessage(sock, > , buffer) > Encountered unexpected EOF > Captured stderr call > - > Allowing the Plasma store to use up to 0.1GB of memory. > Starting object store with directory /mnt/hugepages and huge page support > enabled > create_buffer failed to open file /mnt/hugepages/plasmapSNc0X > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3781) [C++] Configure buffer size in arrow::io::BufferedOutputStream
Wes McKinney created ARROW-3781: --- Summary: [C++] Configure buffer size in arrow::io::BufferedOutputStream Key: ARROW-3781 URL: https://issues.apache.org/jira/browse/ARROW-3781 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.12.0 This is hard-coded to 4096 right now. For higher latency file systems it may be desirable to use a larger buffer. See also ARROW-3777 about performance testing for high latency files -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2807) [Python] Enable memory-mapping to be toggled in get_reader when reading Parquet files
[ https://issues.apache.org/jira/browse/ARROW-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-2807: -- Labels: parquet pull-request-available (was: parquet) > [Python] Enable memory-mapping to be toggled in get_reader when reading > Parquet files > - > > Key: ARROW-2807 > URL: https://issues.apache.org/jira/browse/ARROW-2807 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.12.0 > > > See relevant discussion in ARROW-2654 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3344) [Python] test_plasma.py fails (in test_plasma_list)
[ https://issues.apache.org/jira/browse/ARROW-3344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685711#comment-16685711 ] Wes McKinney commented on ARROW-3344: - This bug is still present for me on Ubuntu 14.04 {code} pyarrow/tests/test_plasma.py::test_plasma_list FAILED [ 83%] >>> >>> >>> captured stderr >>> >>> ../src/plasma/store.cc:1000: Allowing the Plasma store to use up to 0.1GB of memory. ../src/plasma/store.cc:1030: Starting object store with directory /dev/shm and huge page support disabled >> >> >> traceback >> >> @pytest.mark.plasma def test_plasma_list(): import pyarrow.plasma as plasma with plasma.start_plasma_store( plasma_store_memory=DEFAULT_PLASMA_STORE_MEMORY) \ as (plasma_store_name, p): plasma_client = plasma.connect(plasma_store_name, "", 0) # Test sizes u, _, _ = create_object(plasma_client, 11, metadata_size=7, seal=False) l1 = plasma_client.list() assert l1[u]["data_size"] == 11 assert l1[u]["metadata_size"] == 7 # Test ref_count v = plasma_client.put(np.zeros(3)) l2 = plasma_client.list() # Ref count has already been released assert l2[v]["ref_count"] == 0 a = plasma_client.get(v) l3 = plasma_client.list() > assert l3[v]["ref_count"] == 1 E assert 0 == 1 pyarrow/tests/test_plasma.py:966: AssertionError entering PDB > > /home/wesm/code/arrow/python/pyarrow/tests/test_plasma.py(966)test_plasma_list() -> assert l3[v]["ref_count"] == 1 {code} > [Python] test_plasma.py fails (in test_plasma_list) > --- > > Key: ARROW-3344 > URL: https://issues.apache.org/jira/browse/ARROW-3344 > Project: Apache Arrow > Issue Type: Bug > Components: Plasma (C++), Python >Reporter: Antoine Pitrou >Priority: Major > > I routinely get the following failure in {{test_plasma.py}}: > {code} > Traceback (most recent call last): > File "/home/antoine/arrow/python/pyarrow/tests/test_plasma.py", line 825, > in test_plasma_list > assert l3[v]["ref_count"] == 1 > AssertionError: assert 0 == 1 > Captured stderr call > - > ../src/plasma/store.cc:926: Allowing the Plasma store to use up to 0.1GB of > memory. > ../src/plasma/store.cc:956: Starting object store with directory /dev/shm and > huge page support disabled > {code} > I'm not sure whether there's something wrong in my setup (on Ubuntu 18.04, > x86-64), or it's a genuine bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-2807) [Python] Enable memory-mapping to be toggled in get_reader when reading Parquet files
[ https://issues.apache.org/jira/browse/ARROW-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-2807: --- Assignee: Wes McKinney > [Python] Enable memory-mapping to be toggled in get_reader when reading > Parquet files > - > > Key: ARROW-2807 > URL: https://issues.apache.org/jira/browse/ARROW-2807 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > See relevant discussion in ARROW-2654 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3780) [R] Failed to fetch data: invalid data when collecting int16
[ https://issues.apache.org/jira/browse/ARROW-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685675#comment-16685675 ] Wes McKinney commented on ARROW-3780: - I was pretty sure this non-specific error message was going to rear its ugly head https://github.com/apache/arrow/blob/202265fbb67685f1ed179ba080a85b48fbd53adc/r/src/arrow_types.h#L36 > [R] Failed to fetch data: invalid data when collecting int16 > > > Key: ARROW-3780 > URL: https://issues.apache.org/jira/browse/ARROW-3780 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Javier Luraschi >Priority: Major > Labels: spark > Fix For: 0.12.0 > > > Repro from sparklyr unit test: > {code:java} > library(dplyr) > library(sparklyr) > library(arrow) > sc <- spark_connect(master = "local") > hive_type <- tibble::frame_data( > ~stype, ~svalue, ~rtype, ~rvalue, ~arrow, > "smallint", "1", "integer", "1", "integer", > ) > spark_query <- hive_type %>% > mutate( > query = paste0("cast(", svalue, " as ", stype, ") as ", gsub("\\(|\\)", "", > stype), "_col") > ) %>% > pull(query) %>% > paste(collapse = ", ") %>% > paste("SELECT", .) > spark_types <- DBI::dbGetQuery(sc, spark_query) %>% > lapply(function(e) class(e)[[1]]) %>% > as.character(){code} > Actual: error: Failed to fetch data: invalid data -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3779) [Python] Validate timezone passed to pa.timestamp
[ https://issues.apache.org/jira/browse/ARROW-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685682#comment-16685682 ] Krisztian Szucs commented on ARROW-3779: Renamed. I created the issue before I saw that... On long term We should validate it on C++ side. > [Python] Validate timezone passed to pa.timestamp > - > > Key: ARROW-3779 > URL: https://issues.apache.org/jira/browse/ARROW-3779 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Krisztian Szucs >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3779) [Python] Validate timezone passed to pa.timestamp
[ https://issues.apache.org/jira/browse/ARROW-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs updated ARROW-3779: --- Summary: [Python] Validate timezone passed to pa.timestamp (was: [Format] Standardize timezone specification) > [Python] Validate timezone passed to pa.timestamp > - > > Key: ARROW-3779 > URL: https://issues.apache.org/jira/browse/ARROW-3779 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Krisztian Szucs >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3780) [R] Failed to fetch data: invalid data when collecting int16
[ https://issues.apache.org/jira/browse/ARROW-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3780: Labels: spark (was: ) > [R] Failed to fetch data: invalid data when collecting int16 > > > Key: ARROW-3780 > URL: https://issues.apache.org/jira/browse/ARROW-3780 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Javier Luraschi >Priority: Major > Labels: spark > Fix For: 0.12.0 > > > Repro from sparklyr unit test: > {code:java} > library(dplyr) > library(sparklyr) > library(arrow) > sc <- spark_connect(master = "local") > hive_type <- tibble::frame_data( > ~stype, ~svalue, ~rtype, ~rvalue, ~arrow, > "smallint", "1", "integer", "1", "integer", > ) > spark_query <- hive_type %>% > mutate( > query = paste0("cast(", svalue, " as ", stype, ") as ", gsub("\\(|\\)", "", > stype), "_col") > ) %>% > pull(query) %>% > paste(collapse = ", ") %>% > paste("SELECT", .) > spark_types <- DBI::dbGetQuery(sc, spark_query) %>% > lapply(function(e) class(e)[[1]]) %>% > as.character(){code} > Actual: error: Failed to fetch data: invalid data -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3780) [R] Failed to fetch data: invalid data when collecting int16
[ https://issues.apache.org/jira/browse/ARROW-3780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3780: Fix Version/s: 0.12.0 > [R] Failed to fetch data: invalid data when collecting int16 > > > Key: ARROW-3780 > URL: https://issues.apache.org/jira/browse/ARROW-3780 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Javier Luraschi >Priority: Major > Labels: spark > Fix For: 0.12.0 > > > Repro from sparklyr unit test: > {code:java} > library(dplyr) > library(sparklyr) > library(arrow) > sc <- spark_connect(master = "local") > hive_type <- tibble::frame_data( > ~stype, ~svalue, ~rtype, ~rvalue, ~arrow, > "smallint", "1", "integer", "1", "integer", > ) > spark_query <- hive_type %>% > mutate( > query = paste0("cast(", svalue, " as ", stype, ") as ", gsub("\\(|\\)", "", > stype), "_col") > ) %>% > pull(query) %>% > paste(collapse = ", ") %>% > paste("SELECT", .) > spark_types <- DBI::dbGetQuery(sc, spark_query) %>% > lapply(function(e) class(e)[[1]]) %>% > as.character(){code} > Actual: error: Failed to fetch data: invalid data -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3780) [R] Failed to fetch data: invalid data when collecting int16
Javier Luraschi created ARROW-3780: -- Summary: [R] Failed to fetch data: invalid data when collecting int16 Key: ARROW-3780 URL: https://issues.apache.org/jira/browse/ARROW-3780 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Javier Luraschi Repro from sparklyr unit test: {code:java} library(dplyr) library(sparklyr) library(arrow) sc <- spark_connect(master = "local") hive_type <- tibble::frame_data( ~stype, ~svalue, ~rtype, ~rvalue, ~arrow, "smallint", "1", "integer", "1", "integer", ) spark_query <- hive_type %>% mutate( query = paste0("cast(", svalue, " as ", stype, ") as ", gsub("\\(|\\)", "", stype), "_col") ) %>% pull(query) %>% paste(collapse = ", ") %>% paste("SELECT", .) spark_types <- DBI::dbGetQuery(sc, spark_query) %>% lapply(function(e) class(e)[[1]]) %>% as.character(){code} Actual: error: Failed to fetch data: invalid data -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3779) [Format] Standardize timezone specification
[ https://issues.apache.org/jira/browse/ARROW-3779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685672#comment-16685672 ] Wes McKinney commented on ARROW-3779: - What do we need to do beyond what's in Schema.fbs? https://github.com/apache/arrow/blob/master/format/Schema.fbs#L162 > [Format] Standardize timezone specification > --- > > Key: ARROW-3779 > URL: https://issues.apache.org/jira/browse/ARROW-3779 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Krisztian Szucs >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3779) [Format] Standardize timezone specification
Krisztian Szucs created ARROW-3779: -- Summary: [Format] Standardize timezone specification Key: ARROW-3779 URL: https://issues.apache.org/jira/browse/ARROW-3779 Project: Apache Arrow Issue Type: Improvement Reporter: Krisztian Szucs -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3778) [C++] Don't put implementations in test-util.h
[ https://issues.apache.org/jira/browse/ARROW-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3778: Fix Version/s: 0.12.0 > [C++] Don't put implementations in test-util.h > -- > > Key: ARROW-3778 > URL: https://issues.apache.org/jira/browse/ARROW-3778 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.11.1 >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.12.0 > > > {{test-util.h}} is included in most (all?) test files, and it's quite long to > compile because it includes many other files and recompiles helper functions > all the time. Instead we should have only declarations in {{test-util.h}} and > put implementations in a separate {{.cc}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3778) [C++] Don't put implementations in test-util.h
[ https://issues.apache.org/jira/browse/ARROW-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685666#comment-16685666 ] Wes McKinney commented on ARROW-3778: - Agreed. I had partly done this in https://github.com/apache/arrow/pull/2704, so if you wanted to use just the arrow/util/testing.h/testing.cc changes from there go ahead > [C++] Don't put implementations in test-util.h > -- > > Key: ARROW-3778 > URL: https://issues.apache.org/jira/browse/ARROW-3778 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.11.1 >Reporter: Antoine Pitrou >Priority: Major > > {{test-util.h}} is included in most (all?) test files, and it's quite long to > compile because it includes many other files and recompiles helper functions > all the time. Instead we should have only declarations in {{test-util.h}} and > put implementations in a separate {{.cc}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
[ https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3738: -- Labels: csv pull-request-available (was: csv) > [C++] Add CSV conversion option to parse ISO8601-like timestamp strings > --- > > Key: ARROW-3738 > URL: https://issues.apache.org/jira/browse/ARROW-3738 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: csv, pull-request-available > > See similar functionality in other libraries. I believe pandas has a fast > path for iso8601 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3778) [C++] Don't put implementations in test-util.h
Antoine Pitrou created ARROW-3778: - Summary: [C++] Don't put implementations in test-util.h Key: ARROW-3778 URL: https://issues.apache.org/jira/browse/ARROW-3778 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.11.1 Reporter: Antoine Pitrou {{test-util.h}} is included in most (all?) test files, and it's quite long to compile because it includes many other files and recompiles helper functions all the time. Instead we should have only declarations in {{test-util.h}} and put implementations in a separate {{.cc}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1837) [Java] Unable to read unsigned integers outside signed range for bit width in integration tests
[ https://issues.apache.org/jira/browse/ARROW-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1837: Fix Version/s: (was: 0.12.0) 0.13.0 > [Java] Unable to read unsigned integers outside signed range for bit width in > integration tests > --- > > Key: ARROW-1837 > URL: https://issues.apache.org/jira/browse/ARROW-1837 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Wes McKinney >Priority: Blocker > Labels: columnar-format-1.0 > Fix For: 0.13.0 > > Attachments: generated_primitive.json > > > I believe this was introduced recently (perhaps in the refactors), but there > was a problem where the integration tests weren't being properly run that hid > the error from us > see https://github.com/apache/arrow/pull/1294#issuecomment-345553066 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1875) Write 64-bit ints as strings in integration test JSON files
[ https://issues.apache.org/jira/browse/ARROW-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1875: Fix Version/s: (was: 0.12.0) 0.13.0 > Write 64-bit ints as strings in integration test JSON files > --- > > Key: ARROW-1875 > URL: https://issues.apache.org/jira/browse/ARROW-1875 > Project: Apache Arrow > Issue Type: Task >Reporter: Brian Hulette >Priority: Minor > Fix For: 0.13.0 > > > Javascript can't handle 64-bit integers natively, so writing them as strings > in the JSON would make implementing the integration tests a lot simpler. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3738) [C++] Add CSV conversion option to parse ISO8601-like timestamp strings
[ https://issues.apache.org/jira/browse/ARROW-3738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-3738: - Assignee: Antoine Pitrou > [C++] Add CSV conversion option to parse ISO8601-like timestamp strings > --- > > Key: ARROW-3738 > URL: https://issues.apache.org/jira/browse/ARROW-3738 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Antoine Pitrou >Priority: Major > Labels: csv > > See similar functionality in other libraries. I believe pandas has a fast > path for iso8601 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3085) [Rust] Add an adapter for parquet.
[ https://issues.apache.org/jira/browse/ARROW-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3085: Component/s: Rust > [Rust] Add an adapter for parquet. > -- > > Key: ARROW-3085 > URL: https://issues.apache.org/jira/browse/ARROW-3085 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Renjie Liu >Assignee: Renjie Liu >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3346) [Python] Segfault when reading parquet files if torch is imported before pyarrow
[ https://issues.apache.org/jira/browse/ARROW-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3346: Summary: [Python] Segfault when reading parquet files if torch is imported before pyarrow (was: Segfault when reading parquet files if torch is imported before pyarrow) > [Python] Segfault when reading parquet files if torch is imported before > pyarrow > > > Key: ARROW-3346 > URL: https://issues.apache.org/jira/browse/ARROW-3346 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 >Reporter: Alexey Strokach >Priority: Major > > pyarrow (version 0.10.0) appears to crash sporadically with a segmentation > fault when reading parquet files if it is used in a program where torch is > imported first. > A self-contained example is available here: > [https://gitlab.com/ostrokach/pyarrow_pytorch_segfault]. > Basically, running > {{python -X faulthandler -c "import torch; import pyarrow.parquet as pq; _ = > pq.ParquetFile('example.parquet').read_row_group(0)"}} > sooner or later results in a segfault: > {{Fatal Python error: Segmentation fault}} > {{Current thread 0x7f52959bb740 (most recent call first):}} > {{File > "/home/kimlab1/strokach/anaconda/lib/python3.6/site-packages/pyarrow/parquet.py", > line 125 in read_row_group}} > {{File "", line 1 in }} > {{./test_fail.sh: line 5: 42612 Segmentation fault (core dumped) python -X > faulthandler -c "import torch; import pyarrow.parquet as pq; _ = > pq.ParquetFile('example.parquet').read_row_group(0)"}} > The number of iterations before a segfault varies, but it usually happens > within the first several calls. > Running > {{python -X faulthandler -c "import pyarrow.parquet as pq import torch; _ = > pq.ParquetFile('example.parquet').read_row_group(0)"}} > works without a problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2786) [JS] Read Parquet files in JavaScript
[ https://issues.apache.org/jira/browse/ARROW-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2786: Labels: parquet (was: ) > [JS] Read Parquet files in JavaScript > - > > Key: ARROW-2786 > URL: https://issues.apache.org/jira/browse/ARROW-2786 > Project: Apache Arrow > Issue Type: New Feature > Components: JavaScript >Reporter: Wes McKinney >Priority: Major > Labels: parquet > > See question in https://github.com/apache/arrow/issues/2209 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2627) [Python] Add option (or some equivalent) to toggle memory mapping functionality when using parquet.ParquetFile or other read entry points
[ https://issues.apache.org/jira/browse/ARROW-2627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2627: Labels: parquet (was: ) > [Python] Add option (or some equivalent) to toggle memory mapping > functionality when using parquet.ParquetFile or other read entry points > - > > Key: ARROW-2627 > URL: https://issues.apache.org/jira/browse/ARROW-2627 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > > See issue described in https://github.com/apache/arrow/issues/1946. When > passing a filename to {{parquet.ParquetFile}}, one cannot control what kind > of file reader internally is created (OSFile or MemoryMappedFile) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2079) [Python] Possibly use `_common_metadata` for schema if `_metadata` isn't available
[ https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2079: Labels: parquet (was: ) > [Python] Possibly use `_common_metadata` for schema if `_metadata` isn't > available > -- > > Key: ARROW-2079 > URL: https://issues.apache.org/jira/browse/ARROW-2079 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Priority: Minor > Labels: parquet > > Currently pyarrow's parquet writer only writes `_common_metadata` and not > `_metadata`. From what I understand these are intended to contain the dataset > schema but not any row group information. > > A few (possibly naive) questions: > > 1. In the `__init__` for `ParquetDataset`, the following lines exist: > {code:java} > if self.metadata_path is not None: > with self.fs.open(self.metadata_path) as f: > self.common_metadata = ParquetFile(f).metadata > else: > self.common_metadata = None > {code} > I believe this should use `common_metadata_path` instead of `metadata_path`, > as the latter is never written by `pyarrow`, and is given by the `_metadata` > file instead of `_common_metadata` (as seemingly intended?). > > 2. In `validate_schemas` I believe an option should exist for using the > schema from `_common_metadata` instead of `_metadata`, as pyarrow currently > only writes the former, and as far as I can tell `_common_metadata` does > include all the schema information needed. > > Perhaps the logic in `validate_schemas` could be ported over to: > > {code:java} > if self.schema is not None: > pass # schema explicitly provided > elif self.metadata is not None: > self.schema = self.metadata.schema > elif self.common_metadata is not None: > self.schema = self.common_metadata.schema > else: > self.schema = self.pieces[0].get_metadata(open_file).schema{code} > If these changes are valid, I'd be happy to submit a PR. It's not 100% clear > to me the difference between `_common_metadata` and `_metadata`, but I > believe the schema in both should be the same. Figured I'd open this for > discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1957) [Python] Handle nanosecond timestamps in parquet serialization
[ https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1957: Labels: parquet (was: ) > [Python] Handle nanosecond timestamps in parquet serialization > -- > > Key: ARROW-1957 > URL: https://issues.apache.org/jira/browse/ARROW-1957 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 > Environment: Python 3.6.4. Mac OSX and CentOS Linux release > 7.3.1611. Pandas 0.21.1 . >Reporter: Jordan Samuels >Priority: Minor > Labels: parquet > > The following code > {code} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > n=3 > df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', > freq='1n', periods=n)) > pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code} > results in: > {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: > 14832288001}} > The desired effect is that we can save nanosecond resolution without losing > precision (e.g. conversion to ms). Note that if {{freq='1u'}} is used, the > code runs properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1957) [Python] Handle nanosecond timestamps in parquet serialization
[ https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1957: Component/s: Python > [Python] Handle nanosecond timestamps in parquet serialization > -- > > Key: ARROW-1957 > URL: https://issues.apache.org/jira/browse/ARROW-1957 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 > Environment: Python 3.6.4. Mac OSX and CentOS Linux release > 7.3.1611. Pandas 0.21.1 . >Reporter: Jordan Samuels >Priority: Minor > Labels: parquet > > The following code > {code} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > n=3 > df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', > freq='1n', periods=n)) > pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code} > results in: > {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: > 14832288001}} > The desired effect is that we can save nanosecond resolution without losing > precision (e.g. conversion to ms). Note that if {{freq='1u'}} is used, the > code runs properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3248) [C++] Arrow tests should have label "arrow"
[ https://issues.apache.org/jira/browse/ARROW-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3248: Fix Version/s: 0.12.0 > [C++] Arrow tests should have label "arrow" > --- > > Key: ARROW-3248 > URL: https://issues.apache.org/jira/browse/ARROW-3248 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Affects Versions: 0.10.0 >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.12.0 > > > It would help executing only them, not Parquet unit tests which for some > reason are quite a bit longer to run. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3325) [Python] Support reading Parquet binary/string columns as pandas Categorical
[ https://issues.apache.org/jira/browse/ARROW-3325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3325: Labels: parquet (was: ) > [Python] Support reading Parquet binary/string columns as pandas Categorical > > > Key: ARROW-3325 > URL: https://issues.apache.org/jira/browse/ARROW-3325 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > Requires PARQUET-1324 and probably quite a bit of extra work > Properly implementing this will require dictionary normalization across row > groups. When reading a new row group, a fast path that compares the current > dictionary with the prior dictionary should be used. This also needs to > handle the case where a column chunk "fell back" to PLAIN encoding mid-stream -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1957) [Python] Handle nanosecond timestamps in parquet serialization
[ https://issues.apache.org/jira/browse/ARROW-1957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1957: Summary: [Python] Handle nanosecond timestamps in parquet serialization (was: Handle nanosecond timestamps in parquet serialization) > [Python] Handle nanosecond timestamps in parquet serialization > -- > > Key: ARROW-1957 > URL: https://issues.apache.org/jira/browse/ARROW-1957 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.8.0 > Environment: Python 3.6.4. Mac OSX and CentOS Linux release > 7.3.1611. Pandas 0.21.1 . >Reporter: Jordan Samuels >Priority: Minor > Labels: parquet > > The following code > {code} > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > n=3 > df = pd.DataFrame({'x': range(n)}, index=pd.DatetimeIndex(start='2017-01-01', > freq='1n', periods=n)) > pq.write_table(pa.Table.from_pandas(df), '/tmp/t.parquet'){code} > results in: > {{ArrowInvalid: Casting from timestamp[ns] to timestamp[us] would lose data: > 14832288001}} > The desired effect is that we can save nanosecond resolution without losing > precision (e.g. conversion to ms). Note that if {{freq='1u'}} is used, the > code runs properly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3085) [Rust] Add an adapter for parquet.
[ https://issues.apache.org/jira/browse/ARROW-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3085: Labels: parquet (was: ) > [Rust] Add an adapter for parquet. > -- > > Key: ARROW-3085 > URL: https://issues.apache.org/jira/browse/ARROW-3085 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Renjie Liu >Assignee: Renjie Liu >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3762) [C++] Arrow table reads error when overflowing capacity of BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3762: Labels: parquet (was: ) > [C++] Arrow table reads error when overflowing capacity of BinaryArray > -- > > Key: ARROW-3762 > URL: https://issues.apache.org/jira/browse/ARROW-3762 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Chris Ellison >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError > due to it not creating chunked arrays. Reading each row group individually > and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-3139) [Python] ArrowIOError: Arrow error: Capacity error during read
[ https://issues.apache.org/jira/browse/ARROW-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-3139. --- Resolution: Duplicate > [Python] ArrowIOError: Arrow error: Capacity error during read > -- > > Key: ARROW-3139 > URL: https://issues.apache.org/jira/browse/ARROW-3139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 > Environment: pandas=0.23.1=py36h637b7d7_0 > pyarrow==0.10.0 >Reporter: Frédérique Vanneste >Priority: Major > Labels: parquet > > My assumption: the problem is caused by a large object column containing > strings up to 27 characters long. (so that column is much larger than 2GB of > strings, chunking issue) > looks similar as > https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 > > Code > * basket_plateau= pq.read_table("basket_plateau.parquet") > * basket_plateau = pd.read_parquet("basket_plateau.parquet") > Error produced > * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more > than 2147483646 bytes, have 2147483655 > Dataset > * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0) > * 2.7 billion record, 4 columns ( int64/object/datetime64/float64) > * aprox 90GB in memory > * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think > food retail categories) > History to bug: > * was using older version of pyarrow > * tried writing dataset to disk (parquet) and failed > * stumbled on https://issues.apache.org/jira/browse/ARROW-2227 > * upgraded to 0.10 > * tried writing dataset to disk (parquet) and succeeded > * tried reading dataset and failed > * looks like a similar case as: > https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2360) [C++] Add set_chunksize for RecordBatchReader in arrow/record_batch.h
[ https://issues.apache.org/jira/browse/ARROW-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2360: Component/s: C++ > [C++] Add set_chunksize for RecordBatchReader in arrow/record_batch.h > - > > Key: ARROW-2360 > URL: https://issues.apache.org/jira/browse/ARROW-2360 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Xianjin YE >Priority: Major > > As discussed in [https://github.com/apache/parquet-cpp/pull/445,] > Maybe it's better to expose chunksize related API in RecordBatchReader. > > However RecordBatchStreamReader doesn't conforms to this requirement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3208) [Python] Segmentation fault when reading a Parquet partitioned dataset to a Parquet file
[ https://issues.apache.org/jira/browse/ARROW-3208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3208: Labels: parquet (was: ) > [Python] Segmentation fault when reading a Parquet partitioned dataset to a > Parquet file > > > Key: ARROW-3208 > URL: https://issues.apache.org/jira/browse/ARROW-3208 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Ubuntu 16.04 LTS; System76 Oryx Pro >Reporter: Ying Wang >Priority: Major > Labels: parquet > > Steps to reproduce: > # Create a partitioned dataset with the following code: > ```python > import numpy as np > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame({ 'one': [-1, 10, 2.5, 100, 1000, 1, 29.2], 'two': [-1, 10, > 2, 100, 1000, 1, 11], 'three': [0, 0, 0, 0, 0, 0, 0] }) > table = pa.Table.from_pandas(df) > pq.write_to_dataset(table, root_path='/home/yingw787/misc/example_dataset', > partition_cols=['one', 'two']) > ``` > # Create a Parquet file from a PyArrow Table created from the partitioned > Parquet dataset: > ```python > import pyarrow.parquet as pq > table = pq.ParquetDataset('/path/to/dataset').read() > pq.write_table(table, '/path/to/example.parquet') > ``` > EXPECTED: > * Successful write > GOT: > * Segmentation fault > Issue reference on GitHub mirror: https://github.com/apache/arrow/issues/2511 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3722) [C++] Allow specifying column types to CSV reader
[ https://issues.apache.org/jira/browse/ARROW-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3722: -- Labels: pull-request-available (was: ) > [C++] Allow specifying column types to CSV reader > - > > Key: ARROW-3722 > URL: https://issues.apache.org/jira/browse/ARROW-3722 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.11.1 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > I'm not sure how to expose this. The easiest, implementation-wise, would be > to allow passing a {{Schema}} (for example inside the {{ConvertOptions}}). > Another possibility is to allow specifying the default types for type > inference. For example type inference currently infers integers as {{int64}}, > but the user might prefer {{int32}}. > Thoughts? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3502) [C++] parquet-column_scanner-test failure building ARROW_PARQUET build 11.
[ https://issues.apache.org/jira/browse/ARROW-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3502: Labels: parquet (was: ) > [C++] parquet-column_scanner-test failure building ARROW_PARQUET build 11. > -- > > Key: ARROW-3502 > URL: https://issues.apache.org/jira/browse/ARROW-3502 > Project: Apache Arrow > Issue Type: Bug >Reporter: Tanveer >Priority: Major > Labels: parquet > Attachments: Screenshot from 2018-10-11 12-25-13.png > > > For building Arrow Apache, I have enabled following flags and got error in > the attachment (parquet- > column_scanner-test failure) in making arrow build 11. > cmake .. -DCMAKE_BUILD_TYPE=Release -DARROW_PARQUET=ON -DARROW_PLASMA=ON > -DARROW_PLASMA_JAVA_CLIENT=ON -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3731) [R] R API for reading and writing Parquet files
[ https://issues.apache.org/jira/browse/ARROW-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3731: Labels: parquet (was: ) > [R] R API for reading and writing Parquet files > --- > > Key: ARROW-3731 > URL: https://issues.apache.org/jira/browse/ARROW-3731 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Wes McKinney >Priority: Major > Labels: parquet > > To start, this would be at the level of complexity of > {{pyarrow.parquet.read_table}} and {{write_table}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3703) [Python] DataFrame.to_parquet crashes if datetime column has time zones
[ https://issues.apache.org/jira/browse/ARROW-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3703: Labels: parquet (was: ) > [Python] DataFrame.to_parquet crashes if datetime column has time zones > --- > > Key: ARROW-3703 > URL: https://issues.apache.org/jira/browse/ARROW-3703 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 > Environment: pandas 0.23.4 > pyarrow 0.11.1 > Python 2.7, 3.5 - 3.7 > MacOS High Sierra (10.13.6) >Reporter: Diego Argueta >Priority: Major > Labels: parquet > > On CPython 2.7.15, 3.5.6, 3.6.6, and 3.7.0, creating a Pandas DataFrame with > a {{datetime.datetime}} object serializes to Parquet just fine, but crashes > with an {{AttributeError}} if you try to use the built-in {{timezone}} > objects. > To reproduce, on Python 3: > {code:java} > import datetime as dt > import pandas as pd > df = pd.DataFrame({'foo': [dt.datetime(2018, 1, 1, 1, 23, 45, > tzinfo=dt.timezone.utc)]}) > df.to_parquet('data.parq') > {code} > > On Python 2, create a subclass of {{datetime.tzinfo}} as shown > [here|https://docs.python.org/2/library/datetime.html#datetime.tzinfo] and > try the same thing. > > The following exception results: > {noformat} > Traceback (most recent call last): > File "", line 1, in > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/core/frame.py", > line 1945, in to_parquet > compression=compression, **kwargs) > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", > line 257, in to_parquet > return impl.write(df, path, compression=compression, **kwargs) > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", > line 118, in write > table = self.api.Table.from_pandas(df) > File "pyarrow/table.pxi", line 1217, in pyarrow.lib.Table.from_pandas > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 381, in dataframe_to_arrays > convert_types)] > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 380, in > for c, t in zip(columns_to_convert, > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 370, in convert_column > return pa.array(col, type=ty, from_pandas=True, safe=safe) > File "pyarrow/array.pxi", line 167, in pyarrow.lib.array > File > "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 409, in get_datetimetz_type > type_ = pa.timestamp(unit, tz) > File "pyarrow/types.pxi", line 1038, in pyarrow.lib.timestamp > File "pyarrow/types.pxi", line 955, in pyarrow.lib.tzinfo_to_string > AttributeError: 'datetime.timezone' object has no attribute 'zone' > 'datetime.timezone' object has no attribute 'zone' > {noformat} > > This doesn't happen if you use {{pytz.UTC}} as the timezone object. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1988) [Python] Extend flavor=spark in Parquet writing to handle INT types
[ https://issues.apache.org/jira/browse/ARROW-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1988: Labels: parquet (was: ) > [Python] Extend flavor=spark in Parquet writing to handle INT types > --- > > Key: ARROW-1988 > URL: https://issues.apache.org/jira/browse/ARROW-1988 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Uwe L. Korn >Priority: Major > Labels: parquet > Fix For: 0.13.0 > > > See the relevant code sections at > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L139 > We should cater for them in the {{pyarrow}} code and also reach out to Spark > developers so that they are supported there in the longterm. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3166) [C++] Consolidate IO interfaces used in arrow/io and parquet-cpp
[ https://issues.apache.org/jira/browse/ARROW-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3166: Labels: parquet (was: ) > [C++] Consolidate IO interfaces used in arrow/io and parquet-cpp > > > Key: ARROW-3166 > URL: https://issues.apache.org/jira/browse/ARROW-3166 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > With the codebase consolidation, we have the opportunity to remove cruft from > the Parquet codebase. I believe it would be simpler and better for the > ecosystem to use the Arrow IO interface classes rather than maintaining > separate vitual IO interfaces exported from the {{parquet::}} namespace -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3728) [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch
[ https://issues.apache.org/jira/browse/ARROW-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3728: Labels: parquet (was: ) > [Python] Merging Parquet Files - Pandas Meta in Schema Mismatch > --- > > Key: ARROW-3728 > URL: https://issues.apache.org/jira/browse/ARROW-3728 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0, 0.11.0, 0.11.1 > Environment: Python 3.6.3 > OSX 10.14 >Reporter: Micah Williamson >Priority: Major > Labels: parquet > > From: > https://stackoverflow.com/questions/53214288/merging-parquet-files-pandas-meta-in-schema-mismatch > > I am trying to merge multiple parquet files into one. Their schemas are > identical field-wise but my {{ParquetWriter}} is complaining that they are > not. After some investigation I found that the pandas meta in the schemas are > different, causing this error. > > Sample- > {code:python} > import pyarrow.parquet as pq > pq_tables=[] > for file_ in files: > pq_table = pq.read_table(f'{MESS_DIR}/{file_}') > pq_tables.append(pq_table) > if writer is None: > writer = pq.ParquetWriter(COMPRESSED_FILE, schema=pq_table.schema, > use_deprecated_int96_timestamps=True) > writer.write_table(table=pq_table) > {code} > The error- > {code} > Traceback (most recent call last): > File "{PATH_TO}/main.py", line 68, in lambda_handler > writer.write_table(table=pq_table) > File > "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyarrow/parquet.py", > line 335, in write_table > raise ValueError(msg) > ValueError: Table schema does not match schema used to create file: > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3525) [Packaging] Remove arrow/ and parquet-cpp/ dependencies in dev/run_docker_compose.sh
[ https://issues.apache.org/jira/browse/ARROW-3525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685329#comment-16685329 ] Wes McKinney commented on ARROW-3525: - [~kszucs] was this completed? > [Packaging] Remove arrow/ and parquet-cpp/ dependencies in > dev/run_docker_compose.sh > > > Key: ARROW-3525 > URL: https://issues.apache.org/jira/browse/ARROW-3525 > Project: Apache Arrow > Issue Type: Task > Components: Packaging >Affects Versions: 0.11.0 >Reporter: Kouhei Sutou >Priority: Minor > > Because we merge parquet-cpp into the Apache Arrow repository. > > Can someone work on this? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2624) [Python] Random schema and data generator for Arrow conversion and Parquet testing
[ https://issues.apache.org/jira/browse/ARROW-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2624: Labels: parquet (was: ) > [Python] Random schema and data generator for Arrow conversion and Parquet > testing > -- > > Key: ARROW-2624 > URL: https://issues.apache.org/jira/browse/ARROW-2624 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > > See discussion in https://github.com/apache/arrow/issues/2067 > Being able to generate random complex schemas and corresponding example data > sets will help with exercising edge cases in many different parts of the > codebase. One practical example: reading and writing nested data to Parquet > format -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1848) [Python] Add documentation examples for reading single Parquet files and datasets from HDFS
[ https://issues.apache.org/jira/browse/ARROW-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1848: Labels: parquet (was: ) > [Python] Add documentation examples for reading single Parquet files and > datasets from HDFS > --- > > Key: ARROW-1848 > URL: https://issues.apache.org/jira/browse/ARROW-1848 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > see > https://stackoverflow.com/questions/47443151/read-a-parquet-files-from-hdfs-using-pyarrow -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2728) [Python] Support partitioned Parquet datasets using glob-style file paths
[ https://issues.apache.org/jira/browse/ARROW-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2728: Labels: parquet (was: newbie) > [Python] Support partitioned Parquet datasets using glob-style file paths > - > > Key: ARROW-2728 > URL: https://issues.apache.org/jira/browse/ARROW-2728 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: pyarrow : 0.9.0.post1 > dask : 0.17.1 > Mac OS >Reporter: pranav kohli >Priority: Minor > Labels: parquet > > I am saving a dask dataframe to parquet with two partition columns using the > pyarrow engine. The problem arises in scanning the partition columns. When I > scan using the directory path, I get the partition columns in the output > dataframe, whereas if I scan using the glob path, I dont get these columns > > https://github.com/apache/arrow/issues/2147 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3210) [Python] Creating ParquetDataset creates partitioned ParquetFiles with mismatched Parquet schemas
[ https://issues.apache.org/jira/browse/ARROW-3210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3210: Labels: parquet (was: ) > [Python] Creating ParquetDataset creates partitioned ParquetFiles with > mismatched Parquet schemas > - > > Key: ARROW-3210 > URL: https://issues.apache.org/jira/browse/ARROW-3210 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Ubuntu 16.04 LTS, System76 Oryx Pro >Reporter: Ying Wang >Priority: Major > Labels: parquet > Attachments: environment.yml, repro.csv, repro.py, repro_2.py > > > STEPS TO REPRODUCE: > 1. Create a conda environment reflecting [^environment.yml] > 2. Execute script [^repro.py], replacing various config variables to create a > ParquetDataset on S3 given [^repro.csv] > 3. Create reference of ParquetDataset using script [^repro_2.py], again > replacing various config variables. > > EXPECTED: > Reference is created correctly. > GOT: > Mismatched Arrow schemas in validate_schemas() method: > > ```python > *** ValueError: Schema in partition[Draught=1, Name=1, VesselType=0, x=1, > Heading=1] > s3://kio-tests-files/_tmp/test_parquet_dataset/Draught=10.3/Name=MSC > RAFAELA/VesselType=Cargo/x=130.43158/Heading=270.0/e9e3cea5a5c24c4da587c263ec817c98.parquet > was different. > Record_ID: int64 > y: double > TRACKID: string > MMSI: int64 > IMO: int64 > AgeMinutes: double > SoG: double > Width: int64 > Length: int64 > Callsign: string > Destination: string > ETA: int64 > Status: string > ExtraInfo: string > TIMESTAMP: int64 > __index_level_0__: int64 > metadata > > {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": > [{"na' > b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_' > b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":' > b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"' > b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y' > b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f' > b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T' > b'RACKID", "pandas_type": "unicode", "numpy_type": "object", "meta' > b'data": null}, {"name": "MMSI", "field_name": "MMSI", "pandas_typ' > b'e": "int64", "numpy_type": "int64", "metadata": null}, {"name": ' > b'"IMO", "field_name": "IMO", "pandas_type": "int64", "numpy_type"' > b': "int64", "metadata": null}, {"name": "AgeMinutes", "field_name' > b'": "AgeMinutes", "pandas_type": "float64", "numpy_type": "float6' > b'4", "metadata": null}, {"name": "SoG", "field_name": "SoG", "pan' > b'das_type": "float64", "numpy_type": "float64", "metadata": null}' > b', {"name": "Width", "field_name": "Width", "pandas_type": "int64' > b'", "numpy_type": "int64", "metadata": null}, {"name": "Length", ' > b'"field_name": "Length", "pandas_type": "int64", "numpy_type": "i' > b'nt64", "metadata": null}, {"name": "Callsign", "field_name": "Ca' > b'llsign", "pandas_type": "unicode", "numpy_type": "object", "meta' > b'data": null}, {"name": "Destination", "field_name": "Destination' > b'", "pandas_type": "unicode", "numpy_type": "object", "metadata":' > b' null}, {"name": "ETA", "field_name": "ETA", "pandas_type": "int' > b'64", "numpy_type": "int64", "metadata": null}, {"name": "Status"' > b', "field_name": "Status", "pandas_type": "unicode", "numpy_type"' > b': "object", "metadata": null}, {"name": "ExtraInfo", "field_name' > b'": "ExtraInfo", "pandas_type": "unicode", "numpy_type": "object"' > b', "metadata": null}, {"name": "TIMESTAMP", "field_name": "TIMEST' > b'AMP", "pandas_type": "int64", "numpy_type": "int64", "metadata":' > b' null}, {"name": null, "field_name": "__index_level_0__", "panda' > b's_type": "int64", "numpy_type": "int64", "metadata": null}], "pa' > b'ndas_version": "0.21.0"}'} > vs > Record_ID: int64 > y: double > TRACKID: string > MMSI: int64 > IMO: int64 > AgeMinutes: double > SoG: double > Width: int64 > Length: int64 > Callsign: string > Destination: string > ETA: int64 > Status: string > ExtraInfo: null > TIMESTAMP: int64 > __index_level_0__: int64 > metadata > > {b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": > [{"na' > b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_' > b'type": "object", "metadata": \{"encoding": "UTF-8"}}], "columns":' > b' [{"name": "Record_ID", "field_name": "Record_ID", "pandas_type"' > b': "int64", "numpy_type": "int64", "metadata": null}, {"name": "y' > b'", "field_name": "y", "pandas_type": "float64", "numpy_type": "f' > b'loat64", "metadata": null}, {"name": "TRACKID", "field_name": "T' > b'RACKID",
[jira] [Commented] (ARROW-3722) [C++] Allow specifying column types to CSV reader
[ https://issues.apache.org/jira/browse/ARROW-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685325#comment-16685325 ] Antoine Pitrou commented on ARROW-3722: --- > We also need a way to provide column names (or even default to numbering) for > files without a header. This topic is related, but maybe a new Jira would be > better suited for it. Yes, I think a separate JIRA is better. > additional thoughts on passing ColumnBuilder instead of just a type. Ideally, > the user would be able to implement own converters to support, let's say, > uncommon date formats or even parse struct types at load time. Right now most CSV APIs are internal. APIs like ColumnBuilder and Converter expose implementation details that we don't want to set in stone. If there's some demand we might think about an API to let people define their conversion functions without having to depend on internal APIs. > [C++] Allow specifying column types to CSV reader > - > > Key: ARROW-3722 > URL: https://issues.apache.org/jira/browse/ARROW-3722 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.11.1 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > > I'm not sure how to expose this. The easiest, implementation-wise, would be > to allow passing a {{Schema}} (for example inside the {{ConvertOptions}}). > Another possibility is to allow specifying the default types for type > inference. For example type inference currently infers integers as {{int64}}, > but the user might prefer {{int32}}. > Thoughts? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2628) [Python] parquet.write_to_dataset is memory-hungry on large DataFrames
[ https://issues.apache.org/jira/browse/ARROW-2628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2628: Labels: parquet (was: ) > [Python] parquet.write_to_dataset is memory-hungry on large DataFrames > -- > > Key: ARROW-2628 > URL: https://issues.apache.org/jira/browse/ARROW-2628 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > > See discussion in https://github.com/apache/arrow/issues/1749. We should > consider strategies for writing very large tables to a partitioned directory > scheme. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1012) [C++] Create implementation of StreamReader that reads from Apache Parquet files
[ https://issues.apache.org/jira/browse/ARROW-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1012: Labels: parquet (was: ) > [C++] Create implementation of StreamReader that reads from Apache Parquet > files > > > Key: ARROW-1012 > URL: https://issues.apache.org/jira/browse/ARROW-1012 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: parquet > > This will be enabled by ARROW-1008 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2366) [Python] Support reading Parquet files having a permutation of column order
[ https://issues.apache.org/jira/browse/ARROW-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2366: Labels: parquet (was: ) > [Python] Support reading Parquet files having a permutation of column order > --- > > Key: ARROW-2366 > URL: https://issues.apache.org/jira/browse/ARROW-2366 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Assignee: Uwe L. Korn >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > See discussion in https://github.com/dask/fastparquet/issues/320 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2038) [Python] Follow-up bug fixes for s3fs Parquet support
[ https://issues.apache.org/jira/browse/ARROW-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2038: Labels: aws parquet (was: aws) > [Python] Follow-up bug fixes for s3fs Parquet support > - > > Key: ARROW-2038 > URL: https://issues.apache.org/jira/browse/ARROW-2038 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: aws, parquet > Fix For: 0.12.0 > > > see discussion in > https://github.com/apache/arrow/pull/916#issuecomment-360558248 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2077) [Python] Document on how to use Storefact & Arrow to read Parquet from S3/Azure/...
[ https://issues.apache.org/jira/browse/ARROW-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2077: Labels: parquet (was: ) > [Python] Document on how to use Storefact & Arrow to read Parquet from > S3/Azure/... > --- > > Key: ARROW-2077 > URL: https://issues.apache.org/jira/browse/ARROW-2077 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Uwe L. Korn >Assignee: Uwe L. Korn >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > We're using this happily in production, also with column projection down to > the storage layer. Others should also benefit from this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1682) [Python] Add documentation / example for reading a directory of Parquet files on S3
[ https://issues.apache.org/jira/browse/ARROW-1682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1682: Labels: parquet (was: ) > [Python] Add documentation / example for reading a directory of Parquet files > on S3 > --- > > Key: ARROW-1682 > URL: https://issues.apache.org/jira/browse/ARROW-1682 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > Opened based on comment > https://github.com/apache/arrow/pull/916#issuecomment-337563492 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3722) [C++] Allow specifying column types to CSV reader
[ https://issues.apache.org/jira/browse/ARROW-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-3722: - Assignee: Antoine Pitrou > [C++] Allow specifying column types to CSV reader > - > > Key: ARROW-3722 > URL: https://issues.apache.org/jira/browse/ARROW-3722 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.11.1 >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > > I'm not sure how to expose this. The easiest, implementation-wise, would be > to allow passing a {{Schema}} (for example inside the {{ConvertOptions}}). > Another possibility is to allow specifying the default types for type > inference. For example type inference currently infers integers as {{int64}}, > but the user might prefer {{int32}}. > Thoughts? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1925) [Python] Wrapping PyArrow Table with Numpy without copy
[ https://issues.apache.org/jira/browse/ARROW-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1925: Summary: [Python] Wrapping PyArrow Table with Numpy without copy (was: Wrapping PyArrow Table with Numpy without copy) > [Python] Wrapping PyArrow Table with Numpy without copy > --- > > Key: ARROW-1925 > URL: https://issues.apache.org/jira/browse/ARROW-1925 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Affects Versions: 0.7.1 >Reporter: Young-Jun Ko >Priority: Minor > Labels: parquet > > The scenario is the following: > I have a parquet file, which has a column containing a float array of > constant size. > So it can be thought of as a matrix. > When I read the parquet file, the way I currently access it, is to convert it > to pandas, extract the values, giving me a list of np.array and then doing > np.vstack to get the matrix. > This involves a copy that would be nice to avoid. > When a parquet file (or more generally a parquet dataset) is read, would the > values of the array column be contiguous in memory, so that a view on the > data could be created without having to copy? That would be neat. > Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-976) [Python] Provide API for defining and reading Parquet datasets with more ad hoc partition schemes
[ https://issues.apache.org/jira/browse/ARROW-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-976: --- Labels: parquet (was: ) > [Python] Provide API for defining and reading Parquet datasets with more ad > hoc partition schemes > - > > Key: ARROW-976 > URL: https://issues.apache.org/jira/browse/ARROW-976 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1925) [Python] Wrapping PyArrow Table with Numpy without copy
[ https://issues.apache.org/jira/browse/ARROW-1925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-1925: Labels: parquet (was: ) > [Python] Wrapping PyArrow Table with Numpy without copy > --- > > Key: ARROW-1925 > URL: https://issues.apache.org/jira/browse/ARROW-1925 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Affects Versions: 0.7.1 >Reporter: Young-Jun Ko >Priority: Minor > Labels: parquet > > The scenario is the following: > I have a parquet file, which has a column containing a float array of > constant size. > So it can be thought of as a matrix. > When I read the parquet file, the way I currently access it, is to convert it > to pandas, extract the values, giving me a list of np.array and then doing > np.vstack to get the matrix. > This involves a copy that would be nice to avoid. > When a parquet file (or more generally a parquet dataset) is read, would the > values of the array column be contiguous in memory, so that a view on the > data could be created without having to copy? That would be neat. > Thanks! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2360) [C++] Add set_chunksize for RecordBatchReader in arrow/record_batch.h
[ https://issues.apache.org/jira/browse/ARROW-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2360: Summary: [C++] Add set_chunksize for RecordBatchReader in arrow/record_batch.h (was: Add set_chunksize for RecordBatchReader in arrow/record_batch.h) > [C++] Add set_chunksize for RecordBatchReader in arrow/record_batch.h > - > > Key: ARROW-2360 > URL: https://issues.apache.org/jira/browse/ARROW-2360 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Xianjin YE >Priority: Major > > As discussed in [https://github.com/apache/parquet-cpp/pull/445,] > Maybe it's better to expose chunksize related API in RecordBatchReader. > > However RecordBatchStreamReader doesn't conforms to this requirement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3762) [C++] Arrow table reads error when overflowing capacity of BinaryArray
[ https://issues.apache.org/jira/browse/ARROW-3762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3762: Component/s: Python C++ > [C++] Arrow table reads error when overflowing capacity of BinaryArray > -- > > Key: ARROW-3762 > URL: https://issues.apache.org/jira/browse/ARROW-3762 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Chris Ellison >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > When reading a parquet file with binary data > 2 GiB, we get an ArrowIOError > due to it not creating chunked arrays. Reading each row group individually > and then concatenating the tables works, however. > > {code:java} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > x = pa.array(list('1' * 2**30)) > demo = 'demo.parquet' > def scenario(): > t = pa.Table.from_arrays([x], ['x']) > writer = pq.ParquetWriter(demo, t.schema) > for i in range(2): > writer.write_table(t) > writer.close() > pf = pq.ParquetFile(demo) > # pyarrow.lib.ArrowIOError: Arrow error: Invalid: BinaryArray cannot > contain more than 2147483646 bytes, have 2147483647 > t2 = pf.read() > # Works, but note, there are 32 row groups, not 2 as suggested by: > # > https://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing > tables = [pf.read_row_group(i) for i in range(pf.num_row_groups)] > t3 = pa.concat_tables(tables) > scenario() > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2592) [Python] AssertionError in to_pandas()
[ https://issues.apache.org/jira/browse/ARROW-2592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2592: Labels: parquet (was: ) > [Python] AssertionError in to_pandas() > -- > > Key: ARROW-2592 > URL: https://issues.apache.org/jira/browse/ARROW-2592 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0, 0.10.0, 0.11.0, 0.11.1 >Reporter: Dima Ryazanov >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > > Pyarrow 0.8 and 0.9 raises an AssertionError for one of the datasets I have > (created using an older version of pyarrow). Repro steps: > {{In [1]: from pyarrow.parquet import ParquetDataset}} > {{In [2]: d = ParquetDataset(['bug.parq'])}} > {{In [3]: t = d.read()}} > {{In [4]: t.to_pandas()}} > {{---}} > {{AssertionError Traceback (most recent call > last)}} > {{ in ()}} > {{> 1 t.to_pandas()}} > {{table.pxi in pyarrow.lib.Table.to_pandas()}} > {{~/envs/cli3/lib/python3.6/site-packages/pyarrow/pandas_compat.py in > table_to_blockmanager(options, table, memory_pool, nthreads, categories)}} > {{ 529 # There must be the same number of field names and physical > names}} > {{ 530 # (fields in the arrow Table)}} > {{--> 531 assert len(logical_index_names) == len(index_columns_set)}} > {{ 532 }} > {{ 533 # It can never be the case in a released version of pyarrow > that}} > {{AssertionError: }} > > Here's the file: [https://www.dropbox.com/s/oja3khjsc5tycfh/bug.parq] > (I was not able to attach it here due to a "missing token", whatever that > means.) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2659) [Python] More graceful reading of empty String columns in ParquetDataset
[ https://issues.apache.org/jira/browse/ARROW-2659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2659: Labels: parquet (was: beginner) > [Python] More graceful reading of empty String columns in ParquetDataset > > > Key: ARROW-2659 > URL: https://issues.apache.org/jira/browse/ARROW-2659 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 >Reporter: Uwe L. Korn >Priority: Major > Labels: parquet > Fix For: 0.12.0 > > Attachments: read_parquet_dataset.error.read_table.novalidation.txt, > read_parquet_dataset.error.read_table.txt > > > When currently saving a {{ParquetDataset}} from Pandas, we don't get > consistent schemas, even if the source was a single DataFrame. This is due to > the fact that in some partitions object columns like string can become empty. > Then the resulting Arrow schema will differ. In the central metadata, we will > store this column as {{pa.string}} whereas in the partition file with the > empty columns, this columns will be stored as {{pa.null}}. > The two schemas are still a valid match in terms of schema evolution and we > should respect that in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L754 > Instead of doing a {{pa.Schema.equals}} in > https://github.com/apache/arrow/blob/79a22074e0b059a24c5cd45713f8d085e24f826a/python/pyarrow/parquet.py#L778 > we should introduce a new method {{pa.Schema.can_evolve_to}} that is more > graceful and returns {{True}} if a dataset piece has a null column where the > main metadata states a nullable column of any type. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3538) [Python] ability to override the automated assignment of uuid for filenames when writing datasets
[ https://issues.apache.org/jira/browse/ARROW-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3538: Labels: features parquet (was: features) > [Python] ability to override the automated assignment of uuid for filenames > when writing datasets > - > > Key: ARROW-3538 > URL: https://issues.apache.org/jira/browse/ARROW-3538 > Project: Apache Arrow > Issue Type: Wish >Affects Versions: 0.10.0 >Reporter: Ji Xu >Priority: Major > Labels: features, parquet > > Say I have a pandas DataFrame {{df}} that I would like to store on disk as > dataset using pyarrow parquet, I would do this: > {code:java} > table = pyarrow.Table.from_pandas(df) > pyarrow.parquet.write_to_dataset(table, root_path=some_path, > partition_cols=['a',]){code} > On disk the dataset would look like something like this: > {color:#14892c}some_path{color} > {color:#14892c}├── a=1{color} > {color:#14892c}├── 4498704937d84fe5abebb3f06515ab2d.parquet{color} > {color:#14892c}├── a=2{color} > {color:#14892c}├── 8bcfaed8986c4bdba587aaaee532370c.parquet{color} > *Wished Feature:* It'd be great if I can override the auto-assignment of the > long UUID as filename somehow during the *dataset* writing. My purpose is to > be able to overwrite the dataset on disk when I have a new version of {{df}}. > Currently if I try to write the dataset again, another new uniquely named > [UUID].parquet file will be placed next to the old one, with the same, > redundant data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2098) [Python] Implement "errors as null" option when coercing Python object arrays to Arrow format
[ https://issues.apache.org/jira/browse/ARROW-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2098: Labels: parquet (was: ) > [Python] Implement "errors as null" option when coercing Python object arrays > to Arrow format > - > > Key: ARROW-2098 > URL: https://issues.apache.org/jira/browse/ARROW-2098 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > > Inspired by > https://stackoverflow.com/questions/48611998/type-error-on-first-steps-with-apache-parquet > where the user has a string inside a mostly integer column -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2598) [Python] table.to_pandas segfault
[ https://issues.apache.org/jira/browse/ARROW-2598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2598: Labels: parquet (was: ) > [Python] table.to_pandas segfault > -- > > Key: ARROW-2598 > URL: https://issues.apache.org/jira/browse/ARROW-2598 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: jacques >Priority: Major > Labels: parquet > > Here is a small snippet which produces a segfault: > {noformat} > In [1]: import pyarrow as pa > In [2]: import pyarrow.parquet as pq > In [3]: pa_ar = pa.array([[], []]) > In [4]: pq.write_table( > ...: table=pa.Table.from_arrays([pa_ar],["test"]), > ...: where="test5.parquet", > ...: compression="snappy", > ...: flavor="spark" > ...: ) > In [5]: pq.read_table("test5.parquet") > Out[5]: > pyarrow.Table > test: list > child 0, item: null > In [6]: pq.read_table("test5.parquet").to_pydict() > Out[6]: OrderedDict([(u'test', [None, None])]) > In [7]: pq.read_table("test5.parquet").to_pandas() > Segmentation fault > {noformat} > I thank you in advance for having this fixed. > Best, > Jacques -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3538) [Python] ability to override the automated assignment of uuid for filenames when writing datasets
[ https://issues.apache.org/jira/browse/ARROW-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3538: Summary: [Python] ability to override the automated assignment of uuid for filenames when writing datasets (was: ability to override the automated assignment of uuid for filenames when writing datasets) > [Python] ability to override the automated assignment of uuid for filenames > when writing datasets > - > > Key: ARROW-3538 > URL: https://issues.apache.org/jira/browse/ARROW-3538 > Project: Apache Arrow > Issue Type: Wish >Affects Versions: 0.10.0 >Reporter: Ji Xu >Priority: Major > Labels: features, parquet > > Say I have a pandas DataFrame {{df}} that I would like to store on disk as > dataset using pyarrow parquet, I would do this: > {code:java} > table = pyarrow.Table.from_pandas(df) > pyarrow.parquet.write_to_dataset(table, root_path=some_path, > partition_cols=['a',]){code} > On disk the dataset would look like something like this: > {color:#14892c}some_path{color} > {color:#14892c}├── a=1{color} > {color:#14892c}├── 4498704937d84fe5abebb3f06515ab2d.parquet{color} > {color:#14892c}├── a=2{color} > {color:#14892c}├── 8bcfaed8986c4bdba587aaaee532370c.parquet{color} > *Wished Feature:* It'd be great if I can override the auto-assignment of the > long UUID as filename somehow during the *dataset* writing. My purpose is to > be able to overwrite the dataset on disk when I have a new version of {{df}}. > Currently if I try to write the dataset again, another new uniquely named > [UUID].parquet file will be placed next to the old one, with the same, > redundant data. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2026) [Python] µs timestamps saved as int64 even if use_deprecated_int96_timestamps=True
[ https://issues.apache.org/jira/browse/ARROW-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2026: Labels: parquet redshift timestamps (was: redshift timestamps) > [Python] µs timestamps saved as int64 even if > use_deprecated_int96_timestamps=True > -- > > Key: ARROW-2026 > URL: https://issues.apache.org/jira/browse/ARROW-2026 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: OS: Mac OS X 10.13.2 > Python: 3.6.4 > PyArrow: 0.8.0 >Reporter: Diego Argueta >Priority: Major > Labels: parquet, redshift, timestamps > Fix For: 0.12.0 > > > When writing to a Parquet file, if `use_deprecated_int96_timestamps` is True, > timestamps are only written as 96-bit integers if the timestamp has > nanosecond resolution. This is a problem because Amazon Redshift timestamps > only have microsecond resolution but require them to be stored in 96-bit > format in Parquet files. > I'd expect the use_deprecated_int96_timestamps flag to cause _all_ timestamps > to be written as 96 bits, regardless of resolution. If this is a deliberate > design decision, it'd be immensely helpful if it were explicitly documented > as part of the argument. > > To reproduce: > > 1. Create a table with a timestamp having microsecond or millisecond > resolution, and save it to a Parquet file. Be sure to set > `use_deprecated_int96_timestamps` to True. > > {code:java} > import datetime > import pyarrow > from pyarrow import parquet > schema = pyarrow.schema([ > pyarrow.field('last_updated', pyarrow.timestamp('us')), > ]) > data = [ > pyarrow.array([datetime.datetime.now()], pyarrow.timestamp('us')), > ] > table = pyarrow.Table.from_arrays(data, ['last_updated']) > with open('test_file.parquet', 'wb') as fdesc: > parquet.write_table(table, fdesc, > use_deprecated_int96_timestamps=True) > {code} > > 2. Inspect the file. I used parquet-tools: > > {noformat} > dak@tux ~ $ parquet-tools meta test_file.parquet > file: file:/Users/dak/test_file.parquet > creator: parquet-cpp version 1.3.2-SNAPSHOT > file schema: schema > > last_updated: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1 > row group 1: RC:1 TS:76 OFFSET:4 > > last_updated: INT64 SNAPPY DO:4 FPO:28 SZ:76/72/0.95 VC:1 > ENC:PLAIN,PLAIN_DICTIONARY,RLE{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2079) [Python] Possibly use `_common_metadata` for schema if `_metadata` isn't available
[ https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2079: Summary: [Python] Possibly use `_common_metadata` for schema if `_metadata` isn't available (was: Possibly use `_common_metadata` for schema if `_metadata` isn't available) > [Python] Possibly use `_common_metadata` for schema if `_metadata` isn't > available > -- > > Key: ARROW-2079 > URL: https://issues.apache.org/jira/browse/ARROW-2079 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Priority: Minor > Labels: parquet > > Currently pyarrow's parquet writer only writes `_common_metadata` and not > `_metadata`. From what I understand these are intended to contain the dataset > schema but not any row group information. > > A few (possibly naive) questions: > > 1. In the `__init__` for `ParquetDataset`, the following lines exist: > {code:java} > if self.metadata_path is not None: > with self.fs.open(self.metadata_path) as f: > self.common_metadata = ParquetFile(f).metadata > else: > self.common_metadata = None > {code} > I believe this should use `common_metadata_path` instead of `metadata_path`, > as the latter is never written by `pyarrow`, and is given by the `_metadata` > file instead of `_common_metadata` (as seemingly intended?). > > 2. In `validate_schemas` I believe an option should exist for using the > schema from `_common_metadata` instead of `_metadata`, as pyarrow currently > only writes the former, and as far as I can tell `_common_metadata` does > include all the schema information needed. > > Perhaps the logic in `validate_schemas` could be ported over to: > > {code:java} > if self.schema is not None: > pass # schema explicitly provided > elif self.metadata is not None: > self.schema = self.metadata.schema > elif self.common_metadata is not None: > self.schema = self.common_metadata.schema > else: > self.schema = self.pieces[0].get_metadata(open_file).schema{code} > If these changes are valid, I'd be happy to submit a PR. It's not 100% clear > to me the difference between `_common_metadata` and `_metadata`, but I > believe the schema in both should be the same. Figured I'd open this for > discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3139) [Python] ArrowIOError: Arrow error: Capacity error during read
[ https://issues.apache.org/jira/browse/ARROW-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3139: Labels: parquet (was: ) > [Python] ArrowIOError: Arrow error: Capacity error during read > -- > > Key: ARROW-3139 > URL: https://issues.apache.org/jira/browse/ARROW-3139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 > Environment: pandas=0.23.1=py36h637b7d7_0 > pyarrow==0.10.0 >Reporter: Frédérique Vanneste >Priority: Major > Labels: parquet > > My assumption: the problem is caused by a large object column containing > strings up to 27 characters long. (so that column is much larger than 2GB of > strings, chunking issue) > looks similar as > https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 > > Code > * basket_plateau= pq.read_table("basket_plateau.parquet") > * basket_plateau = pd.read_parquet("basket_plateau.parquet") > Error produced > * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more > than 2147483646 bytes, have 2147483655 > Dataset > * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0) > * 2.7 billion record, 4 columns ( int64/object/datetime64/float64) > * aprox 90GB in memory > * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think > food retail categories) > History to bug: > * was using older version of pyarrow > * tried writing dataset to disk (parquet) and failed > * stumbled on https://issues.apache.org/jira/browse/ARROW-2227 > * upgraded to 0.10 > * tried writing dataset to disk (parquet) and succeeded > * tried reading dataset and failed > * looks like a similar case as: > https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (ARROW-3139) [Python] ArrowIOError: Arrow error: Capacity error during read
[ https://issues.apache.org/jira/browse/ARROW-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reopened ARROW-3139: - > [Python] ArrowIOError: Arrow error: Capacity error during read > -- > > Key: ARROW-3139 > URL: https://issues.apache.org/jira/browse/ARROW-3139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 > Environment: pandas=0.23.1=py36h637b7d7_0 > pyarrow==0.10.0 >Reporter: Frédérique Vanneste >Priority: Major > Labels: parquet > > My assumption: the problem is caused by a large object column containing > strings up to 27 characters long. (so that column is much larger than 2GB of > strings, chunking issue) > looks similar as > https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 > > Code > * basket_plateau= pq.read_table("basket_plateau.parquet") > * basket_plateau = pd.read_parquet("basket_plateau.parquet") > Error produced > * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more > than 2147483646 bytes, have 2147483655 > Dataset > * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0) > * 2.7 billion record, 4 columns ( int64/object/datetime64/float64) > * aprox 90GB in memory > * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think > food retail categories) > History to bug: > * was using older version of pyarrow > * tried writing dataset to disk (parquet) and failed > * stumbled on https://issues.apache.org/jira/browse/ARROW-2227 > * upgraded to 0.10 > * tried writing dataset to disk (parquet) and succeeded > * tried reading dataset and failed > * looks like a similar case as: > https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2710) [Python] pyarrow.lib.ArrowIOError when running PyTorch DataLoader in multiprocessing
[ https://issues.apache.org/jira/browse/ARROW-2710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2710: Labels: parquet (was: ) > [Python] pyarrow.lib.ArrowIOError when running PyTorch DataLoader in > multiprocessing > > > Key: ARROW-2710 > URL: https://issues.apache.org/jira/browse/ARROW-2710 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 > Environment: Tested on several Linux OSs. >Reporter: Michael Andrews >Priority: Major > Labels: parquet > > Unable to open a parquet file via {{pq.ParquetFile(filename)}} when called > using the PyTorch DataLoader in multiprocessing mode. Affects versions > pyarrow > 0.7.1. > As detailed in [https://github.com/apache/arrow/issues/1946]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2591) [Python] Segmentationfault issue in pq.write_table
[ https://issues.apache.org/jira/browse/ARROW-2591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-2591: Labels: parquet (was: ) > [Python] Segmentationfault issue in pq.write_table > -- > > Key: ARROW-2591 > URL: https://issues.apache.org/jira/browse/ARROW-2591 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0, 0.9.0 >Reporter: jacques >Priority: Major > Labels: parquet > > Context Is the following: I am currently dealing with sparse column > serialization in parquet. In some cases, many lines are empty I can also have > columns containing only empty lists. > However I got a segmentation fault when I try to write in parquet thoses > columns filled only with empty lists. > Here is a simple code snipet reproduces the segmentation fault I had: > {noformat} > In [1]: import pyarrow as pa > In [2]: import pyarrow.parquet as pq > In [3]: pa_ar = pa.array([[],[]],pa.list_(pa.int32())) > In [4]: table = pa.Table.from_arrays([pa_ar],["test"]) > In [5]: pq.write_table( > ...: table=table, > ...: where="test.parquet", > ...: compression="snappy", > ...: flavor="spark" > ...: ) > Segmentation fault > {noformat} > May I have it fixed? > Best > Jacques -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-3139) [Python] ArrowIOError: Arrow error: Capacity error during read
[ https://issues.apache.org/jira/browse/ARROW-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-3139. --- Resolution: Duplicate duplicate of ARROW-3762 (formerly PARQUET-1239) > [Python] ArrowIOError: Arrow error: Capacity error during read > -- > > Key: ARROW-3139 > URL: https://issues.apache.org/jira/browse/ARROW-3139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 > Environment: pandas=0.23.1=py36h637b7d7_0 > pyarrow==0.10.0 >Reporter: Frédérique Vanneste >Priority: Major > > My assumption: the problem is caused by a large object column containing > strings up to 27 characters long. (so that column is much larger than 2GB of > strings, chunking issue) > looks similar as > https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 > > Code > * basket_plateau= pq.read_table("basket_plateau.parquet") > * basket_plateau = pd.read_parquet("basket_plateau.parquet") > Error produced > * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more > than 2147483646 bytes, have 2147483655 > Dataset > * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0) > * 2.7 billion record, 4 columns ( int64/object/datetime64/float64) > * aprox 90GB in memory > * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think > food retail categories) > History to bug: > * was using older version of pyarrow > * tried writing dataset to disk (parquet) and failed > * stumbled on https://issues.apache.org/jira/browse/ARROW-2227 > * upgraded to 0.10 > * tried writing dataset to disk (parquet) and succeeded > * tried reading dataset and failed > * looks like a similar case as: > https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3585) [Python] Update the documentation about Schema & Metadata usage
[ https://issues.apache.org/jira/browse/ARROW-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3585: Labels: beginner documentation easyfix newbie parquet (was: beginner documentation easyfix newbie) > [Python] Update the documentation about Schema & Metadata usage > --- > > Key: ARROW-3585 > URL: https://issues.apache.org/jira/browse/ARROW-3585 > Project: Apache Arrow > Issue Type: Task > Components: Documentation >Reporter: Daniel Haviv >Assignee: Daniel Haviv >Priority: Trivial > Labels: beginner, documentation, easyfix, newbie, parquet > Original Estimate: 24h > Remaining Estimate: 24h > > Reusing the Schema object from a Parquet file written with Spark with Pandas > fails due to Schema mismatch. > The culprit is in the metadata part of the schema which each component fills > according to it's implementation. More details can be found here: > [https://github.com/apache/arrow/issues/2805] > The documentation should point that out. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3139) [Python] ArrowIOError: Arrow error: Capacity error during read
[ https://issues.apache.org/jira/browse/ARROW-3139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3139: Summary: [Python] ArrowIOError: Arrow error: Capacity error during read (was: [Python]ArrowIOError: Arrow error: Capacity error during read) > [Python] ArrowIOError: Arrow error: Capacity error during read > -- > > Key: ARROW-3139 > URL: https://issues.apache.org/jira/browse/ARROW-3139 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.10.0 > Environment: pandas=0.23.1=py36h637b7d7_0 > pyarrow==0.10.0 >Reporter: Frédérique Vanneste >Priority: Major > > My assumption: the problem is caused by a large object column containing > strings up to 27 characters long. (so that column is much larger than 2GB of > strings, chunking issue) > looks similar as > https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 > > Code > * basket_plateau= pq.read_table("basket_plateau.parquet") > * basket_plateau = pd.read_parquet("basket_plateau.parquet") > Error produced > * ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more > than 2147483646 bytes, have 2147483655 > Dataset > * Pandas dataframe (pandas=0.23.1=py36h637b7d7_0) > * 2.7 billion record, 4 columns ( int64/object/datetime64/float64) > * aprox 90GB in memory > * example of object col: "Fresh Vegetables", "Alcohol Beers", ... (think > food retail categories) > History to bug: > * was using older version of pyarrow > * tried writing dataset to disk (parquet) and failed > * stumbled on https://issues.apache.org/jira/browse/ARROW-2227 > * upgraded to 0.10 > * tried writing dataset to disk (parquet) and succeeded > * tried reading dataset and failed > * looks like a similar case as: > https://issues.apache.org/jira/browse/ARROW-2227?focusedCommentId=16379574=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16379574 > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3652) [Python] CategoricalIndex is lost after reading back
[ https://issues.apache.org/jira/browse/ARROW-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3652: Labels: parquet (was: ) > [Python] CategoricalIndex is lost after reading back > > > Key: ARROW-3652 > URL: https://issues.apache.org/jira/browse/ARROW-3652 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: Armin Berres >Priority: Major > Labels: parquet > > When a {{CategoricalIndex}} is written and read back the resulting index is > not more categorical. > {code} > df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['c1', 'c2']) > df['c1'] = df['c1'].astype('category') > df = df.set_index(['c1']) > table = pa.Table.from_pandas(df) > pq.write_table(table, 'test.parquet') > ref_df = pq.read_pandas('test.parquet').to_pandas() > print(df.index) > # CategoricalIndex(['a', 'c'], categories=['a', 'c'], ordered=False, > name='c1', dtype='category') > print(ref_df.index) > # Index(['a', 'c'], dtype='object', name='c1') > {code} > In the metadata the information is correctly contained: > {code:java} > {"name": "c1", "field_name": "c1", "p' > b'andas_type": "categorical", "numpy_type": "int8", "metadata": > {"' > b'num_categories": 2, "ordered": false} > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3654) [Python] Column with CategoricalIndex fails to be read back
[ https://issues.apache.org/jira/browse/ARROW-3654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3654: Labels: parquet (was: ) > [Python] Column with CategoricalIndex fails to be read back > --- > > Key: ARROW-3654 > URL: https://issues.apache.org/jira/browse/ARROW-3654 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: Armin Berres >Priority: Major > Labels: parquet > > When a column with a \{Categoricalndex} is written the data can never be read > back. > {code:python} > df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['c1', 'c2']) > df['c1'] = df['c1'].astype('category') > df = df.set_index(['c1']) > table = pa.Table.from_pandas(df) > pq.write_table(table, 'test.parquet') > pq.read_pandas('test.parquet').to_pandas() > {code} > Results in > {code} > KeyError Traceback (most recent call last) > ~/venv/mpptool/lib/python3.7/site-packages/pyarrow/pandas_compat.py in > _pandas_type_to_numpy_type(pandas_type) > 676 try: > --> 677 return _pandas_logical_type_map[pandas_type] > 678 except KeyError: > KeyError: 'categorical' > {code} > The schema looks good: > {code} > column_indexes": [{"name": "c1", "field_name": "c1", "pandas_type": > "categorical", "numpy_type": "int8", "metadata": {"num_categories": 2, > "ordered": false}}] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3650) [Python] Mixed column indexes are read back as strings
[ https://issues.apache.org/jira/browse/ARROW-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-3650: Labels: parquet (was: ) > [Python] Mixed column indexes are read back as strings > --- > > Key: ARROW-3650 > URL: https://issues.apache.org/jira/browse/ARROW-3650 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.11.1 >Reporter: Armin Berres >Priority: Major > Labels: parquet > > Consider the following example: > {code:java} > df = pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['a > string', pd.to_datetime('2018/01/02')]) > table = pa.Table.from_pandas(df) > pq.write_table(table, 'test.parquet') > ref_df = pq.read_pandas('test.parquet').to_pandas() > print(df.columns) > # Index(['a string', 2018-01-02 00:00:00], dtype='object') > print(ref_df.columns) > # Index(['a string', '2018-01-02 00:00:00'], dtype='object') > {code} > The serialized data frame has an index with a string and a datetime field > (happened when resetting the index of a formerly datetime only column). > When reading the string back the datetime is converted into a string. > When looking at the schema I find {{"pandas_type": "mixed", "numpy_ty' > b'pe": "object"}} before serializing and {{"pandas_type": > "unicode", "numpy_' > b'type": "object"}} after reading back. So the schema was aware > of the mixed type but did not store the actual types. > The same happens with other types like numbers as well. One can produce > interesting situations: > {{pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['1', 1])}} > can be written but fails to be read back as the index is no more unique with > '1' showing up two times. > IIf this is not a bug but expected maybe the user should be somehow warned > that information is lost? Like a {{NotImplemented}} exception. -- This message was sent by Atlassian JIRA (v7.6.3#76005)