[jira] [Created] (ARROW-3668) [R] Namespace dependency 'bit64' is not required
James Lamb created ARROW-3668: - Summary: [R] Namespace dependency 'bit64' is not required Key: ARROW-3668 URL: https://issues.apache.org/jira/browse/ARROW-3668 Project: Apache Arrow Issue Type: Bug Components: R Reporter: James Lamb Dependency 'bit64' was added to the R package recently, but that package was not added to the DESCRIPTION file. This error [was caught on Travis|https://travis-ci.org/apache/arrow/jobs/449082414] but merging the whichever PR added it wasn't blocked because failures in the R build are allowed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column
Brian Hulette created ARROW-3667: Summary: [JS] Incorrectly reads record batches with an all null column Key: ARROW-3667 URL: https://issues.apache.org/jira/browse/ARROW-3667 Project: Apache Arrow Issue Type: Bug Affects Versions: JS-0.3.1 Reporter: Brian Hulette Fix For: JS-0.4.0 The JS library seems to incorrectly read any columns that come after an all-null column in IPC buffers produced by pyarrow. Here's a python script that generates two arrow buffers, one with an all-null column followed by a utf-8 column, and a second with those two reversed {code:python} import pyarrow as pa import pandas as pd def serialize_to_arrow(df, fd, compress=True): batch = pa.RecordBatch.from_pandas(df) writer = pa.RecordBatchFileWriter(fd, batch.schema) writer.write_batch(batch) writer.close() if __name__ == "__main__": df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', 'def', 'ghi']}, columns=['nulls', 'not nulls']) with open('bad.arrow', 'wb') as fd: serialize_to_arrow(df, fd) df = pd.DataFrame(df, columns=['not nulls', 'nulls']) with open('good.arrow', 'wb') as fd: serialize_to_arrow(df, fd) {code} JS incorrectly interprets the [null, not null] case: {code:javascript} > var arrow = require('apache-arrow') undefined > var fs = require('fs') undefined > arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not nulls').get(0) 'abc' > arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0) '\u\u\u\u\u0003\u\u\u\u0006\u\u\u\t\u\u\u' {code} Presumably this is because pyarrow is omitting some (or all) of the buffers associated with the all-null column, but the JS IPC reader is still looking for them, causing the buffer count to get out of sync. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3666) [C++] Improve CSV parser performance
Antoine Pitrou created ARROW-3666: - Summary: [C++] Improve CSV parser performance Key: ARROW-3666 URL: https://issues.apache.org/jira/browse/ARROW-3666 Project: Apache Arrow Issue Type: Improvement Components: C++ Affects Versions: 0.11.0 Reporter: Antoine Pitrou Assignee: Antoine Pitrou The CSV parser is currently the bottleneck when reading CSV files. There are a couple ways to make it a bit faster. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Merging RecordBatches [C++]
Hey, Im trying to figure out how to merge multiple recordbatches in order to optimize overly-chunked tables. A bit of background here... we have a process that is streaming table rows with a batch size of 1 ( because we want to ensure updates are written out in case of a crash ). We also have some code that reads this table on startup. Our reading code has logic to access a specific row of a table, which this startup code does. To access a specific row you need to iterate through all chunks to find the right one. We're hitting a bottle neck on this specific file since it has a chunk size of 1. Simplest solution for us would be to merge all the chunked data into one chunk on startup when we read in the arrow file. We've tried to find a way to do this using the arrow c++ library / documents but cant seem to find a clean approach. Is there any clean way to do this? Any other possible suggestions? Side note - we did notice theres some method called "RechunkArraysConsistently" . We couldn't find much info on it, but if that somehow ensures all chunks are of the same size and we can re-chunk the columns, then row access would be a quick calc ( if all chunks are the same size computing chunk / row in chunk is quick ) Thanks - Rob DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.
[jira] [Created] (ARROW-3665) Implement StructArrayBuilder
Chao Sun created ARROW-3665: --- Summary: Implement StructArrayBuilder Key: ARROW-3665 URL: https://issues.apache.org/jira/browse/ARROW-3665 Project: Apache Arrow Issue Type: Sub-task Components: Rust Reporter: Chao Sun -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3664) [Rust] Add benchmark for PrimitiveArrayBuilder
Chao Sun created ARROW-3664: --- Summary: [Rust] Add benchmark for PrimitiveArrayBuilder Key: ARROW-3664 URL: https://issues.apache.org/jira/browse/ARROW-3664 Project: Apache Arrow Issue Type: Sub-task Reporter: Chao Sun We should add a benchmark for the {{PrimitiveArrayBuilder}} to measure and track its performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Sync today right now
Recap: Short one today. # Attendees Jacques Pearu Brendan Li # Topics ## Go flatbuffers support: Brendan and company are interested in contributing this. They want to discuss approach with existing Go developers. Recommendation was to start a thread on mailing list and then create follow-up jiras as tasks are identified. ## Better Dictionary Support in Java Li has some ideas on this but they are complex. Plans to write up a proposal for the mailing list. On Wed, Oct 31, 2018 at 9:02 AM Jacques Nadeau wrote: > https://meet.google.com/vtm-teks-phx >
Sync today right now
https://meet.google.com/vtm-teks-phx
[jira] [Created] (ARROW-3663) pyarrow install via pip3 fails with error no module named Cython
Rajshekhar K created ARROW-3663: --- Summary: pyarrow install via pip3 fails with error no module named Cython Key: ARROW-3663 URL: https://issues.apache.org/jira/browse/ARROW-3663 Project: Apache Arrow Issue Type: Bug Affects Versions: 0.11.1 Reporter: Rajshekhar K Hi Team, The issue is reproducible : # pip3 install pyarrow Fails installation with no module name Cython. Seems it's not mentioned in the requirements or something. {code:java} Downloading pyarrow-0.10.0.tar.gz (2.1MB): 2.1MB downloaded Running setup.py (path:/tmp/pip_build_root/pyarrow/setup.py) egg_info for package pyarrow Traceback (most recent call last): File "", line 17, in File "/tmp/pip_build_root/pyarrow/setup.py", line 29, in from Cython.Distutils import build_ext as _build_ext ImportError: No module named 'Cython' Complete output from command python setup.py egg_info: Traceback (most recent call last): File "", line 17, in File "/tmp/pip_build_root/pyarrow/setup.py", line 29, in from Cython.Distutils import build_ext as _build_ext ImportError: No module named 'Cython' Cleaning up... {code} Tested on Environment: ubuntu14.04 Pip version: {noformat} pip 1.5.4 from /usr/lib/python3/dist-packages (python 3.4){noformat} Thanks, -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3662) [C++] Add a const overload to MemoryMappedFile::GetSize
Dimitri Vorona created ARROW-3662: - Summary: [C++] Add a const overload to MemoryMappedFile::GetSize Key: ARROW-3662 URL: https://issues.apache.org/jira/browse/ARROW-3662 Project: Apache Arrow Issue Type: New Feature Affects Versions: 0.11.1 Reporter: Dimitri Vorona While GetSize in general is not a const function, it can be on a MemoryMappedFile. I propose to add a const override directly to the MemoryMappedFile. Alternatively we could add a const version on the RandomAccessFile level which would fail, if a const size getting (e.g. without a seek) isn't possible, but it seems to me to be a potential source of hard-to-debug bugs and spurious failures. At would at least require a careful analysis of the platform support of different size getting options. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3661) [Gandiva][GLib] Improve constant name
Kouhei Sutou created ARROW-3661: --- Summary: [Gandiva][GLib] Improve constant name Key: ARROW-3661 URL: https://issues.apache.org/jira/browse/ARROW-3661 Project: Apache Arrow Issue Type: Improvement Components: Gandiva, GLib Reporter: Kouhei Sutou Assignee: Kouhei Sutou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3660) [C++] Don't unnecessary lock MemoryMappedFile for resizing in readonly files
Dimitri Vorona created ARROW-3660: - Summary: [C++] Don't unnecessary lock MemoryMappedFile for resizing in readonly files Key: ARROW-3660 URL: https://issues.apache.org/jira/browse/ARROW-3660 Project: Apache Arrow Issue Type: New Feature Components: C++ Affects Versions: 0.11.0 Reporter: Dimitri Vorona We lock the resize_lock_ on every read, even though it's not possible to resize read-only files. This also inlines the ResizeMap method into Resize. -- This message was sent by Atlassian JIRA (v7.6.3#76005)