[jira] [Created] (ARROW-3668) [R] Namespace dependency 'bit64' is not required

2018-10-31 Thread James Lamb (JIRA)
James Lamb created ARROW-3668:
-

 Summary: [R] Namespace dependency 'bit64' is not required
 Key: ARROW-3668
 URL: https://issues.apache.org/jira/browse/ARROW-3668
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: James Lamb


Dependency 'bit64' was added to the R package recently, but that package was 
not added to the DESCRIPTION file.

This error [was caught on 
Travis|https://travis-ci.org/apache/arrow/jobs/449082414] but merging the 
whichever PR added it wasn't blocked because failures in the R build are 
allowed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3667) [JS] Incorrectly reads record batches with an all null column

2018-10-31 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-3667:


 Summary: [JS] Incorrectly reads record batches with an all null 
column
 Key: ARROW-3667
 URL: https://issues.apache.org/jira/browse/ARROW-3667
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: JS-0.3.1
Reporter: Brian Hulette
 Fix For: JS-0.4.0


The JS library seems to incorrectly read any columns that come after an 
all-null column in IPC buffers produced by pyarrow.

Here's a python script that generates two arrow buffers, one with an all-null 
column followed by a utf-8 column, and a second with those two reversed

{code:python}
import pyarrow as pa
import pandas as pd

def serialize_to_arrow(df, fd, compress=True):
  batch = pa.RecordBatch.from_pandas(df)
  writer = pa.RecordBatchFileWriter(fd, batch.schema)

  writer.write_batch(batch)
  writer.close()

if __name__ == "__main__":
df = pd.DataFrame(data={'nulls': [None, None, None], 'not nulls': ['abc', 
'def', 'ghi']}, columns=['nulls', 'not nulls'])
with open('bad.arrow', 'wb') as fd:
serialize_to_arrow(df, fd)
df = pd.DataFrame(df, columns=['not nulls', 'nulls'])
with open('good.arrow', 'wb') as fd:
serialize_to_arrow(df, fd)
{code}

JS incorrectly interprets the [null, not null] case:

{code:javascript}
> var arrow = require('apache-arrow')
undefined
> var fs = require('fs')
undefined
> arrow.Table.from(fs.readFileSync('good.arrow')).getColumn('not nulls').get(0)
'abc'
> arrow.Table.from(fs.readFileSync('bad.arrow')).getColumn('not nulls').get(0)
'\u\u\u\u\u0003\u\u\u\u0006\u\u\u\t\u\u\u'
{code}

Presumably this is because pyarrow is omitting some (or all) of the buffers 
associated with the all-null column, but the JS IPC reader is still looking for 
them, causing the buffer count to get out of sync.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3666) [C++] Improve CSV parser performance

2018-10-31 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-3666:
-

 Summary: [C++] Improve CSV parser performance
 Key: ARROW-3666
 URL: https://issues.apache.org/jira/browse/ARROW-3666
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Affects Versions: 0.11.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


The CSV parser is currently the bottleneck when reading CSV files. There are a 
couple ways to make it a bit faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Merging RecordBatches [C++]

2018-10-31 Thread Ambalu, Robert
Hey, Im trying to figure out how to merge multiple recordbatches in order to 
optimize overly-chunked tables.
A bit of background here... we have a process that is streaming table rows with 
a batch size of 1 ( because we want to ensure updates are written out in case 
of a crash ).  We also have some code that reads this table on startup.
Our reading code has logic to access a specific row of a table, which this 
startup code does.  To access a specific row you need to iterate through all 
chunks to find the right one.  We're hitting a bottle neck on this specific 
file since it has a chunk size of 1.  Simplest solution for us would be to 
merge all the chunked data into one chunk on startup when we read in the arrow 
file.  We've tried to find a way to do this using the arrow c++ library / 
documents but cant seem to find a clean approach.
Is there any clean way to do this?  Any other possible suggestions?

Side note - we did notice theres some method called "RechunkArraysConsistently" 
.  We couldn't find much info on it, but if that somehow ensures all chunks are 
of the same size and we can re-chunk the columns, then row access would be a 
quick calc ( if all chunks are the same size computing chunk / row in chunk is 
quick )


Thanks
- Rob





DISCLAIMER: This e-mail message and any attachments are intended solely for the 
use of the individual or entity to which it is addressed and may contain 
information that is confidential or legally privileged. If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution, copying or other use of this message or its attachments is 
strictly prohibited. If you have received this message in error, please notify 
the sender immediately and permanently delete this message and any attachments.





[jira] [Created] (ARROW-3665) Implement StructArrayBuilder

2018-10-31 Thread Chao Sun (JIRA)
Chao Sun created ARROW-3665:
---

 Summary: Implement StructArrayBuilder
 Key: ARROW-3665
 URL: https://issues.apache.org/jira/browse/ARROW-3665
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Chao Sun






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3664) [Rust] Add benchmark for PrimitiveArrayBuilder

2018-10-31 Thread Chao Sun (JIRA)
Chao Sun created ARROW-3664:
---

 Summary: [Rust] Add benchmark for PrimitiveArrayBuilder
 Key: ARROW-3664
 URL: https://issues.apache.org/jira/browse/ARROW-3664
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Chao Sun


We should add a benchmark for the {{PrimitiveArrayBuilder}} to measure and 
track its performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Sync today right now

2018-10-31 Thread Jacques Nadeau
Recap: Short one today.

# Attendees
Jacques
Pearu
Brendan
Li

# Topics
## Go flatbuffers support:
Brendan and company are interested in contributing this. They want to
discuss approach with existing Go developers. Recommendation was to start a
thread on mailing list and then create follow-up jiras as tasks are
identified.

## Better Dictionary Support in Java
Li has some ideas on this but they are complex. Plans to write up a
proposal for the mailing list.

On Wed, Oct 31, 2018 at 9:02 AM Jacques Nadeau  wrote:

> https://meet.google.com/vtm-teks-phx
>


Sync today right now

2018-10-31 Thread Jacques Nadeau
https://meet.google.com/vtm-teks-phx


[jira] [Created] (ARROW-3663) pyarrow install via pip3 fails with error no module named Cython

2018-10-31 Thread Rajshekhar K (JIRA)
Rajshekhar K created ARROW-3663:
---

 Summary: pyarrow install via pip3 fails with error no module named 
Cython
 Key: ARROW-3663
 URL: https://issues.apache.org/jira/browse/ARROW-3663
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 0.11.1
Reporter: Rajshekhar K


Hi Team,

 

The issue is reproducible :

# pip3 install pyarrow 

Fails installation with no module name Cython. Seems it's not mentioned in the 
requirements or something.

 
{code:java}
Downloading pyarrow-0.10.0.tar.gz (2.1MB): 2.1MB downloaded
Running setup.py (path:/tmp/pip_build_root/pyarrow/setup.py) egg_info for 
package pyarrow
Traceback (most recent call last):
File "", line 17, in 
File "/tmp/pip_build_root/pyarrow/setup.py", line 29, in 
from Cython.Distutils import build_ext as _build_ext
ImportError: No module named 'Cython'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 17, in 
File "/tmp/pip_build_root/pyarrow/setup.py", line 29, in 
from Cython.Distutils import build_ext as _build_ext
ImportError: No module named 'Cython'

Cleaning up...
{code}
 

 

Tested on Environment: ubuntu14.04

Pip version:
{noformat}
pip 1.5.4 from /usr/lib/python3/dist-packages (python 3.4){noformat}
 

Thanks,

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3662) [C++] Add a const overload to MemoryMappedFile::GetSize

2018-10-31 Thread Dimitri Vorona (JIRA)
Dimitri Vorona created ARROW-3662:
-

 Summary: [C++] Add a const overload to MemoryMappedFile::GetSize
 Key: ARROW-3662
 URL: https://issues.apache.org/jira/browse/ARROW-3662
 Project: Apache Arrow
  Issue Type: New Feature
Affects Versions: 0.11.1
Reporter: Dimitri Vorona


 

While GetSize in general is not a const function, it can be on a 
MemoryMappedFile. I propose to add a const override directly to the 
MemoryMappedFile.

Alternatively we could add a const version on the RandomAccessFile level which 
would fail, if a const size getting (e.g. without a seek) isn't possible, but 
it seems to me to be a potential source of hard-to-debug bugs and spurious 
failures. At would at least require a careful analysis of the platform support 
of different size getting options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3661) [Gandiva][GLib] Improve constant name

2018-10-31 Thread Kouhei Sutou (JIRA)
Kouhei Sutou created ARROW-3661:
---

 Summary: [Gandiva][GLib] Improve constant name
 Key: ARROW-3661
 URL: https://issues.apache.org/jira/browse/ARROW-3661
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Gandiva, GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3660) [C++] Don't unnecessary lock MemoryMappedFile for resizing in readonly files

2018-10-31 Thread Dimitri Vorona (JIRA)
Dimitri Vorona created ARROW-3660:
-

 Summary: [C++] Don't unnecessary lock MemoryMappedFile for 
resizing in readonly files
 Key: ARROW-3660
 URL: https://issues.apache.org/jira/browse/ARROW-3660
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Affects Versions: 0.11.0
Reporter: Dimitri Vorona


We lock the resize_lock_ on every read, even though it's not possible to resize 
read-only files. 

 

This also inlines the ResizeMap method into Resize.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)