[jira] [Created] (ARROW-4581) [C++] gbenchmark_ep is a dependency of unit tests when ARROW_BUILD_BENCHMARKS=ON

2019-02-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4581:
---

 Summary: [C++] gbenchmark_ep is a dependency of unit tests when 
ARROW_BUILD_BENCHMARKS=ON
 Key: ARROW-4581
 URL: https://issues.apache.org/jira/browse/ARROW-4581
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.13.0


I hit this issue when trying to use clang-7 from conda-forge, and wasn't sure 
why gbenchmark_ep is getting built when I'm building only a single unit test 
executable like arrow-array-test

https://github.com/google/benchmark/issues/351



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: building py-arrow with CUDA

2019-02-14 Thread Andrew Palumbo
sorry forgot to include the gist:

https://gist.github.com/andrewpalumbo/d85d57063e58ae81134426ca640aded9
[https://avatars2.githubusercontent.com/u/7681565?s=400&v=4]

conda build of py-arrow with CUDA 
support
conda build of py-arrow with CUDA support. GitHub Gist: instantly share code, 
notes, and snippets.
gist.github.com



Thanks very much,

Andy

From: Andrew Palumbo 
Sent: Thursday, February 14, 2019 7:48 PM
To: dev@arrow.apache.org
Subject: building py-arrow with CUDA

Hello,
I've been trying to get py-arrow built with CUDA support, I've had help from 
Wes and Perau on user@, and it seems that the docs for building with CUDA are 
out of date; Wes suggested That I try here.

I have a sript:

conda create -n pyarrow-dev
conda activate pyarrow-dev
conda install python numpy six setuptools cython pandas pytest \
  cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib \
  gflags brotli jemalloc lz4-c zstd \
  double-conversion glog autoconf hypothesis numba \
  clangdev=6 flake8 gtest gmock \
  -c conda-forge

git clone https://github.com/arrow/arrow.git

conda activate pyarrow-dev
cd arrow
export ARROW_BUILD_TYPE=release
export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX
export ARROW_HOME=$CONDA_PREFIX
export PARQUET_HOME=$CONDA_PREFIX
export NUMBAPRO_LIBDEVICE=/usr/local/cuda-9.0/nvvm/libdevice
export NUMBAPRO_NVVM=/usr/local/cuda-9.0/nvvm/lib64/libnvvm.so

cd cpp

cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
  -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
  -DARROW_PARQUET=off  -DARROW_PYTHON=on  \
  -DARROW_PLASMA=off -DARROW_BUILD_TESTS=OFF \
  -DARROW_CUDA=on \
  -DCLANG_FORMAT_BIN=`which clang-format` \
  .
make -j3
make install
cd ../python
python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-cuda develop
py.test -sv pyarrow/


which is a slightly modified script given to me by Perau (I'd earlier been 
trying to build strictly with cmake)

I'm working on the Amazon Deep Learning AMI:
conda 4.6.4
Clang 6.0.1 (need to upgrade this)
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11)
cmake version 3.13.2
GNU Make 4.1
Python 3.6.7
CUDA 9.0


I can build without py-arrow, but my needs involve CUDA support.

Any help would be appreciated.

Thanks in advance,

Andy




building py-arrow with CUDA

2019-02-14 Thread Andrew Palumbo
Hello,
I've been trying to get py-arrow built with CUDA support, I've had help from 
Wes and Perau on user@, and it seems that the docs for building with CUDA are 
out of date; Wes suggested That I try here.

I have a sript:

conda create -n pyarrow-dev
conda activate pyarrow-dev
conda install python numpy six setuptools cython pandas pytest \
  cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib \
  gflags brotli jemalloc lz4-c zstd \
  double-conversion glog autoconf hypothesis numba \
  clangdev=6 flake8 gtest gmock \
  -c conda-forge

git clone https://github.com/arrow/arrow.git

conda activate pyarrow-dev
cd arrow
export ARROW_BUILD_TYPE=release
export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX
export ARROW_HOME=$CONDA_PREFIX
export PARQUET_HOME=$CONDA_PREFIX
export NUMBAPRO_LIBDEVICE=/usr/local/cuda-9.0/nvvm/libdevice
export NUMBAPRO_NVVM=/usr/local/cuda-9.0/nvvm/lib64/libnvvm.so

cd cpp

cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
  -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
  -DARROW_PARQUET=off  -DARROW_PYTHON=on  \
  -DARROW_PLASMA=off -DARROW_BUILD_TESTS=OFF \
  -DARROW_CUDA=on \
  -DCLANG_FORMAT_BIN=`which clang-format` \
  .
make -j3
make install
cd ../python
python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-cuda develop
py.test -sv pyarrow/


which is a slightly modified script given to me by Perau (I'd earlier been 
trying to build strictly with cmake)

I'm working on the Amazon Deep Learning AMI:
conda 4.6.4
Clang 6.0.1 (need to upgrade this)
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11)
cmake version 3.13.2
GNU Make 4.1
Python 3.6.7
CUDA 9.0


I can build without py-arrow, but my needs involve CUDA support.

Any help would be appreciated.

Thanks in advance,

Andy




Re: pyarrow schema in protobuf

2019-02-14 Thread Wes McKinney
hi Ryan,

see

http://arrow.apache.org/docs/python/generated/pyarrow.Buffer.html#pyarrow.Buffer

you can call to_pybytes() on the result of serialize()

HTH
Wes

On Thu, Feb 14, 2019 at 4:24 PM Ryan White  wrote:
>
> I've found the deserialize in pa.ipc.read_schema(). From schema.serialize,
> I get a pyarrow Buffer. Do I need to write this to a BufferOutputStream to
> get raw bytes (e.g. sink.getvalue())?
> In Flight, I see this is casted to std:string
> https://github.com/apache/arrow/blob/69d595ae4c61902b3f2778e536fca6675350c88c/cpp/src/arrow/flight/internal.cc#L219
> for the protbuf.
>
> Thanks
>
> Ryan M. White, Ph. D
> www.linkedin.com/in/ryanmwhitephd
> @ryanmwhitephd
> ryanmwhite...@gmail.com
>
>
> On Thu, Feb 14, 2019 at 4:11 PM Ryan White  wrote:
>
> > Hi,
> >
> > I'm using protocol buffers to retain metadata, and I would like to store
> > the Arrow Schema in the protobuf as Arrow is doing in Flight. Looking at
> > the Flight perf.proto, I can do the same and define a bytes field in my
> > proto. From pyarrow, can I serialize/deserialize pyarrow.schema? I've only
> > found pyarrow.schema.serialize() in python/arrow/types.pxi.
> >
> > Thanks,
> >
> >


Re: pyarrow schema in protobuf

2019-02-14 Thread Ryan White
I've found the deserialize in pa.ipc.read_schema(). From schema.serialize,
I get a pyarrow Buffer. Do I need to write this to a BufferOutputStream to
get raw bytes (e.g. sink.getvalue())?
In Flight, I see this is casted to std:string
https://github.com/apache/arrow/blob/69d595ae4c61902b3f2778e536fca6675350c88c/cpp/src/arrow/flight/internal.cc#L219
for the protbuf.

Thanks

Ryan M. White, Ph. D
www.linkedin.com/in/ryanmwhitephd
@ryanmwhitephd
ryanmwhite...@gmail.com


On Thu, Feb 14, 2019 at 4:11 PM Ryan White  wrote:

> Hi,
>
> I'm using protocol buffers to retain metadata, and I would like to store
> the Arrow Schema in the protobuf as Arrow is doing in Flight. Looking at
> the Flight perf.proto, I can do the same and define a bytes field in my
> proto. From pyarrow, can I serialize/deserialize pyarrow.schema? I've only
> found pyarrow.schema.serialize() in python/arrow/types.pxi.
>
> Thanks,
>
>


[jira] [Created] (ARROW-4579) [JS] Add more interop with BigInt/BigInt64Array/BigUint64Array

2019-02-14 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4579:
--

 Summary: [JS] Add more interop with 
BigInt/BigInt64Array/BigUint64Array
 Key: ARROW-4579
 URL: https://issues.apache.org/jira/browse/ARROW-4579
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


We should use or return the new native [BigInt 
types|https://developers.google.com/web/updates/2018/05/bigint] whenever it's 
available.

* Use the native {{BigInt}} to convert/stringify i64s/u64s
* Support the {{BigInt}} type in element comparator and {{indexOf()}}
* Add zero-copy {{toBigInt64Array()}} and {{toBigUint64Array()}} methods to 
{{Int64Vector}} and {{Uint64Vector}}, respectively




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4580) [JS] Accept Iterables in IntVector/FloatVector from() signatures

2019-02-14 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4580:
--

 Summary: [JS] Accept Iterables in IntVector/FloatVector from() 
signatures
 Key: ARROW-4580
 URL: https://issues.apache.org/jira/browse/ARROW-4580
 Project: Apache Arrow
  Issue Type: Improvement
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


Right now {{IntVector.from()}} and {{FloatVector.from()}} expect the data is 
already in typed-array form. But if we know the desired Vector type before hand 
(e.g. if {{Int32Vector.from()}} is called), we can accept any JS iterable of 
the values.

In order to do this, we should ensure {{Float16Vector.from()}} properly clamps 
incoming f32/f64 values to u16s, in case the source is a vanilla 64-bit JS 
float.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4578) [JS] Float16Vector toArray should be zero-copy

2019-02-14 Thread Paul Taylor (JIRA)
Paul Taylor created ARROW-4578:
--

 Summary: [JS] Float16Vector toArray should be zero-copy
 Key: ARROW-4578
 URL: https://issues.apache.org/jira/browse/ARROW-4578
 Project: Apache Arrow
  Issue Type: Bug
  Components: JavaScript
Affects Versions: JS-0.4.0
Reporter: Paul Taylor
Assignee: Paul Taylor
 Fix For: JS-0.4.1


The {{Float16Vector#toArray()}} implementation currently transforms each half 
float into a single float, and returns a Float32Array. All the other 
{{toArray()}} implementations are zero-copy, and this deviation would break 
anyone expecting to give two-byte half floats to native APIs like WebGL. We 
should instead include {{Float16Vector#toFloat32Array()}} and 
{{Float16Vector#toFloat64Array()}} convenience methods that do rely on copying.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


pyarrow schema in protobuf

2019-02-14 Thread Ryan White
Hi,

I'm using protocol buffers to retain metadata, and I would like to store
the Arrow Schema in the protobuf as Arrow is doing in Flight. Looking at
the Flight perf.proto, I can do the same and define a bytes field in my
proto. From pyarrow, can I serialize/deserialize pyarrow.schema? I've only
found pyarrow.schema.serialize() in python/arrow/types.pxi.

Thanks,


Re: Arrow Flight protocol/API questions

2019-02-14 Thread Antoine Pitrou


Perhaps authentication can wait until we have proper requirements?
There are many authentication schemes around.

Regards

Antoine.


Le 14/02/2019 à 20:44, David Ming Li a écrit :
> Back to extending the protocol, all we should need, and the simple thing 
> (IMO) to do, would be:
> 
> - Add a `bytes data_application = 3` to FlightData 
> (https://github.com/apache/arrow/blob/master/format/Flight.proto#L286)
> - Add a `bytes data_application = 1` to PutResult
> - Change `DoPut` to `rpc DoPut(stream FlightData) returns (stream PutResult) 
> {}`
> 
> The bigger question is how to change the C++/Java APIs to expose this, as 
> they kind of assume the only thing around is RecordBatches.
> 
> It does sound interesting to have different underlying transports, which 
> would then preclude ever exposing gRPC. Would the thought then be to do 
> token-based authentication in DoGet/DoPut? I suppose the Ticket in DoGet and 
> the command in DoPut could serve that purpose.
> 
> Best,
> David
> 
> On 2019/02/12 20:48:10, Antoine Pitrou  wrote: 
>>
>> Hi David,
>>
>> I think allowing to send application-specific ancillary data in addition
>> to Arrow data makes sense.
>>
>> (I'm also wondering whether the choice of gRPC is appropriate at all -
>> the current C++ hacks around "zero-copy" are not pretty and they may not
>> translate to other languages either)
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 12/02/2019 à 21:44, David Ming Li a écrit :
>>> Hi all,
>>>
>>>
>>>
>>> We've been evaluating Flight for our use, and we're wondering if the 
>>> protocol is still open to extensions, as having a few application-defined 
>>> metadata fields would help our use cases a lot.
>>>
>>>
>>>
>>> (Apologies if this is a repost - was having issue with the spam filter.)
>>>
>>>
>>>
>>> Specifically, in DoGet, having a metadata binary blob in the server->client 
>>> messages would help implement resumable requests, especially as we have 
>>> non-monotonically-indexed data streams. This would also help us reuse 
>>> server-side state if we do have to resume a stream.
>>>
>>>
>>>
>>> In DoPut, we think making this call bidirectional would be useful to 
>>> support application-level ACKs, again to implement resumable uploads. The 
>>> server would thus have the opt


Re: Arrow Flight protocol/API questions

2019-02-14 Thread David Ming Li
Back to extending the protocol, all we should need, and the simple thing (IMO) 
to do, would be:

- Add a `bytes data_application = 3` to FlightData 
(https://github.com/apache/arrow/blob/master/format/Flight.proto#L286)
- Add a `bytes data_application = 1` to PutResult
- Change `DoPut` to `rpc DoPut(stream FlightData) returns (stream PutResult) {}`

The bigger question is how to change the C++/Java APIs to expose this, as they 
kind of assume the only thing around is RecordBatches.

It does sound interesting to have different underlying transports, which would 
then preclude ever exposing gRPC. Would the thought then be to do token-based 
authentication in DoGet/DoPut? I suppose the Ticket in DoGet and the command in 
DoPut could serve that purpose.

Best,
David

On 2019/02/12 20:48:10, Antoine Pitrou  wrote: 
> 
> Hi David,
> 
> I think allowing to send application-specific ancillary data in addition
> to Arrow data makes sense.
> 
> (I'm also wondering whether the choice of gRPC is appropriate at all -
> the current C++ hacks around "zero-copy" are not pretty and they may not
> translate to other languages either)
> 
> Regards
> 
> Antoine.
> 
> 
> Le 12/02/2019 à 21:44, David Ming Li a écrit :
> > Hi all,
> > 
> > 
> > 
> > We've been evaluating Flight for our use, and we're wondering if the 
> > protocol is still open to extensions, as having a few application-defined 
> > metadata fields would help our use cases a lot.
> > 
> > 
> > 
> > (Apologies if this is a repost - was having issue with the spam filter.)
> > 
> > 
> > 
> > Specifically, in DoGet, having a metadata binary blob in the server->client 
> > messages would help implement resumable requests, especially as we have 
> > non-monotonically-indexed data streams. This would also help us reuse 
> > server-side state if we do have to resume a stream.
> > 
> > 
> > 
> > In DoPut, we think making this call bidirectional would be useful to 
> > support application-level ACKs, again to implement resumable uploads. The 
> > server would thus have the opt


[jira] [Created] (ARROW-4577) [C++] Interface link libraries declared on arrow_shared target that are actually non-interface

2019-02-14 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-4577:
--

 Summary: [C++] Interface link libraries declared on arrow_shared 
target that are actually non-interface
 Key: ARROW-4577
 URL: https://issues.apache.org/jira/browse/ARROW-4577
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 0.12.0
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.13.0


We are pulling in {{jemalloc_static}} as an interface linking dependency in the 
{{arrowTargets.cmake}}. But as it is statically linked inside the shared 
library, consumers don't need to link against it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4576) [Python] Benchmark failures

2019-02-14 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-4576:
-

 Summary: [Python] Benchmark failures
 Key: ARROW-4576
 URL: https://issues.apache.org/jira/browse/ARROW-4576
 Project: Apache Arrow
  Issue Type: Bug
  Components: Benchmarking, Python
Reporter: Antoine Pitrou


I get the following error during running the benchmarks:
{code}
   Traceback (most recent call last):
 File "/home/antoine/asv/asv/benchmark.py", line 1170, in 
main_run_server
   main_run(run_args)
 File "/home/antoine/asv/asv/benchmark.py", line 1038, in 
main_run
   skip = benchmark.do_setup()
 File "/home/antoine/asv/asv/benchmark.py", line 569, in 
do_setup
   result = Benchmark.do_setup(self)
 File "/home/antoine/asv/asv/benchmark.py", line 501, in 
do_setup
   setup(*self._current_params)
 File "/home/antoine/arrow/python/benchmarks/streaming.py", 
line 65, in setup
   self.source = sink.get_result()
   AttributeError: 'pyarrow.lib.BufferOutputStream' object has no 
attribute 'get_result'

{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4575) Add Python Flight implementation to integration testing

2019-02-14 Thread David Li (JIRA)
David Li created ARROW-4575:
---

 Summary: Add Python Flight implementation to integration testing
 Key: ARROW-4575
 URL: https://issues.apache.org/jira/browse/ARROW-4575
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC, Integration, Python
Reporter: David Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4572) [C++] Remove memory zeroing from PrimitiveAllocatingUnaryKernel

2019-02-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4572:
---

 Summary: [C++] Remove memory zeroing from 
PrimitiveAllocatingUnaryKernel
 Key: ARROW-4572
 URL: https://issues.apache.org/jira/browse/ARROW-4572
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.13.0


Follow up work to ARROW-1896



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4574) [Doc] Add Flight documentation

2019-02-14 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-4574:
-

 Summary: [Doc] Add Flight documentation
 Key: ARROW-4574
 URL: https://issues.apache.org/jira/browse/ARROW-4574
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation, FlightRPC
Reporter: Antoine Pitrou


Should add documentation for the Flight RPC system. At least high-level docs. 
Perhaps per-language docs can wait until the APIs are stabilized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4573) [Python] Add Flight unit tests

2019-02-14 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-4573:
-

 Summary: [Python] Add Flight unit tests
 Key: ARROW-4573
 URL: https://issues.apache.org/jira/browse/ARROW-4573
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Python
Reporter: Antoine Pitrou


Should add simple unit tests for the Python Flight bindings.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4571) [Format] Tensor.fbs file has multiple root_type declarations

2019-02-14 Thread Eric Erhardt (JIRA)
Eric Erhardt created ARROW-4571:
---

 Summary: [Format] Tensor.fbs file has multiple root_type 
declarations
 Key: ARROW-4571
 URL: https://issues.apache.org/jira/browse/ARROW-4571
 Project: Apache Arrow
  Issue Type: Bug
  Components: Format
Reporter: Eric Erhardt


Looking at [the flatbuffers 
doc|https://google.github.io/flatbuffers/flatbuffers_guide_tutorial.html], it 
appears there should only be one `root_type` declaration in an fbs file:
{code:java}
The last part of the schema is the root_type. The root type declares what will 
be the root table for the serialized data. In our case, the root type is our 
Monster table.{code}
However, the Tensor.fbs file has multiple `root_type` declarations:

[https://github.com/apache/arrow/blob/69d595ae4c61902b3f2778e536fca6675350c88c/format/Tensor.fbs#L53]

[https://github.com/apache/arrow/blob/69d595ae4c61902b3f2778e536fca6675350c88c/format/Tensor.fbs#L146]

 

See the discussion here: 
https://github.com/apache/arrow/pull/2546#discussion_r256549256



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Rust] Rust 0.13.0 release

2019-02-14 Thread Renjie Liu
Then I'm expecting to finish it in 0.14

Wes McKinney  于 2019年2月13日周三 下午11:08写道:

> > BTW, what's the time line of 0.13.0?
>
> See
> https://lists.apache.org/thread.html/7890bd7aebd2d2018fa68a78630280581a544346ce80e4002cd9e548@%3Cdev.arrow.apache.org%3E
>
> Since 0.12 was ~January 20 I think it would be good to release again
> by the end of March
>
> On Wed, Feb 13, 2019 at 7:29 AM Renjie Liu 
> wrote:
> >
> > Hi, Andy:
> >  Thanks for bringing this thread. I'm working on the arrow reader for
> > parquet and expecting to make progress recently. BTW, what's the time
> line
> > of 0.13.0?
> >
> > Chao Sun  于 2019年2月13日周三 上午10:34写道:
> >
> > > I’m also interested in the Parquet/Arrow integration and may help
> there.
> > > This is however a relative large feature and I’m not sure if it can be
> done
> > > in 0.13.
> > >
> > > Another area I’d like to work in is high level Parquet writer support.
> This
> > > issue has been discussed several times in the past. People should not
> need
> > > to specify definition & repetition levels in order to write data in
> Parquet
> > > format.
> > >
> > > Chao
> > >
> > >
> > >
> > > On Wed, Feb 13, 2019 at 10:24 AM paddy horan 
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > The focus for me for 0.13.0 is SIMD.  I would like to port all the
> "ops"
> > > > in "array_ops" to the new "compute" module and leverage SIMD for them
> > > all.
> > > > I have most of this done in various forks.
> > > >
> > > > Past 0.13.0 I would really like to work toward getting Rust running
> in
> > > the
> > > > integration tests.  The thing I am most excited about regarding
> Arrow is
> > > > the concept of defining computational libraries in say Rust and being
> > > able
> > > > to use them from any implementation, pyarrow probably for me.  This
> all
> > > > starts and ends with the integration tests.
> > > >
> > > > Also, Gandiva is fascinating I would love to have robust support for
> this
> > > > in Rust (via bindings)...
> > > >
> > > > Regards,
> > > > P
> > > >
> > > >
> > > > 
> > > > From: Neville Dipale 
> > > > Sent: Tuesday, February 12, 2019 11:33 AM
> > > > To: dev@arrow.apache.org
> > > > Subject: Re: [Rust] Rust 0.13.0 release
> > > >
> > > > Thanks for bringing this up Andy.
> > > >
> > > > I'm unemployed/on recovery leave, so I've had some surplus time to
> work
> > > on
> > > > Rust.
> > > >
> > > > There's a lot of features that I've wanted to work on, some which
> I've
> > > > spent some time attempting, but struggled with. A few block
> additional
> > > work
> > > > that I could contribute.
> > > >
> > > > In 0.13.0 and the release thereafter: I'd like to see:
> > > >
> > > > Date/time support. I've spent a lot of time trying to implement this,
> > > but I
> > > > get the feeling that my Rust isn't good enough yet to pull this
> together.
> > > >
> > > > More IO support.
> > > > I'm working on JSON reader, and want to work on JSON and CSV
> (continuing
> > > > where you left off) writers after this.
> > > > With date/time support, I can also work on date/time parsing so we
> can
> > > have
> > > > these in CSV and JSON.
> > > > Parquet support isn't on my radar at the moment. JSON and CSV are
> more
> > > > commonly used, so I'm hoping that with concrete support for these,
> more
> > > > people using Rust can choose to integrate Arrow. That could bring us
> more
> > > > hands to help.
> > > >
> > > > Array slicing (https://issues.apache.org/jira/browse/ARROW-3954). I
> > > tried
> > > > working on it but failed. Related to this would be array chunking.
> > > > I need these in order to be able to operate on "Tables" like CPP,
> Python
> > > > and others. I've got ChunkedArray, Column and Table roughly
> implemented
> > > in
> > > > my fork, but without zero-copy slicing, I can't upstream them.
> > > >
> > > > I've made good progress on scalar and array operations. I have trig
> > > > functions, some string operators and other functions that one can
> run on
> > > a
> > > > Spark-esque dataframe.
> > > > These will fit in well with DataFusion's SQL operations, but from a
> > > > decision-perspective, I think it would help if we join heads and
> think
> > > > about the direction we want to take on compute.
> > > >
> > > > SIMD is great, and when Paddy's hashed out how it works, more of us
> will
> > > be
> > > > able to contribute SIMD compatible compute operators.
> > > >
> > > > Thanks,
> > > > Neville
> > > >
> > > > On Tue, 12 Feb 2019 at 18:12, Andy Grove 
> wrote:
> > > >
> > > > > I was curious what our Rust committers and contributors are excited
> > > about
> > > > > for 0.13.0.
> > > > >
> > > > > The feature I would most like to see is that ability for
> DataFusion to
> > > > run
> > > > > SQL against Parquet files again, as that would give me an excuse
> for a
> > > > PoC
> > > > > in my day job using Arrow.
> > > > >
> > > > > I know there were some efforts underway to build arrow array
> readers
> > > for
> > > > > Parquet an

[jira] [Created] (ARROW-4570) [Gandiva] Add overflow checks for decimals

2019-02-14 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-4570:
-

 Summary: [Gandiva] Add overflow checks for decimals
 Key: ARROW-4570
 URL: https://issues.apache.org/jira/browse/ARROW-4570
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra


For decimals, overflows can occur at two places :
 # input array can have values that are outside the bound (eg. > 38 digits)
 # When an operation can result in overflows. eg. add of two decimals of (38, 
6) can result in an overflow, if the input numbers are very large.

In both the above cases, just verifying that an overflow occurred can be a perf 
overhead. We should do this based on a conf variable.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4569) [Gandiva] validate that the precision/scale are within bounds

2019-02-14 Thread Pindikura Ravindra (JIRA)
Pindikura Ravindra created ARROW-4569:
-

 Summary: [Gandiva] validate that the precision/scale are within 
bounds
 Key: ARROW-4569
 URL: https://issues.apache.org/jira/browse/ARROW-4569
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Pindikura Ravindra
Assignee: Pindikura Ravindra






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)