[jira] [Created] (ARROW-2153) decimal conversion not working for exponential notation

2018-02-13 Thread Antony Mayi (JIRA)
Antony Mayi created ARROW-2153:
--

 Summary: decimal conversion not working for exponential notation
 Key: ARROW-2153
 URL: https://issues.apache.org/jira/browse/ARROW-2153
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Antony Mayi


{code:java}
import pyarrow as pa
import pandas as pd
import decimal

pa.Table.from_pandas(pd.DataFrame({'a': [decimal.Decimal('1.1'), 
decimal.Decimal('2E+1')]}))
{code}
 
{code:java}
Traceback (most recent call last):
  File "", line 1, in 
  File "pyarrow/table.pxi", line 875, in pyarrow.lib.Table.from_pandas 
(/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:44927)
  File 
"/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
line 350, in dataframe_to_arrays
convert_types)]
  File 
"/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
line 349, in 
for c, t in zip(columns_to_convert,
  File 
"/home/skadlec/.local/lib/python3.6/site-packages/pyarrow/pandas_compat.py", 
line 345, in convert_column
return pa.array(col, from_pandas=True, type=ty)
  File "pyarrow/array.pxi", line 170, in pyarrow.lib.array 
(/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:29224)
  File "pyarrow/array.pxi", line 70, in pyarrow.lib._ndarray_to_array 
(/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:28465)
  File "pyarrow/error.pxi", line 77, in pyarrow.lib.check_status 
(/arrow/python/build/temp.linux-x86_64-3.6/lib.cxx:8270)
pyarrow.lib.ArrowInvalid: Expected base ten digit or decimal point but found 
'E' instead.
{code}
In manual cases clearly we can write {{decimal.Decimal('20')}} instead of 
{{decimal.Decimal('2E+1')}} but during arithmetical operations inside an 
application the exponential notation can be produced out of control (it is 
actually the _normalized_ form of the decimal number) plus for some values the 
exponential notation is the only form expressing the significance so this 
should be accepted.

The [documentation|https://docs.python.org/3/library/decimal.html] suggests 
using following transformation but that's only possible when the significance 
information doesn't need to be kept:
{code:java}
def remove_exponent(d):
return d.quantize(Decimal(1)) if d == d.to_integral() else d.normalize()
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Merge multiple record batches

2018-02-13 Thread Rares Vernica
Hi,

If I have multiple RecordBatchStreamReader inputs, what is the recommended
way to get all the RecordBatch from all the inputs together, maybe in a
Table? They all have the same schema. The source for the readers are
different files.

So, I do something like:

reader1 = pa.open_stream('foo')
table1 = reader1.read_all()

reader2 = pa.open_stream('bar')
table2 = reader2.read_all()

# table_all = ???
# OR maybe I don't need to create table1 and table2
# table_all = pa.Table.from_batches( ??? )

Thanks!
Rares


Decimal NaNs

2018-02-13 Thread Phillip Cloud
Recently someone opened ARROW-2145
 asking
for support for non-finite values, such as NaN and infinity.
It may seem like a “no-brainer” to implement this, but there’s no real
consistency on how to implement it or *even to implement it at all*:

   - Java BigDecimal: raises an exception for nan or inf as per the docs
   

   - boost multiprecision supports it but not for fixed precision decimal
   numbers (cpp_bin_float/cpp_dec_float, which are arbitrary precision
   floating point not fixed point)
   - python supports it using flags and special string exponents (and it
   supports both signaling and quiet nans)
   - impala doesn’t support it (returns null when you try to perform
CAST(CAST('NaN'
   AS DOUBLE) AS DECIMAL)
   - postgres supports it with its numeric
    type
   by using the sign member of the C struct backing numeric values
   

   - MySQL: doesn’t even support nan/inf!

The lack of support for these values across languages likely stems from the
fact that fixed precision arithmetic by definition must happen on finite
values, and nan/inf are not finite values therefore they are not supported.

We could go down this rabbit hole in the name of providing support for
Python decimal.Decimal() but I’m not sure how useful it
is.

No other system except in-memory C++ arrow arrays would be able to operate
on these values (I suppose we could add a wrapper around BigDecimal that
has the desired behavior).

For example, writing arrow arrays containing Decimal128 values (with nans
or infs) to a parquet file seems untenable.

Additionally, if we decided to implement it, we’d likely have to take
something like the flag approach which would require a change to the
metadata (not necessary a bad thing) that would add two bitmaps to arrow
Decimal arrays: one for indicating nan-ness and one for indicating inf-ness
(that’s a ton of overhead IMO when I think it’s likely that most values are
always finite).

I’m skeptical about whether we should support this.

Thoughts?
​


RE: JDBC Adapter for Apache-Arrow

2018-02-13 Thread Atul Dambalkar
Hi Uwe,

Sorry for late response on this thread. We have started some discussions 
internally. I wanted to know what help you would need specifically on the JDBC 
Adapter front, we would be happy to collaborate. At this time, we were mainly 
trying to model it around the C++ work that has gone in. Are there any 
particular use-cases/requirements you have in mind?

-Atul

-Original Message-
From: Jacques Nadeau [mailto:jacq...@apache.org] 
Sent: Tuesday, January 09, 2018 7:41 PM
To: dev@arrow.apache.org
Subject: Re: JDBC Adapter for Apache-Arrow

We have some stuff I  Dremio that we've planned on open sourcing but haven't 
yet done so. We should try to get that out for others to consume.

On Jan 7, 2018 11:49 AM, "Uwe L. Korn"  wrote:

> Has anyone made progress on the JDBC adapter yet?
>
> I recently came across a lot projects with good JDBC drivers but not 
> so good drivers in Python. Having an Arrow-JDBC adaptor would make 
> these query engines much more useful to the Python community. Being an 
> Arrow committer and one of the turbodbc authors, I have quite some 
> knowledge in this area but my Java is a bit rusty and I have never 
> dealt with JDBC, so I‘m looking for someone to collaborate on this feature.
>
> Also this might be my ultimate chance to also get contributing to the 
> Java part of Apache Arrow.
>
> Uwe
>
> > Am 07.11.2017 um 20:01 schrieb Julian Hyde :
> >
> > I have logged https://issues.apache.org/jira/browse/CALCITE-2040 (I 
> > logged it within Calcite because this makes more sense that this is 
> > an Arrow adapter within Calcite than a Calcite adapter within Arrow).
> >
> > Note the last paragraph about
> > https://issues.apache.org/jira/browse/CALCITE-2025 and 
> > bioinformatics file formats. Readers for these formats would be 
> > useful extensions to Arrow regardless of whether the data was 
> > ultimately going to be queried using SQL. (Contributions welcome!) 
> > Calcite's bio adapter would build upon the Arrow readers in two 
> > respects:  (1) to read metadata from these files (e.g. are there any 
> > extra fields?) and (2) to push down processing (filters, projects) into the 
> > reader.
> >
> > Julian
> >
> >
> > On Tue, Nov 7, 2017 at 10:21 AM, Atul Dambalkar 
> >  wrote:
> >> Hi,
> >>
> >> Don' t mean to interrupt the current discussion threads. But, based 
> >> on
> the discussions so far on the JDBC Adapter piece, are we in a position 
> to create a JIRA ticket for this as well as the other piece about 
> adding a direct Arrow objects creation support from JDBC drivers? If 
> yes, I can certainly go ahead and create JIRA for JDBC Adapter work.
> >>
> >> Julian, would you like to create the JIRA for the other item that 
> >> you
> proposed.
> >>
> >> -Atul
> >>
> >> -Original Message-
> >> From: Atul Dambalkar
> >> Sent: Thursday, November 02, 2017 2:59 PM
> >> To: dev@arrow.apache.org
> >> Subject: RE: JDBC Adapter for Apache-Arrow
> >>
> >> I also like the approach of adding an interface and making it art 
> >> of
> Arrow, so any specific JDBC driver can implement that interface to 
> directly expose Arrow objects without having to create JDBC objects in 
> the first place. One such implementation could be for Avatica itself 
> what Julian was suggesting earlier.
> >>
> >> -Original Message-
> >> From: Julian Hyde [mailto:jh...@apache.org]
> >> Sent: Tuesday, October 31, 2017 4:28 PM
> >> To: dev@arrow.apache.org
> >> Subject: Re: JDBC Adapter for Apache-Arrow
> >>
> >> Yeah, I agree, it should be an interface defined as part of Arrow. 
> >> Not
> driver-specific.
> >>
> >>> On Oct 31, 2017, at 1:37 PM, Laurent Goujon 
> wrote:
> >>>
> >>> I really like Julian's idea of unwrapping Arrow objects out of the 
> >>> JDBC ResultSet, but I wonder if the unwrap class has to be 
> >>> specific to the driver and if an interface can be designed to be 
> >>> used by multiple
> drivers:
> >>> for drivers based on Arrow, it means you could totally skip the 
> >>> serialization/deserialization from/to JDBC records.
> >>> If such an interface exists, I would propose to add it to the 
> >>> Arrow project, with Arrow product/projects in charge of adding 
> >>> support for it in their own JDBC driver.
> >>>
> >>> Laurent
> >>>
> >>> On Tue, Oct 31, 2017 at 1:18 PM, Atul Dambalkar 
> >>> 
> >>> wrote:
> >>>
>  Thanks for your thoughts Julian. I think, adding support for 
>  Arrow objects for Avatica Remote Driver (AvaticaToArrowConverter) 
>  can be certainly taken up as another activity. And you are right, 
>  we will have to look at specific JDBC driver to really optimize 
>  it
> individually.
> 
>  I would be curious if there are any further inputs/comments from 
>  other Dev folks, on the JDBC adapter aspect.
> 
>  -Atul
> 
>  -Original Message-
>  From: Julian Hyde 

Add a UUID type to the Arrow format

2018-02-13 Thread Uwe L. Korn
Hello,

I just opened https://issues.apache.org/jira/browse/ARROW-2152 to start the 
discussion about adding a UUID type to the Arrow format specification. In its 
essence a UUID is simply a 128bit array but there are often special classes 
used for it, e.g. java.util.UUID in Java and uuid.UUID in Python. These provide 
special functions for them as well as sometimes the knowledge that a column is 
a UUID could be beneficial during computations. Other data systems like 
Postgres or Parquet also have a special UUID type.

While there is only a small difference to a 128bit fixed sized binary array, I 
think providing the respective object model accessor is already a good benefit.

Uwe


[jira] [Created] (ARROW-2152) [Format] UUID type

2018-02-13 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-2152:
--

 Summary: [Format] UUID type
 Key: ARROW-2152
 URL: https://issues.apache.org/jira/browse/ARROW-2152
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Uwe L. Korn


Add a UUID type that is physically represented as an 128bit fixed sized binary. 
The addition of the type should add the annotation that a value in this column 
is a UUID. The important benefit would be that in the native language object 
models we could return the respective UUID class, e.g. {{java.util.UUID}} in 
Java and {{uuid.UUID}} in Python. For reference, Postgres and Parquet both have 
a UUID type as well do other data system have this small specialization.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow for MATLAB?

2018-02-13 Thread Phillip Cloud
The MathWorks is in the process of starting to contribute. I spoke with
them a couple weeks ago about this and they were excited about it. I can
ping them to see if they are still interested.

On Tue, Feb 13, 2018, 09:24 Uwe L. Korn  wrote:

> Hello Joris,
>
> this is only due to lack of someone doing it and probably due to lack of
> people that have the experience to do that. I had a short look at Matlab's
> C++ API and the interfaces seem to be promising enough
> https://de.mathworks.com/help/matlab/matlab-data-array.html that once
> someone attempts it, it should not be hard to build.
>
> If you want to try to take a shot, we are happy to help if there are
> problems with the Arrow side of things.
>
> Uwe
>
> On Tue, Feb 13, 2018, at 2:41 PM, Joris Peeters wrote:
> > Hello,
> >
> > Is anyone aware of plans (or concrete projects) to add MATLAB bindings
> for
> > Arrow? I'm interested in exchanging data between Java, Python, ..., and
> > MATLAB - and Arrow sounds like a great solution.
> >
> > I couldn't find any pre-existing effort, though, so curious if that is
> due
> > to a lack of interest or because there might be underlying reasons that
> > would make this very hard to achieve.
> >
> > Best,
> > -Joris.
>


Re: Arrow for MATLAB?

2018-02-13 Thread Uwe L. Korn
Hello Joris,

this is only due to lack of someone doing it and probably due to lack of people 
that have the experience to do that. I had a short look at Matlab's C++ API and 
the interfaces seem to be promising enough 
https://de.mathworks.com/help/matlab/matlab-data-array.html that once someone 
attempts it, it should not be hard to build.

If you want to try to take a shot, we are happy to help if there are problems 
with the Arrow side of things.

Uwe

On Tue, Feb 13, 2018, at 2:41 PM, Joris Peeters wrote:
> Hello,
> 
> Is anyone aware of plans (or concrete projects) to add MATLAB bindings for
> Arrow? I'm interested in exchanging data between Java, Python, ..., and
> MATLAB - and Arrow sounds like a great solution.
> 
> I couldn't find any pre-existing effort, though, so curious if that is due
> to a lack of interest or because there might be underlying reasons that
> would make this very hard to achieve.
> 
> Best,
> -Joris.


[jira] [Created] (ARROW-2151) [Python] Error when converting from list of uint64 arrays

2018-02-13 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2151:
-

 Summary: [Python] Error when converting from list of uint64 arrays
 Key: ARROW-2151
 URL: https://issues.apache.org/jira/browse/ARROW-2151
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.8.0
Reporter: Antoine Pitrou


{code:python}
>>> pa.array(np.uint64([0,1,2]), type=pa.uint64())

[
  0,
  1,
  2
]
>>> pa.array([np.uint64([0,1,2])], type=pa.list_(pa.uint64()))
Traceback (most recent call last):
  File "", line 1, in 
pa.array([np.uint64([0,1,2])], type=pa.list_(pa.uint64()))
  File "array.pxi", line 181, in pyarrow.lib.array
  File "array.pxi", line 36, in pyarrow.lib._sequence_to_array
  File "error.pxi", line 98, in pyarrow.lib.check_status
ArrowException: Unknown error: 
/home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:979 code: 
AppendPySequence(seq, size, real_type, builder.get())
/home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:402 code: 
static_cast(this)->AppendSingle(ref.obj())
/home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:402 code: 
static_cast(this)->AppendSingle(ref.obj())
/home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:542 code: 
CheckPyError()
an integer is required
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2150) [Python] array equality defaults to identity

2018-02-13 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2150:
-

 Summary: [Python] array equality defaults to identity
 Key: ARROW-2150
 URL: https://issues.apache.org/jira/browse/ARROW-2150
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.8.0
Reporter: Antoine Pitrou


I'm not sure this is deliberate, but it doesn't look very desirable to me:
{code}
>>> pa.array([1,2,3], type=pa.int32()) == pa.array([1,2,3], type=pa.int32())
False
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2149) [Python] reorganize test_convert_pandas.py

2018-02-13 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2149:
-

 Summary: [Python] reorganize test_convert_pandas.py
 Key: ARROW-2149
 URL: https://issues.apache.org/jira/browse/ARROW-2149
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Affects Versions: 0.8.0
Reporter: Antoine Pitrou
Assignee: Antoine Pitrou


{{test_convert_pandas.py}} is getting painful to navigate through. We should 
reorganize the tests in various classes / categories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2148) [Python] to_pandas() on struct array returns object array

2018-02-13 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2148:
-

 Summary: [Python] to_pandas() on struct array returns object array
 Key: ARROW-2148
 URL: https://issues.apache.org/jira/browse/ARROW-2148
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Antoine Pitrou


This should probably return a Numpy struct array instead:

{code:python}
>>> arr = pa.array([{'a': 1, 'b': 2.5}, {'a': 2, 'b': 3.5}], 
>>> type=pa.struct([pa.field('a', pa.int32()), pa.field('b', pa.float64())]))
>>> arr.type
StructType(struct)
>>> arr.to_pandas()
array([{'a': 1, 'b': 2.5}, {'a': 2, 'b': 3.5}], dtype=object)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2147) [Python] Type inference doesn't work on lists of Numpy arrays

2018-02-13 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-2147:
-

 Summary: [Python] Type inference doesn't work on lists of Numpy 
arrays
 Key: ARROW-2147
 URL: https://issues.apache.org/jira/browse/ARROW-2147
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Affects Versions: 0.8.0
Reporter: Antoine Pitrou


{code:python}
>>> arr = np.int16([2, 3, 4])
>>> pa.array(arr)

[
  2,
  3,
  4
]
>>> pa.array([arr])
Traceback (most recent call last):
  File "", line 1, in 
    pa.array([arr])
  File "array.pxi", line 181, in pyarrow.lib.array
  File "array.pxi", line 26, in pyarrow.lib._sequence_to_array
  File "error.pxi", line 77, in pyarrow.lib.check_status
ArrowInvalid: /home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:964 
code: InferArrowType(seq, _type)
/home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:321 code: 
seq_visitor.Visit(obj)
/home/antoine/arrow/cpp/src/arrow/python/builtin_convert.cc:195 code: 
VisitElem(ref, level)
Error inferring Arrow data type for collection of Python objects. Got Python 
object of type ndarray but can only handle these types: bool, float, integer, 
date, datetime, bytes, unicode
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2146) [GLib] Implement Slice for ChunkedArray

2018-02-13 Thread yosuke shiro (JIRA)
yosuke shiro created ARROW-2146:
---

 Summary: [GLib] Implement Slice for ChunkedArray
 Key: ARROW-2146
 URL: https://issues.apache.org/jira/browse/ARROW-2146
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: yosuke shiro


Add {{Slice}} api for ChunkedArray.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)