[jira] [Created] (ARROW-6320) [C++] Arrow utilities are linked statically

2019-08-22 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-6320:
-

 Summary: [C++] Arrow utilities are linked statically
 Key: ARROW-6320
 URL: https://issues.apache.org/jira/browse/ARROW-6320
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Developer Tools
Reporter: Antoine Pitrou


Even though other executables are linked dynamically with {{libarrow}} and 
friends, the arrow utilities are linked statically on Linux:

{code}
$ ldd build-test/debug/arrow-stream-to-file 
linux-vdso.so.1 (0x7ffe353a8000)
libboost_filesystem.so.1.67.0 => 
/home/antoine/miniconda3/envs/pyarrow/lib/libboost_filesystem.so.1.67.0 
(0x7f7baf7a1000)
libboost_system.so.1.67.0 => 
/home/antoine/miniconda3/envs/pyarrow/lib/libboost_system.so.1.67.0 
(0x7f7baf59c000)
libstdc++.so.6 => 
/home/antoine/miniconda3/envs/pyarrow/lib/libstdc++.so.6 (0x7f7bb0522000)
libgcc_s.so.1 => 
/home/antoine/miniconda3/envs/pyarrow/lib/libgcc_s.so.1 (0x7f7bb050e000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 
(0x7f7baf37d000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x7f7baef8c000)
/lib64/ld-linux-x86-64.so.2 (0x7f7bb0471000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x7f7baed84000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x7f7bae9e6000)
{code}




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6321) [Python] Ability to create ExtensionBlock on conversion to pandas

2019-08-22 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6321:


 Summary: [Python] Ability to create ExtensionBlock on conversion 
to pandas
 Key: ARROW-6321
 URL: https://issues.apache.org/jira/browse/ARROW-6321
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Joris Van den Bossche


To be able to create a pandas DataFrame in {{to_pandas()}} that holds 
ExtensionArrays (e.g. towards ARROW-2428 to register a conversion), we first 
need to add to the {{table_to_blockmanager}} / {{ConvertTableToPandas}} 
conversion utilities the ability to create an pandas {{ExtensionBlock}} that 
can hold a pandas {{ExtensionArray}}.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


In-memory sorting of plasma objects

2019-08-22 Thread Tanveer Ahmad - EWI
Hi,


I need some help regarding data exchange between Arrow based plasma shared 
memory objects on cluster nodes.


I have two Plasma shared memory objects each contains a RecordBatch on 
different nodes of a cluster.

I want to use pandas dataframes or something like that (dask) on a single node 
to sort them together. Is there any way to access these Plasma objects on a 
single node and sort them in-memory?

Thanks.


Regards,
Tanveer Ahmad


Re: In-memory sorting of plasma objects

2019-08-22 Thread Wes McKinney
hi Tanveer,

IIUC there is logic for moving data that's managed by Plasma servers
between nodes in the Ray project (https://github.com/ray-project/ray)
--if you need to move the bytes from one node to another you need to
use some kind of messaging / RPC tool. The Ray developers might have
some advice -- I think their implementation is specific to Ray's
internals which is why we don't have this implemented (yet) natively
in Apache Arrow

- Wes

On Thu, Aug 22, 2019 at 8:34 AM Tanveer Ahmad - EWI  wrote:
>
> Hi,
>
>
> I need some help regarding data exchange between Arrow based plasma shared 
> memory objects on cluster nodes.
>
>
> I have two Plasma shared memory objects each contains a RecordBatch on 
> different nodes of a cluster.
>
> I want to use pandas dataframes or something like that (dask) on a single 
> node to sort them together. Is there any way to access these Plasma objects 
> on a single node and sort them in-memory?
>
> Thanks.
>
>
> Regards,
> Tanveer Ahmad


[jira] [Created] (ARROW-6322) [C#] Implement a plasma client

2019-08-22 Thread Eric Erhardt (Jira)
Eric Erhardt created ARROW-6322:
---

 Summary: [C#] Implement a plasma client
 Key: ARROW-6322
 URL: https://issues.apache.org/jira/browse/ARROW-6322
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C#
Reporter: Eric Erhardt
Assignee: Eric Erhardt


We should create a C# plasma client, so .NET code can get and put objects into 
the plasma store.

An easy-ish way of implementing this would be to build on the c_glib C APIs 
already exposed for the plasma client. Unfortunately, I haven't found a decent 
C# GObject generator, so I think the C bindings will need to be written by 
hand, but there isn't too many of them.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6323) [R] Expand file paths when passing to readers

2019-08-22 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6323:
--

 Summary: [R] Expand file paths when passing to readers
 Key: ARROW-6323
 URL: https://issues.apache.org/jira/browse/ARROW-6323
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 0.15.0


All file paths in R are wrapped in {{fs::path_abs()}}, which handles relative 
paths, but it doesn't expand {{~}}, so this fails:
{code:java}
> df <- read_parquet("~/Downloads/demofile.parquet")
 Error in io___MemoryMappedFile__Open(fs::path_abs(path), mode) :
   IOError: Failed to open local file '~/Downloads/demofile.parquet', error: No 
such file or directory
{code}
This is fixed by using {{fs::path_real()}} instead.

Should this be properly handled in C++ though? cc [~pitrou]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6324) [C++] File system API should expand paths

2019-08-22 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-6324:
--

 Summary: [C++] File system API should expand paths
 Key: ARROW-6324
 URL: https://issues.apache.org/jira/browse/ARROW-6324
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Neal Richardson


See ARROW-6323



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6325) [Python] wrong conversion of DataFrame with boolean values

2019-08-22 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-6325:


 Summary: [Python] wrong conversion of DataFrame with boolean values
 Key: ARROW-6325
 URL: https://issues.apache.org/jira/browse/ARROW-6325
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.14.1
Reporter: Joris Van den Bossche
 Fix For: 0.15.0


>From https://github.com/pandas-dev/pandas/issues/28090

{code}
In [19]: df = pd.DataFrame(np.ones((5, 2), dtype=bool), columns=['a', 'b']) 

In [20]: df  
Out[20]: 
  a b
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True

In [21]: table = pa.table(df) 

In [23]: table.column(0)
Out[23]: 

[
  [
true,
false,
false,
false,
false
  ]
]
{code}

The resulting table has False values while the original DataFrame had only true 
values. 
It seems this has to do with the fact that it are multiple columns, as with a 
single column it converts correctly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6326) [C++] Nullable fields when converting std::tuple to Table

2019-08-22 Thread Omer Ozarslan (Jira)
Omer Ozarslan created ARROW-6326:


 Summary: [C++] Nullable fields when converting std::tuple to Table
 Key: ARROW-6326
 URL: https://issues.apache.org/jira/browse/ARROW-6326
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Omer Ozarslan


{{std::optional}} isn't used for representing nullable fields in Arrow's 
current STL conversion API since it requires C++17. Also there are other ways 
to represent an optional field other than {{std::optional}} such as using 
pointers or external implementations of optional ({{boost::optional}}, 
{{type_safe::optional}} and alike). 

Since it is hard to maintain so many different kinds of specializations, 
introducing an {{Optional}} concept covering these classes could solve this 
issue and allow implementing nullable fields consistently.

So, the gist of proposed change will be something along the lines of:

{code:cpp}

template
constexpr bool is_optional_like_v = ...;

template
struct CTypeTraits>> {
   //...
}

template
struct ConversionTraits>> : 
public CTypeTraits {
   //...
}
{code}

For a type {{T}} to be considered as an {{Optional}}:
1) It should be convertible (implicitly or explicitly)  to {{bool}}, i.e. it 
implements {{[explicit] operator bool()}},
2) It should be dereferencable, i.e. it implements {{operator*()}}.

These two requirements provide a generalized way of templating nullable fields 
based on pointers, {{std::optional}}, {{boost::optional}} etc. However, it 
would be better (necessary?) if this implementation should act as a default 
while not breaking existing specializations of users (e.g. an existing  
implementation in which {{std::optional}} is specialized by user).

Is there any issues this approach may cause that I may have missed?

I will open a draft PR for working on that meanwhile.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6327) [Python] Conversion of pandas.SparseArray columns in pandas.DataFrames to pyarrow.Table and back

2019-08-22 Thread Rok Mihevc (Jira)
Rok Mihevc created ARROW-6327:
-

 Summary: [Python] Conversion of pandas.SparseArray columns in 
pandas.DataFrames to pyarrow.Table and back
 Key: ARROW-6327
 URL: https://issues.apache.org/jira/browse/ARROW-6327
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Rok Mihevc


We would like to convert sparse columns from Pandas to Arrow:

{code:python}
import numpy as np
import pandas
import pyarrow

arr = pandas.Series([1, 2, 3])
sparr = pandas.SparseArray(np.array([1, 0, 0], dtype='int64'))
df = pandas.DataFrame({'sparr': sparr, 'arr': arr})

table = pyarrow.table(df)
df == table.to_pandas()
{code}

I assume `pandas.SparseArray` is a 1D sparse COO Tensor that would map to 
`pyarrow.SparseTensorCOO`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6328) Click.option-s should have help text

2019-08-22 Thread Ulzii O (Jira)
Ulzii O created ARROW-6328:
--

 Summary: Click.option-s should have help text
 Key: ARROW-6328
 URL: https://issues.apache.org/jira/browse/ARROW-6328
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Ulzii O


Click.option-s should have `help` text

## What?
Add `help` text to click.option

## Why?
Click.option should ideally have a `help` text defined to be useful.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)

2019-08-22 Thread Wes McKinney
The vote carries with 4 binding +1 votes and 1 non-binding +1

I'll merge the specification patch later today and we can begin
working on implementations so we can get this done for 0.15.0

On Tue, Aug 20, 2019 at 12:30 PM Bryan Cutler  wrote:
>
> +1 (non-binding)
>
> On Tue, Aug 20, 2019, 7:43 AM Antoine Pitrou  wrote:
>
> >
> > Sorry, had forgotten to send my vote on this.
> >
> > +1 from me.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Wed, 14 Aug 2019 17:42:33 -0500
> > Wes McKinney  wrote:
> > > hi all,
> > >
> > > As we've been discussing [1], there is a need to introduce 4 bytes of
> > > padding into the preamble of the "encapsulated IPC message" format to
> > > ensure that the Flatbuffers metadata payload begins on an 8-byte
> > > aligned memory offset. The alternative to this would be for Arrow
> > > implementations where alignment is important (e.g. C or C++) to copy
> > > the metadata (which is not always small) into memory when it is
> > > unaligned.
> > >
> > > Micah has proposed to address this by adding a
> > > 4-byte "continuation" value at the beginning of the payload
> > > having the value 0x. The reason to do it this way is that
> > > old clients will see an invalid length (what is currently the
> > > first 4 bytes of the message -- a 32-bit little endian signed
> > > integer indicating the metadata length) rather than potentially
> > > crashing on a valid length. We also propose to expand the "end of
> > > stream" marker used in the stream and file format from 4 to 8
> > > bytes. This has the additional effect of aligning the file footer
> > > defined in File.fbs.
> > >
> > > This would be a backwards incompatible protocol change, so older Arrow
> > > libraries would not be able to read these new messages. Maintaining
> > > forward compatibility (reading data produced by older libraries) would
> > > be possible as we can reason that a value other than the continuation
> > > value was produced by an older library (and then validate the
> > > Flatbuffer message of course). Arrow implementations could offer a
> > > backward compatibility mode for the sake of old readers if they desire
> > > (this may also assist with testing).
> > >
> > > Additionally with this vote, we want to formally approve the change to
> > > the Arrow "file" format to always write the (new 8-byte) end-of-stream
> > > marker, which enables code that processes Arrow streams to safely read
> > > the file's internal messages as though they were a normal stream.
> > >
> > > The PR making these changes to the IPC documentation is here
> > >
> > > https://github.com/apache/arrow/pull/4951
> > >
> > > Please vote to accept these changes. This vote will be open for at
> > > least 72 hours
> > >
> > > [ ] +1 Adopt these Arrow protocol changes
> > > [ ] +0
> > > [ ] -1 I disagree because...
> > >
> > > Here is my vote: +1
> > >
> > > Thanks,
> > > Wes
> > >
> > > [1]:
> > https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E
> > >
> >
> >
> >
> >


[jira] [Created] (ARROW-6329) [Format] Add 4-byte "stream continuation" to IPC message format to align Flatbuffers

2019-08-22 Thread Wes McKinney (Jira)
Wes McKinney created ARROW-6329:
---

 Summary: [Format] Add 4-byte "stream continuation" to IPC message 
format to align Flatbuffers
 Key: ARROW-6329
 URL: https://issues.apache.org/jira/browse/ARROW-6329
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Wes McKinney
Assignee: Micah Kornfield
 Fix For: 0.15.0


This is the JIRA corresponding to the mailing list discussion



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [RESULT] [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements (2nd vote)

2019-08-22 Thread Micah Kornfield
I created https://issues.apache.org/jira/browse/ARROW-6313 as a tracking
issue with sub-issues on the development work.  So far no-one has claimed
Java and Javascript tasks.

Would it make sense to have a separate dev branch for this work?

Thanks,
Micah

On Thu, Aug 22, 2019 at 3:24 PM Wes McKinney  wrote:

> The vote carries with 4 binding +1 votes and 1 non-binding +1
>
> I'll merge the specification patch later today and we can begin
> working on implementations so we can get this done for 0.15.0
>
> On Tue, Aug 20, 2019 at 12:30 PM Bryan Cutler  wrote:
> >
> > +1 (non-binding)
> >
> > On Tue, Aug 20, 2019, 7:43 AM Antoine Pitrou 
> wrote:
> >
> > >
> > > Sorry, had forgotten to send my vote on this.
> > >
> > > +1 from me.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Wed, 14 Aug 2019 17:42:33 -0500
> > > Wes McKinney  wrote:
> > > > hi all,
> > > >
> > > > As we've been discussing [1], there is a need to introduce 4 bytes of
> > > > padding into the preamble of the "encapsulated IPC message" format to
> > > > ensure that the Flatbuffers metadata payload begins on an 8-byte
> > > > aligned memory offset. The alternative to this would be for Arrow
> > > > implementations where alignment is important (e.g. C or C++) to copy
> > > > the metadata (which is not always small) into memory when it is
> > > > unaligned.
> > > >
> > > > Micah has proposed to address this by adding a
> > > > 4-byte "continuation" value at the beginning of the payload
> > > > having the value 0x. The reason to do it this way is that
> > > > old clients will see an invalid length (what is currently the
> > > > first 4 bytes of the message -- a 32-bit little endian signed
> > > > integer indicating the metadata length) rather than potentially
> > > > crashing on a valid length. We also propose to expand the "end of
> > > > stream" marker used in the stream and file format from 4 to 8
> > > > bytes. This has the additional effect of aligning the file footer
> > > > defined in File.fbs.
> > > >
> > > > This would be a backwards incompatible protocol change, so older
> Arrow
> > > > libraries would not be able to read these new messages. Maintaining
> > > > forward compatibility (reading data produced by older libraries)
> would
> > > > be possible as we can reason that a value other than the continuation
> > > > value was produced by an older library (and then validate the
> > > > Flatbuffer message of course). Arrow implementations could offer a
> > > > backward compatibility mode for the sake of old readers if they
> desire
> > > > (this may also assist with testing).
> > > >
> > > > Additionally with this vote, we want to formally approve the change
> to
> > > > the Arrow "file" format to always write the (new 8-byte)
> end-of-stream
> > > > marker, which enables code that processes Arrow streams to safely
> read
> > > > the file's internal messages as though they were a normal stream.
> > > >
> > > > The PR making these changes to the IPC documentation is here
> > > >
> > > > https://github.com/apache/arrow/pull/4951
> > > >
> > > > Please vote to accept these changes. This vote will be open for at
> > > > least 72 hours
> > > >
> > > > [ ] +1 Adopt these Arrow protocol changes
> > > > [ ] +0
> > > > [ ] -1 I disagree because...
> > > >
> > > > Here is my vote: +1
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> > > > [1]:
> > >
> https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E
> > > >
> > >
> > >
> > >
> > >
>


Re: [DISCUSS][Java] Design of RLE vector

2019-08-22 Thread Micah Kornfield
I'm in favor of this, but still think we are gather feedback on the
proposal, so we should hold off on coding these up, until we have consensus
on the approach.

Thanks,
Micah

On Wed, Aug 21, 2019 at 9:22 PM Fan Liya  wrote:

> Hi Micah,
>
> Thanks for the comments.
> By storing the run-length ends (partial sum of run-lengths), it provides
> better support for random access (O(log(n)), at the expense of larger
> buffer width.
>
> Generally, I think this is a better design, so the design should be
> changed as follows:
>
> 2. the data structure of RleVector includes an inner vector, plus a buffer
> storing the end indices for runs.
> 3. we provide random access, with time complexity O(log(n)), so it should
> not be used frequently.
>
> What do you think?
>
> Best,
> Liya Fan
>
> On Thu, Aug 22, 2019 at 11:45 AM Micah Kornfield 
> wrote:
>
>> Hi Liya Fan,
>> Perhaps comment on the original thread?  This differs from my proposal in
>> terms on details of encoding.  For RLE, I proposed encoding run end
>> indices
>> instead of run-lengths.  This allows for sublinear access to elements at
>> the cost of potentially larger bit-widths for the lengths.
>>
>>
>> Thanks,
>> Micah
>>
>> On Wed, Aug 21, 2019 at 6:50 PM Fan Liya  wrote:
>>
>> > Hi Wes,
>> >
>> > Thanks for the good suggestion.
>> > It is intended to be sent through IPC. So it should implement
>> FieldVector,
>> > not just ValueVector.
>> >
>> > This can be considered a sub-item of Micah's proposal about
>> > compression/decompression.
>> > I will spend more time on that discussion.
>> >
>> > Best,
>> > Liya Fan
>> >
>> > On Wed, Aug 21, 2019 at 9:34 PM Wes McKinney 
>> wrote:
>> >
>> > > hi Liya,
>> > >
>> > > Do you intend to be able to send RLE vectors using the IPC protocol?
>> > > If so, we need to spend some time on Micah's discussion about
>> > > sparseness and encodings/compression.
>> > >
>> > > - Wes
>> > >
>> > > On Wed, Aug 21, 2019 at 7:33 AM Fan Liya 
>> wrote:
>> > > >
>> > > > Dear all,
>> > > >
>> > > > RLE (run length encoding) is a widely used encoding/decoding
>> technique.
>> > > > Compared with other encoding/decoding techniques, it is easier to
>> work
>> > > with
>> > > > the encoded data.
>> > > >
>> > > > We want to provide an RLE vector implementation in Arrow. The design
>> > > > details include:
>> > > >
>> > > > 1. RleVector implements ValueVector.
>> > > > 2. the data structure of RleVector includes an inner vector, plus a
>> > > > repetition buffer.
>> > > > 3. we do not provide random access over the RleVector
>> > > > 4. In the future, we will provide iterators to access the vector in
>> > > > sequence.
>> > > > 5. RleVector does not support update, but supports appending.
>> > > > 6. In the future, we will provide encoder/decoder to efficiently
>> > > transform
>> > > > encoded/decoded vectors.
>> > > >
>> > > > Please give your valuable feedback.
>> > > >
>> > > > Best,
>> > > > Liya Fan
>> > >
>> >
>>
>


Re: [DISCUSS][Java] Design of RLE vector

2019-08-22 Thread Fan Liya
Hi Micah,

Sounds good. Thanks.

I have prepared some initial code, in the hope that it will make
discussions easier.
Anyway, we can ignore it for now, until we have consensus.

Best,
Liya Fan

On Fri, Aug 23, 2019 at 11:05 AM Micah Kornfield 
wrote:

> I'm in favor of this, but still think we are gather feedback on the
> proposal, so we should hold off on coding these up, until we have consensus
> on the approach.
>
> Thanks,
> Micah
>
> On Wed, Aug 21, 2019 at 9:22 PM Fan Liya  wrote:
>
>> Hi Micah,
>>
>> Thanks for the comments.
>> By storing the run-length ends (partial sum of run-lengths), it provides
>> better support for random access (O(log(n)), at the expense of larger
>> buffer width.
>>
>> Generally, I think this is a better design, so the design should be
>> changed as follows:
>>
>> 2. the data structure of RleVector includes an inner vector, plus a
>> buffer storing the end indices for runs.
>> 3. we provide random access, with time complexity O(log(n)), so it should
>> not be used frequently.
>>
>> What do you think?
>>
>> Best,
>> Liya Fan
>>
>> On Thu, Aug 22, 2019 at 11:45 AM Micah Kornfield 
>> wrote:
>>
>>> Hi Liya Fan,
>>> Perhaps comment on the original thread?  This differs from my proposal in
>>> terms on details of encoding.  For RLE, I proposed encoding run end
>>> indices
>>> instead of run-lengths.  This allows for sublinear access to elements at
>>> the cost of potentially larger bit-widths for the lengths.
>>>
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Wed, Aug 21, 2019 at 6:50 PM Fan Liya  wrote:
>>>
>>> > Hi Wes,
>>> >
>>> > Thanks for the good suggestion.
>>> > It is intended to be sent through IPC. So it should implement
>>> FieldVector,
>>> > not just ValueVector.
>>> >
>>> > This can be considered a sub-item of Micah's proposal about
>>> > compression/decompression.
>>> > I will spend more time on that discussion.
>>> >
>>> > Best,
>>> > Liya Fan
>>> >
>>> > On Wed, Aug 21, 2019 at 9:34 PM Wes McKinney 
>>> wrote:
>>> >
>>> > > hi Liya,
>>> > >
>>> > > Do you intend to be able to send RLE vectors using the IPC protocol?
>>> > > If so, we need to spend some time on Micah's discussion about
>>> > > sparseness and encodings/compression.
>>> > >
>>> > > - Wes
>>> > >
>>> > > On Wed, Aug 21, 2019 at 7:33 AM Fan Liya 
>>> wrote:
>>> > > >
>>> > > > Dear all,
>>> > > >
>>> > > > RLE (run length encoding) is a widely used encoding/decoding
>>> technique.
>>> > > > Compared with other encoding/decoding techniques, it is easier to
>>> work
>>> > > with
>>> > > > the encoded data.
>>> > > >
>>> > > > We want to provide an RLE vector implementation in Arrow. The
>>> design
>>> > > > details include:
>>> > > >
>>> > > > 1. RleVector implements ValueVector.
>>> > > > 2. the data structure of RleVector includes an inner vector, plus a
>>> > > > repetition buffer.
>>> > > > 3. we do not provide random access over the RleVector
>>> > > > 4. In the future, we will provide iterators to access the vector in
>>> > > > sequence.
>>> > > > 5. RleVector does not support update, but supports appending.
>>> > > > 6. In the future, we will provide encoder/decoder to efficiently
>>> > > transform
>>> > > > encoded/decoded vectors.
>>> > > >
>>> > > > Please give your valuable feedback.
>>> > > >
>>> > > > Best,
>>> > > > Liya Fan
>>> > >
>>> >
>>>
>>


Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-22 Thread Jacques Nadeau
I don't think we should couple this discussion with the implementation of
large list, etc since I think those two concepts are independent.

I've asked some others on my team their opinions on the risk here. I think
we should probably review some our more complex vector interactions and see
how the jvm's assembly changes with this kind of change. Using
microbenchmarking is good but I think we also need to see whether we're
constantly inserting additional instructions or if in most cases, this
actually doesn't impact instruction count.



On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield 
wrote:

>
>> With regards to the reference implementation point. It is a good point.
>> I'm on vacation this week. Unless you're pushing hard on this, can we pick
>> this up and discuss more next week?
>
>
> Hi Jacques, I hope you had a good rest.  Any more thoughts on the
> reference implementation aspect of this?
>
>
>> To copy the sentiments from the 0.15.0 release thread, I think it
>> would be best to decouple this discussion from the release timeline
>> given how many people we have relying on regular releases coming out.
>> We can keep continue making major 0.x releases until we're ready to
>> release 1.0.0.
>
>
> I'm OK with it as long as other stakeholders are. Timed releases are the
> way to go.  As stated on the release thread [1] we need a better mechanism
> to avoid this type of issue arising again.  The release thread also had
> some more discussion on compatibility.
>
> Thanks,
> Micah
>
> [1]
> https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
>
>
> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney  wrote:
>
>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield 
>> wrote:
>> >
>> > Hi Wes and Jacques,
>> > See responses below.
>> >
>> > With regards to the reference implementation point. It is a good point.
>> I'm
>> > > on vacation this week. Unless you're pushing hard on this, can we
>> pick this
>> > > up and discuss more next week?
>> >
>> >
>> > Sure thing, enjoy your vacation.  I think the only practical
>> implications
>> > are it delays choices around implementing LargeList, LargeBinary,
>> > LargeString in Java, which in turn might push out the 0.15.0 release.
>> >
>>
>> To copy the sentiments from the 0.15.0 release thread, I think it
>> would be best to decouple this discussion from the release timeline
>> given how many people we have relying on regular releases coming out.
>> We can keep continue making major 0.x releases until we're ready to
>> release 1.0.0.
>>
>> > My stance on this is that I don't know how important it is for Java to
>> > > support vectors over INT32_MAX elements. The use cases enabled by
>> > > having very large arrays seem to be concentrated in the native code
>> > > world (e.g. C/C++/Rust) -- that could just be implementation-centrism
>> > > on my part, though.
>> >
>> >
>> > A data point against this view is Spark has done work to eliminate 2GB
>> > memory limits on its block sizes [1].  I don't claim to understand the
>> > implications of this. Bryan might you have any thoughts here?  I'm OK
>> with
>> > INT32_MAX, as well, I think we should think about what this means for
>> > adding Large types to Java and implications for reference
>> implementations.
>> >
>> > Thanks,
>> > Micah
>> >
>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
>> >
>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau 
>> wrote:
>> >
>> > > Hey Micah,
>> > >
>> > > Appreciate the offer on the compiling. The reality is I'm more
>> concerned
>> > > about the unknowns than the compiling issue itself. Any time you've
>> been
>> > > tuning for a while, changing something like this could be totally
>> fine or
>> > > cause a couple of major issues. For example, we've done a very large
>> amount
>> > > of work reducing heap memory footprint of the vectors. Are target is
>> to
>> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per
>> vector
>> > > (not including arrow bufs).
>> > >
>> > > With regards to the reference implementation point. It is a good
>> point.
>> > > I'm on vacation this week. Unless you're pushing hard on this, can we
>> pick
>> > > this up and discuss more next week?
>> > >
>> > > thanks,
>> > > Jacques
>> > >
>> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <
>> emkornfi...@gmail.com>
>> > > wrote:
>> > >
>> > >> Hi Jacques,
>> > >> I definitely understand these concerns and this change is risky
>> because it
>> > >> is so large.  Perhaps, creating a new hierarchy, might be the
>> cleanest way
>> > >> of dealing with this.  This could have other benefits like cleaning
>> up
>> > >> some
>> > >> cruft around dictionary encode and "orphaned" method.   Per past
>> e-mail
>> > >> threads I agree it is beneficial to have 2 separate reference
>> > >> implementations that can communicate fully, and my intent here was to
>> > >> close
>> > >> that gap.
>> > >>
>> > >> Trying to
>> > >> > determi

Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-22 Thread Jacques Nadeau
>
>  Hi Jacques, I hope you had a good rest.


I did, thanks!

On Fri, Aug 23, 2019 at 9:25 AM Jacques Nadeau  wrote:

> I don't think we should couple this discussion with the implementation of
> large list, etc since I think those two concepts are independent.
>
> I've asked some others on my team their opinions on the risk here. I think
> we should probably review some our more complex vector interactions and see
> how the jvm's assembly changes with this kind of change. Using
> microbenchmarking is good but I think we also need to see whether we're
> constantly inserting additional instructions or if in most cases, this
> actually doesn't impact instruction count.
>
>
>
> On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield 
> wrote:
>
>>
>>> With regards to the reference implementation point. It is a good point.
>>> I'm on vacation this week. Unless you're pushing hard on this, can we pick
>>> this up and discuss more next week?
>>
>>
>> Hi Jacques, I hope you had a good rest.  Any more thoughts on the
>> reference implementation aspect of this?
>>
>>
>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>> would be best to decouple this discussion from the release timeline
>>> given how many people we have relying on regular releases coming out.
>>> We can keep continue making major 0.x releases until we're ready to
>>> release 1.0.0.
>>
>>
>> I'm OK with it as long as other stakeholders are. Timed releases are the
>> way to go.  As stated on the release thread [1] we need a better mechanism
>> to avoid this type of issue arising again.  The release thread also had
>> some more discussion on compatibility.
>>
>> Thanks,
>> Micah
>>
>> [1]
>> https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
>>
>>
>> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney  wrote:
>>
>>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield 
>>> wrote:
>>> >
>>> > Hi Wes and Jacques,
>>> > See responses below.
>>> >
>>> > With regards to the reference implementation point. It is a good
>>> point. I'm
>>> > > on vacation this week. Unless you're pushing hard on this, can we
>>> pick this
>>> > > up and discuss more next week?
>>> >
>>> >
>>> > Sure thing, enjoy your vacation.  I think the only practical
>>> implications
>>> > are it delays choices around implementing LargeList, LargeBinary,
>>> > LargeString in Java, which in turn might push out the 0.15.0 release.
>>> >
>>>
>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>> would be best to decouple this discussion from the release timeline
>>> given how many people we have relying on regular releases coming out.
>>> We can keep continue making major 0.x releases until we're ready to
>>> release 1.0.0.
>>>
>>> > My stance on this is that I don't know how important it is for Java to
>>> > > support vectors over INT32_MAX elements. The use cases enabled by
>>> > > having very large arrays seem to be concentrated in the native code
>>> > > world (e.g. C/C++/Rust) -- that could just be implementation-centrism
>>> > > on my part, though.
>>> >
>>> >
>>> > A data point against this view is Spark has done work to eliminate 2GB
>>> > memory limits on its block sizes [1].  I don't claim to understand the
>>> > implications of this. Bryan might you have any thoughts here?  I'm OK
>>> with
>>> > INT32_MAX, as well, I think we should think about what this means for
>>> > adding Large types to Java and implications for reference
>>> implementations.
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
>>> >
>>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau 
>>> wrote:
>>> >
>>> > > Hey Micah,
>>> > >
>>> > > Appreciate the offer on the compiling. The reality is I'm more
>>> concerned
>>> > > about the unknowns than the compiling issue itself. Any time you've
>>> been
>>> > > tuning for a while, changing something like this could be totally
>>> fine or
>>> > > cause a couple of major issues. For example, we've done a very large
>>> amount
>>> > > of work reducing heap memory footprint of the vectors. Are target is
>>> to
>>> > > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per
>>> vector
>>> > > (not including arrow bufs).
>>> > >
>>> > > With regards to the reference implementation point. It is a good
>>> point.
>>> > > I'm on vacation this week. Unless you're pushing hard on this, can
>>> we pick
>>> > > this up and discuss more next week?
>>> > >
>>> > > thanks,
>>> > > Jacques
>>> > >
>>> > > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <
>>> emkornfi...@gmail.com>
>>> > > wrote:
>>> > >
>>> > >> Hi Jacques,
>>> > >> I definitely understand these concerns and this change is risky
>>> because it
>>> > >> is so large.  Perhaps, creating a new hierarchy, might be the
>>> cleanest way
>>> > >> of dealing with this.  This could have other benefits like cleaning
>>> up
>>> > >> some
>>> > >> cruft around dictionary encode and "o

[jira] [Created] (ARROW-6330) [C++] Include missing headers in api.h

2019-08-22 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6330:
--

 Summary: [C++] Include missing headers in api.h
 Key: ARROW-6330
 URL: https://issues.apache.org/jira/browse/ARROW-6330
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Micah Kornfield


I think result.h and array/concatenate.h should be included as they export 
public symbols.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6331) [Java] Incorporate ErrorProne into the java build

2019-08-22 Thread Micah Kornfield (Jira)
Micah Kornfield created ARROW-6331:
--

 Summary: [Java] Incorporate ErrorProne into the java build
 Key: ARROW-6331
 URL: https://issues.apache.org/jira/browse/ARROW-6331
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, Java
Reporter: Micah Kornfield


[Using 
https://github.com/google/error-prone|https://github.com/google/error-prone] 
seems like it would be a good idea to automatically catch more errors.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Binary compatibility of pyarrow.serialize

2019-08-22 Thread Yevgeni Litvin
In our system we are using arrow serialization as it showed excellent
deserialization speed. However, seems that we made a mistake by persisting
the streams into a long-term storage as the serialized data appears to be
incompatible between versions. According to the release notes of 0.14.0 it
appears that starting 1.0.0 binary compatibility will be maintained. My
question is whether pyarrow.serialize is also guaranteed to maintain binary
compatibility starting with arrow 1.0 and it would be safe to persist its
output then (or maybe even starting now - 0.14)?

(from my quick test the 0.13 is not compatible with 0.12 and before, while
it is compatible to 0.14)

Thank you,

- Yevgeni


Re: [Discuss][Java] 64-bit lengths for ValueVectors

2019-08-22 Thread Micah Kornfield
>
> I don't think we should couple this discussion with the implementation of
> large list, etc since I think those two concepts are independent.

I'm still trying to balance in my mind which is a worse experience for
consumers of the libraries for these types.  Claiming that Java supports
these types but throwing an exception when the Vectors exceed 32-bits or
just say they aren't supported until we have 64-bit support in Java.


> I've asked some others on my team their opinions on the risk here. I think
> we should probably review some our more complex vector interactions and see
> how the jvm's assembly changes with this kind of change. Using
> microbenchmarking is good but I think we also need to see whether we're
> constantly inserting additional instructions or if in most cases, this
> actually doesn't impact instruction count.


Is this something that your team will take on?  Do you need a rebased
version of the PR or is the existing one sufficient?

Thanks,
Micah

On Thu, Aug 22, 2019 at 8:56 PM Jacques Nadeau  wrote:

> I don't think we should couple this discussion with the implementation of
> large list, etc since I think those two concepts are independent.
>
> I've asked some others on my team their opinions on the risk here. I think
> we should probably review some our more complex vector interactions and see
> how the jvm's assembly changes with this kind of change. Using
> microbenchmarking is good but I think we also need to see whether we're
> constantly inserting additional instructions or if in most cases, this
> actually doesn't impact instruction count.
>
>
>
> On Wed, Aug 21, 2019 at 12:18 PM Micah Kornfield 
> wrote:
>
>>
>>> With regards to the reference implementation point. It is a good point.
>>> I'm on vacation this week. Unless you're pushing hard on this, can we pick
>>> this up and discuss more next week?
>>
>>
>> Hi Jacques, I hope you had a good rest.  Any more thoughts on the
>> reference implementation aspect of this?
>>
>>
>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>> would be best to decouple this discussion from the release timeline
>>> given how many people we have relying on regular releases coming out.
>>> We can keep continue making major 0.x releases until we're ready to
>>> release 1.0.0.
>>
>>
>> I'm OK with it as long as other stakeholders are. Timed releases are the
>> way to go.  As stated on the release thread [1] we need a better mechanism
>> to avoid this type of issue arising again.  The release thread also had
>> some more discussion on compatibility.
>>
>> Thanks,
>> Micah
>>
>> [1]
>> https://lists.apache.org/thread.html/d70feeceaf2570906ade117030b29887af7c77ca5c4a976e6d555920@%3Cdev.arrow.apache.org%3E
>>
>>
>> On Wed, Aug 14, 2019 at 3:23 PM Wes McKinney  wrote:
>>
>>> On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield 
>>> wrote:
>>> >
>>> > Hi Wes and Jacques,
>>> > See responses below.
>>> >
>>> > With regards to the reference implementation point. It is a good
>>> point. I'm
>>> > > on vacation this week. Unless you're pushing hard on this, can we
>>> pick this
>>> > > up and discuss more next week?
>>> >
>>> >
>>> > Sure thing, enjoy your vacation.  I think the only practical
>>> implications
>>> > are it delays choices around implementing LargeList, LargeBinary,
>>> > LargeString in Java, which in turn might push out the 0.15.0 release.
>>> >
>>>
>>> To copy the sentiments from the 0.15.0 release thread, I think it
>>> would be best to decouple this discussion from the release timeline
>>> given how many people we have relying on regular releases coming out.
>>> We can keep continue making major 0.x releases until we're ready to
>>> release 1.0.0.
>>>
>>> > My stance on this is that I don't know how important it is for Java to
>>> > > support vectors over INT32_MAX elements. The use cases enabled by
>>> > > having very large arrays seem to be concentrated in the native code
>>> > > world (e.g. C/C++/Rust) -- that could just be implementation-centrism
>>> > > on my part, though.
>>> >
>>> >
>>> > A data point against this view is Spark has done work to eliminate 2GB
>>> > memory limits on its block sizes [1].  I don't claim to understand the
>>> > implications of this. Bryan might you have any thoughts here?  I'm OK
>>> with
>>> > INT32_MAX, as well, I think we should think about what this means for
>>> > adding Large types to Java and implications for reference
>>> implementations.
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>> > [1] https://issues.apache.org/jira/browse/SPARK-6235
>>> >
>>> > On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau 
>>> wrote:
>>> >
>>> > > Hey Micah,
>>> > >
>>> > > Appreciate the offer on the compiling. The reality is I'm more
>>> concerned
>>> > > about the unknowns than the compiling issue itself. Any time you've
>>> been
>>> > > tuning for a while, changing something like this could be totally
>>> fine or
>>> > > cause a couple of major issues. For example, we've done a very large
>>> amount
>>