[jira] [Created] (ARROW-3574) Fix remaining bug with plasma static versus shared libraries.

2018-10-19 Thread Robert Nishihara (JIRA)
Robert Nishihara created ARROW-3574:
---

 Summary: Fix remaining bug with plasma static versus shared 
libraries.
 Key: ARROW-3574
 URL: https://issues.apache.org/jira/browse/ARROW-3574
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Plasma (C++)
Reporter: Robert Nishihara
Assignee: Robert Nishihara


Address a few missing pieces in [https://github.com/apache/arrow/pull/2792.] On 
Mac, moving the {{plasma_store_server}} executable around and then executing it 
leads to

 
{code:java}
dyld: Library not loaded: @rpath/libarrow.12.dylib

  Referenced from: 
/Users/rkn/Workspace/ray/./python/ray/core/src/plasma/plasma_store_server

  Reason: image not found

Abort trap: 6{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3573) [Rust] with_bitset does not set valid bits correctly

2018-10-19 Thread Paddy Horan (JIRA)
Paddy Horan created ARROW-3573:
--

 Summary: [Rust] with_bitset does not set valid bits correctly
 Key: ARROW-3573
 URL: https://issues.apache.org/jira/browse/ARROW-3573
 Project: Apache Arrow
  Issue Type: Bug
  Components: Rust
Reporter: Paddy Horan
Assignee: Paddy Horan


The boundary check is off a little, 
{color:#33}`MutableBuffer::new(64).with_bitset(64, false);` will fail.  
This issue only happens if the arguments to `new` and `with_bitset` are the 
same and a multiple of 64.
{color}

{color:#33}`write_bytes` is currently writing 1 instead of 255 to set all 
the bits when `val` is `true`{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3572) [Packaging] Correctly handle ssh origin urls for crossbow

2018-10-19 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-3572:
--

 Summary: [Packaging] Correctly handle ssh origin urls for crossbow 
 Key: ARROW-3572
 URL: https://issues.apache.org/jira/browse/ARROW-3572
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3571) [Wiki] Release management guide does not explain how to set up Crossbow or where to find instructions

2018-10-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3571:
---

 Summary: [Wiki] Release management guide does not explain how to 
set up Crossbow or where to find instructions
 Key: ARROW-3571
 URL: https://issues.apache.org/jira/browse/ARROW-3571
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Wiki
Reporter: Wes McKinney
 Fix For: 0.12.0


If you follow the guide, at one point it says "Launch a Crossbow build" but 
provides no link to the setup instructions for this



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3570) [Packaging] Don't bundle test data files with python wheels

2018-10-19 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-3570:
--

 Summary: [Packaging] Don't bundle test data files with python 
wheels
 Key: ARROW-3570
 URL: https://issues.apache.org/jira/browse/ARROW-3570
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Packaging
Reporter: Krisztian Szucs


https://travis-ci.org/kszucs/crossbow/builds/443856122#L2153

BTW What's the practice about bundling the test files? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3569) [Packaging] Run pyarrow unittests for when building conda package

2018-10-19 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-3569:
--

 Summary: [Packaging] Run pyarrow unittests for when building conda 
package
 Key: ARROW-3569
 URL: https://issues.apache.org/jira/browse/ARROW-3569
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3568) [Packaging] Run pyarrow unittests for windows wheels

2018-10-19 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-3568:
--

 Summary: [Packaging] Run pyarrow unittests for windows wheels
 Key: ARROW-3568
 URL: https://issues.apache.org/jira/browse/ARROW-3568
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Packaging
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 0.11.1






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3567) [Gandiva] [GLib] Add GLib bindings of Gandiva

2018-10-19 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-3567:
---

 Summary: [Gandiva] [GLib] Add GLib bindings of Gandiva
 Key: ARROW-3567
 URL: https://issues.apache.org/jira/browse/ARROW-3567
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Gandiva, GLib
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro
 Fix For: 0.12.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3566) Clarify that the type of dictionary encoded field should be the encoded(index) type

2018-10-19 Thread Li Jin (JIRA)
Li Jin created ARROW-3566:
-

 Summary: Clarify that the type of dictionary encoded field should 
be the encoded(index) type
 Key: ARROW-3566
 URL: https://issues.apache.org/jira/browse/ARROW-3566
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Li Jin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Efficient Pandas serialization for mixed object and numeric DataFrames

2018-10-19 Thread Wes McKinney
hi Mitar -- to Robert's point, we aren't sure which code path you are
referring to.

Perhaps related, I'm interested in handling Python pickling for
"other" kinds of Python objects when converting to or from the Arrow
format. So "Python object" would be defined as a user defined type
that's embedded in the Arrow BINARY type. The relevant JIRA for this
is https://issues.apache.org/jira/browse/ARROW-823

Thanks
Wes
On Fri, Oct 19, 2018 at 6:26 AM Antoine Pitrou  wrote:
>
>
> Slightly off-topic, but the recent work on PEP 574 (*) should allow
> efficient serialization of Pandas dataframes (**) with standard pickle
> (or the pickle5 backport).  Experimental support for pickle5 has
> already been merged in Arrow and Numpy (and Pandas uses Numpy as its
> storage backend).  My personal goal is to have the PEP accepted and
> integrated into Python 3.8.
>
> Regards
>
> Antoine.
>
> (*) Pickle protocol 5 with out-of-band data:
> https://www.python.org/dev/peps/pep-0574/
>
> (**) No-copy semantics for pandas dataframes:
> https://github.com/numpy/numpy/pull/12011#issuecomment-428915852
>
>
> On Thu, 18 Oct 2018 21:22:04 -0700
> Robert Nishihara  wrote:
> > How are you serializing the dataframe? If you use *pyarrow.serialize(df)*,
> > then each column should be serialized separately and numeric columns will
> > be handled efficiently.
> >
> > On Thu, Oct 18, 2018 at 9:10 PM Mitar  wrote:
> >
> > > Hi!
> > >
> > > It seems that if a DataFrame contains both numeric and object columns,
> > > the whole DataFrame is pickled and not that only object columns are
> > > pickled? Is this right? Are there any plans to improve this?
> > >
> > >
> > > Mitar
> > >
> > > --
> > > http://mitar.tnode.com/
> > > https://twitter.com/mitar_m
> > >
> >
>


Parquet format in Java

2018-10-19 Thread Tanveer Ahmad - EWI
Hi,

In Java, I'm getting plasma object from C++ (in parquet format) as byte[] 
buffer. How can I convert it to back to Arrow Schema/columns? Thanks.

--
Regards,
Tanveer Ahmad


Re: Making a bugfix 0.11.1 release

2018-10-19 Thread Wes McKinney
I prepared the maintenance branch here

https://github.com/apache/arrow/tree/maint-0.11.x

I'm not fully set up with to create a release candidate yet with
Crossbow but I'll work on it today and try to get a vote started by
EOD
On Fri, Oct 19, 2018 at 3:45 AM Antoine Pitrou  wrote:
>
>
> I would recommend cherry-picking a minimal number of patches for the
> bugfix and for packaging to work.  It's better not to include API
> additions or changes.
>
> Regards
>
> Antoine.
>
>
> Le 17/10/2018 à 03:32, Wes McKinney a écrit :
> > hi folks,
> >
> > As a result of ARROW-3514, we need to release new Python packages
> > quite urgently since major functionality (Parquet writing on many
> > Linux platforms) is broken out of the box
> >
> > https://github.com/apache/arrow/commit/66d9a30a26e1659d9e992037339515e59a6ae518
> >
> > We have a couple options:
> >
> > * Release from master
> > * Release 0.11.0 + minimum patches to include the ARROW-3514 fix and
> > any follow up patches to fix packaging
> >
> > There is the option to "not" release but it could cause confusion for
> > people because PyPI does not allow replacing wheels; a new version
> > number has to be created.
> >
> > What would folks like to do? Who can help with the RM duties? Since a
> > 72 hour vote is a _should_ rather than _must_, we could reasonably
> > close the release vote in < 72 hours and push out packages faster if
> > it is scope limited to the zlib bug fix
> >
> > Thanks,
> > Wes
> >


[jira] [Created] (ARROW-3565) [Python] Pin tensorflow to 1.11.0 in manylinux1 container

2018-10-19 Thread Uwe L. Korn (JIRA)
Uwe L. Korn created ARROW-3565:
--

 Summary: [Python] Pin tensorflow to 1.11.0 in manylinux1 container
 Key: ARROW-3565
 URL: https://issues.apache.org/jira/browse/ARROW-3565
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Affects Versions: 0.11.0
Reporter: Uwe L. Korn
Assignee: Uwe L. Korn
 Fix For: 0.11.1


Just enough to get {{pyarrow}} in a releasable state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3564) pyarrow: writing version 2.0 parquet format with dictionary encoding enabled

2018-10-19 Thread Hatem Helal (JIRA)
Hatem Helal created ARROW-3564:
--

 Summary: pyarrow: writing version 2.0 parquet format with 
dictionary encoding enabled
 Key: ARROW-3564
 URL: https://issues.apache.org/jira/browse/ARROW-3564
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.11.0
Reporter: Hatem Helal
 Attachments: example_v1.0_dict_False.parquet, 
example_v1.0_dict_True.parquet, example_v2.0_dict_False.parquet, 
example_v2.0_dict_True.parquet, pyarrow_repro.py

Using pyarrow v0.11.0, the following script writes a simple table (lifted from 
the [pyarrow doc|https://arrow.apache.org/docs/python/parquet.html]) to both 
parquet format versions 1.0 and 2.0, with and without dictionary encoding 
enabled.
|{{import}} {{pyarrow.parquet as pq}}
{{import}} {{numpy as np}}
{{import}} {{pandas as pd}}
{{import}} {{pyarrow as pa}}
{{import}} {{itertools}}
 
{{df }}{{=}} {{pd.DataFrame({}}{{'one'}}{{: [}}{{-}}{{1}}{{, np.nan, 
}}{{2.5}}{{],}}
{{}}{{'two'}}{{: [}}{{'foo'}}{{, }}{{'bar'}}{{, }}{{'baz'}}{{],}}
{{}}{{'three'}}{{: [}}{{True}}{{, }}{{False}}{{, }}{{True}}{{]},}}
{{}}{{index}}{{=}}{{list}}{{(}}{{'abc'}}{{))}}
 
{{table }}{{=}} {{pa.Table.from_pandas(df)}}
 
{{use_dict }}{{=}} {{[}}{{True}}{{, }}{{False}}{{]}}
{{version }}{{=}} {{[}}{{'1.0'}}{{, }}{{'2.0'}}{{]}}
 
{{for}} {{tf, v }}{{in}} {{itertools.product(use_dict, version):}}
{{}}{{filename }}{{=}} {{'example_v'}} {{+}} {{v  }}{{+}} {{'_dict_'}} 
{{+}} {{str}}{{(tf) }}{{+}} {{'.parquet'}}
{{}}{{pq.write_table(table, filename, use_dictionary}}{{=}}{{tf, 
version}}{{=}}{{v)}}|

Inspecting the written files using 
[parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] 
appears to show that dictionary encoding is not used in either of the version 
2.0 files.  Both files report that the columns are encoded using {{PLAIN,RLE}} 
and that the dictionary page offset is zero.  I was expecting that the column 
encoding would include {{RLE_DICTIONARY}}. Attached are the script with repro 
steps and the files that were generated by it.

Below is the output of using {{parquet-tools meta}} on the version 2.0 files
{panel:title=version='2.0', use_dictionary = True}
{panel}
|{{% parquet-tools meta example_v2.0_dict_True.parquet}}
{{file:  file:.../example_v2.0_dict_True.parquet}}
{{creator:   parquet-cpp version 1.5.1-SNAPSHOT}}
{{extra: pandas = \{"pandas_version": "0.23.4", "index_columns": 
["__index_level_0__"], "columns": [{"metadata": null, "field_name": "one", 
"name": "one", "numpy_type": "float64", "pandas_type": "float64"}, 
\{"metadata": null, "field_name": "three", "name": "three", "numpy_type": 
"bool", "pandas_type": "bool"}, \{"metadata": null, "field_name": "two", 
"name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \{"metadata": 
null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", 
"pandas_type": "bytes"}], "column_indexes": [\{"metadata": null, "field_name": 
null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}}}
 
{{file schema:   schema}}
{{}}
{{one:   OPTIONAL DOUBLE R:0 D:1}}
{{three: OPTIONAL BOOLEAN R:0 D:1}}
{{two:   OPTIONAL BINARY R:0 D:1}}
{{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
 
{{row group 1:   RC:3 TS:211 OFFSET:4}}
{{}}
{{one:    DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 ENC:PLAIN,RLE 
ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
{{three:  BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 
ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
{{two:    BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 
ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
{{__index_level_0__:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 
ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|
{panel:title=version='2.0', use_dictionary = False}
{panel}
|{{% parquet-tools meta example_v2.0_dict_False.parquet}}
{{file:  file:.../example_v2.0_dict_False.parquet}}
{{creator:   parquet-cpp version 1.5.1-SNAPSHOT}}
{{extra: pandas = \{"pandas_version": "0.23.4", "index_columns": 
["__index_level_0__"], "columns": [{"metadata": null, "field_name": "one", 
"name": "one", "numpy_type": "float64", "pandas_type": "float64"}, 
\{"metadata": null, "field_name": "three", "name": "three", "numpy_type": 
"bool", "pandas_type": "bool"}, \{"metadata": null, "field_name": "two", 
"name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \{"metadata": 
null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", 
"pandas_type": "bytes"}], "column_indexes": [\{"metadata": null, "field_name": 
null, "name": null, 

[jira] [Created] (ARROW-3563) [C++] Declare public link dependencies so arrow_static, plasma_static automatically pull in transitive dependencies

2018-10-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3563:
---

 Summary: [C++] Declare public link dependencies so arrow_static, 
plasma_static automatically pull in transitive dependencies
 Key: ARROW-3563
 URL: https://issues.apache.org/jira/browse/ARROW-3563
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


see comments in https://github.com/apache/arrow/pull/2792



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3562) [R] Disallow creating of objects with null shared_ptr

2018-10-19 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3562:
---

 Summary: [R] Disallow creating of objects with null shared_ptr
 Key: ARROW-3562
 URL: https://issues.apache.org/jira/browse/ARROW-3562
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Wes McKinney
Assignee: Romain François
 Fix For: 0.12.0


Follow up work to ARROW-3490



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Efficient Pandas serialization for mixed object and numeric DataFrames

2018-10-19 Thread Antoine Pitrou


Slightly off-topic, but the recent work on PEP 574 (*) should allow
efficient serialization of Pandas dataframes (**) with standard pickle
(or the pickle5 backport).  Experimental support for pickle5 has
already been merged in Arrow and Numpy (and Pandas uses Numpy as its
storage backend).  My personal goal is to have the PEP accepted and
integrated into Python 3.8.

Regards

Antoine.

(*) Pickle protocol 5 with out-of-band data:
https://www.python.org/dev/peps/pep-0574/

(**) No-copy semantics for pandas dataframes:
https://github.com/numpy/numpy/pull/12011#issuecomment-428915852


On Thu, 18 Oct 2018 21:22:04 -0700
Robert Nishihara  wrote:
> How are you serializing the dataframe? If you use *pyarrow.serialize(df)*,
> then each column should be serialized separately and numeric columns will
> be handled efficiently.
> 
> On Thu, Oct 18, 2018 at 9:10 PM Mitar  wrote:
> 
> > Hi!
> >
> > It seems that if a DataFrame contains both numeric and object columns,
> > the whole DataFrame is pickled and not that only object columns are
> > pickled? Is this right? Are there any plans to improve this?
> >
> >
> > Mitar
> >
> > --
> > http://mitar.tnode.com/
> > https://twitter.com/mitar_m
> >  
> 



[jira] [Created] (ARROW-3561) [JS] Update ts-jest

2018-10-19 Thread Dominik Moritz (JIRA)
Dominik Moritz created ARROW-3561:
-

 Summary: [JS] Update ts-jest
 Key: ARROW-3561
 URL: https://issues.apache.org/jira/browse/ARROW-3561
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Dominik Moritz






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3560) Remove @std/esm

2018-10-19 Thread Dominik Moritz (JIRA)
Dominik Moritz created ARROW-3560:
-

 Summary: Remove @std/esm
 Key: ARROW-3560
 URL: https://issues.apache.org/jira/browse/ARROW-3560
 Project: Apache Arrow
  Issue Type: Task
  Components: JavaScript
Reporter: Dominik Moritz


When I run npm install, I get this warning:

@std/esm@0.26.0: This package is discontinued. Use https://npmjs.com/esm



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Accept donation of Ruby bindings to Parquet GLib

2018-10-19 Thread Krisztián Szűcs
+1

On Fri, Oct 19, 2018, 9:44 AM Antoine Pitrou  wrote:

>
> +1 from me.  Makes entire sense.
>
>
> Le 18/10/2018 à 23:02, Uwe L. Korn a écrit :
> > +1
> >
> >> Am 18.10.2018 um 22:59 schrieb Wes McKinney :
> >>
> >> hello,
> >>
> >> Kouhei Sutou is proposing to donate Ruby bindings to the Parquet GLib
> >> library, which was received as a donation in September. This Ruby
> >> library was originally developed at
> >>
> >> https://github.com/red-data-tools/red-parquet/
> >>
> >> Kou has submitted the work as a pull request
> >> https://github.com/apache/arrow/pull/2772
> >>
> >> This vote is to determine if the Arrow PMC is in favor of accepting
> >> this donation, subject to the fulfillment of the ASF IP Clearance
> process.
> >>
> >>[ ] +1 : Accept contribution of Ruby Parquet bindings
> >>[ ]  0 : No opinion
> >>[ ] -1 : Reject contribution because...
> >>
> >> Here is my vote: +1
> >>
> >> The vote will be open for at least 72 hours.
> >>
> >> Thanks,
> >> Wes
>


Re: Making a bugfix 0.11.1 release

2018-10-19 Thread Antoine Pitrou


I would recommend cherry-picking a minimal number of patches for the
bugfix and for packaging to work.  It's better not to include API
additions or changes.

Regards

Antoine.


Le 17/10/2018 à 03:32, Wes McKinney a écrit :
> hi folks,
> 
> As a result of ARROW-3514, we need to release new Python packages
> quite urgently since major functionality (Parquet writing on many
> Linux platforms) is broken out of the box
> 
> https://github.com/apache/arrow/commit/66d9a30a26e1659d9e992037339515e59a6ae518
> 
> We have a couple options:
> 
> * Release from master
> * Release 0.11.0 + minimum patches to include the ARROW-3514 fix and
> any follow up patches to fix packaging
> 
> There is the option to "not" release but it could cause confusion for
> people because PyPI does not allow replacing wheels; a new version
> number has to be created.
> 
> What would folks like to do? Who can help with the RM duties? Since a
> 72 hour vote is a _should_ rather than _must_, we could reasonably
> close the release vote in < 72 hours and push out packages faster if
> it is scope limited to the zlib bug fix
> 
> Thanks,
> Wes
> 


Re: [VOTE] Accept donation of Ruby bindings to Parquet GLib

2018-10-19 Thread Antoine Pitrou


+1 from me.  Makes entire sense.


Le 18/10/2018 à 23:02, Uwe L. Korn a écrit :
> +1 
> 
>> Am 18.10.2018 um 22:59 schrieb Wes McKinney :
>>
>> hello,
>>
>> Kouhei Sutou is proposing to donate Ruby bindings to the Parquet GLib
>> library, which was received as a donation in September. This Ruby
>> library was originally developed at
>>
>> https://github.com/red-data-tools/red-parquet/
>>
>> Kou has submitted the work as a pull request
>> https://github.com/apache/arrow/pull/2772
>>
>> This vote is to determine if the Arrow PMC is in favor of accepting
>> this donation, subject to the fulfillment of the ASF IP Clearance process.
>>
>>[ ] +1 : Accept contribution of Ruby Parquet bindings
>>[ ]  0 : No opinion
>>[ ] -1 : Reject contribution because...
>>
>> Here is my vote: +1
>>
>> The vote will be open for at least 72 hours.
>>
>> Thanks,
>> Wes