[jira] [Created] (ARROW-3574) Fix remaining bug with plasma static versus shared libraries.
Robert Nishihara created ARROW-3574: --- Summary: Fix remaining bug with plasma static versus shared libraries. Key: ARROW-3574 URL: https://issues.apache.org/jira/browse/ARROW-3574 Project: Apache Arrow Issue Type: Improvement Components: Plasma (C++) Reporter: Robert Nishihara Assignee: Robert Nishihara Address a few missing pieces in [https://github.com/apache/arrow/pull/2792.] On Mac, moving the {{plasma_store_server}} executable around and then executing it leads to {code:java} dyld: Library not loaded: @rpath/libarrow.12.dylib Referenced from: /Users/rkn/Workspace/ray/./python/ray/core/src/plasma/plasma_store_server Reason: image not found Abort trap: 6{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3573) [Rust] with_bitset does not set valid bits correctly
Paddy Horan created ARROW-3573: -- Summary: [Rust] with_bitset does not set valid bits correctly Key: ARROW-3573 URL: https://issues.apache.org/jira/browse/ARROW-3573 Project: Apache Arrow Issue Type: Bug Components: Rust Reporter: Paddy Horan Assignee: Paddy Horan The boundary check is off a little, {color:#33}`MutableBuffer::new(64).with_bitset(64, false);` will fail. This issue only happens if the arguments to `new` and `with_bitset` are the same and a multiple of 64. {color} {color:#33}`write_bytes` is currently writing 1 instead of 255 to set all the bits when `val` is `true`{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3572) [Packaging] Correctly handle ssh origin urls for crossbow
Krisztian Szucs created ARROW-3572: -- Summary: [Packaging] Correctly handle ssh origin urls for crossbow Key: ARROW-3572 URL: https://issues.apache.org/jira/browse/ARROW-3572 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Krisztian Szucs -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3571) [Wiki] Release management guide does not explain how to set up Crossbow or where to find instructions
Wes McKinney created ARROW-3571: --- Summary: [Wiki] Release management guide does not explain how to set up Crossbow or where to find instructions Key: ARROW-3571 URL: https://issues.apache.org/jira/browse/ARROW-3571 Project: Apache Arrow Issue Type: Improvement Components: Wiki Reporter: Wes McKinney Fix For: 0.12.0 If you follow the guide, at one point it says "Launch a Crossbow build" but provides no link to the setup instructions for this -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3570) [Packaging] Don't bundle test data files with python wheels
Krisztian Szucs created ARROW-3570: -- Summary: [Packaging] Don't bundle test data files with python wheels Key: ARROW-3570 URL: https://issues.apache.org/jira/browse/ARROW-3570 Project: Apache Arrow Issue Type: Improvement Components: Packaging Reporter: Krisztian Szucs https://travis-ci.org/kszucs/crossbow/builds/443856122#L2153 BTW What's the practice about bundling the test files? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3569) [Packaging] Run pyarrow unittests for when building conda package
Krisztian Szucs created ARROW-3569: -- Summary: [Packaging] Run pyarrow unittests for when building conda package Key: ARROW-3569 URL: https://issues.apache.org/jira/browse/ARROW-3569 Project: Apache Arrow Issue Type: Sub-task Reporter: Krisztian Szucs Assignee: Krisztian Szucs -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3568) [Packaging] Run pyarrow unittests for windows wheels
Krisztian Szucs created ARROW-3568: -- Summary: [Packaging] Run pyarrow unittests for windows wheels Key: ARROW-3568 URL: https://issues.apache.org/jira/browse/ARROW-3568 Project: Apache Arrow Issue Type: Sub-task Components: Packaging Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 0.11.1 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3567) [Gandiva] [GLib] Add GLib bindings of Gandiva
Yosuke Shiro created ARROW-3567: --- Summary: [Gandiva] [GLib] Add GLib bindings of Gandiva Key: ARROW-3567 URL: https://issues.apache.org/jira/browse/ARROW-3567 Project: Apache Arrow Issue Type: New Feature Components: Gandiva, GLib Reporter: Yosuke Shiro Assignee: Yosuke Shiro Fix For: 0.12.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3566) Clarify that the type of dictionary encoded field should be the encoded(index) type
Li Jin created ARROW-3566: - Summary: Clarify that the type of dictionary encoded field should be the encoded(index) type Key: ARROW-3566 URL: https://issues.apache.org/jira/browse/ARROW-3566 Project: Apache Arrow Issue Type: Improvement Reporter: Li Jin -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Efficient Pandas serialization for mixed object and numeric DataFrames
hi Mitar -- to Robert's point, we aren't sure which code path you are referring to. Perhaps related, I'm interested in handling Python pickling for "other" kinds of Python objects when converting to or from the Arrow format. So "Python object" would be defined as a user defined type that's embedded in the Arrow BINARY type. The relevant JIRA for this is https://issues.apache.org/jira/browse/ARROW-823 Thanks Wes On Fri, Oct 19, 2018 at 6:26 AM Antoine Pitrou wrote: > > > Slightly off-topic, but the recent work on PEP 574 (*) should allow > efficient serialization of Pandas dataframes (**) with standard pickle > (or the pickle5 backport). Experimental support for pickle5 has > already been merged in Arrow and Numpy (and Pandas uses Numpy as its > storage backend). My personal goal is to have the PEP accepted and > integrated into Python 3.8. > > Regards > > Antoine. > > (*) Pickle protocol 5 with out-of-band data: > https://www.python.org/dev/peps/pep-0574/ > > (**) No-copy semantics for pandas dataframes: > https://github.com/numpy/numpy/pull/12011#issuecomment-428915852 > > > On Thu, 18 Oct 2018 21:22:04 -0700 > Robert Nishihara wrote: > > How are you serializing the dataframe? If you use *pyarrow.serialize(df)*, > > then each column should be serialized separately and numeric columns will > > be handled efficiently. > > > > On Thu, Oct 18, 2018 at 9:10 PM Mitar wrote: > > > > > Hi! > > > > > > It seems that if a DataFrame contains both numeric and object columns, > > > the whole DataFrame is pickled and not that only object columns are > > > pickled? Is this right? Are there any plans to improve this? > > > > > > > > > Mitar > > > > > > -- > > > http://mitar.tnode.com/ > > > https://twitter.com/mitar_m > > > > > >
Parquet format in Java
Hi, In Java, I'm getting plasma object from C++ (in parquet format) as byte[] buffer. How can I convert it to back to Arrow Schema/columns? Thanks. -- Regards, Tanveer Ahmad
Re: Making a bugfix 0.11.1 release
I prepared the maintenance branch here https://github.com/apache/arrow/tree/maint-0.11.x I'm not fully set up with to create a release candidate yet with Crossbow but I'll work on it today and try to get a vote started by EOD On Fri, Oct 19, 2018 at 3:45 AM Antoine Pitrou wrote: > > > I would recommend cherry-picking a minimal number of patches for the > bugfix and for packaging to work. It's better not to include API > additions or changes. > > Regards > > Antoine. > > > Le 17/10/2018 à 03:32, Wes McKinney a écrit : > > hi folks, > > > > As a result of ARROW-3514, we need to release new Python packages > > quite urgently since major functionality (Parquet writing on many > > Linux platforms) is broken out of the box > > > > https://github.com/apache/arrow/commit/66d9a30a26e1659d9e992037339515e59a6ae518 > > > > We have a couple options: > > > > * Release from master > > * Release 0.11.0 + minimum patches to include the ARROW-3514 fix and > > any follow up patches to fix packaging > > > > There is the option to "not" release but it could cause confusion for > > people because PyPI does not allow replacing wheels; a new version > > number has to be created. > > > > What would folks like to do? Who can help with the RM duties? Since a > > 72 hour vote is a _should_ rather than _must_, we could reasonably > > close the release vote in < 72 hours and push out packages faster if > > it is scope limited to the zlib bug fix > > > > Thanks, > > Wes > >
[jira] [Created] (ARROW-3565) [Python] Pin tensorflow to 1.11.0 in manylinux1 container
Uwe L. Korn created ARROW-3565: -- Summary: [Python] Pin tensorflow to 1.11.0 in manylinux1 container Key: ARROW-3565 URL: https://issues.apache.org/jira/browse/ARROW-3565 Project: Apache Arrow Issue Type: Task Components: Python Affects Versions: 0.11.0 Reporter: Uwe L. Korn Assignee: Uwe L. Korn Fix For: 0.11.1 Just enough to get {{pyarrow}} in a releasable state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3564) pyarrow: writing version 2.0 parquet format with dictionary encoding enabled
Hatem Helal created ARROW-3564: -- Summary: pyarrow: writing version 2.0 parquet format with dictionary encoding enabled Key: ARROW-3564 URL: https://issues.apache.org/jira/browse/ARROW-3564 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.11.0 Reporter: Hatem Helal Attachments: example_v1.0_dict_False.parquet, example_v1.0_dict_True.parquet, example_v2.0_dict_False.parquet, example_v2.0_dict_True.parquet, pyarrow_repro.py Using pyarrow v0.11.0, the following script writes a simple table (lifted from the [pyarrow doc|https://arrow.apache.org/docs/python/parquet.html]) to both parquet format versions 1.0 and 2.0, with and without dictionary encoding enabled. |{{import}} {{pyarrow.parquet as pq}} {{import}} {{numpy as np}} {{import}} {{pandas as pd}} {{import}} {{pyarrow as pa}} {{import}} {{itertools}} {{df }}{{=}} {{pd.DataFrame({}}{{'one'}}{{: [}}{{-}}{{1}}{{, np.nan, }}{{2.5}}{{],}} {{}}{{'two'}}{{: [}}{{'foo'}}{{, }}{{'bar'}}{{, }}{{'baz'}}{{],}} {{}}{{'three'}}{{: [}}{{True}}{{, }}{{False}}{{, }}{{True}}{{]},}} {{}}{{index}}{{=}}{{list}}{{(}}{{'abc'}}{{))}} {{table }}{{=}} {{pa.Table.from_pandas(df)}} {{use_dict }}{{=}} {{[}}{{True}}{{, }}{{False}}{{]}} {{version }}{{=}} {{[}}{{'1.0'}}{{, }}{{'2.0'}}{{]}} {{for}} {{tf, v }}{{in}} {{itertools.product(use_dict, version):}} {{}}{{filename }}{{=}} {{'example_v'}} {{+}} {{v }}{{+}} {{'_dict_'}} {{+}} {{str}}{{(tf) }}{{+}} {{'.parquet'}} {{}}{{pq.write_table(table, filename, use_dictionary}}{{=}}{{tf, version}}{{=}}{{v)}}| Inspecting the written files using [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] appears to show that dictionary encoding is not used in either of the version 2.0 files. Both files report that the columns are encoded using {{PLAIN,RLE}} and that the dictionary page offset is zero. I was expecting that the column encoding would include {{RLE_DICTIONARY}}. Attached are the script with repro steps and the files that were generated by it. Below is the output of using {{parquet-tools meta}} on the version 2.0 files {panel:title=version='2.0', use_dictionary = True} {panel} |{{% parquet-tools meta example_v2.0_dict_True.parquet}} {{file: file:.../example_v2.0_dict_True.parquet}} {{creator: parquet-cpp version 1.5.1-SNAPSHOT}} {{extra: pandas = \{"pandas_version": "0.23.4", "index_columns": ["__index_level_0__"], "columns": [{"metadata": null, "field_name": "one", "name": "one", "numpy_type": "float64", "pandas_type": "float64"}, \{"metadata": null, "field_name": "three", "name": "three", "numpy_type": "bool", "pandas_type": "bool"}, \{"metadata": null, "field_name": "two", "name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \{"metadata": null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", "pandas_type": "bytes"}], "column_indexes": [\{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}}} {{file schema: schema}} {{}} {{one: OPTIONAL DOUBLE R:0 D:1}} {{three: OPTIONAL BOOLEAN R:0 D:1}} {{two: OPTIONAL BINARY R:0 D:1}} {{__index_level_0__: OPTIONAL BINARY R:0 D:1}} {{row group 1: RC:3 TS:211 OFFSET:4}} {{}} {{one: DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}} {{three: BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}} {{two: BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}} {{__index_level_0__: BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}| {panel:title=version='2.0', use_dictionary = False} {panel} |{{% parquet-tools meta example_v2.0_dict_False.parquet}} {{file: file:.../example_v2.0_dict_False.parquet}} {{creator: parquet-cpp version 1.5.1-SNAPSHOT}} {{extra: pandas = \{"pandas_version": "0.23.4", "index_columns": ["__index_level_0__"], "columns": [{"metadata": null, "field_name": "one", "name": "one", "numpy_type": "float64", "pandas_type": "float64"}, \{"metadata": null, "field_name": "three", "name": "three", "numpy_type": "bool", "pandas_type": "bool"}, \{"metadata": null, "field_name": "two", "name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \{"metadata": null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", "pandas_type": "bytes"}], "column_indexes": [\{"metadata": null, "field_name": null, "name": null,
[jira] [Created] (ARROW-3563) [C++] Declare public link dependencies so arrow_static, plasma_static automatically pull in transitive dependencies
Wes McKinney created ARROW-3563: --- Summary: [C++] Declare public link dependencies so arrow_static, plasma_static automatically pull in transitive dependencies Key: ARROW-3563 URL: https://issues.apache.org/jira/browse/ARROW-3563 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.12.0 see comments in https://github.com/apache/arrow/pull/2792 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3562) [R] Disallow creating of objects with null shared_ptr
Wes McKinney created ARROW-3562: --- Summary: [R] Disallow creating of objects with null shared_ptr Key: ARROW-3562 URL: https://issues.apache.org/jira/browse/ARROW-3562 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Wes McKinney Assignee: Romain François Fix For: 0.12.0 Follow up work to ARROW-3490 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: Efficient Pandas serialization for mixed object and numeric DataFrames
Slightly off-topic, but the recent work on PEP 574 (*) should allow efficient serialization of Pandas dataframes (**) with standard pickle (or the pickle5 backport). Experimental support for pickle5 has already been merged in Arrow and Numpy (and Pandas uses Numpy as its storage backend). My personal goal is to have the PEP accepted and integrated into Python 3.8. Regards Antoine. (*) Pickle protocol 5 with out-of-band data: https://www.python.org/dev/peps/pep-0574/ (**) No-copy semantics for pandas dataframes: https://github.com/numpy/numpy/pull/12011#issuecomment-428915852 On Thu, 18 Oct 2018 21:22:04 -0700 Robert Nishihara wrote: > How are you serializing the dataframe? If you use *pyarrow.serialize(df)*, > then each column should be serialized separately and numeric columns will > be handled efficiently. > > On Thu, Oct 18, 2018 at 9:10 PM Mitar wrote: > > > Hi! > > > > It seems that if a DataFrame contains both numeric and object columns, > > the whole DataFrame is pickled and not that only object columns are > > pickled? Is this right? Are there any plans to improve this? > > > > > > Mitar > > > > -- > > http://mitar.tnode.com/ > > https://twitter.com/mitar_m > > >
[jira] [Created] (ARROW-3561) [JS] Update ts-jest
Dominik Moritz created ARROW-3561: - Summary: [JS] Update ts-jest Key: ARROW-3561 URL: https://issues.apache.org/jira/browse/ARROW-3561 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Dominik Moritz -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3560) Remove @std/esm
Dominik Moritz created ARROW-3560: - Summary: Remove @std/esm Key: ARROW-3560 URL: https://issues.apache.org/jira/browse/ARROW-3560 Project: Apache Arrow Issue Type: Task Components: JavaScript Reporter: Dominik Moritz When I run npm install, I get this warning: @std/esm@0.26.0: This package is discontinued. Use https://npmjs.com/esm -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [VOTE] Accept donation of Ruby bindings to Parquet GLib
+1 On Fri, Oct 19, 2018, 9:44 AM Antoine Pitrou wrote: > > +1 from me. Makes entire sense. > > > Le 18/10/2018 à 23:02, Uwe L. Korn a écrit : > > +1 > > > >> Am 18.10.2018 um 22:59 schrieb Wes McKinney : > >> > >> hello, > >> > >> Kouhei Sutou is proposing to donate Ruby bindings to the Parquet GLib > >> library, which was received as a donation in September. This Ruby > >> library was originally developed at > >> > >> https://github.com/red-data-tools/red-parquet/ > >> > >> Kou has submitted the work as a pull request > >> https://github.com/apache/arrow/pull/2772 > >> > >> This vote is to determine if the Arrow PMC is in favor of accepting > >> this donation, subject to the fulfillment of the ASF IP Clearance > process. > >> > >>[ ] +1 : Accept contribution of Ruby Parquet bindings > >>[ ] 0 : No opinion > >>[ ] -1 : Reject contribution because... > >> > >> Here is my vote: +1 > >> > >> The vote will be open for at least 72 hours. > >> > >> Thanks, > >> Wes >
Re: Making a bugfix 0.11.1 release
I would recommend cherry-picking a minimal number of patches for the bugfix and for packaging to work. It's better not to include API additions or changes. Regards Antoine. Le 17/10/2018 à 03:32, Wes McKinney a écrit : > hi folks, > > As a result of ARROW-3514, we need to release new Python packages > quite urgently since major functionality (Parquet writing on many > Linux platforms) is broken out of the box > > https://github.com/apache/arrow/commit/66d9a30a26e1659d9e992037339515e59a6ae518 > > We have a couple options: > > * Release from master > * Release 0.11.0 + minimum patches to include the ARROW-3514 fix and > any follow up patches to fix packaging > > There is the option to "not" release but it could cause confusion for > people because PyPI does not allow replacing wheels; a new version > number has to be created. > > What would folks like to do? Who can help with the RM duties? Since a > 72 hour vote is a _should_ rather than _must_, we could reasonably > close the release vote in < 72 hours and push out packages faster if > it is scope limited to the zlib bug fix > > Thanks, > Wes >
Re: [VOTE] Accept donation of Ruby bindings to Parquet GLib
+1 from me. Makes entire sense. Le 18/10/2018 à 23:02, Uwe L. Korn a écrit : > +1 > >> Am 18.10.2018 um 22:59 schrieb Wes McKinney : >> >> hello, >> >> Kouhei Sutou is proposing to donate Ruby bindings to the Parquet GLib >> library, which was received as a donation in September. This Ruby >> library was originally developed at >> >> https://github.com/red-data-tools/red-parquet/ >> >> Kou has submitted the work as a pull request >> https://github.com/apache/arrow/pull/2772 >> >> This vote is to determine if the Arrow PMC is in favor of accepting >> this donation, subject to the fulfillment of the ASF IP Clearance process. >> >>[ ] +1 : Accept contribution of Ruby Parquet bindings >>[ ] 0 : No opinion >>[ ] -1 : Reject contribution because... >> >> Here is my vote: +1 >> >> The vote will be open for at least 72 hours. >> >> Thanks, >> Wes