date:20200515

[jira] [Updated] (ARROW-8820) [C++][Gandiva] fix date_trunc functions to return date types

2020-05-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8820:
--
Labels: pull-request-available  (was: )

> [C++][Gandiva] fix date_trunc functions to return date types
> 
>
> Key: ARROW-8820
> URL: https://issues.apache.org/jira/browse/ARROW-8820
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Prudhvi Porandla
>Assignee: Prudhvi Porandla
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> date_trunc functions return int64 instead of date types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8820) [C++][Gandiva] fix date_trunc functions to return date types

2020-05-15 Thread Prudhvi Porandla (Jira)

Prudhvi Porandla created ARROW-8820:
---

 Summary: [C++][Gandiva] fix date_trunc functions to return date 
types
 Key: ARROW-8820
 URL: https://issues.apache.org/jira/browse/ARROW-8820
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Prudhvi Porandla
Assignee: Prudhvi Porandla


date_trunc functions return int64 instead of date types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8121) [Java] Enhance code style checking for Java code (add space after commas, semi-colons and type casts)

2020-05-15 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-8121.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 6622
[https://github.com/apache/arrow/pull/6622]

> [Java] Enhance code style checking for Java code (add space after commas, 
> semi-colons and type casts)
> -
>
> Key: ARROW-8121
> URL: https://issues.apache.org/jira/browse/ARROW-8121
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> This is in response to a discussion in 
> https://github.com/apache/arrow/pull/6039#discussion_r375161992
> We found the current style checking for Java code is not sufficient. So we 
> want to enhace it in a series of "small" steps, in order to avoid having to 
> change too many files at once.
> In this issue, we add spaces after commas, semi-colons and type casts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8788) [C#] Array builders to use bit-packed buffer builder rather than boolean array builder for validity map

2020-05-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8788:
--
Labels: pull-request-available  (was: )

> [C#] Array builders to use bit-packed buffer builder rather than boolean 
> array builder for validity map
> ---
>
> Key: ARROW-8788
> URL: https://issues.apache.org/jira/browse/ARROW-8788
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C#
>Affects Versions: 0.17.0
>Reporter: Adam Szmigin
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The C# array builders were recently enhanced to have support for adding 
> nullable values easily, under [PR 
> #7032|https://github.com/apache/arrow/pull/7032].
> However, the builders internally referenced {{BooleanArray.Builder}}, which 
> itself then had logic "baked-in" for efficient bit-packing of boolean values 
> into a byte buffer.
> It would be cleaner for there to be a general-purpose bit-packed buffer 
> builder, and for all array builders to use that for their validity map.  The 
> boolean array builder would use it twice: once for values, once for validity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-8805) [C++] Arrow (master) build error from sources

2020-05-15 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-8805.
---
Resolution: Won't Fix

Apache Arrow has ceased support for Python 2.7 since it reached EOL in January 
2020

> [C++] Arrow (master) build error from sources
> -
>
> Key: ARROW-8805
> URL: https://issues.apache.org/jira/browse/ARROW-8805
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Tanveer
>Priority: Major
> Attachments: Screenshot from 2020-05-14 22-22-01.png
>
>
> Building Arrow C++ from sources (with following flags: cmake 
> -DARROW_PLASMA=ON  -DARROW_PYTHON=ON ..) is not possible due to some errors 
> as shown in the attached figure.
> Can someone fix them or suggest me some solution? Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8805) [C++] Arrow (master) build error from sources

2020-05-15 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108788#comment-17108788
 ] 

Wes McKinney commented on ARROW-8805:
-

You'll also have to enable the optional components that GLib depends on

> [C++] Arrow (master) build error from sources
> -
>
> Key: ARROW-8805
> URL: https://issues.apache.org/jira/browse/ARROW-8805
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Tanveer
>Priority: Major
> Attachments: Screenshot from 2020-05-14 22-22-01.png
>
>
> Building Arrow C++ from sources (with following flags: cmake 
> -DARROW_PLASMA=ON  -DARROW_PYTHON=ON ..) is not possible due to some errors 
> as shown in the attached figure.
> Can someone fix them or suggest me some solution? Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8374) [R] Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

2020-05-15 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108786#comment-17108786
 ] 

Wes McKinney commented on ARROW-8374:
-

Oof, this would be good to fix

> [R] Table to vector of DictonaryType will error when Arrays don't have the 
> same Dictionary per array
> 
>
> Key: ARROW-8374
> URL: https://issues.apache.org/jira/browse/ARROW-8374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> The conversion should accommodate Unifying the dictionary before converting, 
> otherwise the indices are simply broken



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8805) [C++] Arrow (master) build error from sources

2020-05-15 Thread Tanveer (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanveer updated ARROW-8805:
---
Description: 
Building Arrow C++ from sources (with following flags: cmake -DARROW_PLASMA=ON  
-DARROW_PYTHON=ON ..) is not possible due to some errors as shown in the 
attached figure.

Can someone fix them or suggest me some solution? Thanks.

  was:
!Screenshot from 2020-05-14 22-22-01.png!

Building Arrow C++ from sources (with following flags: cmake -DARROW_PLASMA=ON  
-DARROW_PYTHON=ON ..) is not possible due to some errors as shown in the 
attached figure.

Can someone fix them or suggest me some solution? Thanks.


> [C++] Arrow (master) build error from sources
> -
>
> Key: ARROW-8805
> URL: https://issues.apache.org/jira/browse/ARROW-8805
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Tanveer
>Priority: Major
> Attachments: Screenshot from 2020-05-14 22-22-01.png
>
>
> Building Arrow C++ from sources (with following flags: cmake 
> -DARROW_PLASMA=ON  -DARROW_PYTHON=ON ..) is not possible due to some errors 
> as shown in the attached figure.
> Can someone fix them or suggest me some solution? Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8805) [C++] Arrow (master) build error from sources

2020-05-15 Thread Tanveer (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108785#comment-17108785
 ] 

Tanveer commented on ARROW-8805:


 

With CMake command:
{code:java}
cmake -DCMAKE_BUILD_TYPE=Release -DARROW_PARQUET=ON -DARROW_PLASMA=ON 
-DARROW_PLASMA_JAVA_CLIENT=ON -DARROW_PYTHON=ON ..{code}
{code:java}
$ git clone https://github.com/apache/arrow.git
Cloning into 'arrow'...
remote: Enumerating objects: 86, done.
remote: Counting objects: 100% (86/86), done.
remote: Compressing objects: 100% (70/70), done.
remote: Total 99863 (delta 20), reused 45 (delta 12), pack-reused 99777
Receiving objects: 100% (99863/99863), 53.01 MiB | 1.22 MiB/s, done.
Resolving deltas: 100% (68594/68594), done.
Checking connectivity... done.
tahmad@Rezkuh-7480: ~
$ cd arrow/cpp/
tahmad@Rezkuh-7480: ~/arrow/cpp(master)
$ mkdir release
tahmad@Rezkuh-7480: ~/arrow/cpp(master)
$ cd release/
tahmad@Rezkuh-7480: ~/arrow/cpp/release(master)
$ cmake -DCMAKE_BUILD_TYPE=Release -DARROW_PARQUET=ON -DARROW_PLASMA=ON 
-DARROW_PLASMA_JAVA_CLIENT=ON -DARROW_PYTHON=ON ..
-- Building using CMake version: 3.5.1
-- The C compiler identification is GNU 5.5.0
-- The CXX compiler identification is GNU 5.5.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Arrow version: 0.18.0 (full: '0.18.0-SNAPSHOT')
-- Arrow SO version: 18 (full: 18.0.0)
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1") 
-- clang-tidy not found
-- clang-format not found
-- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) 
-- infer not found
-- Found PythonInterp: /usr/bin/python (found version "2.7.12") 
-- Found cpplint executable at /home/tahmad/arrow/cpp/build-support/cpplint.py
-- System processor: x86_64
-- Performing Test CXX_SUPPORTS_SSE4_2
-- Performing Test CXX_SUPPORTS_SSE4_2 - Success
-- Performing Test CXX_SUPPORTS_AVX2
-- Performing Test CXX_SUPPORTS_AVX2 - Success
-- Performing Test CXX_SUPPORTS_AVX512
-- Performing Test CXX_SUPPORTS_AVX512 - Failed
-- Arrow build warning level: PRODUCTION
Using ld linker
Configured for RELEASE build (set with cmake 
-DCMAKE_BUILD_TYPE={release,debug,...})
-- Build Type: RELEASE
-- Using AUTO approach to find dependencies
-- ARROW_AWSSDK_BUILD_VERSION: 1.7.160
-- ARROW_BOOST_BUILD_VERSION: 1.71.0
-- ARROW_BROTLI_BUILD_VERSION: v1.0.7
-- ARROW_BZIP2_BUILD_VERSION: 1.0.8
-- ARROW_CARES_BUILD_VERSION: 1.15.0
-- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.0
-- ARROW_GFLAGS_BUILD_VERSION: v2.2.0
-- ARROW_GLOG_BUILD_VERSION: v0.3.5
-- ARROW_GRPC_BUILD_VERSION: v1.25.0
-- ARROW_GTEST_BUILD_VERSION: 1.8.1
-- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1
-- ARROW_LZ4_BUILD_VERSION: v1.9.2
-- ARROW_MIMALLOC_BUILD_VERSION: 270e765454f98e8bab9d42609b153425f749fff6
-- ARROW_ORC_BUILD_VERSION: 1.6.2
-- ARROW_PROTOBUF_BUILD_VERSION: v3.7.1
-- ARROW_RAPIDJSON_BUILD_VERSION: 2bbd33b33217ff4a73434ebf10cdac41e2ef5e34
-- ARROW_RE2_BUILD_VERSION: 2019-08-01
-- ARROW_SNAPPY_BUILD_VERSION: 1.1.7
-- ARROW_THRIFT_BUILD_VERSION: 0.12.0
-- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183
-- ARROW_ZLIB_BUILD_VERSION: 1.2.11
-- ARROW_ZSTD_BUILD_VERSION: v1.4.3
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Check if compiler accepts -pthread
-- Check if compiler accepts -pthread - yes
-- Found Threads: TRUE 
-- Checking for module 'thrift'
-- No package 'thrift' found
-- Could NOT find Thrift: Found unsuitable version "", but required is at least 
"0.11.0" (found THRIFT_STATIC_LIB-NOTFOUND)
-- Boost version: 1.58.0
-- Found the following Boost libraries:
-- regex
-- system
-- filesystem
-- Boost include dir: /usr/include
-- Boost libraries: Boost::system;Boost::filesystem
-- Building without OpenSSL support. Minimum OpenSSL version 1.0.2 required.
Building Apache Thrift from source
-- Building (vendored) jemalloc from source
-- Could NOT find RapidJSONAlt (missing: RAPIDJSON_INCLUDE_DIR) (Required is at 
least version "1.1.0")
-- Building rapidjson from source
-- Found hdfs.h at: /home/tahmad/arrow/cpp/thirdparty/hadoop/include/hdfs.h
-- CMAKE_C_FLAGS: -O3 -DNDEBUG -Wall -Wno-attributes -msse4.2
-- CMAKE_CXX_FLAGS: -fdiagnostics-color=always -O3 -DNDEBUG -Wall 
-Wno-attributes -msse4.2 
-- Found JNI: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libjawt.so 
-- JNI_INCLUDE_DIRS =

[jira] [Created] (ARROW-8819) [Rust] Rust docs don't complile for the Arrow crate

2020-05-15 Thread Paddy Horan (Jira)

Paddy Horan created ARROW-8819:
--

 Summary: [Rust] Rust docs don't complile for the Arrow crate
 Key: ARROW-8819
 URL: https://issues.apache.org/jira/browse/ARROW-8819
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Affects Versions: 0.17.0
Reporter: Paddy Horan


See Github [issue|https://github.com/apache/arrow/issues/7194]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8818) [Rust] Failing to build on master due to Flatbuffers/Union issues

2020-05-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8818:
--
Labels: pull-request-available  (was: )

> [Rust] Failing to build on master due to Flatbuffers/Union issues
> -
>
> Key: ARROW-8818
> URL: https://issues.apache.org/jira/browse/ARROW-8818
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8818) [Rust] Failing to build on master due to Flatbuffers/Union issues

2020-05-15 Thread Paddy Horan (Jira)

Paddy Horan created ARROW-8818:
--

 Summary: [Rust] Failing to build on master due to 
Flatbuffers/Union issues
 Key: ARROW-8818
 URL: https://issues.apache.org/jira/browse/ARROW-8818
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Paddy Horan
Assignee: Paddy Horan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8817) [Rust] Add support for Union arrays in Parquet

2020-05-15 Thread Paddy Horan (Jira)

Paddy Horan created ARROW-8817:
--

 Summary: [Rust] Add support for Union arrays in Parquet
 Key: ARROW-8817
 URL: https://issues.apache.org/jira/browse/ARROW-8817
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Rust
Reporter: Paddy Horan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas

2020-05-15 Thread Rauli Ruohonen (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108736#comment-17108736
 ] 

Rauli Ruohonen commented on ARROW-8816:
---

Ah, I see. I thought that the output was wrong, because fastparquet also reads 
it incorrectly. But using both from pandas is not an independent test, because 
pandas is shared between the tests. Checking with parquet-tools, the output 
does look correct (924618240 is 2263-01-01 00:00:00, and the extra field 
gives "datetime" for pandas_type and "object" for numpy_type; AFAICS the reader 
has no basis to assume that unchecked cast to datetime64 would be safe).

Still, it's something of a pitfall that you can successfully save data (using 
default options), and when you later try to load it using the same software 
(using default options), it fails. If timestamp_as_object is required to read 
the data, one could symmetrically also require it to write the data, and avoid 
surprises upon loading.

OTOH raising an exception when you actually can produce correct output would 
also be slightly odd. One solution would be to have a 
timestamp_as_object='infer' option (instead of just True/False) that would be 
the default, so that the current writing behavior would be matched with 
symmetric reading behavior that would produce datetime64 when possible, and 
datetime when not.

>From one pragmatic perspective it'd be safer to raise an exception when trying 
>to write these things unless explicitly requested, because there are readers 
>that fail with them in common use (such as current pyarrow and fastparquet). 
>Maybe the reasoning why write_table defaults to parquet version 1.0 output 
>instead of 2.0 is similar...?

IMHO the important thing is to always be able to read back in what one wrote 
(possibly with wider types) if the write was successful, provided that one uses 
the same pyarrow version and the default options for both reading and writing.

> [Python] Year 2263 or later datetimes get mangled when written using pandas
> ---
>
> Key: ARROW-8816
> URL: https://issues.apache.org/jira/browse/ARROW-8816
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0, 0.17.0
> Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, 
> python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, 
> python 3.8.2, ubuntu 20.04 (linux).
>Reporter: Rauli Ruohonen
>Priority: Major
>
> Using pyarrow 0.17.0, this
>  
> {code:java}
> import datetime
> import pandas as pd
> def try_with_year(year):
>     print(f'Year {year:_}:')
>     df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]})
>     df.to_parquet('foo.parquet', engine='pyarrow', compression=None)
>     try:
>         print(pd.read_parquet('foo.parquet', engine='pyarrow'))
>     except Exception as exc:
>         print(repr(exc))
>     print()
> try_with_year(2_263)
> try_with_year(2_262)
> {code}
>  
> prints
>  
> {noformat}
> Year 2_263:
> ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out 
> of bounds timestamp: 924618240')
> Year 2_262:
>            x
> 0 2262-01-01{noformat}
> and using pyarrow 0.16.0, it prints
>  
>  
> {noformat}
> Year 2_263:
>   x
> 0 1678-06-12 00:25:26.290448384
> Year 2_262:
>x
> 0 2262-01-01{noformat}
> The issue is that 2263-01-01 is out of bounds for a timestamp stored using 
> epoch nanoseconds, but not out of bounds for a Python datetime.
> While pyarrow 0.17.0 refuses to read the erroneous output, it is still 
> possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or 
> fastparquet), yielding the same result as with 0.16.0 above (i.e. only 
> reading has changed in 0.17.0, not writing). It would be better if an error 
> was raised when attempting to write the file instead of silently producing 
> erroneous output.
> The reason I suspect this is a pyarrow issue instead of a pandas issue is 
> this modified example:
>  
> {code:java}
> import datetime
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]})
> table = pa.Table.from_pandas(df)
> print(table[0])
> try:
>     print(table.to_pandas())
> except Exception as exc:
>     print(repr(exc))
> {code}
> which prints
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.00
>   ]
> ]
> ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out 
> of bounds timestamp: 92461824'){noformat}
> on pyarrow 0.17.0 and
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.00
>   ]
> ]
>                               x
> 0 1678-06-12 00:25:26.290448384{noformat}
> on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, 
>

[jira] [Resolved] (ARROW-8757) [C++] Plasma header is written in native endian

2020-05-15 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8757.
-
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7146
[https://github.com/apache/arrow/pull/7146]

> [C++] Plasma header is written in native endian
> ---
>
> Key: ARROW-8757
> URL: https://issues.apache.org/jira/browse/ARROW-8757
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> The current Plasma header (version, type, and length) is written in native 
> endian at 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/plasma/io.cc#L65-L71].
>  It will be hard to interpret the Plasma data among different endian 
> platforms in the future.
> At least, the header should be written in the pre-defined endian. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8757) [C++] Plasma header is written in native endian

2020-05-15 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-8757:
---

Assignee: Kazuaki Ishizaki

> [C++] Plasma header is written in native endian
> ---
>
> Key: ARROW-8757
> URL: https://issues.apache.org/jira/browse/ARROW-8757
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The current Plasma header (version, type, and length) is written in native 
> endian at 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/plasma/io.cc#L65-L71].
>  It will be hard to interpret the Plasma data among different endian 
> platforms in the future.
> At least, the header should be written in the pre-defined endian. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8757) [C++] Plasma header is written in native endian

2020-05-15 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-8757:

Summary: [C++] Plasma header is written in native endian  (was: [c++] 
Plasma header is written in native endian)

> [C++] Plasma header is written in native endian
> ---
>
> Key: ARROW-8757
> URL: https://issues.apache.org/jira/browse/ARROW-8757
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++ - Plasma
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The current Plasma header (version, type, and length) is written in native 
> endian at 
> [here|https://github.com/apache/arrow/blob/master/cpp/src/plasma/io.cc#L65-L71].
>  It will be hard to interpret the Plasma data among different endian 
> platforms in the future.
> At least, the header should be written in the pre-defined endian. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7967) [CI][Crossbow] Pin macOS version in autobrew job to match CRAN

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7967.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7197
[https://github.com/apache/arrow/pull/7197]

> [CI][Crossbow] Pin macOS version in autobrew job to match CRAN
> --
>
> Key: ARROW-7967
> URL: https://issues.apache.org/jira/browse/ARROW-7967
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Followup to ARROW-7923. After hopefully fixing the underlying issue somewhere 
> in Travis, revert the changes in that issue so that we're still testing on 
> old macOS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (ARROW-8556) [R] zstd symbol not found if there are multiple installations of zstd

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson reassigned ARROW-8556:
--

Assignee: Neal Richardson

> [R] zstd symbol not found if there are multiple installations of zstd
> -
>
> Key: ARROW-8556
> URL: https://issues.apache.org/jira/browse/ARROW-8556
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: Ubuntu 19.10
> R 3.6.1
>Reporter: Karl Dunkle Werner
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> I would like to install the `arrow` R package on my Ubuntu 19.10 system. 
> Prebuilt binaries are unavailable, and I want to enable compression, so I set 
> the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks 
> like the package is able to compile, but can't be loaded. I'm able to install 
> correctly if I don't set the {{LIBARROW_MINIMAL}} variable.
> Here's the error I get:
> {code:java}
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: 
> ZSTD_initCStream
> Error: loading failed
> Execution halted
> ERROR: loading failed
> * removing ‘~/.R/3.6/arrow’
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8556) [R] zstd symbol not found if there are multiple installations of zstd

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-8556.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7196
[https://github.com/apache/arrow/pull/7196]

> [R] zstd symbol not found if there are multiple installations of zstd
> -
>
> Key: ARROW-8556
> URL: https://issues.apache.org/jira/browse/ARROW-8556
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: Ubuntu 19.10
> R 3.6.1
>Reporter: Karl Dunkle Werner
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I would like to install the `arrow` R package on my Ubuntu 19.10 system. 
> Prebuilt binaries are unavailable, and I want to enable compression, so I set 
> the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks 
> like the package is able to compile, but can't be loaded. I'm able to install 
> correctly if I don't set the {{LIBARROW_MINIMAL}} variable.
> Here's the error I get:
> {code:java}
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: 
> ZSTD_initCStream
> Error: loading failed
> Execution halted
> ERROR: loading failed
> * removing ‘~/.R/3.6/arrow’
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8814) [Dev][Release] Binary upload script keeps raising locale warnings

2020-05-15 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-8814.
-
Resolution: Fixed

Issue resolved by pull request 7191
[https://github.com/apache/arrow/pull/7191]

> [Dev][Release] Binary upload script keeps raising locale warnings
> -
>
> Key: ARROW-8814
> URL: https://issues.apache.org/jira/browse/ARROW-8814
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The console output is filled with warnings which makes hard to follow what 
> happens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7967) [CI][Crossbow] Pin macOS version in autobrew job to match CRAN

2020-05-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7967:
--
Labels: pull-request-available  (was: )

> [CI][Crossbow] Pin macOS version in autobrew job to match CRAN
> --
>
> Key: ARROW-7967
> URL: https://issues.apache.org/jira/browse/ARROW-7967
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Followup to ARROW-7923. After hopefully fixing the underlying issue somewhere 
> in Travis, revert the changes in that issue so that we're still testing on 
> old macOS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7967) [CI][Crossbow] Pin macOS version in autobrew job to match CRAN

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7967:
---
Summary: [CI][Crossbow] Pin macOS version in autobrew job to match CRAN  
(was: [CI][Crossbow] Move autobrew job back to old macOS)

> [CI][Crossbow] Pin macOS version in autobrew job to match CRAN
> --
>
> Key: ARROW-7967
> URL: https://issues.apache.org/jira/browse/ARROW-7967
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>
> Followup to ARROW-7923. After hopefully fixing the underlying issue somewhere 
> in Travis, revert the changes in that issue so that we're still testing on 
> old macOS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7803) [R][CI] Autobrew/homebrew tests should not always install from master

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-7803.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7195
[https://github.com/apache/arrow/pull/7195]

> [R][CI] Autobrew/homebrew tests should not always install from master
> -
>
> Key: ARROW-7803
> URL: https://issues.apache.org/jira/browse/ARROW-7803
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Figure out how to get the formula to check out a branch when building 
> {{--head}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-7825) [R] Update docs to clarify that stringsAsFactors isn't relevant for parquet/feather

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson closed ARROW-7825.
--
  Assignee: Neal Richardson
Resolution: Won't Fix

Now that wisdom has prevailed and {{stringsAsFactors=FALSE}} by default in R 
4.0, I don't think we need to add anything to the arrow docs. Feel free to 
reopen and submit a PR if you feel strongly otherwise.

> [R] Update docs to clarify that stringsAsFactors isn't relevant for 
> parquet/feather
> ---
>
> Key: ARROW-7825
> URL: https://issues.apache.org/jira/browse/ARROW-7825
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 0.16.0
> Environment: Linux 64-bit 5.4.15
>Reporter: Keith Hughitt
>Assignee: Neal Richardson
>Priority: Major
>  Labels: R, parquet
>
> Same issue as reported for feather::read_feather 
> (https://issues.apache.org/jira/browse/ARROW-7823);
>  
> For the R arrow package, the "read_parquet()" function currently does not 
> respect "options(stringsAsFactors = FALSE)", leading to 
> unexpected/inconsistent behavior.
>  
> *Example:*
>  
>  
> {code:java}
> library(arrow)
> library(readr)
> options(stringsAsFactors = FALSE)
> write_tsv(head(iris), 'test.tsv')
> write_parquet(head(iris), 'test.parquet')
> head(read.delim('test.tsv', sep='\t')$Species)
> # [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
> head(read_tsv('test.tsv', col_types = cols())$Species)
> # [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
> head(read_parquet('test.parquet')$Species)
> # [1] setosa setosa setosa setosa setosa setosa
> # Levels: setosa versicolor virginica
> {code}
>  
>  
> *Versions:*
> - R 3.6.2
> - arrow_0.15.1.9000



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8374) [R] Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8374:
---
Fix Version/s: 1.0.0

> [R] Table to vector of DictonaryType will error when Arrays don't have the 
> same Dictionary per array
> 
>
> Key: ARROW-8374
> URL: https://issues.apache.org/jira/browse/ARROW-8374
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Francois Saint-Jacques
>Priority: Major
> Fix For: 1.0.0
>
>
> The conversion should accommodate Unifying the dictionary before converting, 
> otherwise the indices are simply broken



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8556) [R] zstd symbol not found if there are multiple installations of zstd

2020-05-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8556:
--
Labels: pull-request-available  (was: )

> [R] zstd symbol not found if there are multiple installations of zstd
> -
>
> Key: ARROW-8556
> URL: https://issues.apache.org/jira/browse/ARROW-8556
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 0.17.0
> Environment: Ubuntu 19.10
> R 3.6.1
>Reporter: Karl Dunkle Werner
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I would like to install the `arrow` R package on my Ubuntu 19.10 system. 
> Prebuilt binaries are unavailable, and I want to enable compression, so I set 
> the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks 
> like the package is able to compile, but can't be loaded. I'm able to install 
> correctly if I don't set the {{LIBARROW_MINIMAL}} variable.
> Here's the error I get:
> {code:java}
> ** testing if installed package can be loaded from temporary location
> Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath 
> = DLLpath, ...):
>  unable to load shared object 
> '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so':
>   ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: 
> ZSTD_initCStream
> Error: loading failed
> Execution halted
> ERROR: loading failed
> * removing ‘~/.R/3.6/arrow’
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8805) [C++] Arrow (master) build error from sources

2020-05-15 Thread Kouhei Sutou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-8805:

Summary: [C++] Arrow (master) build error from sources  (was: [CPP] Arrow 
(master) build error from sources)

> [C++] Arrow (master) build error from sources
> -
>
> Key: ARROW-8805
> URL: https://issues.apache.org/jira/browse/ARROW-8805
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Tanveer
>Priority: Major
> Attachments: Screenshot from 2020-05-14 22-22-01.png
>
>
> !Screenshot from 2020-05-14 22-22-01.png!
> Building Arrow C++ from sources (with following flags: cmake 
> -DARROW_PLASMA=ON  -DARROW_PYTHON=ON ..) is not possible due to some errors 
> as shown in the attached figure.
> Can someone fix them or suggest me some solution? Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8805) [CPP] Arrow (master) build error from sources

2020-05-15 Thread Kouhei Sutou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108651#comment-17108651
 ] 

Kouhei Sutou commented on ARROW-8805:
-

Could you attach the full build log as text instead of screenshot?
And could you also show the full CMake command line you specified?

> [CPP] Arrow (master) build error from sources
> -
>
> Key: ARROW-8805
> URL: https://issues.apache.org/jira/browse/ARROW-8805
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Tanveer
>Priority: Major
> Attachments: Screenshot from 2020-05-14 22-22-01.png
>
>
> !Screenshot from 2020-05-14 22-22-01.png!
> Building Arrow C++ from sources (with following flags: cmake 
> -DARROW_PLASMA=ON  -DARROW_PYTHON=ON ..) is not possible due to some errors 
> as shown in the attached figure.
> Can someone fix them or suggest me some solution? Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8662) [CI] Consolidate appveyor scripts

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-8662.

Resolution: Fixed

Issue resolved by pull request 7080
[https://github.com/apache/arrow/pull/7080]

> [CI] Consolidate appveyor scripts
> -
>
> Key: ARROW-8662
> URL: https://issues.apache.org/jira/browse/ARROW-8662
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The appveyor scripts are a bit outdated and contain unreasonable amount of 
> indirections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8777) [Rust] Parquet.rs does not support reading fixed-size binary fields.

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8777:
---
Component/s: Rust

> [Rust] Parquet.rs does not support reading fixed-size binary fields.
> 
>
> Key: ARROW-8777
> URL: https://issues.apache.org/jira/browse/ARROW-8777
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Max Burke
>Assignee: Max Burke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7803) [R][CI] Autobrew/homebrew tests should not always install from master

2020-05-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7803:
--
Labels: pull-request-available  (was: )

> [R][CI] Autobrew/homebrew tests should not always install from master
> -
>
> Key: ARROW-7803
> URL: https://issues.apache.org/jira/browse/ARROW-7803
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Figure out how to get the formula to check out a branch when building 
> {{--head}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7803) [R][CI] Autobrew/homebrew tests should not always install from master

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-7803:
---
Summary: [R][CI] Autobrew/homebrew tests should not always install from 
master  (was: [R][CI] Autobrew/homebrew tests always install from master)

> [R][CI] Autobrew/homebrew tests should not always install from master
> -
>
> Key: ARROW-7803
> URL: https://issues.apache.org/jira/browse/ARROW-7803
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>
> Figure out how to get the formula to check out a branch when building 
> {{--head}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas

2020-05-15 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108548#comment-17108548
 ] 

Joris Van den Bossche commented on ARROW-8816:
--

> It would be better if an error was raised when attempting to write the file 
> instead of silently producing erroneous output.

The file is correct (so we shouldn't error when writing), it is only after 
reading in that the conversion to pandas causes the issue given pandas' 
limitation on the range of timestamps.

As you can see, in pyarrow 0.17 it was at least fixed to not produces garbage 
dates but an error is raised instead (which I would say is better than 
garbage). But it is a known issue that there should be a way to still convert 
to pandas but with converting to datetime objects instead of to datetime64[ns] 
dtype. This is covered by ARROW-5359 with the idea to add a 
{{timestamp_as_object}} keyword.





> [Python] Year 2263 or later datetimes get mangled when written using pandas
> ---
>
> Key: ARROW-8816
> URL: https://issues.apache.org/jira/browse/ARROW-8816
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.16.0, 0.17.0
> Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, 
> python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, 
> python 3.8.2, ubuntu 20.04 (linux).
>Reporter: Rauli Ruohonen
>Priority: Major
>
> Using pyarrow 0.17.0, this
>  
> {code:java}
> import datetime
> import pandas as pd
> def try_with_year(year):
>     print(f'Year {year:_}:')
>     df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]})
>     df.to_parquet('foo.parquet', engine='pyarrow', compression=None)
>     try:
>         print(pd.read_parquet('foo.parquet', engine='pyarrow'))
>     except Exception as exc:
>         print(repr(exc))
>     print()
> try_with_year(2_263)
> try_with_year(2_262)
> {code}
>  
> prints
>  
> {noformat}
> Year 2_263:
> ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out 
> of bounds timestamp: 924618240')
> Year 2_262:
>            x
> 0 2262-01-01{noformat}
> and using pyarrow 0.16.0, it prints
>  
>  
> {noformat}
> Year 2_263:
>   x
> 0 1678-06-12 00:25:26.290448384
> Year 2_262:
>x
> 0 2262-01-01{noformat}
> The issue is that 2263-01-01 is out of bounds for a timestamp stored using 
> epoch nanoseconds, but not out of bounds for a Python datetime.
> While pyarrow 0.17.0 refuses to read the erroneous output, it is still 
> possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or 
> fastparquet), yielding the same result as with 0.16.0 above (i.e. only 
> reading has changed in 0.17.0, not writing). It would be better if an error 
> was raised when attempting to write the file instead of silently producing 
> erroneous output.
> The reason I suspect this is a pyarrow issue instead of a pandas issue is 
> this modified example:
>  
> {code:java}
> import datetime
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]})
> table = pa.Table.from_pandas(df)
> print(table[0])
> try:
>     print(table.to_pandas())
> except Exception as exc:
>     print(repr(exc))
> {code}
> which prints
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.00
>   ]
> ]
> ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out 
> of bounds timestamp: 92461824'){noformat}
> on pyarrow 0.17.0 and
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.00
>   ]
> ]
>                               x
> 0 1678-06-12 00:25:26.290448384{noformat}
> on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, 
> pyarrow prints the correct timestamp when asked to produce it as a string (so 
> it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas() 
> round-trip fails.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8777) [Rust] Parquet.rs does not support reading fixed-size binary fields.

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-8777.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7159
[https://github.com/apache/arrow/pull/7159]

> [Rust] Parquet.rs does not support reading fixed-size binary fields.
> 
>
> Key: ARROW-8777
> URL: https://issues.apache.org/jira/browse/ARROW-8777
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Max Burke
>Assignee: Max Burke
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8810) [R] Add documentation about Parquet format, appending to stream format

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8810:
---
Summary: [R] Add documentation about Parquet format, appending to stream 
format  (was: [R] Append to parquet file?)

> [R] Add documentation about Parquet format, appending to stream format
> --
>
> Key: ARROW-8810
> URL: https://issues.apache.org/jira/browse/ARROW-8810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Carl Boettiger
>Priority: Minor
>
> Is it possible to append new rows to an existing .parquet file using the R 
> client's arrow::write_parquet(), in a manner similar to the `append=TRUE` 
> argument in text-based output formats like write.table()? 
>  
> Apologies as this is perhaps more a question of documentation or user 
> interface, or maybe just my ignorance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8810) [R] Append to parquet file?

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8810:
---
Priority: Minor  (was: Major)

> [R] Append to parquet file?
> ---
>
> Key: ARROW-8810
> URL: https://issues.apache.org/jira/browse/ARROW-8810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Carl Boettiger
>Priority: Minor
>
> Is it possible to append new rows to an existing .parquet file using the R 
> client's arrow::write_parquet(), in a manner similar to the `append=TRUE` 
> argument in text-based output formats like write.table()? 
>  
> Apologies as this is perhaps more a question of documentation or user 
> interface, or maybe just my ignorance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-7924) [Rust] Add sort for float types

2020-05-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-7924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-7924:
--
Labels: pull-request-available  (was: )

> [Rust] Add sort for float types
> ---
>
> Key: ARROW-7924
> URL: https://issues.apache.org/jira/browse/ARROW-7924
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Rust
>Reporter: Neville Dipale
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Floats need a different sort approach than other primitives, and this ticket 
> will implement them separately



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-3827) [Rust] Implement UnionArray

2020-05-15 Thread Neville Dipale (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neville Dipale resolved ARROW-3827.
---
Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7004
[https://github.com/apache/arrow/pull/7004]

> [Rust] Implement UnionArray
> ---
>
> Key: ARROW-3827
> URL: https://issues.apache.org/jira/browse/ARROW-3827
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Rust
>Reporter: Paddy Horan
>Assignee: Paddy Horan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8813) [R] Implementing tidyr interface

2020-05-15 Thread Dominic Dennenmoser (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Dennenmoser updated ARROW-8813:
---
Description: 
I think it would be reasonable to implement an interface to the {{tidyr}} 
package. The implementation would allow to lazily process ArrowTables before 
put it back into the memory. However, currently you need to collect the table 
first before applying tidyr methods. The following code chunk shows an example 
routine:
{code:r}
library(magrittr)
arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
nested_df <-
   arrow_table %>%
   dplyr::select(ID, 4:7, Value) %>%
   dplyr::filter(Value >= 5) %>%
   dplyr::group_by(ID) %>%
   dplyr::collect() %>%
   tidyr::nest(){code}
The main focus might be the following three methods:
 * {{tidyr::[un]nest()}},
 * {{tidyr::pivot_[longer|wider]()}}, and
 * {{tidyr::seperate()}}.

I suppose the last two can be fairly quickly implemented, but {{tidyr::nest()}} 
and {{tidyr::unnest()}} cannot be implement before conversion to List 
will be accessible.

  was:
I think it would be reasonable to implement an interface to the {{tidyr}} 
package. The implementation would allow to lazily process ArrowTables before 
put it back into the memory. However, currently you need to collect the table 
first before applying tidyr methods. The following code chunk shows an example 
routine:
{code:r}
library(magrittr)
arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
nested_df <-
   arrow_table %>%
   dplyr::select(ID, 4:7, Value) %>%
   dplyr::filter(Value >= 5) %>%
   dplyr::group_by(ID) %>%
   dplyr::collect() %>%
   tidyr::nest(){code}
The main focus might be the following three methods:
 * {{tidyr::[un]nest()}},
 * {{tidyr::pivot_[longer|wider]()}}, and
 * {{tidyr::seperate()}}.

I suppose the last two can be fairly quickly implemented, but {{tidyr::nest()}} 
and {{tidyr::unnest()}} cannot be implement before [conversion to 
List|ARROW-8779] will be accessible.


> [R] Implementing tidyr interface
> 
>
> Key: ARROW-8813
> URL: https://issues.apache.org/jira/browse/ARROW-8813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dominic Dennenmoser
>Priority: Major
>  Labels: extension, feature, improvement
>
> I think it would be reasonable to implement an interface to the {{tidyr}} 
> package. The implementation would allow to lazily process ArrowTables before 
> put it back into the memory. However, currently you need to collect the table 
> first before applying tidyr methods. The following code chunk shows an 
> example routine:
> {code:r}
> library(magrittr)
> arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
> nested_df <-
>arrow_table %>%
>dplyr::select(ID, 4:7, Value) %>%
>dplyr::filter(Value >= 5) %>%
>dplyr::group_by(ID) %>%
>dplyr::collect() %>%
>tidyr::nest(){code}
> The main focus might be the following three methods:
>  * {{tidyr::[un]nest()}},
>  * {{tidyr::pivot_[longer|wider]()}}, and
>  * {{tidyr::seperate()}}.
> I suppose the last two can be fairly quickly implemented, but 
> {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before 
> conversion to List will be accessible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8813) [R] Implementing tidyr interface

2020-05-15 Thread Dominic Dennenmoser (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Dennenmoser updated ARROW-8813:
---
Description: 
I think it would be reasonable to implement an interface to the {{tidyr}} 
package. The implementation would allow to lazily process ArrowTables before 
put it back into the memory. However, currently you need to collect the table 
first before applying tidyr methods. The following code chunk shows an example 
routine:
{code:r}
library(magrittr)
arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
nested_df <-
   arrow_table %>%
   dplyr::select(ID, 4:7, Value) %>%
   dplyr::filter(Value >= 5) %>%
   dplyr::group_by(ID) %>%
   dplyr::collect() %>%
   tidyr::nest(){code}
The main focus might be the following three methods:
 * {{tidyr::[un]nest()}},
 * {{tidyr::pivot_[longer|wider]()}}, and
 * {{tidyr::seperate()}}.

I suppose the last two can be fairly quickly implemented, but {{tidyr::nest()}} 
and {{tidyr::unnest()}} cannot be implement before [conversion to 
List|ARROW-8779] will be accessible.

  was:
I think it would be reasonable to implement an interface to the {{tidyr}} 
package. The implementation would allow to lazily process ArrowTables before 
put it back into the memory. However, currently you need to collect the table 
first before applying tidyr methods. The following code chunk shows an example 
routine:
{code:r}
library(magrittr)
arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
nested_df <-
   arrow_table %>%
   dplyr::select(ID, 4:7, Value) %>%
   dplyr::filter(Value >= 5) %>%
   dplyr::group_by(ID) %>%
   dplyr::collect() %>%
   tidyr::nest(){code}
The main focus might be the following three methods:
 * {{tidyr::[un]nest()}},
 * {{tidyr::pivot_[longer|wider]()}}, and
 * {{tidyr::seperate()}}.

I suppose the last two can be fairly quickly implemented, but {{tidyr::nest()}} 
and {{tidyr::unnest()}} cannot be implement before conversion to List 
will be accessible.


> [R] Implementing tidyr interface
> 
>
> Key: ARROW-8813
> URL: https://issues.apache.org/jira/browse/ARROW-8813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dominic Dennenmoser
>Priority: Major
>  Labels: extension, feature, improvement
>
> I think it would be reasonable to implement an interface to the {{tidyr}} 
> package. The implementation would allow to lazily process ArrowTables before 
> put it back into the memory. However, currently you need to collect the table 
> first before applying tidyr methods. The following code chunk shows an 
> example routine:
> {code:r}
> library(magrittr)
> arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
> nested_df <-
>arrow_table %>%
>dplyr::select(ID, 4:7, Value) %>%
>dplyr::filter(Value >= 5) %>%
>dplyr::group_by(ID) %>%
>dplyr::collect() %>%
>tidyr::nest(){code}
> The main focus might be the following three methods:
>  * {{tidyr::[un]nest()}},
>  * {{tidyr::pivot_[longer|wider]()}}, and
>  * {{tidyr::seperate()}}.
> I suppose the last two can be fairly quickly implemented, but 
> {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before 
> [conversion to List|ARROW-8779] will be accessible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8810) [R] Append to parquet file?

2020-05-15 Thread Carl Boettiger (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108424#comment-17108424
 ] 

Carl Boettiger commented on ARROW-8810:
---

Thanks all, this is a great answer.  Would love to see some of these details 
mentioned in the R vignettes, as no doubt other R users might also be unclear 
how this differs from other compressed/encoded filetypes (e.g. the issue of 
metadata in the file footer). 

 

Writing multiple files makes sense for larger chunks. My current use case is 
effectively streaming (currently just to .tsv.gz compressed table), so I'm 
definitely following the discussion in ARROW-8784. 

 

Please feel free to close, and thanks again for this fantastic library and the 
R bindings. 

> [R] Append to parquet file?
> ---
>
> Key: ARROW-8810
> URL: https://issues.apache.org/jira/browse/ARROW-8810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Carl Boettiger
>Priority: Major
>
> Is it possible to append new rows to an existing .parquet file using the R 
> client's arrow::write_parquet(), in a manner similar to the `append=TRUE` 
> argument in text-based output formats like write.table()? 
>  
> Apologies as this is perhaps more a question of documentation or user 
> interface, or maybe just my ignorance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-8813) [R] Implementing tidyr interface

2020-05-15 Thread Dominic Dennenmoser (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108422#comment-17108422
 ] 

Dominic Dennenmoser edited comment on ARROW-8813 at 5/15/20, 4:05 PM:
--

Thanks for refering to that. I've just looked for issues or pull-requests 
mention anything in that direction. Fortunately, a generic version of 
{{pivot_[longer|wider]()}} will be available in the upcoming version of 
{{tidyr}}, and is already implemented into the development version 
([#800|https://github.com/tidyverse/tidyr/issues/800]).


was (Author: domiden):
Thanks for revering to that. I've just looked for issues or pull-requests 
mention anything in that direction. Fortunately, a generic version of 
{{pivot_[longer|wider]()}} will be available in the upcoming version of 
{{tidyr}}, and is already implemented into the development version 
([#800|https://github.com/tidyverse/tidyr/issues/800]).

> [R] Implementing tidyr interface
> 
>
> Key: ARROW-8813
> URL: https://issues.apache.org/jira/browse/ARROW-8813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dominic Dennenmoser
>Priority: Major
>  Labels: extension, feature, improvement
>
> I think it would be reasonable to implement an interface to the {{tidyr}} 
> package. The implementation would allow to lazily process ArrowTables before 
> put it back into the memory. However, currently you need to collect the table 
> first before applying tidyr methods. The following code chunk shows an 
> example routine:
> {code:r}
> library(magrittr)
> arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
> nested_df <-
>arrow_table %>%
>dplyr::select(ID, 4:7, Value) %>%
>dplyr::filter(Value >= 5) %>%
>dplyr::group_by(ID) %>%
>dplyr::collect() %>%
>tidyr::nest(){code}
> The main focus might be the following three methods:
>  * {{tidyr::[un]nest()}},
>  * {{tidyr::pivot_[longer|wider]()}}, and
>  * {{tidyr::seperate()}}.
> I suppose the last two can be fairly quickly implemented, but 
> {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before 
> conversion to List will be accessible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-8813) [R] Implementing tidyr interface

2020-05-15 Thread Dominic Dennenmoser (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108422#comment-17108422
 ] 

Dominic Dennenmoser edited comment on ARROW-8813 at 5/15/20, 4:04 PM:
--

Thanks for revering to that. I've just looked for issues or pull-requests 
mention anything in that direction. Fortunately, a generic version of 
{{pivot_[longer|wider]()}} will be available in the upcoming version of 
{{tidyr}}, and is already implemented into the development version 
([#800|https://github.com/tidyverse/tidyr/issues/800]).


was (Author: domiden):
Thanks for revering to that. I've just looked for issues or pull-requests 
mention anything in that direction. Fortunately, a generic version of 
{{pivot_[longer|wider]()}} will be available in the upcoming version of 
{{tidyr}}, and is already implemented into the development version (#800).

> [R] Implementing tidyr interface
> 
>
> Key: ARROW-8813
> URL: https://issues.apache.org/jira/browse/ARROW-8813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dominic Dennenmoser
>Priority: Major
>  Labels: extension, feature, improvement
>
> I think it would be reasonable to implement an interface to the {{tidyr}} 
> package. The implementation would allow to lazily process ArrowTables before 
> put it back into the memory. However, currently you need to collect the table 
> first before applying tidyr methods. The following code chunk shows an 
> example routine:
> {code:r}
> library(magrittr)
> arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
> nested_df <-
>arrow_table %>%
>dplyr::select(ID, 4:7, Value) %>%
>dplyr::filter(Value >= 5) %>%
>dplyr::group_by(ID) %>%
>dplyr::collect() %>%
>tidyr::nest(){code}
> The main focus might be the following three methods:
>  * {{tidyr::[un]nest()}},
>  * {{tidyr::pivot_[longer|wider]()}}, and
>  * {{tidyr::seperate()}}.
> I suppose the last two can be fairly quickly implemented, but 
> {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before 
> conversion to List will be accessible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8813) [R] Implementing tidyr interface

2020-05-15 Thread Dominic Dennenmoser (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108422#comment-17108422
 ] 

Dominic Dennenmoser commented on ARROW-8813:


Thanks for revering to that. I've just looked for issues or pull-requests 
mention anything in that direction. Fortunately, a generic version of 
{{pivot_[longer|wider]()}} will be available in the upcoming version of 
{{tidyr}}, and is already implemented into the development version (#800).

> [R] Implementing tidyr interface
> 
>
> Key: ARROW-8813
> URL: https://issues.apache.org/jira/browse/ARROW-8813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dominic Dennenmoser
>Priority: Major
>  Labels: extension, feature, improvement
>
> I think it would be reasonable to implement an interface to the {{tidyr}} 
> package. The implementation would allow to lazily process ArrowTables before 
> put it back into the memory. However, currently you need to collect the table 
> first before applying tidyr methods. The following code chunk shows an 
> example routine:
> {code:r}
> library(magrittr)
> arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
> nested_df <-
>arrow_table %>%
>dplyr::select(ID, 4:7, Value) %>%
>dplyr::filter(Value >= 5) %>%
>dplyr::group_by(ID) %>%
>dplyr::collect() %>%
>tidyr::nest(){code}
> The main focus might be the following three methods:
>  * {{tidyr::[un]nest()}},
>  * {{tidyr::pivot_[longer|wider]()}}, and
>  * {{tidyr::seperate()}}.
> I suppose the last two can be fairly quickly implemented, but 
> {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before 
> conversion to List will be accessible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8553) [C++] Optimize unaligned bitmap operations

2020-05-15 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108382#comment-17108382
 ] 

Wes McKinney commented on ARROW-8553:
-

Thanks for looking into it, sounds good to me

> [C++] Optimize unaligned bitmap operations
> --
>
> Key: ARROW-8553
> URL: https://issues.apache.org/jira/browse/ARROW-8553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Antoine Pitrou
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Currently, {{BitmapAnd}} uses a bit-by-bit loop for unaligned inputs. Using 
> {{Bitmap::VisitWords}} instead would probably yield a manyfold performance 
> increase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8812) [Python] Columns of type CategoricalIndex fails to be read back

2020-05-15 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8812:

Summary: [Python] Columns of type CategoricalIndex fails to be read back  
(was: Columns of type CategoricalIndex fails to be read back)

> [Python] Columns of type CategoricalIndex fails to be read back
> ---
>
> Key: ARROW-8812
> URL: https://issues.apache.org/jira/browse/ARROW-8812
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.15.1
> Environment: Python 3.7.7
> MacOS (Darwin-19.4.0-x86_64-i386-64bit)
> Pandas 1.0.3
> Pyarrow 0.15.1
>Reporter: Jonas Nelle
>Priority: Minor
>  Labels: parquet
>
> When columns are of type {{CategoricalIndex}}, saving and reading the table 
> back causes a {{TypeError: data type "categorical" not understood}}:
> {code:python}
> import pandas as pd
> from pyarrow import parquet, Table
> base_df = pd.DataFrame([['foo', 'j', "1"],
> ['bar', 'j', "1"],
> ['foo', 'j', "1"],
> ['foobar', 'j', "1"]],
>columns=['my_cat', 'var', 'for_count'])
> base_df['my_cat'] = base_df['my_cat'].astype('category')
> df = (
> base_df
> .groupby(["my_cat", "var"], observed=True)
> .agg({"for_count": "count"})
> .rename(columns={"for_count": "my_cat_counts"})
> .unstack(level="my_cat", fill_value=0)
> )
> print(df)
> {code}
> The resulting data frame looks something like this:
> || ||my_cat_counts|| || ||
> |my_cat|foo|bar|foobar|
> |var| | | |
> |j|2|1|1|
> Then, writing and reading causes the {{KeyError}}:
> {code:python}
> parquet.write_table(Table.from_pandas(df), "test.pqt")
> parquet.read_table("test.pqt").to_pandas()
> > TypeError: data type "categorical" not understood
> {code}
> In the example, the column is also a MultiIndex, but that isn't the problem:
> {code:python}
> df.columns = df.columns.get_level_values(1)
> parquet.write_table(Table.from_pandas(df), "test.pqt")
> parquet.read_table("test.pqt").to_pandas()
> > TypeError: data type "categorical" not understood
> {code}
> This is the workaround [suggested on 
> stackoverflow|https://stackoverflow.com/questions/55749399/how-to-fix-the-issue-of-categoricalindex-column-in-pandas]:
> {code:python}
> df.columns = pd.Index(list(df.columns)) # suggested fix for the time being
> parquet.write_table(Table.from_pandas(df), "test.pqt")
> parquet.read_table("test.pqt").to_pandas() # no error
> {code}
> Are there any plans to support the pattern described here in the future?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8810) [R] Append to parquet file?

2020-05-15 Thread Wes McKinney (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108381#comment-17108381
 ] 

Wes McKinney commented on ARROW-8810:
-

Since it's not possible to append data to an existing file (without a great 
deal of effort in the C++ library) I would suggest closing this. Might be some 
documentation we could add to clarify that Parquet datasets are intended to 
constitute many files with appending by writing additional files

> [R] Append to parquet file?
> ---
>
> Key: ARROW-8810
> URL: https://issues.apache.org/jira/browse/ARROW-8810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Carl Boettiger
>Priority: Major
>
> Is it possible to append new rows to an existing .parquet file using the R 
> client's arrow::write_parquet(), in a manner similar to the `append=TRUE` 
> argument in text-based output formats like write.table()? 
>  
> Apologies as this is perhaps more a question of documentation or user 
> interface, or maybe just my ignorance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8810) [R] Append to parquet file?

2020-05-15 Thread Wes McKinney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-8810:

Summary: [R] Append to parquet file?  (was: Append to parquet file?)

> [R] Append to parquet file?
> ---
>
> Key: ARROW-8810
> URL: https://issues.apache.org/jira/browse/ARROW-8810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Carl Boettiger
>Priority: Major
>
> Is it possible to append new rows to an existing .parquet file using the R 
> client's arrow::write_parquet(), in a manner similar to the `append=TRUE` 
> argument in text-based output formats like write.table()? 
>  
> Apologies as this is perhaps more a question of documentation or user 
> interface, or maybe just my ignorance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8734) [R] improve nightly build installation

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-8734.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7184
[https://github.com/apache/arrow/pull/7184]

> [R] improve nightly build installation
> --
>
> Key: ARROW-8734
> URL: https://issues.apache.org/jira/browse/ARROW-8734
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I've tried to install / build from source (with from a git checkout and using 
> the built-in `install_arrow()`) and when compiling I'm getting the following 
> error reliably during the auto brew process:
> {code:bash}
>  x System command 'R' failed, exit status: 1, stdout + stderr:
> E> * checking for file ‘/Users/jkeane/Dropbox/arrow/r/DESCRIPTION’ ... OK
> E> * preparing ‘arrow’:
> E> * checking DESCRIPTION meta-information ... OK
> E> * cleaning src
> E> * running ‘cleanup’
> E> * installing the package to build vignettes
> E>   ---
> E> * installing *source* package ‘arrow’ ...
> E> ** using staged installation
> E> *** Generating code with data-raw/codegen.R
> E> There were 27 warnings (use warnings() to see them)
> E> *** > 375 functions decorated with [[arrow|s3::export]]
> E> *** > generated file `src/arrowExports.cpp`
> E> *** > generated file `R/arrowExports.R`
> E> *** Downloading apache-arrow
> E>  Using local manifest for apache-arrow
> E> Thu May  7 13:13:42 CDT 2020: Auto-brewing apache-arrow in 
> /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T//build-apache-arrow...
> E> ==> Tapping autobrew/core from https://github.com/autobrew/homebrew-core
> E> Tapped 2 commands and 4639 formulae (4,888 files, 12.7MB).
> E> lz4
> E> openssl
> E> thrift
> E> snappy
> E> ==> Downloading 
> https://homebrew.bintray.com/bottles/lz4-1.8.3.mojave.bottle.tar.gz
> E> Already downloaded: 
> /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/b4158ef68d619dbf78935df6a42a70b8339a65bc8876cbb4446355ccd40fa5de--lz4-1.8.3.mojave.bottle.tar.gz
> E> ==> Pouring lz4-1.8.3.mojave.bottle.tar.gz
> E> ==> Skipping post_install step for autobrew...
> E>   
> /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/lz4/1.8.3:
>  22 files, 512.7KB
> E> ==> Downloading 
> https://homebrew.bintray.com/bottles/openssl-1.0.2p.mojave.bottle.tar.gz
> E> Already downloaded: 
> /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/fbb493745981c8b26c0fab115c76c2a70142bfde9e776c450277e9dfbbba0bb2--openssl-1.0.2p.mojave.bottle.tar.gz
> E> ==> Pouring openssl-1.0.2p.mojave.bottle.tar.gz
> E> ==> Skipping post_install step for autobrew...
> E> ==> Caveats
> E> openssl is keg-only, which means it was not symlinked into 
> /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow,
> E> because Apple has deprecated use of OpenSSL in favor of its own TLS and 
> crypto libraries.
> E> 
> E> If you need to have openssl first in your PATH run:
> E>   echo 'export 
> PATH="/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/bin:$PATH"'
>  >> ~/.zshrc
> E> 
> E> For compilers to find openssl you may need to set:
> E>   export 
> LDFLAGS="-L/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/lib"
> E>   export 
> CPPFLAGS="-I/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/include"
> E> 
> E> For pkg-config to find openssl you may need to set:
> E>   export 
> PKG_CONFIG_PATH="/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/lib/pkgconfig"
> E> 
> E> ==> Summary
> E>   
> /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/openssl/1.0.2p:
>  1,793 files, 12MB
> E> ==> Downloading 
> https://homebrew.bintray.com/bottles/thrift-0.11.0.mojave.bottle.tar.gz
> E> Already downloaded: 
> /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/7e05ea11a9f7f924dd7f8f36252ec73a24958b7f214f71e3752a355e75e589bd--thrift-0.11.0.mojave.bottle.tar.gz
> E> ==> Pouring thrift-0.11.0.mojave.bottle.tar.gz
> E> ==> Skipping post_install step for autobrew...
> E> ==> Caveats
> E> To install Ruby binding:
> E>   gem install thrift
> E> ==> Summary
> E>   
> /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/thrift/0.11.0:
>  102 files, 7MB
> E> ==> Downloading 
> https://homebrew.bintray.com/bottles/snappy-1.1.7_1.mojave.bottle.tar.gz
> E> Already downloaded: 
>

[jira] [Updated] (ARROW-8734) [R] improve nightly build installation

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8734:
---
Summary: [R] improve nightly build installation  (was: [R] autobrew script 
always builds from master)

> [R] improve nightly build installation
> --
>
> Key: ARROW-8734
> URL: https://issues.apache.org/jira/browse/ARROW-8734
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I've tried to install / build from source (with from a git checkout and using 
> the built-in `install_arrow()`) and when compiling I'm getting the following 
> error reliably during the auto brew process:
> {code:bash}
>  x System command 'R' failed, exit status: 1, stdout + stderr:
> E> * checking for file ‘/Users/jkeane/Dropbox/arrow/r/DESCRIPTION’ ... OK
> E> * preparing ‘arrow’:
> E> * checking DESCRIPTION meta-information ... OK
> E> * cleaning src
> E> * running ‘cleanup’
> E> * installing the package to build vignettes
> E>   ---
> E> * installing *source* package ‘arrow’ ...
> E> ** using staged installation
> E> *** Generating code with data-raw/codegen.R
> E> There were 27 warnings (use warnings() to see them)
> E> *** > 375 functions decorated with [[arrow|s3::export]]
> E> *** > generated file `src/arrowExports.cpp`
> E> *** > generated file `R/arrowExports.R`
> E> *** Downloading apache-arrow
> E>  Using local manifest for apache-arrow
> E> Thu May  7 13:13:42 CDT 2020: Auto-brewing apache-arrow in 
> /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T//build-apache-arrow...
> E> ==> Tapping autobrew/core from https://github.com/autobrew/homebrew-core
> E> Tapped 2 commands and 4639 formulae (4,888 files, 12.7MB).
> E> lz4
> E> openssl
> E> thrift
> E> snappy
> E> ==> Downloading 
> https://homebrew.bintray.com/bottles/lz4-1.8.3.mojave.bottle.tar.gz
> E> Already downloaded: 
> /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/b4158ef68d619dbf78935df6a42a70b8339a65bc8876cbb4446355ccd40fa5de--lz4-1.8.3.mojave.bottle.tar.gz
> E> ==> Pouring lz4-1.8.3.mojave.bottle.tar.gz
> E> ==> Skipping post_install step for autobrew...
> E>   
> /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/lz4/1.8.3:
>  22 files, 512.7KB
> E> ==> Downloading 
> https://homebrew.bintray.com/bottles/openssl-1.0.2p.mojave.bottle.tar.gz
> E> Already downloaded: 
> /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/fbb493745981c8b26c0fab115c76c2a70142bfde9e776c450277e9dfbbba0bb2--openssl-1.0.2p.mojave.bottle.tar.gz
> E> ==> Pouring openssl-1.0.2p.mojave.bottle.tar.gz
> E> ==> Skipping post_install step for autobrew...
> E> ==> Caveats
> E> openssl is keg-only, which means it was not symlinked into 
> /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow,
> E> because Apple has deprecated use of OpenSSL in favor of its own TLS and 
> crypto libraries.
> E> 
> E> If you need to have openssl first in your PATH run:
> E>   echo 'export 
> PATH="/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/bin:$PATH"'
>  >> ~/.zshrc
> E> 
> E> For compilers to find openssl you may need to set:
> E>   export 
> LDFLAGS="-L/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/lib"
> E>   export 
> CPPFLAGS="-I/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/include"
> E> 
> E> For pkg-config to find openssl you may need to set:
> E>   export 
> PKG_CONFIG_PATH="/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/lib/pkgconfig"
> E> 
> E> ==> Summary
> E>   
> /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/openssl/1.0.2p:
>  1,793 files, 12MB
> E> ==> Downloading 
> https://homebrew.bintray.com/bottles/thrift-0.11.0.mojave.bottle.tar.gz
> E> Already downloaded: 
> /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/7e05ea11a9f7f924dd7f8f36252ec73a24958b7f214f71e3752a355e75e589bd--thrift-0.11.0.mojave.bottle.tar.gz
> E> ==> Pouring thrift-0.11.0.mojave.bottle.tar.gz
> E> ==> Skipping post_install step for autobrew...
> E> ==> Caveats
> E> To install Ruby binding:
> E>   gem install thrift
> E> ==> Summary
> E>   
> /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/thrift/0.11.0:
>  102 files, 7MB
> E> ==> Downloading 
> https://homebrew.bintray.com/bottles/snappy-1.1.7_1.mojave.bottle.tar.gz
> E> Already downloaded: 
>

[jira] [Commented] (ARROW-8813) [R] Implementing tidyr interface

2020-05-15 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108367#comment-17108367
 ] 

Neal Richardson commented on ARROW-8813:


If you wanted to explore this, one challenge I see is that pivot_longer and 
pivot_wider aren't generics, so you can't just make arrow methods for them.

> [R] Implementing tidyr interface
> 
>
> Key: ARROW-8813
> URL: https://issues.apache.org/jira/browse/ARROW-8813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dominic Dennenmoser
>Priority: Major
>  Labels: extension, feature, improvement
>
> I think it would be reasonable to implement an interface to the {{tidyr}} 
> package. The implementation would allow to lazily process ArrowTables before 
> put it back into the memory. However, currently you need to collect the table 
> first before applying tidyr methods. The following code chunk shows an 
> example routine:
> {code:r}
> library(magrittr)
> arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
> nested_df <-
>arrow_table %>%
>dplyr::select(ID, 4:7, Value) %>%
>dplyr::filter(Value >= 5) %>%
>dplyr::group_by(ID) %>%
>dplyr::collect() %>%
>tidyr::nest(){code}
> The main focus might be the following three methods:
>  * {{tidyr::[un]nest()}},
>  * {{tidyr::pivot_[longer|wider]()}}, and
>  * {{tidyr::seperate()}}.
> I suppose the last two can be fairly quickly implemented, but 
> {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before 
> conversion to List will be accessible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8813) [R] Implementing tidyr interface

2020-05-15 Thread Neal Richardson (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-8813:
---
Summary: [R] Implementing tidyr interface  (was: Implementing tidyr 
interface)

> [R] Implementing tidyr interface
> 
>
> Key: ARROW-8813
> URL: https://issues.apache.org/jira/browse/ARROW-8813
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Dominic Dennenmoser
>Priority: Major
>  Labels: extension, feature, improvement
>
> I think it would be reasonable to implement an interface to the {{tidyr}} 
> package. The implementation would allow to lazily process ArrowTables before 
> put it back into the memory. However, currently you need to collect the table 
> first before applying tidyr methods. The following code chunk shows an 
> example routine:
> {code:r}
> library(magrittr)
> arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
> nested_df <-
>arrow_table %>%
>dplyr::select(ID, 4:7, Value) %>%
>dplyr::filter(Value >= 5) %>%
>dplyr::group_by(ID) %>%
>dplyr::collect() %>%
>tidyr::nest(){code}
> The main focus might be the following three methods:
>  * {{tidyr::[un]nest()}},
>  * {{tidyr::pivot_[longer|wider]()}}, and
>  * {{tidyr::seperate()}}.
> I suppose the last two can be fairly quickly implemented, but 
> {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before 
> conversion to List will be accessible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8810) Append to parquet file?

2020-05-15 Thread Neal Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108356#comment-17108356
 ] 

Neal Richardson commented on ARROW-8810:


Multi-file (Parquet and other format) datasets in R: 
http://arrow.apache.org/docs/r/articles/dataset.html

If appending to a single file is important for your use case, you could use the 
Arrow stream format. See discussion on ARROW-8748 for what that would look like.

> Append to parquet file?
> ---
>
> Key: ARROW-8810
> URL: https://issues.apache.org/jira/browse/ARROW-8810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Carl Boettiger
>Priority: Major
>
> Is it possible to append new rows to an existing .parquet file using the R 
> client's arrow::write_parquet(), in a manner similar to the `append=TRUE` 
> argument in text-based output formats like write.table()? 
>  
> Apologies as this is perhaps more a question of documentation or user 
> interface, or maybe just my ignorance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8783) [Rust] [DataFusion] Logical plan should have ParquetScan and CsvScan entries

2020-05-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8783:
--
Labels: pull-request-available  (was: )

> [Rust] [DataFusion] Logical plan should have ParquetScan and CsvScan entries
> 
>
> Key: ARROW-8783
> URL: https://issues.apache.org/jira/browse/ARROW-8783
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The LogicalPlan currently has a TableScan entry which references a Table (any 
> logical plan registered with an ExecutionContext) and is often backed by a 
> Parquet or CSV data source.
> I am finding it increasingly inconvenient that we can't just create a logical 
> plan referencing a Parquet or CSV file, without having to create an execution 
> context first and register the data sources with it.
> This addition will not remove any existing behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8783) [Rust] [DataFusion] Logical plan should have ParquetScan and CsvScan entries

2020-05-15 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-8783.
---
Resolution: Fixed

Issue resolved by pull request 7164
[https://github.com/apache/arrow/pull/7164]

> [Rust] [DataFusion] Logical plan should have ParquetScan and CsvScan entries
> 
>
> Key: ARROW-8783
> URL: https://issues.apache.org/jira/browse/ARROW-8783
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> The LogicalPlan currently has a TableScan entry which references a Table (any 
> logical plan registered with an ExecutionContext) and is often backed by a 
> Parquet or CSV data source.
> I am finding it increasingly inconvenient that we can't just create a logical 
> plan referencing a Parquet or CSV file, without having to create an execution 
> context first and register the data sources with it.
> This addition will not remove any existing behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas

2020-05-15 Thread Rauli Ruohonen (Jira)

Rauli Ruohonen created ARROW-8816:
-

 Summary: [Python] Year 2263 or later datetimes get mangled when 
written using pandas
 Key: ARROW-8816
 URL: https://issues.apache.org/jira/browse/ARROW-8816
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.17.0, 0.16.0
 Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, 
python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, 
python 3.8.2, ubuntu 20.04 (linux).
Reporter: Rauli Ruohonen


Using pyarrow 0.17.0, this

 
{code:java}
import datetime
import pandas as pd

def try_with_year(year):
    print(f'Year {year:_}:')
    df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]})
    df.to_parquet('foo.parquet', engine='pyarrow', compression=None)
    try:
        print(pd.read_parquet('foo.parquet', engine='pyarrow'))
    except Exception as exc:
        print(repr(exc))
    print()

try_with_year(2_263)
try_with_year(2_262)
{code}
 

prints

 
{noformat}
Year 2_263:
ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out 
of bounds timestamp: 924618240')

Year 2_262:
           x
0 2262-01-01{noformat}
and using pyarrow 0.16.0, it prints

 

 
{noformat}
Year 2_263:
  x
0 1678-06-12 00:25:26.290448384

Year 2_262:
   x
0 2262-01-01{noformat}
The issue is that 2263-01-01 is out of bounds for a timestamp stored using 
epoch nanoseconds, but not out of bounds for a Python datetime.

While pyarrow 0.17.0 refuses to read the erroneous output, it is still possible 
to read it using other parquet readers (e.g. pyarrow 0.16.0 or fastparquet), 
yielding the same result as with 0.16.0 above (i.e. only reading has changed in 
0.17.0, not writing). It would be better if an error was raised when attempting 
to write the file instead of silently producing erroneous output.

The reason I suspect this is a pyarrow issue instead of a pandas issue is this 
modified example:

 
{code:java}
import datetime
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]})
table = pa.Table.from_pandas(df)
print(table[0])
try:
    print(table.to_pandas())
except Exception as exc:
    print(repr(exc))
{code}
which prints

 

 
{noformat}
[
  [
    2263-01-01 00:00:00.00
  ]
]
ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out 
of bounds timestamp: 92461824'){noformat}
on pyarrow 0.17.0 and

 

 
{noformat}
[
  [
    2263-01-01 00:00:00.00
  ]
]
                              x
0 1678-06-12 00:25:26.290448384{noformat}
on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, 
pyarrow prints the correct timestamp when asked to produce it as a string (so 
it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas() 
round-trip fails.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-7574) [Rust] FileSource read implementation is seeking for each single byte

2020-05-15 Thread Jira



 [ 
https://issues.apache.org/jira/browse/ARROW-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jörn Horstmann resolved ARROW-7574.
---
Resolution: Fixed

> [Rust] FileSource read implementation is seeking for each single byte
> -
>
> Key: ARROW-7574
> URL: https://issues.apache.org/jira/browse/ARROW-7574
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Jörn Horstmann
>Priority: Major
>
> on current master branch
> {code:java}
> $ RUST_BACKTRACE=1 strace target/debug/parquet-read tripdata.parquet
> ...
> lseek(3, -8, SEEK_END)  = 2937
> read(3, ",\10\0\0PAR1", 8192)   = 8
> lseek(3, 845, SEEK_SET) = 845
> read(3, "\25\2\31\334H schema"..., 8192) = 2100
> ...
> lseek(5, 4, SEEK_SET)   = 4
> read(5, 
> "\25\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02"..., 8192) 
> = 2941
> lseek(5, 5, SEEK_SET)   = 5
> read(5, "\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020"..., 
> 8192) = 2940
> lseek(5, 6, SEEK_SET)   = 6
> read(5, "\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\0200"..., 
> 8192) = 2939
> lseek(5, 7, SEEK_SET)   = 7
> read(5, "\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000"..., 
> 8192) = 2938
> lseek(5, 8, SEEK_SET)   = 8
> read(5, "\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02"..., 8192) 
> = 2937
> lseek(5, 9, SEEK_SET)   = 9
> read(5, "\25P,\25\n\25\0\25\10\25\10\0346\0(\024"..., 8192) = 
> 2936
> lseek(5, 10, SEEK_SET)  = 10
> read(5, "P,\25\n\25\0\25\10\25\10\0346\0(\024\30"..., 8192) = 
> 2935
> {code}
>  Notice the seek position being incremented by one, despite reading up to 
> 8192 bytes at a time. Interestingly this does not seem to have a big 
> performance impact on a local file system with linux, but becomes a problem 
> when working with a custom implementation of ParquetReader, for example for 
> reading from s3.
> The problem seems to be in
> {code}
> impl Read for FileSource
> {code}
> which is unconditionally calling
> {code}
> reader.seek(SeekFrom::Start(self.start as u64))?
> {code}
> Instead it should probably keep track of the current position and only seek 
> on the first read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-7574) [Rust] FileSource read implementation is seeking for each single byte

2020-05-15 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108299#comment-17108299
 ] 

Jörn Horstmann commented on ARROW-7574:
---

I retested this with the current master and it seems indeed to be fixed. There 
are still seeks where the file position should already be at the right 
position, but doing those for every 8192 bytes should not be a problem.
{code:java}
lseek(5, 4, SEEK_SET)   = 4
read(5, 
"\25\0\25\260\200\200\1\25\272\354\37,\25\234\263\6\25\0\25\10\25\10\0346\0(\0200"...,
 8192) = 8192
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f66eca9
lseek(5, 8196, SEEK_SET)= 8196
read(5, "J N\0001J 
N\00416J\320\7\0006J\240\17\0006J\320\7\0006J\320\7\0006J"..., 252765) = 252765 
{code}

> [Rust] FileSource read implementation is seeking for each single byte
> -
>
> Key: ARROW-7574
> URL: https://issues.apache.org/jira/browse/ARROW-7574
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Affects Versions: 0.16.0
>Reporter: Jörn Horstmann
>Priority: Major
>
> on current master branch
> {code:java}
> $ RUST_BACKTRACE=1 strace target/debug/parquet-read tripdata.parquet
> ...
> lseek(3, -8, SEEK_END)  = 2937
> read(3, ",\10\0\0PAR1", 8192)   = 8
> lseek(3, 845, SEEK_SET) = 845
> read(3, "\25\2\31\334H schema"..., 8192) = 2100
> ...
> lseek(5, 4, SEEK_SET)   = 4
> read(5, 
> "\25\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02"..., 8192) 
> = 2941
> lseek(5, 5, SEEK_SET)   = 5
> read(5, "\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020"..., 
> 8192) = 2940
> lseek(5, 6, SEEK_SET)   = 6
> read(5, "\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\0200"..., 
> 8192) = 2939
> lseek(5, 7, SEEK_SET)   = 7
> read(5, "\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000"..., 
> 8192) = 2938
> lseek(5, 8, SEEK_SET)   = 8
> read(5, "\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02"..., 8192) 
> = 2937
> lseek(5, 9, SEEK_SET)   = 9
> read(5, "\25P,\25\n\25\0\25\10\25\10\0346\0(\024"..., 8192) = 
> 2936
> lseek(5, 10, SEEK_SET)  = 10
> read(5, "P,\25\n\25\0\25\10\25\10\0346\0(\024\30"..., 8192) = 
> 2935
> {code}
>  Notice the seek position being incremented by one, despite reading up to 
> 8192 bytes at a time. Interestingly this does not seem to have a big 
> performance impact on a local file system with linux, but becomes a problem 
> when working with a custom implementation of ParquetReader, for example for 
> reading from s3.
> The problem seems to be in
> {code}
> impl Read for FileSource
> {code}
> which is unconditionally calling
> {code}
> reader.seek(SeekFrom::Start(self.start as u64))?
> {code}
> Instead it should probably keep track of the current position and only seek 
> on the first read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8809) [Rust] schema mismatch in integration test

2020-05-15 Thread Andy Grove (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-8809.
---
Resolution: Fixed

Issue resolved by pull request 7187
[https://github.com/apache/arrow/pull/7187]

> [Rust] schema mismatch in integration test
> --
>
> Key: ARROW-8809
> URL: https://issues.apache.org/jira/browse/ARROW-8809
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Rust
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I apologize for the vagueness here, will flesh out details when I learn more 
> but it looks like Rust is specifying an int64 as a 32 bit type somewhere.
> {code:java}
> diff schema1.txt schema2.txt 
> 15c15
> <  int64_nullable: Int(32,
> ---
> >  int64_nullable: Int(64,
> 17c17
> <  int64_nonnullable: Int(32,
> ---
> >  int64_nonnullable: Int(64,
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8815) [Dev][Release] Binary upload script should retry on unexpected bintray request error

2020-05-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8815:
--
Labels: pull-request-available  (was: )

> [Dev][Release] Binary upload script should retry on unexpected bintray 
> request error
> 
>
> Key: ARROW-8815
> URL: https://issues.apache.org/jira/browse/ARROW-8815
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> During uploading the binaries to bintray the script exited multiple times 
> because of unhandled HTTP errors. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8815) [Dev][Release] Binary upload script should retry on unexpected bintray request error

2020-05-15 Thread Krisztian Szucs (Jira)

Krisztian Szucs created ARROW-8815:
--

 Summary: [Dev][Release] Binary upload script should retry on 
unexpected bintray request error
 Key: ARROW-8815
 URL: https://issues.apache.org/jira/browse/ARROW-8815
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


During uploading the binaries to bintray the script exited multiple times 
because of unhandled HTTP errors. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8814) [Dev][Release] Binary upload script keeps raising locale warnings

2020-05-15 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8814:
--
Labels: pull-request-available  (was: )

> [Dev][Release] Binary upload script keeps raising locale warnings
> -
>
> Key: ARROW-8814
> URL: https://issues.apache.org/jira/browse/ARROW-8814
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The console output is filled with warnings which makes hard to follow what 
> happens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8814) [Dev][Release] Binary upload script keeps raising locale warnings

2020-05-15 Thread Krisztian Szucs (Jira)

Krisztian Szucs created ARROW-8814:
--

 Summary: [Dev][Release] Binary upload script keeps raising locale 
warnings
 Key: ARROW-8814
 URL: https://issues.apache.org/jira/browse/ARROW-8814
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs
 Fix For: 1.0.0


The console output is filled with warnings which makes hard to follow what 
happens.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (ARROW-8803) [Java] Row count should be set before loading buffers in VectorLoader

2020-05-15 Thread Rong Ma (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rong Ma closed ARROW-8803.
--
Resolution: Won't Do

> [Java] Row count should be set before loading buffers in VectorLoader
> -
>
> Key: ARROW-8803
> URL: https://issues.apache.org/jira/browse/ARROW-8803
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Rong Ma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Hi guys! I'm new to the community, and I've been using Arrow for some time. 
> In my use case, I need to read RecordBatch with *compressed* underlying 
> buffers using Java's IPC API, and I'm finally blocked by the VectorLoader's 
> "load" method. In this method,
> {quote}{{root.setRowCount(recordBatch.getLength());}}
> {quote}
> It not only set the rowCount for the root, but also set the valueCount for 
> the vectors the root holds, *which have already been set once when load 
> buffers.*
> It's not a bug... I know. But if I try to load some compressed buffers, I 
> will get the following exceptions:
> {quote}java.lang.IndexOutOfBoundsException: index: 0, length: 512 (expected: 
> range(0, 504))
>  at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:718)
>  at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:965)
>  at 
> org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:439)
>  at 
> org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:708)
>  at 
> org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:226)
>  at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
>  at 
> org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205)
>  at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:122)
> {quote}
> And I start to think that if it would be more make sense to call 
> root.setRowCount before loadbuffers?
> In root.setRowCount it also calls each vector's setValueCount, which I think 
> is unnecessary here since the vectors after calling loadbuffers are already 
> formed.
> Another existing piece of code upstream is similar to this change. 
> [link|https://github.com/apache/arrow/blob/ed1f771dccdde623ce85e212eccb2b573185c461/java/vector/src/main/java/org/apache/arrow/vector/ipc/JsonFileReader.java#L170-L178]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8803) [Java] Row count should be set before loading buffers in VectorLoader

2020-05-15 Thread Rong Ma (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108172#comment-17108172
 ] 

Rong Ma commented on ARROW-8803:


[~fan_li_ya] Yes, you're right... It indeed is not a nice way to solve the 
problem. Will close this and wait for the updates. Thanks :)

> [Java] Row count should be set before loading buffers in VectorLoader
> -
>
> Key: ARROW-8803
> URL: https://issues.apache.org/jira/browse/ARROW-8803
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Rong Ma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Hi guys! I'm new to the community, and I've been using Arrow for some time. 
> In my use case, I need to read RecordBatch with *compressed* underlying 
> buffers using Java's IPC API, and I'm finally blocked by the VectorLoader's 
> "load" method. In this method,
> {quote}{{root.setRowCount(recordBatch.getLength());}}
> {quote}
> It not only set the rowCount for the root, but also set the valueCount for 
> the vectors the root holds, *which have already been set once when load 
> buffers.*
> It's not a bug... I know. But if I try to load some compressed buffers, I 
> will get the following exceptions:
> {quote}java.lang.IndexOutOfBoundsException: index: 0, length: 512 (expected: 
> range(0, 504))
>  at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:718)
>  at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:965)
>  at 
> org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:439)
>  at 
> org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:708)
>  at 
> org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:226)
>  at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
>  at 
> org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205)
>  at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:122)
> {quote}
> And I start to think that if it would be more make sense to call 
> root.setRowCount before loadbuffers?
> In root.setRowCount it also calls each vector's setValueCount, which I think 
> is unnecessary here since the vectors after calling loadbuffers are already 
> formed.
> Another existing piece of code upstream is similar to this change. 
> [link|https://github.com/apache/arrow/blob/ed1f771dccdde623ce85e212eccb2b573185c461/java/vector/src/main/java/org/apache/arrow/vector/ipc/JsonFileReader.java#L170-L178]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8803) [Java] Row count should be set before loading buffers in VectorLoader

2020-05-15 Thread Liya Fan (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108163#comment-17108163
 ] 

Liya Fan commented on ARROW-8803:
-

As you have indicated, {{root.setRowCount}} calls {{setValueCount}} methods for 
the underlying vectors, and the {{setValueCount}} methods may involve 
allocation for the underlying vectors. 

If we place the {{root.setRowCount}} call to the front, it will lead to 
unnecessary vector allocations, as the underlying buffers will be populated 
shortly.

In fact, we are working on the support of data compression in IPC scenarios 
(ARROW-8672). Hope it will solve your problem. 

> [Java] Row count should be set before loading buffers in VectorLoader
> -
>
> Key: ARROW-8803
> URL: https://issues.apache.org/jira/browse/ARROW-8803
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Rong Ma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Hi guys! I'm new to the community, and I've been using Arrow for some time. 
> In my use case, I need to read RecordBatch with *compressed* underlying 
> buffers using Java's IPC API, and I'm finally blocked by the VectorLoader's 
> "load" method. In this method,
> {quote}{{root.setRowCount(recordBatch.getLength());}}
> {quote}
> It not only set the rowCount for the root, but also set the valueCount for 
> the vectors the root holds, *which have already been set once when load 
> buffers.*
> It's not a bug... I know. But if I try to load some compressed buffers, I 
> will get the following exceptions:
> {quote}java.lang.IndexOutOfBoundsException: index: 0, length: 512 (expected: 
> range(0, 504))
>  at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:718)
>  at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:965)
>  at 
> org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:439)
>  at 
> org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:708)
>  at 
> org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:226)
>  at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
>  at 
> org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205)
>  at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:122)
> {quote}
> And I start to think that if it would be more make sense to call 
> root.setRowCount before loadbuffers?
> In root.setRowCount it also calls each vector's setValueCount, which I think 
> is unnecessary here since the vectors after calling loadbuffers are already 
> formed.
> Another existing piece of code upstream is similar to this change. 
> [link|https://github.com/apache/arrow/blob/ed1f771dccdde623ce85e212eccb2b573185c461/java/vector/src/main/java/org/apache/arrow/vector/ipc/JsonFileReader.java#L170-L178]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8762) [C++][Gandiva] Replace Gandiva's BitmapAnd with common implementation

2020-05-15 Thread Yibo Cai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108160#comment-17108160
 ] 

Yibo Cai commented on ARROW-8762:
-

Benchmarked processing in uint8 and uint64, no obvious diff found.
https://issues.apache.org/jira/browse/ARROW-8553?focusedCommentId=17108159=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17108159

> [C++][Gandiva] Replace Gandiva's BitmapAnd with common implementation
> -
>
> Key: ARROW-8762
> URL: https://issues.apache.org/jira/browse/ARROW-8762
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, C++ - Gandiva
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 1.0.0
>
>
> Now that the arrow/util/bit_util.h implementation has been optimized, we 
> should just use that one



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8553) [C++] Optimize unaligned bitmap operations

2020-05-15 Thread Yibo Cai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108159#comment-17108159
 ] 

Yibo Cai commented on ARROW-8553:
-

[~wesm], the aligned case is simple enough for compiler to auto vectorize the 
code.

Did a quick test with below patch, no obvious performance diff found.
{code:c}
diff --git a/cpp/src/arrow/util/bit_util.cc b/cpp/src/arrow/util/bit_util.cc
index 395801f5e..8beaf6cb8 100644
--- a/cpp/src/arrow/util/bit_util.cc
+++ b/cpp/src/arrow/util/bit_util.cc
@@ -261,7 +261,7 @@ template  class BitOp>
 void AlignedBitmapOp(const uint8_t* left, int64_t left_offset, const uint8_t* 
right,
  int64_t right_offset, uint8_t* out, int64_t out_offset,
  int64_t length) {
-  BitOp op;
+  BitOp op;
   DCHECK_EQ(left_offset % 8, right_offset % 8);
   DCHECK_EQ(left_offset % 8, out_offset % 8);
 
@@ -269,8 +269,11 @@ void AlignedBitmapOp(const uint8_t* left, int64_t 
left_offset, const uint8_t* ri
   left += left_offset / 8;
   right += right_offset / 8;
   out += out_offset / 8;
-  for (int64_t i = 0; i < nbytes; ++i) {
-out[i] = op(left[i], right[i]);
+  uint64_t *out64 = (uint64_t*)out;
+  uint64_t *left64 = (uint64_t*)left;
+  uint64_t *right64 = (uint64_t*)right;
+  for (int64_t i = 0; i < nbytes/8; ++i) {
+out64[i] = op(left64[i], right64[i]);
   }
 }
{code}
Benchmark before this patch (in uint8)
{code:c}
BenchmarkBitmapAnd/32768/0   4253 ns 4251 ns   
164715 bytes_per_second=7.17813G/s
BenchmarkBitmapAnd/131072/0 16767 ns16760 ns
41875 bytes_per_second=7.28348G/s
BenchmarkBitmapAnd/32768/0   4264 ns 4262 ns   
165145 bytes_per_second=7.15959G/s
BenchmarkBitmapAnd/131072/0 16702 ns16695 ns
41952 bytes_per_second=7.31158G/s
{code}
Benchmark after this patch (in uint64)
{code:c}
BenchmarkBitmapAnd/32768/0   4133 ns 4131 ns   
171808 bytes_per_second=7.38787G/s
BenchmarkBitmapAnd/131072/0 17167 ns17157 ns
40529 bytes_per_second=7.11491G/s
BenchmarkBitmapAnd/32768/0   4103 ns 4101 ns   
171883 bytes_per_second=7.44151G/s
BenchmarkBitmapAnd/131072/0 17351 ns17343 ns
43299 bytes_per_second=7.0385G/s
{code}

> [C++] Optimize unaligned bitmap operations
> --
>
> Key: ARROW-8553
> URL: https://issues.apache.org/jira/browse/ARROW-8553
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.17.0
>Reporter: Antoine Pitrou
>Assignee: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Currently, {{BitmapAnd}} uses a bit-by-bit loop for unaligned inputs. Using 
> {{Bitmap::VisitWords}} instead would probably yield a manyfold performance 
> increase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8813) Implementing tidyr interface

2020-05-15 Thread Dominic Dennenmoser (Jira)

Dominic Dennenmoser created ARROW-8813:
--

 Summary: Implementing tidyr interface
 Key: ARROW-8813
 URL: https://issues.apache.org/jira/browse/ARROW-8813
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Dominic Dennenmoser


I think it would be reasonable to implement an interface to the {{tidyr}} 
package. The implementation would allow to lazily process ArrowTables before 
put it back into the memory. However, currently you need to collect the table 
first before applying tidyr methods. The following code chunk shows an example 
routine:
{code:r}
library(magrittr)
arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) 
nested_df <-
   arrow_table %>%
   dplyr::select(ID, 4:7, Value) %>%
   dplyr::filter(Value >= 5) %>%
   dplyr::group_by(ID) %>%
   dplyr::collect() %>%
   tidyr::nest(){code}
The main focus might be the following three methods:
 * {{tidyr::[un]nest()}},
 * {{tidyr::pivot_[longer|wider]()}}, and
 * {{tidyr::seperate()}}.

I suppose the last two can be fairly quickly implemented, but {{tidyr::nest()}} 
and {{tidyr::unnest()}} cannot be implement before conversion to List 
will be accessible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-8812) Columns of type CategoricalIndex fails to be read back

2020-05-15 Thread Jonas Nelle (Jira)

Jonas Nelle created ARROW-8812:
--

 Summary: Columns of type CategoricalIndex fails to be read back
 Key: ARROW-8812
 URL: https://issues.apache.org/jira/browse/ARROW-8812
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.15.1
 Environment: Python 3.7.7
MacOS (Darwin-19.4.0-x86_64-i386-64bit)
Pandas 1.0.3
Pyarrow 0.15.1
Reporter: Jonas Nelle


When columns are of type {{CategoricalIndex}}, saving and reading the table 
back causes a {{TypeError: data type "categorical" not understood}}:
{code:python}
import pandas as pd
from pyarrow import parquet, Table

base_df = pd.DataFrame([['foo', 'j', "1"],
['bar', 'j', "1"],
['foo', 'j', "1"],
['foobar', 'j', "1"]],
   columns=['my_cat', 'var', 'for_count'])

base_df['my_cat'] = base_df['my_cat'].astype('category')

df = (
base_df
.groupby(["my_cat", "var"], observed=True)
.agg({"for_count": "count"})
.rename(columns={"for_count": "my_cat_counts"})
.unstack(level="my_cat", fill_value=0)
)

print(df)
{code}
The resulting data frame looks something like this:
|| ||my_cat_counts|| || ||
|my_cat|foo|bar|foobar|
|var| | | |
|j|2|1|1|

Then, writing and reading causes the {{KeyError}}:
{code:python}
parquet.write_table(Table.from_pandas(df), "test.pqt")
parquet.read_table("test.pqt").to_pandas()
> TypeError: data type "categorical" not understood
{code}
In the example, the column is also a MultiIndex, but that isn't the problem:
{code:python}
df.columns = df.columns.get_level_values(1)
parquet.write_table(Table.from_pandas(df), "test.pqt")
parquet.read_table("test.pqt").to_pandas()
> TypeError: data type "categorical" not understood
{code}
This is the workaround [suggested on 
stackoverflow|https://stackoverflow.com/questions/55749399/how-to-fix-the-issue-of-categoricalindex-column-in-pandas]:
{code:python}
df.columns = pd.Index(list(df.columns)) # suggested fix for the time being
parquet.write_table(Table.from_pandas(df), "test.pqt")
parquet.read_table("test.pqt").to_pandas() # no error
{code}
Are there any plans to support the pattern described here in the future?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-8810) Append to parquet file?

2020-05-15 Thread Uwe Korn (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108087#comment-17108087
 ] 

Uwe Korn edited comment on ARROW-8810 at 5/15/20, 8:47 AM:
---

Generally, you should see Parquet files as immutable. If you want to change its 
contents, it is almost always simpler and faster to just rewrite them 
completely or (much better) just write a second file and treat a directory of 
Parquet files as a single dataset. This comes down to two major properties:
 * Values in a Parquet file are encoded and compressed. Thus they don't adhere 
to a fixed size per row/value but in some cases a column chunk of a million 
values may be stored in just 64 bytes.
 * The metadata that contains all essential information, e.g. where row groups 
start, what schema the data is, is stored at the end of the file (i.e. the 
footer). Especially the last four bytes are needed as they indicate the start 
position of the footer.

Technically, you could still write code that appends to an existing Parquet 
file but this has the drawbacks that:
 * Writing wouldn't be faster than writing to a second, separate file. It would 
probably be even slower as we need to deserialize the existing metadata and 
serialize it again only with slight modifications.
 * Reading wouldn't be faster than reading from a second file, even when doing 
it sequentially.
 * While append to a Parquet file, the file would be unreadable.
 * If your process crashes during write, all existing data in the Parquet file 
will be lost.
 * It will give the users the impression that you could efficiently insert 
row-by-row to a file. With a columnar data format that can only leverage its 
techniques on large chunks of rows, this would generate a massive overhead.

Still if one would try to implement this, it would work as follows:
 # Read in the footer/metadata of the existing file.
 # Seek to the start position of the existing footer and overwrite it with the 
new data.
 # Merge (or rather concat) the existing metadata with the newly computed 
metadata and write it at the end of the file.

If you would take a look at how a completely fresh Parquet file would be 
written, this is identical except that we wouldn't need to read in and 
overwrite any existing metadata.

With newer Arrow releases, there will be better support for Parquet datasets in 
R, I'll leave this to [~npr] or [~jorisvandenbossche] to link to the right docs.


was (Author: xhochy):
Generally, you should see Parquet files as immutable. If you want to change its 
contents, it is almost always simpler and faster to just rewrite them 
completely or (much better) just write a second file and treat a directory of 
Parquet files as a single dataset. This comes down to two major properties:
 * Values in a Parquet file are encoded and compressed. Thus they don't adhere 
to a fixed size per row/value but in some cases a column chunk of a million 
values may be stored in just 64 bytes.
 * The metadata that contains all essential information, e.g. where row groups 
start, what schema the data is, is stored at the end of the file (i.e. the 
footer). Especially the last four bytes are needed as they indicate the start 
position of the footer.

Technically, you could still write code that appends to an existing Parquet 
file but this has the drawbacks that:
 * Writing wouldn't be faster than writing to a second, separate file. It would 
probably be even slower as we need to deserialize the existing metadata and 
serialize it again only with slight modifications.
 * Reading wouldn't be faster than reading from a second file, even when doing 
it sequentially.
 * While append to a Parquet file, the file would be unreadable.
 * If your process crashes during write, all existing data in the Parquet file 
will be lost.
 * It will give the users the impression that you could efficiently insert 
row-by-row to a file. With a columnar data format that can only leverage its 
techniques on large chunks of rows, this would generate a massive overhead.

Still if one would try to implement this, it would work as follows:
 # Read in the footer/metadata of the existing file.
 # Seek to the start position of the existing footer and overwrite it with the 
new data.
 # Merge (or rather concat) the existing metadata with the newly computed 
metadata and write it at the end of the file.

If you would take a look at how a completely fresh Parquet file would be 
written, this is identical except that we wouldn't need to read in and 
overwrite any existing metadata.

> Append to parquet file?
> ---
>
> Key: ARROW-8810
> URL: https://issues.apache.org/jira/browse/ARROW-8810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Carl Boettiger
>Priority: Major
>
> Is it possible to append new

[jira] [Commented] (ARROW-8810) Append to parquet file?

2020-05-15 Thread Uwe Korn (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108087#comment-17108087
 ] 

Uwe Korn commented on ARROW-8810:
-

Generally, you should see Parquet files as immutable. If you want to change its 
contents, it is almost always simpler and faster to just rewrite them 
completely or (much better) just write a second file and treat a directory of 
Parquet files as a single dataset. This comes down to two major properties:
 * Values in a Parquet file are encoded and compressed. Thus they don't adhere 
to a fixed size per row/value but in some cases a column chunk of a million 
values may be stored in just 64 bytes.
 * The metadata that contains all essential information, e.g. where row groups 
start, what schema the data is, is stored at the end of the file (i.e. the 
footer). Especially the last four bytes are needed as they indicate the start 
position of the footer.

Technically, you could still write code that appends to an existing Parquet 
file but this has the drawbacks that:
 * Writing wouldn't be faster than writing to a second, separate file. It would 
probably be even slower as we need to deserialize the existing metadata and 
serialize it again only with slight modifications.
 * Reading wouldn't be faster than reading from a second file, even when doing 
it sequentially.
 * While append to a Parquet file, the file would be unreadable.
 * If your process crashes during write, all existing data in the Parquet file 
will be lost.
 * It will give the users the impression that you could efficiently insert 
row-by-row to a file. With a columnar data format that can only leverage its 
techniques on large chunks of rows, this would generate a massive overhead.

Still if one would try to implement this, it would work as follows:
 # Read in the footer/metadata of the existing file.
 # Seek to the start position of the existing footer and overwrite it with the 
new data.
 # Merge (or rather concat) the existing metadata with the newly computed 
metadata and write it at the end of the file.

If you would take a look at how a completely fresh Parquet file would be 
written, this is identical except that we wouldn't need to read in and 
overwrite any existing metadata.

> Append to parquet file?
> ---
>
> Key: ARROW-8810
> URL: https://issues.apache.org/jira/browse/ARROW-8810
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Carl Boettiger
>Priority: Major
>
> Is it possible to append new rows to an existing .parquet file using the R 
> client's arrow::write_parquet(), in a manner similar to the `append=TRUE` 
> argument in text-based output formats like write.table()? 
>  
> Apologies as this is perhaps more a question of documentation or user 
> interface, or maybe just my ignorance. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-8774) [Rust] [DataFusion] Improve threading model

2020-05-15 Thread Adam Lippai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108081#comment-17108081
 ] 

Adam Lippai commented on ARROW-8774:


[~andygrove] I don't have access to edit, so my addition is pending as 
suggestion in the doc.

> [Rust] [DataFusion] Improve threading model
> ---
>
> Key: ARROW-8774
> URL: https://issues.apache.org/jira/browse/ARROW-8774
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust, Rust - DataFusion
>Reporter: Andy Grove
>Assignee: Andy Grove
>Priority: Major
> Fix For: 1.0.0
>
>
> DataFusion currently spawns one thread per partition and this results in poor 
> performance if there are more partitions than available cores/threads. It 
> would be better to have a thread-pool that defaults to number of available 
> cores.
> Here is a Google doc where we can collaborate on a design discussion.
> https://docs.google.com/document/d/1_wc6diy3YrRgEIhVIGzrO5AK8yhwfjWlmKtGnvbsrrY/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-8811) [Java] Fix build on master

2020-05-15 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved ARROW-8811.

Fix Version/s: 1.0.0
   Resolution: Fixed

Issue resolved by pull request 7190
[https://github.com/apache/arrow/pull/7190]

> [Java] Fix  build on master
> ---
>
> Key: ARROW-8811
> URL: https://issues.apache.org/jira/browse/ARROW-8811
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

75 matches

Mail list logo