[jira] [Updated] (ARROW-8820) [C++][Gandiva] fix date_trunc functions to return date types
[ https://issues.apache.org/jira/browse/ARROW-8820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8820: -- Labels: pull-request-available (was: ) > [C++][Gandiva] fix date_trunc functions to return date types > > > Key: ARROW-8820 > URL: https://issues.apache.org/jira/browse/ARROW-8820 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Prudhvi Porandla >Assignee: Prudhvi Porandla >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > date_trunc functions return int64 instead of date types -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8820) [C++][Gandiva] fix date_trunc functions to return date types
Prudhvi Porandla created ARROW-8820: --- Summary: [C++][Gandiva] fix date_trunc functions to return date types Key: ARROW-8820 URL: https://issues.apache.org/jira/browse/ARROW-8820 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla date_trunc functions return int64 instead of date types -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8121) [Java] Enhance code style checking for Java code (add space after commas, semi-colons and type casts)
[ https://issues.apache.org/jira/browse/ARROW-8121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-8121. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 6622 [https://github.com/apache/arrow/pull/6622] > [Java] Enhance code style checking for Java code (add space after commas, > semi-colons and type casts) > - > > Key: ARROW-8121 > URL: https://issues.apache.org/jira/browse/ARROW-8121 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > This is in response to a discussion in > https://github.com/apache/arrow/pull/6039#discussion_r375161992 > We found the current style checking for Java code is not sufficient. So we > want to enhace it in a series of "small" steps, in order to avoid having to > change too many files at once. > In this issue, we add spaces after commas, semi-colons and type casts. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8788) [C#] Array builders to use bit-packed buffer builder rather than boolean array builder for validity map
[ https://issues.apache.org/jira/browse/ARROW-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8788: -- Labels: pull-request-available (was: ) > [C#] Array builders to use bit-packed buffer builder rather than boolean > array builder for validity map > --- > > Key: ARROW-8788 > URL: https://issues.apache.org/jira/browse/ARROW-8788 > Project: Apache Arrow > Issue Type: Improvement > Components: C# >Affects Versions: 0.17.0 >Reporter: Adam Szmigin >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The C# array builders were recently enhanced to have support for adding > nullable values easily, under [PR > #7032|https://github.com/apache/arrow/pull/7032]. > However, the builders internally referenced {{BooleanArray.Builder}}, which > itself then had logic "baked-in" for efficient bit-packing of boolean values > into a byte buffer. > It would be cleaner for there to be a general-purpose bit-packed buffer > builder, and for all array builders to use that for their validity map. The > boolean array builder would use it twice: once for values, once for validity. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8805) [C++] Arrow (master) build error from sources
[ https://issues.apache.org/jira/browse/ARROW-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-8805. --- Resolution: Won't Fix Apache Arrow has ceased support for Python 2.7 since it reached EOL in January 2020 > [C++] Arrow (master) build error from sources > - > > Key: ARROW-8805 > URL: https://issues.apache.org/jira/browse/ARROW-8805 > Project: Apache Arrow > Issue Type: Bug >Reporter: Tanveer >Priority: Major > Attachments: Screenshot from 2020-05-14 22-22-01.png > > > Building Arrow C++ from sources (with following flags: cmake > -DARROW_PLASMA=ON -DARROW_PYTHON=ON ..) is not possible due to some errors > as shown in the attached figure. > Can someone fix them or suggest me some solution? Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8805) [C++] Arrow (master) build error from sources
[ https://issues.apache.org/jira/browse/ARROW-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108788#comment-17108788 ] Wes McKinney commented on ARROW-8805: - You'll also have to enable the optional components that GLib depends on > [C++] Arrow (master) build error from sources > - > > Key: ARROW-8805 > URL: https://issues.apache.org/jira/browse/ARROW-8805 > Project: Apache Arrow > Issue Type: Bug >Reporter: Tanveer >Priority: Major > Attachments: Screenshot from 2020-05-14 22-22-01.png > > > Building Arrow C++ from sources (with following flags: cmake > -DARROW_PLASMA=ON -DARROW_PYTHON=ON ..) is not possible due to some errors > as shown in the attached figure. > Can someone fix them or suggest me some solution? Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8374) [R] Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array
[ https://issues.apache.org/jira/browse/ARROW-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108786#comment-17108786 ] Wes McKinney commented on ARROW-8374: - Oof, this would be good to fix > [R] Table to vector of DictonaryType will error when Arrays don't have the > same Dictionary per array > > > Key: ARROW-8374 > URL: https://issues.apache.org/jira/browse/ARROW-8374 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Francois Saint-Jacques >Priority: Major > Fix For: 1.0.0 > > > The conversion should accommodate Unifying the dictionary before converting, > otherwise the indices are simply broken -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8805) [C++] Arrow (master) build error from sources
[ https://issues.apache.org/jira/browse/ARROW-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanveer updated ARROW-8805: --- Description: Building Arrow C++ from sources (with following flags: cmake -DARROW_PLASMA=ON -DARROW_PYTHON=ON ..) is not possible due to some errors as shown in the attached figure. Can someone fix them or suggest me some solution? Thanks. was: !Screenshot from 2020-05-14 22-22-01.png! Building Arrow C++ from sources (with following flags: cmake -DARROW_PLASMA=ON -DARROW_PYTHON=ON ..) is not possible due to some errors as shown in the attached figure. Can someone fix them or suggest me some solution? Thanks. > [C++] Arrow (master) build error from sources > - > > Key: ARROW-8805 > URL: https://issues.apache.org/jira/browse/ARROW-8805 > Project: Apache Arrow > Issue Type: Bug >Reporter: Tanveer >Priority: Major > Attachments: Screenshot from 2020-05-14 22-22-01.png > > > Building Arrow C++ from sources (with following flags: cmake > -DARROW_PLASMA=ON -DARROW_PYTHON=ON ..) is not possible due to some errors > as shown in the attached figure. > Can someone fix them or suggest me some solution? Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8805) [C++] Arrow (master) build error from sources
[ https://issues.apache.org/jira/browse/ARROW-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108785#comment-17108785 ] Tanveer commented on ARROW-8805: With CMake command: {code:java} cmake -DCMAKE_BUILD_TYPE=Release -DARROW_PARQUET=ON -DARROW_PLASMA=ON -DARROW_PLASMA_JAVA_CLIENT=ON -DARROW_PYTHON=ON ..{code} {code:java} $ git clone https://github.com/apache/arrow.git Cloning into 'arrow'... remote: Enumerating objects: 86, done. remote: Counting objects: 100% (86/86), done. remote: Compressing objects: 100% (70/70), done. remote: Total 99863 (delta 20), reused 45 (delta 12), pack-reused 99777 Receiving objects: 100% (99863/99863), 53.01 MiB | 1.22 MiB/s, done. Resolving deltas: 100% (68594/68594), done. Checking connectivity... done. tahmad@Rezkuh-7480: ~ $ cd arrow/cpp/ tahmad@Rezkuh-7480: ~/arrow/cpp(master) $ mkdir release tahmad@Rezkuh-7480: ~/arrow/cpp(master) $ cd release/ tahmad@Rezkuh-7480: ~/arrow/cpp/release(master) $ cmake -DCMAKE_BUILD_TYPE=Release -DARROW_PARQUET=ON -DARROW_PLASMA=ON -DARROW_PLASMA_JAVA_CLIENT=ON -DARROW_PYTHON=ON .. -- Building using CMake version: 3.5.1 -- The C compiler identification is GNU 5.5.0 -- The CXX compiler identification is GNU 5.5.0 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Arrow version: 0.18.0 (full: '0.18.0-SNAPSHOT') -- Arrow SO version: 18 (full: 18.0.0) -- Found PkgConfig: /usr/bin/pkg-config (found version "0.29.1") -- clang-tidy not found -- clang-format not found -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) -- infer not found -- Found PythonInterp: /usr/bin/python (found version "2.7.12") -- Found cpplint executable at /home/tahmad/arrow/cpp/build-support/cpplint.py -- System processor: x86_64 -- Performing Test CXX_SUPPORTS_SSE4_2 -- Performing Test CXX_SUPPORTS_SSE4_2 - Success -- Performing Test CXX_SUPPORTS_AVX2 -- Performing Test CXX_SUPPORTS_AVX2 - Success -- Performing Test CXX_SUPPORTS_AVX512 -- Performing Test CXX_SUPPORTS_AVX512 - Failed -- Arrow build warning level: PRODUCTION Using ld linker Configured for RELEASE build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...}) -- Build Type: RELEASE -- Using AUTO approach to find dependencies -- ARROW_AWSSDK_BUILD_VERSION: 1.7.160 -- ARROW_BOOST_BUILD_VERSION: 1.71.0 -- ARROW_BROTLI_BUILD_VERSION: v1.0.7 -- ARROW_BZIP2_BUILD_VERSION: 1.0.8 -- ARROW_CARES_BUILD_VERSION: 1.15.0 -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.0 -- ARROW_GFLAGS_BUILD_VERSION: v2.2.0 -- ARROW_GLOG_BUILD_VERSION: v0.3.5 -- ARROW_GRPC_BUILD_VERSION: v1.25.0 -- ARROW_GTEST_BUILD_VERSION: 1.8.1 -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1 -- ARROW_LZ4_BUILD_VERSION: v1.9.2 -- ARROW_MIMALLOC_BUILD_VERSION: 270e765454f98e8bab9d42609b153425f749fff6 -- ARROW_ORC_BUILD_VERSION: 1.6.2 -- ARROW_PROTOBUF_BUILD_VERSION: v3.7.1 -- ARROW_RAPIDJSON_BUILD_VERSION: 2bbd33b33217ff4a73434ebf10cdac41e2ef5e34 -- ARROW_RE2_BUILD_VERSION: 2019-08-01 -- ARROW_SNAPPY_BUILD_VERSION: 1.1.7 -- ARROW_THRIFT_BUILD_VERSION: 0.12.0 -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183 -- ARROW_ZLIB_BUILD_VERSION: 1.2.11 -- ARROW_ZSTD_BUILD_VERSION: v1.4.3 -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Check if compiler accepts -pthread -- Check if compiler accepts -pthread - yes -- Found Threads: TRUE -- Checking for module 'thrift' -- No package 'thrift' found -- Could NOT find Thrift: Found unsuitable version "", but required is at least "0.11.0" (found THRIFT_STATIC_LIB-NOTFOUND) -- Boost version: 1.58.0 -- Found the following Boost libraries: -- regex -- system -- filesystem -- Boost include dir: /usr/include -- Boost libraries: Boost::system;Boost::filesystem -- Building without OpenSSL support. Minimum OpenSSL version 1.0.2 required. Building Apache Thrift from source -- Building (vendored) jemalloc from source -- Could NOT find RapidJSONAlt (missing: RAPIDJSON_INCLUDE_DIR) (Required is at least version "1.1.0") -- Building rapidjson from source -- Found hdfs.h at: /home/tahmad/arrow/cpp/thirdparty/hadoop/include/hdfs.h -- CMAKE_C_FLAGS: -O3 -DNDEBUG -Wall -Wno-attributes -msse4.2 -- CMAKE_CXX_FLAGS: -fdiagnostics-color=always -O3 -DNDEBUG -Wall -Wno-attributes -msse4.2 -- Found JNI: /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/libjawt.so -- JNI_INCLUDE_DIRS =
[jira] [Created] (ARROW-8819) [Rust] Rust docs don't complile for the Arrow crate
Paddy Horan created ARROW-8819: -- Summary: [Rust] Rust docs don't complile for the Arrow crate Key: ARROW-8819 URL: https://issues.apache.org/jira/browse/ARROW-8819 Project: Apache Arrow Issue Type: New Feature Components: Rust Affects Versions: 0.17.0 Reporter: Paddy Horan See Github [issue|https://github.com/apache/arrow/issues/7194] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8818) [Rust] Failing to build on master due to Flatbuffers/Union issues
[ https://issues.apache.org/jira/browse/ARROW-8818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8818: -- Labels: pull-request-available (was: ) > [Rust] Failing to build on master due to Flatbuffers/Union issues > - > > Key: ARROW-8818 > URL: https://issues.apache.org/jira/browse/ARROW-8818 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8818) [Rust] Failing to build on master due to Flatbuffers/Union issues
Paddy Horan created ARROW-8818: -- Summary: [Rust] Failing to build on master due to Flatbuffers/Union issues Key: ARROW-8818 URL: https://issues.apache.org/jira/browse/ARROW-8818 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Paddy Horan Assignee: Paddy Horan -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8817) [Rust] Add support for Union arrays in Parquet
Paddy Horan created ARROW-8817: -- Summary: [Rust] Add support for Union arrays in Parquet Key: ARROW-8817 URL: https://issues.apache.org/jira/browse/ARROW-8817 Project: Apache Arrow Issue Type: New Feature Components: Rust Reporter: Paddy Horan -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas
[ https://issues.apache.org/jira/browse/ARROW-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108736#comment-17108736 ] Rauli Ruohonen commented on ARROW-8816: --- Ah, I see. I thought that the output was wrong, because fastparquet also reads it incorrectly. But using both from pandas is not an independent test, because pandas is shared between the tests. Checking with parquet-tools, the output does look correct (924618240 is 2263-01-01 00:00:00, and the extra field gives "datetime" for pandas_type and "object" for numpy_type; AFAICS the reader has no basis to assume that unchecked cast to datetime64 would be safe). Still, it's something of a pitfall that you can successfully save data (using default options), and when you later try to load it using the same software (using default options), it fails. If timestamp_as_object is required to read the data, one could symmetrically also require it to write the data, and avoid surprises upon loading. OTOH raising an exception when you actually can produce correct output would also be slightly odd. One solution would be to have a timestamp_as_object='infer' option (instead of just True/False) that would be the default, so that the current writing behavior would be matched with symmetric reading behavior that would produce datetime64 when possible, and datetime when not. >From one pragmatic perspective it'd be safer to raise an exception when trying >to write these things unless explicitly requested, because there are readers >that fail with them in common use (such as current pyarrow and fastparquet). >Maybe the reasoning why write_table defaults to parquet version 1.0 output >instead of 2.0 is similar...? IMHO the important thing is to always be able to read back in what one wrote (possibly with wider types) if the write was successful, provided that one uses the same pyarrow version and the default options for both reading and writing. > [Python] Year 2263 or later datetimes get mangled when written using pandas > --- > > Key: ARROW-8816 > URL: https://issues.apache.org/jira/browse/ARROW-8816 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0, 0.17.0 > Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, > python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, > python 3.8.2, ubuntu 20.04 (linux). >Reporter: Rauli Ruohonen >Priority: Major > > Using pyarrow 0.17.0, this > > {code:java} > import datetime > import pandas as pd > def try_with_year(year): > print(f'Year {year:_}:') > df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]}) > df.to_parquet('foo.parquet', engine='pyarrow', compression=None) > try: > print(pd.read_parquet('foo.parquet', engine='pyarrow')) > except Exception as exc: > print(repr(exc)) > print() > try_with_year(2_263) > try_with_year(2_262) > {code} > > prints > > {noformat} > Year 2_263: > ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out > of bounds timestamp: 924618240') > Year 2_262: > x > 0 2262-01-01{noformat} > and using pyarrow 0.16.0, it prints > > > {noformat} > Year 2_263: > x > 0 1678-06-12 00:25:26.290448384 > Year 2_262: >x > 0 2262-01-01{noformat} > The issue is that 2263-01-01 is out of bounds for a timestamp stored using > epoch nanoseconds, but not out of bounds for a Python datetime. > While pyarrow 0.17.0 refuses to read the erroneous output, it is still > possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or > fastparquet), yielding the same result as with 0.16.0 above (i.e. only > reading has changed in 0.17.0, not writing). It would be better if an error > was raised when attempting to write the file instead of silently producing > erroneous output. > The reason I suspect this is a pyarrow issue instead of a pandas issue is > this modified example: > > {code:java} > import datetime > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]}) > table = pa.Table.from_pandas(df) > print(table[0]) > try: > print(table.to_pandas()) > except Exception as exc: > print(repr(exc)) > {code} > which prints > > > {noformat} > [ > [ > 2263-01-01 00:00:00.00 > ] > ] > ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out > of bounds timestamp: 92461824'){noformat} > on pyarrow 0.17.0 and > > > {noformat} > [ > [ > 2263-01-01 00:00:00.00 > ] > ] > x > 0 1678-06-12 00:25:26.290448384{noformat} > on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, >
[jira] [Resolved] (ARROW-8757) [C++] Plasma header is written in native endian
[ https://issues.apache.org/jira/browse/ARROW-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-8757. - Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7146 [https://github.com/apache/arrow/pull/7146] > [C++] Plasma header is written in native endian > --- > > Key: ARROW-8757 > URL: https://issues.apache.org/jira/browse/ARROW-8757 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > The current Plasma header (version, type, and length) is written in native > endian at > [here|https://github.com/apache/arrow/blob/master/cpp/src/plasma/io.cc#L65-L71]. > It will be hard to interpret the Plasma data among different endian > platforms in the future. > At least, the header should be written in the pre-defined endian. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8757) [C++] Plasma header is written in native endian
[ https://issues.apache.org/jira/browse/ARROW-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou reassigned ARROW-8757: --- Assignee: Kazuaki Ishizaki > [C++] Plasma header is written in native endian > --- > > Key: ARROW-8757 > URL: https://issues.apache.org/jira/browse/ARROW-8757 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > The current Plasma header (version, type, and length) is written in native > endian at > [here|https://github.com/apache/arrow/blob/master/cpp/src/plasma/io.cc#L65-L71]. > It will be hard to interpret the Plasma data among different endian > platforms in the future. > At least, the header should be written in the pre-defined endian. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8757) [C++] Plasma header is written in native endian
[ https://issues.apache.org/jira/browse/ARROW-8757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-8757: Summary: [C++] Plasma header is written in native endian (was: [c++] Plasma header is written in native endian) > [C++] Plasma header is written in native endian > --- > > Key: ARROW-8757 > URL: https://issues.apache.org/jira/browse/ARROW-8757 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Plasma >Reporter: Kazuaki Ishizaki >Priority: Minor > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > The current Plasma header (version, type, and length) is written in native > endian at > [here|https://github.com/apache/arrow/blob/master/cpp/src/plasma/io.cc#L65-L71]. > It will be hard to interpret the Plasma data among different endian > platforms in the future. > At least, the header should be written in the pre-defined endian. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7967) [CI][Crossbow] Pin macOS version in autobrew job to match CRAN
[ https://issues.apache.org/jira/browse/ARROW-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-7967. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7197 [https://github.com/apache/arrow/pull/7197] > [CI][Crossbow] Pin macOS version in autobrew job to match CRAN > -- > > Key: ARROW-7967 > URL: https://issues.apache.org/jira/browse/ARROW-7967 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > Followup to ARROW-7923. After hopefully fixing the underlying issue somewhere > in Travis, revert the changes in that issue so that we're still testing on > old macOS. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (ARROW-8556) [R] zstd symbol not found if there are multiple installations of zstd
[ https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson reassigned ARROW-8556: -- Assignee: Neal Richardson > [R] zstd symbol not found if there are multiple installations of zstd > - > > Key: ARROW-8556 > URL: https://issues.apache.org/jira/browse/ARROW-8556 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.17.0 > Environment: Ubuntu 19.10 > R 3.6.1 >Reporter: Karl Dunkle Werner >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > I would like to install the `arrow` R package on my Ubuntu 19.10 system. > Prebuilt binaries are unavailable, and I want to enable compression, so I set > the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks > like the package is able to compile, but can't be loaded. I'm able to install > correctly if I don't set the {{LIBARROW_MINIMAL}} variable. > Here's the error I get: > {code:java} > ** testing if installed package can be loaded from temporary location > Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath > = DLLpath, ...): > unable to load shared object > '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so': > ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: > ZSTD_initCStream > Error: loading failed > Execution halted > ERROR: loading failed > * removing ‘~/.R/3.6/arrow’ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8556) [R] zstd symbol not found if there are multiple installations of zstd
[ https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-8556. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7196 [https://github.com/apache/arrow/pull/7196] > [R] zstd symbol not found if there are multiple installations of zstd > - > > Key: ARROW-8556 > URL: https://issues.apache.org/jira/browse/ARROW-8556 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.17.0 > Environment: Ubuntu 19.10 > R 3.6.1 >Reporter: Karl Dunkle Werner >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > I would like to install the `arrow` R package on my Ubuntu 19.10 system. > Prebuilt binaries are unavailable, and I want to enable compression, so I set > the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks > like the package is able to compile, but can't be loaded. I'm able to install > correctly if I don't set the {{LIBARROW_MINIMAL}} variable. > Here's the error I get: > {code:java} > ** testing if installed package can be loaded from temporary location > Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath > = DLLpath, ...): > unable to load shared object > '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so': > ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: > ZSTD_initCStream > Error: loading failed > Execution halted > ERROR: loading failed > * removing ‘~/.R/3.6/arrow’ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8814) [Dev][Release] Binary upload script keeps raising locale warnings
[ https://issues.apache.org/jira/browse/ARROW-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-8814. - Resolution: Fixed Issue resolved by pull request 7191 [https://github.com/apache/arrow/pull/7191] > [Dev][Release] Binary upload script keeps raising locale warnings > - > > Key: ARROW-8814 > URL: https://issues.apache.org/jira/browse/ARROW-8814 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > The console output is filled with warnings which makes hard to follow what > happens. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7967) [CI][Crossbow] Pin macOS version in autobrew job to match CRAN
[ https://issues.apache.org/jira/browse/ARROW-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7967: -- Labels: pull-request-available (was: ) > [CI][Crossbow] Pin macOS version in autobrew job to match CRAN > -- > > Key: ARROW-7967 > URL: https://issues.apache.org/jira/browse/ARROW-7967 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Followup to ARROW-7923. After hopefully fixing the underlying issue somewhere > in Travis, revert the changes in that issue so that we're still testing on > old macOS. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7967) [CI][Crossbow] Pin macOS version in autobrew job to match CRAN
[ https://issues.apache.org/jira/browse/ARROW-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7967: --- Summary: [CI][Crossbow] Pin macOS version in autobrew job to match CRAN (was: [CI][Crossbow] Move autobrew job back to old macOS) > [CI][Crossbow] Pin macOS version in autobrew job to match CRAN > -- > > Key: ARROW-7967 > URL: https://issues.apache.org/jira/browse/ARROW-7967 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > > Followup to ARROW-7923. After hopefully fixing the underlying issue somewhere > in Travis, revert the changes in that issue so that we're still testing on > old macOS. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7803) [R][CI] Autobrew/homebrew tests should not always install from master
[ https://issues.apache.org/jira/browse/ARROW-7803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-7803. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7195 [https://github.com/apache/arrow/pull/7195] > [R][CI] Autobrew/homebrew tests should not always install from master > - > > Key: ARROW-7803 > URL: https://issues.apache.org/jira/browse/ARROW-7803 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Figure out how to get the formula to check out a branch when building > {{--head}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-7825) [R] Update docs to clarify that stringsAsFactors isn't relevant for parquet/feather
[ https://issues.apache.org/jira/browse/ARROW-7825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson closed ARROW-7825. -- Assignee: Neal Richardson Resolution: Won't Fix Now that wisdom has prevailed and {{stringsAsFactors=FALSE}} by default in R 4.0, I don't think we need to add anything to the arrow docs. Feel free to reopen and submit a PR if you feel strongly otherwise. > [R] Update docs to clarify that stringsAsFactors isn't relevant for > parquet/feather > --- > > Key: ARROW-7825 > URL: https://issues.apache.org/jira/browse/ARROW-7825 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Affects Versions: 0.16.0 > Environment: Linux 64-bit 5.4.15 >Reporter: Keith Hughitt >Assignee: Neal Richardson >Priority: Major > Labels: R, parquet > > Same issue as reported for feather::read_feather > (https://issues.apache.org/jira/browse/ARROW-7823); > > For the R arrow package, the "read_parquet()" function currently does not > respect "options(stringsAsFactors = FALSE)", leading to > unexpected/inconsistent behavior. > > *Example:* > > > {code:java} > library(arrow) > library(readr) > options(stringsAsFactors = FALSE) > write_tsv(head(iris), 'test.tsv') > write_parquet(head(iris), 'test.parquet') > head(read.delim('test.tsv', sep='\t')$Species) > # [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" > head(read_tsv('test.tsv', col_types = cols())$Species) > # [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa" > head(read_parquet('test.parquet')$Species) > # [1] setosa setosa setosa setosa setosa setosa > # Levels: setosa versicolor virginica > {code} > > > *Versions:* > - R 3.6.2 > - arrow_0.15.1.9000 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8374) [R] Table to vector of DictonaryType will error when Arrays don't have the same Dictionary per array
[ https://issues.apache.org/jira/browse/ARROW-8374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-8374: --- Fix Version/s: 1.0.0 > [R] Table to vector of DictonaryType will error when Arrays don't have the > same Dictionary per array > > > Key: ARROW-8374 > URL: https://issues.apache.org/jira/browse/ARROW-8374 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Francois Saint-Jacques >Priority: Major > Fix For: 1.0.0 > > > The conversion should accommodate Unifying the dictionary before converting, > otherwise the indices are simply broken -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8556) [R] zstd symbol not found if there are multiple installations of zstd
[ https://issues.apache.org/jira/browse/ARROW-8556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8556: -- Labels: pull-request-available (was: ) > [R] zstd symbol not found if there are multiple installations of zstd > - > > Key: ARROW-8556 > URL: https://issues.apache.org/jira/browse/ARROW-8556 > Project: Apache Arrow > Issue Type: Bug > Components: R >Affects Versions: 0.17.0 > Environment: Ubuntu 19.10 > R 3.6.1 >Reporter: Karl Dunkle Werner >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > I would like to install the `arrow` R package on my Ubuntu 19.10 system. > Prebuilt binaries are unavailable, and I want to enable compression, so I set > the {{LIBARROW_MINIMAL=false}} environment variable. When I do so, it looks > like the package is able to compile, but can't be loaded. I'm able to install > correctly if I don't set the {{LIBARROW_MINIMAL}} variable. > Here's the error I get: > {code:java} > ** testing if installed package can be loaded from temporary location > Error: package or namespace load failed for ‘arrow’ in dyn.load(file, DLLpath > = DLLpath, ...): > unable to load shared object > '~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so': > ~/.R/3.6/00LOCK-arrow/00new/arrow/libs/arrow.so: undefined symbol: > ZSTD_initCStream > Error: loading failed > Execution halted > ERROR: loading failed > * removing ‘~/.R/3.6/arrow’ > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8805) [C++] Arrow (master) build error from sources
[ https://issues.apache.org/jira/browse/ARROW-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou updated ARROW-8805: Summary: [C++] Arrow (master) build error from sources (was: [CPP] Arrow (master) build error from sources) > [C++] Arrow (master) build error from sources > - > > Key: ARROW-8805 > URL: https://issues.apache.org/jira/browse/ARROW-8805 > Project: Apache Arrow > Issue Type: Bug >Reporter: Tanveer >Priority: Major > Attachments: Screenshot from 2020-05-14 22-22-01.png > > > !Screenshot from 2020-05-14 22-22-01.png! > Building Arrow C++ from sources (with following flags: cmake > -DARROW_PLASMA=ON -DARROW_PYTHON=ON ..) is not possible due to some errors > as shown in the attached figure. > Can someone fix them or suggest me some solution? Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8805) [CPP] Arrow (master) build error from sources
[ https://issues.apache.org/jira/browse/ARROW-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108651#comment-17108651 ] Kouhei Sutou commented on ARROW-8805: - Could you attach the full build log as text instead of screenshot? And could you also show the full CMake command line you specified? > [CPP] Arrow (master) build error from sources > - > > Key: ARROW-8805 > URL: https://issues.apache.org/jira/browse/ARROW-8805 > Project: Apache Arrow > Issue Type: Bug >Reporter: Tanveer >Priority: Major > Attachments: Screenshot from 2020-05-14 22-22-01.png > > > !Screenshot from 2020-05-14 22-22-01.png! > Building Arrow C++ from sources (with following flags: cmake > -DARROW_PLASMA=ON -DARROW_PYTHON=ON ..) is not possible due to some errors > as shown in the attached figure. > Can someone fix them or suggest me some solution? Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8662) [CI] Consolidate appveyor scripts
[ https://issues.apache.org/jira/browse/ARROW-8662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-8662. Resolution: Fixed Issue resolved by pull request 7080 [https://github.com/apache/arrow/pull/7080] > [CI] Consolidate appveyor scripts > - > > Key: ARROW-8662 > URL: https://issues.apache.org/jira/browse/ARROW-8662 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > The appveyor scripts are a bit outdated and contain unreasonable amount of > indirections. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8777) [Rust] Parquet.rs does not support reading fixed-size binary fields.
[ https://issues.apache.org/jira/browse/ARROW-8777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-8777: --- Component/s: Rust > [Rust] Parquet.rs does not support reading fixed-size binary fields. > > > Key: ARROW-8777 > URL: https://issues.apache.org/jira/browse/ARROW-8777 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Max Burke >Assignee: Max Burke >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7803) [R][CI] Autobrew/homebrew tests should not always install from master
[ https://issues.apache.org/jira/browse/ARROW-7803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7803: -- Labels: pull-request-available (was: ) > [R][CI] Autobrew/homebrew tests should not always install from master > - > > Key: ARROW-7803 > URL: https://issues.apache.org/jira/browse/ARROW-7803 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Figure out how to get the formula to check out a branch when building > {{--head}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7803) [R][CI] Autobrew/homebrew tests should not always install from master
[ https://issues.apache.org/jira/browse/ARROW-7803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-7803: --- Summary: [R][CI] Autobrew/homebrew tests should not always install from master (was: [R][CI] Autobrew/homebrew tests always install from master) > [R][CI] Autobrew/homebrew tests should not always install from master > - > > Key: ARROW-7803 > URL: https://issues.apache.org/jira/browse/ARROW-7803 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > > Figure out how to get the formula to check out a branch when building > {{--head}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas
[ https://issues.apache.org/jira/browse/ARROW-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108548#comment-17108548 ] Joris Van den Bossche commented on ARROW-8816: -- > It would be better if an error was raised when attempting to write the file > instead of silently producing erroneous output. The file is correct (so we shouldn't error when writing), it is only after reading in that the conversion to pandas causes the issue given pandas' limitation on the range of timestamps. As you can see, in pyarrow 0.17 it was at least fixed to not produces garbage dates but an error is raised instead (which I would say is better than garbage). But it is a known issue that there should be a way to still convert to pandas but with converting to datetime objects instead of to datetime64[ns] dtype. This is covered by ARROW-5359 with the idea to add a {{timestamp_as_object}} keyword. > [Python] Year 2263 or later datetimes get mangled when written using pandas > --- > > Key: ARROW-8816 > URL: https://issues.apache.org/jira/browse/ARROW-8816 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.16.0, 0.17.0 > Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, > python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, > python 3.8.2, ubuntu 20.04 (linux). >Reporter: Rauli Ruohonen >Priority: Major > > Using pyarrow 0.17.0, this > > {code:java} > import datetime > import pandas as pd > def try_with_year(year): > print(f'Year {year:_}:') > df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]}) > df.to_parquet('foo.parquet', engine='pyarrow', compression=None) > try: > print(pd.read_parquet('foo.parquet', engine='pyarrow')) > except Exception as exc: > print(repr(exc)) > print() > try_with_year(2_263) > try_with_year(2_262) > {code} > > prints > > {noformat} > Year 2_263: > ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out > of bounds timestamp: 924618240') > Year 2_262: > x > 0 2262-01-01{noformat} > and using pyarrow 0.16.0, it prints > > > {noformat} > Year 2_263: > x > 0 1678-06-12 00:25:26.290448384 > Year 2_262: >x > 0 2262-01-01{noformat} > The issue is that 2263-01-01 is out of bounds for a timestamp stored using > epoch nanoseconds, but not out of bounds for a Python datetime. > While pyarrow 0.17.0 refuses to read the erroneous output, it is still > possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or > fastparquet), yielding the same result as with 0.16.0 above (i.e. only > reading has changed in 0.17.0, not writing). It would be better if an error > was raised when attempting to write the file instead of silently producing > erroneous output. > The reason I suspect this is a pyarrow issue instead of a pandas issue is > this modified example: > > {code:java} > import datetime > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]}) > table = pa.Table.from_pandas(df) > print(table[0]) > try: > print(table.to_pandas()) > except Exception as exc: > print(repr(exc)) > {code} > which prints > > > {noformat} > [ > [ > 2263-01-01 00:00:00.00 > ] > ] > ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out > of bounds timestamp: 92461824'){noformat} > on pyarrow 0.17.0 and > > > {noformat} > [ > [ > 2263-01-01 00:00:00.00 > ] > ] > x > 0 1678-06-12 00:25:26.290448384{noformat} > on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, > pyarrow prints the correct timestamp when asked to produce it as a string (so > it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas() > round-trip fails. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8777) [Rust] Parquet.rs does not support reading fixed-size binary fields.
[ https://issues.apache.org/jira/browse/ARROW-8777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-8777. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7159 [https://github.com/apache/arrow/pull/7159] > [Rust] Parquet.rs does not support reading fixed-size binary fields. > > > Key: ARROW-8777 > URL: https://issues.apache.org/jira/browse/ARROW-8777 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Max Burke >Assignee: Max Burke >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8810) [R] Add documentation about Parquet format, appending to stream format
[ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-8810: --- Summary: [R] Add documentation about Parquet format, appending to stream format (was: [R] Append to parquet file?) > [R] Add documentation about Parquet format, appending to stream format > -- > > Key: ARROW-8810 > URL: https://issues.apache.org/jira/browse/ARROW-8810 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Carl Boettiger >Priority: Minor > > Is it possible to append new rows to an existing .parquet file using the R > client's arrow::write_parquet(), in a manner similar to the `append=TRUE` > argument in text-based output formats like write.table()? > > Apologies as this is perhaps more a question of documentation or user > interface, or maybe just my ignorance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8810) [R] Append to parquet file?
[ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-8810: --- Priority: Minor (was: Major) > [R] Append to parquet file? > --- > > Key: ARROW-8810 > URL: https://issues.apache.org/jira/browse/ARROW-8810 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Carl Boettiger >Priority: Minor > > Is it possible to append new rows to an existing .parquet file using the R > client's arrow::write_parquet(), in a manner similar to the `append=TRUE` > argument in text-based output formats like write.table()? > > Apologies as this is perhaps more a question of documentation or user > interface, or maybe just my ignorance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-7924) [Rust] Add sort for float types
[ https://issues.apache.org/jira/browse/ARROW-7924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-7924: -- Labels: pull-request-available (was: ) > [Rust] Add sort for float types > --- > > Key: ARROW-7924 > URL: https://issues.apache.org/jira/browse/ARROW-7924 > Project: Apache Arrow > Issue Type: Sub-task > Components: Rust >Reporter: Neville Dipale >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Floats need a different sort approach than other primitives, and this ticket > will implement them separately -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-3827) [Rust] Implement UnionArray
[ https://issues.apache.org/jira/browse/ARROW-3827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neville Dipale resolved ARROW-3827. --- Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7004 [https://github.com/apache/arrow/pull/7004] > [Rust] Implement UnionArray > --- > > Key: ARROW-3827 > URL: https://issues.apache.org/jira/browse/ARROW-3827 > Project: Apache Arrow > Issue Type: New Feature > Components: Rust >Reporter: Paddy Horan >Assignee: Paddy Horan >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8813) [R] Implementing tidyr interface
[ https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Dennenmoser updated ARROW-8813: --- Description: I think it would be reasonable to implement an interface to the {{tidyr}} package. The implementation would allow to lazily process ArrowTables before put it back into the memory. However, currently you need to collect the table first before applying tidyr methods. The following code chunk shows an example routine: {code:r} library(magrittr) arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) nested_df <- arrow_table %>% dplyr::select(ID, 4:7, Value) %>% dplyr::filter(Value >= 5) %>% dplyr::group_by(ID) %>% dplyr::collect() %>% tidyr::nest(){code} The main focus might be the following three methods: * {{tidyr::[un]nest()}}, * {{tidyr::pivot_[longer|wider]()}}, and * {{tidyr::seperate()}}. I suppose the last two can be fairly quickly implemented, but {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before conversion to List will be accessible. was: I think it would be reasonable to implement an interface to the {{tidyr}} package. The implementation would allow to lazily process ArrowTables before put it back into the memory. However, currently you need to collect the table first before applying tidyr methods. The following code chunk shows an example routine: {code:r} library(magrittr) arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) nested_df <- arrow_table %>% dplyr::select(ID, 4:7, Value) %>% dplyr::filter(Value >= 5) %>% dplyr::group_by(ID) %>% dplyr::collect() %>% tidyr::nest(){code} The main focus might be the following three methods: * {{tidyr::[un]nest()}}, * {{tidyr::pivot_[longer|wider]()}}, and * {{tidyr::seperate()}}. I suppose the last two can be fairly quickly implemented, but {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before [conversion to List|ARROW-8779] will be accessible. > [R] Implementing tidyr interface > > > Key: ARROW-8813 > URL: https://issues.apache.org/jira/browse/ARROW-8813 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dominic Dennenmoser >Priority: Major > Labels: extension, feature, improvement > > I think it would be reasonable to implement an interface to the {{tidyr}} > package. The implementation would allow to lazily process ArrowTables before > put it back into the memory. However, currently you need to collect the table > first before applying tidyr methods. The following code chunk shows an > example routine: > {code:r} > library(magrittr) > arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) > nested_df <- >arrow_table %>% >dplyr::select(ID, 4:7, Value) %>% >dplyr::filter(Value >= 5) %>% >dplyr::group_by(ID) %>% >dplyr::collect() %>% >tidyr::nest(){code} > The main focus might be the following three methods: > * {{tidyr::[un]nest()}}, > * {{tidyr::pivot_[longer|wider]()}}, and > * {{tidyr::seperate()}}. > I suppose the last two can be fairly quickly implemented, but > {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before > conversion to List will be accessible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8813) [R] Implementing tidyr interface
[ https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Dennenmoser updated ARROW-8813: --- Description: I think it would be reasonable to implement an interface to the {{tidyr}} package. The implementation would allow to lazily process ArrowTables before put it back into the memory. However, currently you need to collect the table first before applying tidyr methods. The following code chunk shows an example routine: {code:r} library(magrittr) arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) nested_df <- arrow_table %>% dplyr::select(ID, 4:7, Value) %>% dplyr::filter(Value >= 5) %>% dplyr::group_by(ID) %>% dplyr::collect() %>% tidyr::nest(){code} The main focus might be the following three methods: * {{tidyr::[un]nest()}}, * {{tidyr::pivot_[longer|wider]()}}, and * {{tidyr::seperate()}}. I suppose the last two can be fairly quickly implemented, but {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before [conversion to List|ARROW-8779] will be accessible. was: I think it would be reasonable to implement an interface to the {{tidyr}} package. The implementation would allow to lazily process ArrowTables before put it back into the memory. However, currently you need to collect the table first before applying tidyr methods. The following code chunk shows an example routine: {code:r} library(magrittr) arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) nested_df <- arrow_table %>% dplyr::select(ID, 4:7, Value) %>% dplyr::filter(Value >= 5) %>% dplyr::group_by(ID) %>% dplyr::collect() %>% tidyr::nest(){code} The main focus might be the following three methods: * {{tidyr::[un]nest()}}, * {{tidyr::pivot_[longer|wider]()}}, and * {{tidyr::seperate()}}. I suppose the last two can be fairly quickly implemented, but {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before conversion to List will be accessible. > [R] Implementing tidyr interface > > > Key: ARROW-8813 > URL: https://issues.apache.org/jira/browse/ARROW-8813 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dominic Dennenmoser >Priority: Major > Labels: extension, feature, improvement > > I think it would be reasonable to implement an interface to the {{tidyr}} > package. The implementation would allow to lazily process ArrowTables before > put it back into the memory. However, currently you need to collect the table > first before applying tidyr methods. The following code chunk shows an > example routine: > {code:r} > library(magrittr) > arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) > nested_df <- >arrow_table %>% >dplyr::select(ID, 4:7, Value) %>% >dplyr::filter(Value >= 5) %>% >dplyr::group_by(ID) %>% >dplyr::collect() %>% >tidyr::nest(){code} > The main focus might be the following three methods: > * {{tidyr::[un]nest()}}, > * {{tidyr::pivot_[longer|wider]()}}, and > * {{tidyr::seperate()}}. > I suppose the last two can be fairly quickly implemented, but > {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before > [conversion to List|ARROW-8779] will be accessible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8810) [R] Append to parquet file?
[ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108424#comment-17108424 ] Carl Boettiger commented on ARROW-8810: --- Thanks all, this is a great answer. Would love to see some of these details mentioned in the R vignettes, as no doubt other R users might also be unclear how this differs from other compressed/encoded filetypes (e.g. the issue of metadata in the file footer). Writing multiple files makes sense for larger chunks. My current use case is effectively streaming (currently just to .tsv.gz compressed table), so I'm definitely following the discussion in ARROW-8784. Please feel free to close, and thanks again for this fantastic library and the R bindings. > [R] Append to parquet file? > --- > > Key: ARROW-8810 > URL: https://issues.apache.org/jira/browse/ARROW-8810 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Carl Boettiger >Priority: Major > > Is it possible to append new rows to an existing .parquet file using the R > client's arrow::write_parquet(), in a manner similar to the `append=TRUE` > argument in text-based output formats like write.table()? > > Apologies as this is perhaps more a question of documentation or user > interface, or maybe just my ignorance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8813) [R] Implementing tidyr interface
[ https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108422#comment-17108422 ] Dominic Dennenmoser edited comment on ARROW-8813 at 5/15/20, 4:05 PM: -- Thanks for refering to that. I've just looked for issues or pull-requests mention anything in that direction. Fortunately, a generic version of {{pivot_[longer|wider]()}} will be available in the upcoming version of {{tidyr}}, and is already implemented into the development version ([#800|https://github.com/tidyverse/tidyr/issues/800]). was (Author: domiden): Thanks for revering to that. I've just looked for issues or pull-requests mention anything in that direction. Fortunately, a generic version of {{pivot_[longer|wider]()}} will be available in the upcoming version of {{tidyr}}, and is already implemented into the development version ([#800|https://github.com/tidyverse/tidyr/issues/800]). > [R] Implementing tidyr interface > > > Key: ARROW-8813 > URL: https://issues.apache.org/jira/browse/ARROW-8813 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dominic Dennenmoser >Priority: Major > Labels: extension, feature, improvement > > I think it would be reasonable to implement an interface to the {{tidyr}} > package. The implementation would allow to lazily process ArrowTables before > put it back into the memory. However, currently you need to collect the table > first before applying tidyr methods. The following code chunk shows an > example routine: > {code:r} > library(magrittr) > arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) > nested_df <- >arrow_table %>% >dplyr::select(ID, 4:7, Value) %>% >dplyr::filter(Value >= 5) %>% >dplyr::group_by(ID) %>% >dplyr::collect() %>% >tidyr::nest(){code} > The main focus might be the following three methods: > * {{tidyr::[un]nest()}}, > * {{tidyr::pivot_[longer|wider]()}}, and > * {{tidyr::seperate()}}. > I suppose the last two can be fairly quickly implemented, but > {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before > conversion to List will be accessible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8813) [R] Implementing tidyr interface
[ https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108422#comment-17108422 ] Dominic Dennenmoser edited comment on ARROW-8813 at 5/15/20, 4:04 PM: -- Thanks for revering to that. I've just looked for issues or pull-requests mention anything in that direction. Fortunately, a generic version of {{pivot_[longer|wider]()}} will be available in the upcoming version of {{tidyr}}, and is already implemented into the development version ([#800|https://github.com/tidyverse/tidyr/issues/800]). was (Author: domiden): Thanks for revering to that. I've just looked for issues or pull-requests mention anything in that direction. Fortunately, a generic version of {{pivot_[longer|wider]()}} will be available in the upcoming version of {{tidyr}}, and is already implemented into the development version (#800). > [R] Implementing tidyr interface > > > Key: ARROW-8813 > URL: https://issues.apache.org/jira/browse/ARROW-8813 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dominic Dennenmoser >Priority: Major > Labels: extension, feature, improvement > > I think it would be reasonable to implement an interface to the {{tidyr}} > package. The implementation would allow to lazily process ArrowTables before > put it back into the memory. However, currently you need to collect the table > first before applying tidyr methods. The following code chunk shows an > example routine: > {code:r} > library(magrittr) > arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) > nested_df <- >arrow_table %>% >dplyr::select(ID, 4:7, Value) %>% >dplyr::filter(Value >= 5) %>% >dplyr::group_by(ID) %>% >dplyr::collect() %>% >tidyr::nest(){code} > The main focus might be the following three methods: > * {{tidyr::[un]nest()}}, > * {{tidyr::pivot_[longer|wider]()}}, and > * {{tidyr::seperate()}}. > I suppose the last two can be fairly quickly implemented, but > {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before > conversion to List will be accessible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8813) [R] Implementing tidyr interface
[ https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108422#comment-17108422 ] Dominic Dennenmoser commented on ARROW-8813: Thanks for revering to that. I've just looked for issues or pull-requests mention anything in that direction. Fortunately, a generic version of {{pivot_[longer|wider]()}} will be available in the upcoming version of {{tidyr}}, and is already implemented into the development version (#800). > [R] Implementing tidyr interface > > > Key: ARROW-8813 > URL: https://issues.apache.org/jira/browse/ARROW-8813 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dominic Dennenmoser >Priority: Major > Labels: extension, feature, improvement > > I think it would be reasonable to implement an interface to the {{tidyr}} > package. The implementation would allow to lazily process ArrowTables before > put it back into the memory. However, currently you need to collect the table > first before applying tidyr methods. The following code chunk shows an > example routine: > {code:r} > library(magrittr) > arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) > nested_df <- >arrow_table %>% >dplyr::select(ID, 4:7, Value) %>% >dplyr::filter(Value >= 5) %>% >dplyr::group_by(ID) %>% >dplyr::collect() %>% >tidyr::nest(){code} > The main focus might be the following three methods: > * {{tidyr::[un]nest()}}, > * {{tidyr::pivot_[longer|wider]()}}, and > * {{tidyr::seperate()}}. > I suppose the last two can be fairly quickly implemented, but > {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before > conversion to List will be accessible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8553) [C++] Optimize unaligned bitmap operations
[ https://issues.apache.org/jira/browse/ARROW-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108382#comment-17108382 ] Wes McKinney commented on ARROW-8553: - Thanks for looking into it, sounds good to me > [C++] Optimize unaligned bitmap operations > -- > > Key: ARROW-8553 > URL: https://issues.apache.org/jira/browse/ARROW-8553 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.17.0 >Reporter: Antoine Pitrou >Assignee: Yibo Cai >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > Currently, {{BitmapAnd}} uses a bit-by-bit loop for unaligned inputs. Using > {{Bitmap::VisitWords}} instead would probably yield a manyfold performance > increase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8812) [Python] Columns of type CategoricalIndex fails to be read back
[ https://issues.apache.org/jira/browse/ARROW-8812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8812: Summary: [Python] Columns of type CategoricalIndex fails to be read back (was: Columns of type CategoricalIndex fails to be read back) > [Python] Columns of type CategoricalIndex fails to be read back > --- > > Key: ARROW-8812 > URL: https://issues.apache.org/jira/browse/ARROW-8812 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.15.1 > Environment: Python 3.7.7 > MacOS (Darwin-19.4.0-x86_64-i386-64bit) > Pandas 1.0.3 > Pyarrow 0.15.1 >Reporter: Jonas Nelle >Priority: Minor > Labels: parquet > > When columns are of type {{CategoricalIndex}}, saving and reading the table > back causes a {{TypeError: data type "categorical" not understood}}: > {code:python} > import pandas as pd > from pyarrow import parquet, Table > base_df = pd.DataFrame([['foo', 'j', "1"], > ['bar', 'j', "1"], > ['foo', 'j', "1"], > ['foobar', 'j', "1"]], >columns=['my_cat', 'var', 'for_count']) > base_df['my_cat'] = base_df['my_cat'].astype('category') > df = ( > base_df > .groupby(["my_cat", "var"], observed=True) > .agg({"for_count": "count"}) > .rename(columns={"for_count": "my_cat_counts"}) > .unstack(level="my_cat", fill_value=0) > ) > print(df) > {code} > The resulting data frame looks something like this: > || ||my_cat_counts|| || || > |my_cat|foo|bar|foobar| > |var| | | | > |j|2|1|1| > Then, writing and reading causes the {{KeyError}}: > {code:python} > parquet.write_table(Table.from_pandas(df), "test.pqt") > parquet.read_table("test.pqt").to_pandas() > > TypeError: data type "categorical" not understood > {code} > In the example, the column is also a MultiIndex, but that isn't the problem: > {code:python} > df.columns = df.columns.get_level_values(1) > parquet.write_table(Table.from_pandas(df), "test.pqt") > parquet.read_table("test.pqt").to_pandas() > > TypeError: data type "categorical" not understood > {code} > This is the workaround [suggested on > stackoverflow|https://stackoverflow.com/questions/55749399/how-to-fix-the-issue-of-categoricalindex-column-in-pandas]: > {code:python} > df.columns = pd.Index(list(df.columns)) # suggested fix for the time being > parquet.write_table(Table.from_pandas(df), "test.pqt") > parquet.read_table("test.pqt").to_pandas() # no error > {code} > Are there any plans to support the pattern described here in the future? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8810) [R] Append to parquet file?
[ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108381#comment-17108381 ] Wes McKinney commented on ARROW-8810: - Since it's not possible to append data to an existing file (without a great deal of effort in the C++ library) I would suggest closing this. Might be some documentation we could add to clarify that Parquet datasets are intended to constitute many files with appending by writing additional files > [R] Append to parquet file? > --- > > Key: ARROW-8810 > URL: https://issues.apache.org/jira/browse/ARROW-8810 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Carl Boettiger >Priority: Major > > Is it possible to append new rows to an existing .parquet file using the R > client's arrow::write_parquet(), in a manner similar to the `append=TRUE` > argument in text-based output formats like write.table()? > > Apologies as this is perhaps more a question of documentation or user > interface, or maybe just my ignorance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8810) [R] Append to parquet file?
[ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-8810: Summary: [R] Append to parquet file? (was: Append to parquet file?) > [R] Append to parquet file? > --- > > Key: ARROW-8810 > URL: https://issues.apache.org/jira/browse/ARROW-8810 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Carl Boettiger >Priority: Major > > Is it possible to append new rows to an existing .parquet file using the R > client's arrow::write_parquet(), in a manner similar to the `append=TRUE` > argument in text-based output formats like write.table()? > > Apologies as this is perhaps more a question of documentation or user > interface, or maybe just my ignorance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8734) [R] improve nightly build installation
[ https://issues.apache.org/jira/browse/ARROW-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson resolved ARROW-8734. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7184 [https://github.com/apache/arrow/pull/7184] > [R] improve nightly build installation > -- > > Key: ARROW-8734 > URL: https://issues.apache.org/jira/browse/ARROW-8734 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > I've tried to install / build from source (with from a git checkout and using > the built-in `install_arrow()`) and when compiling I'm getting the following > error reliably during the auto brew process: > {code:bash} > x System command 'R' failed, exit status: 1, stdout + stderr: > E> * checking for file ‘/Users/jkeane/Dropbox/arrow/r/DESCRIPTION’ ... OK > E> * preparing ‘arrow’: > E> * checking DESCRIPTION meta-information ... OK > E> * cleaning src > E> * running ‘cleanup’ > E> * installing the package to build vignettes > E> --- > E> * installing *source* package ‘arrow’ ... > E> ** using staged installation > E> *** Generating code with data-raw/codegen.R > E> There were 27 warnings (use warnings() to see them) > E> *** > 375 functions decorated with [[arrow|s3::export]] > E> *** > generated file `src/arrowExports.cpp` > E> *** > generated file `R/arrowExports.R` > E> *** Downloading apache-arrow > E> Using local manifest for apache-arrow > E> Thu May 7 13:13:42 CDT 2020: Auto-brewing apache-arrow in > /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T//build-apache-arrow... > E> ==> Tapping autobrew/core from https://github.com/autobrew/homebrew-core > E> Tapped 2 commands and 4639 formulae (4,888 files, 12.7MB). > E> lz4 > E> openssl > E> thrift > E> snappy > E> ==> Downloading > https://homebrew.bintray.com/bottles/lz4-1.8.3.mojave.bottle.tar.gz > E> Already downloaded: > /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/b4158ef68d619dbf78935df6a42a70b8339a65bc8876cbb4446355ccd40fa5de--lz4-1.8.3.mojave.bottle.tar.gz > E> ==> Pouring lz4-1.8.3.mojave.bottle.tar.gz > E> ==> Skipping post_install step for autobrew... > E> > /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/lz4/1.8.3: > 22 files, 512.7KB > E> ==> Downloading > https://homebrew.bintray.com/bottles/openssl-1.0.2p.mojave.bottle.tar.gz > E> Already downloaded: > /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/fbb493745981c8b26c0fab115c76c2a70142bfde9e776c450277e9dfbbba0bb2--openssl-1.0.2p.mojave.bottle.tar.gz > E> ==> Pouring openssl-1.0.2p.mojave.bottle.tar.gz > E> ==> Skipping post_install step for autobrew... > E> ==> Caveats > E> openssl is keg-only, which means it was not symlinked into > /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow, > E> because Apple has deprecated use of OpenSSL in favor of its own TLS and > crypto libraries. > E> > E> If you need to have openssl first in your PATH run: > E> echo 'export > PATH="/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/bin:$PATH"' > >> ~/.zshrc > E> > E> For compilers to find openssl you may need to set: > E> export > LDFLAGS="-L/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/lib" > E> export > CPPFLAGS="-I/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/include" > E> > E> For pkg-config to find openssl you may need to set: > E> export > PKG_CONFIG_PATH="/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/lib/pkgconfig" > E> > E> ==> Summary > E> > /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/openssl/1.0.2p: > 1,793 files, 12MB > E> ==> Downloading > https://homebrew.bintray.com/bottles/thrift-0.11.0.mojave.bottle.tar.gz > E> Already downloaded: > /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/7e05ea11a9f7f924dd7f8f36252ec73a24958b7f214f71e3752a355e75e589bd--thrift-0.11.0.mojave.bottle.tar.gz > E> ==> Pouring thrift-0.11.0.mojave.bottle.tar.gz > E> ==> Skipping post_install step for autobrew... > E> ==> Caveats > E> To install Ruby binding: > E> gem install thrift > E> ==> Summary > E> > /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/thrift/0.11.0: > 102 files, 7MB > E> ==> Downloading > https://homebrew.bintray.com/bottles/snappy-1.1.7_1.mojave.bottle.tar.gz > E> Already downloaded: >
[jira] [Updated] (ARROW-8734) [R] improve nightly build installation
[ https://issues.apache.org/jira/browse/ARROW-8734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-8734: --- Summary: [R] improve nightly build installation (was: [R] autobrew script always builds from master) > [R] improve nightly build installation > -- > > Key: ARROW-8734 > URL: https://issues.apache.org/jira/browse/ARROW-8734 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Jonathan Keane >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > I've tried to install / build from source (with from a git checkout and using > the built-in `install_arrow()`) and when compiling I'm getting the following > error reliably during the auto brew process: > {code:bash} > x System command 'R' failed, exit status: 1, stdout + stderr: > E> * checking for file ‘/Users/jkeane/Dropbox/arrow/r/DESCRIPTION’ ... OK > E> * preparing ‘arrow’: > E> * checking DESCRIPTION meta-information ... OK > E> * cleaning src > E> * running ‘cleanup’ > E> * installing the package to build vignettes > E> --- > E> * installing *source* package ‘arrow’ ... > E> ** using staged installation > E> *** Generating code with data-raw/codegen.R > E> There were 27 warnings (use warnings() to see them) > E> *** > 375 functions decorated with [[arrow|s3::export]] > E> *** > generated file `src/arrowExports.cpp` > E> *** > generated file `R/arrowExports.R` > E> *** Downloading apache-arrow > E> Using local manifest for apache-arrow > E> Thu May 7 13:13:42 CDT 2020: Auto-brewing apache-arrow in > /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T//build-apache-arrow... > E> ==> Tapping autobrew/core from https://github.com/autobrew/homebrew-core > E> Tapped 2 commands and 4639 formulae (4,888 files, 12.7MB). > E> lz4 > E> openssl > E> thrift > E> snappy > E> ==> Downloading > https://homebrew.bintray.com/bottles/lz4-1.8.3.mojave.bottle.tar.gz > E> Already downloaded: > /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/b4158ef68d619dbf78935df6a42a70b8339a65bc8876cbb4446355ccd40fa5de--lz4-1.8.3.mojave.bottle.tar.gz > E> ==> Pouring lz4-1.8.3.mojave.bottle.tar.gz > E> ==> Skipping post_install step for autobrew... > E> > /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/lz4/1.8.3: > 22 files, 512.7KB > E> ==> Downloading > https://homebrew.bintray.com/bottles/openssl-1.0.2p.mojave.bottle.tar.gz > E> Already downloaded: > /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/fbb493745981c8b26c0fab115c76c2a70142bfde9e776c450277e9dfbbba0bb2--openssl-1.0.2p.mojave.bottle.tar.gz > E> ==> Pouring openssl-1.0.2p.mojave.bottle.tar.gz > E> ==> Skipping post_install step for autobrew... > E> ==> Caveats > E> openssl is keg-only, which means it was not symlinked into > /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow, > E> because Apple has deprecated use of OpenSSL in favor of its own TLS and > crypto libraries. > E> > E> If you need to have openssl first in your PATH run: > E> echo 'export > PATH="/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/bin:$PATH"' > >> ~/.zshrc > E> > E> For compilers to find openssl you may need to set: > E> export > LDFLAGS="-L/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/lib" > E> export > CPPFLAGS="-I/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/include" > E> > E> For pkg-config to find openssl you may need to set: > E> export > PKG_CONFIG_PATH="/private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/opt/openssl/lib/pkgconfig" > E> > E> ==> Summary > E> > /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/openssl/1.0.2p: > 1,793 files, 12MB > E> ==> Downloading > https://homebrew.bintray.com/bottles/thrift-0.11.0.mojave.bottle.tar.gz > E> Already downloaded: > /var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/downloads/7e05ea11a9f7f924dd7f8f36252ec73a24958b7f214f71e3752a355e75e589bd--thrift-0.11.0.mojave.bottle.tar.gz > E> ==> Pouring thrift-0.11.0.mojave.bottle.tar.gz > E> ==> Skipping post_install step for autobrew... > E> ==> Caveats > E> To install Ruby binding: > E> gem install thrift > E> ==> Summary > E> > /private/var/folders/45/n5gfjjtn05j877spnpbnhqqwgn/T/build-apache-arrow/Cellar/thrift/0.11.0: > 102 files, 7MB > E> ==> Downloading > https://homebrew.bintray.com/bottles/snappy-1.1.7_1.mojave.bottle.tar.gz > E> Already downloaded: >
[jira] [Commented] (ARROW-8813) [R] Implementing tidyr interface
[ https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108367#comment-17108367 ] Neal Richardson commented on ARROW-8813: If you wanted to explore this, one challenge I see is that pivot_longer and pivot_wider aren't generics, so you can't just make arrow methods for them. > [R] Implementing tidyr interface > > > Key: ARROW-8813 > URL: https://issues.apache.org/jira/browse/ARROW-8813 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dominic Dennenmoser >Priority: Major > Labels: extension, feature, improvement > > I think it would be reasonable to implement an interface to the {{tidyr}} > package. The implementation would allow to lazily process ArrowTables before > put it back into the memory. However, currently you need to collect the table > first before applying tidyr methods. The following code chunk shows an > example routine: > {code:r} > library(magrittr) > arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) > nested_df <- >arrow_table %>% >dplyr::select(ID, 4:7, Value) %>% >dplyr::filter(Value >= 5) %>% >dplyr::group_by(ID) %>% >dplyr::collect() %>% >tidyr::nest(){code} > The main focus might be the following three methods: > * {{tidyr::[un]nest()}}, > * {{tidyr::pivot_[longer|wider]()}}, and > * {{tidyr::seperate()}}. > I suppose the last two can be fairly quickly implemented, but > {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before > conversion to List will be accessible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8813) [R] Implementing tidyr interface
[ https://issues.apache.org/jira/browse/ARROW-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neal Richardson updated ARROW-8813: --- Summary: [R] Implementing tidyr interface (was: Implementing tidyr interface) > [R] Implementing tidyr interface > > > Key: ARROW-8813 > URL: https://issues.apache.org/jira/browse/ARROW-8813 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Dominic Dennenmoser >Priority: Major > Labels: extension, feature, improvement > > I think it would be reasonable to implement an interface to the {{tidyr}} > package. The implementation would allow to lazily process ArrowTables before > put it back into the memory. However, currently you need to collect the table > first before applying tidyr methods. The following code chunk shows an > example routine: > {code:r} > library(magrittr) > arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) > nested_df <- >arrow_table %>% >dplyr::select(ID, 4:7, Value) %>% >dplyr::filter(Value >= 5) %>% >dplyr::group_by(ID) %>% >dplyr::collect() %>% >tidyr::nest(){code} > The main focus might be the following three methods: > * {{tidyr::[un]nest()}}, > * {{tidyr::pivot_[longer|wider]()}}, and > * {{tidyr::seperate()}}. > I suppose the last two can be fairly quickly implemented, but > {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before > conversion to List will be accessible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8810) Append to parquet file?
[ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108356#comment-17108356 ] Neal Richardson commented on ARROW-8810: Multi-file (Parquet and other format) datasets in R: http://arrow.apache.org/docs/r/articles/dataset.html If appending to a single file is important for your use case, you could use the Arrow stream format. See discussion on ARROW-8748 for what that would look like. > Append to parquet file? > --- > > Key: ARROW-8810 > URL: https://issues.apache.org/jira/browse/ARROW-8810 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Carl Boettiger >Priority: Major > > Is it possible to append new rows to an existing .parquet file using the R > client's arrow::write_parquet(), in a manner similar to the `append=TRUE` > argument in text-based output formats like write.table()? > > Apologies as this is perhaps more a question of documentation or user > interface, or maybe just my ignorance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8783) [Rust] [DataFusion] Logical plan should have ParquetScan and CsvScan entries
[ https://issues.apache.org/jira/browse/ARROW-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8783: -- Labels: pull-request-available (was: ) > [Rust] [DataFusion] Logical plan should have ParquetScan and CsvScan entries > > > Key: ARROW-8783 > URL: https://issues.apache.org/jira/browse/ARROW-8783 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The LogicalPlan currently has a TableScan entry which references a Table (any > logical plan registered with an ExecutionContext) and is often backed by a > Parquet or CSV data source. > I am finding it increasingly inconvenient that we can't just create a logical > plan referencing a Parquet or CSV file, without having to create an execution > context first and register the data sources with it. > This addition will not remove any existing behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8783) [Rust] [DataFusion] Logical plan should have ParquetScan and CsvScan entries
[ https://issues.apache.org/jira/browse/ARROW-8783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-8783. --- Resolution: Fixed Issue resolved by pull request 7164 [https://github.com/apache/arrow/pull/7164] > [Rust] [DataFusion] Logical plan should have ParquetScan and CsvScan entries > > > Key: ARROW-8783 > URL: https://issues.apache.org/jira/browse/ARROW-8783 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 1.0.0 > > > The LogicalPlan currently has a TableScan entry which references a Table (any > logical plan registered with an ExecutionContext) and is often backed by a > Parquet or CSV data source. > I am finding it increasingly inconvenient that we can't just create a logical > plan referencing a Parquet or CSV file, without having to create an execution > context first and register the data sources with it. > This addition will not remove any existing behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas
Rauli Ruohonen created ARROW-8816: - Summary: [Python] Year 2263 or later datetimes get mangled when written using pandas Key: ARROW-8816 URL: https://issues.apache.org/jira/browse/ARROW-8816 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.17.0, 0.16.0 Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, ubuntu 20.04 (linux). Reporter: Rauli Ruohonen Using pyarrow 0.17.0, this {code:java} import datetime import pandas as pd def try_with_year(year): print(f'Year {year:_}:') df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]}) df.to_parquet('foo.parquet', engine='pyarrow', compression=None) try: print(pd.read_parquet('foo.parquet', engine='pyarrow')) except Exception as exc: print(repr(exc)) print() try_with_year(2_263) try_with_year(2_262) {code} prints {noformat} Year 2_263: ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out of bounds timestamp: 924618240') Year 2_262: x 0 2262-01-01{noformat} and using pyarrow 0.16.0, it prints {noformat} Year 2_263: x 0 1678-06-12 00:25:26.290448384 Year 2_262: x 0 2262-01-01{noformat} The issue is that 2263-01-01 is out of bounds for a timestamp stored using epoch nanoseconds, but not out of bounds for a Python datetime. While pyarrow 0.17.0 refuses to read the erroneous output, it is still possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or fastparquet), yielding the same result as with 0.16.0 above (i.e. only reading has changed in 0.17.0, not writing). It would be better if an error was raised when attempting to write the file instead of silently producing erroneous output. The reason I suspect this is a pyarrow issue instead of a pandas issue is this modified example: {code:java} import datetime import pandas as pd import pyarrow as pa df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]}) table = pa.Table.from_pandas(df) print(table[0]) try: print(table.to_pandas()) except Exception as exc: print(repr(exc)) {code} which prints {noformat} [ [ 2263-01-01 00:00:00.00 ] ] ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 92461824'){noformat} on pyarrow 0.17.0 and {noformat} [ [ 2263-01-01 00:00:00.00 ] ] x 0 1678-06-12 00:25:26.290448384{noformat} on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, pyarrow prints the correct timestamp when asked to produce it as a string (so it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas() round-trip fails. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-7574) [Rust] FileSource read implementation is seeking for each single byte
[ https://issues.apache.org/jira/browse/ARROW-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jörn Horstmann resolved ARROW-7574. --- Resolution: Fixed > [Rust] FileSource read implementation is seeking for each single byte > - > > Key: ARROW-7574 > URL: https://issues.apache.org/jira/browse/ARROW-7574 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.16.0 >Reporter: Jörn Horstmann >Priority: Major > > on current master branch > {code:java} > $ RUST_BACKTRACE=1 strace target/debug/parquet-read tripdata.parquet > ... > lseek(3, -8, SEEK_END) = 2937 > read(3, ",\10\0\0PAR1", 8192) = 8 > lseek(3, 845, SEEK_SET) = 845 > read(3, "\25\2\31\334H schema"..., 8192) = 2100 > ... > lseek(5, 4, SEEK_SET) = 4 > read(5, > "\25\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02"..., 8192) > = 2941 > lseek(5, 5, SEEK_SET) = 5 > read(5, "\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020"..., > 8192) = 2940 > lseek(5, 6, SEEK_SET) = 6 > read(5, "\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\0200"..., > 8192) = 2939 > lseek(5, 7, SEEK_SET) = 7 > read(5, "\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000"..., > 8192) = 2938 > lseek(5, 8, SEEK_SET) = 8 > read(5, "\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02"..., 8192) > = 2937 > lseek(5, 9, SEEK_SET) = 9 > read(5, "\25P,\25\n\25\0\25\10\25\10\0346\0(\024"..., 8192) = > 2936 > lseek(5, 10, SEEK_SET) = 10 > read(5, "P,\25\n\25\0\25\10\25\10\0346\0(\024\30"..., 8192) = > 2935 > {code} > Notice the seek position being incremented by one, despite reading up to > 8192 bytes at a time. Interestingly this does not seem to have a big > performance impact on a local file system with linux, but becomes a problem > when working with a custom implementation of ParquetReader, for example for > reading from s3. > The problem seems to be in > {code} > impl Read for FileSource > {code} > which is unconditionally calling > {code} > reader.seek(SeekFrom::Start(self.start as u64))? > {code} > Instead it should probably keep track of the current position and only seek > on the first read. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-7574) [Rust] FileSource read implementation is seeking for each single byte
[ https://issues.apache.org/jira/browse/ARROW-7574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108299#comment-17108299 ] Jörn Horstmann commented on ARROW-7574: --- I retested this with the current master and it seems indeed to be fixed. There are still seeks where the file position should already be at the right position, but doing those for every 8192 bytes should not be a problem. {code:java} lseek(5, 4, SEEK_SET) = 4 read(5, "\25\0\25\260\200\200\1\25\272\354\37,\25\234\263\6\25\0\25\10\25\10\0346\0(\0200"..., 8192) = 8192 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f66eca9 lseek(5, 8196, SEEK_SET)= 8196 read(5, "J N\0001J N\00416J\320\7\0006J\240\17\0006J\320\7\0006J\320\7\0006J"..., 252765) = 252765 {code} > [Rust] FileSource read implementation is seeking for each single byte > - > > Key: ARROW-7574 > URL: https://issues.apache.org/jira/browse/ARROW-7574 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Affects Versions: 0.16.0 >Reporter: Jörn Horstmann >Priority: Major > > on current master branch > {code:java} > $ RUST_BACKTRACE=1 strace target/debug/parquet-read tripdata.parquet > ... > lseek(3, -8, SEEK_END) = 2937 > read(3, ",\10\0\0PAR1", 8192) = 8 > lseek(3, 845, SEEK_SET) = 845 > read(3, "\25\2\31\334H schema"..., 8192) = 2100 > ... > lseek(5, 4, SEEK_SET) = 4 > read(5, > "\25\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02"..., 8192) > = 2941 > lseek(5, 5, SEEK_SET) = 5 > read(5, "\0\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\020"..., > 8192) = 2940 > lseek(5, 6, SEEK_SET) = 6 > read(5, "\25\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\0200"..., > 8192) = 2939 > lseek(5, 7, SEEK_SET) = 7 > read(5, "\310\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02000"..., > 8192) = 2938 > lseek(5, 8, SEEK_SET) = 8 > read(5, "\1\25P,\25\n\25\0\25\10\25\10\0346\0(\02"..., 8192) > = 2937 > lseek(5, 9, SEEK_SET) = 9 > read(5, "\25P,\25\n\25\0\25\10\25\10\0346\0(\024"..., 8192) = > 2936 > lseek(5, 10, SEEK_SET) = 10 > read(5, "P,\25\n\25\0\25\10\25\10\0346\0(\024\30"..., 8192) = > 2935 > {code} > Notice the seek position being incremented by one, despite reading up to > 8192 bytes at a time. Interestingly this does not seem to have a big > performance impact on a local file system with linux, but becomes a problem > when working with a custom implementation of ParquetReader, for example for > reading from s3. > The problem seems to be in > {code} > impl Read for FileSource > {code} > which is unconditionally calling > {code} > reader.seek(SeekFrom::Start(self.start as u64))? > {code} > Instead it should probably keep track of the current position and only seek > on the first read. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8809) [Rust] schema mismatch in integration test
[ https://issues.apache.org/jira/browse/ARROW-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Grove resolved ARROW-8809. --- Resolution: Fixed Issue resolved by pull request 7187 [https://github.com/apache/arrow/pull/7187] > [Rust] schema mismatch in integration test > -- > > Key: ARROW-8809 > URL: https://issues.apache.org/jira/browse/ARROW-8809 > Project: Apache Arrow > Issue Type: Bug > Components: Rust >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > I apologize for the vagueness here, will flesh out details when I learn more > but it looks like Rust is specifying an int64 as a 32 bit type somewhere. > {code:java} > diff schema1.txt schema2.txt > 15c15 > < int64_nullable: Int(32, > --- > > int64_nullable: Int(64, > 17c17 > < int64_nonnullable: Int(32, > --- > > int64_nonnullable: Int(64, > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8815) [Dev][Release] Binary upload script should retry on unexpected bintray request error
[ https://issues.apache.org/jira/browse/ARROW-8815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8815: -- Labels: pull-request-available (was: ) > [Dev][Release] Binary upload script should retry on unexpected bintray > request error > > > Key: ARROW-8815 > URL: https://issues.apache.org/jira/browse/ARROW-8815 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > During uploading the binaries to bintray the script exited multiple times > because of unhandled HTTP errors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8815) [Dev][Release] Binary upload script should retry on unexpected bintray request error
Krisztian Szucs created ARROW-8815: -- Summary: [Dev][Release] Binary upload script should retry on unexpected bintray request error Key: ARROW-8815 URL: https://issues.apache.org/jira/browse/ARROW-8815 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 During uploading the binaries to bintray the script exited multiple times because of unhandled HTTP errors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8814) [Dev][Release] Binary upload script keeps raising locale warnings
[ https://issues.apache.org/jira/browse/ARROW-8814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8814: -- Labels: pull-request-available (was: ) > [Dev][Release] Binary upload script keeps raising locale warnings > - > > Key: ARROW-8814 > URL: https://issues.apache.org/jira/browse/ARROW-8814 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > The console output is filled with warnings which makes hard to follow what > happens. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8814) [Dev][Release] Binary upload script keeps raising locale warnings
Krisztian Szucs created ARROW-8814: -- Summary: [Dev][Release] Binary upload script keeps raising locale warnings Key: ARROW-8814 URL: https://issues.apache.org/jira/browse/ARROW-8814 Project: Apache Arrow Issue Type: Improvement Components: Developer Tools Reporter: Krisztian Szucs Assignee: Krisztian Szucs Fix For: 1.0.0 The console output is filled with warnings which makes hard to follow what happens. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-8803) [Java] Row count should be set before loading buffers in VectorLoader
[ https://issues.apache.org/jira/browse/ARROW-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rong Ma closed ARROW-8803. -- Resolution: Won't Do > [Java] Row count should be set before loading buffers in VectorLoader > - > > Key: ARROW-8803 > URL: https://issues.apache.org/jira/browse/ARROW-8803 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Rong Ma >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Hi guys! I'm new to the community, and I've been using Arrow for some time. > In my use case, I need to read RecordBatch with *compressed* underlying > buffers using Java's IPC API, and I'm finally blocked by the VectorLoader's > "load" method. In this method, > {quote}{{root.setRowCount(recordBatch.getLength());}} > {quote} > It not only set the rowCount for the root, but also set the valueCount for > the vectors the root holds, *which have already been set once when load > buffers.* > It's not a bug... I know. But if I try to load some compressed buffers, I > will get the following exceptions: > {quote}java.lang.IndexOutOfBoundsException: index: 0, length: 512 (expected: > range(0, 504)) > at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:718) > at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:965) > at > org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:439) > at > org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:708) > at > org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:226) > at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61) > at > org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:122) > {quote} > And I start to think that if it would be more make sense to call > root.setRowCount before loadbuffers? > In root.setRowCount it also calls each vector's setValueCount, which I think > is unnecessary here since the vectors after calling loadbuffers are already > formed. > Another existing piece of code upstream is similar to this change. > [link|https://github.com/apache/arrow/blob/ed1f771dccdde623ce85e212eccb2b573185c461/java/vector/src/main/java/org/apache/arrow/vector/ipc/JsonFileReader.java#L170-L178] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8803) [Java] Row count should be set before loading buffers in VectorLoader
[ https://issues.apache.org/jira/browse/ARROW-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108172#comment-17108172 ] Rong Ma commented on ARROW-8803: [~fan_li_ya] Yes, you're right... It indeed is not a nice way to solve the problem. Will close this and wait for the updates. Thanks :) > [Java] Row count should be set before loading buffers in VectorLoader > - > > Key: ARROW-8803 > URL: https://issues.apache.org/jira/browse/ARROW-8803 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Rong Ma >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Hi guys! I'm new to the community, and I've been using Arrow for some time. > In my use case, I need to read RecordBatch with *compressed* underlying > buffers using Java's IPC API, and I'm finally blocked by the VectorLoader's > "load" method. In this method, > {quote}{{root.setRowCount(recordBatch.getLength());}} > {quote} > It not only set the rowCount for the root, but also set the valueCount for > the vectors the root holds, *which have already been set once when load > buffers.* > It's not a bug... I know. But if I try to load some compressed buffers, I > will get the following exceptions: > {quote}java.lang.IndexOutOfBoundsException: index: 0, length: 512 (expected: > range(0, 504)) > at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:718) > at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:965) > at > org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:439) > at > org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:708) > at > org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:226) > at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61) > at > org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:122) > {quote} > And I start to think that if it would be more make sense to call > root.setRowCount before loadbuffers? > In root.setRowCount it also calls each vector's setValueCount, which I think > is unnecessary here since the vectors after calling loadbuffers are already > formed. > Another existing piece of code upstream is similar to this change. > [link|https://github.com/apache/arrow/blob/ed1f771dccdde623ce85e212eccb2b573185c461/java/vector/src/main/java/org/apache/arrow/vector/ipc/JsonFileReader.java#L170-L178] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8803) [Java] Row count should be set before loading buffers in VectorLoader
[ https://issues.apache.org/jira/browse/ARROW-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108163#comment-17108163 ] Liya Fan commented on ARROW-8803: - As you have indicated, {{root.setRowCount}} calls {{setValueCount}} methods for the underlying vectors, and the {{setValueCount}} methods may involve allocation for the underlying vectors. If we place the {{root.setRowCount}} call to the front, it will lead to unnecessary vector allocations, as the underlying buffers will be populated shortly. In fact, we are working on the support of data compression in IPC scenarios (ARROW-8672). Hope it will solve your problem. > [Java] Row count should be set before loading buffers in VectorLoader > - > > Key: ARROW-8803 > URL: https://issues.apache.org/jira/browse/ARROW-8803 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Rong Ma >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Hi guys! I'm new to the community, and I've been using Arrow for some time. > In my use case, I need to read RecordBatch with *compressed* underlying > buffers using Java's IPC API, and I'm finally blocked by the VectorLoader's > "load" method. In this method, > {quote}{{root.setRowCount(recordBatch.getLength());}} > {quote} > It not only set the rowCount for the root, but also set the valueCount for > the vectors the root holds, *which have already been set once when load > buffers.* > It's not a bug... I know. But if I try to load some compressed buffers, I > will get the following exceptions: > {quote}java.lang.IndexOutOfBoundsException: index: 0, length: 512 (expected: > range(0, 504)) > at io.netty.buffer.ArrowBuf.checkIndex(ArrowBuf.java:718) > at io.netty.buffer.ArrowBuf.setBytes(ArrowBuf.java:965) > at > org.apache.arrow.vector.BaseFixedWidthVector.reAlloc(BaseFixedWidthVector.java:439) > at > org.apache.arrow.vector.BaseFixedWidthVector.setValueCount(BaseFixedWidthVector.java:708) > at > org.apache.arrow.vector.VectorSchemaRoot.setRowCount(VectorSchemaRoot.java:226) > at org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61) > at > org.apache.arrow.vector.ipc.ArrowReader.loadRecordBatch(ArrowReader.java:205) > at > org.apache.arrow.vector.ipc.ArrowStreamReader.loadNextBatch(ArrowStreamReader.java:122) > {quote} > And I start to think that if it would be more make sense to call > root.setRowCount before loadbuffers? > In root.setRowCount it also calls each vector's setValueCount, which I think > is unnecessary here since the vectors after calling loadbuffers are already > formed. > Another existing piece of code upstream is similar to this change. > [link|https://github.com/apache/arrow/blob/ed1f771dccdde623ce85e212eccb2b573185c461/java/vector/src/main/java/org/apache/arrow/vector/ipc/JsonFileReader.java#L170-L178] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8762) [C++][Gandiva] Replace Gandiva's BitmapAnd with common implementation
[ https://issues.apache.org/jira/browse/ARROW-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108160#comment-17108160 ] Yibo Cai commented on ARROW-8762: - Benchmarked processing in uint8 and uint64, no obvious diff found. https://issues.apache.org/jira/browse/ARROW-8553?focusedCommentId=17108159=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17108159 > [C++][Gandiva] Replace Gandiva's BitmapAnd with common implementation > - > > Key: ARROW-8762 > URL: https://issues.apache.org/jira/browse/ARROW-8762 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, C++ - Gandiva >Reporter: Wes McKinney >Priority: Major > Fix For: 1.0.0 > > > Now that the arrow/util/bit_util.h implementation has been optimized, we > should just use that one -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8553) [C++] Optimize unaligned bitmap operations
[ https://issues.apache.org/jira/browse/ARROW-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108159#comment-17108159 ] Yibo Cai commented on ARROW-8553: - [~wesm], the aligned case is simple enough for compiler to auto vectorize the code. Did a quick test with below patch, no obvious performance diff found. {code:c} diff --git a/cpp/src/arrow/util/bit_util.cc b/cpp/src/arrow/util/bit_util.cc index 395801f5e..8beaf6cb8 100644 --- a/cpp/src/arrow/util/bit_util.cc +++ b/cpp/src/arrow/util/bit_util.cc @@ -261,7 +261,7 @@ template class BitOp> void AlignedBitmapOp(const uint8_t* left, int64_t left_offset, const uint8_t* right, int64_t right_offset, uint8_t* out, int64_t out_offset, int64_t length) { - BitOp op; + BitOp op; DCHECK_EQ(left_offset % 8, right_offset % 8); DCHECK_EQ(left_offset % 8, out_offset % 8); @@ -269,8 +269,11 @@ void AlignedBitmapOp(const uint8_t* left, int64_t left_offset, const uint8_t* ri left += left_offset / 8; right += right_offset / 8; out += out_offset / 8; - for (int64_t i = 0; i < nbytes; ++i) { -out[i] = op(left[i], right[i]); + uint64_t *out64 = (uint64_t*)out; + uint64_t *left64 = (uint64_t*)left; + uint64_t *right64 = (uint64_t*)right; + for (int64_t i = 0; i < nbytes/8; ++i) { +out64[i] = op(left64[i], right64[i]); } } {code} Benchmark before this patch (in uint8) {code:c} BenchmarkBitmapAnd/32768/0 4253 ns 4251 ns 164715 bytes_per_second=7.17813G/s BenchmarkBitmapAnd/131072/0 16767 ns16760 ns 41875 bytes_per_second=7.28348G/s BenchmarkBitmapAnd/32768/0 4264 ns 4262 ns 165145 bytes_per_second=7.15959G/s BenchmarkBitmapAnd/131072/0 16702 ns16695 ns 41952 bytes_per_second=7.31158G/s {code} Benchmark after this patch (in uint64) {code:c} BenchmarkBitmapAnd/32768/0 4133 ns 4131 ns 171808 bytes_per_second=7.38787G/s BenchmarkBitmapAnd/131072/0 17167 ns17157 ns 40529 bytes_per_second=7.11491G/s BenchmarkBitmapAnd/32768/0 4103 ns 4101 ns 171883 bytes_per_second=7.44151G/s BenchmarkBitmapAnd/131072/0 17351 ns17343 ns 43299 bytes_per_second=7.0385G/s {code} > [C++] Optimize unaligned bitmap operations > -- > > Key: ARROW-8553 > URL: https://issues.apache.org/jira/browse/ARROW-8553 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.17.0 >Reporter: Antoine Pitrou >Assignee: Yibo Cai >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > Currently, {{BitmapAnd}} uses a bit-by-bit loop for unaligned inputs. Using > {{Bitmap::VisitWords}} instead would probably yield a manyfold performance > increase. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8813) Implementing tidyr interface
Dominic Dennenmoser created ARROW-8813: -- Summary: Implementing tidyr interface Key: ARROW-8813 URL: https://issues.apache.org/jira/browse/ARROW-8813 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Dominic Dennenmoser I think it would be reasonable to implement an interface to the {{tidyr}} package. The implementation would allow to lazily process ArrowTables before put it back into the memory. However, currently you need to collect the table first before applying tidyr methods. The following code chunk shows an example routine: {code:r} library(magrittr) arrow_table <- arrow::read_feather("table.feather", as_data_frame = FALSE) nested_df <- arrow_table %>% dplyr::select(ID, 4:7, Value) %>% dplyr::filter(Value >= 5) %>% dplyr::group_by(ID) %>% dplyr::collect() %>% tidyr::nest(){code} The main focus might be the following three methods: * {{tidyr::[un]nest()}}, * {{tidyr::pivot_[longer|wider]()}}, and * {{tidyr::seperate()}}. I suppose the last two can be fairly quickly implemented, but {{tidyr::nest()}} and {{tidyr::unnest()}} cannot be implement before conversion to List will be accessible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-8812) Columns of type CategoricalIndex fails to be read back
Jonas Nelle created ARROW-8812: -- Summary: Columns of type CategoricalIndex fails to be read back Key: ARROW-8812 URL: https://issues.apache.org/jira/browse/ARROW-8812 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.15.1 Environment: Python 3.7.7 MacOS (Darwin-19.4.0-x86_64-i386-64bit) Pandas 1.0.3 Pyarrow 0.15.1 Reporter: Jonas Nelle When columns are of type {{CategoricalIndex}}, saving and reading the table back causes a {{TypeError: data type "categorical" not understood}}: {code:python} import pandas as pd from pyarrow import parquet, Table base_df = pd.DataFrame([['foo', 'j', "1"], ['bar', 'j', "1"], ['foo', 'j', "1"], ['foobar', 'j', "1"]], columns=['my_cat', 'var', 'for_count']) base_df['my_cat'] = base_df['my_cat'].astype('category') df = ( base_df .groupby(["my_cat", "var"], observed=True) .agg({"for_count": "count"}) .rename(columns={"for_count": "my_cat_counts"}) .unstack(level="my_cat", fill_value=0) ) print(df) {code} The resulting data frame looks something like this: || ||my_cat_counts|| || || |my_cat|foo|bar|foobar| |var| | | | |j|2|1|1| Then, writing and reading causes the {{KeyError}}: {code:python} parquet.write_table(Table.from_pandas(df), "test.pqt") parquet.read_table("test.pqt").to_pandas() > TypeError: data type "categorical" not understood {code} In the example, the column is also a MultiIndex, but that isn't the problem: {code:python} df.columns = df.columns.get_level_values(1) parquet.write_table(Table.from_pandas(df), "test.pqt") parquet.read_table("test.pqt").to_pandas() > TypeError: data type "categorical" not understood {code} This is the workaround [suggested on stackoverflow|https://stackoverflow.com/questions/55749399/how-to-fix-the-issue-of-categoricalindex-column-in-pandas]: {code:python} df.columns = pd.Index(list(df.columns)) # suggested fix for the time being parquet.write_table(Table.from_pandas(df), "test.pqt") parquet.read_table("test.pqt").to_pandas() # no error {code} Are there any plans to support the pattern described here in the future? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-8810) Append to parquet file?
[ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108087#comment-17108087 ] Uwe Korn edited comment on ARROW-8810 at 5/15/20, 8:47 AM: --- Generally, you should see Parquet files as immutable. If you want to change its contents, it is almost always simpler and faster to just rewrite them completely or (much better) just write a second file and treat a directory of Parquet files as a single dataset. This comes down to two major properties: * Values in a Parquet file are encoded and compressed. Thus they don't adhere to a fixed size per row/value but in some cases a column chunk of a million values may be stored in just 64 bytes. * The metadata that contains all essential information, e.g. where row groups start, what schema the data is, is stored at the end of the file (i.e. the footer). Especially the last four bytes are needed as they indicate the start position of the footer. Technically, you could still write code that appends to an existing Parquet file but this has the drawbacks that: * Writing wouldn't be faster than writing to a second, separate file. It would probably be even slower as we need to deserialize the existing metadata and serialize it again only with slight modifications. * Reading wouldn't be faster than reading from a second file, even when doing it sequentially. * While append to a Parquet file, the file would be unreadable. * If your process crashes during write, all existing data in the Parquet file will be lost. * It will give the users the impression that you could efficiently insert row-by-row to a file. With a columnar data format that can only leverage its techniques on large chunks of rows, this would generate a massive overhead. Still if one would try to implement this, it would work as follows: # Read in the footer/metadata of the existing file. # Seek to the start position of the existing footer and overwrite it with the new data. # Merge (or rather concat) the existing metadata with the newly computed metadata and write it at the end of the file. If you would take a look at how a completely fresh Parquet file would be written, this is identical except that we wouldn't need to read in and overwrite any existing metadata. With newer Arrow releases, there will be better support for Parquet datasets in R, I'll leave this to [~npr] or [~jorisvandenbossche] to link to the right docs. was (Author: xhochy): Generally, you should see Parquet files as immutable. If you want to change its contents, it is almost always simpler and faster to just rewrite them completely or (much better) just write a second file and treat a directory of Parquet files as a single dataset. This comes down to two major properties: * Values in a Parquet file are encoded and compressed. Thus they don't adhere to a fixed size per row/value but in some cases a column chunk of a million values may be stored in just 64 bytes. * The metadata that contains all essential information, e.g. where row groups start, what schema the data is, is stored at the end of the file (i.e. the footer). Especially the last four bytes are needed as they indicate the start position of the footer. Technically, you could still write code that appends to an existing Parquet file but this has the drawbacks that: * Writing wouldn't be faster than writing to a second, separate file. It would probably be even slower as we need to deserialize the existing metadata and serialize it again only with slight modifications. * Reading wouldn't be faster than reading from a second file, even when doing it sequentially. * While append to a Parquet file, the file would be unreadable. * If your process crashes during write, all existing data in the Parquet file will be lost. * It will give the users the impression that you could efficiently insert row-by-row to a file. With a columnar data format that can only leverage its techniques on large chunks of rows, this would generate a massive overhead. Still if one would try to implement this, it would work as follows: # Read in the footer/metadata of the existing file. # Seek to the start position of the existing footer and overwrite it with the new data. # Merge (or rather concat) the existing metadata with the newly computed metadata and write it at the end of the file. If you would take a look at how a completely fresh Parquet file would be written, this is identical except that we wouldn't need to read in and overwrite any existing metadata. > Append to parquet file? > --- > > Key: ARROW-8810 > URL: https://issues.apache.org/jira/browse/ARROW-8810 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Carl Boettiger >Priority: Major > > Is it possible to append new
[jira] [Commented] (ARROW-8810) Append to parquet file?
[ https://issues.apache.org/jira/browse/ARROW-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108087#comment-17108087 ] Uwe Korn commented on ARROW-8810: - Generally, you should see Parquet files as immutable. If you want to change its contents, it is almost always simpler and faster to just rewrite them completely or (much better) just write a second file and treat a directory of Parquet files as a single dataset. This comes down to two major properties: * Values in a Parquet file are encoded and compressed. Thus they don't adhere to a fixed size per row/value but in some cases a column chunk of a million values may be stored in just 64 bytes. * The metadata that contains all essential information, e.g. where row groups start, what schema the data is, is stored at the end of the file (i.e. the footer). Especially the last four bytes are needed as they indicate the start position of the footer. Technically, you could still write code that appends to an existing Parquet file but this has the drawbacks that: * Writing wouldn't be faster than writing to a second, separate file. It would probably be even slower as we need to deserialize the existing metadata and serialize it again only with slight modifications. * Reading wouldn't be faster than reading from a second file, even when doing it sequentially. * While append to a Parquet file, the file would be unreadable. * If your process crashes during write, all existing data in the Parquet file will be lost. * It will give the users the impression that you could efficiently insert row-by-row to a file. With a columnar data format that can only leverage its techniques on large chunks of rows, this would generate a massive overhead. Still if one would try to implement this, it would work as follows: # Read in the footer/metadata of the existing file. # Seek to the start position of the existing footer and overwrite it with the new data. # Merge (or rather concat) the existing metadata with the newly computed metadata and write it at the end of the file. If you would take a look at how a completely fresh Parquet file would be written, this is identical except that we wouldn't need to read in and overwrite any existing metadata. > Append to parquet file? > --- > > Key: ARROW-8810 > URL: https://issues.apache.org/jira/browse/ARROW-8810 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Carl Boettiger >Priority: Major > > Is it possible to append new rows to an existing .parquet file using the R > client's arrow::write_parquet(), in a manner similar to the `append=TRUE` > argument in text-based output formats like write.table()? > > Apologies as this is perhaps more a question of documentation or user > interface, or maybe just my ignorance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-8774) [Rust] [DataFusion] Improve threading model
[ https://issues.apache.org/jira/browse/ARROW-8774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108081#comment-17108081 ] Adam Lippai commented on ARROW-8774: [~andygrove] I don't have access to edit, so my addition is pending as suggestion in the doc. > [Rust] [DataFusion] Improve threading model > --- > > Key: ARROW-8774 > URL: https://issues.apache.org/jira/browse/ARROW-8774 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust, Rust - DataFusion >Reporter: Andy Grove >Assignee: Andy Grove >Priority: Major > Fix For: 1.0.0 > > > DataFusion currently spawns one thread per partition and this results in poor > performance if there are more partitions than available cores/threads. It > would be better to have a thread-pool that defaults to number of available > cores. > Here is a Google doc where we can collaborate on a design discussion. > https://docs.google.com/document/d/1_wc6diy3YrRgEIhVIGzrO5AK8yhwfjWlmKtGnvbsrrY/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (ARROW-8811) [Java] Fix build on master
[ https://issues.apache.org/jira/browse/ARROW-8811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield resolved ARROW-8811. Fix Version/s: 1.0.0 Resolution: Fixed Issue resolved by pull request 7190 [https://github.com/apache/arrow/pull/7190] > [Java] Fix build on master > --- > > Key: ARROW-8811 > URL: https://issues.apache.org/jira/browse/ARROW-8811 > Project: Apache Arrow > Issue Type: Bug > Components: Java >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)