[jira] [Created] (ARROW-15460) [R] Add as.data.frame.Dataset method
Dragoș Moldovan-Grünfeld created ARROW-15460: Summary: [R] Add as.data.frame.Dataset method Key: ARROW-15460 URL: https://issues.apache.org/jira/browse/ARROW-15460 Project: Apache Arrow Issue Type: New Feature Components: R Reporter: Dragoș Moldovan-Grünfeld Started with a question from Jim Hester on Twitter: bq. Is there a way to take an arrow::Dataset and collect all the data into a data.frame without using `dplyr::collect()`? bq. I have a code path I just want to return a regular data.frame, but I don't really want to add a soft dplyr dependency just for this. Twitter thread: https://twitter.com/jimhester_/status/1484624519612579841?s=21 This might also be useful for pillar/tibble. Maybe add a {{max_memory_argument}} to avoid allocating to much memory. (see suggestion from Kirill Müller) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15461) [C++] arrow-utility-test fails with clang-12 (TestCopyAndReverseBitmapPreAllocated)
Yibo Cai created ARROW-15461: Summary: [C++] arrow-utility-test fails with clang-12 (TestCopyAndReverseBitmapPreAllocated) Key: ARROW-15461 URL: https://issues.apache.org/jira/browse/ARROW-15461 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Yibo Cai Unit test {{BitUtilTests.TestCopyAndReverseBitmapPreAllocated}} failed if release build arrow with clang-12, on both x86 and Arm. Per my debug, it's related to {{GetReversedBlock}} function [1], when right shift a uint8 value by 8 bits. I think it's a compiler bug. From the test code [2], clang-12 returns 1, which is wrong. clang-11 and clang-13 both return 2, the correct answer. Looks clang-12 over optimized the code, there should be no UB in the code (uint8 is promoted to integer before shift). A workaround is to treat shifting 8 bits as a special case. Or we can simply ignore this error if the compiler bug is confirmed (I didn't find clang bug report). [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bitmap_ops.cc#L101 [2] https://godbolt.org/z/TzYWfcP1E -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15462) [GLib] Add GArrow{Month,DayTime,MonthDayNano}Scalar,Array,Arraybuilder
Keisuke Okada created ARROW-15462: - Summary: [GLib] Add GArrow{Month,DayTime,MonthDayNano}Scalar,Array,Arraybuilder Key: ARROW-15462 URL: https://issues.apache.org/jira/browse/ARROW-15462 Project: Apache Arrow Issue Type: Sub-task Components: GLib Affects Versions: 8.0.0 Reporter: Keisuke Okada -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15463) [GLib] Add arrow::compute::Utf8NormalizeOptions bindings
Keisuke Okada created ARROW-15463: - Summary: [GLib] Add arrow::compute::Utf8NormalizeOptions bindings Key: ARROW-15463 URL: https://issues.apache.org/jira/browse/ARROW-15463 Project: Apache Arrow Issue Type: Improvement Components: GLib, Ruby Reporter: Keisuke Okada -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15464) [Python] CSV cancellation test flaky on macOS ARM64
Antoine Pitrou created ARROW-15464: -- Summary: [Python] CSV cancellation test flaky on macOS ARM64 Key: ARROW-15464 URL: https://issues.apache.org/jira/browse/ARROW-15464 Project: Apache Arrow Issue Type: Bug Components: C++, Python Reporter: Antoine Pitrou Fix For: 8.0.0 See for example this build where the test was un-skipped on Apple M1 hardware: https://github.com/ursacomputing/crossbow/runs/4943189166?check_suite_focus=true {code} test-arm64-env/lib/python3.8/site-packages/pyarrow/tests/test_csv.py ... [ 21%] arrow/ci/scripts/python_wheel_unix_test.sh: line 84: 73197 Killed: 9 python -m pytest -r s --pyargs pyarrow ... Error: Process completed with exit code 137. {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15465) [Python][CI] Dataset tests when Parquet is disabled
Antoine Pitrou created ARROW-15465: -- Summary: [Python][CI] Dataset tests when Parquet is disabled Key: ARROW-15465 URL: https://issues.apache.org/jira/browse/ARROW-15465 Project: Apache Arrow Issue Type: Bug Components: Continuous Integration, Python Reporter: Antoine Pitrou Fix For: 8.0.0 Example build at https://app.travis-ci.com/github/apache/arrow/jobs/557089817#L7819 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15466) [Go] Please tag versions for Go modules to recognize
Jonathan A Sternberg created ARROW-15466: Summary: [Go] Please tag versions for Go modules to recognize Key: ARROW-15466 URL: https://issues.apache.org/jira/browse/ARROW-15466 Project: Apache Arrow Issue Type: Improvement Reporter: Jonathan A Sternberg Please tag v7 of arrow for Go with the method that Go expects for modules to specify their versions. At the current moment, if you want to upgrade to v7, you have to give a specific hash or a specific tag as part of the `go get` command instead of doing `go get github.com/apache/arrow/go/arrow/v7@latest`. This is because there is no `go/v7.0.0` tag pointing at the commit. There is a `go/v6.0.1`. This request is to tag the versions in v7 with the same tag format alongside the `apache-arrow-7.0.0` tag. See this page to see an example of the tag being recognized by Go modules properly: [https://pkg.go.dev/github.com/apache/arrow/go/v6.] If I replace that with `v7`, it does not currently recognize a stable version: [https://pkg.go.dev/github.com/apache/arrow/go/v7]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15467) [Go][Parquet] pqarrow decimal Test fails on s390x
Matthew Topol created ARROW-15467: - Summary: [Go][Parquet] pqarrow decimal Test fails on s390x Key: ARROW-15467 URL: https://issues.apache.org/jira/browse/ARROW-15467 Project: Apache Arrow Issue Type: Bug Components: Go, Parquet Reporter: Matthew Topol Assignee: Matthew Topol Fix For: 8.0.0 Faulty random decimal generation on BigEndian causing tests to fail. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15468) [R] [CI] A crossbow job that tests against DuckDB's dev branch
Jonathan Keane created ARROW-15468: -- Summary: [R] [CI] A crossbow job that tests against DuckDB's dev branch Key: ARROW-15468 URL: https://issues.apache.org/jira/browse/ARROW-15468 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jonathan Keane It would be good to test against DuckDB's dev branch to warn us if there are impending changes that break something. While we're doing this, we should clean up some of the Currently some of our jobs do already https://github.com/apache/arrow/blob/f9f6fdbb7518c09b833cb6b78bc202008d28e865/ci/scripts/r_deps.sh#L45-L51 We should clean this up so that _generally_ builds use the released DuckDB, but we can toggle dev DuckDB (and run a separate build that uses the dev DuckDB optionally) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15469) Unable to build pyarrow wheels with manylinux2014 for ppc64le arch
Marvin Giessing created ARROW-15469: --- Summary: Unable to build pyarrow wheels with manylinux2014 for ppc64le arch Key: ARROW-15469 URL: https://issues.apache.org/jira/browse/ARROW-15469 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Marvin Giessing Hi, I'm trying to build wheels for ppc64le with manylinux2014 following the [documentation|https://arrow.apache.org/docs/developers/python.html#building-on-linux-and-macos] but when I'm executing the cmake command I'm getting this issue: ``` [...] cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ -DCMAKE_INSTALL_LIBDIR=lib \ -DARROW_WITH_BZ2=ON \ -DARROW_WITH_ZLIB=ON \ -DARROW_WITH_ZSTD=ON \ -DARROW_WITH_LZ4=ON \ -DARROW_WITH_SNAPPY=ON \ -DARROW_WITH_BROTLI=ON \ -DARROW_PARQUET=ON \ -DARROW_PYTHON=ON \ -DARROW_BUILD_TESTS=ON \ -DPython3_EXECUTABLE=/opt/python/cp37-cp37m/bin/python3 \ .. [...] -- Creating bundled static library target arrow_bundled_dependencies at /repos/arrow/cpp/build/release/libarrow_bundled_dependencies.a CMake Error at /opt/_internal/pipx/venvs/cmake/lib/python3.9/site-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:230 (message): Could NOT find Python3 (missing: Development NumPy Development.Module Development.Embed) (found version "3.7.12") Call Stack (most recent call first): /opt/_internal/pipx/venvs/cmake/lib/python3.9/site-packages/cmake/data/share/cmake-3.22/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE) /opt/_internal/pipx/venvs/cmake/lib/python3.9/site-packages/cmake/data/share/cmake-3.22/Modules/FindPython/Support.cmake:3166 (find_package_handle_standard_args) /opt/_internal/pipx/venvs/cmake/lib/python3.9/site-packages/cmake/data/share/cmake-3.22/Modules/FindPython3.cmake:490 (include) cmake_modules/FindPython3Alt.cmake:46 (find_package) src/arrow/python/CMakeLists.txt:22 (find_package) ``` Anyone knows what is going wrong here? I installed numpy via the requirements files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15470) [C++] Allows user to specify string to be used for missing data when writing CSV dataset
Nicola Crane created ARROW-15470: Summary: [C++] Allows user to specify string to be used for missing data when writing CSV dataset Key: ARROW-15470 URL: https://issues.apache.org/jira/browse/ARROW-15470 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Nicola Crane The ability to select the string to be used for missing data was implemented for the CSV Writer in ARROW-14903 but would it be possible to also allow this when writing CSV datasets? -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15471) [R] ExtensionType support in R
Dewey Dunnington created ARROW-15471: Summary: [R] ExtensionType support in R Key: ARROW-15471 URL: https://issues.apache.org/jira/browse/ARROW-15471 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Dewey Dunnington In Python there is support for extension types that consists of a registration step that defines functions to handle metadata serialization and deserialization. In R, any extension name or metadata at the top level is currently obliterated on import. To implement geometry reading and writing to Parquet, IPC, and/or Feather, we will need to at the very least have the extension name and metadata preserved (in R), and at best provide a registration step to customize the behaviour of the resulting Array/DataType. Reprex for R: {code:R} # remotes::install_github("paleolimbot/narrow") library(narrow) carray <- as_narrow_array(1:5) carray$schema$metadata[["ARROW:extension:name"]] <- "extension name!" carray$schema$metadata[["ARROW:extension:metadata"]] <- "bananas" carray$schema$metadata[["something else"]] <- "more bananas" array <- from_narrow_array(carray, arrow::Array) carray2 <- as_narrow_array(array) carray2$schema$metadata[["ARROW:extension:name"]] #> NULL carray2$schema$metadata[["ARROW:extension:metadata"]] #> NULL carray2$schema$metadata[["something else"]] #> NULL {code} There is some discussion of that as a solution to ARROW-14378, including an example of how pandas implements the 'interval' extension type (example contributed by [~jorisvandenbossche]). For the Interval example, there are some different parts living in different places: - The Arrow Extension Type definition for pandas' interval type: https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/_arrow_utils.py#L88-L136 - The __from_arrow__ implementation (doing the conversion to arrow): https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/arrays/interval.py#L1405-L1455 - The __from_arrow__ implementation (conversion arrow -> pandas): https://github.com/pandas-dev/pandas/blob/fc6b441ba527ca32b460ae4f4f5a6802335497f9/pandas/core/dtypes/dtypes.py#L1227-L1255 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15472) [Website] Add Flight SQL blog post
David Li created ARROW-15472: Summary: [Website] Add Flight SQL blog post Key: ARROW-15472 URL: https://issues.apache.org/jira/browse/ARROW-15472 Project: Apache Arrow Issue Type: Task Components: Website Reporter: David Li To go along with/right after the 7.0.0 release announcement. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15473) [C++][FlightRPC] Expose a way to terminate DoExchange stream client side
Rok Mihevc created ARROW-15473: -- Summary: [C++][FlightRPC] Expose a way to terminate DoExchange stream client side Key: ARROW-15473 URL: https://issues.apache.org/jira/browse/ARROW-15473 Project: Apache Arrow Issue Type: New Feature Components: C++, FlightRPC Reporter: Rok Mihevc We want a mechanism to close DoExchange streams from client side in case of long running connections. This would be handy for testing and in case e.g. user wants to disconnect. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?
Lance Dacey created ARROW-15474: --- Summary: [Python] Possibility of a table.drop_duplicates() function? Key: ARROW-15474 URL: https://issues.apache.org/jira/browse/ARROW-15474 Project: Apache Arrow Issue Type: Wish Affects Versions: 6.0.1 Reporter: Lance Dacey Fix For: 8.0.0 I noticed that there is a group_by() and sort_by() function in the 7.0.0 branch. Is it possible to include a drop_duplicates() function as well? ||id||updated_at|| |1|2022-01-01 04:23:57| |2|2022-01-01 07:19:21| |2|2022-01-10 22:14:01| Something like this which would return a table without the second row in the example above would be great. I usually am reading an append-only dataset and then I need to report on latest version of each row. To drop duplicates, I am temporarily converting the append-only table to a pandas DataFrame, and then I convert it back to a table and save a separate "latest-version" dataset. {code:python} table.sort_by(sorting=[("id", "ascending"), ("updated_at", "ascending")]).drop_duplicates(subset=["id"] keep="last") {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [arrow-julia] sl-solution opened a new issue #280: Allow missing type without converting to vector
sl-solution opened a new issue #280: URL: https://github.com/apache/arrow-julia/issues/280 Not sure if it makes sense, but, would it be possible to allow missing type without copying the underlining arrow vector? As far as I understand, allowing missing only changes the `Type` of arrow vector (e.g. in `Primitive`) not the underlying data. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org