[jira] [Assigned] (ARROW-14893) [C++] Allow creating GCS filesystem from URI
[ https://issues.apache.org/jira/browse/ARROW-14893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned ARROW-14893: --- Assignee: Micah Kornfield > [C++] Allow creating GCS filesystem from URI > > > Key: ARROW-14893 > URL: https://issues.apache.org/jira/browse/ARROW-14893 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Antoine Pitrou >Assignee: Micah Kornfield >Priority: Major > > Similarly to what already exists for S3. See {{FileSystemFromUri}} and > {{S3Options::FromUri}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-14892) [Python] Add bindings for GCS filesystem
[ https://issues.apache.org/jira/browse/ARROW-14892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Micah Kornfield reassigned ARROW-14892: --- Assignee: Micah Kornfield > [Python] Add bindings for GCS filesystem > > > Key: ARROW-14892 > URL: https://issues.apache.org/jira/browse/ARROW-14892 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Antoine Pitrou >Assignee: Micah Kornfield >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15492) [Python] handle timestamp type in parquet file for compatibility with older HiveQL
[ https://issues.apache.org/jira/browse/ARROW-15492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488603#comment-17488603 ] Micah Kornfield commented on ARROW-15492: - So this looks like an oversight with int96. The logical type with isAdjustedToUtc isn't accounted for when making the [arrow type for int96|https://github.com/apache/arrow/blob/85f192a45755b3f15653fdc0a8fbd788086e125f/cpp/src/parquet/arrow/schema_internal.cc#L197]. It is used for [int64|#L197]. [~amznero] would you be interested in contributing a fix for this? > [Python] handle timestamp type in parquet file for compatibility with older > HiveQL > -- > > Key: ARROW-15492 > URL: https://issues.apache.org/jira/browse/ARROW-15492 > Project: Apache Arrow > Issue Type: New Feature >Affects Versions: 6.0.1 >Reporter: nero >Priority: Major > > Hi there, > I face an issue when I write a parquet file by PyArrow. > In the older version of Hive, it can only recognize the timestamp type stored > in INT96, so I use table.write_to_data with `use_deprecated > timestamp_int96_timestamps=True` option to save the parquet file. But the > HiveQL will skip conversion when the metadata of parquet file is not > created_by "parquet-mr". > [hive/ParquetRecordReaderBase.java at > f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive > (github.com)|https://github.com/apache/hive/blob/f1ff99636a5546231336208a300a114bcf8c5944/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L137-L139] > > So I have to save the timestamp columns with timezone info(pad to UTC+8). > But when pyarrow.parquet read from a dir which contains parquets created by > both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for > parquet-mr files. > > Maybe PyArrow can expose the created_by option in pyarrow({*}prefer{*}, > parquet::WriterProperties::created_by is available in the C++ ). > Or handle the timestamp type with timezone which files created by parquet-mr? > > Maybe related to https://issues.apache.org/jira/browse/ARROW-14422 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15492) [Python] handle timestamp type in parquet file for compatibility with older HiveQL
[ https://issues.apache.org/jira/browse/ARROW-15492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488578#comment-17488578 ] nero commented on ARROW-15492: -- [~emkornfield] In Parquet format, there is a flag named "isAdjustedToUTC" to indicate whether the timestamp type is local timezone or UTC. Ref: [parquet-format/LogicalTypes.md at master · apache/parquet-format (github.com)|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc] > [Python] handle timestamp type in parquet file for compatibility with older > HiveQL > -- > > Key: ARROW-15492 > URL: https://issues.apache.org/jira/browse/ARROW-15492 > Project: Apache Arrow > Issue Type: New Feature >Affects Versions: 6.0.1 >Reporter: nero >Priority: Major > > Hi there, > I face an issue when I write a parquet file by PyArrow. > In the older version of Hive, it can only recognize the timestamp type stored > in INT96, so I use table.write_to_data with `use_deprecated > timestamp_int96_timestamps=True` option to save the parquet file. But the > HiveQL will skip conversion when the metadata of parquet file is not > created_by "parquet-mr". > [hive/ParquetRecordReaderBase.java at > f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive > (github.com)|https://github.com/apache/hive/blob/f1ff99636a5546231336208a300a114bcf8c5944/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L137-L139] > > So I have to save the timestamp columns with timezone info(pad to UTC+8). > But when pyarrow.parquet read from a dir which contains parquets created by > both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for > parquet-mr files. > > Maybe PyArrow can expose the created_by option in pyarrow({*}prefer{*}, > parquet::WriterProperties::created_by is available in the C++ ). > Or handle the timestamp type with timezone which files created by parquet-mr? > > Maybe related to https://issues.apache.org/jira/browse/ARROW-14422 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15593) [C++] An unhandled race condition exists in ThreadPool
[ https://issues.apache.org/jira/browse/ARROW-15593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yibo Cai resolved ARROW-15593. -- Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12358 [https://github.com/apache/arrow/pull/12358] > [C++] An unhandled race condition exists in ThreadPool > -- > > Key: ARROW-15593 > URL: https://issues.apache.org/jira/browse/ARROW-15593 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 > Environment: linux >Reporter: Huxley Hu >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > A race condition exists at the ThreadPool which may lead to the loss of > pending tasks after a process forks. > See this issue for more detail: https://github.com/apache/arrow/issues/12329 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15607) [C++] Fix incorrect CPUID flag for AVX detection
[ https://issues.apache.org/jira/browse/ARROW-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yibo Cai resolved ARROW-15607. -- Resolution: Fixed Issue resolved by pull request 12347 [https://github.com/apache/arrow/pull/12347] > [C++] Fix incorrect CPUID flag for AVX detection > > > Key: ARROW-15607 > URL: https://issues.apache.org/jira/browse/ARROW-15607 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 >Reporter: Yibo Cai >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > https://github.com/apache/arrow/pull/12347 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15607) [C++] Fix incorrect CPUID flag for AVX detection
[ https://issues.apache.org/jira/browse/ARROW-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15607: --- Labels: pull-request-available (was: ) > [C++] Fix incorrect CPUID flag for AVX detection > > > Key: ARROW-15607 > URL: https://issues.apache.org/jira/browse/ARROW-15607 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 >Reporter: Yibo Cai >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > https://github.com/apache/arrow/pull/12347 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15607) [C++] Fix incorrect CPUID flag for AVX detection
Yibo Cai created ARROW-15607: Summary: [C++] Fix incorrect CPUID flag for AVX detection Key: ARROW-15607 URL: https://issues.apache.org/jira/browse/ARROW-15607 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 7.0.0 Reporter: Yibo Cai Fix For: 8.0.0 https://github.com/apache/arrow/pull/12347 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488548#comment-17488548 ] Weston Pace commented on ARROW-15604: - I ran into bugs like this before. I don't think the cause is really OT but it seems to increase the likelihood of failure. Basically we have async tasks that do something like... * Run task * Mark future finished with result (at this point the main thread is free to exit and start shutdown) * Cleanup task If anything in the Cleanup task accesses global state we could get this error. In the past the problem was that a task was accessing the default memory pool in its cleanup (I don't recall why). A short term fix is to update the test so it isn't using the eternal thread pool or to call WaitForIdle on the CPU thread pool but these feel more like hacks than real fixes as a real customer would still have a segfault at shutdown. In this case it seems the cleanup step is doing something with OT (which makes perfect sense). I don't suppose there is any way to block the shutdown until the eternal thread pool is idle? It could probably be signal safe if we waited with a busy loop but then I think you run the risk of shutdown delays. > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15563) [C++] Compilation failure on s390x platform
[ https://issues.apache.org/jira/browse/ARROW-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488531#comment-17488531 ] Kazuaki Ishizaki commented on ARROW-15563: -- Sure, I will look at this > [C++] Compilation failure on s390x platform > --- > > Key: ARROW-15563 > URL: https://issues.apache.org/jira/browse/ARROW-15563 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 3.0.0 > Environment: s390x (IBM LinuxONE) >Reporter: Chandra Shekhar Reddy >Priority: Major > > > {code:java} > (pyarrow-dev) [root@s390x]# cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME > -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_BUILD_TYPE=debug > -DARROW_WITH_BZ2=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON > -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON > -DARROW_WITH_BROTLI=ON -DARROW_PARQUET=ON -DARROW_PYTHON=ON > -DARROW_BUILD_TESTS=ON .. > -- Building using CMake version: 3.22.2 > -- The C compiler identification is GNU 9.2.1 > -- The CXX compiler identification is GNU 8.5.0 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Check for working C compiler: /usr/bin/cc - skipped > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Check for working CXX compiler: /usr/bin/c++ - skipped > -- Detecting CXX compile features > -- Detecting CXX compile features - done > -- Arrow version: 3.0.0 (full: '3.0.0') > -- Arrow SO version: 300 (full: 300.0.0) > -- clang-tidy not found > -- clang-format not found > -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) > -- infer not found > -- Found Python3: /usr/bin/python3.9 (found version "3.9.6") found > components: Interpreter > -- Found cpplint executable at > /root/git/repos/arrow/cpp/build-support/cpplint.py > -- System processor: s390x > -- Arrow build warning level: CHECKIN > Using ld linker > Configured for DEBUG build (set with cmake > -DCMAKE_BUILD_TYPE={release,debug,...}) > -- Build Type: DEBUG > -- Using AUTO approach to find dependencies > -- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862 > -- ARROW_AWSSDK_BUILD_VERSION: 1.8.90 > -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.5 > -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59 > -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5 > -- ARROW_BOOST_BUILD_VERSION: 1.71.0 > -- ARROW_BROTLI_BUILD_VERSION: v1.0.7 > -- ARROW_BZIP2_BUILD_VERSION: 1.0.8 > -- ARROW_CARES_BUILD_VERSION: 1.16.1 > -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2 > -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2 > -- ARROW_GLOG_BUILD_VERSION: v0.4.0 > -- ARROW_GRPC_BUILD_VERSION: v1.33.2 > -- ARROW_GTEST_BUILD_VERSION: 1.10.0 > -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1 > -- ARROW_LZ4_BUILD_VERSION: v1.9.2 > -- ARROW_MIMALLOC_BUILD_VERSION: v1.6.4 > -- ARROW_ORC_BUILD_VERSION: 1.6.2 > -- ARROW_PROTOBUF_BUILD_VERSION: v3.13.0 > -- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5 > -- ARROW_RE2_BUILD_VERSION: 2019-08-01 > -- ARROW_SNAPPY_BUILD_VERSION: 1.1.8 > -- ARROW_THRIFT_BUILD_VERSION: 0.12.0 > -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183 > -- ARROW_UTF8PROC_BUILD_VERSION: v2.5.0 > -- ARROW_ZLIB_BUILD_VERSION: 1.2.11 > -- ARROW_ZSTD_BUILD_VERSION: v1.4.5 > -- Looking for pthread.h > -- Looking for pthread.h - found > -- Performing Test CMAKE_HAVE_LIBC_PTHREAD > -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed > -- Check if compiler accepts -pthread > -- Check if compiler accepts -pthread - yes > -- Found Threads: TRUE > -- Checking for module 'thrift' > -- Package 'thrift', required by 'virtual:world', not found > -- Could NOT find Thrift: Found unsuitable version "", but required is at > least "0.11.0" (found THRIFT_LIB-NOTFOUND) > -- Looking for __SIZEOF_INT128__ > -- Looking for __SIZEOF_INT128__ - found > -- Found Boost: /usr/include (found suitable version "1.66.0", minimum > required is "1.58") found components: regex system filesystem > -- Boost include dir: /usr/include > -- Boost libraries: Boost::system;Boost::filesystem > -- Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR) > -- Building snappy from source > -- Checking for modules 'libbrotlicommon;libbrotlienc;libbrotlidec' > -- Package 'libbrotlicommon', required by 'virtual:world', not found > -- Package 'libbrotlienc', required by 'virtual:world', not found > -- Package 'libbrotlidec', required by 'virtual:world', not found > -- Could NOT find Brotli (missing: BROTLI_COMMON_LIBRARY BROTLI_ENC_LIBRARY > BROTLI_DEC_LIBRARY BROTLI_INCLUDE_DIR) > -- Building brotli from source > -- Building without OpenSSL support. Minimum OpenSSL version 1.0.2 required. > CM
[jira] [Updated] (ARROW-15468) [R] [CI] A crossbow job that tests against DuckDB's dev branch
[ https://issues.apache.org/jira/browse/ARROW-15468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15468: --- Labels: pull-request-available (was: ) > [R] [CI] A crossbow job that tests against DuckDB's dev branch > -- > > Key: ARROW-15468 > URL: https://issues.apache.org/jira/browse/ARROW-15468 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > It would be good to test against DuckDB's dev branch to warn us if there are > impending changes that break something. > While we're doing this, we should clean up some of the Currently some of our > jobs do already > https://github.com/apache/arrow/blob/f9f6fdbb7518c09b833cb6b78bc202008d28e865/ci/scripts/r_deps.sh#L45-L51 > > We should clean this up so that _generally_ builds use the released DuckDB, > but we can toggle dev DuckDB (and run a separate build that uses the dev > DuckDB optionally) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job
[ https://issues.apache.org/jira/browse/ARROW-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15605. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12363 [https://github.com/apache/arrow/pull/12363] > [CI] [R] Keep using old macos runners on our autobrew CI job > > > Key: ARROW-15605 > URL: https://issues.apache.org/jira/browse/ARROW-15605 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15606) [CI] [R] Add brew build that exercises the R package
[ https://issues.apache.org/jira/browse/ARROW-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15606: --- Labels: pull-request-available (was: ) > [CI] [R] Add brew build that exercises the R package > > > Key: ARROW-15606 > URL: https://issues.apache.org/jira/browse/ARROW-15606 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15606) [CI] [R] Add brew build that exercises the R package
[ https://issues.apache.org/jira/browse/ARROW-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane updated ARROW-15606: --- Summary: [CI] [R] Add brew build that exercises the R package (was: [CI] [R] Add brew release build) > [CI] [R] Add brew build that exercises the R package > > > Key: ARROW-15606 > URL: https://issues.apache.org/jira/browse/ARROW-15606 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15599) [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or other delimited) file
[ https://issues.apache.org/jira/browse/ARROW-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488436#comment-17488436 ] Nicola Crane commented on ARROW-15599: -- Here's a reprex with more verbose output. {code:r} library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp tf <- tempfile() write.csv(data.frame(x = '2018-10-07 19:04:05.005'), tf, row.names = FALSE) # successfully read in file read_csv_arrow(tf, as_data_frame = FALSE) #> Table #> 1 rows x 1 columns #> $x # successfully read in with col_names and col_types specified read_csv_arrow( tf, col_names = "x", col_types = "?", skip = 1, as_data_frame = FALSE ) #> Table #> 1 rows x 1 columns #> $x read_csv_arrow( tf, col_names = "x", col_types = "T", skip = 1, as_data_frame = FALSE ) #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: invalid value '2018-10-07 19:04:05.005' #> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:550 decoder_.Decode(data, size, quoted, &value) #> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:123 status #> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:554 parser.VisitColumn(col_index, visit) read_csv_arrow( tf, col_names = "x", col_types = "T", as_data_frame = FALSE, skip = 1, timestamp_parsers = "%Y-%m-%d %H:%M:%S" ) #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: invalid value '2018-10-07 19:04:05.005' #> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:550 decoder_.Decode(data, size, quoted, &value) #> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:123 status #> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:554 parser.VisitColumn(col_index, visit) {code} > [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or > other delimited) file > --- > > Key: ARROW-15599 > URL: https://issues.apache.org/jira/browse/ARROW-15599 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 6.0.1 > Environment: R version 4.1.2 (2021-11-01) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.3 LTS >Reporter: SHIMA Tatsuya >Priority: Major > > I tried to read the csv column type as timestamp, but I could only get it to > work well when `col_types` was not specified. > I'm sorry if I missed something and this is the expected behavior. (It would > be great if you could add an example with `col_types` in the documentation.) > {code:r} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > t_string <- tibble::tibble( > x = "2018-10-07 19:04:05.005" > ) > write_csv_arrow(t_string, "tmp.csv") > read_csv_arrow( > "tmp.csv", > as_data_frame = FALSE > ) > #> Table > #> 1 rows x 1 columns > #> $x > read_csv_arrow( > "tmp.csv", > col_names = "x", > col_types = "?", > skip = 1, > as_data_frame = FALSE > ) > #> Table > #> 1 rows x 1 columns > #> $x > read_csv_arrow( > "tmp.csv", > col_names = "x", > col_types = "T", > skip = 1, > as_data_frame = FALSE > ) > #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: > invalid value '2018-10-07 19:04:05.005' > read_csv_arrow( > "tmp.csv", > col_names = "x", > col_types = "T", > as_data_frame = FALSE, > skip = 1, > timestamp_parsers = "%Y-%m-%d %H:%M:%S" > ) > #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: > invalid value '2018-10-07 19:04:05.005' > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15606) [CI] [R] Add brew release build
Jonathan Keane created ARROW-15606: -- Summary: [CI] [R] Add brew release build Key: ARROW-15606 URL: https://issues.apache.org/jira/browse/ARROW-15606 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jonathan Keane Assignee: Jonathan Keane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-15554) [Format][C++] Add "LargeMap" type with 64-bit offsets
[ https://issues.apache.org/jira/browse/ARROW-15554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488413#comment-17488413 ] Sarah Gilmore edited comment on ARROW-15554 at 2/7/22, 8:21 PM: Will do! was (Author: sgilmore): Wil do! > [Format][C++] Add "LargeMap" type with 64-bit offsets > - > > Key: ARROW-15554 > URL: https://issues.apache.org/jira/browse/ARROW-15554 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Sarah Gilmore >Priority: Major > > It would be nice if a "LargeMap" type existed along side the "Map" type for > parity. For other datatypes that require offset arrays/buffers, such as > String, List, BinaryArray, provides a "large" version of these types, i.e. > LargeString, LargeList, and LargeBinaryArray. It would be nice to have a > "LargeMap" for parity. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15554) [Format][C++] Add "LargeMap" type with 64-bit offsets
[ https://issues.apache.org/jira/browse/ARROW-15554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488413#comment-17488413 ] Sarah Gilmore commented on ARROW-15554: --- Wil do! > [Format][C++] Add "LargeMap" type with 64-bit offsets > - > > Key: ARROW-15554 > URL: https://issues.apache.org/jira/browse/ARROW-15554 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Sarah Gilmore >Priority: Major > > It would be nice if a "LargeMap" type existed along side the "Map" type for > parity. For other datatypes that require offset arrays/buffers, such as > String, List, BinaryArray, provides a "large" version of these types, i.e. > LargeString, LargeList, and LargeBinaryArray. It would be nice to have a > "LargeMap" for parity. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15570) [CI][Nightly] Drop centos-8 R nightly job
[ https://issues.apache.org/jira/browse/ARROW-15570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-15570. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12337 [https://github.com/apache/arrow/pull/12337] > [CI][Nightly] Drop centos-8 R nightly job > - > > Key: ARROW-15570 > URL: https://issues.apache.org/jira/browse/ARROW-15570 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Neal Richardson >Assignee: Neal Richardson >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > It has started failing since CentOS 8 went EOL. Followup to ARROW-15038, > which removed all of the other CentOS 8 jobs. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-14745) [R] Enable true duckdb streaming
[ https://issues.apache.org/jira/browse/ARROW-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane resolved ARROW-14745. Resolution: Fixed Issue resolved by pull request 11730 [https://github.com/apache/arrow/pull/11730] > [R] Enable true duckdb streaming > > > Key: ARROW-14745 > URL: https://issues.apache.org/jira/browse/ARROW-14745 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Blocker > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 9h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14488) [Python] Incorrect inferred schema from pandas dataframe with length 0.
[ https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488402#comment-17488402 ] Alenka Frim commented on ARROW-14488: - Thank you Joris! An example would be: {code:python} >>> import pandas as pd >>> df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c']) >>> import pyarrow as pa >>> >>> schema = pa.schema([ ...pa.field('a', pa.string()), ...pa.field('b', pa.int64()), ...pa.field('c', pa.float64())]) >>> >>> pa.Table.from_pandas(df, schema=schema) pyarrow.Table a: string b: int64 c: double a: [["a"]] b: [[1]] c: [[1]] >>> pa.Table.from_pandas(df.head(0), schema=schema) pyarrow.Table a: string b: int64 c: double a: [[]] b: [[]] c: [[]] {code} > [Python] Incorrect inferred schema from pandas dataframe with length 0. > --- > > Key: ARROW-14488 > URL: https://issues.apache.org/jira/browse/ARROW-14488 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 5.0.0 > Environment: OS: Windows 10, CentOS 7 >Reporter: Yuan Zhou >Priority: Major > > We use pandas(with pyarrow engine) to write out parquet files and those > outputs will be consumed by other applications such as Java apps using > org.apache.parquet.hadoop.ParquetFileReader. We found that some empty > dataframes would get incorrect schema for string columns in other > applications. After some investigation, we narrow down the issue to the > schema inference by pyarrow: > {code:java} > In [1]: import pandas as pd > In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c']) > In [3]: import pyarrow as pa > In [4]: pa.Schema.from_pandas(df) > Out[4]: > a: string > b: int64 > c: double > -- schema metadata -- > pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + > 562 > In [5]: pa.Schema.from_pandas(df.head(0)) > Out[5]: > a: null > b: int64 > c: double > -- schema metadata -- > pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + > 560 > In [6]: pa._version_ > Out[6]: '5.0.0' > {code} > As you can see, the column 'a' which should be string type if inferred as > null type and is converted to int32 while writing to parquet files. > Is this an expected behavior? Or do we have any workaround for this issue? > Could anyone take a look please. Thanks! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job
[ https://issues.apache.org/jira/browse/ARROW-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15605: --- Labels: pull-request-available (was: ) > [CI] [R] Keep using old macos runners on our autobrew CI job > > > Key: ARROW-15605 > URL: https://issues.apache.org/jira/browse/ARROW-15605 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job
[ https://issues.apache.org/jira/browse/ARROW-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Keane reassigned ARROW-15605: -- Assignee: Jonathan Keane > [CI] [R] Keep using old macos runners on our autobrew CI job > > > Key: ARROW-15605 > URL: https://issues.apache.org/jira/browse/ARROW-15605 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, R >Reporter: Jonathan Keane >Assignee: Jonathan Keane >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job
Jonathan Keane created ARROW-15605: -- Summary: [CI] [R] Keep using old macos runners on our autobrew CI job Key: ARROW-15605 URL: https://issues.apache.org/jira/browse/ARROW-15605 Project: Apache Arrow Issue Type: Improvement Components: Continuous Integration, R Reporter: Jonathan Keane -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15080) [Python] Allow creation of month_day_nano interval from tuple
[ https://issues.apache.org/jira/browse/ARROW-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-15080. Fix Version/s: 8.0.0 Resolution: Fixed Issue resolved by pull request 12348 [https://github.com/apache/arrow/pull/12348] > [Python] Allow creation of month_day_nano interval from tuple > - > > Key: ARROW-15080 > URL: https://issues.apache.org/jira/browse/ARROW-15080 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Antoine Pitrou >Assignee: Micah Kornfield >Priority: Minor > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > This should ideally be allowed but isn't: > {code:python} > >>> a = pa.array([(3, 20, 100)], type=pa.month_day_nano_interval()) > Traceback (most recent call last): > File "", line 1, in > a = pa.array([(3, 20, 100)], type=pa.month_day_nano_interval()) > File "pyarrow/array.pxi", line 315, in pyarrow.lib.array > return _sequence_to_array(obj, mask, size, type, pool, c_from_pandas) > File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array > chunked = GetResultValue( > File "pyarrow/error.pxi", line 143, in > pyarrow.lib.pyarrow_internal_check_status > return check_status(status) > File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status > raise ArrowTypeError(message) > ArrowTypeError: No temporal attributes found on object. > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-13168) [C++] Timezone database configuration and access
[ https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-13168: -- Assignee: Will Jones > [C++] Timezone database configuration and access > > > Key: ARROW-13168 > URL: https://issues.apache.org/jira/browse/ARROW-13168 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Rok Mihevc >Assignee: Will Jones >Priority: Major > Labels: timestamp > > Note: currently timezone database is not available on windows so timezone > aware operations will fail. > We're using tz.h library which needs an updated timezone database to > correctly handle timezoned timestamps. See [installation > instructions|https://howardhinnant.github.io/date/tz.html#Installation]. > We have the following options for getting a timezone database: > # local (non-windows) OS timezone database - no work required. > # arrow bundled folder - we could bundle the database at build time for > windows. Database would slowly go stale. > # download it from IANA Time Zone Database at runtime - tz.h gets the > database at runtime, but curl (and 7-zip on windows) are required. > # local user-provided folder - user could provide a location at buildtime. > Nice to have. > # allow runtime configuration - at runtime say: "the tzdata can be found at > this location" > For more context see: > [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP > 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data]. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-13204) [MATLAB] Update documentation for the MATLAB Interface to reflect latest CMake build system changes
[ https://issues.apache.org/jira/browse/ARROW-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Gurney updated ARROW-13204: - Description: The MATLAB Interface should have clear documentation detailing how to build and run the interface on all platforms using the latest CMake build system changes merged in [https://github.com/apache/arrow/pull/12004.] The current documentation is out of date and is in need of some quality of life improvements as we transition toward building out the MATLAB interface beyond just supporting basic Feather file reading and writing. was: [https://github.com/apache/arrow/pull/10614] integrates GoogleTest into the CMake build system to support building and running C++ tests for the MATLAB Interface. The MATLAB interface should have clear documentation detailing how to build and run the C++ tests on all platforms. This should include instructions for building with and without specifying a custom `GTEST_ROOT` value. Summary: [MATLAB] Update documentation for the MATLAB Interface to reflect latest CMake build system changes (was: [MATLAB] Add documentation for building and running C++ tests) > [MATLAB] Update documentation for the MATLAB Interface to reflect latest > CMake build system changes > --- > > Key: ARROW-13204 > URL: https://issues.apache.org/jira/browse/ARROW-13204 > Project: Apache Arrow > Issue Type: Task > Components: MATLAB >Reporter: Kevin Gurney >Assignee: Kevin Gurney >Priority: Minor > > The MATLAB Interface should have clear documentation detailing how to build > and run the interface on all platforms using the latest CMake build system > changes merged in [https://github.com/apache/arrow/pull/12004.] > The current documentation is out of date and is in need of some quality of > life improvements as we transition toward building out the MATLAB interface > beyond just supporting basic Feather file reading and writing. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488349#comment-17488349 ] David Li commented on ARROW-15604: -- It also seems the main thread is being destroyed during/before the thread pools, so maybe this is a static destructor order pitfall… > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488347#comment-17488347 ] David Li commented on ARROW-15604: -- Hmm, I think I ran into something similar when I working on my PR. https://github.com/apache/arrow/pull/11964#issuecomment-995043666 CC [~mbrobbel] Should we disable OT in CI for now? > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15603) [C++] Clang 13 build fails on unused var
[ https://issues.apache.org/jira/browse/ARROW-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-15603. Resolution: Fixed Issue resolved by pull request 12359 [https://github.com/apache/arrow/pull/12359] > [C++] Clang 13 build fails on unused var > > > Key: ARROW-15603 > URL: https://issues.apache.org/jira/browse/ARROW-15603 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Minor > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Just a small issue. When I build with clang 13 I get the following error from > a unused var warning: > {code:java} > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:791:13: > error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable] > int64_t n = 0; > ^ > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:799:13: > error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable] > int64_t n = 0; > ^ {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15552) [Docs][Format] Unclear wording about base64 encoding requirement of metadata values
[ https://issues.apache.org/jira/browse/ARROW-15552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15552: --- Labels: pull-request-available (was: ) > [Docs][Format] Unclear wording about base64 encoding requirement of metadata > values > --- > > Key: ARROW-15552 > URL: https://issues.apache.org/jira/browse/ARROW-15552 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Format >Reporter: Joris Van den Bossche >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The C Data Interface docs indicate that the values in key-value metadata > should be base64 encoded, which is mentioned in the section about which > key-value metadata to use for extension types > (https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays): > bq. The base64 encoding of metadata values ensures that any possible > serialization is representable. > This might not be fully correct, though (or at least not required, which is > implied with the current wording). While a binary blob (like a serialized > schema) can be base64 encoded, as we do when putting the Arrow schema in the > Parquet metadata, this is not required? > cc [~apitrou] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15552) [Docs][Format] Unclear wording about base64 encoding requirement of metadata values
[ https://issues.apache.org/jira/browse/ARROW-15552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-15552: -- Assignee: Antoine Pitrou > [Docs][Format] Unclear wording about base64 encoding requirement of metadata > values > --- > > Key: ARROW-15552 > URL: https://issues.apache.org/jira/browse/ARROW-15552 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Format >Reporter: Joris Van den Bossche >Assignee: Antoine Pitrou >Priority: Major > > The C Data Interface docs indicate that the values in key-value metadata > should be base64 encoded, which is mentioned in the section about which > key-value metadata to use for extension types > (https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays): > bq. The base64 encoding of metadata values ensures that any possible > serialization is representable. > This might not be fully correct, though (or at least not required, which is > implied with the current wording). While a binary blob (like a serialized > schema) can be base64 encoded, as we do when putting the Arrow schema in the > Parquet metadata, this is not required? > cc [~apitrou] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15552) [Docs][Format] Unclear wording about base64 encoding requirement of metadata values
[ https://issues.apache.org/jira/browse/ARROW-15552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488330#comment-17488330 ] Antoine Pitrou commented on ARROW-15552: This is probably a leftover from a draft version of the C data interface where key-value metadata has a different encoding (perhaps JSON with base64-encoded values). > [Docs][Format] Unclear wording about base64 encoding requirement of metadata > values > --- > > Key: ARROW-15552 > URL: https://issues.apache.org/jira/browse/ARROW-15552 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation, Format >Reporter: Joris Van den Bossche >Priority: Major > > The C Data Interface docs indicate that the values in key-value metadata > should be base64 encoded, which is mentioned in the section about which > key-value metadata to use for extension types > (https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays): > bq. The base64 encoding of metadata values ensures that any possible > serialization is representable. > This might not be fully correct, though (or at least not required, which is > implied with the current wording). While a binary blob (like a serialized > schema) can be base64 encoded, as we do when putting the Arrow schema in the > Parquet metadata, this is not required? > cc [~apitrou] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15195) [MATLAB] Enable GitHub Actions CI for MATLAB Interface on macOS
[ https://issues.apache.org/jira/browse/ARROW-15195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fiona La reassigned ARROW-15195: Assignee: Sreehari Hegden > [MATLAB] Enable GitHub Actions CI for MATLAB Interface on macOS > --- > > Key: ARROW-15195 > URL: https://issues.apache.org/jira/browse/ARROW-15195 > Project: Apache Arrow > Issue Type: Task > Components: MATLAB >Reporter: Fiona La >Assignee: Sreehari Hegden >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Enable CI and Testing for MATLAB submissions to the [Apache Arrow > project|https://github.com/apache/arrow] on GitHub. This task can be worked > on after [MATLAB Actions|https://github.com/matlab-actions/setup-matlab] is > enabled on macOS. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488325#comment-17488325 ] Antoine Pitrou commented on ARROW-15604: So, basically, it seems using OpenTracing in an asynchronous setup where code may run after process teardown has started may be quite delicate. [~lidavidm] [~westonpace] > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
[ https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488321#comment-17488321 ] Antoine Pitrou commented on ARROW-15604: The "atexit hook" I mentioned simply seems to be a standard C++ exit hook that destroys global/static variables. Here the static singleton that's stored in {{RuntimeContext::GetStorage}}. > [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing > --- > > Key: ARROW-15604 > URL: https://issues.apache.org/jira/browse/ARROW-15604 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Major > > The error is a heap-use-after-free and involves an OpenTracing structure that > was deleted by an atexit hook. > https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 > Summary: > {code} > Atomic write of size 4 at 0x7b08000136a8 by thread T2: > [...] > #10 > opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 > (libarrow.so.800+0x1e62ef7) > #11 > opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 > (libarrow.so.800+0x1e70178) > #12 opentelemetry::v1::context::Token::~Token() > /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 > (libarrow.so.800+0x1e7012f) > [...] > {code} > {code} > Previous write of size 8 at 0x7b08000136a8 by main thread: > #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) > [...] > #7 > opentelemetry::v1::nostd::shared_ptr::~shared_ptr() > > /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 > (libarrow.so.800+0x1e62fb3) > #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
Antoine Pitrou created ARROW-15604: -- Summary: [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing Key: ARROW-15604 URL: https://issues.apache.org/jira/browse/ARROW-15604 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration Reporter: Antoine Pitrou The error is a heap-use-after-free and involves an OpenTracing structure that was deleted by an atexit hook. https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843 Summary: {code} Atomic write of size 4 at 0x7b08000136a8 by thread T2: [...] #10 opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12 (libarrow.so.800+0x1e62ef7) #11 opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&) /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54 (libarrow.so.800+0x1e70178) #12 opentelemetry::v1::context::Token::~Token() /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3 (libarrow.so.800+0x1e7012f) [...] {code} {code} Previous write of size 8 at 0x7b08000136a8 by main thread: #0 operator delete(void*) (arrow-dataset-scanner-test+0x16a69e) [...] #7 opentelemetry::v1::nostd::shared_ptr::~shared_ptr() /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30 (libarrow.so.800+0x1e62fb3) #8 cxa_at_exit_wrapper(void*) (arrow-dataset-scanner-test+0x11866f) {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15081) [R][C++] Arrow crashes (OOM) on R client with large remote parquet files
[ https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488306#comment-17488306 ] Dewey Dunnington commented on ARROW-15081: -- There was another user who reported an issue with count on a parquet file that seems to have been fixed in the development version (which is about to be released to CRAN). Perhaps ARROW-15201 is the same issue? If it is not, when I try to reproduce the above I get an error (see below). Is there a more recent bucket with the files we can use to reproduce? {code:R} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) server <- arrow::s3_bucket( "ebird", endpoint_override = "minio.cirrus.carlboettiger.info" ) path <- server$path("Oct-2021/observations") path$ls() #> Error: IOError: Path does not exist 'ebird/Oct-2021/observations' #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:1913 collector.Finish(this) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:2275 impl_->Walk(select, base_path.bucket, base_path.key, &results) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341 base_fs_->GetFileInfo(selector) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341 base_fs_->GetFileInfo(selector) path <- server$path("partitioned") path$ls() #> Error: IOError: Path does not exist 'ebird/partitioned' #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:1913 collector.Finish(this) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:2275 impl_->Walk(select, base_path.bucket, base_path.key, &results) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341 base_fs_->GetFileInfo(selector) #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341 base_fs_->GetFileInfo(selector) {code} > [R][C++] Arrow crashes (OOM) on R client with large remote parquet files > > > Key: ARROW-15081 > URL: https://issues.apache.org/jira/browse/ARROW-15081 > Project: Apache Arrow > Issue Type: Bug > Components: R >Reporter: Carl Boettiger >Assignee: Weston Pace >Priority: Major > > The below should be a reproducible crash: > {code:java} > library(arrow) > library(dplyr) > server <- arrow::s3_bucket("ebird",endpoint_override = > "minio.cirrus.carlboettiger.info") > path <- server$path("Oct-2021/observations") > obs <- arrow::open_dataset(path) > path$ls() # observe -- 1 parquet file > obs %>% count() # CRASH > obs %>% to_duckdb() # also crash{code} > I have attempted to split this large (~100 GB parquet file) into some smaller > files, which helps: > {code:java} > path <- server$path("partitioned") > obs <- arrow::open_dataset(path) > obs$ls() # observe, multiple parquet files now > obs %>% count() > {code} > (These parquet files have also been created by arrow, btw, from a single > large csv file provided by the original data provider (eBird). Unfortunately > generating the partitioned versions is cumbersome as the data is very > unevenly distributed, there's few columns that can avoid creating 1000s of > parquet partition files and even so the bulk of the 1-billion rows fall > within the same group. But all the same I think this is a bug as there's no > indication why arrow cannot handle a single 100GB parquet file I think?). > > Let me know if I can provide more info! I'm testing in R with latest CRAN > version of arrow on a machine with 200 GB RAM. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15563) [C++] Compilation failure on s390x platform
[ https://issues.apache.org/jira/browse/ARROW-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15563: --- Summary: [C++] Compilation failure on s390x platform (was: Compilation of Arrow Cpp code fails with array-bounds issue on s390x platform) > [C++] Compilation failure on s390x platform > --- > > Key: ARROW-15563 > URL: https://issues.apache.org/jira/browse/ARROW-15563 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 3.0.0 > Environment: s390x (IBM LinuxONE) >Reporter: Chandra Shekhar Reddy >Priority: Major > > > {code:java} > (pyarrow-dev) [root@s390x]# cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME > -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_BUILD_TYPE=debug > -DARROW_WITH_BZ2=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON > -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON > -DARROW_WITH_BROTLI=ON -DARROW_PARQUET=ON -DARROW_PYTHON=ON > -DARROW_BUILD_TESTS=ON .. > -- Building using CMake version: 3.22.2 > -- The C compiler identification is GNU 9.2.1 > -- The CXX compiler identification is GNU 8.5.0 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Check for working C compiler: /usr/bin/cc - skipped > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Check for working CXX compiler: /usr/bin/c++ - skipped > -- Detecting CXX compile features > -- Detecting CXX compile features - done > -- Arrow version: 3.0.0 (full: '3.0.0') > -- Arrow SO version: 300 (full: 300.0.0) > -- clang-tidy not found > -- clang-format not found > -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) > -- infer not found > -- Found Python3: /usr/bin/python3.9 (found version "3.9.6") found > components: Interpreter > -- Found cpplint executable at > /root/git/repos/arrow/cpp/build-support/cpplint.py > -- System processor: s390x > -- Arrow build warning level: CHECKIN > Using ld linker > Configured for DEBUG build (set with cmake > -DCMAKE_BUILD_TYPE={release,debug,...}) > -- Build Type: DEBUG > -- Using AUTO approach to find dependencies > -- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862 > -- ARROW_AWSSDK_BUILD_VERSION: 1.8.90 > -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.5 > -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59 > -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5 > -- ARROW_BOOST_BUILD_VERSION: 1.71.0 > -- ARROW_BROTLI_BUILD_VERSION: v1.0.7 > -- ARROW_BZIP2_BUILD_VERSION: 1.0.8 > -- ARROW_CARES_BUILD_VERSION: 1.16.1 > -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2 > -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2 > -- ARROW_GLOG_BUILD_VERSION: v0.4.0 > -- ARROW_GRPC_BUILD_VERSION: v1.33.2 > -- ARROW_GTEST_BUILD_VERSION: 1.10.0 > -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1 > -- ARROW_LZ4_BUILD_VERSION: v1.9.2 > -- ARROW_MIMALLOC_BUILD_VERSION: v1.6.4 > -- ARROW_ORC_BUILD_VERSION: 1.6.2 > -- ARROW_PROTOBUF_BUILD_VERSION: v3.13.0 > -- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5 > -- ARROW_RE2_BUILD_VERSION: 2019-08-01 > -- ARROW_SNAPPY_BUILD_VERSION: 1.1.8 > -- ARROW_THRIFT_BUILD_VERSION: 0.12.0 > -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183 > -- ARROW_UTF8PROC_BUILD_VERSION: v2.5.0 > -- ARROW_ZLIB_BUILD_VERSION: 1.2.11 > -- ARROW_ZSTD_BUILD_VERSION: v1.4.5 > -- Looking for pthread.h > -- Looking for pthread.h - found > -- Performing Test CMAKE_HAVE_LIBC_PTHREAD > -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed > -- Check if compiler accepts -pthread > -- Check if compiler accepts -pthread - yes > -- Found Threads: TRUE > -- Checking for module 'thrift' > -- Package 'thrift', required by 'virtual:world', not found > -- Could NOT find Thrift: Found unsuitable version "", but required is at > least "0.11.0" (found THRIFT_LIB-NOTFOUND) > -- Looking for __SIZEOF_INT128__ > -- Looking for __SIZEOF_INT128__ - found > -- Found Boost: /usr/include (found suitable version "1.66.0", minimum > required is "1.58") found components: regex system filesystem > -- Boost include dir: /usr/include > -- Boost libraries: Boost::system;Boost::filesystem > -- Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR) > -- Building snappy from source > -- Checking for modules 'libbrotlicommon;libbrotlienc;libbrotlidec' > -- Package 'libbrotlicommon', required by 'virtual:world', not found > -- Package 'libbrotlienc', required by 'virtual:world', not found > -- Package 'libbrotlidec', required by 'virtual:world', not found > -- Could NOT find Brotli (missing: BROTLI_COMMON_LIBRARY BROTLI_ENC_LIBRARY > BROTLI_DEC_LIBRARY BROTLI_INCLUDE_DIR) > -- Building brotli from source > -- Building without OpenS
[jira] [Commented] (ARROW-15563) Compilation of Arrow Cpp code fails with array-bounds issue on s390x platform
[ https://issues.apache.org/jira/browse/ARROW-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488295#comment-17488295 ] Antoine Pitrou commented on ARROW-15563: [~mr.chandureddy] Have you tried with Arrow 6.0.0 or 7.0.0? > Compilation of Arrow Cpp code fails with array-bounds issue on s390x platform > - > > Key: ARROW-15563 > URL: https://issues.apache.org/jira/browse/ARROW-15563 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 3.0.0 > Environment: s390x (IBM LinuxONE) >Reporter: Chandra Shekhar Reddy >Priority: Major > > > {code:java} > (pyarrow-dev) [root@s390x]# cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME > -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_BUILD_TYPE=debug > -DARROW_WITH_BZ2=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON > -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON > -DARROW_WITH_BROTLI=ON -DARROW_PARQUET=ON -DARROW_PYTHON=ON > -DARROW_BUILD_TESTS=ON .. > -- Building using CMake version: 3.22.2 > -- The C compiler identification is GNU 9.2.1 > -- The CXX compiler identification is GNU 8.5.0 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Check for working C compiler: /usr/bin/cc - skipped > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Check for working CXX compiler: /usr/bin/c++ - skipped > -- Detecting CXX compile features > -- Detecting CXX compile features - done > -- Arrow version: 3.0.0 (full: '3.0.0') > -- Arrow SO version: 300 (full: 300.0.0) > -- clang-tidy not found > -- clang-format not found > -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) > -- infer not found > -- Found Python3: /usr/bin/python3.9 (found version "3.9.6") found > components: Interpreter > -- Found cpplint executable at > /root/git/repos/arrow/cpp/build-support/cpplint.py > -- System processor: s390x > -- Arrow build warning level: CHECKIN > Using ld linker > Configured for DEBUG build (set with cmake > -DCMAKE_BUILD_TYPE={release,debug,...}) > -- Build Type: DEBUG > -- Using AUTO approach to find dependencies > -- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862 > -- ARROW_AWSSDK_BUILD_VERSION: 1.8.90 > -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.5 > -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59 > -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5 > -- ARROW_BOOST_BUILD_VERSION: 1.71.0 > -- ARROW_BROTLI_BUILD_VERSION: v1.0.7 > -- ARROW_BZIP2_BUILD_VERSION: 1.0.8 > -- ARROW_CARES_BUILD_VERSION: 1.16.1 > -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2 > -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2 > -- ARROW_GLOG_BUILD_VERSION: v0.4.0 > -- ARROW_GRPC_BUILD_VERSION: v1.33.2 > -- ARROW_GTEST_BUILD_VERSION: 1.10.0 > -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1 > -- ARROW_LZ4_BUILD_VERSION: v1.9.2 > -- ARROW_MIMALLOC_BUILD_VERSION: v1.6.4 > -- ARROW_ORC_BUILD_VERSION: 1.6.2 > -- ARROW_PROTOBUF_BUILD_VERSION: v3.13.0 > -- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5 > -- ARROW_RE2_BUILD_VERSION: 2019-08-01 > -- ARROW_SNAPPY_BUILD_VERSION: 1.1.8 > -- ARROW_THRIFT_BUILD_VERSION: 0.12.0 > -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183 > -- ARROW_UTF8PROC_BUILD_VERSION: v2.5.0 > -- ARROW_ZLIB_BUILD_VERSION: 1.2.11 > -- ARROW_ZSTD_BUILD_VERSION: v1.4.5 > -- Looking for pthread.h > -- Looking for pthread.h - found > -- Performing Test CMAKE_HAVE_LIBC_PTHREAD > -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed > -- Check if compiler accepts -pthread > -- Check if compiler accepts -pthread - yes > -- Found Threads: TRUE > -- Checking for module 'thrift' > -- Package 'thrift', required by 'virtual:world', not found > -- Could NOT find Thrift: Found unsuitable version "", but required is at > least "0.11.0" (found THRIFT_LIB-NOTFOUND) > -- Looking for __SIZEOF_INT128__ > -- Looking for __SIZEOF_INT128__ - found > -- Found Boost: /usr/include (found suitable version "1.66.0", minimum > required is "1.58") found components: regex system filesystem > -- Boost include dir: /usr/include > -- Boost libraries: Boost::system;Boost::filesystem > -- Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR) > -- Building snappy from source > -- Checking for modules 'libbrotlicommon;libbrotlienc;libbrotlidec' > -- Package 'libbrotlicommon', required by 'virtual:world', not found > -- Package 'libbrotlienc', required by 'virtual:world', not found > -- Package 'libbrotlidec', required by 'virtual:world', not found > -- Could NOT find Brotli (missing: BROTLI_COMMON_LIBRARY BROTLI_ENC_LIBRARY > BROTLI_DEC_LIBRARY BROTLI_INCLUDE_DIR) > -- Building bro
[jira] [Commented] (ARROW-15563) Compilation of Arrow Cpp code fails with array-bounds issue on s390x platform
[ https://issues.apache.org/jira/browse/ARROW-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488294#comment-17488294 ] Antoine Pitrou commented on ARROW-15563: cc [~kiszk] > Compilation of Arrow Cpp code fails with array-bounds issue on s390x platform > - > > Key: ARROW-15563 > URL: https://issues.apache.org/jira/browse/ARROW-15563 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 3.0.0 > Environment: s390x (IBM LinuxONE) >Reporter: Chandra Shekhar Reddy >Priority: Major > > > {code:java} > (pyarrow-dev) [root@s390x]# cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME > -DCMAKE_INSTALL_LIBDIR=lib -DCMAKE_BUILD_TYPE=debug > -DARROW_WITH_BZ2=ON -DARROW_WITH_ZLIB=ON -DARROW_WITH_ZSTD=ON > -DARROW_WITH_LZ4=ON -DARROW_WITH_SNAPPY=ON > -DARROW_WITH_BROTLI=ON -DARROW_PARQUET=ON -DARROW_PYTHON=ON > -DARROW_BUILD_TESTS=ON .. > -- Building using CMake version: 3.22.2 > -- The C compiler identification is GNU 9.2.1 > -- The CXX compiler identification is GNU 8.5.0 > -- Detecting C compiler ABI info > -- Detecting C compiler ABI info - done > -- Check for working C compiler: /usr/bin/cc - skipped > -- Detecting C compile features > -- Detecting C compile features - done > -- Detecting CXX compiler ABI info > -- Detecting CXX compiler ABI info - done > -- Check for working CXX compiler: /usr/bin/c++ - skipped > -- Detecting CXX compile features > -- Detecting CXX compile features - done > -- Arrow version: 3.0.0 (full: '3.0.0') > -- Arrow SO version: 300 (full: 300.0.0) > -- clang-tidy not found > -- clang-format not found > -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN) > -- infer not found > -- Found Python3: /usr/bin/python3.9 (found version "3.9.6") found > components: Interpreter > -- Found cpplint executable at > /root/git/repos/arrow/cpp/build-support/cpplint.py > -- System processor: s390x > -- Arrow build warning level: CHECKIN > Using ld linker > Configured for DEBUG build (set with cmake > -DCMAKE_BUILD_TYPE={release,debug,...}) > -- Build Type: DEBUG > -- Using AUTO approach to find dependencies > -- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862 > -- ARROW_AWSSDK_BUILD_VERSION: 1.8.90 > -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.5 > -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59 > -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5 > -- ARROW_BOOST_BUILD_VERSION: 1.71.0 > -- ARROW_BROTLI_BUILD_VERSION: v1.0.7 > -- ARROW_BZIP2_BUILD_VERSION: 1.0.8 > -- ARROW_CARES_BUILD_VERSION: 1.16.1 > -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2 > -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2 > -- ARROW_GLOG_BUILD_VERSION: v0.4.0 > -- ARROW_GRPC_BUILD_VERSION: v1.33.2 > -- ARROW_GTEST_BUILD_VERSION: 1.10.0 > -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1 > -- ARROW_LZ4_BUILD_VERSION: v1.9.2 > -- ARROW_MIMALLOC_BUILD_VERSION: v1.6.4 > -- ARROW_ORC_BUILD_VERSION: 1.6.2 > -- ARROW_PROTOBUF_BUILD_VERSION: v3.13.0 > -- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5 > -- ARROW_RE2_BUILD_VERSION: 2019-08-01 > -- ARROW_SNAPPY_BUILD_VERSION: 1.1.8 > -- ARROW_THRIFT_BUILD_VERSION: 0.12.0 > -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183 > -- ARROW_UTF8PROC_BUILD_VERSION: v2.5.0 > -- ARROW_ZLIB_BUILD_VERSION: 1.2.11 > -- ARROW_ZSTD_BUILD_VERSION: v1.4.5 > -- Looking for pthread.h > -- Looking for pthread.h - found > -- Performing Test CMAKE_HAVE_LIBC_PTHREAD > -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed > -- Check if compiler accepts -pthread > -- Check if compiler accepts -pthread - yes > -- Found Threads: TRUE > -- Checking for module 'thrift' > -- Package 'thrift', required by 'virtual:world', not found > -- Could NOT find Thrift: Found unsuitable version "", but required is at > least "0.11.0" (found THRIFT_LIB-NOTFOUND) > -- Looking for __SIZEOF_INT128__ > -- Looking for __SIZEOF_INT128__ - found > -- Found Boost: /usr/include (found suitable version "1.66.0", minimum > required is "1.58") found components: regex system filesystem > -- Boost include dir: /usr/include > -- Boost libraries: Boost::system;Boost::filesystem > -- Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR) > -- Building snappy from source > -- Checking for modules 'libbrotlicommon;libbrotlienc;libbrotlidec' > -- Package 'libbrotlicommon', required by 'virtual:world', not found > -- Package 'libbrotlienc', required by 'virtual:world', not found > -- Package 'libbrotlidec', required by 'virtual:world', not found > -- Could NOT find Brotli (missing: BROTLI_COMMON_LIBRARY BROTLI_ENC_LIBRARY > BROTLI_DEC_LIBRARY BROTLI_INCLUDE_DIR) > -- Building brotli from source > -- Building without OpenSSL su
[jira] [Commented] (ARROW-15554) [Format][C++] Add "LargeMap" type with 64-bit offsets
[ https://issues.apache.org/jira/browse/ARROW-15554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488293#comment-17488293 ] Antoine Pitrou commented on ARROW-15554: Hi [~sgilmore] , thank you for the explanation. In any case, format additions have to be discussed and voted on on the development mailing-list. I encourage you to create a new discussion there: see [https://arrow.apache.org/community/] > [Format][C++] Add "LargeMap" type with 64-bit offsets > - > > Key: ARROW-15554 > URL: https://issues.apache.org/jira/browse/ARROW-15554 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Format >Reporter: Sarah Gilmore >Priority: Major > > It would be nice if a "LargeMap" type existed along side the "Map" type for > parity. For other datatypes that require offset arrays/buffers, such as > String, List, BinaryArray, provides a "large" version of these types, i.e. > LargeString, LargeList, and LargeBinaryArray. It would be nice to have a > "LargeMap" for parity. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15016) [R] show_query() for an arrow_dplyr_query
[ https://issues.apache.org/jira/browse/ARROW-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488280#comment-17488280 ] Dewey Dunnington commented on ARROW-15016: -- It seems like a route here would be to implement a {{ToString()}} and {{print()}} as R6 methods here: https://github.com/apache/arrow/blob/bd356295f6beaba744a2c6b498455701f53a64f8/r/R/dplyr.R#L91 and here: https://github.com/apache/arrow/blob/master/r/src/compute-exec.cpp#L47 ...and add a {{show_dplyr_query()}} function here (maybe like this): {code:R} library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) show_exec_plan <- function(.data) { adq <- arrow:::as_adq(.data) plan <- arrow:::ExecPlan$create() final_node <- plan$Build(.data) print(plan$ToString()) invisible(.data) } ggplot2::mpg %>% arrow_table() %>% filter(year > 2007) %>% show_exec_plan() #> Error in print(plan$ToString()): attempt to apply non-function {code} Maybe here: https://github.com/apache/arrow/blob/master/r/R/dplyr.R#L91 > [R] show_query() for an arrow_dplyr_query > - > > Key: ARROW-15016 > URL: https://issues.apache.org/jira/browse/ARROW-15016 > Project: Apache Arrow > Issue Type: Improvement > Components: R >Reporter: Jonathan Keane >Priority: Major > Fix For: 8.0.0 > > > Now that we can print a query plan (ARROW-13785) we should wire this up in R > so we can see what execution plans are being put together for various queries > (like the TPC-H queries) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15603) [C++] Clang 13 build fails on unused var
[ https://issues.apache.org/jira/browse/ARROW-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15603: --- Labels: pull-request-available (was: ) > [C++] Clang 13 build fails on unused var > > > Key: ARROW-15603 > URL: https://issues.apache.org/jira/browse/ARROW-15603 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Minor > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Just a small issue. When I build with clang 13 I get the following error from > a unused var warning: > {code:java} > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:791:13: > error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable] > int64_t n = 0; > ^ > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:799:13: > error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable] > int64_t n = 0; > ^ {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15593) [C++] An unhandled race condition exists in ThreadPool
[ https://issues.apache.org/jira/browse/ARROW-15593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-15593: --- Summary: [C++] An unhandled race condition exists in ThreadPool (was: An unhandled race condition exists in ThreadPool) > [C++] An unhandled race condition exists in ThreadPool > -- > > Key: ARROW-15593 > URL: https://issues.apache.org/jira/browse/ARROW-15593 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 > Environment: linux >Reporter: Huxley Hu >Priority: Major > Labels: pull-request-available > Time Spent: 1h 20m > Remaining Estimate: 0h > > A race condition exists at the ThreadPool which may lead to the loss of > pending tasks after a process forks. > See this issue for more detail: https://github.com/apache/arrow/issues/12329 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15603) [C++] Clang 13 build fails on unused var
[ https://issues.apache.org/jira/browse/ARROW-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Will Jones reassigned ARROW-15603: -- Assignee: Will Jones > [C++] Clang 13 build fails on unused var > > > Key: ARROW-15603 > URL: https://issues.apache.org/jira/browse/ARROW-15603 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 7.0.0 >Reporter: Will Jones >Assignee: Will Jones >Priority: Minor > Fix For: 8.0.0 > > > Just a small issue. When I build with clang 13 I get the following error from > a unused var warning: > {code:java} > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:791:13: > error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable] > int64_t n = 0; > ^ > /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:799:13: > error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable] > int64_t n = 0; > ^ {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15603) [C++] Clang 13 build fails on unused var
Will Jones created ARROW-15603: -- Summary: [C++] Clang 13 build fails on unused var Key: ARROW-15603 URL: https://issues.apache.org/jira/browse/ARROW-15603 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 7.0.0 Reporter: Will Jones Fix For: 8.0.0 Just a small issue. When I build with clang 13 I get the following error from a unused var warning: {code:java} /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:791:13: error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable] int64_t n = 0; ^ /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:799:13: error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable] int64_t n = 0; ^ {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14567) [C++][Python][R] PrettyPrint ignores timezone
[ https://issues.apache.org/jira/browse/ARROW-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488247#comment-17488247 ] Rok Mihevc commented on ARROW-14567: Agreed! Offset ("1970-01-01 02:00:00+02:00") per value plus timezone string () in the header would be great. > [C++][Python][R] PrettyPrint ignores timezone > - > > Key: ARROW-14567 > URL: https://issues.apache.org/jira/browse/ARROW-14567 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python, R >Reporter: Alenka Frim >Priority: Major > > When printing TimestampArray in pyarrow the timezone information is ignored > by PrettyPrint (__str__ calls to_string() in array.pxi). > {code:python} > import pyarrow as pa > a = pa.array([0], pa.timestamp('s', tz='+02:00')) > print(a) # representation not correct? > # > # [ > # 1970-01-01 00:00:00 > # ] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14488) [Python] Incorrect inferred schema from pandas dataframe with length 0.
[ https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488237#comment-17488237 ] Joris Van den Bossche commented on ARROW-14488: --- bq. the conversion from empty Pandas series to pa.array is wrong in the case of a string dtype. The main problem is that the example code is not using a "string dtype". By default, pandas uses the generic "object" dtype to store strings. But this data type basically means that it can hold _any_ Python object. So it is not guaranteed to be strings (eg it could also be decimals, bytes, .., for some python types that pyarrow also infers). As long as the array is not empty, the conversion to a pyarrow array will try to infer the appropriate type based on the values in the input array (eg in case of an object dtype array with strings, it will indeed convert that to a {{pa.string()}} type). But if the array is empty, there are no values to infer the type from. And that is the reason why pyarrow defaults to use the generic "null" data type for such array (or column in a DataFrame). If you know that you have strings for a certain column, and want the pandas->pyarrow conversion to robustly work (regardless of having empty dataframes/arrays), the {{from_pandas}} method has a {{schema}} argument, and this way you can specific a schema to use (and so pyarrow will not try to infer the types based on the values in the array). You will have to construct this schema manually, though, in this case. > [Python] Incorrect inferred schema from pandas dataframe with length 0. > --- > > Key: ARROW-14488 > URL: https://issues.apache.org/jira/browse/ARROW-14488 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 5.0.0 > Environment: OS: Windows 10, CentOS 7 >Reporter: Yuan Zhou >Priority: Major > > We use pandas(with pyarrow engine) to write out parquet files and those > outputs will be consumed by other applications such as Java apps using > org.apache.parquet.hadoop.ParquetFileReader. We found that some empty > dataframes would get incorrect schema for string columns in other > applications. After some investigation, we narrow down the issue to the > schema inference by pyarrow: > {code:java} > In [1]: import pandas as pd > In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c']) > In [3]: import pyarrow as pa > In [4]: pa.Schema.from_pandas(df) > Out[4]: > a: string > b: int64 > c: double > -- schema metadata -- > pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + > 562 > In [5]: pa.Schema.from_pandas(df.head(0)) > Out[5]: > a: null > b: int64 > c: double > -- schema metadata -- > pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + > 560 > In [6]: pa._version_ > Out[6]: '5.0.0' > {code} > As you can see, the column 'a' which should be string type if inferred as > null type and is converted to int32 while writing to parquet files. > Is this an expected behavior? Or do we have any workaround for this issue? > Could anyone take a look please. Thanks! -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14567) [C++][Python][R] PrettyPrint ignores timezone
[ https://issues.apache.org/jira/browse/ARROW-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488228#comment-17488228 ] Joris Van den Bossche commented on ARROW-14567: --- I agree with [~jonkeane] that IMO the least confusing display would be to use localized strings (with timezone offset indication, so like "1970-01-01 02:00:00+02:00"). Adding "Z" is certainly better than the current situation, but it still doesn't give a quick idea about the local time that the timestamp actually represents. > [C++][Python][R] PrettyPrint ignores timezone > - > > Key: ARROW-14567 > URL: https://issues.apache.org/jira/browse/ARROW-14567 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python, R >Reporter: Alenka Frim >Priority: Major > > When printing TimestampArray in pyarrow the timezone information is ignored > by PrettyPrint (__str__ calls to_string() in array.pxi). > {code:python} > import pyarrow as pa > a = pa.array([0], pa.timestamp('s', tz='+02:00')) > print(a) # representation not correct? > # > # [ > # 1970-01-01 00:00:00 > # ] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15215) [C++] Consolidate kernel data-copy utilities between replace_with_mask, case_when, coalesce, choose, fill_null_forward, fill_null_backward
[ https://issues.apache.org/jira/browse/ARROW-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-15215: - Description: All six kernels use two sets of otherwise very similar kernel utilities for copying slices of an array into an output array. However, there's no reason they can't use the same utilities. The first set are here: "CopyFixedWidth" https://github.com/apache/arrow/blob/bd356295f6beaba744a2c6b498455701f53a64f8/cpp/src/arrow/compute/kernels/scalar_if_else.cc#L1282-L1284 The second set are here: "ReplaceWithMask::CopyData" https://github.com/apache/arrow/blob/bd356295f6beaba744a2c6b498455701f53a64f8/cpp/src/arrow/compute/kernels/vector_replace.cc#L208-L209 (This is a little confusing because the utilities are intertwined into the kernel implementation) They would need to be moved into a new header to share them between the codegen units. Also, their interfaces would need to be consolidated. Additionally, the utilities may be excessively verbose, or generate too much code for what they do. For instance, some of the utilities are templated out for every Arrow type. Instead, we could replace all instantiations for numbers, decimals, temporal types, and so on with a single one for FixedWidthType (an abstract base class). Care should be taken to evaluate the benchmarks for these kernels to ensure there is not a regression. was:All four kernels (and soon to be fill_null_forward/backward) make use of a set of very similar utilities for copying data between arrays; we should consolidate those into a single set of helpers instead of duplicating them, and consider whether they could be further consolidated (e.g. making use of the FixedWidthType hierarchy instead of specializing for every type) > [C++] Consolidate kernel data-copy utilities between replace_with_mask, > case_when, coalesce, choose, fill_null_forward, fill_null_backward > -- > > Key: ARROW-15215 > URL: https://issues.apache.org/jira/browse/ARROW-15215 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: Jabari Booker >Priority: Major > Labels: good-second-issue > > All six kernels use two sets of otherwise very similar kernel utilities for > copying slices of an array into an output array. However, there's no reason > they can't use the same utilities. > The first set are here: "CopyFixedWidth" > https://github.com/apache/arrow/blob/bd356295f6beaba744a2c6b498455701f53a64f8/cpp/src/arrow/compute/kernels/scalar_if_else.cc#L1282-L1284 > The second set are here: "ReplaceWithMask::CopyData" > https://github.com/apache/arrow/blob/bd356295f6beaba744a2c6b498455701f53a64f8/cpp/src/arrow/compute/kernels/vector_replace.cc#L208-L209 > (This is a little confusing because the utilities are intertwined into the > kernel implementation) > They would need to be moved into a new header to share them between the > codegen units. Also, their interfaces would need to be consolidated. > Additionally, the utilities may be excessively verbose, or generate too much > code for what they do. For instance, some of the utilities are templated out > for every Arrow type. Instead, we could replace all instantiations for > numbers, decimals, temporal types, and so on with a single one for > FixedWidthType (an abstract base class). Care should be taken to evaluate the > benchmarks for these kernels to ensure there is not a regression. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15215) [C++] Consolidate kernel data-copy utilities between replace_with_mask, case_when, coalesce, choose, fill_null_forward, fill_null_backward
[ https://issues.apache.org/jira/browse/ARROW-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-15215: - Summary: [C++] Consolidate kernel data-copy utilities between replace_with_mask, case_when, coalesce, choose, fill_null_forward, fill_null_backward (was: [C++] Consolidate kernel utilities between replace_with_mask, case_when, coalesce, choose, fill_null_forward, fill_null_backward) > [C++] Consolidate kernel data-copy utilities between replace_with_mask, > case_when, coalesce, choose, fill_null_forward, fill_null_backward > -- > > Key: ARROW-15215 > URL: https://issues.apache.org/jira/browse/ARROW-15215 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: Jabari Booker >Priority: Major > Labels: good-second-issue > > All four kernels (and soon to be fill_null_forward/backward) make use of a > set of very similar utilities for copying data between arrays; we should > consolidate those into a single set of helpers instead of duplicating them, > and consider whether they could be further consolidated (e.g. making use of > the FixedWidthType hierarchy instead of specializing for every type) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15215) [C++] Consolidate kernel utilities between replace_with_mask, case_when, coalesce, choose, fill_null_forward, fill_null_backward
[ https://issues.apache.org/jira/browse/ARROW-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Li updated ARROW-15215: - Summary: [C++] Consolidate kernel utilities between replace_with_mask, case_when, coalesce, choose, fill_null_forward, fill_null_backward (was: [C++] Consolidate kernel utilities between replace_with_mask, case_when, coalesce, choose) > [C++] Consolidate kernel utilities between replace_with_mask, case_when, > coalesce, choose, fill_null_forward, fill_null_backward > > > Key: ARROW-15215 > URL: https://issues.apache.org/jira/browse/ARROW-15215 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: David Li >Assignee: Jabari Booker >Priority: Major > Labels: good-second-issue > > All four kernels (and soon to be fill_null_forward/backward) make use of a > set of very similar utilities for copying data between arrays; we should > consolidate those into a single set of helpers instead of duplicating them, > and consider whether they could be further consolidated (e.g. making use of > the FixedWidthType hierarchy instead of specializing for every type) -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14082) [Python] Expose Arrow C++ Consumer API to pyarrow
[ https://issues.apache.org/jira/browse/ARROW-14082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-14082: -- Summary: [Python] Expose Arrow C++ Consumer API to pyarrow (was: Expose Arrow C++ Consumer API to pyarrow) > [Python] Expose Arrow C++ Consumer API to pyarrow > - > > Key: ARROW-14082 > URL: https://issues.apache.org/jira/browse/ARROW-14082 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Compute IR, Python >Reporter: Phillip Cloud >Assignee: Ben Kietzman >Priority: Major > > Once we have ARROW-14081, we need to add pyarrow bindings to allow use from > Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14082) Expose Arrow C++ Consumer API to pyarrow
[ https://issues.apache.org/jira/browse/ARROW-14082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-14082: -- Description: Once we have ARROW-14081, we need to add pyarrow bindings to allow use from Python. (was: Once we have https://issues.apache.org/jira/browse/ARROW-14081, we need to add pyarrow bindings to allow use from Python.) > Expose Arrow C++ Consumer API to pyarrow > > > Key: ARROW-14082 > URL: https://issues.apache.org/jira/browse/ARROW-14082 > Project: Apache Arrow > Issue Type: Sub-task > Components: C++, Compute IR, Python >Reporter: Phillip Cloud >Assignee: Ben Kietzman >Priority: Major > > Once we have ARROW-14081, we need to add pyarrow bindings to allow use from > Python. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-14292) [Python] Minimal ExecPlan to perform joins in pyarrow
[ https://issues.apache.org/jira/browse/ARROW-14292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462691#comment-17462691 ] Joris Van den Bossche edited comment on ARROW-14292 at 2/7/22, 4:11 PM: ARROW-14082 is probably a clone of this one was (Author: amol-): https://issues.apache.org/jira/browse/ARROW-14082 is probably a clone of this one > [Python] Minimal ExecPlan to perform joins in pyarrow > - > > Key: ARROW-14292 > URL: https://issues.apache.org/jira/browse/ARROW-14292 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python >Reporter: Alessandro Molina >Assignee: Alessandro Molina >Priority: Major > Fix For: 8.0.0 > > > At the moment pyarrow doesn't provide any way to leverage the query execution > engine that the C++ layer provides. The goal is to allow a minimal > implementation (unexposed to end users) that permits to create an exec plan > with multiple inputs and that produces a single output. In between. It should > allow to inject as intermediate steps one of the nodes to perform data > manipulation. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-10643) [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty dataframe
[ https://issues.apache.org/jira/browse/ARROW-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche resolved ARROW-10643. --- Resolution: Fixed Issue resolved by pull request 12311 [https://github.com/apache/arrow/pull/12311] > [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty > dataframe > - > > Key: ARROW-10643 > URL: https://issues.apache.org/jira/browse/ARROW-10643 > Project: Apache Arrow > Issue Type: New Feature > Components: Python >Reporter: Joris Van den Bossche >Assignee: Alenka Frim >Priority: Major > Labels: conversion, pandas, pull-request-available > Fix For: 8.0.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > From https://github.com/pandas-dev/pandas/issues/37897 > The roundtrip of an empty pandas.DataFrame _with_ and index (so no columns, > but a non-zero shape for the rows) isn't faithful: > {code} > In [33]: df = pd.DataFrame(index=pd.RangeIndex(0, 10, 1)) > In [34]: df > Out[34]: > Empty DataFrame > Columns: [] > Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] > In [35]: df.shape > Out[35]: (10, 0) > In [36]: table = pa.table(df) > In [37]: table.to_pandas() > Out[37]: > Empty DataFrame > Columns: [] > Index: [] > In [38]: table.to_pandas().shape > Out[38]: (0, 0) > {code} > Since the pandas metadata in the Table actually have this RangeIndex > information: > {code} > In [39]: table.schema.pandas_metadata > Out[39]: > {'index_columns': [{'kind': 'range', >'name': None, >'start': 0, >'stop': 10, >'step': 1}], > 'column_indexes': [{'name': None, >'field_name': None, >'pandas_type': 'empty', >'numpy_type': 'object', >'metadata': None}], > 'columns': [], > 'creator': {'library': 'pyarrow', 'version': '3.0.0.dev162+g305160495'}, > 'pandas_version': '1.2.0.dev0+1225.g91f5bfcdc4'} > {code} > we should in principle be able to correctly roundtrip this case. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14817) [R] Implement bindings for lubridate::tz
[ https://issues.apache.org/jira/browse/ARROW-14817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14817: --- Labels: good-first-issue pull-request-available (was: good-first-issue) > [R] Implement bindings for lubridate::tz > > > Key: ARROW-14817 > URL: https://issues.apache.org/jira/browse/ARROW-14817 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: good-first-issue, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > This can be achieved via strftime -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15602) [R] can't read timestamp with timezone from CSV (or other delimited) file
SHIMA Tatsuya created ARROW-15602: - Summary: [R] can't read timestamp with timezone from CSV (or other delimited) file Key: ARROW-15602 URL: https://issues.apache.org/jira/browse/ARROW-15602 Project: Apache Arrow Issue Type: Improvement Environment: R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Reporter: SHIMA Tatsuya The following values in a csv file can be read as timestamp by `pyarrow.csv.read_csv` and `readr::read_csv`, but not by `arrow::read_csv_arrow`. {code} "x" "2004-04-01T12:00+09:00" {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-14471) [R] Implement lubridate's date/time parsing functions
[ https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488048#comment-17488048 ] Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/7/22, 3:08 PM: --- [~paleolimbot] I don't think we can rely on {{coalesce()}} to iterate through the various formats supported for {{{}ymd(){}}}. It would need to rely on the assumption that the passed {{format}} matches the data or otherwise fail. Sadly, arrow works with a wrong format resulting in weird timestamps: {code:r} suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(arrow)) suppressPackageStartupMessages(library(lubridate)) df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03")) df #> # A tibble: 3 × 1 #> x #> #> 1 09-01-01 #> 2 09-01-02 #> 3 09-01-03 # lubridate::ymd() df %>% mutate(y = ymd(x)) #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 2009-01-01 #> 2 09-01-02 2009-01-02 #> 3 09-01-03 2009-01-03 # y = short year correct df %>% record_batch() %>% mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% collect() #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 2009-01-01 00:00:00 #> 2 09-01-02 2009-01-02 00:00:00 #> 3 09-01-03 2009-01-03 00:00:00 # Y = long year this should fail in order for us to rely on coalesce df %>% record_batch() %>% mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% collect() #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 0008-12-31 23:58:45 #> 2 09-01-02 0009-01-01 23:58:45 #> 3 09-01-03 0009-01-02 23:58:45 {code} Therefore, my early (and somewhat naive) conclusion would be that we cannot implement {{arrow::ymd()}} binding as {{{}coalesce(strptime(x, format1), strptime(x, format2), ...){}}}. What do you think? was (Author: dragosmg): [~paleolimbot] I don't think we can rely on {{coalesce()}} to iterate through the various formats supported for {{ymd()}}. It would need to rely on the assumption that the passed {{format}} matches the data or otherwise fail. Sadly, arrow works with a wrong format resulting in weird timestamps: {code:r} suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(arrow)) suppressPackageStartupMessages(library(lubridate)) df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03")) df #> # A tibble: 3 × 1 #> x #> #> 1 09-01-01 #> 2 09-01-02 #> 3 09-01-03 # lubridate::ymd() df %>% mutate(y = ymd(x)) #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 2009-01-01 #> 2 09-01-02 2009-01-02 #> 3 09-01-03 2009-01-03 # y = short year correct df %>% record_batch() %>% mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% collect() #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 2009-01-01 00:00:00 #> 2 09-01-02 2009-01-02 00:00:00 #> 3 09-01-03 2009-01-03 00:00:00 # Y = long year this should fail in order for us to rely on coalesce df %>% record_batch() %>% mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% collect() #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 0008-12-31 23:58:45 #> 2 09-01-02 0009-01-01 23:58:45 #> 3 09-01-03 0009-01-02 23:58:45 {code} Therefore, my conclusion would be that we cannot implement {{arrow::ymd()}} binding as {{coalesce(strptime(x, format1), strptime(x, format2), ...)}}. What do you think? > [R] Implement lubridate's date/time parsing functions > - > > Key: ARROW-14471 > URL: https://issues.apache.org/jira/browse/ARROW-14471 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Parse dates with year, month, and day components: > ymd() ydm() mdy() myd() dmy() dym() yq() ym() my() > > Parse date-times with year, month, and day, hour, minute, and second > components: > ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() > mdy_h() ydm_hms() ydm_hm() ydm_h() > Parse periods with hour, minute, and second components: > ms() hm() hms() > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14471) [R] Implement lubridate's date/time parsing functions
[ https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488164#comment-17488164 ] Dragoș Moldovan-Grünfeld commented on ARROW-14471: -- {{lubridate}} has {{guess_formats()}} to identify the likely candidates. We could try something similar, where we have a list of supported formats (something similar to [this|https://github.com/dragosmg/arrow/blob/cfba9e1dfbedd5dfdf652c805e93692808dd092e/r/R/dplyr-funcs-datetime.R#L152-L196]), which we then narrow down to the most likely ones. Only then use something like {{{}coalesce(){}}}. > [R] Implement lubridate's date/time parsing functions > - > > Key: ARROW-14471 > URL: https://issues.apache.org/jira/browse/ARROW-14471 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Parse dates with year, month, and day components: > ymd() ydm() mdy() myd() dmy() dym() yq() ym() my() > > Parse date-times with year, month, and day, hour, minute, and second > components: > ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() > mdy_h() ydm_hms() ydm_hm() ydm_h() > Parse periods with hour, minute, and second components: > ms() hm() hms() > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (ARROW-1921) [Doc] Build API docs on a per-release basis
[ https://issues.apache.org/jira/browse/ARROW-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche closed ARROW-1921. Assignee: (was: Krisztian Szucs) Resolution: Duplicate > [Doc] Build API docs on a per-release basis > --- > > Key: ARROW-1921 > URL: https://issues.apache.org/jira/browse/ARROW-1921 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Uwe Korn >Priority: Major > > Currently we build the docs from time to time manually from master. We should > also build them per release so that you can have a look at the latest > released API version. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-1921) [Doc] Build API docs on a per-release basis
[ https://issues.apache.org/jira/browse/ARROW-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488163#comment-17488163 ] Joris Van den Bossche commented on ARROW-1921: -- In the meantime, we have versioned docs for older version, and we also have the latest dev version of the docs. Closing as a duplicate of ARROW-13260 > [Doc] Build API docs on a per-release basis > --- > > Key: ARROW-1921 > URL: https://issues.apache.org/jira/browse/ARROW-1921 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: Uwe Korn >Assignee: Krisztian Szucs >Priority: Major > > Currently we build the docs from time to time manually from master. We should > also build them per release so that you can have a look at the latest > released API version. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (ARROW-8533) [Release] Don't commit doctrees in the docs post release script
[ https://issues.apache.org/jira/browse/ARROW-8533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche closed ARROW-8533. Resolution: Fixed > [Release] Don't commit doctrees in the docs post release script > --- > > Key: ARROW-8533 > URL: https://issues.apache.org/jira/browse/ARROW-8533 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Priority: Major > > A .gitignore file would be enough to prevent committing pickled binaries. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-8533) [Release] Don't commit doctrees in the docs post release script
[ https://issues.apache.org/jira/browse/ARROW-8533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488160#comment-17488160 ] Joris Van den Bossche commented on ARROW-8533: -- This is done in the meantime (see https://github.com/apache/arrow-site/blob/asf-site/docs/.gitignore) > [Release] Don't commit doctrees in the docs post release script > --- > > Key: ARROW-8533 > URL: https://issues.apache.org/jira/browse/ARROW-8533 > Project: Apache Arrow > Issue Type: Improvement > Components: Developer Tools >Reporter: Krisztian Szucs >Priority: Major > > A .gitignore file would be enough to prevent committing pickled binaries. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15601) [Docs][Release] Update post release script to move stable docs to versioned + keep dev docs
[ https://issues.apache.org/jira/browse/ARROW-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15601: --- Labels: pull-request-available (was: ) > [Docs][Release] Update post release script to move stable docs to versioned + > keep dev docs > --- > > Key: ARROW-15601 > URL: https://issues.apache.org/jira/browse/ARROW-15601 > Project: Apache Arrow > Issue Type: Sub-task > Components: Documentation >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0, 7.0.1 > > Time Spent: 10m > Remaining Estimate: 0h > > xref https://github.com/apache/arrow-site/pull/187 > We need to update the {{post-09-docs.sh}} script to keep the dev docs and to > move the current stable docs to a versioned sub-directory -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15601) [Docs][Release] Update post release script to move stable docs to versioned + keep dev docs
[ https://issues.apache.org/jira/browse/ARROW-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-15601: - Assignee: Joris Van den Bossche > [Docs][Release] Update post release script to move stable docs to versioned + > keep dev docs > --- > > Key: ARROW-15601 > URL: https://issues.apache.org/jira/browse/ARROW-15601 > Project: Apache Arrow > Issue Type: Sub-task > Components: Documentation >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 8.0.0, 7.0.1 > > > xref https://github.com/apache/arrow-site/pull/187 > We need to update the {{post-09-docs.sh}} script to keep the dev docs and to > move the current stable docs to a versioned sub-directory -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15601) [Docs][Release] Update post release script to move stable docs to versioned + keep dev docs
Joris Van den Bossche created ARROW-15601: - Summary: [Docs][Release] Update post release script to move stable docs to versioned + keep dev docs Key: ARROW-15601 URL: https://issues.apache.org/jira/browse/ARROW-15601 Project: Apache Arrow Issue Type: Sub-task Components: Documentation Reporter: Joris Van den Bossche Fix For: 8.0.0, 7.0.1 xref https://github.com/apache/arrow-site/pull/187 We need to update the {{post-09-docs.sh}} script to keep the dev docs and to move the current stable docs to a versioned sub-directory -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15600) [C++][FlightRPC] Add a simple Flight SQL example
[ https://issues.apache.org/jira/browse/ARROW-15600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15600: --- Labels: pull-request-available (was: ) > [C++][FlightRPC] Add a simple Flight SQL example > > > Key: ARROW-15600 > URL: https://issues.apache.org/jira/browse/ARROW-15600 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, FlightRPC >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15600) [C++][FlightRPC] Add a simple Flight SQL example
David Li created ARROW-15600: Summary: [C++][FlightRPC] Add a simple Flight SQL example Key: ARROW-15600 URL: https://issues.apache.org/jira/browse/ARROW-15600 Project: Apache Arrow Issue Type: Improvement Components: C++, FlightRPC Reporter: David Li Assignee: David Li -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15571) [C++] Add min/max/sqrt scalar kernels to execution engine
[ https://issues.apache.org/jira/browse/ARROW-15571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488129#comment-17488129 ] Yaron Gvili commented on ARROW-15571: - Since plenty of NaN values exist, I find returning an input NaN value more user-friendly then returning a null, which just drops information, which can always be done in a later step. However, if there is a convention that other operations conform to on this issue, then it should probably be followed on. > [C++] Add min/max/sqrt scalar kernels to execution engine > - > > Key: ARROW-15571 > URL: https://issues.apache.org/jira/browse/ARROW-15571 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yaron Gvili >Priority: Major > Labels: kernel, pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > The list of execution engine's scalar kernels currently available in > `cpp/src/arrow/compute/kernels/scalar_arithmetic.cc` does not cover the > common minimum, maximum, and square-root functions. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15599) [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or other delimited) file
[ https://issues.apache.org/jira/browse/ARROW-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-15599: -- Description: I tried to read the csv column type as timestamp, but I could only get it to work well when `col_types` was not specified. I'm sorry if I missed something and this is the expected behavior. (It would be great if you could add an example with `col_types` in the documentation.) {code:r} library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp t_string <- tibble::tibble( x = "2018-10-07 19:04:05.005" ) write_csv_arrow(t_string, "tmp.csv") read_csv_arrow( "tmp.csv", as_data_frame = FALSE ) #> Table #> 1 rows x 1 columns #> $x read_csv_arrow( "tmp.csv", col_names = "x", col_types = "?", skip = 1, as_data_frame = FALSE ) #> Table #> 1 rows x 1 columns #> $x read_csv_arrow( "tmp.csv", col_names = "x", col_types = "T", skip = 1, as_data_frame = FALSE ) #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: invalid value '2018-10-07 19:04:05.005' read_csv_arrow( "tmp.csv", col_names = "x", col_types = "T", as_data_frame = FALSE, skip = 1, timestamp_parsers = "%Y-%m-%d %H:%M:%S" ) #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: invalid value '2018-10-07 19:04:05.005' {code} was: I tried to read the csv column type as timestamp, but I could only get it to work well when `col_types` was not specified. I'm sorry if I missed something and this is the expected behavior. (It would be great if you could add an example with `col_types` in the documentation.) {code:r} library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp t_string <- tibble::tibble( x = "2018-10-07 19:04:05" ) write_csv_arrow(t_string, "tmp.csv") read_csv_arrow( "tmp.csv", as_data_frame = FALSE ) #> Table #> 1 rows x 1 columns #> $x read_csv_arrow( "tmp.csv", col_names = "x", col_types = "?", as_data_frame = FALSE ) #> Table #> 2 rows x 1 columns #> $x read_csv_arrow( "tmp.csv", col_names = "x", col_types = "T", as_data_frame = FALSE ) #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: invalid value 'x' read_csv_arrow( "tmp.csv", col_names = "x", col_types = "T", as_data_frame = FALSE, timestamp_parsers = "%Y-%m-%d %H:%M:%S" ) #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: invalid value 'x' {code} > [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or > other delimited) file > --- > > Key: ARROW-15599 > URL: https://issues.apache.org/jira/browse/ARROW-15599 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 6.0.1 > Environment: R version 4.1.2 (2021-11-01) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.3 LTS >Reporter: SHIMA Tatsuya >Priority: Major > > I tried to read the csv column type as timestamp, but I could only get it to > work well when `col_types` was not specified. > I'm sorry if I missed something and this is the expected behavior. (It would > be great if you could add an example with `col_types` in the documentation.) > {code:r} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > t_string <- tibble::tibble( > x = "2018-10-07 19:04:05.005" > ) > write_csv_arrow(t_string, "tmp.csv") > read_csv_arrow( > "tmp.csv", > as_data_frame = FALSE > ) > #> Table > #> 1 rows x 1 columns > #> $x > read_csv_arrow( > "tmp.csv", > col_names = "x", > col_types = "?", > skip = 1, > as_data_frame = FALSE > ) > #> Table > #> 1 rows x 1 columns > #> $x > read_csv_arrow( > "tmp.csv", > col_names = "x", > col_types = "T", > skip = 1, > as_data_frame = FALSE > ) > #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: > invalid value '2018-10-07 19:04:05.005' > read_csv_arrow( > "tmp.csv", > col_names = "x", > col_types = "T", > as_data_frame = FALSE, > skip = 1, > timestamp_parsers = "%Y-%m-%d %H:%M:%S" > ) > #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: > invalid value '2018-10-07 19:04:05.005' > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-14471) [R] Implement lubridate's date/time parsing functions
[ https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-14471: --- Labels: pull-request-available (was: ) > [R] Implement lubridate's date/time parsing functions > - > > Key: ARROW-14471 > URL: https://issues.apache.org/jira/browse/ARROW-14471 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Labels: pull-request-available > Fix For: 8.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Parse dates with year, month, and day components: > ymd() ydm() mdy() myd() dmy() dym() yq() ym() my() > > Parse date-times with year, month, and day, hour, minute, and second > components: > ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() > mdy_h() ydm_hms() ydm_hm() ydm_h() > Parse periods with hour, minute, and second components: > ms() hm() hms() > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15599) [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or other delimited) file
[ https://issues.apache.org/jira/browse/ARROW-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SHIMA Tatsuya updated ARROW-15599: -- Summary: [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or other delimited) file (was: [R] can't explicitly convert a column as a typestamp from CSV (or other delimited) file) > [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or > other delimited) file > --- > > Key: ARROW-15599 > URL: https://issues.apache.org/jira/browse/ARROW-15599 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 6.0.1 > Environment: R version 4.1.2 (2021-11-01) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.3 LTS >Reporter: SHIMA Tatsuya >Priority: Major > > I tried to read the csv column type as timestamp, but I could only get it to > work well when `col_types` was not specified. > I'm sorry if I missed something and this is the expected behavior. (It would > be great if you could add an example with `col_types` in the documentation.) > {code:r} > library(arrow) > #> > #> Attaching package: 'arrow' > #> The following object is masked from 'package:utils': > #> > #> timestamp > t_string <- tibble::tibble( > x = "2018-10-07 19:04:05" > ) > write_csv_arrow(t_string, "tmp.csv") > read_csv_arrow( > "tmp.csv", > as_data_frame = FALSE > ) > #> Table > #> 1 rows x 1 columns > #> $x > read_csv_arrow( > "tmp.csv", > col_names = "x", > col_types = "?", > as_data_frame = FALSE > ) > #> Table > #> 2 rows x 1 columns > #> $x > read_csv_arrow( > "tmp.csv", > col_names = "x", > col_types = "T", > as_data_frame = FALSE > ) > #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: > invalid value 'x' > read_csv_arrow( > "tmp.csv", > col_names = "x", > col_types = "T", > as_data_frame = FALSE, > timestamp_parsers = "%Y-%m-%d %H:%M:%S" > ) > #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: > invalid value 'x' > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15599) [R] can't explicitly convert a column as a typestamp from CSV (or other delimited) file
SHIMA Tatsuya created ARROW-15599: - Summary: [R] can't explicitly convert a column as a typestamp from CSV (or other delimited) file Key: ARROW-15599 URL: https://issues.apache.org/jira/browse/ARROW-15599 Project: Apache Arrow Issue Type: Bug Affects Versions: 6.0.1 Environment: R version 4.1.2 (2021-11-01) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.3 LTS Reporter: SHIMA Tatsuya I tried to read the csv column type as timestamp, but I could only get it to work well when `col_types` was not specified. I'm sorry if I missed something and this is the expected behavior. (It would be great if you could add an example with `col_types` in the documentation.) {code:r} library(arrow) #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp t_string <- tibble::tibble( x = "2018-10-07 19:04:05" ) write_csv_arrow(t_string, "tmp.csv") read_csv_arrow( "tmp.csv", as_data_frame = FALSE ) #> Table #> 1 rows x 1 columns #> $x read_csv_arrow( "tmp.csv", col_names = "x", col_types = "?", as_data_frame = FALSE ) #> Table #> 2 rows x 1 columns #> $x read_csv_arrow( "tmp.csv", col_names = "x", col_types = "T", as_data_frame = FALSE ) #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: invalid value 'x' read_csv_arrow( "tmp.csv", col_names = "x", col_types = "T", as_data_frame = FALSE, timestamp_parsers = "%Y-%m-%d %H:%M:%S" ) #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: invalid value 'x' {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-15547) Regression: Decimal type inferemce
[ https://issues.apache.org/jira/browse/ARROW-15547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488109#comment-17488109 ] Joris Van den Bossche edited comment on ARROW-15547 at 2/7/22, 1:30 PM: Can you provide a reproducible code example of the issue you encounter? With the data that you currently provided, the function works fine for me using pyarrow 6.0 (but there are no decimals in the resulting table, as it doesn't infer this type automatically from numbers): {code} In [3]: null = None In [4]: data = [{"accounted_at": # data as provided above In [6]: create_dataframe(data) Out[6]: pyarrow.Table booked_by: string invoice_recipient_id: string created_at: string due_date: string lines: list> child 0, item: struct child 0, amount: double child 1, commission: double child 2, commissionUnit: string child 3, description: string child 4, soldPrice: double child 5, type: string deleted_at: null internal_code: string type: string id: string payment_term: string franchise_id: string teamleader_id: string created_by: string parent_id: null sent_by: string accounted_at: string recipient_emails: null booked_at: string status: string description: string sent_at: string {code} was (Author: jorisvandenbossche): Can you provide a reproducible code example of the issue you encounter? With the data that you currently provided, the function works fine for me (but there are no decimals in the resulting table, as it doesn't infer this type automatically from numbers): {code} In [3]: null = None In [4]: data = [{"accounted_at": # data as provided above In [6]: create_dataframe(data) Out[6]: pyarrow.Table booked_by: string invoice_recipient_id: string created_at: string due_date: string lines: list> child 0, item: struct child 0, amount: double child 1, commission: double child 2, commissionUnit: string child 3, description: string child 4, soldPrice: double child 5, type: string deleted_at: null internal_code: string type: string id: string payment_term: string franchise_id: string teamleader_id: string created_by: string parent_id: null sent_by: string accounted_at: string recipient_emails: null booked_at: string status: string description: string sent_at: string {code} > Regression: Decimal type inferemce > -- > > Key: ARROW-15547 > URL: https://issues.apache.org/jira/browse/ARROW-15547 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 6.0.1 >Reporter: Charley Guillaume >Priority: Major > > While trying to ingest data using pyarrow 6.0.1 using this function :{{{}{}}} > {code:java} > def create_dataframe(list_dict: dict) -> pa.table: > fields = set() > for d in list_dict: > fields = fields.union(d.keys()) > dataframe = pa.table({f: [row.get(f) for row in list_dict] for f in > fields}) > return dataframe {code} > {{I had the following error: }} > {code:java} > pyarrow.lib.ArrowInvalid: Decimal type with precision 7 does not fit into > precision inferred from first array element: 8 {code} > After downgrading too v4.0.1 the error was gone. > The data looked like that : > {noformat} > [{"accounted_at": "2022-01-31T22:55:25.702000+00:00", "booked_at": > "2022-01-27T09:24:17.539000+00:00", "booked_by": > "7b3ce009-728d-4fbc-9120-00fa8c1c8655", "created_at": > "2022-01-27T09:08:22.306000+00:00", "created_by": > "7b3ce009-728d-4fbc-9120-00fa8c1c8655", "deleted_at": null, "description": > "description of the record", "due_date": "2022-02-10T00:00:00+00:00", > "franchise_id": "9a2858c4-5c71-43d3-b28f-2352de47ff9f", "id": > "ba3f6d3a-12f4-4d78-acc5-2e59ca384c1e", "internal_code": "A.2022 / 9", > "invoice_recipient_id": "7169cef9-9cb2-461f-a38f-a4d1ce3ca1c3", "lines": > [{"type": "property", "amount": 7800, "soldPrice": 26, "commission": 3, > "description": "Honoraires de l'agence", "commissionUnit": "PERCENT"}], > "parent_id": null, "payment_term": "14-days", "recipient_emails": null, > "sent_at": null, "sent_by": null, "status": "booked", "teamleader_id": > "xxx-yyy-www-zzz", "type": "out"}, {"accounted_at": null, "booked_at": > "2022-01-05T09:23:03.274000+00:00", "booked_by": > "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "created_at": > "2022-01-05T09:21:32.503000+00:00", "created_by": > "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "deleted_at": null, "description": > "Description content", "due_date": "2022-02-04T00:00:00+00:00", > "franchise_id": "929d47a3-c30f-404b-aaff-c96cff1bdd10", "id": > "828cd056-6aa7-4cea-9c94-ffa2db4498df", "internal_code": "BXC22 / 3", > "invoice_recipient_id": "5f90aa24-4c32-401d-927c-db9d4a9f90bf", "lines": > [{"type": "property", "amount": 92.55, "soldPrice": 3702.02, "commis
[jira] [Updated] (ARROW-15547) Regression: Decimal type inferemce
[ https://issues.apache.org/jira/browse/ARROW-15547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-15547: -- Description: While trying to ingest data using pyarrow 6.0.1 using this function :{{{}{}}} {code:java} def create_dataframe(list_dict: dict) -> pa.table: fields = set() for d in list_dict: fields = fields.union(d.keys()) dataframe = pa.table({f: [row.get(f) for row in list_dict] for f in fields}) return dataframe {code} {{I had the following error: }} {code:java} pyarrow.lib.ArrowInvalid: Decimal type with precision 7 does not fit into precision inferred from first array element: 8 {code} After downgrading too v4.0.1 the error was gone. The data looked like that : {noformat} [{"accounted_at": "2022-01-31T22:55:25.702000+00:00", "booked_at": "2022-01-27T09:24:17.539000+00:00", "booked_by": "7b3ce009-728d-4fbc-9120-00fa8c1c8655", "created_at": "2022-01-27T09:08:22.306000+00:00", "created_by": "7b3ce009-728d-4fbc-9120-00fa8c1c8655", "deleted_at": null, "description": "description of the record", "due_date": "2022-02-10T00:00:00+00:00", "franchise_id": "9a2858c4-5c71-43d3-b28f-2352de47ff9f", "id": "ba3f6d3a-12f4-4d78-acc5-2e59ca384c1e", "internal_code": "A.2022 / 9", "invoice_recipient_id": "7169cef9-9cb2-461f-a38f-a4d1ce3ca1c3", "lines": [{"type": "property", "amount": 7800, "soldPrice": 26, "commission": 3, "description": "Honoraires de l'agence", "commissionUnit": "PERCENT"}], "parent_id": null, "payment_term": "14-days", "recipient_emails": null, "sent_at": null, "sent_by": null, "status": "booked", "teamleader_id": "xxx-yyy-www-zzz", "type": "out"}, {"accounted_at": null, "booked_at": "2022-01-05T09:23:03.274000+00:00", "booked_by": "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "created_at": "2022-01-05T09:21:32.503000+00:00", "created_by": "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "deleted_at": null, "description": "Description content", "due_date": "2022-02-04T00:00:00+00:00", "franchise_id": "929d47a3-c30f-404b-aaff-c96cff1bdd10", "id": "828cd056-6aa7-4cea-9c94-ffa2db4498df", "internal_code": "BXC22 / 3", "invoice_recipient_id": "5f90aa24-4c32-401d-927c-db9d4a9f90bf", "lines": [{"type": "property", "amount": 92.55, "soldPrice": 3702.02, "commission": 2.5, "description": "description2", "commissionUnit": "PERCENT"}], "parent_id": null, "payment_term": "30-days", "recipient_emails": null, "sent_at": "2022-01-05T09:27:34.077000+00:00", "sent_by": "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "status": "credited", "teamleader_id": "xxx-yzyzy-zzz-www", "type": "out"}]{noformat} was: While trying to ingest data using pyarrow 6.0.1 using this function :{{{}{}}} {code:java} def create_dataframe(list_dict: dict) -> pa.table: fields = set() for d in list_dict: fields = fields.union(d.keys()) dataframe = pa.table({f: [row.get(f) for row in list_dict] for f in fields}) return dataframe {code} {{I had the following error: }} {code:java} pyarrow.lib.ArrowInvalid: Decimal type with precision 7 does not fit into precision inferred from first array element: 8 {code} {{}} {{After downgrading too v4.0.1 the error was gone.}} {{}} {{The data looked like that : }} {noformat} [{"accounted_at": "2022-01-31T22:55:25.702000+00:00", "booked_at": "2022-01-27T09:24:17.539000+00:00", "booked_by": "7b3ce009-728d-4fbc-9120-00fa8c1c8655", "created_at": "2022-01-27T09:08:22.306000+00:00", "created_by": "7b3ce009-728d-4fbc-9120-00fa8c1c8655", "deleted_at": null, "description": "description of the record", "due_date": "2022-02-10T00:00:00+00:00", "franchise_id": "9a2858c4-5c71-43d3-b28f-2352de47ff9f", "id": "ba3f6d3a-12f4-4d78-acc5-2e59ca384c1e", "internal_code": "A.2022 / 9", "invoice_recipient_id": "7169cef9-9cb2-461f-a38f-a4d1ce3ca1c3", "lines": [{"type": "property", "amount": 7800, "soldPrice": 26, "commission": 3, "description": "Honoraires de l'agence", "commissionUnit": "PERCENT"}], "parent_id": null, "payment_term": "14-days", "recipient_emails": null, "sent_at": null, "sent_by": null, "status": "booked", "teamleader_id": "xxx-yyy-www-zzz", "type": "out"}, {"accounted_at": null, "booked_at": "2022-01-05T09:23:03.274000+00:00", "booked_by": "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "created_at": "2022-01-05T09:21:32.503000+00:00", "created_by": "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "deleted_at": null, "description": "Description content", "due_date": "2022-02-04T00:00:00+00:00", "franchise_id": "929d47a3-c30f-404b-aaff-c96cff1bdd10", "id": "828cd056-6aa7-4cea-9c94-ffa2db4498df", "internal_code": "BXC22 / 3", "invoice_recipient_id": "5f90aa24-4c32-401d-927c-db9d4a9f90bf", "lines": [{"type": "property", "amount": 92.55, "soldPrice": 3702.02, "commission": 2.5, "description": "description2", "commissionUnit": "PERCENT"}], "parent_id": null, "payment_term": "30-days", "recipient_emails": null, "sent_at":
[jira] [Commented] (ARROW-15547) Regression: Decimal type inferemce
[ https://issues.apache.org/jira/browse/ARROW-15547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488109#comment-17488109 ] Joris Van den Bossche commented on ARROW-15547: --- Can you provide a reproducible code example of the issue you encounter? With the data that you currently provided, the function works fine for me (but there are no decimals in the resulting table, as it doesn't infer this type automatically from numbers): {code} In [3]: null = None In [4]: data = [{"accounted_at": # data as provided above In [6]: create_dataframe(data) Out[6]: pyarrow.Table booked_by: string invoice_recipient_id: string created_at: string due_date: string lines: list> child 0, item: struct child 0, amount: double child 1, commission: double child 2, commissionUnit: string child 3, description: string child 4, soldPrice: double child 5, type: string deleted_at: null internal_code: string type: string id: string payment_term: string franchise_id: string teamleader_id: string created_by: string parent_id: null sent_by: string accounted_at: string recipient_emails: null booked_at: string status: string description: string sent_at: string {code} > Regression: Decimal type inferemce > -- > > Key: ARROW-15547 > URL: https://issues.apache.org/jira/browse/ARROW-15547 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 6.0.1 >Reporter: Charley Guillaume >Priority: Major > > While trying to ingest data using pyarrow 6.0.1 using this function :{{{}{}}} > {code:java} > def create_dataframe(list_dict: dict) -> pa.table: > fields = set() > for d in list_dict: > fields = fields.union(d.keys()) > dataframe = pa.table({f: [row.get(f) for row in list_dict] for f in > fields}) > return dataframe {code} > {{I had the following error: }} > {code:java} > pyarrow.lib.ArrowInvalid: Decimal type with precision 7 does not fit into > precision inferred from first array element: 8 {code} > {{}} > {{After downgrading too v4.0.1 the error was gone.}} > {{}} > {{The data looked like that : }} > {noformat} > [{"accounted_at": "2022-01-31T22:55:25.702000+00:00", "booked_at": > "2022-01-27T09:24:17.539000+00:00", "booked_by": > "7b3ce009-728d-4fbc-9120-00fa8c1c8655", "created_at": > "2022-01-27T09:08:22.306000+00:00", "created_by": > "7b3ce009-728d-4fbc-9120-00fa8c1c8655", "deleted_at": null, "description": > "description of the record", "due_date": "2022-02-10T00:00:00+00:00", > "franchise_id": "9a2858c4-5c71-43d3-b28f-2352de47ff9f", "id": > "ba3f6d3a-12f4-4d78-acc5-2e59ca384c1e", "internal_code": "A.2022 / 9", > "invoice_recipient_id": "7169cef9-9cb2-461f-a38f-a4d1ce3ca1c3", "lines": > [{"type": "property", "amount": 7800, "soldPrice": 26, "commission": 3, > "description": "Honoraires de l'agence", "commissionUnit": "PERCENT"}], > "parent_id": null, "payment_term": "14-days", "recipient_emails": null, > "sent_at": null, "sent_by": null, "status": "booked", "teamleader_id": > "xxx-yyy-www-zzz", "type": "out"}, {"accounted_at": null, "booked_at": > "2022-01-05T09:23:03.274000+00:00", "booked_by": > "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "created_at": > "2022-01-05T09:21:32.503000+00:00", "created_by": > "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "deleted_at": null, "description": > "Description content", "due_date": "2022-02-04T00:00:00+00:00", > "franchise_id": "929d47a3-c30f-404b-aaff-c96cff1bdd10", "id": > "828cd056-6aa7-4cea-9c94-ffa2db4498df", "internal_code": "BXC22 / 3", > "invoice_recipient_id": "5f90aa24-4c32-401d-927c-db9d4a9f90bf", "lines": > [{"type": "property", "amount": 92.55, "soldPrice": 3702.02, "commission": > 2.5, "description": "description2", "commissionUnit": "PERCENT"}], > "parent_id": null, "payment_term": "30-days", "recipient_emails": null, > "sent_at": "2022-01-05T09:27:34.077000+00:00", "sent_by": > "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "status": "credited", > "teamleader_id": "xxx-yzyzy-zzz-www", "type": "out"}]{noformat} > {{}} > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15580) package does not include pytz dependency
[ https://issues.apache.org/jira/browse/ARROW-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488102#comment-17488102 ] Joris Van den Bossche commented on ARROW-15580: --- Actually, looking at the code, we indeed seem to import pytz unconditionally and assume it to be present, if you have timestamp data with a timezone, see eg https://github.com/apache/arrow/blob/4144c1739ec2e58d5f076fa63a0b61653324dc02/cpp/src/arrow/python/datetime.cc#L394 > package does not include pytz dependency > > > Key: ARROW-15580 > URL: https://issues.apache.org/jira/browse/ARROW-15580 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging, Python >Affects Versions: 7.0.0 >Reporter: Mark Wood >Priority: Minor > > When install pyarrow via poetry and executing same, a ModuleNotFoundError > error can result. > Pyarrow has a dependency on pytz, but this dependency is not declared (the > only dependency I see declared is numpy, see > [https://github.com/apache/arrow/blob/47d55752172af99fce629f8fe6177df6b41af1d3/python/setup.py#L576)] > It would be helpful if pyarrow could declare its dependencies in such a way > that tools such as poetry could automatically ensure they were present. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15598) [C++][Gandiva] Avoid using hardcoded raw pointer addresses in generated gandiva IR code
[ https://issues.apache.org/jira/browse/ARROW-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Projjal Chanda reassigned ARROW-15598: -- Assignee: Projjal Chanda > [C++][Gandiva] Avoid using hardcoded raw pointer addresses in generated > gandiva IR code > --- > > Key: ARROW-15598 > URL: https://issues.apache.org/jira/browse/ARROW-15598 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Projjal Chanda >Assignee: Projjal Chanda >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15580) package does not include pytz dependency
[ https://issues.apache.org/jira/browse/ARROW-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488096#comment-17488096 ] Joris Van den Bossche commented on ARROW-15580: --- Can you show the error you get? (and with which code snippet you get it) I would assume that pytz is only an optional dependency (i.e. only used when installed), which doesn't require it being added to setup.py install_requires. And if we accidentally depend on it unconditionally, that sounds as a bug. > package does not include pytz dependency > > > Key: ARROW-15580 > URL: https://issues.apache.org/jira/browse/ARROW-15580 > Project: Apache Arrow > Issue Type: Bug > Components: Packaging, Python >Affects Versions: 7.0.0 >Reporter: Mark Wood >Priority: Minor > > When install pyarrow via poetry and executing same, a ModuleNotFoundError > error can result. > Pyarrow has a dependency on pytz, but this dependency is not declared (the > only dependency I see declared is numpy, see > [https://github.com/apache/arrow/blob/47d55752172af99fce629f8fe6177df6b41af1d3/python/setup.py#L576)] > It would be helpful if pyarrow could declare its dependencies in such a way > that tools such as poetry could automatically ensure they were present. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15571) [C++] Add min/max/sqrt scalar kernels to execution engine
[ https://issues.apache.org/jira/browse/ARROW-15571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488098#comment-17488098 ] David Li commented on ARROW-15571: -- There are no operations that return NaN instead of null in this case. It could perhaps be added as an option (that would also be a separate JIRA) > [C++] Add min/max/sqrt scalar kernels to execution engine > - > > Key: ARROW-15571 > URL: https://issues.apache.org/jira/browse/ARROW-15571 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yaron Gvili >Priority: Major > Labels: kernel, pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > The list of execution engine's scalar kernels currently available in > `cpp/src/arrow/compute/kernels/scalar_arithmetic.cc` does not cover the > common minimum, maximum, and square-root functions. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15598) [C++][Gandiva] Avoid using hardcoded raw pointer addresses in generated gandiva IR code
[ https://issues.apache.org/jira/browse/ARROW-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-15598: --- Labels: pull-request-available (was: ) > [C++][Gandiva] Avoid using hardcoded raw pointer addresses in generated > gandiva IR code > --- > > Key: ARROW-15598 > URL: https://issues.apache.org/jira/browse/ARROW-15598 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Projjal Chanda >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-13993) [C++] Hash aggregate function that returns value from first row in group
[ https://issues.apache.org/jira/browse/ARROW-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488095#comment-17488095 ] David Li edited comment on ARROW-13993 at 2/7/22, 1:17 PM: --- Yes, we just want a single row per group. Any row will do; the point above is that we can't implement anything else (because the query engine currently lacks support for ordering, beyond sorting outputs at the very end). All hash_ kernels ("hash aggregate kernels") are in [{{hash_aggregate.cc}}|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/hash_aggregate.cc] and it will be very similar to the CountDistinct/Distinct implementation there. was (Author: lidavidm): Yes, we just want a single tuple. Any tuple will do; the point above is that we can't implement anything else (because the query engine currently lacks support for ordering, beyond sorting outputs at the very end). All hash_ kernels ("hash aggregate kernels") are in [{{hash_aggregate.cc}}|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/hash_aggregate.cc] and it will be very similar to the CountDistinct/Distinct implementation there. > [C++] Hash aggregate function that returns value from first row in group > > > Key: ARROW-13993 > URL: https://issues.apache.org/jira/browse/ARROW-13993 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Assignee: Dhruv Vats >Priority: Major > Labels: good-second-issue, kernel > Fix For: 8.0.0 > > > It would be nice to have a hash aggregate function that returns the first > value of a column within each hash group. > If row order within groups is non-deterministic, then effectively this would > return one arbitrary value. This is a very computationally cheap operation. > This can be quite useful when querying a non-normalized table. For example if > you have a table with a {{country}} column and also a {{country_abbr}} column > and you want to group by either/both of those columns but return the values > from both columns, you could do > {code:java} > SELECT country, country_abbr FROM table GROUP BY country, country_abbr{code} > but it would be more efficient to do > {code:java} > SELECT country, first(country_abbr) FROM table GROUP BY country{code} > because then the engine does not need to scan all the values of the > {{country_abbr}} column. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15598) [C++][Gandiva] Avoid using hardcoded raw pointer addresses in generated gandiva IR code
Projjal Chanda created ARROW-15598: -- Summary: [C++][Gandiva] Avoid using hardcoded raw pointer addresses in generated gandiva IR code Key: ARROW-15598 URL: https://issues.apache.org/jira/browse/ARROW-15598 Project: Apache Arrow Issue Type: Bug Components: C++ - Gandiva Reporter: Projjal Chanda -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13993) [C++] Hash aggregate function that returns value from first row in group
[ https://issues.apache.org/jira/browse/ARROW-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488095#comment-17488095 ] David Li commented on ARROW-13993: -- Yes, we just want a single tuple. Any tuple will do; the point above is that we can't implement anything else (because the query engine currently lacks support for ordering, beyond sorting outputs at the very end). All hash_ kernels ("hash aggregate kernels") are in [{{hash_aggregate.cc}}|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/hash_aggregate.cc] and it will be very similar to the CountDistinct/Distinct implementation there. > [C++] Hash aggregate function that returns value from first row in group > > > Key: ARROW-13993 > URL: https://issues.apache.org/jira/browse/ARROW-13993 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Assignee: Dhruv Vats >Priority: Major > Labels: good-second-issue, kernel > Fix For: 8.0.0 > > > It would be nice to have a hash aggregate function that returns the first > value of a column within each hash group. > If row order within groups is non-deterministic, then effectively this would > return one arbitrary value. This is a very computationally cheap operation. > This can be quite useful when querying a non-normalized table. For example if > you have a table with a {{country}} column and also a {{country_abbr}} column > and you want to group by either/both of those columns but return the values > from both columns, you could do > {code:java} > SELECT country, country_abbr FROM table GROUP BY country, country_abbr{code} > but it would be more efficient to do > {code:java} > SELECT country, first(country_abbr) FROM table GROUP BY country{code} > because then the engine does not need to scan all the values of the > {{country_abbr}} column. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-13993) [C++] Hash aggregate function that returns value from first row in group
[ https://issues.apache.org/jira/browse/ARROW-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488089#comment-17488089 ] Dhruv Vats commented on ARROW-13993: Just so I understand this correctly (as I don't have a very formal CS background), when we do: {code:sql} SELECT country, SUM(customerID) FROM db_table GROUP BY country{code} from a supposed sales table {{db_table}} that has fields {{country}} and {{{}customerID{}}}, we get number of customers _per_ country/group. So here instead sum of all tuples in a group, we just want to return a single tuple from the different groups/country? And, it seems _which_ tuple (like either the first or a specific one) to return is yet to be finalised, right? Also is there a PR or an existing kernel that has a similar boilerplate code to what this will have? (That'll save a disproportionate time going through all the abstractions). > [C++] Hash aggregate function that returns value from first row in group > > > Key: ARROW-13993 > URL: https://issues.apache.org/jira/browse/ARROW-13993 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ian Cook >Assignee: Dhruv Vats >Priority: Major > Labels: good-second-issue, kernel > Fix For: 8.0.0 > > > It would be nice to have a hash aggregate function that returns the first > value of a column within each hash group. > If row order within groups is non-deterministic, then effectively this would > return one arbitrary value. This is a very computationally cheap operation. > This can be quite useful when querying a non-normalized table. For example if > you have a table with a {{country}} column and also a {{country_abbr}} column > and you want to group by either/both of those columns but return the values > from both columns, you could do > {code:java} > SELECT country, country_abbr FROM table GROUP BY country, country_abbr{code} > but it would be more efficient to do > {code:java} > SELECT country, first(country_abbr) FROM table GROUP BY country{code} > because then the engine does not need to scan all the values of the > {{country_abbr}} column. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14471) [R] Implement lubridate's date/time parsing functions
[ https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488059#comment-17488059 ] Dragoș Moldovan-Grünfeld commented on ARROW-14471: -- Another alternative would be for {{strptime}} to error when the selected format does not match the data (for example, attempting to parse {{"09-12-31"}} with {{{}"%Y-%m-%d"{}}}should error due to a mismatch in the length of the year). Then we could rely on this behaviour with {{{}coalesce{}}}. Should I create a ticket for this? > [R] Implement lubridate's date/time parsing functions > - > > Key: ARROW-14471 > URL: https://issues.apache.org/jira/browse/ARROW-14471 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 8.0.0 > > > Parse dates with year, month, and day components: > ymd() ydm() mdy() myd() dmy() dym() yq() ym() my() > > Parse date-times with year, month, and day, hour, minute, and second > components: > ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() > mdy_h() ydm_hms() ydm_hm() ydm_h() > Parse periods with hour, minute, and second components: > ms() hm() hms() > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-14471) [R] Implement lubridate's date/time parsing functions
[ https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488050#comment-17488050 ] Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/7/22, 11:41 AM: We could to some processing to figure out how many characters we have (in the string to be parsed) in between the separators (or how many characters we have in total, in the cases where we have no separator) and only try with the suitable formats. i.e. in the example above not try to parse with {{{}%Y{}}}, only {{{}%y{}}}. was (Author: dragosmg): We could to some processing to figure out how many characters we have in between the separators (or how many characters we have in total, in the cases where we have no separator) and only try with the suitable formats. i.e. in the example above not try to parse with {{%Y}}, only {{%y}}. > [R] Implement lubridate's date/time parsing functions > - > > Key: ARROW-14471 > URL: https://issues.apache.org/jira/browse/ARROW-14471 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 8.0.0 > > > Parse dates with year, month, and day components: > ymd() ydm() mdy() myd() dmy() dym() yq() ym() my() > > Parse date-times with year, month, and day, hour, minute, and second > components: > ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() > mdy_h() ydm_hms() ydm_hm() ydm_h() > Parse periods with hour, minute, and second components: > ms() hm() hms() > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14471) [R] Implement lubridate's date/time parsing functions
[ https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488050#comment-17488050 ] Dragoș Moldovan-Grünfeld commented on ARROW-14471: -- We could to some processing to figure out how many characters we have in between the separators (or how many characters we have in total, in the cases where we have no separator) and only try with the suitable formats. i.e. in the example above not try to parse with {{%Y}}, only {{%y}}. > [R] Implement lubridate's date/time parsing functions > - > > Key: ARROW-14471 > URL: https://issues.apache.org/jira/browse/ARROW-14471 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 8.0.0 > > > Parse dates with year, month, and day components: > ymd() ydm() mdy() myd() dmy() dym() yq() ym() my() > > Parse date-times with year, month, and day, hour, minute, and second > components: > ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() > mdy_h() ydm_hms() ydm_hm() ydm_h() > Parse periods with hour, minute, and second components: > ms() hm() hms() > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14471) [R] Implement lubridate's date/time parsing functions
[ https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488048#comment-17488048 ] Dragoș Moldovan-Grünfeld commented on ARROW-14471: -- [~paleolimbot] I don't think we can rely on {{coalesce()}} to iterate through the various formats supported for {{ymd()}}. It would need to rely on the assumption that the passed {{format}} matches the data or otherwise fail. Sadly, arrow works with a wrong format resulting in weird timestamps: {code:r} suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(arrow)) suppressPackageStartupMessages(library(lubridate)) df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03")) df #> # A tibble: 3 × 1 #> x #> #> 1 09-01-01 #> 2 09-01-02 #> 3 09-01-03 # lubridate::ymd() df %>% mutate(y = ymd(x)) #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 2009-01-01 #> 2 09-01-02 2009-01-02 #> 3 09-01-03 2009-01-03 # y = short year correct df %>% record_batch() %>% mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% collect() #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 2009-01-01 00:00:00 #> 2 09-01-02 2009-01-02 00:00:00 #> 3 09-01-03 2009-01-03 00:00:00 # Y = long year this should fail in order for us to rely on coalesce df %>% record_batch() %>% mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% collect() #> # A tibble: 3 × 2 #> xy #> #> 1 09-01-01 0008-12-31 23:58:45 #> 2 09-01-02 0009-01-01 23:58:45 #> 3 09-01-03 0009-01-02 23:58:45 {code} Therefore, my conclusion would be that we cannot implement {{arrow::ymd()}} binding as {{coalesce(strptime(x, format1), strptime(x, format2), ...)}}. What do you think? > [R] Implement lubridate's date/time parsing functions > - > > Key: ARROW-14471 > URL: https://issues.apache.org/jira/browse/ARROW-14471 > Project: Apache Arrow > Issue Type: Sub-task > Components: R >Reporter: Nicola Crane >Assignee: Dragoș Moldovan-Grünfeld >Priority: Major > Fix For: 8.0.0 > > > Parse dates with year, month, and day components: > ymd() ydm() mdy() myd() dmy() dym() yq() ym() my() > > Parse date-times with year, month, and day, hour, minute, and second > components: > ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() > mdy_h() ydm_hms() ydm_hm() ydm_h() > Parse periods with hour, minute, and second components: > ms() hm() hms() > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15591) [C++] Add support for aggregation to the Substrait consumer
[ https://issues.apache.org/jira/browse/ARROW-15591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vibhatha Lakmal Abeykoon reassigned ARROW-15591: Assignee: Vibhatha Lakmal Abeykoon > [C++] Add support for aggregation to the Substrait consumer > --- > > Key: ARROW-15591 > URL: https://issues.apache.org/jira/browse/ARROW-15591 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: substrait > > The streaming execution engine supports aggregation (i.e. group by). The > Substrait consumer does not currently consume aggregation relations. We > should add support for this. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (ARROW-15590) [C++] Add support for joins to the Substrait consumer
[ https://issues.apache.org/jira/browse/ARROW-15590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vibhatha Lakmal Abeykoon reassigned ARROW-15590: Assignee: Vibhatha Lakmal Abeykoon > [C++] Add support for joins to the Substrait consumer > - > > Key: ARROW-15590 > URL: https://issues.apache.org/jira/browse/ARROW-15590 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Weston Pace >Assignee: Vibhatha Lakmal Abeykoon >Priority: Major > Labels: substrait > > The streaming execution engine supports joins. The Substrait consumer does > not currently consume joins. We should add support for this. We may want to > split this PR into subtasks as there are many different kinds of joins and we > may not support all of them immediately. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15255) [C++][Developer Tools] Create native ubuntu-lint container for ARM
[ https://issues.apache.org/jira/browse/ARROW-15255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487940#comment-17487940 ] Yibo Cai commented on ARROW-15255: -- One blocking issue is that we use {{hadolint}}, but it has no arm64 docker image. [1] There's an open PR to support building and publishing {{hadolint}} arm64 image, looks not active now. [2] [1] https://github.com/apache/arrow/blob/master/ci/docker/linux-apt-lint.dockerfile#L19 [2] https://github.com/hadolint/hadolint/pull/694 > [C++][Developer Tools] Create native ubuntu-lint container for ARM > -- > > Key: ARROW-15255 > URL: https://issues.apache.org/jira/browse/ARROW-15255 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Developer Tools >Reporter: David Li >Priority: Major > > It apparently runs via emulation, but would presumably be faster if a native > version were available. That said, I don't actually have an ARM machine to > experience this. -- This message was sent by Atlassian Jira (v8.20.1#820001)