date:20220207

[jira] [Assigned] (ARROW-14893) [C++] Allow creating GCS filesystem from URI

2022-02-07 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-14893:
---

Assignee: Micah Kornfield

> [C++] Allow creating GCS filesystem from URI
> 
>
> Key: ARROW-14893
> URL: https://issues.apache.org/jira/browse/ARROW-14893
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>
> Similarly to what already exists for S3. See {{FileSystemFromUri}} and 
> {{S3Options::FromUri}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-14892) [Python] Add bindings for GCS filesystem

2022-02-07 Thread Micah Kornfield (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned ARROW-14892:
---

Assignee: Micah Kornfield

> [Python] Add bindings for GCS filesystem
> 
>
> Key: ARROW-14892
> URL: https://issues.apache.org/jira/browse/ARROW-14892
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15492) [Python] handle timestamp type in parquet file for compatibility with older HiveQL

2022-02-07 Thread Micah Kornfield (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488603#comment-17488603
 ] 

Micah Kornfield commented on ARROW-15492:
-

So this looks like an oversight with int96. The logical type with 
isAdjustedToUtc isn't accounted for when making the [arrow type for 
int96|https://github.com/apache/arrow/blob/85f192a45755b3f15653fdc0a8fbd788086e125f/cpp/src/parquet/arrow/schema_internal.cc#L197].
 It is used for [int64|#L197].  [~amznero]  would you be interested in 
contributing a fix for this?

> [Python] handle timestamp type in parquet file for compatibility with older 
> HiveQL
> --
>
> Key: ARROW-15492
> URL: https://issues.apache.org/jira/browse/ARROW-15492
> Project: Apache Arrow
>  Issue Type: New Feature
>Affects Versions: 6.0.1
>Reporter: nero
>Priority: Major
>
> Hi there,
> I face an issue when I write a parquet file by PyArrow.
> In the older version of Hive, it can only recognize the timestamp type stored 
> in INT96, so I use table.write_to_data with `use_deprecated 
> timestamp_int96_timestamps=True` option to save the parquet file. But the 
> HiveQL will skip conversion when the metadata of parquet file is not 
> created_by "parquet-mr".
> [hive/ParquetRecordReaderBase.java at 
> f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive 
> (github.com)|https://github.com/apache/hive/blob/f1ff99636a5546231336208a300a114bcf8c5944/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L137-L139]
>  
> So I have to save the timestamp columns with timezone info(pad to UTC+8).
> But when pyarrow.parquet read from a dir which contains parquets created by 
> both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for 
> parquet-mr files.
>  
> Maybe PyArrow can expose the created_by option in pyarrow({*}prefer{*}, 
> parquet::WriterProperties::created_by is available in the C++ ).
> Or handle the timestamp type with timezone which files created by parquet-mr?
>  
> Maybe related to https://issues.apache.org/jira/browse/ARROW-14422



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15492) [Python] handle timestamp type in parquet file for compatibility with older HiveQL

2022-02-07 Thread nero (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488578#comment-17488578
 ] 

nero commented on ARROW-15492:
--

[~emkornfield] 

In Parquet format, there is a flag named "isAdjustedToUTC" to indicate whether 
the timestamp type is local timezone or UTC.

 

Ref: [parquet-format/LogicalTypes.md at master · apache/parquet-format 
(github.com)|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc]

> [Python] handle timestamp type in parquet file for compatibility with older 
> HiveQL
> --
>
> Key: ARROW-15492
> URL: https://issues.apache.org/jira/browse/ARROW-15492
> Project: Apache Arrow
>  Issue Type: New Feature
>Affects Versions: 6.0.1
>Reporter: nero
>Priority: Major
>
> Hi there,
> I face an issue when I write a parquet file by PyArrow.
> In the older version of Hive, it can only recognize the timestamp type stored 
> in INT96, so I use table.write_to_data with `use_deprecated 
> timestamp_int96_timestamps=True` option to save the parquet file. But the 
> HiveQL will skip conversion when the metadata of parquet file is not 
> created_by "parquet-mr".
> [hive/ParquetRecordReaderBase.java at 
> f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive 
> (github.com)|https://github.com/apache/hive/blob/f1ff99636a5546231336208a300a114bcf8c5944/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L137-L139]
>  
> So I have to save the timestamp columns with timezone info(pad to UTC+8).
> But when pyarrow.parquet read from a dir which contains parquets created by 
> both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for 
> parquet-mr files.
>  
> Maybe PyArrow can expose the created_by option in pyarrow({*}prefer{*}, 
> parquet::WriterProperties::created_by is available in the C++ ).
> Or handle the timestamp type with timezone which files created by parquet-mr?
>  
> Maybe related to https://issues.apache.org/jira/browse/ARROW-14422



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15593) [C++] An unhandled race condition exists in ThreadPool

2022-02-07 Thread Yibo Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai resolved ARROW-15593.
--
Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12358
[https://github.com/apache/arrow/pull/12358]

> [C++] An unhandled race condition exists in ThreadPool
> --
>
> Key: ARROW-15593
> URL: https://issues.apache.org/jira/browse/ARROW-15593
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
> Environment: linux
>Reporter: Huxley Hu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> A race condition exists at the ThreadPool which may lead to the loss of 
> pending tasks after a process forks. 
> See this issue for more detail: https://github.com/apache/arrow/issues/12329



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15607) [C++] Fix incorrect CPUID flag for AVX detection

2022-02-07 Thread Yibo Cai (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yibo Cai resolved ARROW-15607.
--
Resolution: Fixed

Issue resolved by pull request 12347
[https://github.com/apache/arrow/pull/12347]

> [C++] Fix incorrect CPUID flag for AVX detection
> 
>
> Key: ARROW-15607
> URL: https://issues.apache.org/jira/browse/ARROW-15607
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
>Reporter: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow/pull/12347



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15607) [C++] Fix incorrect CPUID flag for AVX detection

2022-02-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15607:
---
Labels: pull-request-available  (was: )

> [C++] Fix incorrect CPUID flag for AVX detection
> 
>
> Key: ARROW-15607
> URL: https://issues.apache.org/jira/browse/ARROW-15607
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
>Reporter: Yibo Cai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://github.com/apache/arrow/pull/12347



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15607) [C++] Fix incorrect CPUID flag for AVX detection

2022-02-07 Thread Yibo Cai (Jira)

Yibo Cai created ARROW-15607:


 Summary: [C++] Fix incorrect CPUID flag for AVX detection
 Key: ARROW-15607
 URL: https://issues.apache.org/jira/browse/ARROW-15607
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 7.0.0
Reporter: Yibo Cai
 Fix For: 8.0.0


https://github.com/apache/arrow/pull/12347



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-07 Thread Weston Pace (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488548#comment-17488548
 ] 

Weston Pace commented on ARROW-15604:
-

I ran into bugs like this before.  I don't think the cause is really OT but it 
seems to increase the likelihood of failure.  Basically we have async tasks 
that do something like...

 * Run task
 * Mark future finished with result (at this point the main thread is free to 
exit and start shutdown)
 * Cleanup task

If anything in the Cleanup task accesses global state we could get this error.  
In the past the problem was that a task was accessing the default memory pool 
in its cleanup (I don't recall why).  A short term fix is to update the test so 
it isn't using the eternal thread pool or to call WaitForIdle on the CPU thread 
pool but these feel more like hacks than real fixes as a real customer would 
still have a segfault at shutdown.

In this case it seems the cleanup step is doing something with OT (which makes 
perfect sense).

I don't suppose there is any way to block the shutdown until the eternal thread 
pool is idle?  It could probably be signal safe if we waited with a busy loop 
but then I think you run the risk of shutdown delays.

> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15563) [C++] Compilation failure on s390x platform

2022-02-07 Thread Kazuaki Ishizaki (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488531#comment-17488531
 ] 

Kazuaki Ishizaki commented on ARROW-15563:
--

Sure, I will look at this

> [C++] Compilation failure on s390x platform
> ---
>
> Key: ARROW-15563
> URL: https://issues.apache.org/jira/browse/ARROW-15563
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 3.0.0
> Environment: s390x (IBM LinuxONE)
>Reporter: Chandra Shekhar Reddy
>Priority: Major
>
>  
> {code:java}
> (pyarrow-dev) [root@s390x]# cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME       
> -DCMAKE_INSTALL_LIBDIR=lib       -DCMAKE_BUILD_TYPE=debug       
> -DARROW_WITH_BZ2=ON       -DARROW_WITH_ZLIB=ON       -DARROW_WITH_ZSTD=ON     
>   -DARROW_WITH_LZ4=ON       -DARROW_WITH_SNAPPY=ON       
> -DARROW_WITH_BROTLI=ON       -DARROW_PARQUET=ON       -DARROW_PYTHON=ON       
> -DARROW_BUILD_TESTS=ON       ..
> -- Building using CMake version: 3.22.2
> -- The C compiler identification is GNU 9.2.1
> -- The CXX compiler identification is GNU 8.5.0
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: /usr/bin/cc - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: /usr/bin/c++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Arrow version: 3.0.0 (full: '3.0.0')
> -- Arrow SO version: 300 (full: 300.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found Python3: /usr/bin/python3.9 (found version "3.9.6") found 
> components: Interpreter
> -- Found cpplint executable at 
> /root/git/repos/arrow/cpp/build-support/cpplint.py
> -- System processor: s390x
> -- Arrow build warning level: CHECKIN
> Using ld linker
> Configured for DEBUG build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: DEBUG
> -- Using AUTO approach to find dependencies
> -- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862
> -- ARROW_AWSSDK_BUILD_VERSION: 1.8.90
> -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.5
> -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59
> -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5
> -- ARROW_BOOST_BUILD_VERSION: 1.71.0
> -- ARROW_BROTLI_BUILD_VERSION: v1.0.7
> -- ARROW_BZIP2_BUILD_VERSION: 1.0.8
> -- ARROW_CARES_BUILD_VERSION: 1.16.1
> -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2
> -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2
> -- ARROW_GLOG_BUILD_VERSION: v0.4.0
> -- ARROW_GRPC_BUILD_VERSION: v1.33.2
> -- ARROW_GTEST_BUILD_VERSION: 1.10.0
> -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1
> -- ARROW_LZ4_BUILD_VERSION: v1.9.2
> -- ARROW_MIMALLOC_BUILD_VERSION: v1.6.4
> -- ARROW_ORC_BUILD_VERSION: 1.6.2
> -- ARROW_PROTOBUF_BUILD_VERSION: v3.13.0
> -- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5
> -- ARROW_RE2_BUILD_VERSION: 2019-08-01
> -- ARROW_SNAPPY_BUILD_VERSION: 1.1.8
> -- ARROW_THRIFT_BUILD_VERSION: 0.12.0
> -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183
> -- ARROW_UTF8PROC_BUILD_VERSION: v2.5.0
> -- ARROW_ZLIB_BUILD_VERSION: 1.2.11
> -- ARROW_ZSTD_BUILD_VERSION: v1.4.5
> -- Looking for pthread.h
> -- Looking for pthread.h - found
> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
> -- Check if compiler accepts -pthread
> -- Check if compiler accepts -pthread - yes
> -- Found Threads: TRUE
> -- Checking for module 'thrift'
> --   Package 'thrift', required by 'virtual:world', not found
> -- Could NOT find Thrift: Found unsuitable version "", but required is at 
> least "0.11.0" (found THRIFT_LIB-NOTFOUND)
> -- Looking for __SIZEOF_INT128__
> -- Looking for __SIZEOF_INT128__ - found
> -- Found Boost: /usr/include (found suitable version "1.66.0", minimum 
> required is "1.58") found components: regex system filesystem
> -- Boost include dir: /usr/include
> -- Boost libraries: Boost::system;Boost::filesystem
> -- Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR)
> -- Building snappy from source
> -- Checking for modules 'libbrotlicommon;libbrotlienc;libbrotlidec'
> --   Package 'libbrotlicommon', required by 'virtual:world', not found
> --   Package 'libbrotlienc', required by 'virtual:world', not found
> --   Package 'libbrotlidec', required by 'virtual:world', not found
> -- Could NOT find Brotli (missing: BROTLI_COMMON_LIBRARY BROTLI_ENC_LIBRARY 
> BROTLI_DEC_LIBRARY BROTLI_INCLUDE_DIR)
> -- Building brotli from source
> -- Building without OpenSSL support. Minimum OpenSSL version 1.0.2 required.
> CM

[jira] [Updated] (ARROW-15468) [R] [CI] A crossbow job that tests against DuckDB's dev branch

2022-02-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15468:
---
Labels: pull-request-available  (was: )

> [R] [CI] A crossbow job that tests against DuckDB's dev branch
> --
>
> Key: ARROW-15468
> URL: https://issues.apache.org/jira/browse/ARROW-15468
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It would be good to test against DuckDB's dev branch to warn us if there are 
> impending changes that break something.
> While we're doing this, we should clean up some of the Currently some of our 
> jobs do already 
> https://github.com/apache/arrow/blob/f9f6fdbb7518c09b833cb6b78bc202008d28e865/ci/scripts/r_deps.sh#L45-L51
>  
> We should clean this up so that _generally_ builds use the released DuckDB, 
> but we can toggle dev DuckDB (and run a separate build that uses the dev 
> DuckDB optionally)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job

2022-02-07 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15605.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12363
[https://github.com/apache/arrow/pull/12363]

> [CI] [R] Keep using old macos runners on our autobrew CI job
> 
>
> Key: ARROW-15605
> URL: https://issues.apache.org/jira/browse/ARROW-15605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15606) [CI] [R] Add brew build that exercises the R package

2022-02-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15606:
---
Labels: pull-request-available  (was: )

> [CI] [R] Add brew build that exercises the R package
> 
>
> Key: ARROW-15606
> URL: https://issues.apache.org/jira/browse/ARROW-15606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15606) [CI] [R] Add brew build that exercises the R package

2022-02-07 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-15606:
---
Summary: [CI] [R] Add brew build that exercises the R package  (was: [CI] 
[R] Add brew release build)

> [CI] [R] Add brew build that exercises the R package
> 
>
> Key: ARROW-15606
> URL: https://issues.apache.org/jira/browse/ARROW-15606
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15599) [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or other delimited) file

2022-02-07 Thread Nicola Crane (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488436#comment-17488436
 ] 

Nicola Crane commented on ARROW-15599:
--

Here's a reprex with more verbose output.
{code:r}
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#> timestamp
tf <- tempfile()
write.csv(data.frame(x = '2018-10-07 19:04:05.005'), tf, row.names = FALSE)
# successfully read in file
read_csv_arrow(tf, as_data_frame = FALSE)
#> Table
#> 1 rows x 1 columns
#> $x 

# successfully read in with col_names and col_types specified
read_csv_arrow(
  tf,
  col_names = "x",
  col_types = "?",
  skip = 1,
  as_data_frame = FALSE
)
#> Table
#> 1 rows x 1 columns
#> $x 

read_csv_arrow(
  tf,
  col_names = "x",
  col_types = "T",
  skip = 1,
  as_data_frame = FALSE
)
#> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
invalid value '2018-10-07 19:04:05.005'
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:550  decoder_.Decode(data, 
size, quoted, &value)
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:123  status
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:554  
parser.VisitColumn(col_index, visit)

read_csv_arrow(
  tf,
  col_names = "x",
  col_types = "T",
  as_data_frame = FALSE,
  skip = 1,
  timestamp_parsers = "%Y-%m-%d %H:%M:%S"
)
#> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
invalid value '2018-10-07 19:04:05.005'
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:550  decoder_.Decode(data, 
size, quoted, &value)
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.h:123  status
#> /home/nic2/arrow/cpp/src/arrow/csv/converter.cc:554  
parser.VisitColumn(col_index, visit)
{code}

> [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or 
> other delimited) file
> ---
>
> Key: ARROW-15599
> URL: https://issues.apache.org/jira/browse/ARROW-15599
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 6.0.1
> Environment: R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> I tried to read the csv column type as timestamp, but I could only get it to 
> work well when `col_types` was not specified.
> I'm sorry if I missed something and this is the expected behavior. (It would 
> be great if you could add an example with `col_types` in the documentation.)
> {code:r}
> library(arrow)
> #>
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #>
> #> timestamp
> t_string <- tibble::tibble(
>   x = "2018-10-07 19:04:05.005"
> )
> write_csv_arrow(t_string, "tmp.csv")
> read_csv_arrow(
>   "tmp.csv",
>   as_data_frame = FALSE
> )
> #> Table
> #> 1 rows x 1 columns
> #> $x 
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "?",
>   skip = 1,
>   as_data_frame = FALSE
> )
> #> Table
> #> 1 rows x 1 columns
> #> $x 
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "T",
>   skip = 1,
>   as_data_frame = FALSE
> )
> #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
> invalid value '2018-10-07 19:04:05.005'
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "T",
>   as_data_frame = FALSE,
>   skip = 1,
>   timestamp_parsers = "%Y-%m-%d %H:%M:%S"
> )
> #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
> invalid value '2018-10-07 19:04:05.005'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15606) [CI] [R] Add brew release build

2022-02-07 Thread Jonathan Keane (Jira)

Jonathan Keane created ARROW-15606:
--

 Summary: [CI] [R] Add brew release build
 Key: ARROW-15606
 URL: https://issues.apache.org/jira/browse/ARROW-15606
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Jonathan Keane
Assignee: Jonathan Keane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-15554) [Format][C++] Add "LargeMap" type with 64-bit offsets

2022-02-07 Thread Sarah Gilmore (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488413#comment-17488413
 ] 

Sarah Gilmore edited comment on ARROW-15554 at 2/7/22, 8:21 PM:


Will do!


was (Author: sgilmore):
Wil do!

> [Format][C++] Add "LargeMap" type with 64-bit offsets
> -
>
> Key: ARROW-15554
> URL: https://issues.apache.org/jira/browse/ARROW-15554
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Sarah Gilmore
>Priority: Major
>
> It would be nice if a "LargeMap" type existed along side the "Map" type for 
> parity. For other datatypes that require offset arrays/buffers, such as 
> String, List, BinaryArray, provides a "large" version of these types, i.e. 
> LargeString, LargeList, and LargeBinaryArray. It would be nice to have a 
> "LargeMap" for parity.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15554) [Format][C++] Add "LargeMap" type with 64-bit offsets

2022-02-07 Thread Sarah Gilmore (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488413#comment-17488413
 ] 

Sarah Gilmore commented on ARROW-15554:
---

Wil do!

> [Format][C++] Add "LargeMap" type with 64-bit offsets
> -
>
> Key: ARROW-15554
> URL: https://issues.apache.org/jira/browse/ARROW-15554
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Sarah Gilmore
>Priority: Major
>
> It would be nice if a "LargeMap" type existed along side the "Map" type for 
> parity. For other datatypes that require offset arrays/buffers, such as 
> String, List, BinaryArray, provides a "large" version of these types, i.e. 
> LargeString, LargeList, and LargeBinaryArray. It would be nice to have a 
> "LargeMap" for parity.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15570) [CI][Nightly] Drop centos-8 R nightly job

2022-02-07 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-15570.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12337
[https://github.com/apache/arrow/pull/12337]

> [CI][Nightly] Drop centos-8 R nightly job
> -
>
> Key: ARROW-15570
> URL: https://issues.apache.org/jira/browse/ARROW-15570
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Neal Richardson
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> It has started failing since CentOS 8 went EOL. Followup to ARROW-15038, 
> which removed all of the other CentOS 8 jobs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-14745) [R] Enable true duckdb streaming

2022-02-07 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-14745.

Resolution: Fixed

Issue resolved by pull request 11730
[https://github.com/apache/arrow/pull/11730]

> [R] Enable true duckdb streaming
> 
>
> Key: ARROW-14745
> URL: https://issues.apache.org/jira/browse/ARROW-14745
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 9h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14488) [Python] Incorrect inferred schema from pandas dataframe with length 0.

2022-02-07 Thread Alenka Frim (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488402#comment-17488402
 ] 

Alenka Frim commented on ARROW-14488:
-

Thank you Joris!

An example would be:
{code:python}
>>> import pandas as pd
>>> df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
>>> import pyarrow as pa
>>> 
>>> schema = pa.schema([
...pa.field('a', pa.string()),
...pa.field('b', pa.int64()),
...pa.field('c', pa.float64())])
>>> 
>>> pa.Table.from_pandas(df, schema=schema)
pyarrow.Table
a: string
b: int64
c: double

a: [["a"]]
b: [[1]]
c: [[1]]
>>> pa.Table.from_pandas(df.head(0), schema=schema)
pyarrow.Table
a: string
b: int64
c: double

a: [[]]
b: [[]]
c: [[]]
{code}

> [Python] Incorrect inferred schema from pandas dataframe with length 0.
> ---
>
> Key: ARROW-14488
> URL: https://issues.apache.org/jira/browse/ARROW-14488
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
> Environment: OS: Windows 10, CentOS 7
>Reporter: Yuan Zhou
>Priority: Major
>
> We use pandas(with pyarrow engine) to write out parquet files and those 
> outputs will be consumed by other applications such as Java apps using 
> org.apache.parquet.hadoop.ParquetFileReader. We found that some empty 
> dataframes would get incorrect schema for string columns in other 
> applications. After some investigation, we narrow down the issue to the 
> schema inference by pyarrow:
> {code:java}
> In [1]: import pandas as pd
> In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
> In [3]: import pyarrow as pa
> In [4]: pa.Schema.from_pandas(df)
>  Out[4]:
>  a: string
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 562
> In [5]: pa.Schema.from_pandas(df.head(0))
>  Out[5]:
>  a: null
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 560
> In [6]: pa._version_
>  Out[6]: '5.0.0'
> {code}
>  As you can see, the column 'a' which should be string type if inferred as 
> null type and is converted to int32 while writing to parquet files.
> Is this an expected behavior? Or do we have any workaround for this issue? 
> Could anyone take a look please. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job

2022-02-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15605:
---
Labels: pull-request-available  (was: )

> [CI] [R] Keep using old macos runners on our autobrew CI job
> 
>
> Key: ARROW-15605
> URL: https://issues.apache.org/jira/browse/ARROW-15605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job

2022-02-07 Thread Jonathan Keane (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane reassigned ARROW-15605:
--

Assignee: Jonathan Keane

> [CI] [R] Keep using old macos runners on our autobrew CI job
> 
>
> Key: ARROW-15605
> URL: https://issues.apache.org/jira/browse/ARROW-15605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Jonathan Keane
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15605) [CI] [R] Keep using old macos runners on our autobrew CI job

2022-02-07 Thread Jonathan Keane (Jira)

Jonathan Keane created ARROW-15605:
--

 Summary: [CI] [R] Keep using old macos runners on our autobrew CI 
job
 Key: ARROW-15605
 URL: https://issues.apache.org/jira/browse/ARROW-15605
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Jonathan Keane






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15080) [Python] Allow creation of month_day_nano interval from tuple

2022-02-07 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-15080.

Fix Version/s: 8.0.0
   Resolution: Fixed

Issue resolved by pull request 12348
[https://github.com/apache/arrow/pull/12348]

> [Python] Allow creation of month_day_nano interval from tuple
> -
>
> Key: ARROW-15080
> URL: https://issues.apache.org/jira/browse/ARROW-15080
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Micah Kornfield
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This should ideally be allowed but isn't:
> {code:python}
> >>> a = pa.array([(3, 20, 100)], type=pa.month_day_nano_interval())
> Traceback (most recent call last):
>   File "", line 1, in 
> a = pa.array([(3, 20, 100)], type=pa.month_day_nano_interval())
>   File "pyarrow/array.pxi", line 315, in pyarrow.lib.array
> return _sequence_to_array(obj, mask, size, type, pool, c_from_pandas)
>   File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
> chunked = GetResultValue(
>   File "pyarrow/error.pxi", line 143, in 
> pyarrow.lib.pyarrow_internal_check_status
> return check_status(status)
>   File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
> raise ArrowTypeError(message)
> ArrowTypeError: No temporal attributes found on object.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-13168) [C++] Timezone database configuration and access

2022-02-07 Thread Will Jones (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-13168:
--

Assignee: Will Jones

> [C++] Timezone database configuration and access
> 
>
> Key: ARROW-13168
> URL: https://issues.apache.org/jira/browse/ARROW-13168
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Rok Mihevc
>Assignee: Will Jones
>Priority: Major
>  Labels: timestamp
>
>  Note: currently timezone database is not available on windows so timezone 
> aware operations will fail.
> We're using tz.h library which needs an updated timezone database to 
> correctly handle timezoned timestamps. See [installation 
> instructions|https://howardhinnant.github.io/date/tz.html#Installation].
> We have the following options for getting a timezone database:
>  # local (non-windows) OS timezone database - no work required.
>  # arrow bundled folder - we could bundle the database at build time for 
> windows. Database would slowly go stale.
>  # download it from IANA Time Zone Database at runtime - tz.h gets the 
> database at runtime, but curl (and 7-zip on windows) are required.
>  # local user-provided folder - user could provide a location at buildtime. 
> Nice to have.
>  # allow runtime configuration - at runtime say: "the tzdata can be found at 
> this location"
> For more context see: 
> [ARROW-12980|https://github.com/apache/arrow/pull/10457] and [PEP 
> 615|https://www.python.org/dev/peps/pep-0615/#sources-for-time-zone-data].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-13204) [MATLAB] Update documentation for the MATLAB Interface to reflect latest CMake build system changes

2022-02-07 Thread Kevin Gurney (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-13204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Gurney updated ARROW-13204:
-
Description: 
The MATLAB Interface should have clear documentation detailing how to build and 
run the interface on all platforms using the latest CMake build system changes 
merged in [https://github.com/apache/arrow/pull/12004.]

The current documentation is out of date and is in need of some quality of life 
improvements as we transition toward building out the MATLAB interface beyond 
just supporting basic Feather file reading and writing.

  was:
[https://github.com/apache/arrow/pull/10614] integrates GoogleTest into the 
CMake build system to support building and running C++ tests for the MATLAB 
Interface.

The MATLAB interface should have clear documentation detailing how to build and 
run the C++ tests on all platforms. This should include instructions for 
building with and without specifying a custom `GTEST_ROOT` value.

Summary: [MATLAB] Update documentation for the MATLAB Interface to 
reflect latest CMake build system changes  (was: [MATLAB] Add documentation for 
building and running C++ tests)

> [MATLAB] Update documentation for the MATLAB Interface to reflect latest 
> CMake build system changes
> ---
>
> Key: ARROW-13204
> URL: https://issues.apache.org/jira/browse/ARROW-13204
> Project: Apache Arrow
>  Issue Type: Task
>  Components: MATLAB
>Reporter: Kevin Gurney
>Assignee: Kevin Gurney
>Priority: Minor
>
> The MATLAB Interface should have clear documentation detailing how to build 
> and run the interface on all platforms using the latest CMake build system 
> changes merged in [https://github.com/apache/arrow/pull/12004.]
> The current documentation is out of date and is in need of some quality of 
> life improvements as we transition toward building out the MATLAB interface 
> beyond just supporting basic Feather file reading and writing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-07 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488349#comment-17488349
 ] 

David Li commented on ARROW-15604:
--

It also seems the main thread is being destroyed during/before the thread 
pools, so maybe this is a static destructor order pitfall…

> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-07 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488347#comment-17488347
 ] 

David Li commented on ARROW-15604:
--

Hmm, I think I ran into something similar when I working on my PR. 
https://github.com/apache/arrow/pull/11964#issuecomment-995043666

CC [~mbrobbel]

Should we disable OT in CI for now? 

> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-15603) [C++] Clang 13 build fails on unused var

2022-02-07 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-15603.

Resolution: Fixed

Issue resolved by pull request 12359
[https://github.com/apache/arrow/pull/12359]

> [C++] Clang 13 build fails on unused var
> 
>
> Key: ARROW-15603
> URL: https://issues.apache.org/jira/browse/ARROW-15603
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Just a small issue. When I build with clang 13 I get the following error from 
> a unused var warning:
> {code:java}
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:791:13:
>  error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
>     int64_t n = 0;
>             ^
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:799:13:
>  error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
>     int64_t n = 0;
>             ^ {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15552) [Docs][Format] Unclear wording about base64 encoding requirement of metadata values

2022-02-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15552:
---
Labels: pull-request-available  (was: )

> [Docs][Format] Unclear wording about base64 encoding requirement of metadata 
> values
> ---
>
> Key: ARROW-15552
> URL: https://issues.apache.org/jira/browse/ARROW-15552
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Format
>Reporter: Joris Van den Bossche
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The C Data Interface docs indicate that the values in key-value metadata 
> should be base64 encoded, which is mentioned in the section about which 
> key-value metadata to use for extension types 
> (https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays):
> bq. The base64 encoding of metadata values ensures that any possible 
> serialization is representable.
> This might not be fully correct, though (or at least not required, which is 
> implied with the current wording). While a binary blob (like a serialized 
> schema) can be base64 encoded, as we do when putting the Arrow schema in the 
> Parquet metadata, this is not required?
> cc [~apitrou]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15552) [Docs][Format] Unclear wording about base64 encoding requirement of metadata values

2022-02-07 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-15552:
--

Assignee: Antoine Pitrou

> [Docs][Format] Unclear wording about base64 encoding requirement of metadata 
> values
> ---
>
> Key: ARROW-15552
> URL: https://issues.apache.org/jira/browse/ARROW-15552
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Format
>Reporter: Joris Van den Bossche
>Assignee: Antoine Pitrou
>Priority: Major
>
> The C Data Interface docs indicate that the values in key-value metadata 
> should be base64 encoded, which is mentioned in the section about which 
> key-value metadata to use for extension types 
> (https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays):
> bq. The base64 encoding of metadata values ensures that any possible 
> serialization is representable.
> This might not be fully correct, though (or at least not required, which is 
> implied with the current wording). While a binary blob (like a serialized 
> schema) can be base64 encoded, as we do when putting the Arrow schema in the 
> Parquet metadata, this is not required?
> cc [~apitrou]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15552) [Docs][Format] Unclear wording about base64 encoding requirement of metadata values

2022-02-07 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488330#comment-17488330
 ] 

Antoine Pitrou commented on ARROW-15552:


This is probably a leftover from a draft version of the C data interface where 
key-value metadata has a different encoding (perhaps JSON with base64-encoded 
values). 

> [Docs][Format] Unclear wording about base64 encoding requirement of metadata 
> values
> ---
>
> Key: ARROW-15552
> URL: https://issues.apache.org/jira/browse/ARROW-15552
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Format
>Reporter: Joris Van den Bossche
>Priority: Major
>
> The C Data Interface docs indicate that the values in key-value metadata 
> should be base64 encoded, which is mentioned in the section about which 
> key-value metadata to use for extension types 
> (https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays):
> bq. The base64 encoding of metadata values ensures that any possible 
> serialization is representable.
> This might not be fully correct, though (or at least not required, which is 
> implied with the current wording). While a binary blob (like a serialized 
> schema) can be base64 encoded, as we do when putting the Arrow schema in the 
> Parquet metadata, this is not required?
> cc [~apitrou]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15195) [MATLAB] Enable GitHub Actions CI for MATLAB Interface on macOS

2022-02-07 Thread Fiona La (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fiona La reassigned ARROW-15195:


Assignee: Sreehari Hegden

> [MATLAB] Enable GitHub Actions CI for MATLAB Interface on macOS
> ---
>
> Key: ARROW-15195
> URL: https://issues.apache.org/jira/browse/ARROW-15195
> Project: Apache Arrow
>  Issue Type: Task
>  Components: MATLAB
>Reporter: Fiona La
>Assignee: Sreehari Hegden
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Enable CI and Testing for MATLAB submissions to the [Apache Arrow 
> project|https://github.com/apache/arrow] on GitHub. This task can be worked 
> on after [MATLAB Actions|https://github.com/matlab-actions/setup-matlab] is 
> enabled on macOS.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-07 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488325#comment-17488325
 ] 

Antoine Pitrou commented on ARROW-15604:


So, basically, it seems using OpenTracing in an asynchronous setup where code 
may run after process teardown has started may be quite delicate. [~lidavidm] 
[~westonpace]

> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-07 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488321#comment-17488321
 ] 

Antoine Pitrou commented on ARROW-15604:


The "atexit hook" I mentioned simply seems to be a standard C++ exit hook that 
destroys global/static variables. Here the static singleton that's stored in 
{{RuntimeContext::GetStorage}}.



> [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing
> ---
>
> Key: ARROW-15604
> URL: https://issues.apache.org/jira/browse/ARROW-15604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Major
>
> The error is a heap-use-after-free and involves an OpenTracing structure that 
> was deleted by an atexit hook.
> https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843
> Summary:
> {code}
>   Atomic write of size 4 at 0x7b08000136a8 by thread T2:
>   [...]
> #10 
> opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
>  (libarrow.so.800+0x1e62ef7)
> #11 
> opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
>  (libarrow.so.800+0x1e70178)
> #12 opentelemetry::v1::context::Token::~Token() 
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
>  (libarrow.so.800+0x1e7012f)
>   [...]
> {code}
> {code}
>   Previous write of size 8 at 0x7b08000136a8 by main thread:
> #0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
>   [...]
> #7 
> opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
>  
> /build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
>  (libarrow.so.800+0x1e62fb3)
> #8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15604) [C++][CI] Sporadic ThreadSanitizer failure with OpenTracing

2022-02-07 Thread Antoine Pitrou (Jira)

Antoine Pitrou created ARROW-15604:
--

 Summary: [C++][CI] Sporadic ThreadSanitizer failure with 
OpenTracing
 Key: ARROW-15604
 URL: https://issues.apache.org/jira/browse/ARROW-15604
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


The error is a heap-use-after-free and involves an OpenTracing structure that 
was deleted by an atexit hook.
https://github.com/ursacomputing/crossbow/runs/5097362072?check_suite_focus=true#step:5:4843

Summary:
{code}
  Atomic write of size 4 at 0x7b08000136a8 by thread T2:
  [...]
#10 opentelemetry::v1::context::RuntimeContext::GetRuntimeContextStorage() 
/build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:156:12
 (libarrow.so.800+0x1e62ef7)
#11 
opentelemetry::v1::context::RuntimeContext::Detach(opentelemetry::v1::context::Token&)
 
/build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:97:54
 (libarrow.so.800+0x1e70178)
#12 opentelemetry::v1::context::Token::~Token() 
/build/cpp/opentelemetry_ep-install/include/opentelemetry/context/runtime_context.h:168:3
 (libarrow.so.800+0x1e7012f)
  [...]
{code}

{code}
  Previous write of size 8 at 0x7b08000136a8 by main thread:
#0 operator delete(void*)  (arrow-dataset-scanner-test+0x16a69e)
  [...]
#7 
opentelemetry::v1::nostd::shared_ptr::~shared_ptr()
 
/build/cpp/opentelemetry_ep-install/include/opentelemetry/nostd/shared_ptr.h:98:30
 (libarrow.so.800+0x1e62fb3)
#8 cxa_at_exit_wrapper(void*)  (arrow-dataset-scanner-test+0x11866f)
{code}




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15081) [R][C++] Arrow crashes (OOM) on R client with large remote parquet files

2022-02-07 Thread Dewey Dunnington (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488306#comment-17488306
 ] 

Dewey Dunnington commented on ARROW-15081:
--

There was another user who reported an issue with count on a parquet file that 
seems to have been fixed in the development version (which is about to be 
released to CRAN). Perhaps ARROW-15201 is the same issue?

If it is not, when I try to reproduce the above I get an error (see below). Is 
there a more recent bucket with the files we can use to reproduce?

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

server <- arrow::s3_bucket(
  "ebird",
  endpoint_override = "minio.cirrus.carlboettiger.info"
)

path <- server$path("Oct-2021/observations")
path$ls()
#> Error: IOError: Path does not exist 'ebird/Oct-2021/observations'
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:1913
  collector.Finish(this)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:2275
  impl_->Walk(select, base_path.bucket, base_path.key, &results)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341
  base_fs_->GetFileInfo(selector)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341
  base_fs_->GetFileInfo(selector)

path <- server$path("partitioned")
path$ls()
#> Error: IOError: Path does not exist 'ebird/partitioned'
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:1913
  collector.Finish(this)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/s3fs.cc:2275
  impl_->Walk(select, base_path.bucket, base_path.key, &results)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341
  base_fs_->GetFileInfo(selector)
#> 
/Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/filesystem/filesystem.cc:341
  base_fs_->GetFileInfo(selector)
{code}


> [R][C++] Arrow crashes (OOM) on R client with large remote parquet files
> 
>
> Key: ARROW-15081
> URL: https://issues.apache.org/jira/browse/ARROW-15081
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Carl Boettiger
>Assignee: Weston Pace
>Priority: Major
>
> The below should be a reproducible crash:
> {code:java}
> library(arrow)
> library(dplyr)
> server <- arrow::s3_bucket("ebird",endpoint_override = 
> "minio.cirrus.carlboettiger.info")
> path <- server$path("Oct-2021/observations")
> obs <- arrow::open_dataset(path)
> path$ls() # observe -- 1 parquet file
> obs %>% count() # CRASH
> obs %>% to_duckdb() # also crash{code}
> I have attempted to split this large (~100 GB parquet file) into some smaller 
> files, which helps: 
> {code:java}
> path <- server$path("partitioned")
> obs <- arrow::open_dataset(path)
> obs$ls() # observe, multiple parquet files now
> obs %>% count() 
>  {code}
> (These parquet files have also been created by arrow, btw, from a single 
> large csv file provided by the original data provider (eBird).  Unfortunately 
> generating the partitioned versions is cumbersome as the data is very 
> unevenly distributed, there's few columns that can avoid creating 1000s of 
> parquet partition files and even so the bulk of the 1-billion rows fall 
> within the same group.  But all the same I think this is a bug as there's no 
> indication why arrow cannot handle a single 100GB parquet file I think?). 
>  
> Let me know if I can provide more info! I'm testing in R with latest CRAN 
> version of arrow on a machine with 200 GB RAM. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15563) [C++] Compilation failure on s390x platform

2022-02-07 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15563:
---
Summary: [C++] Compilation failure on s390x platform  (was: Compilation of 
Arrow Cpp code fails with array-bounds issue on s390x platform)

> [C++] Compilation failure on s390x platform
> ---
>
> Key: ARROW-15563
> URL: https://issues.apache.org/jira/browse/ARROW-15563
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 3.0.0
> Environment: s390x (IBM LinuxONE)
>Reporter: Chandra Shekhar Reddy
>Priority: Major
>
>  
> {code:java}
> (pyarrow-dev) [root@s390x]# cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME       
> -DCMAKE_INSTALL_LIBDIR=lib       -DCMAKE_BUILD_TYPE=debug       
> -DARROW_WITH_BZ2=ON       -DARROW_WITH_ZLIB=ON       -DARROW_WITH_ZSTD=ON     
>   -DARROW_WITH_LZ4=ON       -DARROW_WITH_SNAPPY=ON       
> -DARROW_WITH_BROTLI=ON       -DARROW_PARQUET=ON       -DARROW_PYTHON=ON       
> -DARROW_BUILD_TESTS=ON       ..
> -- Building using CMake version: 3.22.2
> -- The C compiler identification is GNU 9.2.1
> -- The CXX compiler identification is GNU 8.5.0
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: /usr/bin/cc - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: /usr/bin/c++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Arrow version: 3.0.0 (full: '3.0.0')
> -- Arrow SO version: 300 (full: 300.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found Python3: /usr/bin/python3.9 (found version "3.9.6") found 
> components: Interpreter
> -- Found cpplint executable at 
> /root/git/repos/arrow/cpp/build-support/cpplint.py
> -- System processor: s390x
> -- Arrow build warning level: CHECKIN
> Using ld linker
> Configured for DEBUG build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: DEBUG
> -- Using AUTO approach to find dependencies
> -- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862
> -- ARROW_AWSSDK_BUILD_VERSION: 1.8.90
> -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.5
> -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59
> -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5
> -- ARROW_BOOST_BUILD_VERSION: 1.71.0
> -- ARROW_BROTLI_BUILD_VERSION: v1.0.7
> -- ARROW_BZIP2_BUILD_VERSION: 1.0.8
> -- ARROW_CARES_BUILD_VERSION: 1.16.1
> -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2
> -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2
> -- ARROW_GLOG_BUILD_VERSION: v0.4.0
> -- ARROW_GRPC_BUILD_VERSION: v1.33.2
> -- ARROW_GTEST_BUILD_VERSION: 1.10.0
> -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1
> -- ARROW_LZ4_BUILD_VERSION: v1.9.2
> -- ARROW_MIMALLOC_BUILD_VERSION: v1.6.4
> -- ARROW_ORC_BUILD_VERSION: 1.6.2
> -- ARROW_PROTOBUF_BUILD_VERSION: v3.13.0
> -- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5
> -- ARROW_RE2_BUILD_VERSION: 2019-08-01
> -- ARROW_SNAPPY_BUILD_VERSION: 1.1.8
> -- ARROW_THRIFT_BUILD_VERSION: 0.12.0
> -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183
> -- ARROW_UTF8PROC_BUILD_VERSION: v2.5.0
> -- ARROW_ZLIB_BUILD_VERSION: 1.2.11
> -- ARROW_ZSTD_BUILD_VERSION: v1.4.5
> -- Looking for pthread.h
> -- Looking for pthread.h - found
> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
> -- Check if compiler accepts -pthread
> -- Check if compiler accepts -pthread - yes
> -- Found Threads: TRUE
> -- Checking for module 'thrift'
> --   Package 'thrift', required by 'virtual:world', not found
> -- Could NOT find Thrift: Found unsuitable version "", but required is at 
> least "0.11.0" (found THRIFT_LIB-NOTFOUND)
> -- Looking for __SIZEOF_INT128__
> -- Looking for __SIZEOF_INT128__ - found
> -- Found Boost: /usr/include (found suitable version "1.66.0", minimum 
> required is "1.58") found components: regex system filesystem
> -- Boost include dir: /usr/include
> -- Boost libraries: Boost::system;Boost::filesystem
> -- Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR)
> -- Building snappy from source
> -- Checking for modules 'libbrotlicommon;libbrotlienc;libbrotlidec'
> --   Package 'libbrotlicommon', required by 'virtual:world', not found
> --   Package 'libbrotlienc', required by 'virtual:world', not found
> --   Package 'libbrotlidec', required by 'virtual:world', not found
> -- Could NOT find Brotli (missing: BROTLI_COMMON_LIBRARY BROTLI_ENC_LIBRARY 
> BROTLI_DEC_LIBRARY BROTLI_INCLUDE_DIR)
> -- Building brotli from source
> -- Building without OpenS

[jira] [Commented] (ARROW-15563) Compilation of Arrow Cpp code fails with array-bounds issue on s390x platform

2022-02-07 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488295#comment-17488295
 ] 

Antoine Pitrou commented on ARROW-15563:


[~mr.chandureddy]  Have you tried with Arrow 6.0.0 or 7.0.0?

> Compilation of Arrow Cpp code fails with array-bounds issue on s390x platform
> -
>
> Key: ARROW-15563
> URL: https://issues.apache.org/jira/browse/ARROW-15563
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 3.0.0
> Environment: s390x (IBM LinuxONE)
>Reporter: Chandra Shekhar Reddy
>Priority: Major
>
>  
> {code:java}
> (pyarrow-dev) [root@s390x]# cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME       
> -DCMAKE_INSTALL_LIBDIR=lib       -DCMAKE_BUILD_TYPE=debug       
> -DARROW_WITH_BZ2=ON       -DARROW_WITH_ZLIB=ON       -DARROW_WITH_ZSTD=ON     
>   -DARROW_WITH_LZ4=ON       -DARROW_WITH_SNAPPY=ON       
> -DARROW_WITH_BROTLI=ON       -DARROW_PARQUET=ON       -DARROW_PYTHON=ON       
> -DARROW_BUILD_TESTS=ON       ..
> -- Building using CMake version: 3.22.2
> -- The C compiler identification is GNU 9.2.1
> -- The CXX compiler identification is GNU 8.5.0
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: /usr/bin/cc - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: /usr/bin/c++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Arrow version: 3.0.0 (full: '3.0.0')
> -- Arrow SO version: 300 (full: 300.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found Python3: /usr/bin/python3.9 (found version "3.9.6") found 
> components: Interpreter
> -- Found cpplint executable at 
> /root/git/repos/arrow/cpp/build-support/cpplint.py
> -- System processor: s390x
> -- Arrow build warning level: CHECKIN
> Using ld linker
> Configured for DEBUG build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: DEBUG
> -- Using AUTO approach to find dependencies
> -- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862
> -- ARROW_AWSSDK_BUILD_VERSION: 1.8.90
> -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.5
> -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59
> -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5
> -- ARROW_BOOST_BUILD_VERSION: 1.71.0
> -- ARROW_BROTLI_BUILD_VERSION: v1.0.7
> -- ARROW_BZIP2_BUILD_VERSION: 1.0.8
> -- ARROW_CARES_BUILD_VERSION: 1.16.1
> -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2
> -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2
> -- ARROW_GLOG_BUILD_VERSION: v0.4.0
> -- ARROW_GRPC_BUILD_VERSION: v1.33.2
> -- ARROW_GTEST_BUILD_VERSION: 1.10.0
> -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1
> -- ARROW_LZ4_BUILD_VERSION: v1.9.2
> -- ARROW_MIMALLOC_BUILD_VERSION: v1.6.4
> -- ARROW_ORC_BUILD_VERSION: 1.6.2
> -- ARROW_PROTOBUF_BUILD_VERSION: v3.13.0
> -- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5
> -- ARROW_RE2_BUILD_VERSION: 2019-08-01
> -- ARROW_SNAPPY_BUILD_VERSION: 1.1.8
> -- ARROW_THRIFT_BUILD_VERSION: 0.12.0
> -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183
> -- ARROW_UTF8PROC_BUILD_VERSION: v2.5.0
> -- ARROW_ZLIB_BUILD_VERSION: 1.2.11
> -- ARROW_ZSTD_BUILD_VERSION: v1.4.5
> -- Looking for pthread.h
> -- Looking for pthread.h - found
> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
> -- Check if compiler accepts -pthread
> -- Check if compiler accepts -pthread - yes
> -- Found Threads: TRUE
> -- Checking for module 'thrift'
> --   Package 'thrift', required by 'virtual:world', not found
> -- Could NOT find Thrift: Found unsuitable version "", but required is at 
> least "0.11.0" (found THRIFT_LIB-NOTFOUND)
> -- Looking for __SIZEOF_INT128__
> -- Looking for __SIZEOF_INT128__ - found
> -- Found Boost: /usr/include (found suitable version "1.66.0", minimum 
> required is "1.58") found components: regex system filesystem
> -- Boost include dir: /usr/include
> -- Boost libraries: Boost::system;Boost::filesystem
> -- Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR)
> -- Building snappy from source
> -- Checking for modules 'libbrotlicommon;libbrotlienc;libbrotlidec'
> --   Package 'libbrotlicommon', required by 'virtual:world', not found
> --   Package 'libbrotlienc', required by 'virtual:world', not found
> --   Package 'libbrotlidec', required by 'virtual:world', not found
> -- Could NOT find Brotli (missing: BROTLI_COMMON_LIBRARY BROTLI_ENC_LIBRARY 
> BROTLI_DEC_LIBRARY BROTLI_INCLUDE_DIR)
> -- Building bro

[jira] [Commented] (ARROW-15563) Compilation of Arrow Cpp code fails with array-bounds issue on s390x platform

2022-02-07 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488294#comment-17488294
 ] 

Antoine Pitrou commented on ARROW-15563:


cc [~kiszk] 

> Compilation of Arrow Cpp code fails with array-bounds issue on s390x platform
> -
>
> Key: ARROW-15563
> URL: https://issues.apache.org/jira/browse/ARROW-15563
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 3.0.0
> Environment: s390x (IBM LinuxONE)
>Reporter: Chandra Shekhar Reddy
>Priority: Major
>
>  
> {code:java}
> (pyarrow-dev) [root@s390x]# cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME       
> -DCMAKE_INSTALL_LIBDIR=lib       -DCMAKE_BUILD_TYPE=debug       
> -DARROW_WITH_BZ2=ON       -DARROW_WITH_ZLIB=ON       -DARROW_WITH_ZSTD=ON     
>   -DARROW_WITH_LZ4=ON       -DARROW_WITH_SNAPPY=ON       
> -DARROW_WITH_BROTLI=ON       -DARROW_PARQUET=ON       -DARROW_PYTHON=ON       
> -DARROW_BUILD_TESTS=ON       ..
> -- Building using CMake version: 3.22.2
> -- The C compiler identification is GNU 9.2.1
> -- The CXX compiler identification is GNU 8.5.0
> -- Detecting C compiler ABI info
> -- Detecting C compiler ABI info - done
> -- Check for working C compiler: /usr/bin/cc - skipped
> -- Detecting C compile features
> -- Detecting C compile features - done
> -- Detecting CXX compiler ABI info
> -- Detecting CXX compiler ABI info - done
> -- Check for working CXX compiler: /usr/bin/c++ - skipped
> -- Detecting CXX compile features
> -- Detecting CXX compile features - done
> -- Arrow version: 3.0.0 (full: '3.0.0')
> -- Arrow SO version: 300 (full: 300.0.0)
> -- clang-tidy not found
> -- clang-format not found
> -- Could NOT find ClangTools (missing: CLANG_FORMAT_BIN CLANG_TIDY_BIN)
> -- infer not found
> -- Found Python3: /usr/bin/python3.9 (found version "3.9.6") found 
> components: Interpreter
> -- Found cpplint executable at 
> /root/git/repos/arrow/cpp/build-support/cpplint.py
> -- System processor: s390x
> -- Arrow build warning level: CHECKIN
> Using ld linker
> Configured for DEBUG build (set with cmake 
> -DCMAKE_BUILD_TYPE={release,debug,...})
> -- Build Type: DEBUG
> -- Using AUTO approach to find dependencies
> -- ARROW_ABSL_BUILD_VERSION: 0f3bb466b868b523cf1dc9b2aaaed65c77b28862
> -- ARROW_AWSSDK_BUILD_VERSION: 1.8.90
> -- ARROW_AWS_CHECKSUMS_BUILD_VERSION: v0.1.5
> -- ARROW_AWS_C_COMMON_BUILD_VERSION: v0.4.59
> -- ARROW_AWS_C_EVENT_STREAM_BUILD_VERSION: v0.1.5
> -- ARROW_BOOST_BUILD_VERSION: 1.71.0
> -- ARROW_BROTLI_BUILD_VERSION: v1.0.7
> -- ARROW_BZIP2_BUILD_VERSION: 1.0.8
> -- ARROW_CARES_BUILD_VERSION: 1.16.1
> -- ARROW_GBENCHMARK_BUILD_VERSION: v1.5.2
> -- ARROW_GFLAGS_BUILD_VERSION: v2.2.2
> -- ARROW_GLOG_BUILD_VERSION: v0.4.0
> -- ARROW_GRPC_BUILD_VERSION: v1.33.2
> -- ARROW_GTEST_BUILD_VERSION: 1.10.0
> -- ARROW_JEMALLOC_BUILD_VERSION: 5.2.1
> -- ARROW_LZ4_BUILD_VERSION: v1.9.2
> -- ARROW_MIMALLOC_BUILD_VERSION: v1.6.4
> -- ARROW_ORC_BUILD_VERSION: 1.6.2
> -- ARROW_PROTOBUF_BUILD_VERSION: v3.13.0
> -- ARROW_RAPIDJSON_BUILD_VERSION: 1a803826f1197b5e30703afe4b9c0e7dd48074f5
> -- ARROW_RE2_BUILD_VERSION: 2019-08-01
> -- ARROW_SNAPPY_BUILD_VERSION: 1.1.8
> -- ARROW_THRIFT_BUILD_VERSION: 0.12.0
> -- ARROW_THRIFT_BUILD_MD5_CHECKSUM: 3deebbb4d1ca77dd9c9e009a1ea02183
> -- ARROW_UTF8PROC_BUILD_VERSION: v2.5.0
> -- ARROW_ZLIB_BUILD_VERSION: 1.2.11
> -- ARROW_ZSTD_BUILD_VERSION: v1.4.5
> -- Looking for pthread.h
> -- Looking for pthread.h - found
> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
> -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
> -- Check if compiler accepts -pthread
> -- Check if compiler accepts -pthread - yes
> -- Found Threads: TRUE
> -- Checking for module 'thrift'
> --   Package 'thrift', required by 'virtual:world', not found
> -- Could NOT find Thrift: Found unsuitable version "", but required is at 
> least "0.11.0" (found THRIFT_LIB-NOTFOUND)
> -- Looking for __SIZEOF_INT128__
> -- Looking for __SIZEOF_INT128__ - found
> -- Found Boost: /usr/include (found suitable version "1.66.0", minimum 
> required is "1.58") found components: regex system filesystem
> -- Boost include dir: /usr/include
> -- Boost libraries: Boost::system;Boost::filesystem
> -- Could NOT find Snappy (missing: Snappy_LIB Snappy_INCLUDE_DIR)
> -- Building snappy from source
> -- Checking for modules 'libbrotlicommon;libbrotlienc;libbrotlidec'
> --   Package 'libbrotlicommon', required by 'virtual:world', not found
> --   Package 'libbrotlienc', required by 'virtual:world', not found
> --   Package 'libbrotlidec', required by 'virtual:world', not found
> -- Could NOT find Brotli (missing: BROTLI_COMMON_LIBRARY BROTLI_ENC_LIBRARY 
> BROTLI_DEC_LIBRARY BROTLI_INCLUDE_DIR)
> -- Building brotli from source
> -- Building without OpenSSL su

[jira] [Commented] (ARROW-15554) [Format][C++] Add "LargeMap" type with 64-bit offsets

2022-02-07 Thread Antoine Pitrou (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488293#comment-17488293
 ] 

Antoine Pitrou commented on ARROW-15554:


Hi [~sgilmore] , thank you for the explanation. In any case, format additions 
have to be discussed and voted on on the development mailing-list. I encourage 
you to create a new discussion there: see [https://arrow.apache.org/community/]

> [Format][C++] Add "LargeMap" type with 64-bit offsets
> -
>
> Key: ARROW-15554
> URL: https://issues.apache.org/jira/browse/ARROW-15554
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Sarah Gilmore
>Priority: Major
>
> It would be nice if a "LargeMap" type existed along side the "Map" type for 
> parity. For other datatypes that require offset arrays/buffers, such as 
> String, List, BinaryArray, provides a "large" version of these types, i.e. 
> LargeString, LargeList, and LargeBinaryArray. It would be nice to have a 
> "LargeMap" for parity.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15016) [R] show_query() for an arrow_dplyr_query

2022-02-07 Thread Dewey Dunnington (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488280#comment-17488280
 ] 

Dewey Dunnington commented on ARROW-15016:
--

It seems like a route here would be to implement a {{ToString()}} and 
{{print()}} as R6 methods here: 
https://github.com/apache/arrow/blob/bd356295f6beaba744a2c6b498455701f53a64f8/r/R/dplyr.R#L91

and here: https://github.com/apache/arrow/blob/master/r/src/compute-exec.cpp#L47

...and add a {{show_dplyr_query()}} function here (maybe like this):

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

show_exec_plan <- function(.data) {
  adq <- arrow:::as_adq(.data)
  plan <- arrow:::ExecPlan$create()
  final_node <- plan$Build(.data)
  print(plan$ToString())
  invisible(.data)
}

ggplot2::mpg %>% 
  arrow_table() %>% 
  filter(year > 2007) %>% 
  show_exec_plan()
#> Error in print(plan$ToString()): attempt to apply non-function
{code}

Maybe here:

https://github.com/apache/arrow/blob/master/r/R/dplyr.R#L91

> [R] show_query() for an arrow_dplyr_query
> -
>
> Key: ARROW-15016
> URL: https://issues.apache.org/jira/browse/ARROW-15016
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
> Fix For: 8.0.0
>
>
> Now that we can print a query plan (ARROW-13785) we should wire this up in R 
> so we can see what execution plans are being put together for various queries 
> (like the TPC-H queries)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15603) [C++] Clang 13 build fails on unused var

2022-02-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15603:
---
Labels: pull-request-available  (was: )

> [C++] Clang 13 build fails on unused var
> 
>
> Key: ARROW-15603
> URL: https://issues.apache.org/jira/browse/ARROW-15603
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Just a small issue. When I build with clang 13 I get the following error from 
> a unused var warning:
> {code:java}
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:791:13:
>  error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
>     int64_t n = 0;
>             ^
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:799:13:
>  error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
>     int64_t n = 0;
>             ^ {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15593) [C++] An unhandled race condition exists in ThreadPool

2022-02-07 Thread Antoine Pitrou (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-15593:
---
Summary: [C++] An unhandled race condition exists in ThreadPool  (was: An 
unhandled race condition exists in ThreadPool)

> [C++] An unhandled race condition exists in ThreadPool
> --
>
> Key: ARROW-15593
> URL: https://issues.apache.org/jira/browse/ARROW-15593
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
> Environment: linux
>Reporter: Huxley Hu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> A race condition exists at the ThreadPool which may lead to the loss of 
> pending tasks after a process forks. 
> See this issue for more detail: https://github.com/apache/arrow/issues/12329



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15603) [C++] Clang 13 build fails on unused var

2022-02-07 Thread Will Jones (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-15603:
--

Assignee: Will Jones

> [C++] Clang 13 build fails on unused var
> 
>
> Key: ARROW-15603
> URL: https://issues.apache.org/jira/browse/ARROW-15603
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 7.0.0
>Reporter: Will Jones
>Assignee: Will Jones
>Priority: Minor
> Fix For: 8.0.0
>
>
> Just a small issue. When I build with clang 13 I get the following error from 
> a unused var warning:
> {code:java}
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:791:13:
>  error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
>     int64_t n = 0;
>             ^
> /Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:799:13:
>  error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
>     int64_t n = 0;
>             ^ {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15603) [C++] Clang 13 build fails on unused var

2022-02-07 Thread Will Jones (Jira)

Will Jones created ARROW-15603:
--

 Summary: [C++] Clang 13 build fails on unused var
 Key: ARROW-15603
 URL: https://issues.apache.org/jira/browse/ARROW-15603
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Affects Versions: 7.0.0
Reporter: Will Jones
 Fix For: 8.0.0


Just a small issue. When I build with clang 13 I get the following error from a 
unused var warning:
{code:java}
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:791:13:
 error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
    int64_t n = 0;
            ^
/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/python/numpy_to_arrow.cc:799:13:
 error: variable 'n' set but not used [-Werror,-Wunused-but-set-variable]
    int64_t n = 0;
            ^ {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14567) [C++][Python][R] PrettyPrint ignores timezone

2022-02-07 Thread Rok Mihevc (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488247#comment-17488247
 ] 

Rok Mihevc commented on ARROW-14567:


Agreed! Offset ("1970-01-01 02:00:00+02:00") per value plus timezone string 
() in the header would be great.

> [C++][Python][R] PrettyPrint ignores timezone
> -
>
> Key: ARROW-14567
> URL: https://issues.apache.org/jira/browse/ARROW-14567
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Alenka Frim
>Priority: Major
>
> When printing TimestampArray in pyarrow the timezone information is ignored 
> by PrettyPrint (__str__  calls to_string() in array.pxi).
> {code:python}
> import pyarrow as pa
> a = pa.array([0], pa.timestamp('s', tz='+02:00'))
> print(a) # representation not correct?
> # 
> # [
> #  1970-01-01 00:00:00
> # ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14488) [Python] Incorrect inferred schema from pandas dataframe with length 0.

2022-02-07 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488237#comment-17488237
 ] 

Joris Van den Bossche commented on ARROW-14488:
---

bq. the conversion from empty Pandas series to pa.array is wrong in the case of 
a string dtype.

The main problem is that the example code is not using a "string dtype". By 
default, pandas uses the generic "object" dtype to store strings. But this data 
type basically means that it can hold _any_ Python object. So it is not 
guaranteed to be strings (eg it could also be decimals, bytes, .., for some 
python types that pyarrow also infers). 

As long as the array is not empty, the conversion to a pyarrow array will try 
to infer the appropriate type based on the values in the input array (eg in 
case of an object dtype array with strings, it will indeed convert that to a 
{{pa.string()}} type). But if the array is empty, there are no values to infer 
the type from. And that is the reason why pyarrow defaults to use the generic 
"null" data type for such array (or column in a DataFrame).

If you know that you have strings for a certain column, and want the 
pandas->pyarrow conversion to robustly work (regardless of having empty 
dataframes/arrays), the {{from_pandas}} method has a {{schema}} argument, and 
this way you can specific a schema to use (and so pyarrow will not try to infer 
the types based on the values in the array). You will have to construct this 
schema manually, though, in this case. 



> [Python] Incorrect inferred schema from pandas dataframe with length 0.
> ---
>
> Key: ARROW-14488
> URL: https://issues.apache.org/jira/browse/ARROW-14488
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 5.0.0
> Environment: OS: Windows 10, CentOS 7
>Reporter: Yuan Zhou
>Priority: Major
>
> We use pandas(with pyarrow engine) to write out parquet files and those 
> outputs will be consumed by other applications such as Java apps using 
> org.apache.parquet.hadoop.ParquetFileReader. We found that some empty 
> dataframes would get incorrect schema for string columns in other 
> applications. After some investigation, we narrow down the issue to the 
> schema inference by pyarrow:
> {code:java}
> In [1]: import pandas as pd
> In [2]: df = pd.DataFrame([['a', 1, 1.0]], columns=['a', 'b', 'c'])
> In [3]: import pyarrow as pa
> In [4]: pa.Schema.from_pandas(df)
>  Out[4]:
>  a: string
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 562
> In [5]: pa.Schema.from_pandas(df.head(0))
>  Out[5]:
>  a: null
>  b: int64
>  c: double
>  -- schema metadata --
>  pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 
> 560
> In [6]: pa._version_
>  Out[6]: '5.0.0'
> {code}
>  As you can see, the column 'a' which should be string type if inferred as 
> null type and is converted to int32 while writing to parquet files.
> Is this an expected behavior? Or do we have any workaround for this issue? 
> Could anyone take a look please. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14567) [C++][Python][R] PrettyPrint ignores timezone

2022-02-07 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488228#comment-17488228
 ] 

Joris Van den Bossche commented on ARROW-14567:
---

I agree with [~jonkeane] that IMO the least confusing display would be to use 
localized strings (with timezone offset indication, so like "1970-01-01 
02:00:00+02:00"). Adding "Z" is certainly better than the current situation, 
but it still doesn't give a quick idea about the local time that the timestamp 
actually represents. 

> [C++][Python][R] PrettyPrint ignores timezone
> -
>
> Key: ARROW-14567
> URL: https://issues.apache.org/jira/browse/ARROW-14567
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python, R
>Reporter: Alenka Frim
>Priority: Major
>
> When printing TimestampArray in pyarrow the timezone information is ignored 
> by PrettyPrint (__str__  calls to_string() in array.pxi).
> {code:python}
> import pyarrow as pa
> a = pa.array([0], pa.timestamp('s', tz='+02:00'))
> print(a) # representation not correct?
> # 
> # [
> #  1970-01-01 00:00:00
> # ]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15215) [C++] Consolidate kernel data-copy utilities between replace_with_mask, case_when, coalesce, choose, fill_null_forward, fill_null_backward

2022-02-07 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-15215:
-
Description: 
All six kernels use two sets of otherwise very similar kernel utilities for 
copying slices of an array into an output array. However, there's no reason 
they can't use the same utilities.

The first set are here: "CopyFixedWidth" 
https://github.com/apache/arrow/blob/bd356295f6beaba744a2c6b498455701f53a64f8/cpp/src/arrow/compute/kernels/scalar_if_else.cc#L1282-L1284

The second set are here: "ReplaceWithMask::CopyData" 
https://github.com/apache/arrow/blob/bd356295f6beaba744a2c6b498455701f53a64f8/cpp/src/arrow/compute/kernels/vector_replace.cc#L208-L209
 (This is a little confusing because the utilities are intertwined into the 
kernel implementation)

They would need to be moved into a new header to share them between the codegen 
units. Also, their interfaces would need to be consolidated.

Additionally, the utilities may be excessively verbose, or generate too much 
code for what they do. For instance, some of the utilities are templated out 
for every Arrow type. Instead, we could replace all instantiations for numbers, 
decimals, temporal types, and so on with a single one for FixedWidthType (an 
abstract base class). Care should be taken to evaluate the benchmarks for these 
kernels to ensure there is not a regression.

  was:All four kernels (and soon to be fill_null_forward/backward) make use of 
a set of very similar utilities for copying data between arrays; we should 
consolidate those into a single set of helpers instead of duplicating them, and 
consider whether they could be further consolidated (e.g. making use of the 
FixedWidthType hierarchy instead of specializing for every type)


> [C++] Consolidate kernel data-copy utilities between replace_with_mask, 
> case_when, coalesce, choose, fill_null_forward, fill_null_backward
> --
>
> Key: ARROW-15215
> URL: https://issues.apache.org/jira/browse/ARROW-15215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: Jabari Booker
>Priority: Major
>  Labels: good-second-issue
>
> All six kernels use two sets of otherwise very similar kernel utilities for 
> copying slices of an array into an output array. However, there's no reason 
> they can't use the same utilities.
> The first set are here: "CopyFixedWidth" 
> https://github.com/apache/arrow/blob/bd356295f6beaba744a2c6b498455701f53a64f8/cpp/src/arrow/compute/kernels/scalar_if_else.cc#L1282-L1284
> The second set are here: "ReplaceWithMask::CopyData" 
> https://github.com/apache/arrow/blob/bd356295f6beaba744a2c6b498455701f53a64f8/cpp/src/arrow/compute/kernels/vector_replace.cc#L208-L209
>  (This is a little confusing because the utilities are intertwined into the 
> kernel implementation)
> They would need to be moved into a new header to share them between the 
> codegen units. Also, their interfaces would need to be consolidated.
> Additionally, the utilities may be excessively verbose, or generate too much 
> code for what they do. For instance, some of the utilities are templated out 
> for every Arrow type. Instead, we could replace all instantiations for 
> numbers, decimals, temporal types, and so on with a single one for 
> FixedWidthType (an abstract base class). Care should be taken to evaluate the 
> benchmarks for these kernels to ensure there is not a regression.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15215) [C++] Consolidate kernel data-copy utilities between replace_with_mask, case_when, coalesce, choose, fill_null_forward, fill_null_backward

2022-02-07 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-15215:
-
Summary: [C++] Consolidate kernel data-copy utilities between 
replace_with_mask, case_when, coalesce, choose, fill_null_forward, 
fill_null_backward  (was: [C++] Consolidate kernel utilities between 
replace_with_mask, case_when, coalesce, choose, fill_null_forward, 
fill_null_backward)

> [C++] Consolidate kernel data-copy utilities between replace_with_mask, 
> case_when, coalesce, choose, fill_null_forward, fill_null_backward
> --
>
> Key: ARROW-15215
> URL: https://issues.apache.org/jira/browse/ARROW-15215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: Jabari Booker
>Priority: Major
>  Labels: good-second-issue
>
> All four kernels (and soon to be fill_null_forward/backward) make use of a 
> set of very similar utilities for copying data between arrays; we should 
> consolidate those into a single set of helpers instead of duplicating them, 
> and consider whether they could be further consolidated (e.g. making use of 
> the FixedWidthType hierarchy instead of specializing for every type)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15215) [C++] Consolidate kernel utilities between replace_with_mask, case_when, coalesce, choose, fill_null_forward, fill_null_backward

2022-02-07 Thread David Li (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-15215:
-
Summary: [C++] Consolidate kernel utilities between replace_with_mask, 
case_when, coalesce, choose, fill_null_forward, fill_null_backward  (was: [C++] 
Consolidate kernel utilities between replace_with_mask, case_when, coalesce, 
choose)

> [C++] Consolidate kernel utilities between replace_with_mask, case_when, 
> coalesce, choose, fill_null_forward, fill_null_backward
> 
>
> Key: ARROW-15215
> URL: https://issues.apache.org/jira/browse/ARROW-15215
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: Jabari Booker
>Priority: Major
>  Labels: good-second-issue
>
> All four kernels (and soon to be fill_null_forward/backward) make use of a 
> set of very similar utilities for copying data between arrays; we should 
> consolidate those into a single set of helpers instead of duplicating them, 
> and consider whether they could be further consolidated (e.g. making use of 
> the FixedWidthType hierarchy instead of specializing for every type)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14082) [Python] Expose Arrow C++ Consumer API to pyarrow

2022-02-07 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-14082:
--
Summary: [Python] Expose Arrow C++ Consumer API to pyarrow  (was: Expose 
Arrow C++ Consumer API to pyarrow)

> [Python] Expose Arrow C++ Consumer API to pyarrow
> -
>
> Key: ARROW-14082
> URL: https://issues.apache.org/jira/browse/ARROW-14082
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Compute IR, Python
>Reporter: Phillip Cloud
>Assignee: Ben Kietzman
>Priority: Major
>
> Once we have ARROW-14081, we need to add pyarrow bindings to allow use from 
> Python.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14082) Expose Arrow C++ Consumer API to pyarrow

2022-02-07 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-14082:
--
Description: Once we have ARROW-14081, we need to add pyarrow bindings to 
allow use from Python.  (was: Once we have 
https://issues.apache.org/jira/browse/ARROW-14081, we need to add pyarrow 
bindings to allow use from Python.)

> Expose Arrow C++ Consumer API to pyarrow
> 
>
> Key: ARROW-14082
> URL: https://issues.apache.org/jira/browse/ARROW-14082
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Compute IR, Python
>Reporter: Phillip Cloud
>Assignee: Ben Kietzman
>Priority: Major
>
> Once we have ARROW-14081, we need to add pyarrow bindings to allow use from 
> Python.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-14292) [Python] Minimal ExecPlan to perform joins in pyarrow

2022-02-07 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-14292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17462691#comment-17462691
 ] 

Joris Van den Bossche edited comment on ARROW-14292 at 2/7/22, 4:11 PM:


ARROW-14082 is probably a clone of this one


was (Author: amol-):
https://issues.apache.org/jira/browse/ARROW-14082 is probably a clone of this 
one

> [Python] Minimal ExecPlan to perform joins in pyarrow
> -
>
> Key: ARROW-14292
> URL: https://issues.apache.org/jira/browse/ARROW-14292
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
> Fix For: 8.0.0
>
>
> At the moment pyarrow doesn't provide any way to leverage the query execution 
> engine that the C++ layer provides. The goal is to allow a minimal 
> implementation (unexposed to end users) that permits to create an exec plan 
> with multiple inputs and that produces a single output. In between. It should 
> allow to inject as intermediate steps one of the nodes to perform data 
> manipulation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (ARROW-10643) [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty dataframe

2022-02-07 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche resolved ARROW-10643.
---
Resolution: Fixed

Issue resolved by pull request 12311
[https://github.com/apache/arrow/pull/12311]

> [Python] Pandas<->pyarrow roundtrip failing to recreate index for empty 
> dataframe
> -
>
> Key: ARROW-10643
> URL: https://issues.apache.org/jira/browse/ARROW-10643
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Alenka Frim
>Priority: Major
>  Labels: conversion, pandas, pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> From https://github.com/pandas-dev/pandas/issues/37897
> The roundtrip of an empty pandas.DataFrame _with_ and index (so no columns, 
> but a non-zero shape for the rows) isn't faithful:
> {code}
> In [33]: df = pd.DataFrame(index=pd.RangeIndex(0, 10, 1))
> In [34]: df
> Out[34]: 
> Empty DataFrame
> Columns: []
> Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
> In [35]: df.shape
> Out[35]: (10, 0)
> In [36]: table = pa.table(df)
> In [37]: table.to_pandas()
> Out[37]: 
> Empty DataFrame
> Columns: []
> Index: []
> In [38]: table.to_pandas().shape
> Out[38]: (0, 0)
> {code}
> Since the pandas metadata in the Table actually have this RangeIndex 
> information:
> {code}
> In [39]: table.schema.pandas_metadata
> Out[39]: 
> {'index_columns': [{'kind': 'range',
>'name': None,
>'start': 0,
>'stop': 10,
>'step': 1}],
>  'column_indexes': [{'name': None,
>'field_name': None,
>'pandas_type': 'empty',
>'numpy_type': 'object',
>'metadata': None}],
>  'columns': [],
>  'creator': {'library': 'pyarrow', 'version': '3.0.0.dev162+g305160495'},
>  'pandas_version': '1.2.0.dev0+1225.g91f5bfcdc4'}
> {code}
> we should in principle be able to correctly roundtrip this case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14817) [R] Implement bindings for lubridate::tz

2022-02-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14817:
---
Labels: good-first-issue pull-request-available  (was: good-first-issue)

> [R] Implement bindings for lubridate::tz
> 
>
> Key: ARROW-14817
> URL: https://issues.apache.org/jira/browse/ARROW-14817
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: good-first-issue, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This can be achieved via strftime



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15602) [R] can't read timestamp with timezone from CSV (or other delimited) file

2022-02-07 Thread SHIMA Tatsuya (Jira)

SHIMA Tatsuya created ARROW-15602:
-

 Summary: [R] can't read timestamp with timezone from CSV (or other 
delimited) file
 Key: ARROW-15602
 URL: https://issues.apache.org/jira/browse/ARROW-15602
 Project: Apache Arrow
  Issue Type: Improvement
 Environment: R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Reporter: SHIMA Tatsuya


The following values in a csv file can be read as timestamp by 
`pyarrow.csv.read_csv` and `readr::read_csv`, but not by 
`arrow::read_csv_arrow`.

{code}
"x"
"2004-04-01T12:00+09:00"
{code}





--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-14471) [R] Implement lubridate's date/time parsing functions

2022-02-07 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488048#comment-17488048
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/7/22, 3:08 PM:
---

[~paleolimbot] I don't think we can rely on {{coalesce()}} to iterate through 
the various formats supported for {{{}ymd(){}}}. It would need to rely on the 
assumption that the passed {{format}} matches the data or otherwise fail. 
Sadly, arrow works with a wrong format resulting in weird timestamps:
{code:r}
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(lubridate))

df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03"))
df
#> # A tibble: 3 × 1
#>   x   
#>  
#> 1 09-01-01
#> 2 09-01-02
#> 3 09-01-03

# lubridate::ymd()
df %>% 
  mutate(y = ymd(x))
#> # A tibble: 3 × 2
#>   xy 
#>   
#> 1 09-01-01 2009-01-01
#> 2 09-01-02 2009-01-02
#> 3 09-01-03 2009-01-03

# y = short year correct
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   xy  
#>
#> 1 09-01-01 2009-01-01 00:00:00
#> 2 09-01-02 2009-01-02 00:00:00
#> 3 09-01-03 2009-01-03 00:00:00

# Y = long year this should fail in order for us to rely on coalesce
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   xy  
#>
#> 1 09-01-01 0008-12-31 23:58:45
#> 2 09-01-02 0009-01-01 23:58:45
#> 3 09-01-03 0009-01-02 23:58:45
{code}
Therefore, my early (and somewhat naive) conclusion would be that we cannot 
implement {{arrow::ymd()}} binding as {{{}coalesce(strptime(x, format1), 
strptime(x, format2), ...){}}}. What do you think?


was (Author: dragosmg):
[~paleolimbot] I don't think we can rely on {{coalesce()}} to iterate through 
the various formats supported for {{ymd()}}. It would need to rely on the 
assumption that the passed {{format}} matches the data or otherwise fail. 
Sadly, arrow works with a wrong format resulting in weird timestamps:

{code:r}
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(lubridate))

df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03"))
df
#> # A tibble: 3 × 1
#>   x   
#>  
#> 1 09-01-01
#> 2 09-01-02
#> 3 09-01-03

# lubridate::ymd()
df %>% 
  mutate(y = ymd(x))
#> # A tibble: 3 × 2
#>   xy 
#>   
#> 1 09-01-01 2009-01-01
#> 2 09-01-02 2009-01-02
#> 3 09-01-03 2009-01-03

# y = short year correct
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   xy  
#>
#> 1 09-01-01 2009-01-01 00:00:00
#> 2 09-01-02 2009-01-02 00:00:00
#> 3 09-01-03 2009-01-03 00:00:00

# Y = long year this should fail in order for us to rely on coalesce
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   xy  
#>
#> 1 09-01-01 0008-12-31 23:58:45
#> 2 09-01-02 0009-01-01 23:58:45
#> 3 09-01-03 0009-01-02 23:58:45
{code}

Therefore, my conclusion would be that we cannot implement {{arrow::ymd()}} 
binding as {{coalesce(strptime(x, format1), strptime(x, format2), ...)}}. What 
do you think?

> [R] Implement lubridate's date/time parsing functions
> -
>
> Key: ARROW-14471
> URL: https://issues.apache.org/jira/browse/ARROW-14471
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
>   
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14471) [R] Implement lubridate's date/time parsing functions

2022-02-07 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488164#comment-17488164
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-14471:
--

{{lubridate}} has {{guess_formats()}} to identify the likely candidates. We 
could try something similar, where we have a list of supported formats 
(something similar to 
[this|https://github.com/dragosmg/arrow/blob/cfba9e1dfbedd5dfdf652c805e93692808dd092e/r/R/dplyr-funcs-datetime.R#L152-L196]),
 which we then narrow down to the most likely ones. Only then use something 
like {{{}coalesce(){}}}.

> [R] Implement lubridate's date/time parsing functions
> -
>
> Key: ARROW-14471
> URL: https://issues.apache.org/jira/browse/ARROW-14471
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
>   
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (ARROW-1921) [Doc] Build API docs on a per-release basis

2022-02-07 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-1921.

  Assignee: (was: Krisztian Szucs)
Resolution: Duplicate

> [Doc] Build API docs on a per-release basis
> ---
>
> Key: ARROW-1921
> URL: https://issues.apache.org/jira/browse/ARROW-1921
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Uwe Korn
>Priority: Major
>
> Currently we build the docs from time to time manually from master. We should 
> also build them per release so that you can have a look at the latest 
> released API version.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-1921) [Doc] Build API docs on a per-release basis

2022-02-07 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488163#comment-17488163
 ] 

Joris Van den Bossche commented on ARROW-1921:
--

In the meantime, we have versioned docs for older version, and we also have the 
latest dev version of the docs. Closing as a duplicate of ARROW-13260

> [Doc] Build API docs on a per-release basis
> ---
>
> Key: ARROW-1921
> URL: https://issues.apache.org/jira/browse/ARROW-1921
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: Uwe Korn
>Assignee: Krisztian Szucs
>Priority: Major
>
> Currently we build the docs from time to time manually from master. We should 
> also build them per release so that you can have a look at the latest 
> released API version.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (ARROW-8533) [Release] Don't commit doctrees in the docs post release script

2022-02-07 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-8533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-8533.

Resolution: Fixed

> [Release] Don't commit doctrees in the docs post release script
> ---
>
> Key: ARROW-8533
> URL: https://issues.apache.org/jira/browse/ARROW-8533
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Priority: Major
>
> A .gitignore file would be enough to prevent committing pickled binaries.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-8533) [Release] Don't commit doctrees in the docs post release script

2022-02-07 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-8533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488160#comment-17488160
 ] 

Joris Van den Bossche commented on ARROW-8533:
--

This is done in the meantime (see 
https://github.com/apache/arrow-site/blob/asf-site/docs/.gitignore)

> [Release] Don't commit doctrees in the docs post release script
> ---
>
> Key: ARROW-8533
> URL: https://issues.apache.org/jira/browse/ARROW-8533
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Developer Tools
>Reporter: Krisztian Szucs
>Priority: Major
>
> A .gitignore file would be enough to prevent committing pickled binaries.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15601) [Docs][Release] Update post release script to move stable docs to versioned + keep dev docs

2022-02-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15601:
---
Labels: pull-request-available  (was: )

> [Docs][Release] Update post release script to move stable docs to versioned + 
> keep dev docs
> ---
>
> Key: ARROW-15601
> URL: https://issues.apache.org/jira/browse/ARROW-15601
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0, 7.0.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> xref https://github.com/apache/arrow-site/pull/187
> We need to update the {{post-09-docs.sh}} script to keep the dev docs and to 
> move the current stable docs to a versioned sub-directory



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15601) [Docs][Release] Update post release script to move stable docs to versioned + keep dev docs

2022-02-07 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-15601:
-

Assignee: Joris Van den Bossche

> [Docs][Release] Update post release script to move stable docs to versioned + 
> keep dev docs
> ---
>
> Key: ARROW-15601
> URL: https://issues.apache.org/jira/browse/ARROW-15601
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 8.0.0, 7.0.1
>
>
> xref https://github.com/apache/arrow-site/pull/187
> We need to update the {{post-09-docs.sh}} script to keep the dev docs and to 
> move the current stable docs to a versioned sub-directory



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15601) [Docs][Release] Update post release script to move stable docs to versioned + keep dev docs

2022-02-07 Thread Joris Van den Bossche (Jira)

Joris Van den Bossche created ARROW-15601:
-

 Summary: [Docs][Release] Update post release script to move stable 
docs to versioned + keep dev docs
 Key: ARROW-15601
 URL: https://issues.apache.org/jira/browse/ARROW-15601
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Documentation
Reporter: Joris Van den Bossche
 Fix For: 8.0.0, 7.0.1


xref https://github.com/apache/arrow-site/pull/187

We need to update the {{post-09-docs.sh}} script to keep the dev docs and to 
move the current stable docs to a versioned sub-directory



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15600) [C++][FlightRPC] Add a simple Flight SQL example

2022-02-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15600:
---
Labels: pull-request-available  (was: )

> [C++][FlightRPC] Add a simple Flight SQL example
> 
>
> Key: ARROW-15600
> URL: https://issues.apache.org/jira/browse/ARROW-15600
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, FlightRPC
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15600) [C++][FlightRPC] Add a simple Flight SQL example

2022-02-07 Thread David Li (Jira)

David Li created ARROW-15600:


 Summary: [C++][FlightRPC] Add a simple Flight SQL example
 Key: ARROW-15600
 URL: https://issues.apache.org/jira/browse/ARROW-15600
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC
Reporter: David Li
Assignee: David Li






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15571) [C++] Add min/max/sqrt scalar kernels to execution engine

2022-02-07 Thread Yaron Gvili (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488129#comment-17488129
 ] 

Yaron Gvili commented on ARROW-15571:
-

Since plenty of NaN values exist, I find returning an input NaN value more 
user-friendly then returning a null, which just drops information, which can 
always be done in a later step. However, if there is a convention that other 
operations conform to on this issue, then it should probably be followed on.

> [C++] Add min/max/sqrt scalar kernels to execution engine
> -
>
> Key: ARROW-15571
> URL: https://issues.apache.org/jira/browse/ARROW-15571
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The list of execution engine's scalar kernels currently available in 
> `cpp/src/arrow/compute/kernels/scalar_arithmetic.cc` does not cover the 
> common minimum, maximum, and square-root functions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15599) [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or other delimited) file

2022-02-07 Thread SHIMA Tatsuya (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-15599:
--
Description: 
I tried to read the csv column type as timestamp, but I could only get it to 
work well when `col_types` was not specified.

I'm sorry if I missed something and this is the expected behavior. (It would be 
great if you could add an example with `col_types` in the documentation.)

{code:r}
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp

t_string <- tibble::tibble(
  x = "2018-10-07 19:04:05.005"
)

write_csv_arrow(t_string, "tmp.csv")

read_csv_arrow(
  "tmp.csv",
  as_data_frame = FALSE
)
#> Table
#> 1 rows x 1 columns
#> $x 

read_csv_arrow(
  "tmp.csv",
  col_names = "x",
  col_types = "?",
  skip = 1,
  as_data_frame = FALSE
)
#> Table
#> 1 rows x 1 columns
#> $x 

read_csv_arrow(
  "tmp.csv",
  col_names = "x",
  col_types = "T",
  skip = 1,
  as_data_frame = FALSE
)
#> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
invalid value '2018-10-07 19:04:05.005'

read_csv_arrow(
  "tmp.csv",
  col_names = "x",
  col_types = "T",
  as_data_frame = FALSE,
  skip = 1,
  timestamp_parsers = "%Y-%m-%d %H:%M:%S"
)
#> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
invalid value '2018-10-07 19:04:05.005'
{code}

  was:
I tried to read the csv column type as timestamp, but I could only get it to 
work well when `col_types` was not specified.

I'm sorry if I missed something and this is the expected behavior. (It would be 
great if you could add an example with `col_types` in the documentation.)

{code:r}
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp

t_string <- tibble::tibble(
  x = "2018-10-07 19:04:05"
)

write_csv_arrow(t_string, "tmp.csv")

read_csv_arrow(
  "tmp.csv",
  as_data_frame = FALSE
)
#> Table
#> 1 rows x 1 columns
#> $x 

read_csv_arrow(
  "tmp.csv",
  col_names = "x",
  col_types = "?",
  as_data_frame = FALSE
)
#> Table
#> 2 rows x 1 columns
#> $x 

read_csv_arrow(
  "tmp.csv",
  col_names = "x",
  col_types = "T",
  as_data_frame = FALSE
)
#> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
invalid value 'x'

read_csv_arrow(
  "tmp.csv",
  col_names = "x",
  col_types = "T",
  as_data_frame = FALSE,
  timestamp_parsers = "%Y-%m-%d %H:%M:%S"
)
#> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
invalid value 'x'
{code}


> [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or 
> other delimited) file
> ---
>
> Key: ARROW-15599
> URL: https://issues.apache.org/jira/browse/ARROW-15599
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 6.0.1
> Environment: R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> I tried to read the csv column type as timestamp, but I could only get it to 
> work well when `col_types` was not specified.
> I'm sorry if I missed something and this is the expected behavior. (It would 
> be great if you could add an example with `col_types` in the documentation.)
> {code:r}
> library(arrow)
> #>
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #>
> #> timestamp
> t_string <- tibble::tibble(
>   x = "2018-10-07 19:04:05.005"
> )
> write_csv_arrow(t_string, "tmp.csv")
> read_csv_arrow(
>   "tmp.csv",
>   as_data_frame = FALSE
> )
> #> Table
> #> 1 rows x 1 columns
> #> $x 
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "?",
>   skip = 1,
>   as_data_frame = FALSE
> )
> #> Table
> #> 1 rows x 1 columns
> #> $x 
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "T",
>   skip = 1,
>   as_data_frame = FALSE
> )
> #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
> invalid value '2018-10-07 19:04:05.005'
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "T",
>   as_data_frame = FALSE,
>   skip = 1,
>   timestamp_parsers = "%Y-%m-%d %H:%M:%S"
> )
> #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
> invalid value '2018-10-07 19:04:05.005'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-14471) [R] Implement lubridate's date/time parsing functions

2022-02-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-14471:
---
Labels: pull-request-available  (was: )

> [R] Implement lubridate's date/time parsing functions
> -
>
> Key: ARROW-14471
> URL: https://issues.apache.org/jira/browse/ARROW-14471
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 8.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
>   
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15599) [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or other delimited) file

2022-02-07 Thread SHIMA Tatsuya (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SHIMA Tatsuya updated ARROW-15599:
--
Summary: [R] can't explicitly convert a column as a sub-seconds typestamp 
from CSV (or other delimited) file  (was: [R] can't explicitly convert a column 
as a typestamp from CSV (or other delimited) file)

> [R] can't explicitly convert a column as a sub-seconds typestamp from CSV (or 
> other delimited) file
> ---
>
> Key: ARROW-15599
> URL: https://issues.apache.org/jira/browse/ARROW-15599
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 6.0.1
> Environment: R version 4.1.2 (2021-11-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 20.04.3 LTS
>Reporter: SHIMA Tatsuya
>Priority: Major
>
> I tried to read the csv column type as timestamp, but I could only get it to 
> work well when `col_types` was not specified.
> I'm sorry if I missed something and this is the expected behavior. (It would 
> be great if you could add an example with `col_types` in the documentation.)
> {code:r}
> library(arrow)
> #>
> #> Attaching package: 'arrow'
> #> The following object is masked from 'package:utils':
> #>
> #> timestamp
> t_string <- tibble::tibble(
>   x = "2018-10-07 19:04:05"
> )
> write_csv_arrow(t_string, "tmp.csv")
> read_csv_arrow(
>   "tmp.csv",
>   as_data_frame = FALSE
> )
> #> Table
> #> 1 rows x 1 columns
> #> $x 
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "?",
>   as_data_frame = FALSE
> )
> #> Table
> #> 2 rows x 1 columns
> #> $x 
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "T",
>   as_data_frame = FALSE
> )
> #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
> invalid value 'x'
> read_csv_arrow(
>   "tmp.csv",
>   col_names = "x",
>   col_types = "T",
>   as_data_frame = FALSE,
>   timestamp_parsers = "%Y-%m-%d %H:%M:%S"
> )
> #> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
> invalid value 'x'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15599) [R] can't explicitly convert a column as a typestamp from CSV (or other delimited) file

2022-02-07 Thread SHIMA Tatsuya (Jira)

SHIMA Tatsuya created ARROW-15599:
-

 Summary: [R] can't explicitly convert a column as a typestamp from 
CSV (or other delimited) file
 Key: ARROW-15599
 URL: https://issues.apache.org/jira/browse/ARROW-15599
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 6.0.1
 Environment: R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS
Reporter: SHIMA Tatsuya


I tried to read the csv column type as timestamp, but I could only get it to 
work well when `col_types` was not specified.

I'm sorry if I missed something and this is the expected behavior. (It would be 
great if you could add an example with `col_types` in the documentation.)

{code:r}
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp

t_string <- tibble::tibble(
  x = "2018-10-07 19:04:05"
)

write_csv_arrow(t_string, "tmp.csv")

read_csv_arrow(
  "tmp.csv",
  as_data_frame = FALSE
)
#> Table
#> 1 rows x 1 columns
#> $x 

read_csv_arrow(
  "tmp.csv",
  col_names = "x",
  col_types = "?",
  as_data_frame = FALSE
)
#> Table
#> 2 rows x 1 columns
#> $x 

read_csv_arrow(
  "tmp.csv",
  col_names = "x",
  col_types = "T",
  as_data_frame = FALSE
)
#> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
invalid value 'x'

read_csv_arrow(
  "tmp.csv",
  col_names = "x",
  col_types = "T",
  as_data_frame = FALSE,
  timestamp_parsers = "%Y-%m-%d %H:%M:%S"
)
#> Error: Invalid: In CSV column #0: CSV conversion error to timestamp[s]: 
invalid value 'x'
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-15547) Regression: Decimal type inferemce

2022-02-07 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488109#comment-17488109
 ] 

Joris Van den Bossche edited comment on ARROW-15547 at 2/7/22, 1:30 PM:


Can you provide a reproducible code example of the issue you encounter? 

With the data that you currently provided, the function works fine for me using 
pyarrow 6.0 (but there are no decimals in the resulting table, as it doesn't 
infer this type automatically from numbers):

{code}
In [3]: null = None

In [4]: data = [{"accounted_at":   # data as provided above

In [6]: create_dataframe(data)
Out[6]: 
pyarrow.Table
booked_by: string
invoice_recipient_id: string
created_at: string
due_date: string
lines: list>
  child 0, item: struct
  child 0, amount: double
  child 1, commission: double
  child 2, commissionUnit: string
  child 3, description: string
  child 4, soldPrice: double
  child 5, type: string
deleted_at: null
internal_code: string
type: string
id: string
payment_term: string
franchise_id: string
teamleader_id: string
created_by: string
parent_id: null
sent_by: string
accounted_at: string
recipient_emails: null
booked_at: string
status: string
description: string
sent_at: string
{code}


was (Author: jorisvandenbossche):
Can you provide a reproducible code example of the issue you encounter? 

With the data that you currently provided, the function works fine for me (but 
there are no decimals in the resulting table, as it doesn't infer this type 
automatically from numbers):

{code}
In [3]: null = None

In [4]: data = [{"accounted_at":   # data as provided above

In [6]: create_dataframe(data)
Out[6]: 
pyarrow.Table
booked_by: string
invoice_recipient_id: string
created_at: string
due_date: string
lines: list>
  child 0, item: struct
  child 0, amount: double
  child 1, commission: double
  child 2, commissionUnit: string
  child 3, description: string
  child 4, soldPrice: double
  child 5, type: string
deleted_at: null
internal_code: string
type: string
id: string
payment_term: string
franchise_id: string
teamleader_id: string
created_by: string
parent_id: null
sent_by: string
accounted_at: string
recipient_emails: null
booked_at: string
status: string
description: string
sent_at: string
{code}

> Regression: Decimal type inferemce
> --
>
> Key: ARROW-15547
> URL: https://issues.apache.org/jira/browse/ARROW-15547
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 6.0.1
>Reporter: Charley Guillaume
>Priority: Major
>
> While trying to ingest data using pyarrow 6.0.1 using this function :{{{}{}}}
> {code:java}
> def create_dataframe(list_dict: dict) -> pa.table:
> fields = set()
> for d in list_dict:
> fields = fields.union(d.keys())
> dataframe = pa.table({f: [row.get(f) for row in list_dict] for f in 
> fields})
> return dataframe {code}
> {{I had the following error: }}
> {code:java}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 7 does not fit into 
> precision inferred from first array element: 8  {code}
> After downgrading too v4.0.1 the error was gone.
> The data looked like that : 
> {noformat}
> [{"accounted_at": "2022-01-31T22:55:25.702000+00:00", "booked_at": 
> "2022-01-27T09:24:17.539000+00:00", "booked_by": 
> "7b3ce009-728d-4fbc-9120-00fa8c1c8655", "created_at": 
> "2022-01-27T09:08:22.306000+00:00", "created_by": 
> "7b3ce009-728d-4fbc-9120-00fa8c1c8655", "deleted_at": null, "description": 
> "description of the record", "due_date": "2022-02-10T00:00:00+00:00", 
> "franchise_id": "9a2858c4-5c71-43d3-b28f-2352de47ff9f", "id": 
> "ba3f6d3a-12f4-4d78-acc5-2e59ca384c1e", "internal_code": "A.2022 / 9", 
> "invoice_recipient_id": "7169cef9-9cb2-461f-a38f-a4d1ce3ca1c3", "lines": 
> [{"type": "property", "amount": 7800, "soldPrice": 26, "commission": 3, 
> "description": "Honoraires de l'agence", "commissionUnit": "PERCENT"}], 
> "parent_id": null, "payment_term": "14-days", "recipient_emails": null, 
> "sent_at": null, "sent_by": null, "status": "booked", "teamleader_id": 
> "xxx-yyy-www-zzz", "type": "out"}, {"accounted_at": null, "booked_at": 
> "2022-01-05T09:23:03.274000+00:00", "booked_by": 
> "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "created_at": 
> "2022-01-05T09:21:32.503000+00:00", "created_by": 
> "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "deleted_at": null, "description": 
> "Description content", "due_date": "2022-02-04T00:00:00+00:00", 
> "franchise_id": "929d47a3-c30f-404b-aaff-c96cff1bdd10", "id": 
> "828cd056-6aa7-4cea-9c94-ffa2db4498df", "internal_code": "BXC22 / 3", 
> "invoice_recipient_id": "5f90aa24-4c32-401d-927c-db9d4a9f90bf", "lines": 
> [{"type": "property", "amount": 92.55, "soldPrice": 3702.02, "commis

[jira] [Updated] (ARROW-15547) Regression: Decimal type inferemce

2022-02-07 Thread Joris Van den Bossche (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15547:
--
Description: 
While trying to ingest data using pyarrow 6.0.1 using this function :{{{}{}}}
{code:java}
def create_dataframe(list_dict: dict) -> pa.table:
fields = set()
for d in list_dict:
fields = fields.union(d.keys())
dataframe = pa.table({f: [row.get(f) for row in list_dict] for f in fields})
return dataframe {code}
{{I had the following error: }}
{code:java}
pyarrow.lib.ArrowInvalid: Decimal type with precision 7 does not fit into 
precision inferred from first array element: 8  {code}


After downgrading too v4.0.1 the error was gone.



The data looked like that : 
{noformat}
[{"accounted_at": "2022-01-31T22:55:25.702000+00:00", "booked_at": 
"2022-01-27T09:24:17.539000+00:00", "booked_by": 
"7b3ce009-728d-4fbc-9120-00fa8c1c8655", "created_at": 
"2022-01-27T09:08:22.306000+00:00", "created_by": 
"7b3ce009-728d-4fbc-9120-00fa8c1c8655", "deleted_at": null, "description": 
"description of the record", "due_date": "2022-02-10T00:00:00+00:00", 
"franchise_id": "9a2858c4-5c71-43d3-b28f-2352de47ff9f", "id": 
"ba3f6d3a-12f4-4d78-acc5-2e59ca384c1e", "internal_code": "A.2022 / 9", 
"invoice_recipient_id": "7169cef9-9cb2-461f-a38f-a4d1ce3ca1c3", "lines": 
[{"type": "property", "amount": 7800, "soldPrice": 26, "commission": 3, 
"description": "Honoraires de l'agence", "commissionUnit": "PERCENT"}], 
"parent_id": null, "payment_term": "14-days", "recipient_emails": null, 
"sent_at": null, "sent_by": null, "status": "booked", "teamleader_id": 
"xxx-yyy-www-zzz", "type": "out"}, {"accounted_at": null, "booked_at": 
"2022-01-05T09:23:03.274000+00:00", "booked_by": 
"8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "created_at": 
"2022-01-05T09:21:32.503000+00:00", "created_by": 
"8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "deleted_at": null, "description": 
"Description content", "due_date": "2022-02-04T00:00:00+00:00", "franchise_id": 
"929d47a3-c30f-404b-aaff-c96cff1bdd10", "id": 
"828cd056-6aa7-4cea-9c94-ffa2db4498df", "internal_code": "BXC22 / 3", 
"invoice_recipient_id": "5f90aa24-4c32-401d-927c-db9d4a9f90bf", "lines": 
[{"type": "property", "amount": 92.55, "soldPrice": 3702.02, "commission": 2.5, 
"description": "description2", "commissionUnit": "PERCENT"}], "parent_id": 
null, "payment_term": "30-days", "recipient_emails": null, "sent_at": 
"2022-01-05T09:27:34.077000+00:00", "sent_by": 
"8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "status": "credited", "teamleader_id": 
"xxx-yzyzy-zzz-www", "type": "out"}]{noformat}


 

  was:
While trying to ingest data using pyarrow 6.0.1 using this function :{{{}{}}}
{code:java}
def create_dataframe(list_dict: dict) -> pa.table:
fields = set()
for d in list_dict:
fields = fields.union(d.keys())
dataframe = pa.table({f: [row.get(f) for row in list_dict] for f in fields})
return dataframe {code}
{{I had the following error: }}
{code:java}
pyarrow.lib.ArrowInvalid: Decimal type with precision 7 does not fit into 
precision inferred from first array element: 8  {code}
{{}}

{{After downgrading too v4.0.1 the error was gone.}}

{{}}

{{The data looked like that : }}
{noformat}
[{"accounted_at": "2022-01-31T22:55:25.702000+00:00", "booked_at": 
"2022-01-27T09:24:17.539000+00:00", "booked_by": 
"7b3ce009-728d-4fbc-9120-00fa8c1c8655", "created_at": 
"2022-01-27T09:08:22.306000+00:00", "created_by": 
"7b3ce009-728d-4fbc-9120-00fa8c1c8655", "deleted_at": null, "description": 
"description of the record", "due_date": "2022-02-10T00:00:00+00:00", 
"franchise_id": "9a2858c4-5c71-43d3-b28f-2352de47ff9f", "id": 
"ba3f6d3a-12f4-4d78-acc5-2e59ca384c1e", "internal_code": "A.2022 / 9", 
"invoice_recipient_id": "7169cef9-9cb2-461f-a38f-a4d1ce3ca1c3", "lines": 
[{"type": "property", "amount": 7800, "soldPrice": 26, "commission": 3, 
"description": "Honoraires de l'agence", "commissionUnit": "PERCENT"}], 
"parent_id": null, "payment_term": "14-days", "recipient_emails": null, 
"sent_at": null, "sent_by": null, "status": "booked", "teamleader_id": 
"xxx-yyy-www-zzz", "type": "out"}, {"accounted_at": null, "booked_at": 
"2022-01-05T09:23:03.274000+00:00", "booked_by": 
"8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "created_at": 
"2022-01-05T09:21:32.503000+00:00", "created_by": 
"8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "deleted_at": null, "description": 
"Description content", "due_date": "2022-02-04T00:00:00+00:00", "franchise_id": 
"929d47a3-c30f-404b-aaff-c96cff1bdd10", "id": 
"828cd056-6aa7-4cea-9c94-ffa2db4498df", "internal_code": "BXC22 / 3", 
"invoice_recipient_id": "5f90aa24-4c32-401d-927c-db9d4a9f90bf", "lines": 
[{"type": "property", "amount": 92.55, "soldPrice": 3702.02, "commission": 2.5, 
"description": "description2", "commissionUnit": "PERCENT"}], "parent_id": 
null, "payment_term": "30-days", "recipient_emails": null, "sent_at":

[jira] [Commented] (ARROW-15547) Regression: Decimal type inferemce

2022-02-07 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488109#comment-17488109
 ] 

Joris Van den Bossche commented on ARROW-15547:
---

Can you provide a reproducible code example of the issue you encounter? 

With the data that you currently provided, the function works fine for me (but 
there are no decimals in the resulting table, as it doesn't infer this type 
automatically from numbers):

{code}
In [3]: null = None

In [4]: data = [{"accounted_at":   # data as provided above

In [6]: create_dataframe(data)
Out[6]: 
pyarrow.Table
booked_by: string
invoice_recipient_id: string
created_at: string
due_date: string
lines: list>
  child 0, item: struct
  child 0, amount: double
  child 1, commission: double
  child 2, commissionUnit: string
  child 3, description: string
  child 4, soldPrice: double
  child 5, type: string
deleted_at: null
internal_code: string
type: string
id: string
payment_term: string
franchise_id: string
teamleader_id: string
created_by: string
parent_id: null
sent_by: string
accounted_at: string
recipient_emails: null
booked_at: string
status: string
description: string
sent_at: string
{code}

> Regression: Decimal type inferemce
> --
>
> Key: ARROW-15547
> URL: https://issues.apache.org/jira/browse/ARROW-15547
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 6.0.1
>Reporter: Charley Guillaume
>Priority: Major
>
> While trying to ingest data using pyarrow 6.0.1 using this function :{{{}{}}}
> {code:java}
> def create_dataframe(list_dict: dict) -> pa.table:
> fields = set()
> for d in list_dict:
> fields = fields.union(d.keys())
> dataframe = pa.table({f: [row.get(f) for row in list_dict] for f in 
> fields})
> return dataframe {code}
> {{I had the following error: }}
> {code:java}
> pyarrow.lib.ArrowInvalid: Decimal type with precision 7 does not fit into 
> precision inferred from first array element: 8  {code}
> {{}}
> {{After downgrading too v4.0.1 the error was gone.}}
> {{}}
> {{The data looked like that : }}
> {noformat}
> [{"accounted_at": "2022-01-31T22:55:25.702000+00:00", "booked_at": 
> "2022-01-27T09:24:17.539000+00:00", "booked_by": 
> "7b3ce009-728d-4fbc-9120-00fa8c1c8655", "created_at": 
> "2022-01-27T09:08:22.306000+00:00", "created_by": 
> "7b3ce009-728d-4fbc-9120-00fa8c1c8655", "deleted_at": null, "description": 
> "description of the record", "due_date": "2022-02-10T00:00:00+00:00", 
> "franchise_id": "9a2858c4-5c71-43d3-b28f-2352de47ff9f", "id": 
> "ba3f6d3a-12f4-4d78-acc5-2e59ca384c1e", "internal_code": "A.2022 / 9", 
> "invoice_recipient_id": "7169cef9-9cb2-461f-a38f-a4d1ce3ca1c3", "lines": 
> [{"type": "property", "amount": 7800, "soldPrice": 26, "commission": 3, 
> "description": "Honoraires de l'agence", "commissionUnit": "PERCENT"}], 
> "parent_id": null, "payment_term": "14-days", "recipient_emails": null, 
> "sent_at": null, "sent_by": null, "status": "booked", "teamleader_id": 
> "xxx-yyy-www-zzz", "type": "out"}, {"accounted_at": null, "booked_at": 
> "2022-01-05T09:23:03.274000+00:00", "booked_by": 
> "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "created_at": 
> "2022-01-05T09:21:32.503000+00:00", "created_by": 
> "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "deleted_at": null, "description": 
> "Description content", "due_date": "2022-02-04T00:00:00+00:00", 
> "franchise_id": "929d47a3-c30f-404b-aaff-c96cff1bdd10", "id": 
> "828cd056-6aa7-4cea-9c94-ffa2db4498df", "internal_code": "BXC22 / 3", 
> "invoice_recipient_id": "5f90aa24-4c32-401d-927c-db9d4a9f90bf", "lines": 
> [{"type": "property", "amount": 92.55, "soldPrice": 3702.02, "commission": 
> 2.5, "description": "description2", "commissionUnit": "PERCENT"}], 
> "parent_id": null, "payment_term": "30-days", "recipient_emails": null, 
> "sent_at": "2022-01-05T09:27:34.077000+00:00", "sent_by": 
> "8a91a22d-ddb9-491a-bc2d-c06ff3f256b4", "status": "credited", 
> "teamleader_id": "xxx-yzyzy-zzz-www", "type": "out"}]{noformat}
> {{}}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15580) package does not include pytz dependency

2022-02-07 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488102#comment-17488102
 ] 

Joris Van den Bossche commented on ARROW-15580:
---

Actually, looking at the code, we indeed seem to import pytz unconditionally 
and assume it to be present, if you have timestamp data with a timezone, see eg 
https://github.com/apache/arrow/blob/4144c1739ec2e58d5f076fa63a0b61653324dc02/cpp/src/arrow/python/datetime.cc#L394

> package does not include pytz dependency
> 
>
> Key: ARROW-15580
> URL: https://issues.apache.org/jira/browse/ARROW-15580
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Affects Versions: 7.0.0
>Reporter: Mark Wood
>Priority: Minor
>
> When install pyarrow via poetry and executing same, a ModuleNotFoundError 
> error can result.
> Pyarrow has a dependency on pytz, but this dependency is not declared (the 
> only dependency I see declared is numpy, see 
> [https://github.com/apache/arrow/blob/47d55752172af99fce629f8fe6177df6b41af1d3/python/setup.py#L576)]
> It would be helpful if pyarrow could declare its dependencies in such a way 
> that tools such as poetry could automatically ensure they were present.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15598) [C++][Gandiva] Avoid using hardcoded raw pointer addresses in generated gandiva IR code

2022-02-07 Thread Projjal Chanda (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Projjal Chanda reassigned ARROW-15598:
--

Assignee: Projjal Chanda

> [C++][Gandiva] Avoid using hardcoded raw pointer addresses in generated 
> gandiva IR code
> ---
>
> Key: ARROW-15598
> URL: https://issues.apache.org/jira/browse/ARROW-15598
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Projjal Chanda
>Assignee: Projjal Chanda
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15580) package does not include pytz dependency

2022-02-07 Thread Joris Van den Bossche (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488096#comment-17488096
 ] 

Joris Van den Bossche commented on ARROW-15580:
---

Can you show the error you get? (and with which code snippet you get it) 
I would assume that pytz is only an optional dependency (i.e. only used when 
installed), which doesn't require it being added to setup.py install_requires. 
And if we accidentally depend on it unconditionally, that sounds as a bug.

> package does not include pytz dependency
> 
>
> Key: ARROW-15580
> URL: https://issues.apache.org/jira/browse/ARROW-15580
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Packaging, Python
>Affects Versions: 7.0.0
>Reporter: Mark Wood
>Priority: Minor
>
> When install pyarrow via poetry and executing same, a ModuleNotFoundError 
> error can result.
> Pyarrow has a dependency on pytz, but this dependency is not declared (the 
> only dependency I see declared is numpy, see 
> [https://github.com/apache/arrow/blob/47d55752172af99fce629f8fe6177df6b41af1d3/python/setup.py#L576)]
> It would be helpful if pyarrow could declare its dependencies in such a way 
> that tools such as poetry could automatically ensure they were present.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15571) [C++] Add min/max/sqrt scalar kernels to execution engine

2022-02-07 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488098#comment-17488098
 ] 

David Li commented on ARROW-15571:
--

There are no operations that return NaN instead of null in this case. It could 
perhaps be added as an option (that would also be a separate JIRA)

> [C++] Add min/max/sqrt scalar kernels to execution engine
> -
>
> Key: ARROW-15571
> URL: https://issues.apache.org/jira/browse/ARROW-15571
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Yaron Gvili
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The list of execution engine's scalar kernels currently available in 
> `cpp/src/arrow/compute/kernels/scalar_arithmetic.cc` does not cover the 
> common minimum, maximum, and square-root functions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15598) [C++][Gandiva] Avoid using hardcoded raw pointer addresses in generated gandiva IR code

2022-02-07 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-15598:
---
Labels: pull-request-available  (was: )

> [C++][Gandiva] Avoid using hardcoded raw pointer addresses in generated 
> gandiva IR code
> ---
>
> Key: ARROW-15598
> URL: https://issues.apache.org/jira/browse/ARROW-15598
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Projjal Chanda
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-13993) [C++] Hash aggregate function that returns value from first row in group

2022-02-07 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488095#comment-17488095
 ] 

David Li edited comment on ARROW-13993 at 2/7/22, 1:17 PM:
---

Yes, we just want a single row per group. Any row will do; the point above is 
that we can't implement anything else (because the query engine currently lacks 
support for ordering, beyond sorting outputs at the very end).

All hash_ kernels ("hash aggregate kernels") are in 
[{{hash_aggregate.cc}}|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/hash_aggregate.cc]
 and it will be very similar to the CountDistinct/Distinct implementation 
there. 


was (Author: lidavidm):
Yes, we just want a single tuple. Any tuple will do; the point above is that we 
can't implement anything else (because the query engine currently lacks support 
for ordering, beyond sorting outputs at the very end).

All hash_ kernels ("hash aggregate kernels") are in 
[{{hash_aggregate.cc}}|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/hash_aggregate.cc]
 and it will be very similar to the CountDistinct/Distinct implementation 
there. 

> [C++] Hash aggregate function that returns value from first row in group
> 
>
> Key: ARROW-13993
> URL: https://issues.apache.org/jira/browse/ARROW-13993
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Dhruv Vats
>Priority: Major
>  Labels: good-second-issue, kernel
> Fix For: 8.0.0
>
>
> It would be nice to have a hash aggregate function that returns the first 
> value of a column within each hash group.
> If row order within groups is non-deterministic, then effectively this would 
> return one arbitrary value. This is a very computationally cheap operation.
> This can be quite useful when querying a non-normalized table. For example if 
> you have a table with a {{country}} column and also a {{country_abbr}} column 
> and you want to group by either/both of those columns but return the values 
> from both columns, you could do
> {code:java}
> SELECT country, country_abbr FROM table GROUP BY country, country_abbr{code}
> but it would be more efficient to do
> {code:java}
> SELECT country, first(country_abbr) FROM table GROUP BY country{code}
> because then the engine does not need to scan all the values of the 
> {{country_abbr}} column.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15598) [C++][Gandiva] Avoid using hardcoded raw pointer addresses in generated gandiva IR code

2022-02-07 Thread Projjal Chanda (Jira)

Projjal Chanda created ARROW-15598:
--

 Summary: [C++][Gandiva] Avoid using hardcoded raw pointer 
addresses in generated gandiva IR code
 Key: ARROW-15598
 URL: https://issues.apache.org/jira/browse/ARROW-15598
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Projjal Chanda






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-13993) [C++] Hash aggregate function that returns value from first row in group

2022-02-07 Thread David Li (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488095#comment-17488095
 ] 

David Li commented on ARROW-13993:
--

Yes, we just want a single tuple. Any tuple will do; the point above is that we 
can't implement anything else (because the query engine currently lacks support 
for ordering, beyond sorting outputs at the very end).

All hash_ kernels ("hash aggregate kernels") are in 
[{{hash_aggregate.cc}}|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/hash_aggregate.cc]
 and it will be very similar to the CountDistinct/Distinct implementation 
there. 

> [C++] Hash aggregate function that returns value from first row in group
> 
>
> Key: ARROW-13993
> URL: https://issues.apache.org/jira/browse/ARROW-13993
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Dhruv Vats
>Priority: Major
>  Labels: good-second-issue, kernel
> Fix For: 8.0.0
>
>
> It would be nice to have a hash aggregate function that returns the first 
> value of a column within each hash group.
> If row order within groups is non-deterministic, then effectively this would 
> return one arbitrary value. This is a very computationally cheap operation.
> This can be quite useful when querying a non-normalized table. For example if 
> you have a table with a {{country}} column and also a {{country_abbr}} column 
> and you want to group by either/both of those columns but return the values 
> from both columns, you could do
> {code:java}
> SELECT country, country_abbr FROM table GROUP BY country, country_abbr{code}
> but it would be more efficient to do
> {code:java}
> SELECT country, first(country_abbr) FROM table GROUP BY country{code}
> because then the engine does not need to scan all the values of the 
> {{country_abbr}} column.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-13993) [C++] Hash aggregate function that returns value from first row in group

2022-02-07 Thread Dhruv Vats (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-13993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488089#comment-17488089
 ] 

Dhruv Vats commented on ARROW-13993:


Just so I understand this correctly (as I don't have a very formal CS 
background), when we do:
{code:sql}
SELECT country, SUM(customerID) FROM db_table GROUP BY country{code}
from a supposed sales table {{db_table}} that has fields {{country}} and 
{{{}customerID{}}}, we get number of customers _per_ country/group.

So here instead sum of all tuples in a group, we just want to return a single 
tuple from the different groups/country? And, it seems _which_ tuple (like 
either the first or a specific one) to return is yet to be finalised, right?

Also is there a PR or an existing kernel that has a similar boilerplate code to 
what this will have? (That'll save a disproportionate time going through all 
the abstractions).

> [C++] Hash aggregate function that returns value from first row in group
> 
>
> Key: ARROW-13993
> URL: https://issues.apache.org/jira/browse/ARROW-13993
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Ian Cook
>Assignee: Dhruv Vats
>Priority: Major
>  Labels: good-second-issue, kernel
> Fix For: 8.0.0
>
>
> It would be nice to have a hash aggregate function that returns the first 
> value of a column within each hash group.
> If row order within groups is non-deterministic, then effectively this would 
> return one arbitrary value. This is a very computationally cheap operation.
> This can be quite useful when querying a non-normalized table. For example if 
> you have a table with a {{country}} column and also a {{country_abbr}} column 
> and you want to group by either/both of those columns but return the values 
> from both columns, you could do
> {code:java}
> SELECT country, country_abbr FROM table GROUP BY country, country_abbr{code}
> but it would be more efficient to do
> {code:java}
> SELECT country, first(country_abbr) FROM table GROUP BY country{code}
> because then the engine does not need to scan all the values of the 
> {{country_abbr}} column.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14471) [R] Implement lubridate's date/time parsing functions

2022-02-07 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488059#comment-17488059
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-14471:
--

Another alternative would be for {{strptime}} to error when the selected format 
does not match the data (for example, attempting to parse {{"09-12-31"}} with 
{{{}"%Y-%m-%d"{}}}should error due to a mismatch in the length of the year). 
Then we could rely on this behaviour with {{{}coalesce{}}}. Should I create a 
ticket for this?

> [R] Implement lubridate's date/time parsing functions
> -
>
> Key: ARROW-14471
> URL: https://issues.apache.org/jira/browse/ARROW-14471
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 8.0.0
>
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
>   
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-14471) [R] Implement lubridate's date/time parsing functions

2022-02-07 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488050#comment-17488050
 ] 

Dragoș Moldovan-Grünfeld edited comment on ARROW-14471 at 2/7/22, 11:41 AM:


We could to some processing to figure out how many characters we have (in the 
string to be parsed) in between the separators (or how many characters we have 
in total, in the cases where we have no separator) and only try with the 
suitable formats. i.e. in the example above not try to parse with {{{}%Y{}}}, 
only {{{}%y{}}}.


was (Author: dragosmg):
We could to some processing to figure out how many characters we have in 
between the separators (or how many characters we have in total, in the cases 
where we have no separator) and only try with the suitable formats. i.e. in the 
example above not try to parse with {{%Y}}, only {{%y}}.

> [R] Implement lubridate's date/time parsing functions
> -
>
> Key: ARROW-14471
> URL: https://issues.apache.org/jira/browse/ARROW-14471
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 8.0.0
>
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
>   
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14471) [R] Implement lubridate's date/time parsing functions

2022-02-07 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488050#comment-17488050
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-14471:
--

We could to some processing to figure out how many characters we have in 
between the separators (or how many characters we have in total, in the cases 
where we have no separator) and only try with the suitable formats. i.e. in the 
example above not try to parse with {{%Y}}, only {{%y}}.

> [R] Implement lubridate's date/time parsing functions
> -
>
> Key: ARROW-14471
> URL: https://issues.apache.org/jira/browse/ARROW-14471
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 8.0.0
>
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
>   
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-14471) [R] Implement lubridate's date/time parsing functions

2022-02-07 Thread Jira



[ 
https://issues.apache.org/jira/browse/ARROW-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488048#comment-17488048
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-14471:
--

[~paleolimbot] I don't think we can rely on {{coalesce()}} to iterate through 
the various formats supported for {{ymd()}}. It would need to rely on the 
assumption that the passed {{format}} matches the data or otherwise fail. 
Sadly, arrow works with a wrong format resulting in weird timestamps:

{code:r}
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(arrow))
suppressPackageStartupMessages(library(lubridate))

df <- tibble(x = c("09-01-01", "09-01-02", "09-01-03"))
df
#> # A tibble: 3 × 1
#>   x   
#>  
#> 1 09-01-01
#> 2 09-01-02
#> 3 09-01-03

# lubridate::ymd()
df %>% 
  mutate(y = ymd(x))
#> # A tibble: 3 × 2
#>   xy 
#>   
#> 1 09-01-01 2009-01-01
#> 2 09-01-02 2009-01-02
#> 3 09-01-03 2009-01-03

# y = short year correct
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   xy  
#>
#> 1 09-01-01 2009-01-01 00:00:00
#> 2 09-01-02 2009-01-02 00:00:00
#> 3 09-01-03 2009-01-03 00:00:00

# Y = long year this should fail in order for us to rely on coalesce
df %>% 
  record_batch() %>% 
  mutate(y = strptime(x, format = "%Y-%m-%d", unit = "us")) %>% 
  collect()
#> # A tibble: 3 × 2
#>   xy  
#>
#> 1 09-01-01 0008-12-31 23:58:45
#> 2 09-01-02 0009-01-01 23:58:45
#> 3 09-01-03 0009-01-02 23:58:45
{code}

Therefore, my conclusion would be that we cannot implement {{arrow::ymd()}} 
binding as {{coalesce(strptime(x, format1), strptime(x, format2), ...)}}. What 
do you think?

> [R] Implement lubridate's date/time parsing functions
> -
>
> Key: ARROW-14471
> URL: https://issues.apache.org/jira/browse/ARROW-14471
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: R
>Reporter: Nicola Crane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 8.0.0
>
>
> Parse dates with year, month, and day components:
> ymd() ydm() mdy() myd() dmy() dym() yq() ym() my()
>   
> Parse date-times with year, month, and day, hour, minute, and second 
> components:
> ymd_hms() ymd_hm() ymd_h() dmy_hms() dmy_hm() dmy_h() mdy_hms() mdy_hm() 
> mdy_h() ydm_hms() ydm_hm() ydm_h()
> Parse periods with hour, minute, and second components:
> ms() hm() hms()
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15591) [C++] Add support for aggregation to the Substrait consumer

2022-02-07 Thread Vibhatha Lakmal Abeykoon (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vibhatha Lakmal Abeykoon reassigned ARROW-15591:


Assignee: Vibhatha Lakmal Abeykoon

> [C++] Add support for aggregation to the Substrait consumer
> ---
>
> Key: ARROW-15591
> URL: https://issues.apache.org/jira/browse/ARROW-15591
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: substrait
>
> The streaming execution engine supports aggregation (i.e. group by).  The 
> Substrait consumer does not currently consume aggregation relations.  We 
> should add support for this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (ARROW-15590) [C++] Add support for joins to the Substrait consumer

2022-02-07 Thread Vibhatha Lakmal Abeykoon (Jira)



 [ 
https://issues.apache.org/jira/browse/ARROW-15590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vibhatha Lakmal Abeykoon reassigned ARROW-15590:


Assignee: Vibhatha Lakmal Abeykoon

> [C++] Add support for joins to the Substrait consumer
> -
>
> Key: ARROW-15590
> URL: https://issues.apache.org/jira/browse/ARROW-15590
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>  Labels: substrait
>
> The streaming execution engine supports joins.  The Substrait consumer does 
> not currently consume joins.  We should add support for this.  We may want to 
> split this PR into subtasks as there are many different kinds of joins and we 
> may not support all of them immediately.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15255) [C++][Developer Tools] Create native ubuntu-lint container for ARM

2022-02-07 Thread Yibo Cai (Jira)



[ 
https://issues.apache.org/jira/browse/ARROW-15255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17487940#comment-17487940
 ] 

Yibo Cai commented on ARROW-15255:
--

One blocking issue is that we use {{hadolint}}, but it has no arm64 docker 
image. [1]
There's an open PR to support building and publishing {{hadolint}} arm64 image, 
looks not active now. [2]

[1] 
https://github.com/apache/arrow/blob/master/ci/docker/linux-apt-lint.dockerfile#L19
[2] https://github.com/hadolint/hadolint/pull/694

> [C++][Developer Tools] Create native ubuntu-lint container for ARM
> --
>
> Key: ARROW-15255
> URL: https://issues.apache.org/jira/browse/ARROW-15255
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Developer Tools
>Reporter: David Li
>Priority: Major
>
> It apparently runs via emulation, but would presumably be faster if a native 
> version were available. That said, I don't actually have an ARM machine to 
> experience this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

94 matches

Mail list logo