[jira] [Updated] (ARROW-13603) [GLib] GARROW_VERSION_CHECK() always returns false

2021-08-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13603:
---
Labels: pull-request-available  (was: )

> [GLib] GARROW_VERSION_CHECK() always returns false
> --
>
> Key: ARROW-13603
> URL: https://issues.apache.org/jira/browse/ARROW-13603
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: GLib
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13603) [GLib] GARROW_VERSION_CHECK() always returns false

2021-08-10 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-13603:


 Summary: [GLib] GARROW_VERSION_CHECK() always returns false
 Key: ARROW-13603
 URL: https://issues.apache.org/jira/browse/ARROW-13603
 Project: Apache Arrow
  Issue Type: Bug
  Components: GLib
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13601) [C++] Tests maybe uninitialized compiler warnings

2021-08-10 Thread Keith Kraus (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Keith Kraus updated ARROW-13601:

Description: 
Using gcc9 I get the following:

{code}
/home/keith/git/arrow/cpp/src/arrow/testing/json_internal.cc: In function 
'arrow::Status arrow::testing::json::{anonymous}::GetField(const Value&, 
arrow::ipc::internal::FieldPosition, arrow::ipc::DictionaryMemo*, 
std::shared_ptr*)':
/home/keith/git/arrow/cpp/src/arrow/testing/json_internal.cc:1128:31: warning: 
'is_ordered' may be used uninitialized in this function [-Wmaybe-uninitialized]
 1128 | type = ::arrow::dictionary(index_type, type, is_ordered);
  |~~~^~
{code}

{code}
In file included from 
/home/keith/git/arrow/cpp/src/arrow/util/bit_run_reader.h:26,
 from 
/home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:45:
/home/keith/git/arrow/cpp/src/arrow/util/bitmap_reader.h: In member function 
'void arrow::TestBitmapUInt64Reader::AssertWords(const arrow::Buffer&, int64_t, 
int64_t, const std::vector&)':
/home/keith/git/arrow/cpp/src/arrow/util/bitmap_reader.h:99:16: warning: 
'reader.arrow::internal::BitmapUInt64Reader::carry_bits_' may be used 
uninitialized in this function [-Wmaybe-uninitialized]
   99 |   uint64_t word = carry_bits_ | (next_word << num_carry_bits_);
  |^~~~
/home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:242:34: note: 
'reader.arrow::internal::BitmapUInt64Reader::carry_bits_' was declared here
  242 | internal::BitmapUInt64Reader reader(buffer.data(), start_offset, 
length);
{code}

{code}
In file included from 
/home/keith/git/arrow/cpp/src/arrow/util/async_generator.h:30,
 from 
/home/keith/git/arrow/cpp/src/arrow/util/iterator_test.cc:30:
/home/keith/git/arrow/cpp/src/arrow/util/iterator.h: In member function 
'arrow::Result arrow::TransformIterator::Next() [with T = 
std::shared_ptr; V = std::shared_ptr]':
/home/keith/git/arrow/cpp/src/arrow/util/iterator.h:288:12: warning: 
'*((void*)(& next)+24).std::__shared_count<>::_M_pi' may be used uninitialized 
in this function [-Wmaybe-uninitialized]
  288 |   auto next = *next_res;
  |^~~~
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc: In function 
'arrow::util::UTF8FindIf_Basics_Test::TestBody()::':
/home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:463:35: warning: 
'right' may be used uninitialized in this function [-Wmaybe-uninitialized]
  463 | EXPECT_EQ(offset_right, right - begin);
  |   ^
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:462:33: warning: 
'left' may be used uninitialized in this function [-Wmaybe-uninitialized]
  462 | EXPECT_EQ(offset_left, left - begin);
  | ^
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc: In function 
'arrow::util::UTF8FindIf_Basics_Test::TestBody()::':
/home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:447:35: warning: 
'right' may be used uninitialized in this function [-Wmaybe-uninitialized]
  447 | EXPECT_EQ(offset_right, right - begin);
  |   ^
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:446:33: warning: 
'left' may be used uninitialized in this function [-Wmaybe-uninitialized]
  446 | EXPECT_EQ(offset_left, left - begin);
  | ^
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/util/variant_test.cc: In member function 
'virtual void arrow::util::{anonymous}::VariantTest_Visit_Test::TestBody()':
/home/keith/git/arrow/cpp/src/arrow/util/variant_test.cc:260:3: warning: 
'*((void*)& +8)' may be used uninitialized in this function 
[-Wmaybe-uninitialized]
  260 |   for (auto Assert :
  |   ^~~
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/util/variant_test.cc:230:8: warning: 
'.arrow::util::{anonymous}::AssertVisitOne >, const char*>, std::vector >::expected_' may be 
used uninitialized in this function [-Wmaybe-uninitialized]
  230 | struct AssertVisitOne {
  |^~
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/gpu/cuda_benchmark.cc: In function 'void 
arrow::cuda::CudaBufferWriterBenchmark(benchmark::State&, int64_t, int64_t, 
int64_t)':
/home/keith/git/arrow/cpp/src/arrow/gpu/cuda_benchmark.cc:42:35: warning: 
'manager' may be used uninitialized in this function [-Wmaybe-uninitialized]
   42 |   ABORT_NOT_OK(manager->GetContext(kGpuNumber).Value());
  |   ^
{code}

{code}
/home/keith/git/arrow/cpp/src/parquet/arrow/arrow_schema_test.cc: In function 
'arrow::Result > 
parquet::arrow::RootToTreeLeafLevels(const parquet::arrow::SchemaManifest&, 
int)':
/home/keith/git/arrow/cpp/src/parquet/arrow/arrow_schema_test.cc:1343:16: 

[jira] [Created] (ARROW-13602) [C++] Tests dereferencing type-punned pointer compiler warnings

2021-08-10 Thread Keith Kraus (Jira)
Keith Kraus created ARROW-13602:
---

 Summary: [C++] Tests dereferencing type-punned pointer compiler 
warnings
 Key: ARROW-13602
 URL: https://issues.apache.org/jira/browse/ARROW-13602
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Keith Kraus


Using gcc9:
 
{code}
In file included from /home/keith/miniconda3/envs/dev/include/gtest/gtest.h:375,
 from 
/home/keith/miniconda3/envs/dev/include/gmock/internal/gmock-internal-utils.h:47,
 from 
/home/keith/miniconda3/envs/dev/include/gmock/gmock-actions.h:51,
 from /home/keith/miniconda3/envs/dev/include/gmock/gmock.h:59,
 from 
/home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:30:
/home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc: In member function 
'virtual void arrow::BitUtil_ByteSwap_Test::TestBody()':
/home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:1835:32: warning: 
dereferencing type-punned pointer will break strict-aliasing rules 
[-Wstrict-aliasing]
 1835 |   EXPECT_EQ(BitUtil::ByteSwap(*reinterpret_cast()),
  |^
/home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:1836:14: warning: 
dereferencing type-punned pointer will break strict-aliasing rules 
[-Wstrict-aliasing]
 1836 | *reinterpret_cast());
  |  ^~
/home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:1838:32: warning: 
dereferencing type-punned pointer will break strict-aliasing rules 
[-Wstrict-aliasing]
 1838 |   EXPECT_EQ(BitUtil::ByteSwap(*reinterpret_cast()),
  |^~
/home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:1839:14: warning: 
dereferencing type-punned pointer will break strict-aliasing rules 
[-Wstrict-aliasing]
 1839 | *reinterpret_cast());
  |  ^~~
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13601) [C++] Tests maybe uninitialized compiler warnings

2021-08-10 Thread Keith Kraus (Jira)
Keith Kraus created ARROW-13601:
---

 Summary: [C++] Tests maybe uninitialized compiler warnings
 Key: ARROW-13601
 URL: https://issues.apache.org/jira/browse/ARROW-13601
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Keith Kraus


Using gcc9 I get the following:

{code}
/home/keith/git/arrow/cpp/src/arrow/testing/json_internal.cc: In function 
'arrow::Status arrow::testing::json::{anonymous}::GetField(const Value&, 
arrow::ipc::internal::FieldPosition, arrow::ipc::DictionaryMemo*, 
std::shared_ptr*)':
/home/keith/git/arrow/cpp/src/arrow/testing/json_internal.cc:1128:31: warning: 
'is_ordered' may be used uninitialized in this function [-Wmaybe-uninitialized]
 1128 | type = ::arrow::dictionary(index_type, type, is_ordered);
  |~~~^~
{code}

{code}
In file included from 
/home/keith/git/arrow/cpp/src/arrow/util/bit_run_reader.h:26,
 from 
/home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:45:
/home/keith/git/arrow/cpp/src/arrow/util/bitmap_reader.h: In member function 
'void arrow::TestBitmapUInt64Reader::AssertWords(const arrow::Buffer&, int64_t, 
int64_t, const std::vector&)':
/home/keith/git/arrow/cpp/src/arrow/util/bitmap_reader.h:99:16: warning: 
'reader.arrow::internal::BitmapUInt64Reader::carry_bits_' may be used 
uninitialized in this function [-Wmaybe-uninitialized]
   99 |   uint64_t word = carry_bits_ | (next_word << num_carry_bits_);
  |^~~~
/home/keith/git/arrow/cpp/src/arrow/util/bit_util_test.cc:242:34: note: 
'reader.arrow::internal::BitmapUInt64Reader::carry_bits_' was declared here
  242 | internal::BitmapUInt64Reader reader(buffer.data(), start_offset, 
length);
{code}

{code}
In file included from 
/home/keith/git/arrow/cpp/src/arrow/util/async_generator.h:30,
 from 
/home/keith/git/arrow/cpp/src/arrow/util/iterator_test.cc:30:
/home/keith/git/arrow/cpp/src/arrow/util/iterator.h: In member function 
'arrow::Result arrow::TransformIterator::Next() [with T = 
std::shared_ptr; V = std::shared_ptr]':
/home/keith/git/arrow/cpp/src/arrow/util/iterator.h:288:12: warning: 
'*((void*)(& next)+24).std::__shared_count<>::_M_pi' may be used uninitialized 
in this function [-Wmaybe-uninitialized]
  288 |   auto next = *next_res;
  |^~~~
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/util/iterator.h:288:12: warning: 
'*((void*)(& next)+16).std::__shared_ptr::_M_ptr' 
may be used uninitialized in this function [-Wmaybe-uninitialized]
[359/671] Building CXX object 
src/arrow/util/CMakeFiles/arrow-utility-test.dir/utf8_util_test.cc.o
/home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc: In function 
'arrow::util::UTF8FindIf_Basics_Test::TestBody()::':
/home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:463:35: warning: 
'right' may be used uninitialized in this function [-Wmaybe-uninitialized]
  463 | EXPECT_EQ(offset_right, right - begin);
  |   ^
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:462:33: warning: 
'left' may be used uninitialized in this function [-Wmaybe-uninitialized]
  462 | EXPECT_EQ(offset_left, left - begin);
  | ^
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc: In function 
'arrow::util::UTF8FindIf_Basics_Test::TestBody()::':
/home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:447:35: warning: 
'right' may be used uninitialized in this function [-Wmaybe-uninitialized]
  447 | EXPECT_EQ(offset_right, right - begin);
  |   ^
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/util/utf8_util_test.cc:446:33: warning: 
'left' may be used uninitialized in this function [-Wmaybe-uninitialized]
  446 | EXPECT_EQ(offset_left, left - begin);
  | ^
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/util/variant_test.cc: In member function 
'virtual void arrow::util::{anonymous}::VariantTest_Visit_Test::TestBody()':
/home/keith/git/arrow/cpp/src/arrow/util/variant_test.cc:260:3: warning: 
'*((void*)& +8)' may be used uninitialized in this function 
[-Wmaybe-uninitialized]
  260 |   for (auto Assert :
  |   ^~~
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/util/variant_test.cc:230:8: warning: 
'.arrow::util::{anonymous}::AssertVisitOne >, const char*>, std::vector >::expected_' may be 
used uninitialized in this function [-Wmaybe-uninitialized]
  230 | struct AssertVisitOne {
  |^~
{code}

{code}
/home/keith/git/arrow/cpp/src/arrow/gpu/cuda_benchmark.cc: In function 'void 
arrow::cuda::CudaBufferWriterBenchmark(benchmark::State&, int64_t, int64_t, 
int64_t)':
/home/keith/git/arrow/cpp/src/arrow/gpu/cuda_benchmark.cc:42:35: warning: 
'manager' 

[jira] [Updated] (ARROW-13600) [C++] Maybe uninitialized warnings

2021-08-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13600:
---
Labels: pull-request-available  (was: )

> [C++] Maybe uninitialized warnings
> --
>
> Key: ARROW-13600
> URL: https://issues.apache.org/jira/browse/ARROW-13600
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Keith Kraus
>Assignee: Keith Kraus
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code}
> Building CXX object 
> src/arrow/CMakeFiles/arrow_objlib.dir/compute/exec/key_hash.cc.o
> /home/keith/git/arrow/cpp/src/arrow/compute/exec/key_hash.cc: In static 
> member function 'static void 
> arrow::compute::Hashing::hash_varlen_helper(uint32_t, const uint8_t*, 
> uint32_t*)':
> /home/keith/git/arrow/cpp/src/arrow/compute/exec/key_hash.cc:202:16: warning: 
> 'last_stripe' may be used uninitialized in this function 
> [-Wmaybe-uninitialized]
>   202 |   uint32_t lane = reinterpret_cast uint32_t*>(last_stripe)[j];
>   |^~~~
> {code}
> {code}
> Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/tensor.cc.o
> /home/keith/git/arrow/cpp/src/arrow/tensor.cc: In member function 
> 'arrow::Result arrow::Tensor::CountNonZero() const':
> /home/keith/git/arrow/cpp/src/arrow/tensor.cc:337:18: warning: '*((void*)& 
> counter +8)' may be used uninitialized in this function 
> [-Wmaybe-uninitialized]
>   337 |   NonZeroCounter counter(*this);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13600) [C++] Maybe uninitialized warnings

2021-08-10 Thread Keith Kraus (Jira)
Keith Kraus created ARROW-13600:
---

 Summary: [C++] Maybe uninitialized warnings
 Key: ARROW-13600
 URL: https://issues.apache.org/jira/browse/ARROW-13600
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Keith Kraus
Assignee: Keith Kraus


{code}
Building CXX object 
src/arrow/CMakeFiles/arrow_objlib.dir/compute/exec/key_hash.cc.o
/home/keith/git/arrow/cpp/src/arrow/compute/exec/key_hash.cc: In static member 
function 'static void arrow::compute::Hashing::hash_varlen_helper(uint32_t, 
const uint8_t*, uint32_t*)':
/home/keith/git/arrow/cpp/src/arrow/compute/exec/key_hash.cc:202:16: warning: 
'last_stripe' may be used uninitialized in this function [-Wmaybe-uninitialized]
  202 |   uint32_t lane = reinterpret_cast(last_stripe)[j];
  |^~~~
{code}

{code}
Building CXX object src/arrow/CMakeFiles/arrow_objlib.dir/tensor.cc.o
/home/keith/git/arrow/cpp/src/arrow/tensor.cc: In member function 
'arrow::Result arrow::Tensor::CountNonZero() const':
/home/keith/git/arrow/cpp/src/arrow/tensor.cc:337:18: warning: '*((void*)& 
counter +8)' may be used uninitialized in this function [-Wmaybe-uninitialized]
  337 |   NonZeroCounter counter(*this);
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13518) Identify selected row when using filters

2021-08-10 Thread Weston Pace (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396886#comment-17396886
 ] 

Weston Pace commented on ARROW-13518:
-

ARROW-13599 is somewhat related.

> Identify selected row when using filters
> 
>
> Key: ARROW-13518
> URL: https://issues.apache.org/jira/browse/ARROW-13518
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Parquet, Python
>Reporter: Yair Lenga
>Priority: Major
>
> I created a proposed enhancement to speed up reading of specific rows 
> arrow-13517 https://issues.apache.org/jira/browse/ARROW-13517
> proposing extending the functions that provides filter parquet.read_table 
> ([https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table])
>  to support returning actual row numbers (e.g, row_group and row_index). 
> with the proposed enhancement, this can provide for faster reading of the 
> data (e.g. by caching the return indices, and reading the full data when 
> needed). 
> proposed implementation will be to add 2 pseudo columns, which can be 
> requested in the columns list. E.g., columns=[ ‘$row_group’, ‘$row_index’, 
> ‘dealid’, …] or similar.
>  * $row_group - 0 based row group index
>  * $row_index - 0  based position within the row group
>  * $row_file_index - 0 based position in the file (not critical), can be 
> constructed from the other two
>  
> not sure if this requires change to the c++ interface, or just to the python 
> part of pyarrow.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13599) [C++] [Dataset] Add optional scan type that tags batches with locational information

2021-08-10 Thread Weston Pace (Jira)
Weston Pace created ARROW-13599:
---

 Summary: [C++] [Dataset] Add optional scan type that tags batches 
with locational information
 Key: ARROW-13599
 URL: https://issues.apache.org/jira/browse/ARROW-13599
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Weston Pace


Currently there are two types of scans:

 * Ordered scan - Yields batches in order (includes batch index and fragment 
index)
 * Unordered scan - Yields batches in any order (no batch index or fragment 
index)

There is a third type of scan (Tagged scan?  Indexed scan?) which could tag 
each batch with the starting row # of the batch.  Certain file types (like 
parquet & IPC) should be able to support this with similar performance to an 
unordered scan (since the # of rows is in the metadata).

Other file types (like CSV) could fall back to an ordered scan or do something 
like a two pass approach to count the # of newlines in a file and then scan the 
file itself (not sure if this makes sense yet).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13013) [C++][Compute][Python] Move (majority of) kernel unit tests to python

2021-08-10 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace reassigned ARROW-13013:
---

Assignee: Weston Pace

> [C++][Compute][Python] Move (majority of) kernel unit tests to python
> -
>
> Key: ARROW-13013
> URL: https://issues.apache.org/jira/browse/ARROW-13013
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Reporter: Ben Kietzman
>Assignee: Weston Pace
>Priority: Major
>
> mailing list discussion: 
> [https://lists.apache.org/thread.html/r09e0e0fbb8b655bbec8cf5662d224f3dfc4fba894a312900f73ae3bf%40%3Cdev.arrow.apache.org%3E]
> Writing unit tests for compute functions in c++ is laborious, entails a lot 
> of boilerplate, and slows iteration since it requires recompilation when 
> adding new tests. The majority of these test cases need not be written in C++ 
> at all and could instead be made part of the pyarrow test suite.
> In order to make the kernels' C++ implementations easily debuggable from unit 
> tests, we'll have to expose a c++ function named {{AssertCallFunction}} or 
> so. {{AssertCallFunction}} will invoke the named compute::Function and 
> compare actual results to expected without crossing the C++/python boundary, 
> allowing a developer to step through all relevant code with a single 
> breakpoint in GDB. Construction of scalars/arrays/function options and any 
> other inputs to the function is amply supported by {{pyarrow}}, and will 
> happen outside the scope of {{AssertCallFunction}}.
> {{AssertCallFunction}} should not try to derive additional assertions from 
> its arguments - for example {{CheckScalar("add",
> {left, right}
> , expected)}} will first assert that {{left + right == expected}} then 
> {{left.slice(1) + right.slice(1) == expected.slice(1)}} to ensure that 
> offsets are handled correctly. This has value but can be easily expressed in 
> Python and configuration of such behavior would overcomplicate the interface 
> of {{AssertCallFunction}}.
> Unit tests for kernels would then be written in 
> {{arrow/python/pyarrow/test/kernels/test_*.py}}. The C++ unit test for 
> [addition with implicit 
> casts|https://github.com/apache/arrow/blob/b38ab81cb96e393a026d05a22e5a2f62ff6c23d7/cpp/src/arrow/compute/kernels/scalar_arithmetic_test.cc#L897-L918]
>  could then be rewritten as
> {code:python}
> def test_addition_implicit_casts():
> AssertCallFunction("add", [[ 0,1,   2,None],
>[ 0.25, 1.5, 2.75, None]],
>expected=[0.25, 1.5, 2.75, None])
> # ...
> {code}
> NB: Some unit tests will probably still reside in C++ since we'll need to 
> test things we don't wish to expose in a user facing API, such as "whether a 
> boolean kernel avoids clobbering bits when outputting into a slice". These 
> should be far more manageable since they won't need to assert correct logic 
> across all possible input types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12922) [Java][FlightSQL] Create stubbed APIs for Flight SQL

2021-08-10 Thread Kyle Porter (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396858#comment-17396858
 ] 

Kyle Porter commented on ARROW-12922:
-

Work has now moved to https://github.com/apache/arrow/pull/10906.

> [Java][FlightSQL] Create stubbed APIs for Flight SQL
> 
>
> Key: ARROW-12922
> URL: https://issues.apache.org/jira/browse/ARROW-12922
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Tiffany Lam
>Assignee: Kyle Porter
>Priority: Major
>  Labels: FlightSQL, pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This task is to create stubbed APIs for a Flight SQL client, server and 
> sample application.
>  * Contain TODOs referencing implementation Jira tasks.
>  * Should also be accompanied by Javadocs describing behaviour of the 
> methods/APIs.
>  * TODO: breakdown poc PR 
>  
> *Acceptance Criteria*
>  * TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12922) [Java][FlightSQL] Create stubbed APIs for Flight SQL

2021-08-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-12922:
---
Labels: FlightSQL pull-request-available  (was: FlightSQL)

> [Java][FlightSQL] Create stubbed APIs for Flight SQL
> 
>
> Key: ARROW-12922
> URL: https://issues.apache.org/jira/browse/ARROW-12922
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Java
>Reporter: Tiffany Lam
>Assignee: Kyle Porter
>Priority: Major
>  Labels: FlightSQL, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This task is to create stubbed APIs for a Flight SQL client, server and 
> sample application.
>  * Contain TODOs referencing implementation Jira tasks.
>  * Should also be accompanied by Javadocs describing behaviour of the 
> methods/APIs.
>  * TODO: breakdown poc PR 
>  
> *Acceptance Criteria*
>  * TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9226) [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or hdfs-site.xml if available

2021-08-10 Thread Itamar Turner-Trauring (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396856#comment-17396856
 ] 

Itamar Turner-Trauring commented on ARROW-9226:
---

Looking through the code—

**The deprecated API:**

1. `pyarrow.hdfs` imports from `_hdfsio.pyx`.
2. This is thin wrapper around `CIOHadoopFileSystem` and `HdfsConnectionConfig`.
3. The former is wrapper around `arrow::io::HadoopFileSystem` (see 
`libarrow.pxd`).
 
**The new API:**

1. `pyarrow.fs` imports from `_hdfs.pyx`
2. This builds on Cython classes `CHdfsOptions` and `CHadoopFileSystem`, with 
very small amount of wrapper code.
3. These are synonyms for C++ classes `arrow::fs::HdfsOptions` and 
`arrow::fs::HadoopFilesystem` (see `libarrow_fs.pxd`).

---

Looking through the old code, the connection code path is in 
`cpp/src/arrow/io/hdfs.cc`, and mostly interacts with the driver, which is in 
either from `libhdfs` or `libhdfs3` via a shim 
(https://github.com/apache/arrow/blob/7b66f97330215fe020ec536671ee50f41aa1af35/cpp/src/arrow/io/hdfs_internal.h),
 which `dlopen()`s the underlying `libhdfs` library at least.

... and `libhdfs` then calls into Java implementation.

Further digging suggests that the new code path (`arrow::fs::HadoopFilesystem`) 
still uses the `libhdfs/libhdfs3` drivers from `arrow::io`:

https://github.com/apache/arrow/blob/7b66f97330215fe020ec536671ee50f41aa1af35/cpp/src/arrow/filesystem/hdfs.cc#L56

Given all the heavy lifting seems to be done by the underlying libraries, this 
suggests this functionality could be exposed again, and the issue is less 
implementing the logic, and more just exposing the underlying API again.

I am probably going to try to do this.

> [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or 
> hdfs-site.xml if available
> 
>
> Key: ARROW-9226
> URL: https://issues.apache.org/jira/browse/ARROW-9226
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.17.1
>Reporter: Bruno Quinart
>Priority: Minor
>  Labels: hdfs
>
> 'Legacy' pyarrow.hdfs.connect was somehow able to get the namenode info from 
> the hadoop configuration files.
> The new pyarrow.fs.HadoopFileSystem requires the host to be specified.
> Inferring this info from "the environment" makes it easier to deploy 
> pipelines.
> But more important, for HA namenodes it is almost impossible to know for sure 
> what to specify. If a rolling restart is ongoing, the namenode is changing. 
> There is no guarantee on which will be active in a HA setup.
> I tried connecting to the standby namenode. The connection gets established, 
> but when writing a file an error is raised that standby namenodes are not 
> allowed to write to.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9226) [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or hdfs-site.xml if available

2021-08-10 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396847#comment-17396847
 ] 

Antoine Pitrou commented on ARROW-9226:
---

[~itamarst] Could you explain how this logic can be exposed? Most of us here 
don't have any deep HDFS knowledge.

> [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or 
> hdfs-site.xml if available
> 
>
> Key: ARROW-9226
> URL: https://issues.apache.org/jira/browse/ARROW-9226
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.17.1
>Reporter: Bruno Quinart
>Priority: Minor
>  Labels: hdfs
>
> 'Legacy' pyarrow.hdfs.connect was somehow able to get the namenode info from 
> the hadoop configuration files.
> The new pyarrow.fs.HadoopFileSystem requires the host to be specified.
> Inferring this info from "the environment" makes it easier to deploy 
> pipelines.
> But more important, for HA namenodes it is almost impossible to know for sure 
> what to specify. If a rolling restart is ongoing, the namenode is changing. 
> There is no guarantee on which will be active in a HA setup.
> I tried connecting to the standby namenode. The connection gets established, 
> but when writing a file an error is raised that standby namenodes are not 
> allowed to write to.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-9226) [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or hdfs-site.xml if available

2021-08-10 Thread Itamar Turner-Trauring (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-9226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396846#comment-17396846
 ] 

Itamar Turner-Trauring commented on ARROW-9226:
---

Digging through the code, it doesn't seem like this logic was ever implemented 
in Arrow itself; deep down enough, it's logic from `libhdfs`/`libhdfs3`. If I 
read this correctly, since the new API still uses those underneath, it's 
probably just a matter of (re)exposing the low-level logic in the Arrow wrapper.

> [Python] pyarrow.fs.HadoopFileSystem - retrieve options from core-site.xml or 
> hdfs-site.xml if available
> 
>
> Key: ARROW-9226
> URL: https://issues.apache.org/jira/browse/ARROW-9226
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Python
>Affects Versions: 0.17.1
>Reporter: Bruno Quinart
>Priority: Minor
>  Labels: hdfs
>
> 'Legacy' pyarrow.hdfs.connect was somehow able to get the namenode info from 
> the hadoop configuration files.
> The new pyarrow.fs.HadoopFileSystem requires the host to be specified.
> Inferring this info from "the environment" makes it easier to deploy 
> pipelines.
> But more important, for HA namenodes it is almost impossible to know for sure 
> what to specify. If a rolling restart is ongoing, the namenode is changing. 
> There is no guarantee on which will be active in a HA setup.
> I tried connecting to the standby namenode. The connection gets established, 
> but when writing a file an error is raised that standby namenodes are not 
> allowed to write to.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13574) [C++] Add 'count all' option to count (hash) aggregate kernel

2021-08-10 Thread Ian Cook (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ian Cook updated ARROW-13574:
-
Description: The current "count" hash aggregate kernel counts either all 
non-null or all null values, but doesn't count all values regardless of nullity 
(i.e. SQL "count(\*)" or Pandas size()).  (was: The current "count" hash 
aggregate kernel counts either all non-null or all null values, but doesn't 
count all values regardless of nullity (i.e. SQL "count(*)" or Pandas size()).)

> [C++] Add 'count all' option to count (hash) aggregate kernel
> -
>
> Key: ARROW-13574
> URL: https://issues.apache.org/jira/browse/ARROW-13574
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> The current "count" hash aggregate kernel counts either all non-null or all 
> null values, but doesn't count all values regardless of nullity (i.e. SQL 
> "count(\*)" or Pandas size()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13541) [C++][Python] Implement ExtensionScalar

2021-08-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13541:
---
Labels: pull-request-available  (was: )

> [C++][Python] Implement ExtensionScalar
> ---
>
> Key: ARROW-13541
> URL: https://issues.apache.org/jira/browse/ARROW-13541
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, ExtensionScalar is just an empty shell around the base Scalar 
> class.
> It should have a ValueType member, and support the various usual operations 
> (hashing, equality, validation, GetScalar, etc.).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13541) [C++] Implement ExtensionScalar

2021-08-10 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13541:
---
Component/s: Python

> [C++] Implement ExtensionScalar
> ---
>
> Key: ARROW-13541
> URL: https://issues.apache.org/jira/browse/ARROW-13541
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 6.0.0
>
>
> Currently, ExtensionScalar is just an empty shell around the base Scalar 
> class.
> It should have a ValueType member, and support the various usual operations 
> (hashing, equality, validation, GetScalar, etc.).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13541) [C++][Python] Implement ExtensionScalar

2021-08-10 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13541:
---
Summary: [C++][Python] Implement ExtensionScalar  (was: [C++] Implement 
ExtensionScalar)

> [C++][Python] Implement ExtensionScalar
> ---
>
> Key: ARROW-13541
> URL: https://issues.apache.org/jira/browse/ARROW-13541
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++, Python
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 6.0.0
>
>
> Currently, ExtensionScalar is just an empty shell around the base Scalar 
> class.
> It should have a ValueType member, and support the various usual operations 
> (hashing, equality, validation, GetScalar, etc.).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13597) [C++] [R] ExecNode factory named source not present in registry

2021-08-10 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman updated ARROW-13597:
-
Fix Version/s: 6.0.0

> [C++] [R] ExecNode factory named source not present in registry
> ---
>
> Key: ARROW-13597
> URL: https://issues.apache.org/jira/browse/ARROW-13597
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Jonathan Keane
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 6.0.0
>
>
> {code}
> ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate 
> 
> 860
> `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl` generated warnings:
> 861
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 862
> Backtrace:
> 863
> █
> 864
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2
> 865
>  2.   └─testthat::expect_warning(...) helper-expectation.R:101:4
> 866
> ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate 
> 
> 867
> `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> Table$create(tbl` generated warnings:
> 868
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 869
> Backtrace:
> 870
> █
> 871
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2
> 872
>  2.   └─testthat::expect_warning(...) helper-expectation.R:114:4
> 873
> ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate 
> 
> 874
> `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl` generated warnings:
> 875
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 876
> Backtrace:
> 877
> █
> 878
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2
> 879
>  2.   └─testthat::expect_warning(...) helper-expectation.R:101:4
> 880
> ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate 
> 
> 881
> `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> Table$create(tbl` generated warnings:
> 882
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 883
> Backtrace:
> 884
> █
> 885
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2
> 886
>  2.   └─testthat::expect_warning(...) helper-expectation.R:114:4
> 887
> 888
> [ FAIL 23 | WARN 0 | SKIP 32 | PASS 4954 ]
> {code}
> Link to an example: 
> https://github.com/ursacomputing/crossbow/runs/3287857304#step:7:875
> https://github.com/apache/arrow/commit/2aa94c4712ce406d7c87d361b5c655a6ea585701
>  might be the culprit, it was merged the day before the failures



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13597) [C++] [R] ExecNode factory named source not present in registry

2021-08-10 Thread Ben Kietzman (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Kietzman reassigned ARROW-13597:


Assignee: Ben Kietzman

> [C++] [R] ExecNode factory named source not present in registry
> ---
>
> Key: ARROW-13597
> URL: https://issues.apache.org/jira/browse/ARROW-13597
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Jonathan Keane
>Assignee: Ben Kietzman
>Priority: Major
>
> {code}
> ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate 
> 
> 860
> `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl` generated warnings:
> 861
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 862
> Backtrace:
> 863
> █
> 864
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2
> 865
>  2.   └─testthat::expect_warning(...) helper-expectation.R:101:4
> 866
> ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate 
> 
> 867
> `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> Table$create(tbl` generated warnings:
> 868
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 869
> Backtrace:
> 870
> █
> 871
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2
> 872
>  2.   └─testthat::expect_warning(...) helper-expectation.R:114:4
> 873
> ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate 
> 
> 874
> `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl` generated warnings:
> 875
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 876
> Backtrace:
> 877
> █
> 878
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2
> 879
>  2.   └─testthat::expect_warning(...) helper-expectation.R:101:4
> 880
> ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate 
> 
> 881
> `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> Table$create(tbl` generated warnings:
> 882
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 883
> Backtrace:
> 884
> █
> 885
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2
> 886
>  2.   └─testthat::expect_warning(...) helper-expectation.R:114:4
> 887
> 888
> [ FAIL 23 | WARN 0 | SKIP 32 | PASS 4954 ]
> {code}
> Link to an example: 
> https://github.com/ursacomputing/crossbow/runs/3287857304#step:7:875
> https://github.com/apache/arrow/commit/2aa94c4712ce406d7c87d361b5c655a6ea585701
>  might be the culprit, it was merged the day before the failures



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13574) [C++] Add 'count all' option to count (hash) aggregate kernel

2021-08-10 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-13574:
-
Summary: [C++] Add 'count all' option to count (hash) aggregate kernel  
(was: [C++] Implement "count(*)" hash aggregate kernel)

> [C++] Add 'count all' option to count (hash) aggregate kernel
> -
>
> Key: ARROW-13574
> URL: https://issues.apache.org/jira/browse/ARROW-13574
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current "count" hash aggregate kernel counts either all non-null or all 
> null values, but doesn't count all values regardless of nullity (i.e. SQL 
> "count(*)" or Pandas size()).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13595) [C++] Add debug mode check for compute kernel output type

2021-08-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13595:
---
Labels: pull-request-available  (was: )

> [C++] Add debug mode check for compute kernel output type
> -
>
> Key: ARROW-13595
> URL: https://issues.apache.org/jira/browse/ARROW-13595
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> As discovered in https://github.com/apache/arrow/pull/10890, it's currently 
> possible for a kernel to declare an output type and actually return another 
> one.
> It would be useful to add debug-mode check in the kernel or function 
> execution machinery to validate the concrete output type returned by the 
> kernel exec function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13541) [C++] Implement ExtensionScalar

2021-08-10 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-13541:
--

Assignee: Antoine Pitrou

> [C++] Implement ExtensionScalar
> ---
>
> Key: ARROW-13541
> URL: https://issues.apache.org/jira/browse/ARROW-13541
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 6.0.0
>
>
> Currently, ExtensionScalar is just an empty shell around the base Scalar 
> class.
> It should have a ValueType member, and support the various usual operations 
> (hashing, equality, validation, GetScalar, etc.).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-13596) [C++] Remove util/logging.h in compute/exec/util.h

2021-08-10 Thread Niranda Perera (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranda Perera closed ARROW-13596.
--
Resolution: Invalid

> [C++] Remove util/logging.h in compute/exec/util.h
> --
>
> Key: ARROW-13596
> URL: https://issues.apache.org/jira/browse/ARROW-13596
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Niranda Perera
>Assignee: Niranda Perera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> util/logging.h is included in the compute/exec/util.h. Remove it and move 
> TempVectorStack and AtomicCounter impl to util.cc
>  
> [https://github.com/apache/arrow/blob/c2e198b84d6752733bdd20089195dc9c47df73a1/cpp/src/arrow/compute/exec/util.h#L31]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9773) [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array

2021-08-10 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Percy Camilo Triveño Aucahuasi updated ARROW-9773:
--
Labels: kernel  (was: )

> [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array
> ---
>
> Key: ARROW-9773
> URL: https://issues.apache.org/jira/browse/ARROW-9773
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: David Li
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
>  Labels: kernel
> Fix For: 6.0.0
>
>
> Take() currently concatenates ChunkedArrays first. However, this breaks down 
> when calling Take() from a ChunkedArray or Table where concatenating the 
> arrays would result in an array that's too large. While inconvenient to 
> implement, it would be useful if this case were handled.
> This could be done as a higher-level wrapper around Take(), perhaps.
> Example in Python:
> {code:python}
> >>> import pyarrow as pa
> >>> pa.__version__
> '1.0.0'
> >>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"])
> >>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"])
> >>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema)
> >>> table.take([1, 0])
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take
>   File 
> "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py",
>  line 268, in take
> return call_function('take', [data, indices], options)
>   File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function
>   File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call
>   File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
> {code}
> In this example, it would be useful if Take() or a higher-level wrapper could 
> generate multiple record batches as output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-9773) [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array

2021-08-10 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-9773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Percy Camilo Triveño Aucahuasi updated ARROW-9773:
--
Fix Version/s: 6.0.0

> [C++] Take kernel can't handle ChunkedArrays that don't fit in an Array
> ---
>
> Key: ARROW-9773
> URL: https://issues.apache.org/jira/browse/ARROW-9773
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 1.0.0
>Reporter: David Li
>Assignee: Percy Camilo Triveño Aucahuasi
>Priority: Major
> Fix For: 6.0.0
>
>
> Take() currently concatenates ChunkedArrays first. However, this breaks down 
> when calling Take() from a ChunkedArray or Table where concatenating the 
> arrays would result in an array that's too large. While inconvenient to 
> implement, it would be useful if this case were handled.
> This could be done as a higher-level wrapper around Take(), perhaps.
> Example in Python:
> {code:python}
> >>> import pyarrow as pa
> >>> pa.__version__
> '1.0.0'
> >>> rb1 = pa.RecordBatch.from_arrays([["a" * 2**30]], names=["a"])
> >>> rb2 = pa.RecordBatch.from_arrays([["b" * 2**30]], names=["a"])
> >>> table = pa.Table.from_batches([rb1, rb2], schema=rb1.schema)
> >>> table.take([1, 0])
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "pyarrow/table.pxi", line 1145, in pyarrow.lib.Table.take
>   File 
> "/home/lidavidm/Code/twosigma/arrow/venv/lib/python3.8/site-packages/pyarrow/compute.py",
>  line 268, in take
> return call_function('take', [data, indices], options)
>   File "pyarrow/_compute.pyx", line 298, in pyarrow._compute.call_function
>   File "pyarrow/_compute.pyx", line 192, in pyarrow._compute.Function.call
>   File "pyarrow/error.pxi", line 122, in 
> pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
> {code}
> In this example, it would be useful if Take() or a higher-level wrapper could 
> generate multiple record batches as output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13595) [C++] Add debug mode check for compute kernel output type

2021-08-10 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13595:
---
Summary: [C++] Add debug mode check for compute kernel output type  (was: 
[C++] Add debug mode check for compile kernel output type)

> [C++] Add debug mode check for compute kernel output type
> -
>
> Key: ARROW-13595
> URL: https://issues.apache.org/jira/browse/ARROW-13595
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: 6.0.0
>
>
> As discovered in https://github.com/apache/arrow/pull/10890, it's currently 
> possible for a kernel to declare an output type and actually return another 
> one.
> It would be useful to add debug-mode check in the kernel or function 
> execution machinery to validate the concrete output type returned by the 
> kernel exec function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13595) [C++] Add debug mode check for compile kernel output type

2021-08-10 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13595:
---
Fix Version/s: 6.0.0

> [C++] Add debug mode check for compile kernel output type
> -
>
> Key: ARROW-13595
> URL: https://issues.apache.org/jira/browse/ARROW-13595
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
> Fix For: 6.0.0
>
>
> As discovered in https://github.com/apache/arrow/pull/10890, it's currently 
> possible for a kernel to declare an output type and actually return another 
> one.
> It would be useful to add debug-mode check in the kernel or function 
> execution machinery to validate the concrete output type returned by the 
> kernel exec function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13595) [C++] Add debug mode check for compile kernel output type

2021-08-10 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-13595:
--

Assignee: Antoine Pitrou

> [C++] Add debug mode check for compile kernel output type
> -
>
> Key: ARROW-13595
> URL: https://issues.apache.org/jira/browse/ARROW-13595
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Minor
>
> As discovered in https://github.com/apache/arrow/pull/10890, it's currently 
> possible for a kernel to declare an output type and actually return another 
> one.
> It would be useful to add debug-mode check in the kernel or function 
> execution machinery to validate the concrete output type returned by the 
> kernel exec function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13598) [C++] Deprecate Datum::COLLECTION

2021-08-10 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13598:
---
Description: 
It looks like "collection" datums are not used anywhere. Where we want to 
return several pieces of data, we generally return a Struct array or scalar 
wrapping them.

Perhaps we should simply deprecate or even remove them.


  was:
It looks like "collection" datums are not used anywhere. Where we want to 
return several pieces of data, we generally return a Struct array or scalar 
wrapping them.

Perhaps we should simply deprecate them.



> [C++] Deprecate Datum::COLLECTION
> -
>
> Key: ARROW-13598
> URL: https://issues.apache.org/jira/browse/ARROW-13598
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 6.0.0
>
>
> It looks like "collection" datums are not used anywhere. Where we want to 
> return several pieces of data, we generally return a Struct array or scalar 
> wrapping them.
> Perhaps we should simply deprecate or even remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13598) [C++] Deprecate Datum::COLLECTION

2021-08-10 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396745#comment-17396745
 ] 

Antoine Pitrou commented on ARROW-13598:


[~wesm] [~bkietz] What do you think?

> [C++] Deprecate Datum::COLLECTION
> -
>
> Key: ARROW-13598
> URL: https://issues.apache.org/jira/browse/ARROW-13598
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Antoine Pitrou
>Priority: Minor
> Fix For: 6.0.0
>
>
> It looks like "collection" datums are not used anywhere. Where we want to 
> return several pieces of data, we generally return a Struct array or scalar 
> wrapping them.
> Perhaps we should simply deprecate or even remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13598) [C++] Deprecate Datum::COLLECTION

2021-08-10 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-13598:
--

 Summary: [C++] Deprecate Datum::COLLECTION
 Key: ARROW-13598
 URL: https://issues.apache.org/jira/browse/ARROW-13598
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Antoine Pitrou
 Fix For: 6.0.0


It looks like "collection" datums are not used anywhere. Where we want to 
return several pieces of data, we generally return a Struct array or scalar 
wrapping them.

Perhaps we should simply deprecate them.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13597) [C++] [R] ExecNode factory named source not present in registry

2021-08-10 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13597:
---
Summary: [C++] [R] ExecNode factory named source not present in registry  
(was: [R] ExecNode factory named source not present in registry)

> [C++] [R] ExecNode factory named source not present in registry
> ---
>
> Key: ARROW-13597
> URL: https://issues.apache.org/jira/browse/ARROW-13597
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Reporter: Jonathan Keane
>Priority: Major
>
> {code}
> ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate 
> 
> 860
> `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl` generated warnings:
> 861
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 862
> Backtrace:
> 863
> █
> 864
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2
> 865
>  2.   └─testthat::expect_warning(...) helper-expectation.R:101:4
> 866
> ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate 
> 
> 867
> `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> Table$create(tbl` generated warnings:
> 868
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 869
> Backtrace:
> 870
> █
> 871
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2
> 872
>  2.   └─testthat::expect_warning(...) helper-expectation.R:114:4
> 873
> ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate 
> 
> 874
> `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl` generated warnings:
> 875
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 876
> Backtrace:
> 877
> █
> 878
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2
> 879
>  2.   └─testthat::expect_warning(...) helper-expectation.R:101:4
> 880
> ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate 
> 
> 881
> `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> Table$create(tbl` generated warnings:
> 882
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 883
> Backtrace:
> 884
> █
> 885
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2
> 886
>  2.   └─testthat::expect_warning(...) helper-expectation.R:114:4
> 887
> 888
> [ FAIL 23 | WARN 0 | SKIP 32 | PASS 4954 ]
> {code}
> Link to an example: 
> https://github.com/ursacomputing/crossbow/runs/3287857304#step:7:875
> https://github.com/apache/arrow/commit/2aa94c4712ce406d7c87d361b5c655a6ea585701
>  might be the culprit, it was merged the day before the failures



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13597) [C++] [R] ExecNode factory named source not present in registry

2021-08-10 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane updated ARROW-13597:
---
Component/s: C++

> [C++] [R] ExecNode factory named source not present in registry
> ---
>
> Key: ARROW-13597
> URL: https://issues.apache.org/jira/browse/ARROW-13597
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Reporter: Jonathan Keane
>Priority: Major
>
> {code}
> ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate 
> 
> 860
> `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl` generated warnings:
> 861
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 862
> Backtrace:
> 863
> █
> 864
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2
> 865
>  2.   └─testthat::expect_warning(...) helper-expectation.R:101:4
> 866
> ── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate 
> 
> 867
> `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> Table$create(tbl` generated warnings:
> 868
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 869
> Backtrace:
> 870
> █
> 871
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2
> 872
>  2.   └─testthat::expect_warning(...) helper-expectation.R:114:4
> 873
> ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate 
> 
> 874
> `via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> record_batch(tbl` generated warnings:
> 875
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 876
> Backtrace:
> 877
> █
> 878
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2
> 879
>  2.   └─testthat::expect_warning(...) helper-expectation.R:101:4
> 880
> ── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate 
> 
> 881
> `via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
> Table$create(tbl` generated warnings:
> 882
> * Error : Key error: ExecNode factory named source not present in registry.; 
> pulling data into R
> 883
> Backtrace:
> 884
> █
> 885
>  1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2
> 886
>  2.   └─testthat::expect_warning(...) helper-expectation.R:114:4
> 887
> 888
> [ FAIL 23 | WARN 0 | SKIP 32 | PASS 4954 ]
> {code}
> Link to an example: 
> https://github.com/ursacomputing/crossbow/runs/3287857304#step:7:875
> https://github.com/apache/arrow/commit/2aa94c4712ce406d7c87d361b5c655a6ea585701
>  might be the culprit, it was merged the day before the failures



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13596) [C++] Remove util/logging.h in compute/exec/util.h

2021-08-10 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-13596:
---
Labels: pull-request-available  (was: )

> [C++] Remove util/logging.h in compute/exec/util.h
> --
>
> Key: ARROW-13596
> URL: https://issues.apache.org/jira/browse/ARROW-13596
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Niranda Perera
>Assignee: Niranda Perera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> util/logging.h is included in the compute/exec/util.h. Remove it and move 
> TempVectorStack and AtomicCounter impl to util.cc
>  
> [https://github.com/apache/arrow/blob/c2e198b84d6752733bdd20089195dc9c47df73a1/cpp/src/arrow/compute/exec/util.h#L31]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13596) [C++] Remove util/logging.h in compute/exec/util.h

2021-08-10 Thread Niranda Perera (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranda Perera updated ARROW-13596:
---
Description: 
util/logging.h is included in the compute/exec/util.h. Remove it and move 
TempVectorStack and AtomicCounter impl to util.cc

 

[https://github.com/apache/arrow/blob/c2e198b84d6752733bdd20089195dc9c47df73a1/cpp/src/arrow/compute/exec/util.h#L31]

  was:
util/logging.h is included in the compute/exec/util.h. Remove it and move 
AtomicCounter impl to util.cc

 

https://github.com/apache/arrow/blob/c2e198b84d6752733bdd20089195dc9c47df73a1/cpp/src/arrow/compute/exec/util.h#L31


> [C++] Remove util/logging.h in compute/exec/util.h
> --
>
> Key: ARROW-13596
> URL: https://issues.apache.org/jira/browse/ARROW-13596
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Niranda Perera
>Assignee: Niranda Perera
>Priority: Major
>
> util/logging.h is included in the compute/exec/util.h. Remove it and move 
> TempVectorStack and AtomicCounter impl to util.cc
>  
> [https://github.com/apache/arrow/blob/c2e198b84d6752733bdd20089195dc9c47df73a1/cpp/src/arrow/compute/exec/util.h#L31]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (ARROW-13575) [C++] Implement product aggregate & hash aggregate kernels

2021-08-10 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-13575.

Resolution: Fixed

Issue resolved by pull request 10890
[https://github.com/apache/arrow/pull/10890]

> [C++] Implement product aggregate & hash aggregate kernels
> --
>
> Key: ARROW-13575
> URL: https://issues.apache.org/jira/browse/ARROW-13575
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: kernel, pull-request-available
> Fix For: 6.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Like Pandas 
> [prod|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.prod.html].
>  Note that Pandas has a min_count option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13597) [R] ExecNode factory named source not present in registry

2021-08-10 Thread Jonathan Keane (Jira)
Jonathan Keane created ARROW-13597:
--

 Summary: [R] ExecNode factory named source not present in registry
 Key: ARROW-13597
 URL: https://issues.apache.org/jira/browse/ARROW-13597
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Reporter: Jonathan Keane


{code}
── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate 
860
`via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
record_batch(tbl` generated warnings:
861
* Error : Key error: ExecNode factory named source not present in registry.; 
pulling data into R
862
Backtrace:
863
█
864
 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2
865
 2.   └─testthat::expect_warning(...) helper-expectation.R:101:4
866
── Failure (test-dplyr-aggregate.R:166:3): Filter and aggregate 
867
`via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
Table$create(tbl` generated warnings:
868
* Error : Key error: ExecNode factory named source not present in registry.; 
pulling data into R
869
Backtrace:
870
█
871
 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:166:2
872
 2.   └─testthat::expect_warning(...) helper-expectation.R:114:4
873
── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate 
874
`via_batch <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
record_batch(tbl` generated warnings:
875
* Error : Key error: ExecNode factory named source not present in registry.; 
pulling data into R
876
Backtrace:
877
█
878
 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2
879
 2.   └─testthat::expect_warning(...) helper-expectation.R:101:4
880
── Failure (test-dplyr-aggregate.R:176:3): Filter and aggregate 
881
`via_table <- rlang::eval_tidy(expr, rlang::new_data_mask(rlang::env(input = 
Table$create(tbl` generated warnings:
882
* Error : Key error: ExecNode factory named source not present in registry.; 
pulling data into R
883
Backtrace:
884
█
885
 1. └─arrow:::expect_dplyr_equal(...) test-dplyr-aggregate.R:176:2
886
 2.   └─testthat::expect_warning(...) helper-expectation.R:114:4
887

888
[ FAIL 23 | WARN 0 | SKIP 32 | PASS 4954 ]
{code}

Link to an example: 
https://github.com/ursacomputing/crossbow/runs/3287857304#step:7:875

https://github.com/apache/arrow/commit/2aa94c4712ce406d7c87d361b5c655a6ea585701 
might be the culprit, it was merged the day before the failures



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13596) [C++] Remove util/logging.h in compute/exec/util.h

2021-08-10 Thread Niranda Perera (Jira)
Niranda Perera created ARROW-13596:
--

 Summary: [C++] Remove util/logging.h in compute/exec/util.h
 Key: ARROW-13596
 URL: https://issues.apache.org/jira/browse/ARROW-13596
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Niranda Perera
Assignee: Niranda Perera


util/logging.h is included in the compute/exec/util.h. Remove it and move 
AtomicCounter impl to util.cc

 

https://github.com/apache/arrow/blob/c2e198b84d6752733bdd20089195dc9c47df73a1/cpp/src/arrow/compute/exec/util.h#L31



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13588) [R] Empty character attributes not stored

2021-08-10 Thread Charlie Gao (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396714#comment-17396714
 ] 

Charlie Gao commented on ARROW-13588:
-

Hi Neal, I guess that would likely be the issue. In R, the empty character 
vector "" is different to NULL. An R attribute cannot be stored as NULL - 
setting an attribute to NULL removes it.

The wider issue is that although you _can_ remove the tzone attribute for 
dates, certain object classes (including those popular for time series e.g. 
'xts') _require_ the tzone attribute to be set.

> [R] Empty character attributes not stored
> -
>
> Key: ARROW-13588
> URL: https://issues.apache.org/jira/browse/ARROW-13588
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Ubuntu 20.04 R 4.1 release
>Reporter: Charlie Gao
>Priority: Minor
>  Labels: attributes, feather
> Fix For: 6.0.0
>
>
> Date-times in the POSIXct format have a 'tzone' attribute that by default is 
> set to "", an empty character vector (not NULL) when created.
> This however is not stored in the Arrow feather file. When the file is read 
> back, the original and restored dataframes are not identical as per the below 
> reprex.
> I am thinking that this should not be the intention? My workaround at the 
> moment is making a check when reading back to write the empty string if the 
> tzone attribute does not exist.
> Just to confirm, the attribute is stored correctly when it is not empty.
> Thanks.
> {code:java}
> ``` r
>  dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
>  attributes(dates)
>  #> $class
>  #> [1] "POSIXct" "POSIXt" 
>  #> 
>  #> $tzone
>  #> [1] ""
>  values <- c(1:3)
>  original <- data.frame(dates, values)
>  original
>  #> dates values
>  #> 1 2020-01-01 1
>  #> 2 2020-01-02 2
>  #> 3 2020-01-02 3
> tempfile <- tempfile()
> arrow::write_feather(original, tempfile)
> restored <- arrow::read_feather(tempfile)
> identical(original, restored)
>  #> [1] FALSE
>  waldo::compare(original, restored)
>  #> `attr(old$dates, 'tzone')` is a character vector ('')
>  #> `attr(new$dates, 'tzone')` is absent
> unlink(tempfile)
>  ```
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-1888) [C++] Implement casts from one struct type to another (with same field names and number of fields)

2021-08-10 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-1888:
---
Labels: analytics kernel  (was: analytics)

> [C++] Implement casts from one struct type to another (with same field names 
> and number of fields)
> --
>
> Key: ARROW-1888
> URL: https://issues.apache.org/jira/browse/ARROW-1888
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Fernando Rodriguez
>Priority: Major
>  Labels: analytics, kernel
> Fix For: 6.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-1888) [C++] Implement casts from one struct type to another (with same field names and number of fields)

2021-08-10 Thread Fernando Rodriguez (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-1888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fernando Rodriguez reassigned ARROW-1888:
-

Assignee: Fernando Rodriguez  (was: David Li)

> [C++] Implement casts from one struct type to another (with same field names 
> and number of fields)
> --
>
> Key: ARROW-1888
> URL: https://issues.apache.org/jira/browse/ARROW-1888
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Fernando Rodriguez
>Priority: Major
>  Labels: analytics
> Fix For: 6.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13595) [C++] Add debug mode check for compile kernel output type

2021-08-10 Thread Antoine Pitrou (Jira)
Antoine Pitrou created ARROW-13595:
--

 Summary: [C++] Add debug mode check for compile kernel output type
 Key: ARROW-13595
 URL: https://issues.apache.org/jira/browse/ARROW-13595
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Antoine Pitrou


As discovered in https://github.com/apache/arrow/pull/10890, it's currently 
possible for a kernel to declare an output type and actually return another one.

It would be useful to add debug-mode check in the kernel or function execution 
machinery to validate the concrete output type returned by the kernel exec 
function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-7179) [C++][Compute] Consolidate fill_null and coalesce

2021-08-10 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-7179:
---

Assignee: David Li

> [C++][Compute] Consolidate fill_null and coalesce
> -
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: David Li
>Priority: Major
>  Labels: analytics, kernel
> Fix For: 6.0.0
>
>
> fill_null and coalesce are essentially the same kernel, except the former is 
> binary and doesn't support an array fill value, and the latter is variadic 
> and supports scalar and array fill values.
> We should consolidate them into one kernel, picking the faster implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7179) [C++][Compute] Consolidate fill_null and coalesce

2021-08-10 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396656#comment-17396656
 ] 

Joris Van den Bossche commented on ARROW-7179:
--

Yes, that sounds good

> [C++][Compute] Consolidate fill_null and coalesce
> -
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: analytics, kernel
> Fix For: 6.0.0
>
>
> fill_null and coalesce are essentially the same kernel, except the former is 
> binary and doesn't support an array fill value, and the latter is variadic 
> and supports scalar and array fill values.
> We should consolidate them into one kernel, picking the faster implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7179) [C++][Compute] Consolidate fill_null and coalesce

2021-08-10 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-7179:

Description: 
fill_null and coalesce are essentially the same kernel, except the former is 
binary and doesn't support an array fill value, and the latter is variadic and 
supports scalar and array fill values.

We should consolidate them into one kernel, picking the faster implementation.

  was:
Add kernels to support which replacing null values in an array with values 
taken from corresponding slots in another array:

{code}
fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
{code}


> [C++][Compute] Consolidate fill_null and coalesce
> -
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: analytics
>
> fill_null and coalesce are essentially the same kernel, except the former is 
> binary and doesn't support an array fill value, and the latter is variadic 
> and supports scalar and array fill values.
> We should consolidate them into one kernel, picking the faster implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7179) [C++][Compute] Consolidate fill_null and coalesce

2021-08-10 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-7179:

Labels: analytics kernel  (was: analytics)

> [C++][Compute] Consolidate fill_null and coalesce
> -
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: analytics, kernel
>
> fill_null and coalesce are essentially the same kernel, except the former is 
> binary and doesn't support an array fill value, and the latter is variadic 
> and supports scalar and array fill values.
> We should consolidate them into one kernel, picking the faster implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7179) [C++][Compute] Consolidate fill_null and coalesce

2021-08-10 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-7179:

Fix Version/s: 6.0.0

> [C++][Compute] Consolidate fill_null and coalesce
> -
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: analytics, kernel
> Fix For: 6.0.0
>
>
> fill_null and coalesce are essentially the same kernel, except the former is 
> binary and doesn't support an array fill value, and the latter is variadic 
> and supports scalar and array fill values.
> We should consolidate them into one kernel, picking the faster implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (ARROW-7179) [C++][Compute] Array support for fill_null

2021-08-10 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reopened ARROW-7179:
-
  Assignee: (was: David Li)

> [C++][Compute] Array support for fill_null
> --
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: analytics
> Fix For: 5.0.0
>
>
> Add kernels to support which replacing null values in an array with values 
> taken from corresponding slots in another array:
> {code}
> fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7179) [C++][Compute] Consolidate fill_null and coalesce

2021-08-10 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-7179:

Summary: [C++][Compute] Consolidate fill_null and coalesce  (was: 
[C++][Compute] Array support for fill_null)

> [C++][Compute] Consolidate fill_null and coalesce
> -
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: analytics
>
> Add kernels to support which replacing null values in an array with values 
> taken from corresponding slots in another array:
> {code}
> fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-7179) [C++][Compute] Array support for fill_null

2021-08-10 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-7179:

Fix Version/s: (was: 5.0.0)

> [C++][Compute] Array support for fill_null
> --
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Priority: Major
>  Labels: analytics
>
> Add kernels to support which replacing null values in an array with values 
> taken from corresponding slots in another array:
> {code}
> fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-7179) [C++][Compute] Array support for fill_null

2021-08-10 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396652#comment-17396652
 ] 

David Li commented on ARROW-7179:
-

Ah, I see, thanks Joris. It looks like coalesce has slightly more complete type 
support; it's also variadic instead of binary. Maybe the path here then should 
be to benchmark the common implementations and choose the faster one, then 
consolidate both into one kernel. (And maybe provide aliases from one to the 
other?)

> [C++][Compute] Array support for fill_null
> --
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: David Li
>Priority: Major
>  Labels: analytics
> Fix For: 5.0.0
>
>
> Add kernels to support which replacing null values in an array with values 
> taken from corresponding slots in another array:
> {code}
> fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13590) [C++] Ensure dataset writing applies back pressure

2021-08-10 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13590:

Summary: [C++] Ensure dataset writing applies back pressure  (was: Ensure 
dataset writing applies back pressure)

> [C++] Ensure dataset writing applies back pressure
> --
>
> Key: ARROW-13590
> URL: https://issues.apache.org/jira/browse/ARROW-13590
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Weston Pace
>Priority: Major
>  Labels: query-engine
> Fix For: 6.0.0
>
>
> Dataset writing via exec plan (ARROW-13542) does not apply back pressure 
> currently and will take up far more RAM than it should when writing a large 
> dataset.  The node should be applying back pressure.  However, the preferred 
> back pressure method (via scheduling) will need to wait for ARROW-13576.
> Once those two are finished this can be studied in more detail.  Also, the 
> vm.dirty_ratio might be experimented with.  In theory we should be applying 
> our own back pressure and have no need of dirty pages.  In practice, it may 
> be more work than we want to tackle right now and we just let it do its thing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13588) [R] Empty character attributes not stored

2021-08-10 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-13588:

Fix Version/s: 6.0.0

> [R] Empty character attributes not stored
> -
>
> Key: ARROW-13588
> URL: https://issues.apache.org/jira/browse/ARROW-13588
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Ubuntu 20.04 R 4.1 release
>Reporter: Charlie Gao
>Priority: Minor
>  Labels: attributes, feather
> Fix For: 6.0.0
>
>
> Date-times in the POSIXct format have a 'tzone' attribute that by default is 
> set to "", an empty character vector (not NULL) when created.
> This however is not stored in the Arrow feather file. When the file is read 
> back, the original and restored dataframes are not identical as per the below 
> reprex.
> I am thinking that this should not be the intention? My workaround at the 
> moment is making a check when reading back to write the empty string if the 
> tzone attribute does not exist.
> Just to confirm, the attribute is stored correctly when it is not empty.
> Thanks.
> {code:java}
> ``` r
>  dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
>  attributes(dates)
>  #> $class
>  #> [1] "POSIXct" "POSIXt" 
>  #> 
>  #> $tzone
>  #> [1] ""
>  values <- c(1:3)
>  original <- data.frame(dates, values)
>  original
>  #> dates values
>  #> 1 2020-01-01 1
>  #> 2 2020-01-02 2
>  #> 3 2020-01-02 3
> tempfile <- tempfile()
> arrow::write_feather(original, tempfile)
> restored <- arrow::read_feather(tempfile)
> identical(original, restored)
>  #> [1] FALSE
>  waldo::compare(original, restored)
>  #> `attr(old$dates, 'tzone')` is a character vector ('')
>  #> `attr(new$dates, 'tzone')` is absent
> unlink(tempfile)
>  ```
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13588) [R] Empty character attributes not stored

2021-08-10 Thread Neal Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396644#comment-17396644
 ] 

Neal Richardson commented on ARROW-13588:
-

I am guessing that the issue is that in C++, the empty string "" means that the 
field is not set. Provided that there is no different meaning of {{tzone = 
NULL}} from {{tzone = ""}} in R, we can handle this field specially on the 
round trip. 

> [R] Empty character attributes not stored
> -
>
> Key: ARROW-13588
> URL: https://issues.apache.org/jira/browse/ARROW-13588
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 5.0.0
> Environment: Ubuntu 20.04 R 4.1 release
>Reporter: Charlie Gao
>Priority: Minor
>  Labels: attributes, feather
>
> Date-times in the POSIXct format have a 'tzone' attribute that by default is 
> set to "", an empty character vector (not NULL) when created.
> This however is not stored in the Arrow feather file. When the file is read 
> back, the original and restored dataframes are not identical as per the below 
> reprex.
> I am thinking that this should not be the intention? My workaround at the 
> moment is making a check when reading back to write the empty string if the 
> tzone attribute does not exist.
> Just to confirm, the attribute is stored correctly when it is not empty.
> Thanks.
> {code:java}
> ``` r
>  dates <- as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-02"))
>  attributes(dates)
>  #> $class
>  #> [1] "POSIXct" "POSIXt" 
>  #> 
>  #> $tzone
>  #> [1] ""
>  values <- c(1:3)
>  original <- data.frame(dates, values)
>  original
>  #> dates values
>  #> 1 2020-01-01 1
>  #> 2 2020-01-02 2
>  #> 3 2020-01-02 3
> tempfile <- tempfile()
> arrow::write_feather(original, tempfile)
> restored <- arrow::read_feather(tempfile)
> identical(original, restored)
>  #> [1] FALSE
>  waldo::compare(original, restored)
>  #> `attr(old$dates, 'tzone')` is a character vector ('')
>  #> `attr(new$dates, 'tzone')` is absent
> unlink(tempfile)
>  ```
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-12959) [C++][R] Option for is_null(NaN) to evaluate to true

2021-08-10 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-12959:

Labels: kernel pull-request-available  (was: pull-request-available)

> [C++][R] Option for is_null(NaN) to evaluate to true
> 
>
> Key: ARROW-12959
> URL: https://issues.apache.org/jira/browse/ARROW-12959
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Reporter: Ian Cook
>Assignee: Christian Cordova
>Priority: Major
>  Labels: kernel, pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> (This is the flip side of ARROW-12960.)
> Currently the Arrow compute kernel {{is_null}} always treats {{NaN}} as a 
> non-missing value, returning {{false}} at positions of the input datum with 
> value {{NaN}}.
> It would be helpful to be able to control this behavior with an option. The 
> option could be named {{nan_is_null}} or something similar.  It would default 
> to {{false}}, consistent with current behavior. When set to {{true}}, it 
> should check if the input datum has a floating point data type, and if so, 
> return {{true}} at positions where the input is {{NaN}}. If the input datum 
> has some other type, the option should be silently ignored.
> Among other things, this would enable the {{arrow}} R package to evaluate 
> {{is.na()}} consistently with the way base R does. In base R, {{is.na()}} 
> returns {{TRUE}} on {{NaN}}. But in the {{arrow}} R package, it returns 
> {{FALSE}}:
> {code:r}
> is.na(c(3.14, NA, NaN))
> ## [1] FALSE TRUE TRUE
> as.vector(is.na(Array$create(c(3.14, NA, NaN
> ## [1] FALSE TRUE FALSE{code}
> I think solving this with an option in the C++ kernel is the best solution, 
> because I suspect there are other cases in which users might want to treat 
> {{NaN}} as a missing value. However, it would also be possible to solve this 
> just in the R package, by defining a mapping of {{is.na}} in the R package 
> that checks if the input {{x}} has a floating point data type, and if so, 
> evaluates {{is.na\(x\) | is.nan\(x\)}}. If we choose to go that route, we 
> should change this Jira issue summary to "[R] Make is.na(NaN) consistent with 
> base R".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13479) [Format] Make requirement around dense union offsets less ambiguous

2021-08-10 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396606#comment-17396606
 ] 

Antoine Pitrou commented on ARROW-13479:


Same opinion. We should probably be conservative. [~wesm] What do you say?

> [Format] Make requirement around dense union offsets less ambiguous
> ---
>
> Key: ARROW-13479
> URL: https://issues.apache.org/jira/browse/ARROW-13479
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Format
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 6.0.0
>
>
> Currently, the spec states that dense union offsets for each child array must 
> be "in order / increasing". There is an ambiguity: should they be strictly 
> increasing, or are equal values supported?
> The C++ implementation currently considers that equal offsets are acceptable.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-13578) [Python] Inconsistent handling of integer-valued partitions in dataset filters API

2021-08-10 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396598#comment-17396598
 ] 

Joris Van den Bossche edited comment on ARROW-13578 at 8/10/21, 10:34 AM:
--

[~mnizol] thanks a lot for the clear and detailed report. 

What's causing the confusion here is that we have both a legacy Python 
implementation of ParquetDataset and a new generic Datasets API, and that we 
are still in the middle of moving to the new implementation: {{ParquetDataset}} 
still uses the legacy implementation by default (but you can use 
{{use_legacy_dataset=False}} to opt in to the new), while {{pq.read_table}} 
(which is what {{pd.read_parquet}} uses under the hood) is already defaulting 
to the new implementation (but you can fall back to the old with 
{{use_legacy_dataset=True}}).

In the legacy ParquetDataset implementation, all partition keys are indeed 
parsed as strings as you show with the output of 
{{ParquetDataset.partitions.levels}}. So when passing 
{{use_legacy_dataset=True}} to the read function, using a string actually works:

{code}
In [19]: pd.read_parquet('./test.parquet', filters=[('key1','=','1')], 
use_legacy_dataset=True)
Out[19]: 
  data key1 key2 key3 key4   key5 key6
0  bar11b  2.2  False  2021-06-03 00:00:00
{code}

BTW, also using an integer works here ({{('key1', '=', 1)}}), because the 
legacy implementation will try to interpret the value with the type of the 
partition levels.

In the new Datasets API, the parsing of the directory paths currently supports 
int32 and strings (when inferring, you can use other types when explicitly 
passing the schema for the partition keys). 
So when creating a ParquetDataset with use_legacy_dataset=False, we see:

{code}
In [21]: ds = pq.ParquetDataset('test_partitions', use_legacy_dataset=False)

In [22]: ds._dataset.partitioning
Out[22]: 

In [23]: ds._dataset.partitioning.schema
Out[23]: 
key1: dictionary
key2: dictionary
key3: dictionary
key4: dictionary
key5: dictionary
key6: dictionary
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 340

In [24]: ds._dataset.partitioning.dictionaries
Out[24]: 
[
 [
   0,
   1,
   2
 ],
 
 [
   0,
   1,
   2
 ],
 
 [
   "a",
   "b",
   "c"
 ],
 
 [
   "1.1",
   "2.2",
   "3.3"
 ],
 
 [
   "True",
   "False"
 ],
 
 [
   "2021-06-02 00:00:00",
   "2021-06-03 00:00:00",
   "2021-06-04 00:00:00"
 ]]
{code}

So the first two partition keys are inferred as int, the others as string. And 
that's also the reason that for this case, you actually need to specify the 
filter using an integer (we decided to not do such automatic casting here in 
the new implementation).
Sidenote: I am using {{ds._dataset.partitioning}} above, but this will become 
{{ds.partitioning}} after ARROW-13525. 

So with an integer value in the filter this works (adding 
{{use_legacy_dataset=False}} explicitly, but so this is the default in 
{{pq.read_table}} / {{pd.read_parquet}}):

{code}
In [20]: pd.read_parquet('./test_partitions/', filters=[('key1','=', 1)], 
use_legacy_dataset=False)
Out[20]: 
  data key1 key2 key3 key4   key5 key6
0  bar11b  2.2  False  2021-06-03 00:00:00
{code}

Using the new datasets API directly, this looks like:

{code}
In [25]: import pyarrow.dataset as ds

In [26]: dataset = ds.dataset("test_partitions/", format="parquet", 
partitioning="hive")

In [28]: dataset.to_table(filter=ds.field("key1") == 1).to_pandas()
Out[28]: 
  data  key1  key2 key3 key4   key5 key6
0  bar 1 1b  2.2  False  2021-06-03 00:00:00
{code}


was (Author: jorisvandenbossche):
[~mnizol] thanks a lot for the clear and detailed report. 

What's causing the confusion here is that we have both a legacy Python 
implementation of ParquetDataset and a new generic Datasets API, and that we 
are still in the middle of moving to the new implementation: {{ParquetDataset}} 
still uses the legacy implementation by default (but you can use 
{{use_legacy_dataset=False}} to opt in to the new), while {{pq.read_table}} 
(which is what {{pd.read_parquet}} uses under the hood) is already defaulting 
to the new implementation (but you can fall back to the old with 
{{use_legacy_dataset=True}}).

In the legacy ParquetDataset implementation, all partition keys are indeed 
parsed as strings as you show with the output of 
{{ParquetDataset.partitions.levels}}. So when passing 
{{use_legacy_dataset=True}} to the read function, using a string actually works:

{code}
In [19]: pd.read_parquet('./test.parquet', filters=[('key1','=','1')], 
use_legacy_dataset=True)
Out[19]: 
  data key1 key2 key3 key4   key5 key6
0  bar11b  2.2  False  2021-06-03 00:00:00
{code}

BTW, also using an integer works here ({{('key1', '=', 1)}}), because the 
legacy implementation will try to interpret the 

[jira] [Updated] (ARROW-13525) [Python] Mention alternatives in deprecation message of ParquetDataset attributes

2021-08-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13525:
--
Description: 
Follow-up on ARROW-13074. 

We should maybe also expose the {{partitioning}} attribute on ParquetDataset 
(if constructed with {{use_legacy_dataset=False}}), as I did for the 
{{filesystem}}/{{files}}/{{fragments}} attributes. 

  was:Follow-up on ARROW-13074. Better mention the alternatives (eg pieces -> 
fragments


> [Python] Mention alternatives in deprecation message of ParquetDataset 
> attributes
> -
>
> Key: ARROW-13525
> URL: https://issues.apache.org/jira/browse/ARROW-13525
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 5.0.1
>
>
> Follow-up on ARROW-13074. 
> We should maybe also expose the {{partitioning}} attribute on ParquetDataset 
> (if constructed with {{use_legacy_dataset=False}}), as I did for the 
> {{filesystem}}/{{files}}/{{fragments}} attributes. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13525) [Python] Mention alternatives in deprecation message of ParquetDataset attributes

2021-08-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13525:
--
Description: Follow-up on ARROW-13074. Better mention the alternatives (eg 
pieces -> fragments  (was: Follow-up on ARROW-13074)

> [Python] Mention alternatives in deprecation message of ParquetDataset 
> attributes
> -
>
> Key: ARROW-13525
> URL: https://issues.apache.org/jira/browse/ARROW-13525
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 5.0.1
>
>
> Follow-up on ARROW-13074. Better mention the alternatives (eg pieces -> 
> fragments



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (ARROW-13525) [Python] Mention alternatives in deprecation message of ParquetDataset attributes

2021-08-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-13525:
-

Assignee: Joris Van den Bossche

> [Python] Mention alternatives in deprecation message of ParquetDataset 
> attributes
> -
>
> Key: ARROW-13525
> URL: https://issues.apache.org/jira/browse/ARROW-13525
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 5.0.1
>
>
> Follow-up on ARROW-13074



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13578) [Python] Inconsistent handling of integer-valued partitions in dataset filters API

2021-08-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13578:
--
Component/s: Python

> [Python] Inconsistent handling of integer-valued partitions in dataset 
> filters API
> --
>
> Key: ARROW-13578
> URL: https://issues.apache.org/jira/browse/ARROW-13578
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Matt Nizol
>Priority: Minor
>
> When creating a partitioned data set via the pandas.to_parquet() method, 
> partition columns are ostensibly cast to strings in the partition metadata.  
> When reading specific partitions via the filters parameter in 
> pandas.read_parquet(), string values must be used for filter operands _except 
> when_ the partition column has an integer value.  
> Consider the following example:
> {code:python}
> import datetime
> import pandas as pd
> df = pd.DataFrame({
> "key1": ['0', '1', '2'], 
> "key2": [0, 1, 2],
> "key3": ['a', 'b', 'c'],
> "key4": [1.1, 2.2, 3.3],
> "key5": [True, False, True],
> "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), 
> datetime.date(2021, 6, 4)],
> "data": ["foo", "bar", "baz"]
> })
> df['key6'] = pd.to_datetime(df['key6'])
> df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', 
> 'key4', 'key5', 'key6'])
> {code}
> Reading into a ParquetDataset and inspecting the partition levels suggests 
> that partition keys have been cast to string, regardless of the original type:
> {code:python}
> import pyarrow.parquet as pq
> ds = pq.ParquetDataset('./test.parquet')
> for level in ds.partitions.levels:
> print(f"{level.name}: {level.keys}")
> {code}
> Output:
> {noformat}
> key1: ['0', '1', '2']
> key2: ['0', '1', '2']
> key3: ['a', 'b', 'c']
> key4: ['1.1', '2.2', '3.3']
> key5: ['True', 'False']
> key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04 
> 00:00:00']{noformat}
> Filtering the dataset using any of the non-integer partition keys along with 
> string-valued operands works as expected:
> {code:python}
> df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5', 
> '=', 'True')])
> df2.head()
> {code}
> Output:
> {noformat}
>   datakey1key2key3key4key5key6
> 0 foo 0   0   a   1.1 True2021-06-02 00:00:00
> {noformat}
> However, filtering the dataset using either of the integer-valued partition 
> keys with a string-valued operand raises an exception, *even when the 
> original column's data type is string*:
> {code:python}
> df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')])
> df2.head()
> {code}
> {noformat}
> ArrowNotImplementedError: Function equal has no kernel matching input types 
> (array[int32], scalar[string])
> {noformat}
> It would seem to be less surprising / more consistent if filter operands 
> either (a) are always cast to string, or (b) always retain their original 
> type.
> Note, this issue may be related to ARROW-12114.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13578) [Python] Inconsistent handling of integer-valued partitions in dataset filters API

2021-08-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13578:
--
Summary: [Python] Inconsistent handling of integer-valued partitions in 
dataset filters API  (was: Inconsistent handling of integer-valued partitions 
in dataset filters API)

> [Python] Inconsistent handling of integer-valued partitions in dataset 
> filters API
> --
>
> Key: ARROW-13578
> URL: https://issues.apache.org/jira/browse/ARROW-13578
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Matt Nizol
>Priority: Minor
>
> When creating a partitioned data set via the pandas.to_parquet() method, 
> partition columns are ostensibly cast to strings in the partition metadata.  
> When reading specific partitions via the filters parameter in 
> pandas.read_parquet(), string values must be used for filter operands _except 
> when_ the partition column has an integer value.  
> Consider the following example:
> {code:python}
> import datetime
> import pandas as pd
> df = pd.DataFrame({
> "key1": ['0', '1', '2'], 
> "key2": [0, 1, 2],
> "key3": ['a', 'b', 'c'],
> "key4": [1.1, 2.2, 3.3],
> "key5": [True, False, True],
> "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), 
> datetime.date(2021, 6, 4)],
> "data": ["foo", "bar", "baz"]
> })
> df['key6'] = pd.to_datetime(df['key6'])
> df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', 
> 'key4', 'key5', 'key6'])
> {code}
> Reading into a ParquetDataset and inspecting the partition levels suggests 
> that partition keys have been cast to string, regardless of the original type:
> {code:python}
> import pyarrow.parquet as pq
> ds = pq.ParquetDataset('./test.parquet')
> for level in ds.partitions.levels:
> print(f"{level.name}: {level.keys}")
> {code}
> Output:
> {noformat}
> key1: ['0', '1', '2']
> key2: ['0', '1', '2']
> key3: ['a', 'b', 'c']
> key4: ['1.1', '2.2', '3.3']
> key5: ['True', 'False']
> key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04 
> 00:00:00']{noformat}
> Filtering the dataset using any of the non-integer partition keys along with 
> string-valued operands works as expected:
> {code:python}
> df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5', 
> '=', 'True')])
> df2.head()
> {code}
> Output:
> {noformat}
>   datakey1key2key3key4key5key6
> 0 foo 0   0   a   1.1 True2021-06-02 00:00:00
> {noformat}
> However, filtering the dataset using either of the integer-valued partition 
> keys with a string-valued operand raises an exception, *even when the 
> original column's data type is string*:
> {code:python}
> df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')])
> df2.head()
> {code}
> {noformat}
> ArrowNotImplementedError: Function equal has no kernel matching input types 
> (array[int32], scalar[string])
> {noformat}
> It would seem to be less surprising / more consistent if filter operands 
> either (a) are always cast to string, or (b) always retain their original 
> type.
> Note, this issue may be related to ARROW-12114.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13578) Inconsistent handling of integer-valued partitions in dataset filters API

2021-08-10 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396598#comment-17396598
 ] 

Joris Van den Bossche commented on ARROW-13578:
---

[~mnizol] thanks a lot for the clear and detailed report. 

What's causing the confusion here is that we have both a legacy Python 
implementation of ParquetDataset and a new generic Datasets API, and that we 
are still in the middle of moving to the new implementation: {{ParquetDataset}} 
still uses the legacy implementation by default (but you can use 
{{use_legacy_dataset=False}} to opt in to the new), while {{pq.read_table}} 
(which is what {{pd.read_parquet}} uses under the hood) is already defaulting 
to the new implementation (but you can fall back to the old with 
{{use_legacy_dataset=True}}).

In the legacy ParquetDataset implementation, all partition keys are indeed 
parsed as strings as you show with the output of 
{{ParquetDataset.partitions.levels}}. So when passing 
{{use_legacy_dataset=True}} to the read function, using a string actually works:

{code}
In [19]: pd.read_parquet('./test.parquet', filters=[('key1','=','1')], 
use_legacy_dataset=True)
Out[19]: 
  data key1 key2 key3 key4   key5 key6
0  bar11b  2.2  False  2021-06-03 00:00:00
{code}

BTW, also using an integer works here ({{('key1', '=', 1)}}), because the 
legacy implementation will try to interpret the value with the type of the 
partition levels.

In the new Datasets API, the parsing of the directory paths currently supports 
int32 and strings (when inferring, you can use other types when explicitly 
passing the schema for the partition keys). 
So when creating a ParquetDataset with use_legacy_dataset=False, we see:

{code}
In [21]: ds = pq.ParquetDataset('test_partitions', use_legacy_dataset=False)

In [22]: ds._dataset.partitioning
Out[22]: 

In [23]: ds._dataset.partitioning.schema
Out[23]: 
key1: dictionary
key2: dictionary
key3: dictionary
key4: dictionary
key5: dictionary
key6: dictionary
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 340

In [24]: ds._dataset.partitioning.dictionaries
Out[24]: 
[
 [
   0,
   1,
   2
 ],
 
 [
   0,
   1,
   2
 ],
 
 [
   "a",
   "b",
   "c"
 ],
 
 [
   "1.1",
   "2.2",
   "3.3"
 ],
 
 [
   "True",
   "False"
 ],
 
 [
   "2021-06-02 00:00:00",
   "2021-06-03 00:00:00",
   "2021-06-04 00:00:00"
 ]]
{code}

So the first two partition keys are inferred as int, the others as string. And 
that's also the reason that for this case, you actually need to specify the 
filter using an integer (we decided to not do such automatic casting here in 
the new implementation).

Sidenote: I am using {{ds._dataset.partitioning}} above, but this will become 
{{ds.partitioning}} after ARROW-13525. 

Using the new datasets API directly, this looks like:

{code}
In [25]: import pyarrow.dataset as ds

In [26]: dataset = ds.dataset("test_partitions/", format="parquet", 
partitioning="hive")

In [28]: dataset.to_table(filter=ds.field("key1") == 1).to_pandas()
Out[28]: 
  data  key1  key2 key3 key4   key5 key6
0  bar 1 1b  2.2  False  2021-06-03 00:00:00
{code}

> Inconsistent handling of integer-valued partitions in dataset filters API
> -
>
> Key: ARROW-13578
> URL: https://issues.apache.org/jira/browse/ARROW-13578
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Matt Nizol
>Priority: Minor
>
> When creating a partitioned data set via the pandas.to_parquet() method, 
> partition columns are ostensibly cast to strings in the partition metadata.  
> When reading specific partitions via the filters parameter in 
> pandas.read_parquet(), string values must be used for filter operands _except 
> when_ the partition column has an integer value.  
> Consider the following example:
> {code:python}
> import datetime
> import pandas as pd
> df = pd.DataFrame({
> "key1": ['0', '1', '2'], 
> "key2": [0, 1, 2],
> "key3": ['a', 'b', 'c'],
> "key4": [1.1, 2.2, 3.3],
> "key5": [True, False, True],
> "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), 
> datetime.date(2021, 6, 4)],
> "data": ["foo", "bar", "baz"]
> })
> df['key6'] = pd.to_datetime(df['key6'])
> df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', 
> 'key4', 'key5', 'key6'])
> {code}
> Reading into a ParquetDataset and inspecting the partition levels suggests 
> that partition keys have been cast to string, regardless of the original type:
> {code:python}
> import pyarrow.parquet as pq
> ds = pq.ParquetDataset('./test.parquet')
> for level in ds.partitions.levels:
> print(f"{level.name}: {level.keys}")
> {code}
> Output:
> {noformat}
> key1: ['0', '1', '2']
> key2: ['0', '1', '2']
> key3: ['a', 

[jira] [Commented] (ARROW-7179) [C++][Compute] Array support for fill_null

2021-08-10 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-7179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396577#comment-17396577
 ] 

Joris Van den Bossche commented on ARROW-7179:
--

We actually already have a "fill_null" kernel for filling with a scalar (added 
in ), so this issue was about expanding that kernel to also support array fill 
values. That's indeed the same as coalesce. But, "fill_null" is currently a 
fully separate implementation (scalar_fill_null.cc), while I suppose "coalesce" 
might also have a specialized path for scalar fill values? That could 
potentially be consolidated?



> [C++][Compute] Array support for fill_null
> --
>
> Key: ARROW-7179
> URL: https://issues.apache.org/jira/browse/ARROW-7179
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.15.1
>Reporter: Ben Kietzman
>Assignee: David Li
>Priority: Major
>  Labels: analytics
> Fix For: 5.0.0
>
>
> Add kernels to support which replacing null values in an array with values 
> taken from corresponding slots in another array:
> {code}
> fill_null([1, null, null, 3], [5, 6, null, 8]) -> [1, 6, null, 3]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13594) [CI] Turbodbc integration builds are failing due to use of deprecated/removed APIs

2021-08-10 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-13594:
---
Component/s: Python

> [CI] Turbodbc integration builds are failing due to use of deprecated/removed 
> APIs
> --
>
> Key: ARROW-13594
> URL: https://issues.apache.org/jira/browse/ARROW-13594
> Project: Apache Arrow
>  Issue Type: Test
>  Components: C++, Continuous Integration, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> See eg https://github.com/ursacomputing/crossbow/runs/3277446055
> ARROW-13552 removed some deprecated C++ APIs, and turbodbc was still using 
> some of those. See https://github.com/apache/arrow/pull/10868/files#r685800679
> cc [~uwe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-13581) [Python] pyarrow array equals return False if there's nan

2021-08-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche closed ARROW-13581.
-
Resolution: Duplicate

> [Python] pyarrow array equals return False if there's nan 
> --
>
> Key: ARROW-13581
> URL: https://issues.apache.org/jira/browse/ARROW-13581
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: David Zhang
>Priority: Major
>
> pyarrow array / chunked array / table `.equals` would return False if there 
> is nan value(s) in the data. 
> Example:
> {code:java}
> pa.array([1, np.nan]).equals(pa.array([1, np.nan])) {code}
> Above will return False instead of True



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13581) [Python] pyarrow array equals return False if there's nan

2021-08-10 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396553#comment-17396553
 ] 

Joris Van den Bossche commented on ARROW-13581:
---

See the discussion in ARROW-6043. This is currently somewhat deliberate, but 
the plan is to add an option to consider NaNs as equal if they occur in the 
same location. 

Closing this as a duplicate of ARROW-6043.

> [Python] pyarrow array equals return False if there's nan 
> --
>
> Key: ARROW-13581
> URL: https://issues.apache.org/jira/browse/ARROW-13581
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: David Zhang
>Priority: Major
>
> pyarrow array / chunked array / table `.equals` would return False if there 
> is nan value(s) in the data. 
> Example:
> {code:java}
> pa.array([1, np.nan]).equals(pa.array([1, np.nan])) {code}
> Above will return False instead of True



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13581) [Python] pyarrow array equals return False if there's nan

2021-08-10 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-13581:
--
Summary: [Python] pyarrow array equals return False if there's nan   (was: 
pyarrow array equals return False if there's nan )

> [Python] pyarrow array equals return False if there's nan 
> --
>
> Key: ARROW-13581
> URL: https://issues.apache.org/jira/browse/ARROW-13581
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: David Zhang
>Priority: Major
>
> pyarrow array / chunked array / table `.equals` would return False if there 
> is nan value(s) in the data. 
> Example:
> {code:java}
> pa.array([1, np.nan]).equals(pa.array([1, np.nan])) {code}
> Above will return False instead of True



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13594) [CI] Turbodbc integration builds are failing due to use of deprecated/removed APIs

2021-08-10 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396535#comment-17396535
 ] 

Joris Van den Bossche commented on ARROW-13594:
---

I opened https://github.com/blue-yonder/turbodbc/issues/317 on the  turbodbc 
side.

> [CI] Turbodbc integration builds are failing due to use of deprecated/removed 
> APIs
> --
>
> Key: ARROW-13594
> URL: https://issues.apache.org/jira/browse/ARROW-13594
> Project: Apache Arrow
>  Issue Type: Test
>  Components: C++, Continuous Integration
>Reporter: Joris Van den Bossche
>Priority: Major
>
> See eg https://github.com/ursacomputing/crossbow/runs/3277446055
> ARROW-13552 removed some deprecated C++ APIs, and turbodbc was still using 
> some of those. See https://github.com/apache/arrow/pull/10868/files#r685800679
> cc [~uwe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13594) [CI] Turbodbc integration builds are failing due to use of deprecated/removed APIs

2021-08-10 Thread Joris Van den Bossche (Jira)
Joris Van den Bossche created ARROW-13594:
-

 Summary: [CI] Turbodbc integration builds are failing due to use 
of deprecated/removed APIs
 Key: ARROW-13594
 URL: https://issues.apache.org/jira/browse/ARROW-13594
 Project: Apache Arrow
  Issue Type: Test
  Components: C++, Continuous Integration
Reporter: Joris Van den Bossche


See eg https://github.com/ursacomputing/crossbow/runs/3277446055

ARROW-13552 removed some deprecated C++ APIs, and turbodbc was still using some 
of those. See https://github.com/apache/arrow/pull/10868/files#r685800679

cc [~uwe]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-13593) [C++][Dataset][Parquet] Support parquet modular encryption in the new Dataset API

2021-08-10 Thread Maya Anderson (Jira)
Maya Anderson created ARROW-13593:
-

 Summary: [C++][Dataset][Parquet] Support parquet modular 
encryption in the new Dataset API
 Key: ARROW-13593
 URL: https://issues.apache.org/jira/browse/ARROW-13593
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Parquet
Reporter: Maya Anderson
Assignee: Maya Anderson


In order for the new Dataset API to fully support PME, the same writer 
properties that include file_encryption_properties shouldn’t be used for the 
whole dataset. file_encryption_properties should be per file, for example in 
order to support key rotation https://issues.apache.org/jira/browse/ARROW-9960 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-6908) [C++] Add support for Bazel

2021-08-10 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou closed ARROW-6908.
-
Assignee: (was: Micah Kornfield)

> [C++] Add support for Bazel
> ---
>
> Key: ARROW-6908
> URL: https://issues.apache.org/jira/browse/ARROW-6908
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Aryan Naraghi
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> I would like to use Arrow in a C++ project that uses Bazel.
>  
> Would it be possible to add support for building Arrow using Bazel?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13592) How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow

2021-08-10 Thread wangdapeng (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangdapeng updated ARROW-13592:
---
Description: 
One of the key libraries in myproject uses "-D_GLIBCXX_USE_CXX11_ABI=0", and 
there is no version "-D_GLIBCXX_USE_CXX11_ABI=1",so I had to compile arrow with 
"-D_GLIBCXX_USE_CXX11_ABI=0".

env: GCC 7.5 cmake 3.16

Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
not recognized, and the file was truncated  

commands:

cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
-DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
 make arrow
 make parquet

  was:
env: GCC 7.5 cmake 3.16

Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
not recognized, and the file was truncated  

commands:

cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
-DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
 make arrow
 make parquet


> How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow
> --
>
> Key: ARROW-13592
> URL: https://issues.apache.org/jira/browse/ARROW-13592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: wangdapeng
>Priority: Blocker
>
> One of the key libraries in myproject uses "-D_GLIBCXX_USE_CXX11_ABI=0", and 
> there is no version "-D_GLIBCXX_USE_CXX11_ABI=1",so I had to compile arrow 
> with "-D_GLIBCXX_USE_CXX11_ABI=0".
> env: GCC 7.5 cmake 3.16
> Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
> not recognized, and the file was truncated  
> commands:
> cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
> -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
> -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
>  make arrow
>  make parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13592) How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow

2021-08-10 Thread wangdapeng (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangdapeng updated ARROW-13592:
---
Description: 
One of the key libraries in my project uses "-D_GLIBCXX_USE_CXX11_ABI=0", and 
there is no version "-D_GLIBCXX_USE_CXX11_ABI=1",so I had to compile arrow with 
"-D_GLIBCXX_USE_CXX11_ABI=0".

env: GCC 7.5 cmake 3.16

Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
not recognized, and the file was truncated  

commands:

cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
-DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
 make arrow
 make parquet

  was:
One of the key libraries in myproject uses "-D_GLIBCXX_USE_CXX11_ABI=0", and 
there is no version "-D_GLIBCXX_USE_CXX11_ABI=1",so I had to compile arrow with 
"-D_GLIBCXX_USE_CXX11_ABI=0".

env: GCC 7.5 cmake 3.16

Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
not recognized, and the file was truncated  

commands:

cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
-DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
 make arrow
 make parquet


> How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow
> --
>
> Key: ARROW-13592
> URL: https://issues.apache.org/jira/browse/ARROW-13592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: wangdapeng
>Priority: Blocker
>
> One of the key libraries in my project uses "-D_GLIBCXX_USE_CXX11_ABI=0", and 
> there is no version "-D_GLIBCXX_USE_CXX11_ABI=1",so I had to compile arrow 
> with "-D_GLIBCXX_USE_CXX11_ABI=0".
> env: GCC 7.5 cmake 3.16
> Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
> not recognized, and the file was truncated  
> commands:
> cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
> -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
> -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
>  make arrow
>  make parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13592) How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow

2021-08-10 Thread wangdapeng (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangdapeng updated ARROW-13592:
---
Summary: How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow  (was: 
How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0")

> How to use "-D_GLIBCXX_USE_CXX11_ABI=0" correctly in arrow
> --
>
> Key: ARROW-13592
> URL: https://issues.apache.org/jira/browse/ARROW-13592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: wangdapeng
>Priority: Blocker
>
> env: GCC 7.5 cmake 3.16
> Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
> not recognized, and the file was truncated  
> commands:
> cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
> -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
> -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
>  make arrow
>  make parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-13592) How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0"

2021-08-10 Thread wangdapeng (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangdapeng updated ARROW-13592:
---
Description: 
env: GCC 7.5 cmake 3.16

Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
not recognized, and the file was truncated  

commands:

cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
-DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
 make arrow
 make parquet

  was:
Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
not recognized, and the file was truncated  

commands:

cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
-DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
-DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
 make arrow
 make parquet


> How to compile arrow with "-D_GLIBCXX_USE_CXX11_ABI=0"
> --
>
> Key: ARROW-13592
> URL: https://issues.apache.org/jira/browse/ARROW-13592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 5.0.0
>Reporter: wangdapeng
>Priority: Blocker
>
> env: GCC 7.5 cmake 3.16
> Arrow was compiled with -D_GLIBCXX_USE_CXX11_ABI=0, the link error : file was 
> not recognized, and the file was truncated  
> commands:
> cmake .. -DARROW_DATASET=ON -DARROW_FILESYSTEM=ON -DARROW_PARQUET=ON 
> -DARROW_WITH_SNAPPY=ON -DARROW_WITH_ZLIB=ON 
> -DCMAKE_CXX_FLAGS='-D_GLIBCXX_USE_CXX11_ABI=0'
>  make arrow
>  make parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)