[jira] [Created] (ARROW-15500) parquet link undefined reference to base64_encode(), unpack32(), etc

2022-01-28 Thread Ryan Seghers (Jira)
Ryan Seghers created ARROW-15500:


 Summary: parquet link undefined reference to base64_encode(), 
unpack32(), etc
 Key: ARROW-15500
 URL: https://issues.apache.org/jira/browse/ARROW-15500
 Project: Apache Arrow
  Issue Type: Bug
  Components: Parquet
Affects Versions: 6.0.1
 Environment: ubuntu 20.04
vcpkg master latest as of 2022-01-08, and tag 2022.01.01
gcc 9 and 10 latest
cmake 3.22.1

Reporter: Ryan Seghers


I'm trying to build on ubuntu 20.04, using vcpkg master latest, both gcc-9 and 
gcc-10 latest, cmake 3.22.1. I can build and link Arrow in a small test program 
and write a csv. When I try to build with parquet I get several link-time 
errors. Here is the first full linker error:

/usr/bin/ld: 
/home/ryans/src/vcpkg/installed/x64-linux/lib/libparquet.a(writer.cc.o): in 
function `parquet::arrow::GetSchemaMetadata(arrow::Schema const&, 
arrow::MemoryPool*, parquet::ArrowWriterProperties const&, 
std::shared_ptr*) [clone .localalias]':
writer.cc:(.text+0x179): undefined reference to 
`arrow::util::base64_encode[abi:cxx11](nonstd::sv_lite::basic_string_view >)'

Here are snippets from the rest:

undefined reference to `arrow::internal::unpack32(unsigned int const*, unsigned 
int*, int, int)'

undefined reference to `arrow::internal::unpack64(unsigned char const*, 
unsigned long*, int, int)'

undefined reference to `arrow::io::BufferedInputStream::Create(long, 
arrow::MemoryPool*, std::shared_ptr, long)'

undefined reference to 
`arrow::util::base64_decode[abi:cxx11](nonstd::sv_lite::basic_string_view >)'

I have tried vcpkg tag 2022.01.01 (I think it is Arrow 6.0.0) and looked like 
the same set of undefined symbols.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15499) [Python] Fix import error in pyarrow._orc

2022-01-28 Thread Krisztian Szucs (Jira)
Krisztian Szucs created ARROW-15499:
---

 Summary: [Python] Fix import error in pyarrow._orc
 Key: ARROW-15499
 URL: https://issues.apache.org/jira/browse/ARROW-15499
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [arrow-testing] westonpace opened a new pull request #74: ARROW-15425: [Integration] Add delta dictionaries in file format to integration tests

2022-01-28 Thread GitBox


westonpace opened a new pull request #74:
URL: https://github.com/apache/arrow-testing/pull/74


   This adds an example IPC file containing a delta dictionary for both the 
file and the streaming IPC format.  It requires a small change to the 
integration programs (https://github.com/apache/arrow/pull/12291) to work 
correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (ARROW-15498) Implement Bloom filter pushdown between hash joins

2022-01-28 Thread Sasha Krassovsky (Jira)
Sasha Krassovsky created ARROW-15498:


 Summary: Implement Bloom filter pushdown between hash joins 
 Key: ARROW-15498
 URL: https://issues.apache.org/jira/browse/ARROW-15498
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Sasha Krassovsky
Assignee: Sasha Krassovsky


When there is a chain of hash joins, it's often worthwhile to create Bloom 
filters and push them to the earliest possible point in the chain of joins to 
minimize number of materialized rows.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15497) [C++][Homebrew] Use Clang Tools 12

2022-01-28 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-15497:


 Summary: [C++][Homebrew] Use Clang Tools 12
 Key: ARROW-15497
 URL: https://issues.apache.org/jira/browse/ARROW-15497
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Developer Tools
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15496) [Python] Log warning when user tries to write parquet table with incompatible type

2022-01-28 Thread Grant Williams (Jira)
Grant Williams created ARROW-15496:
--

 Summary: [Python] Log warning when user tries to write parquet 
table with incompatible type
 Key: ARROW-15496
 URL: https://issues.apache.org/jira/browse/ARROW-15496
 Project: Apache Arrow
  Issue Type: Wish
  Components: Parquet, Python
Reporter: Grant Williams


Could we get a logged warning when a user tries to 
`pyarrow.parquet.write_table()` with `version=1.0` and a schema that contains 
an incompatible `uint32()` type? I don't think the behavior to upcast to an 
`int64()` is immediately obvious (although the docs are clear on it) and I 
think it would help prevent some confusion for other users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15495) [C++][FlightRPC] Ensure system gRPC is only used with system Protobuf

2022-01-28 Thread David Li (Jira)
David Li created ARROW-15495:


 Summary: [C++][FlightRPC] Ensure system gRPC is only used with 
system Protobuf
 Key: ARROW-15495
 URL: https://issues.apache.org/jira/browse/ARROW-15495
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, FlightRPC
Reporter: David Li


See Kou's post on the ML: 
[https://lists.apache.org/thread/dg2nm7r9vpo42toygg8o8rzf8gkg6knb]

We should ensure system gRPC doesn't get mixed with bundled Protobuf which can 
cause test failures (also this is not really a valid combination, this will 
likely link in two copies of Protobuf).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15494) [Docs] Clarify {{existing_data_behavior}} docstring

2022-01-28 Thread Jira
Martin Thøgersen created ARROW-15494:


 Summary: [Docs] Clarify {{existing_data_behavior}} docstring
 Key: ARROW-15494
 URL: https://issues.apache.org/jira/browse/ARROW-15494
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 7.0.1
Reporter: Martin Thøgersen


Clarify wording slightly of \{{pyarrow.dataset.write_dataset()}} parameter 
{{existing_data_behavior}}

[https://github.com/apache/arrow/blob/a27c55660e575a3987283d5d9e443642db48f215/python/pyarrow/dataset.py#L812-L827]

Proposed wording:

{noformat}
existing_data_behavior : 'error' | 'overwrite_or_ignore' | \
'delete_matching'
Controls how the dataset will handle data that already exists in
the destination.  The default behavior ('error') is to raise an error
if any data exists in the `base_dir` destination.

'overwrite_or_ignore' will ignore any existing data and will
overwrite files with the same name as an output file.  Other
existing files will be ignored.  This behavior, in combination
with a unique basename_template for each write, will allow for
an append workflow.

'delete_matching' is useful when you are writing a partitioned
dataset.  The first time each partition leaf-level directory is 
encountered the entire leaf-level directory will be deleted.  This
allows you to overwrite old partitions completely.
{noformat}

I.e. clarify that:
- {{error}} applies to the base_dir level.
- {{delete_matching}} applies to the leaf-level directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15493) [C++][Gandiva] Uninitialized data member causes random gandiva-filter-test failures

2022-01-28 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-15493:


 Summary: [C++][Gandiva] Uninitialized data member causes random 
gandiva-filter-test failures
 Key: ARROW-15493
 URL: https://issues.apache.org/jira/browse/ARROW-15493
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva
Reporter: Yibo Cai
Assignee: Yibo Cai


gandiva-filter-test {{TestFilter.TestFilterCache}} fails on Arm, though the bug 
is not architecure dependent.
Class member *mode_* is not initialized in one of ExpressionCacheKey 
constructors [1], but it's used to comapre equality of two instances [2]. It 
causes flaky gandiva-filter-test failures.

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/gandiva/expression_cache_key.h#L55
[2] 
https://github.com/apache/arrow/blob/master/cpp/src/gandiva/expression_cache_key.h#L92



--
This message was sent by Atlassian Jira
(v8.20.1#820001)