[jira] [Updated] (ARROW-16614) [C++] Use lz4::lz4 for lz4's CMake target name

2022-05-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16614:
---
Labels: pull-request-available  (was: )

> [C++] Use lz4::lz4 for lz4's CMake target name
> --
>
> Key: ARROW-16614
> URL: https://issues.apache.org/jira/browse/ARROW-16614
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Kouhei Sutou
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Because upstream uses {{lz4::lz4}} not {{LZ4::lz4}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16614) [C++] Use lz4::lz4 for lz4's CMake target name

2022-05-18 Thread Kouhei Sutou (Jira)
Kouhei Sutou created ARROW-16614:


 Summary: [C++] Use lz4::lz4 for lz4's CMake target name
 Key: ARROW-16614
 URL: https://issues.apache.org/jira/browse/ARROW-16614
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kouhei Sutou
Assignee: Kouhei Sutou


Because upstream uses {{lz4::lz4}} not {{LZ4::lz4}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-11135) [Java][Gandiva] Using Maven Central artifacts as dependencies produces runtime errors

2022-05-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou updated ARROW-11135:
-
Fix Version/s: 7.0.0
   (was: 8.0.0)

> [Java][Gandiva] Using Maven Central artifacts as dependencies produces 
> runtime errors
> -
>
> Key: ARROW-11135
> URL: https://issues.apache.org/jira/browse/ARROW-11135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Michael Mior
>Assignee: Anthony Louis Gotlib Ferreira
>Priority: Major
> Fix For: 7.0.0
>
>
> I'm working on connecting Arrow/Gandiva with Apache Calcite. Overall the 
> integration is working well, but I'm having issues. As [suggested on the 
> mailing 
> list|https://lists.apache.org/thread.html/r93a4fedb499c746917ab8d62cf5a8db8c93a7f24bc9fac81f90bedaa%40%3Cuser.arrow.apache.org%3E],
>  using Dremio's public artifacts solves the problem. Between two Apache 
> projects however, there would be a strong preference to use Apache artifacts 
> as a dependency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-11135) [Java][Gandiva] Using Maven Central artifacts as dependencies produces runtime errors

2022-05-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-11135.
--
Fix Version/s: 8.0.0
   Resolution: Fixed

Thanks for confirming this!

I close this.

> [Java][Gandiva] Using Maven Central artifacts as dependencies produces 
> runtime errors
> -
>
> Key: ARROW-11135
> URL: https://issues.apache.org/jira/browse/ARROW-11135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Michael Mior
>Assignee: Anthony Louis Gotlib Ferreira
>Priority: Major
> Fix For: 8.0.0
>
>
> I'm working on connecting Arrow/Gandiva with Apache Calcite. Overall the 
> integration is working well, but I'm having issues. As [suggested on the 
> mailing 
> list|https://lists.apache.org/thread.html/r93a4fedb499c746917ab8d62cf5a8db8c93a7f24bc9fac81f90bedaa%40%3Cuser.arrow.apache.org%3E],
>  using Dremio's public artifacts solves the problem. Between two Apache 
> projects however, there would be a strong preference to use Apache artifacts 
> as a dependency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-11135) [Java][Gandiva] Using Maven Central artifacts as dependencies produces runtime errors

2022-05-18 Thread Jonathan Swenson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539295#comment-17539295
 ] 

Jonathan Swenson commented on ARROW-11135:
--

I may be missing something, but It appears as though this solution does resolve 
the issue. 

Upgrading to arrow-gandiva 8.0.0 (haven't tried 7.0.0) from maven central 
appears to work on intel macs, but fails with a different linker error when 
running on an m1 mac (apple silicon). Filed 
https://issues.apache.org/jira/browse/ARROW-16608 to track this additional 
issue, but I believe that this particular is solved. 

> [Java][Gandiva] Using Maven Central artifacts as dependencies produces 
> runtime errors
> -
>
> Key: ARROW-11135
> URL: https://issues.apache.org/jira/browse/ARROW-11135
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Michael Mior
>Assignee: Anthony Louis Gotlib Ferreira
>Priority: Major
>
> I'm working on connecting Arrow/Gandiva with Apache Calcite. Overall the 
> integration is working well, but I'm having issues. As [suggested on the 
> mailing 
> list|https://lists.apache.org/thread.html/r93a4fedb499c746917ab8d62cf5a8db8c93a7f24bc9fac81f90bedaa%40%3Cuser.arrow.apache.org%3E],
>  using Dremio's public artifacts solves the problem. Between two Apache 
> projects however, there would be a strong preference to use Apache artifacts 
> as a dependency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-16609) [C++] xxhash not installed into dist/lib/include when building C++

2022-05-18 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539285#comment-17539285
 ] 

Alenka Frim edited comment on ARROW-16609 at 5/19/22 4:47 AM:
--

Also 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h]
 would be needed in {{dist/include}} for 
[{{arrow/python/python_to_arrow.cc:44}}|https://github.com/apache/arrow/blob/1cdedc4cbf0709ce440d69242afd47474a7148c7/cpp/src/arrow/python/python_to_arrow.cc#L44]
 and {{arrow/vendored/portable-snippets}} for 
{{arrow/util/int_util_internal.h:30}}.


was (Author: alenkaf):
Also 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h]
 would be needed in {{dist/include}} for 
[{{arrow/python/python_to_arrow.cc:44}}|https://github.com/apache/arrow/blob/1cdedc4cbf0709ce440d69242afd47474a7148c7/cpp/src/arrow/python/python_to_arrow.cc#L44].

> [C++] xxhash not installed into dist/lib/include when building C++
> --
>
> Key: ARROW-16609
> URL: https://issues.apache.org/jira/browse/ARROW-16609
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Alenka Frim
>Priority: Blocker
> Fix For: 9.0.0
>
>
> My C++ build setup doesn’t install {{dist/include/arrow/vendored/xxhash/}} 
> but only {{dist/include/arrow/vendored/xxhash.h}}. The last time the module 
> was installed was in november 2021.
> As {{arrow/python/arrow_to_pandas.cc}} includes {{arrow/util/hashing.h}} ->  
> {{arrow/vendored/xxhash.h}}  -> {{arrow/vendored/xxhash/xxhash.h}} this 
> module is needed to try to build Python C++ API separately from C++ 
> (https://issues.apache.org/jira/browse/ARROW-16340).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16609) [C++] xxhash not installed into dist/lib/include when building C++

2022-05-18 Thread Alenka Frim (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539285#comment-17539285
 ] 

Alenka Frim commented on ARROW-16609:
-

Also 
[https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/int_util_internal.h]
 would be needed in {{dist/include}} for 
[{{arrow/python/python_to_arrow.cc:44}}|https://github.com/apache/arrow/blob/1cdedc4cbf0709ce440d69242afd47474a7148c7/cpp/src/arrow/python/python_to_arrow.cc#L44].

> [C++] xxhash not installed into dist/lib/include when building C++
> --
>
> Key: ARROW-16609
> URL: https://issues.apache.org/jira/browse/ARROW-16609
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Alenka Frim
>Priority: Blocker
> Fix For: 9.0.0
>
>
> My C++ build setup doesn’t install {{dist/include/arrow/vendored/xxhash/}} 
> but only {{dist/include/arrow/vendored/xxhash.h}}. The last time the module 
> was installed was in november 2021.
> As {{arrow/python/arrow_to_pandas.cc}} includes {{arrow/util/hashing.h}} ->  
> {{arrow/vendored/xxhash.h}}  -> {{arrow/vendored/xxhash/xxhash.h}} this 
> module is needed to try to build Python C++ API separately from C++ 
> (https://issues.apache.org/jira/browse/ARROW-16340).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16612) [R] parquet files with compression extensions should use parquet writer for compression

2022-05-18 Thread Alenka Frim (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alenka Frim updated ARROW-16612:

Summary: [R] parquet files with compression extensions should use parquet 
writer for compression  (was: parquet files with compression extensions should 
use parquet writer for compression)

> [R] parquet files with compression extensions should use parquet writer for 
> compression
> ---
>
> Key: ARROW-16612
> URL: https://issues.apache.org/jira/browse/ARROW-16612
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 8.0.0
>Reporter: Sam Albers
>Priority: Minor
>
> Right now arrow will silently write a file with a .gz extension to 
> CompressedOutputStream rather than passing the compression option to the 
> parquet writer itself. The internal detect_compression() function detects the 
> extension and that is what passes it off incorrectly. However it only fails 
> at the read_parquet stage which could lead to confusion. 
> {code:java}
> library(arrow, warn.conflicts = FALSE) 
> tf <- tempfile(fileext = ".parquet.gz") 
> write_parquet(data.frame(x = 1:5), tf, compression = "gzip", 
> compression_level = 5) read_parquet(tf) 
> #> Error: file must be a "RandomAccessFile"{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-15222) [Ruby] Use Compute functions for Enumerable operations on Column

2022-05-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-15222.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 12053
[https://github.com/apache/arrow/pull/12053]

> [Ruby] Use Compute functions for Enumerable operations on Column
> 
>
> Key: ARROW-15222
> URL: https://issues.apache.org/jira/browse/ARROW-15222
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Ruby
>Reporter: Kanstantsin Ilchanka
>Assignee: Kanstantsin Ilchanka
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently operations like 
> {code:java}
> table['column'].sum{code}
>  use Enumerable module and much slower than using Arrow::Function sum



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16604) [C++] Boost not included when build benchmarks

2022-05-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-16604.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13192
[https://github.com/apache/arrow/pull/13192]

> [C++] Boost not included when build benchmarks
> --
>
> Key: ARROW-16604
> URL: https://issues.apache.org/jira/browse/ARROW-16604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> {code:bash}
> cmake -GNinja -DARROW_BUILD_BENCHMARKS=ON ..
> {code}
> failed with many boost related error, as below
> {code:bash}
> CMake Error at cmake_modules/BuildUtils.cmake:522 (add_executable):
>   Target "arrow-json-parser-benchmark" links to target "Boost::system" but
>   the target was not found.  Perhaps a find_package() call is missing for an
>   IMPORTED target, or an ALIAS target is missing?
> Call Stack (most recent call first):
>   src/arrow/CMakeLists.txt:114 (add_benchmark)
>   src/arrow/json/CMakeLists.txt:28 (add_arrow_benchmark)
> {code}
> The error is gone if also build tests {{-DARROW_BUILD_TESTS=ON}}. Looks boost 
> is not included when build benchmarks.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16608) [Gandiva][Java] Unsatisfied Link Error on M1 Mac when using mavencentral artifacts

2022-05-18 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539166#comment-17539166
 ] 

Kouhei Sutou commented on ARROW-16608:
--

It seems that we need to build bundled binaries on M1 mac like we did for 
wheels.

Related files:
* 
https://github.com/apache/arrow/blob/master/dev/tasks/python-wheels/github.osx.arm64.yml
 for wheel
* https://github.com/apache/arrow/blob/master/dev/tasks/java-jars/github.yml 
for jars

Should we create one {{libgandiva_jni.dylib}} that contains binaries for x86_64 
and arm64? Or separated files such as {{libgandiva_jni_x86_64.dylib}} and 
{{libgandiva_jni_arm64.dylib}} or {{x86_64/libgandiva_jni.dylib}} and 
{{arm64/libgandiva_jni.dylib}}?

[~anthonylouis] Do you want to work on this?

> [Gandiva][Java] Unsatisfied Link Error on M1 Mac when using mavencentral 
> artifacts
> --
>
> Key: ARROW-16608
> URL: https://issues.apache.org/jira/browse/ARROW-16608
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Java
>Affects Versions: 8.0.0
>Reporter: Jonathan Swenson
>Priority: Major
>
> Potentially a blocker for Arrow Integration into Calcite: CALCITE-2040, 
> however it may be possible to move forward without M1 Mac support. 
> potentially somewhat related to ARROW-11135
> Getting an instance of the JNILoader throw a Unsatisfied Link Error when it 
> tries to load the libgandiva_jni.dylib that it has extracted from the jar 
> into a temporary directory. 
> Simplified error:
> {code:java}
> Exception in thread "main" java.lang.UnsatisfiedLinkError: 
> /tmp_dir/libgandiva_jni.dylib_uuid: 
> dlopen(/tmp_dir/libgandiva_jni.dylib_uuid, 0x0001): tried: 
> '/tmp_dir/libgandiva_jni.dylib_uuid' (mach-o file, but is an incompatible 
> architecture (have 'x86_64', need 'arm64e')){code}
>  
> Full error and stack trace:
> {code:java}
> Exception in thread "main" java.lang.UnsatisfiedLinkError: 
> /private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe:
>  
> dlopen(/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe,
>  0x0001): tried: 
> '/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe'
>  (mach-o file, but is an incompatible architecture (have 'x86_64', need 
> 'arm64e'))
>     at java.lang.ClassLoader$NativeLibrary.load(Native Method)
>     at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1950)
>     at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1832)
>     at java.lang.Runtime.load0(Runtime.java:811)
>     at java.lang.System.load(System.java:1088)
>     at 
> org.apache.arrow.gandiva.evaluator.JniLoader.loadGandivaLibraryFromJar(JniLoader.java:74)
>     at 
> org.apache.arrow.gandiva.evaluator.JniLoader.setupInstance(JniLoader.java:63)
>     at 
> org.apache.arrow.gandiva.evaluator.JniLoader.getInstance(JniLoader.java:53)
>     at 
> org.apache.arrow.gandiva.evaluator.JniLoader.getDefaultConfiguration(JniLoader.java:144)
>     at org.apache.arrow.gandiva.evaluator.Filter.make(Filter.java:67)
>     at io.acme.Main.main(Main.java:26) {code}
>  
> This example loads three libraries from mavencentral using gradle: 
> {code:java}
> repositories {
> mavenCentral()
> }
> dependencies {
> implementation("org.apache.arrow:arrow-memory-netty:8.0.0")
> implementation("org.apache.arrow:arrow-vector:8.0.0")
> implementation("org.apache.arrow.gandiva:arrow-gandiva:8.0.0")
> } {code}
> Example code: 
> {code:java}
> public class Main {
>   public static void main(String[] args) throws GandivaException {
> Field field = new Field("int_field", FieldType.nullable(new 
> ArrowType.Int(32, true)), null);
> Schema schema = makeSchema(field);
> Condition condition = makeCondition(field);
> Filter.make(schema, condition);
>   }
>   private static Schema makeSchema(Field field) {
> List fieldList = new ArrayList<>();
> fieldList.add(field);
> return new Schema(fieldList, null);
>   }
>   private static Condition makeCondition(Field f) {
> List treeNodes = new ArrayList<>(2);
> treeNodes.add(TreeBuilder.makeField(f));
> treeNodes.add(TreeBuilder.makeLiteral(4));
> TreeNode comparison = TreeBuilder.makeFunction("less_than", treeNodes, 
> new ArrowType.Bool());
> return TreeBuilder.makeCondition(comparison);
>   }
> } {code}
> While I haven't tested this exact example, a similar example executes without 
> issue on an intel x86 mac.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16585) [C++] CMake and pkg-config files are broken when CMAKE_INSTALL_{BIN,INCLUDE,LIB}DIR is absolute

2022-05-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-16585.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13182
[https://github.com/apache/arrow/pull/13182]

> [C++] CMake and pkg-config files are broken when 
> CMAKE_INSTALL_{BIN,INCLUDE,LIB}DIR is absolute
> ---
>
> Key: ARROW-16585
> URL: https://issues.apache.org/jira/browse/ARROW-16585
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Alexander Shpilkin
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> As per title: {{{}cpp/src/gandiva/gandiva.pc.in{}}}, 
> {{{}cpp/src/parquet/parquet.pc.in{}}}, {{{}cpp/src/plasma/plasma.pc.in{}}}, 
> and {{cpp/src/skyhook/skyhook.pc.in}} have
> {code:java}
> prefix=@CMAKE_INSTALL_PREFIX@
> libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@
> includedir=${prefix}/@CMAKE_INSTALL_INCLUDEDIR@ # not in plasma.pc.in{code}
> while {{cpp/src/plasma/PlasmaConfig.cmake.in}} has
> {code:java}
> set(PLASMA_STORE_SERVER 
> "@CMAKE_INSTALL_PREFIX@/@CMAKE_INSTALL_BINDIR@/plasma-store-server@CMAKE_EXECUTABLE_SUFFIX@"){code}
> and so they can’t handle absolute paths in 
> {{{}CMAKE_INSTALL_\{BIN,INCLUDE,LIB}DIR{}}}. This leads to broken .pc files 
> on NixOS in particular.
> See “[Concatenating paths when building pkg-config 
> files|https://github.com/jtojnar/cmake-snips#concatenating-paths-when-building-pkg-config-files]”
>  for a thorough discussion of the problem and a suggested fix, or [KDE’s 
> extra-cmake-modules|https://invent.kde.org/frameworks/extra-cmake-modules/-/blob/master/modules/ECMGeneratePkgConfigFile.cmake#L166]
>  for a simpler approach.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-15936) [Ruby] Add test for Arrow::DictionaryArray#raw_records

2022-05-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-15936.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 12904
[https://github.com/apache/arrow/pull/12904]

> [Ruby] Add test for Arrow::DictionaryArray#raw_records
> --
>
> Key: ARROW-15936
> URL: https://issues.apache.org/jira/browse/ARROW-15936
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Ruby
>Reporter: Keisuke Okada
>Assignee: Keisuke Okada
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-05-18 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539142#comment-17539142
 ] 

Kouhei Sutou commented on ARROW-15678:
--

How about using template to distinct implementation for each architecture?

{noformat}
diff --git a/cpp/src/arrow/compute/kernels/codegen_internal.h 
b/cpp/src/arrow/compute/kernels/codegen_internal.h
index fa50427bc3..a4bd0eb586 100644
--- a/cpp/src/arrow/compute/kernels/codegen_internal.h
+++ b/cpp/src/arrow/compute/kernels/codegen_internal.h
@@ -710,8 +710,8 @@ struct ScalarUnaryNotNullStateful {
Datum* out) {
   Status st = Status::OK();
   ArrayData* out_arr = out->mutable_array();
-  FirstTimeBitmapWriter out_writer(out_arr->buffers[1]->mutable_data(),
-   out_arr->offset, out_arr->length);
+  FirstTimeBitmapWriter<> out_writer(out_arr->buffers[1]->mutable_data(),
+ out_arr->offset, out_arr->length);
   VisitArrayValuesInline(
   arg0,
   [&](Arg0Value v) {
diff --git a/cpp/src/arrow/compute/kernels/row_encoder.cc 
b/cpp/src/arrow/compute/kernels/row_encoder.cc
index 10a1f4cda5..26316ec315 100644
--- a/cpp/src/arrow/compute/kernels/row_encoder.cc
+++ b/cpp/src/arrow/compute/kernels/row_encoder.cc
@@ -42,7 +42,7 @@ Status KeyEncoder::DecodeNulls(MemoryPool* pool, int32_t 
length, uint8_t** encod
 ARROW_ASSIGN_OR_RAISE(*null_bitmap, AllocateBitmap(length, pool));
 uint8_t* validity = (*null_bitmap)->mutable_data();
 
-FirstTimeBitmapWriter writer(validity, 0, length);
+FirstTimeBitmapWriter<> writer(validity, 0, length);
 for (int32_t i = 0; i < length; ++i) {
   if (encoded_bytes[i][0] == kValidByte) {
 writer.Set();
diff --git a/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc 
b/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
index 7d8d2edc4b..433df0f1b7 100644
--- a/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_set_lookup.cc
@@ -353,8 +353,8 @@ struct IsInVisitor {
 const auto& state = checked_cast&>(*ctx->state());
 ArrayData* output = out->mutable_array();
 
-FirstTimeBitmapWriter writer(output->buffers[1]->mutable_data(), 
output->offset,
- output->length);
+FirstTimeBitmapWriter<> writer(output->buffers[1]->mutable_data(), 
output->offset,
+   output->length);
 
 VisitArrayDataInline(
 this->data,
diff --git a/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc 
b/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc
index 611601cab8..da7de1c277 100644
--- a/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc
+++ b/cpp/src/arrow/compute/kernels/scalar_string_ascii.cc
@@ -1456,7 +1456,7 @@ struct MatchSubstringImpl {
 [](const void* raw_offsets, const uint8_t* data, int64_t 
length,
int64_t output_offset, uint8_t* output) {
   const offset_type* offsets = reinterpret_cast(raw_offsets);
-  FirstTimeBitmapWriter bitmap_writer(output, output_offset, length);
+  FirstTimeBitmapWriter<> bitmap_writer(output, output_offset, length);
   for (int64_t i = 0; i < length; ++i) {
 const char* current_data = reinterpret_cast(data + 
offsets[i]);
 int64_t current_length = offsets[i + 1] - offsets[i];
diff --git a/cpp/src/arrow/util/bit_util_benchmark.cc 
b/cpp/src/arrow/util/bit_util_benchmark.cc
index 258fd27785..66a81b4e04 100644
--- a/cpp/src/arrow/util/bit_util_benchmark.cc
+++ b/cpp/src/arrow/util/bit_util_benchmark.cc
@@ -386,7 +386,7 @@ static void BitmapWriter(benchmark::State& state) {
 }
 
 static void FirstTimeBitmapWriter(benchmark::State& state) {
-  BenchmarkBitmapWriter(state, 
state.range(0));
+  BenchmarkBitmapWriter>(state, 
state.range(0));
 }
 
 struct GenerateBitsFunctor {
diff --git a/cpp/src/arrow/util/bit_util_test.cc 
b/cpp/src/arrow/util/bit_util_test.cc
index 6c2aff4fbe..9b9f19feb1 100644
--- a/cpp/src/arrow/util/bit_util_test.cc
+++ b/cpp/src/arrow/util/bit_util_test.cc
@@ -832,14 +832,14 @@ TEST(FirstTimeBitmapWriter, NormalOperation) {
 const uint8_t fill_byte = static_cast(fill_byte_int);
 {
   uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
-  auto writer = internal::FirstTimeBitmapWriter(bitmap, 0, 12);
+  auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 0, 12);
   WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1});
   //  {0b00110110, 0b1010, 0, 0}
   ASSERT_BYTES_EQ(bitmap, {0x36, 0x0a});
 }
 {
   uint8_t bitmap[] = {fill_byte, fill_byte, fill_byte, fill_byte};
-  auto writer = internal::FirstTimeBitmapWriter(bitmap, 4, 12);
+  auto writer = internal::FirstTimeBitmapWriter<>(bitmap, 4, 12);
   WriteVectorToWriter(writer, {0, 1, 1, 0, 1, 

[jira] [Updated] (ARROW-16613) [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

2022-05-18 Thread Kyle Barron (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Barron updated ARROW-16613:

Description: 
Hello!

 

I've noticed that when writing a `_metadata` file with 
`pyarrow.parquet.write_metadata`, it is very slow with a large 
`metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that 
the concatenation inside `metadata.append_row_groups` is very slow. The writer 
first [iterates over every item of the 
list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302]
 and then [concatenates them on each 
iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799].

 

Would it be possible to make a vectorized implementation of this? Where 
`append_row_groups` accepts a list of `FileMetaData` objects, and where 
concatenation happens only once?

 

Repro (in IPython to use `%time`)
{code:java}
from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq

def create_example_file_meta_data():
    data = {
        "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
        "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
        "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
        "bool": pa.array([True, True, False, False], type=pa.bool_()),
    }
    table = pa.table(data)
    metadata_collector = []
    pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
    return table.schema, metadata_collector[0]

schema, meta = create_example_file_meta_data()

metadata_collector = [meta] * 500
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
# Wall time: 234 ms

metadata_collector = [meta] * 1000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
# Wall time: 970 ms

metadata_collector = [meta] * 2000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
# Wall time: 4.3 s

metadata_collector = [meta] * 4000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
# Wall time: 17.3 s
{code}

  was:
Hello!

 

I've noticed that when writing a `_metadata` file with 
`pyarrow.parquet.write_metadata`, it is very slow with a large 
`metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that 
the concatenation inside `metadata.append_row_groups` is very slow. The writer 
first and [iterates over every item of the 
list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302]
 and then [concatenates them on each 
iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799].

 

Would it be possible to make a vectorized implementation of this? Where 
`append_row_groups` accepts a list of `FileMetaData` objects, and where 
concatenation happens only once?

 

Repro (in IPython to use `%time`)
{code:java}
from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq

def create_example_file_meta_data():
    data = {
        "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
        "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
        "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
        "bool": pa.array([True, True, False, False], type=pa.bool_()),
    }
    table = pa.table(data)
    metadata_collector = []
    pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
    return table.schema, metadata_collector[0]

schema, meta = create_example_file_meta_data()

metadata_collector = [meta] * 500
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
# Wall time: 234 ms

metadata_collector = [meta] * 1000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
# Wall time: 970 ms

metadata_collector = [meta] * 2000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
# Wall time: 4.3 s

metadata_collector = [meta] * 4000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
# Wall time: 17.3 s
{code}


> [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector 
> appears to be O(n^2)
> -
>
> Key: ARROW-16613
>  

[jira] [Updated] (ARROW-16613) [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

2022-05-18 Thread Kyle Barron (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyle Barron updated ARROW-16613:

Description: 
Hello!

 

I've noticed that when writing a `_metadata` file with 
`pyarrow.parquet.write_metadata`, it is very slow with a large 
`metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that 
the concatenation inside `metadata.append_row_groups` is very slow. The writer 
first and [iterates over every item of the 
list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302]
 and then [concatenates them on each 
iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799].

 

Would it be possible to make a vectorized implementation of this? Where 
`append_row_groups` accepts a list of `FileMetaData` objects, and where 
concatenation happens only once?

 

Repro (in IPython to use `%time`)
{code:java}
from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq

def create_example_file_meta_data():
    data = {
        "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
        "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
        "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
        "bool": pa.array([True, True, False, False], type=pa.bool_()),
    }
    table = pa.table(data)
    metadata_collector = []
    pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
    return table.schema, metadata_collector[0]

schema, meta = create_example_file_meta_data()

metadata_collector = [meta] * 500
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
# Wall time: 234 ms

metadata_collector = [meta] * 1000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
# Wall time: 970 ms

metadata_collector = [meta] * 2000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
# Wall time: 4.3 s

metadata_collector = [meta] * 4000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
# Wall time: 17.3 s
{code}

  was:
Hello!

 

I've noticed that when writing a `_metadata` file with 
`pyarrow.parquet.write_metadata`, it is very slow with a large 
`metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that 
the concatenation inside `metadata.append_row_groups` is very slow. The writer 
first and [iterates over every item of the 
list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302]
 and then [concatenates them on each 
iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799].

 

Would it be possible to make a vectorized implementation of this? Where 
`append_row_groups` accepts a list of `FileMetaData` objects, and where 
concatenation happens only once?

 

Repro (in IPython to use `%time`)

```

from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq


def create_example_file_meta_data():
    data = {
        "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
        "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
        "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
        "bool": pa.array([True, True, False, False], type=pa.bool_()),
    }
    table = pa.table(data)
    metadata_collector = []
    pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
    return table.schema, metadata_collector[0]

 

schema, meta = create_example_file_meta_data()

metadata_collector = [meta] * 500
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
# Wall time: 234 ms

metadata_collector = [meta] * 1000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
# Wall time: 970 ms

metadata_collector = [meta] * 2000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
# Wall time: 4.3 s

metadata_collector = [meta] * 4000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
# Wall time: 17.3 s

```


> [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector 
> appears to be O(n^2)
> -
>
> Key: ARROW-16613
>  

[jira] [Created] (ARROW-16613) [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

2022-05-18 Thread Kyle Barron (Jira)
Kyle Barron created ARROW-16613:
---

 Summary: [Python][Parquet] pyarrow.parquet.write_metadata with 
metadata_collector appears to be O(n^2)
 Key: ARROW-16613
 URL: https://issues.apache.org/jira/browse/ARROW-16613
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Parquet, Python
Affects Versions: 8.0.0
Reporter: Kyle Barron


Hello!

 

I've noticed that when writing a `_metadata` file with 
`pyarrow.parquet.write_metadata`, it is very slow with a large 
`metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that 
the concatenation inside `metadata.append_row_groups` is very slow. The writer 
first and [iterates over every item of the 
list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302]
 and then [concatenates them on each 
iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799].

 

Would it be possible to make a vectorized implementation of this? Where 
`append_row_groups` accepts a list of `FileMetaData` objects, and where 
concatenation happens only once?

 

Repro (in IPython to use `%time`)

```

from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq


def create_example_file_meta_data():
    data = {
        "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
        "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
        "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
        "bool": pa.array([True, True, False, False], type=pa.bool_()),
    }
    table = pa.table(data)
    metadata_collector = []
    pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
    return table.schema, metadata_collector[0]

 

schema, meta = create_example_file_meta_data()

metadata_collector = [meta] * 500
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
# Wall time: 234 ms

metadata_collector = [meta] * 1000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
# Wall time: 970 ms

metadata_collector = [meta] * 2000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
# Wall time: 4.3 s

metadata_collector = [meta] * 4000
%time pq.write_metadata(schema, BytesIO(), 
metadata_collector=metadata_collector)
# CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
# Wall time: 17.3 s

```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-15271) [R] Refactor do_exec_plan to return a RecordBatchReader

2022-05-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-15271.
-
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13170
[https://github.com/apache/arrow/pull/13170]

> [R] Refactor do_exec_plan to return a RecordBatchReader
> ---
>
> Key: ARROW-15271
> URL: https://issues.apache.org/jira/browse/ARROW-15271
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 6.0.1
>Reporter: Will Jones
>Assignee: Neal Richardson
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Right now 
> [{{do_exec_plan}}|https://github.com/apache/arrow/blob/master/r/R/query-engine.R#L18]
>  returns an Arrow table because {{head}}, {{tail}}, and {{arrange}} do. If 
> ARROW-14289 is completed and similar work is done for {{arrange}}, we may be 
> able to alter {{do_exec_plan}} to return a RBR instead.
> The {{map_batches()}} implementation (ARROW-14029) could benefit from this 
> refactor. And it might make ARROW-15040 more useful.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16612) parquet files with compression extensions should use parquet writer for compression

2022-05-18 Thread Sam Albers (Jira)
Sam Albers created ARROW-16612:
--

 Summary: parquet files with compression extensions should use 
parquet writer for compression
 Key: ARROW-16612
 URL: https://issues.apache.org/jira/browse/ARROW-16612
 Project: Apache Arrow
  Issue Type: Bug
  Components: R
Affects Versions: 8.0.0
Reporter: Sam Albers


Right now arrow will silently write a file with a .gz extension to 
CompressedOutputStream rather than passing the compression option to the 
parquet writer itself. The internal detect_compression() function detects the 
extension and that is what passes it off incorrectly. However it only fails at 
the read_parquet stage which could lead to confusion. 


{code:java}
library(arrow, warn.conflicts = FALSE) 
tf <- tempfile(fileext = ".parquet.gz") 
write_parquet(data.frame(x = 1:5), tf, compression = "gzip", compression_level 
= 5) read_parquet(tf) 
#> Error: file must be a "RandomAccessFile"{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-05-18 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539107#comment-17539107
 ] 

Jonathan Keane edited comment on ARROW-15678 at 5/18/22 10:03 PM:
--

[~kou] Do you think you might be able to take a look at this?

The comment at 
https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good 
explanation of what's going on and following that there are a few possible 
fixes (though none of them were fully implemented or decided


was (Author: jonkeane):
@kou Do you think you might be able to take a look at this?

The comment at 
https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good 
explanation of what's going on and following that there are a few possible 
fixes (though none of them were fully implemented or decided

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-15678) [C++][CI] a crossbow job with MinRelSize enabled

2022-05-18 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539107#comment-17539107
 ] 

Jonathan Keane commented on ARROW-15678:


@kou Do you think you might be able to take a look at this?

The comment at 
https://github.com/apache/arrow/pull/12928#issuecomment-1105955726 has a good 
explanation of what's going on and following that there are a few possible 
fixes (though none of them were fully implemented or decided

> [C++][CI] a crossbow job with MinRelSize enabled
> 
>
> Key: ARROW-15678
> URL: https://issues.apache.org/jira/browse/ARROW-15678
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Continuous Integration
>Reporter: Jonathan Keane
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 13h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16144) [R] Write compressed data streams (particularly over S3)

2022-05-18 Thread Neal Richardson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson resolved ARROW-16144.
-
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13183
[https://github.com/apache/arrow/pull/13183]

> [R] Write compressed data streams (particularly over S3)
> 
>
> Key: ARROW-16144
> URL: https://issues.apache.org/jira/browse/ARROW-16144
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Carl Boettiger
>Assignee: Sam Albers
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> The python bindings have `CompressedOutputStream`, but  I don't see how we 
> can do this on the R side (e.g. with `write_csv_arrow()`).  It would be 
> wonderful if we could both read and write compressed streams, particularly 
> for CSV and particularly for remote filesystems, where this can provide 
> considerable performance improvements.  
> (For comparison, readr will write a compressed stream automatically based on 
> the extension for the given filename, e.g. `readr::write_csv(data, 
> "file.csv.gz")` or `write_csv("data.file.xz")`  )



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16596) [C++] Add a strptime option to control the cutoff between 1900 and 2000 when %y

2022-05-18 Thread Jira


[ 
https://issues.apache.org/jira/browse/ARROW-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539077#comment-17539077
 ] 

Dragoș Moldovan-Grünfeld commented on ARROW-16596:
--

Maybe related. I find the following unexpected:
{code:r}
library(arrow, warn.conflicts = FALSE)

a <- Array$create("68-10-07 19:04:0")
call_function("strptime", a, options = list(format = "%Y-%m-%d %H:%M:%S", unit 
= 0L))
#> Array
#> 
#> [
#>   0068-10-07 19:04:00
#> ]
call_function("strptime", a, options = list(format = "%y-%m-%d %H:%M:%S", unit 
= 0L))
#> Array
#> 
#> [
#>   2068-10-07 19:04:00
#> ]
{code}
I would expect an error when there is a mismatch between the string and the 
format, i.e. string has a short year ({{{}%y{}}}) and we try to parse using a 
long format ({{{}%Y{}}}). I think it would be much better to error or return a 
null in this situation.

> [C++] Add a strptime option to control the cutoff between 1900 and 2000 when 
> %y 
> 
>
> Key: ARROW-16596
> URL: https://issues.apache.org/jira/browse/ARROW-16596
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> When parsing to datetime a string with year in the short format ({{{}%y{}}}), 
> it would be great if we could have control over the cutoff point between 1900 
> and 2000. Currently it is implicitly set to 68:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> a <- Array$create(c("68-05-17", "69-05-17"))
> call_function("strptime", a, options = list(format = "%y-%m-%d", unit = 0L))
> #> Array
> #> 
> #> [
> #>   2068-05-17 00:00:00,
> #>   1969-05-17 00:00:00
> #> ]
> {code}
> For example, lubridate named this argument {{cutoff_2000}} argument (e.g. for 
> {{{}fast_strptime){}}}. This works as follows:
> {code:r}
> library(lubridate, warn.conflicts = FALSE)
> dates_vector <- c("68-05-17", "69-05-17", "55-05-17")
> fast_strptime(dates_vector, format = "%y-%m-%d")
> #> [1] "2068-05-17 UTC" "1969-05-17 UTC" "2055-05-17 UTC"
> fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 50)
> #> [1] "1968-05-17 UTC" "1969-05-17 UTC" "1955-05-17 UTC"
> fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 70)
> #> [1] "2068-05-17 UTC" "2069-05-17 UTC" "2055-05-17 UTC"
> {code}
> In the {{lubridate::fast_strptime()}} documentation it is described as 
> follows:
> {quote}cutoff_2000 
> integer. For y format, two-digit numbers smaller or equal to cutoff_2000 are 
> parsed as though starting with 20, otherwise parsed as though starting with 
> 19. {-}Available only for functions relying on lubridates internal parser{-}.
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16596) [C++] Add a strptime option to control the cutoff between 1900 and 2000 when %y

2022-05-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-16596:
-
Description: 
When parsing to datetime a string with year in the short format ({{{}%y{}}}), 
it would be great if we could have control over the cutoff point between 1900 
and 2000. Currently it is implicitly set to 68:
{code:r}
library(arrow, warn.conflicts = FALSE)

a <- Array$create(c("68-05-17", "69-05-17"))
call_function("strptime", a, options = list(format = "%y-%m-%d", unit = 0L))
#> Array
#> 
#> [
#>   2068-05-17 00:00:00,
#>   1969-05-17 00:00:00
#> ]
{code}
For example, lubridate named this argument {{cutoff_2000}} argument (e.g. for 
{{{}fast_strptime){}}}. This works as follows:
{code:r}
library(lubridate, warn.conflicts = FALSE)

dates_vector <- c("68-05-17", "69-05-17", "55-05-17")
fast_strptime(dates_vector, format = "%y-%m-%d")
#> [1] "2068-05-17 UTC" "1969-05-17 UTC" "2055-05-17 UTC"
fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 50)
#> [1] "1968-05-17 UTC" "1969-05-17 UTC" "1955-05-17 UTC"
fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 70)
#> [1] "2068-05-17 UTC" "2069-05-17 UTC" "2055-05-17 UTC"
{code}
In the {{lubridate::fast_strptime()}} documentation it is described as follows:
{quote}cutoff_2000 
integer. For y format, two-digit numbers smaller or equal to cutoff_2000 are 
parsed as though starting with 20, otherwise parsed as though starting with 19. 
{-}Available only for functions relying on lubridates internal parser{-}.
{quote}

  was:
When parsing to datetime a string with year in the short format ({{{}%y{}}}), 
it would be great if we could have control over the cutoff point between 1900 
and 2000. Currently it is implicitly set to 68:
{code:r}
library(arrow, warn.conflicts = FALSE)

a <- Array$create(c("68-05-17", "69-05-17"))
call_function("strptime", a, options = list(format = "%y-%m-%d", unit = 0L))
#> Array
#> 
#> [
#>   2068-05-17 00:00:00,
#>   1969-05-17 00:00:00
#> ]
{code}
For example, lubridate named this argument {{cutoff_2000}} argument (e.g. for 
{{{}fast_strptime){}}}. This works as follows:
{code:r}
library(lubridate, warn.conflicts = FALSE)

dates_vector <- c("68-05-17", "69-05-17", "55-05-17")
fast_strptime(dates_vector, format = "%y-%m-%d")
#> [1] "2068-05-17 UTC" "1969-05-17 UTC" "2055-05-17 UTC"
fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 50)
#> [1] "1968-05-17 UTC" "1969-05-17 UTC" "1955-05-17 UTC"
fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 70)
#> [1] "2068-05-17 UTC" "2069-05-17 UTC" "2055-05-17 UTC"
{code}
In the {{lubridate::fast_strptime()}} documentation it is described as follows:
{quote}cutoff_2000 
integer. For y format, two-digit numbers smaller or equal to cutoff_2000 are 
parsed as though starting with 20, otherwise parsed as though starting with 19. 
Available only for functions relying on lubridates internal parser.
{quote}


> [C++] Add a strptime option to control the cutoff between 1900 and 2000 when 
> %y 
> 
>
> Key: ARROW-16596
> URL: https://issues.apache.org/jira/browse/ARROW-16596
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, R
>Affects Versions: 8.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
>
> When parsing to datetime a string with year in the short format ({{{}%y{}}}), 
> it would be great if we could have control over the cutoff point between 1900 
> and 2000. Currently it is implicitly set to 68:
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> a <- Array$create(c("68-05-17", "69-05-17"))
> call_function("strptime", a, options = list(format = "%y-%m-%d", unit = 0L))
> #> Array
> #> 
> #> [
> #>   2068-05-17 00:00:00,
> #>   1969-05-17 00:00:00
> #> ]
> {code}
> For example, lubridate named this argument {{cutoff_2000}} argument (e.g. for 
> {{{}fast_strptime){}}}. This works as follows:
> {code:r}
> library(lubridate, warn.conflicts = FALSE)
> dates_vector <- c("68-05-17", "69-05-17", "55-05-17")
> fast_strptime(dates_vector, format = "%y-%m-%d")
> #> [1] "2068-05-17 UTC" "1969-05-17 UTC" "2055-05-17 UTC"
> fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 50)
> #> [1] "1968-05-17 UTC" "1969-05-17 UTC" "1955-05-17 UTC"
> fast_strptime(dates_vector, format = "%y-%m-%d", cutoff_2000 = 70)
> #> [1] "2068-05-17 UTC" "2069-05-17 UTC" "2055-05-17 UTC"
> {code}
> In the {{lubridate::fast_strptime()}} documentation it is described as 
> follows:
> {quote}cutoff_2000 
> integer. For y format, two-digit numbers smaller or equal to cutoff_2000 are 
> parsed as though starting with 20, otherwise parsed as though starting with 
> 19. {-}Available only for functions relying on lubridates internal parser{-}.
> {quote}



--
This 

[jira] [Updated] (ARROW-16604) [C++] Boost not included when build benchmarks

2022-05-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16604:
---
Labels: pull-request-available  (was: )

> [C++] Boost not included when build benchmarks
> --
>
> Key: ARROW-16604
> URL: https://issues.apache.org/jira/browse/ARROW-16604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Kouhei Sutou
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {code:bash}
> cmake -GNinja -DARROW_BUILD_BENCHMARKS=ON ..
> {code}
> failed with many boost related error, as below
> {code:bash}
> CMake Error at cmake_modules/BuildUtils.cmake:522 (add_executable):
>   Target "arrow-json-parser-benchmark" links to target "Boost::system" but
>   the target was not found.  Perhaps a find_package() call is missing for an
>   IMPORTED target, or an ALIAS target is missing?
> Call Stack (most recent call first):
>   src/arrow/CMakeLists.txt:114 (add_benchmark)
>   src/arrow/json/CMakeLists.txt:28 (add_arrow_benchmark)
> {code}
> The error is gone if also build tests {{-DARROW_BUILD_TESTS=ON}}. Looks boost 
> is not included when build benchmarks.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16608) [Gandiva][Java] Unsatisfied Link Error on M1 Mac when using mavencentral artifacts

2022-05-18 Thread Jonathan Swenson (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Swenson updated ARROW-16608:
-
Summary: [Gandiva][Java] Unsatisfied Link Error on M1 Mac when using 
mavencentral artifacts  (was: [Gandiva][Java] Unsatisfied Link Error on M1 Mac 
for aarch)

> [Gandiva][Java] Unsatisfied Link Error on M1 Mac when using mavencentral 
> artifacts
> --
>
> Key: ARROW-16608
> URL: https://issues.apache.org/jira/browse/ARROW-16608
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Java
>Affects Versions: 8.0.0
>Reporter: Jonathan Swenson
>Priority: Major
>
> Potentially a blocker for Arrow Integration into Calcite: CALCITE-2040, 
> however it may be possible to move forward without M1 Mac support. 
> potentially somewhat related to ARROW-11135
> Getting an instance of the JNILoader throw a Unsatisfied Link Error when it 
> tries to load the libgandiva_jni.dylib that it has extracted from the jar 
> into a temporary directory. 
> Simplified error:
> {code:java}
> Exception in thread "main" java.lang.UnsatisfiedLinkError: 
> /tmp_dir/libgandiva_jni.dylib_uuid: 
> dlopen(/tmp_dir/libgandiva_jni.dylib_uuid, 0x0001): tried: 
> '/tmp_dir/libgandiva_jni.dylib_uuid' (mach-o file, but is an incompatible 
> architecture (have 'x86_64', need 'arm64e')){code}
>  
> Full error and stack trace:
> {code:java}
> Exception in thread "main" java.lang.UnsatisfiedLinkError: 
> /private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe:
>  
> dlopen(/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe,
>  0x0001): tried: 
> '/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe'
>  (mach-o file, but is an incompatible architecture (have 'x86_64', need 
> 'arm64e'))
>     at java.lang.ClassLoader$NativeLibrary.load(Native Method)
>     at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1950)
>     at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1832)
>     at java.lang.Runtime.load0(Runtime.java:811)
>     at java.lang.System.load(System.java:1088)
>     at 
> org.apache.arrow.gandiva.evaluator.JniLoader.loadGandivaLibraryFromJar(JniLoader.java:74)
>     at 
> org.apache.arrow.gandiva.evaluator.JniLoader.setupInstance(JniLoader.java:63)
>     at 
> org.apache.arrow.gandiva.evaluator.JniLoader.getInstance(JniLoader.java:53)
>     at 
> org.apache.arrow.gandiva.evaluator.JniLoader.getDefaultConfiguration(JniLoader.java:144)
>     at org.apache.arrow.gandiva.evaluator.Filter.make(Filter.java:67)
>     at io.acme.Main.main(Main.java:26) {code}
>  
> This example loads three libraries from mavencentral using gradle: 
> {code:java}
> repositories {
> mavenCentral()
> }
> dependencies {
> implementation("org.apache.arrow:arrow-memory-netty:8.0.0")
> implementation("org.apache.arrow:arrow-vector:8.0.0")
> implementation("org.apache.arrow.gandiva:arrow-gandiva:8.0.0")
> } {code}
> Example code: 
> {code:java}
> public class Main {
>   public static void main(String[] args) throws GandivaException {
> Field field = new Field("int_field", FieldType.nullable(new 
> ArrowType.Int(32, true)), null);
> Schema schema = makeSchema(field);
> Condition condition = makeCondition(field);
> Filter.make(schema, condition);
>   }
>   private static Schema makeSchema(Field field) {
> List fieldList = new ArrayList<>();
> fieldList.add(field);
> return new Schema(fieldList, null);
>   }
>   private static Condition makeCondition(Field f) {
> List treeNodes = new ArrayList<>(2);
> treeNodes.add(TreeBuilder.makeField(f));
> treeNodes.add(TreeBuilder.makeLiteral(4));
> TreeNode comparison = TreeBuilder.makeFunction("less_than", treeNodes, 
> new ArrowType.Bool());
> return TreeBuilder.makeCondition(comparison);
>   }
> } {code}
> While I haven't tested this exact example, a similar example executes without 
> issue on an intel x86 mac.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16608) [Gandiva][Java] Unsatisfied Link Error on M1 Mac for aarch

2022-05-18 Thread Jonathan Swenson (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539073#comment-17539073
 ] 

Jonathan Swenson commented on ARROW-16608:
--

Building the gandiva library / jar from source on the M1 mac (on master) then 
loading that in manually works, but the dependencies hosted in maven do not 
seem to be deployed in a way that permits usage from a project running on apple 
silicon.

> [Gandiva][Java] Unsatisfied Link Error on M1 Mac for aarch
> --
>
> Key: ARROW-16608
> URL: https://issues.apache.org/jira/browse/ARROW-16608
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva, Java
>Affects Versions: 8.0.0
>Reporter: Jonathan Swenson
>Priority: Major
>
> Potentially a blocker for Arrow Integration into Calcite: CALCITE-2040, 
> however it may be possible to move forward without M1 Mac support. 
> potentially somewhat related to ARROW-11135
> Getting an instance of the JNILoader throw a Unsatisfied Link Error when it 
> tries to load the libgandiva_jni.dylib that it has extracted from the jar 
> into a temporary directory. 
> Simplified error:
> {code:java}
> Exception in thread "main" java.lang.UnsatisfiedLinkError: 
> /tmp_dir/libgandiva_jni.dylib_uuid: 
> dlopen(/tmp_dir/libgandiva_jni.dylib_uuid, 0x0001): tried: 
> '/tmp_dir/libgandiva_jni.dylib_uuid' (mach-o file, but is an incompatible 
> architecture (have 'x86_64', need 'arm64e')){code}
>  
> Full error and stack trace:
> {code:java}
> Exception in thread "main" java.lang.UnsatisfiedLinkError: 
> /private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe:
>  
> dlopen(/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe,
>  0x0001): tried: 
> '/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe'
>  (mach-o file, but is an incompatible architecture (have 'x86_64', need 
> 'arm64e'))
>     at java.lang.ClassLoader$NativeLibrary.load(Native Method)
>     at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1950)
>     at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1832)
>     at java.lang.Runtime.load0(Runtime.java:811)
>     at java.lang.System.load(System.java:1088)
>     at 
> org.apache.arrow.gandiva.evaluator.JniLoader.loadGandivaLibraryFromJar(JniLoader.java:74)
>     at 
> org.apache.arrow.gandiva.evaluator.JniLoader.setupInstance(JniLoader.java:63)
>     at 
> org.apache.arrow.gandiva.evaluator.JniLoader.getInstance(JniLoader.java:53)
>     at 
> org.apache.arrow.gandiva.evaluator.JniLoader.getDefaultConfiguration(JniLoader.java:144)
>     at org.apache.arrow.gandiva.evaluator.Filter.make(Filter.java:67)
>     at io.acme.Main.main(Main.java:26) {code}
>  
> This example loads three libraries from mavencentral using gradle: 
> {code:java}
> repositories {
> mavenCentral()
> }
> dependencies {
> implementation("org.apache.arrow:arrow-memory-netty:8.0.0")
> implementation("org.apache.arrow:arrow-vector:8.0.0")
> implementation("org.apache.arrow.gandiva:arrow-gandiva:8.0.0")
> } {code}
> Example code: 
> {code:java}
> public class Main {
>   public static void main(String[] args) throws GandivaException {
> Field field = new Field("int_field", FieldType.nullable(new 
> ArrowType.Int(32, true)), null);
> Schema schema = makeSchema(field);
> Condition condition = makeCondition(field);
> Filter.make(schema, condition);
>   }
>   private static Schema makeSchema(Field field) {
> List fieldList = new ArrayList<>();
> fieldList.add(field);
> return new Schema(fieldList, null);
>   }
>   private static Condition makeCondition(Field f) {
> List treeNodes = new ArrayList<>(2);
> treeNodes.add(TreeBuilder.makeField(f));
> treeNodes.add(TreeBuilder.makeLiteral(4));
> TreeNode comparison = TreeBuilder.makeFunction("less_than", treeNodes, 
> new ArrowType.Bool());
> return TreeBuilder.makeCondition(comparison);
>   }
> } {code}
> While I haven't tested this exact example, a similar example executes without 
> issue on an intel x86 mac.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16604) [C++] Boost not included when build benchmarks

2022-05-18 Thread Kouhei Sutou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou reassigned ARROW-16604:


Assignee: Kouhei Sutou

> [C++] Boost not included when build benchmarks
> --
>
> Key: ARROW-16604
> URL: https://issues.apache.org/jira/browse/ARROW-16604
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Yibo Cai
>Assignee: Kouhei Sutou
>Priority: Major
>
> {code:bash}
> cmake -GNinja -DARROW_BUILD_BENCHMARKS=ON ..
> {code}
> failed with many boost related error, as below
> {code:bash}
> CMake Error at cmake_modules/BuildUtils.cmake:522 (add_executable):
>   Target "arrow-json-parser-benchmark" links to target "Boost::system" but
>   the target was not found.  Perhaps a find_package() call is missing for an
>   IMPORTED target, or an ALIAS target is missing?
> Call Stack (most recent call first):
>   src/arrow/CMakeLists.txt:114 (add_benchmark)
>   src/arrow/json/CMakeLists.txt:28 (add_arrow_benchmark)
> {code}
> The error is gone if also build tests {{-DARROW_BUILD_TESTS=ON}}. Looks boost 
> is not included when build benchmarks.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16281) [R] [CI] Bump versions with the release of 4.2

2022-05-18 Thread Jonathan Keane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Keane resolved ARROW-16281.

Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 12980
[https://github.com/apache/arrow/pull/12980]

> [R] [CI] Bump versions with the release of 4.2
> --
>
> Key: ARROW-16281
> URL: https://issues.apache.org/jira/browse/ARROW-16281
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jonathan Keane
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 11h 40m
>  Remaining Estimate: 0h
>
> Now that R 4.2 is released, we should bump all of our R versions where we 
> have ones hardcoded.
> This will mean dropping support for 3.4 entirely and adding in 4.0 to 
> https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/dev/tasks/r/github.linux.versions.yml#L34
> There are a few other places that we have hard-coded versions (we might need 
> to wait a few days for these to catch up):
> https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/dev/tasks/tasks.yml#L1291-L1295
> https://github.com/apache/arrow/blob/c4b646e715d155c1f77d34804796864465caa97b/.github/workflows/r.yml#L60
>  (and a few other places in that file — though one note: we build an old 
> version of windows that uses rtools35 in the GHA CI so that we catch when we 
> break that — we'll want to keep that!)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16611) [Python] MapArray pandas round trip is broken

2022-05-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539066#comment-17539066
 ] 

Antoine Pitrou commented on ARROW-16611:


Also, did it work with previous versions of PyArrow? I don't think so.

> [Python] MapArray pandas round trip is broken
> -
>
> Key: ARROW-16611
> URL: https://issues.apache.org/jira/browse/ARROW-16611
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Robbie Gruener
>Priority: Major
>
> pyarrow.MapArray when converted to pandas cannot be successfully converted 
> back.
> The following snipper does not work:
>  
> ```
> import pyarrow as pa
> data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]]
> ty = pa.map_(pa.string(), pa.int64())
> map_col = pa.array(data, type=ty)
> pa.MapArray.from_pandas(map_col.to_pandas())
> ```
> `Uncaught exception: ArrowTypeError: Expected bytes, got a 'int' object 
> (java.lang.RuntimeException)`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16611) [Python] MapArray pandas round trip is broken

2022-05-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16611:
---
Priority: Major  (was: Minor)

> [Python] MapArray pandas round trip is broken
> -
>
> Key: ARROW-16611
> URL: https://issues.apache.org/jira/browse/ARROW-16611
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Robbie Gruener
>Priority: Major
>
> pyarrow.MapArray when converted to pandas cannot be successfully converted 
> back.
> The following snipper does not work:
>  
> ```
> import pyarrow as pa
> data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]]
> ty = pa.map_(pa.string(), pa.int64())
> map_col = pa.array(data, type=ty)
> pa.MapArray.from_pandas(map_col.to_pandas())
> ```
> `Uncaught exception: ArrowTypeError: Expected bytes, got a 'int' object 
> (java.lang.RuntimeException)`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16611) [Python] MapArray pandas round trip is broken

2022-05-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16611:
---
Summary: [Python] MapArray pandas round trip is broken  (was: MapArray 
pandas round trip is broken)

> [Python] MapArray pandas round trip is broken
> -
>
> Key: ARROW-16611
> URL: https://issues.apache.org/jira/browse/ARROW-16611
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Robbie Gruener
>Priority: Minor
>
> pyarrow.MapArray when converted to pandas cannot be successfully converted 
> back.
> The following snipper does not work:
>  
> ```
> import pyarrow as pa
> data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]]
> ty = pa.map_(pa.string(), pa.int64())
> map_col = pa.array(data, type=ty)
> pa.MapArray.from_pandas(map_col.to_pandas())
> ```
> `Uncaught exception: ArrowTypeError: Expected bytes, got a 'int' object 
> (java.lang.RuntimeException)`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16611) MapArray pandas round trip is broken

2022-05-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16611:
---
Component/s: Python

> MapArray pandas round trip is broken
> 
>
> Key: ARROW-16611
> URL: https://issues.apache.org/jira/browse/ARROW-16611
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Robbie Gruener
>Priority: Minor
>
> pyarrow.MapArray when converted to pandas cannot be successfully converted 
> back.
> The following snipper does not work:
>  
> ```
> import pyarrow as pa
> data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]]
> ty = pa.map_(pa.string(), pa.int64())
> map_col = pa.array(data, type=ty)
> pa.MapArray.from_pandas(map_col.to_pandas())
> ```
> `Uncaught exception: ArrowTypeError: Expected bytes, got a 'int' object 
> (java.lang.RuntimeException)`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16611) [Python] MapArray pandas round trip is broken

2022-05-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539065#comment-17539065
 ] 

Antoine Pitrou commented on ARROW-16611:


Hmm, where does the "java.lang.RuntimeException" come from?

> [Python] MapArray pandas round trip is broken
> -
>
> Key: ARROW-16611
> URL: https://issues.apache.org/jira/browse/ARROW-16611
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 8.0.0
>Reporter: Robbie Gruener
>Priority: Major
>
> pyarrow.MapArray when converted to pandas cannot be successfully converted 
> back.
> The following snipper does not work:
>  
> ```
> import pyarrow as pa
> data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]]
> ty = pa.map_(pa.string(), pa.int64())
> map_col = pa.array(data, type=ty)
> pa.MapArray.from_pandas(map_col.to_pandas())
> ```
> `Uncaught exception: ArrowTypeError: Expected bytes, got a 'int' object 
> (java.lang.RuntimeException)`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16611) MapArray pandas round trip is broken

2022-05-18 Thread Robbie Gruener (Jira)
Robbie Gruener created ARROW-16611:
--

 Summary: MapArray pandas round trip is broken
 Key: ARROW-16611
 URL: https://issues.apache.org/jira/browse/ARROW-16611
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 8.0.0
Reporter: Robbie Gruener


pyarrow.MapArray when converted to pandas cannot be successfully converted back.

The following snipper does not work:

 

```
import pyarrow as pa

data = [[('x', 1), ('y', 0)], [('a', 2), ('b', 45)]]
ty = pa.map_(pa.string(), pa.int64())
map_col = pa.array(data, type=ty)
pa.MapArray.from_pandas(map_col.to_pandas())
```
`Uncaught exception: ArrowTypeError: Expected bytes, got a 'int' object 
(java.lang.RuntimeException)`



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-05-18 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539052#comment-17539052
 ] 

Yaron Gvili commented on ARROW-16211:
-

This second-layer-registry approach is good for another use case in which the 
user runs multiple execution engine invocations, either in sequence or in 
parallel, from the same Python interpreter and wants to keep separate the UDFs 
registered in each invocation.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16592) [FlightRPC][Python] Regression in DoPut error handling

2022-05-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16592:
---
Labels: pull-request-available  (was: )

> [FlightRPC][Python] Regression in DoPut error handling
> --
>
> Key: ARROW-16592
> URL: https://issues.apache.org/jira/browse/ARROW-16592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Python
>Reporter: Lubo Slivka
>Assignee: David Li
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In PyArrow 8.0.0, any error raised while handling DoPut on the server results 
> in FlightInternalError on the client.
> In PyArrow 7.0.0, errors raised while handling DoPut are propagated/converted 
> to non-internal errors.
> —
> Example: on 7.0.0, raising FlightCancelledError while handling DoPut on the 
> server would propagate that error including extra_info all the way to the 
> FlightClient. This is not the case anymore on 8.0.0.
> The FlightInternalError contains extra detail that is derived from the 
> cancelled error though:
> {code:java}
> /arrow/cpp/src/arrow/flight/client.cc:363: Close() failed: IOError:  message from FlightError is here>. Detail: Cancelled. gRPC client debug 
> context: {"created":"@1652777650.446052211","description":"Error received 
> from peer 
> ipv4:127.0.0.1:16001","file":"/opt/vcpkg/buildtrees/grpc/src/85a295989c-6cf7bf442d.clean/src/core/lib/surface/call.cc","file_line":903,"grpc_message":"  message from FlightError is here>. Detail: Cancelled","grpc_status":1}. 
> Client context: OK. Detail: Cancelled
>  {code}
> Note: skimming through the code, it seems this problem is not unique to 
> PyArrow.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16211) [C++][Python] Unregister compute functions

2022-05-18 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539044#comment-17539044
 ] 

Yaron Gvili commented on ARROW-16211:
-

Another alternative to consider is registering Python UDFs to an extension 
registry instance that (1) is specific to the Python interpreter and (2) is 
linked to the default global one (so it can find both UDF and normal 
functions). This Python-specific registry would then be passed to be used by 
the execution engine. I think this way (only) the Python-specific registry 
would naturally get cleaned up on finalization of the Python interpreter.

> [C++][Python] Unregister compute functions
> --
>
> Key: ARROW-16211
> URL: https://issues.apache.org/jira/browse/ARROW-16211
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++, Python
>Reporter: Vibhatha Lakmal Abeykoon
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Major
>
> In general, when using UDFs, the user defines a function expecting a 
> particular outcome. When building the program, there needs to be a way to 
> update existing function kernels if it expands beyond what is planned before. 
> In such situations, there should be a way to remove the existing definition 
> and add a new definition. To enable this, the unregister functionality has to 
> be included. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16605) [CI][R] Fix revdep Crossbow job

2022-05-18 Thread Jonathan Keane (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17539021#comment-17539021
 ] 

Jonathan Keane commented on ARROW-16605:


Thanks for this! One thing I found while running these is that the {targets} 
package does not behave well when we use multiple workers when running the 
revdepchecks. It would be awesome if we could run that one with one worker and 
then all the others with multiple. 

> [CI][R] Fix revdep Crossbow job
> ---
>
> Key: ARROW-16605
> URL: https://issues.apache.org/jira/browse/ARROW-16605
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, R
>Reporter: Jacob Wujciak-Jens
>Assignee: Jacob Wujciak-Jens
>Priority: Blocker
> Fix For: 9.0.0
>
>
> The revdep Crossbow job is currently not functioning correctly. This led to 
> changed behaviour affecting a revdep with the 8.0.0 release, requiring a 
> patch after initial submission.
> cc: [~jonkeane]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16610) [Python] Raise an error for conflicting options in pq.write_to_dataset

2022-05-18 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-16610:
---

 Summary: [Python] Raise an error for conflicting options in 
pq.write_to_dataset
 Key: ARROW-16610
 URL: https://issues.apache.org/jira/browse/ARROW-16610
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Alenka Frim
Assignee: Alenka Frim
 Fix For: 9.0.0


A follow-up for https://issues.apache.org/jira/browse/ARROW-16420 :

For some of the 'conflicting' options, for instance if the user passes both 
'partitioning' and 'partition_cols', or 'metadata_collector' and 
'file_visitor', an error should be raised.

See: [https://github.com/apache/arrow/pull/13062#pullrequestreview-966014225] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16609) [C++] xxhash not installed into dist/lib/include when building C++

2022-05-18 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-16609:
---

 Summary: [C++] xxhash not installed into dist/lib/include when 
building C++
 Key: ARROW-16609
 URL: https://issues.apache.org/jira/browse/ARROW-16609
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Alenka Frim
 Fix For: 9.0.0


My C++ build setup doesn’t install {{dist/include/arrow/vendored/xxhash/}} but 
only {{dist/include/arrow/vendored/xxhash.h}}. The last time the module was 
installed was in november 2021.

As {{arrow/python/arrow_to_pandas.cc}} includes {{arrow/util/hashing.h}} ->  
{{arrow/vendored/xxhash.h}}  -> {{arrow/vendored/xxhash/xxhash.h}} this module 
is needed to try to build Python C++ API separately from C++ 
(https://issues.apache.org/jira/browse/ARROW-16340).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16427) [Java] jdbcToArrowVectors / sqlToArrowVectorIterator fails to handle variable decimal precision / scale

2022-05-18 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li resolved ARROW-16427.
--
Fix Version/s: 9.0.0
   Resolution: Fixed

Issue resolved by pull request 13166
[https://github.com/apache/arrow/pull/13166]

> [Java] jdbcToArrowVectors / sqlToArrowVectorIterator fails to handle variable 
> decimal precision / scale
> ---
>
> Key: ARROW-16427
> URL: https://issues.apache.org/jira/browse/ARROW-16427
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java
>Affects Versions: 7.0.0
>Reporter: Jonathan Swenson
>Assignee: Todd Farmer
>Priority: Major
>  Labels: JDBC, Java, pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When a JDBC driver returns a Numeric type that doesn't exactly align with 
> what is in the JDBC metadata, jdbcToArrowVectors / sqlToArrowVectorIterator 
> fails to process the result (failing on serializing the the value into the 
> BigDecimalVector). 
> It appears as though this is because JDBC drivers can return BigDecimal / 
> Numeric values that are different between the metadata and not consistent 
> between each of the rows. 
> Is there a recommended course of action to represent a variable precision / 
> scale decimal vector? In any case it does not seem possible to convert JDBC 
> data with the built in utilities that uses these numeric types when they come 
> in this form. 
> It seems like both the Oracle and the Postgres JDBC driver also returns 
> metadata with a 0,0 precision / scale when values in the result set have 
> different (and varied) precision / scale. 
> An example: 
> Against postgres, running a simple SQL query that produces numeric types can 
> lead to a JDBC result set with BigDecimal values with variable decimal 
> precision/scale. 
> {code:java}
> SELECT value FROM (
>   SELECT 1000.01 AS "value" 
>   UNION SELECT 10300.001
> ) a {code}
>  
> The postgres JDBC adapter produces a result set that looks like the 
> following: 
>  
> || ||value||precision||scale||
> |metadata|N/A|0|0|
> |row 1|1000.01|18|2|
> |row 2|10300.001|20|7|
>  
> Even a result set that returns a single value may Numeric values with 
> precision / scale that do not match the precision / scale in the 
> ResultSetMetadata. 
>  
> {code:java}
> SELECT AVG(one) from (
>   SELECT 1000.01 as "one" 
>   UNION select 10300.001
> ) a {code}
> produces a result set that looks like this
>  
> || ||value||precision||scale||
> |metadata|N/A|0|0|
> |row 1|5005150.0050001|22|7|
>  
> When processing the result set using the simple jdbcToArrowVectors (or 
> sqlToArrowVectorIterator) this fails to set the values extracted from the 
> result set into the the DecimalVector
>  
> {code:java}
> val calendar = JdbcToArrowUtils.getUtcCalendar()
> val schema = JdbcToArrowUtils.jdbcToArrowSchema(rs.metaData, calendar)
> val root = VectorSchemaRoot.create(schema, RootAllocator())
> val vectors = JdbcToArrowUtils.jdbcToArrowVectors(rs, root, calendar) {code}
> Error:
>  
> {code:java}
> Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 0, 
> length: 1 (expected: range(0, 0))
>     at org.apache.arrow.memory.ArrowBuf.checkIndexD(ArrowBuf.java:318)
>     at org.apache.arrow.memory.ArrowBuf.chk(ArrowBuf.java:305)
>     at org.apache.arrow.memory.ArrowBuf.getByte(ArrowBuf.java:507)
>     at org.apache.arrow.vector.BitVectorHelper.setBit(BitVectorHelper.java:85)
>     at org.apache.arrow.vector.DecimalVector.set(DecimalVector.java:354)
>     at 
> org.apache.arrow.adapter.jdbc.consumer.DecimalConsumer$NullableDecimalConsumer.consume(DecimalConsumer.java:61)
>     at 
> org.apache.arrow.adapter.jdbc.consumer.CompositeJdbcConsumer.consume(CompositeJdbcConsumer.java:46)
>     at 
> org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.jdbcToArrowVectors(JdbcToArrowUtils.java:369)
>     at 
> org.apache.arrow.adapter.jdbc.JdbcToArrowUtils.jdbcToArrowVectors(JdbcToArrowUtils.java:321)
>  {code}
>  
> using `sqlToArrowVectorIterator` also fails with an error trying to set data 
> into the vector: (requires a little bit of trickery to force creation of the 
> package private configuration)
>  
> {code:java}
> Exception in thread "main" java.lang.RuntimeException: Error occurred while 
> getting next schema root.
>     at 
> org.apache.arrow.adapter.jdbc.ArrowVectorIterator.next(ArrowVectorIterator.java:179)
>     at 
> com.acme.dataformat.ArrowResultSetProcessor.processResultSet(ArrowResultSetProcessor.kt:31)
>     at com.acme.AppKt.main(App.kt:54)
>     at com.acme.AppKt.main(App.kt)
> Caused by: java.lang.RuntimeException: Error occurred while 

[jira] [Created] (ARROW-16608) [Gandiva][Java] Unsatisfied Link Error on M1 Mac for aarch

2022-05-18 Thread Jonathan Swenson (Jira)
Jonathan Swenson created ARROW-16608:


 Summary: [Gandiva][Java] Unsatisfied Link Error on M1 Mac for aarch
 Key: ARROW-16608
 URL: https://issues.apache.org/jira/browse/ARROW-16608
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++ - Gandiva, Java
Affects Versions: 8.0.0
Reporter: Jonathan Swenson


Potentially a blocker for Arrow Integration into Calcite: CALCITE-2040, however 
it may be possible to move forward without M1 Mac support. 

potentially somewhat related to ARROW-11135

Getting an instance of the JNILoader throw a Unsatisfied Link Error when it 
tries to load the libgandiva_jni.dylib that it has extracted from the jar into 
a temporary directory. 

Simplified error:
{code:java}
Exception in thread "main" java.lang.UnsatisfiedLinkError: 
/tmp_dir/libgandiva_jni.dylib_uuid: dlopen(/tmp_dir/libgandiva_jni.dylib_uuid, 
0x0001): tried: '/tmp_dir/libgandiva_jni.dylib_uuid' (mach-o file, but is an 
incompatible architecture (have 'x86_64', need 'arm64e')){code}
 

Full error and stack trace:
{code:java}
Exception in thread "main" java.lang.UnsatisfiedLinkError: 
/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe:
 
dlopen(/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe,
 0x0001): tried: 
'/private/var/folders/fj/63_6n5dx10n4b5x7jtdj6tvhgn/T/libgandiva_jni.dylib526a47e1-7306-440f-8bbf-378877abe5fe'
 (mach-o file, but is an incompatible architecture (have 'x86_64', need 
'arm64e'))
    at java.lang.ClassLoader$NativeLibrary.load(Native Method)
    at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1950)
    at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1832)
    at java.lang.Runtime.load0(Runtime.java:811)
    at java.lang.System.load(System.java:1088)
    at 
org.apache.arrow.gandiva.evaluator.JniLoader.loadGandivaLibraryFromJar(JniLoader.java:74)
    at 
org.apache.arrow.gandiva.evaluator.JniLoader.setupInstance(JniLoader.java:63)
    at 
org.apache.arrow.gandiva.evaluator.JniLoader.getInstance(JniLoader.java:53)
    at 
org.apache.arrow.gandiva.evaluator.JniLoader.getDefaultConfiguration(JniLoader.java:144)
    at org.apache.arrow.gandiva.evaluator.Filter.make(Filter.java:67)
    at io.acme.Main.main(Main.java:26) {code}
 

This example loads three libraries from mavencentral using gradle: 
{code:java}
repositories {
mavenCentral()
}

dependencies {
implementation("org.apache.arrow:arrow-memory-netty:8.0.0")
implementation("org.apache.arrow:arrow-vector:8.0.0")
implementation("org.apache.arrow.gandiva:arrow-gandiva:8.0.0")
} {code}
Example code: 
{code:java}
public class Main {
  public static void main(String[] args) throws GandivaException {
Field field = new Field("int_field", FieldType.nullable(new 
ArrowType.Int(32, true)), null);

Schema schema = makeSchema(field);
Condition condition = makeCondition(field);

Filter.make(schema, condition);
  }

  private static Schema makeSchema(Field field) {
List fieldList = new ArrayList<>();
fieldList.add(field);

return new Schema(fieldList, null);
  }

  private static Condition makeCondition(Field f) {
List treeNodes = new ArrayList<>(2);
treeNodes.add(TreeBuilder.makeField(f));
treeNodes.add(TreeBuilder.makeLiteral(4));
TreeNode comparison = TreeBuilder.makeFunction("less_than", treeNodes, new 
ArrowType.Bool());
return TreeBuilder.makeCondition(comparison);
  }
} {code}
While I haven't tested this exact example, a similar example executes without 
issue on an intel x86 mac.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16592) [FlightRPC][Python] Regression in DoPut error handling

2022-05-18 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li updated ARROW-16592:
-
Component/s: FlightRPC
 Python

> [FlightRPC][Python] Regression in DoPut error handling
> --
>
> Key: ARROW-16592
> URL: https://issues.apache.org/jira/browse/ARROW-16592
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Python
>Reporter: Lubo Slivka
>Assignee: David Li
>Priority: Major
>
> In PyArrow 8.0.0, any error raised while handling DoPut on the server results 
> in FlightInternalError on the client.
> In PyArrow 7.0.0, errors raised while handling DoPut are propagated/converted 
> to non-internal errors.
> —
> Example: on 7.0.0, raising FlightCancelledError while handling DoPut on the 
> server would propagate that error including extra_info all the way to the 
> FlightClient. This is not the case anymore on 8.0.0.
> The FlightInternalError contains extra detail that is derived from the 
> cancelled error though:
> {code:java}
> /arrow/cpp/src/arrow/flight/client.cc:363: Close() failed: IOError:  message from FlightError is here>. Detail: Cancelled. gRPC client debug 
> context: {"created":"@1652777650.446052211","description":"Error received 
> from peer 
> ipv4:127.0.0.1:16001","file":"/opt/vcpkg/buildtrees/grpc/src/85a295989c-6cf7bf442d.clean/src/core/lib/surface/call.cc","file_line":903,"grpc_message":"  message from FlightError is here>. Detail: Cancelled","grpc_status":1}. 
> Client context: OK. Detail: Cancelled
>  {code}
> Note: skimming through the code, it seems this problem is not unique to 
> PyArrow.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16415) [R] Update strptime bindings to use tz

2022-05-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16415:
---
Labels: pull-request-available  (was: )

> [R] Update strptime bindings to use tz 
> ---
>
> Key: ARROW-16415
> URL: https://issues.apache.org/jira/browse/ARROW-16415
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{strptime}} mentions it does not support {{tz}} - the timezone argument. 
> ARROW-12820 has been addressed and the binding definition need updating.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16521) [C++][R] Configure curl timeout policy for S3

2022-05-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16521:
---
Labels: good-first-issue good-second-issue  (was: )

> [C++][R] Configure curl timeout policy for S3
> -
>
> Key: ARROW-16521
> URL: https://issues.apache.org/jira/browse/ARROW-16521
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python, R
>Affects Versions: 7.0.0
>Reporter: Carl Boettiger
>Priority: Major
>  Labels: good-first-issue, good-second-issue
> Fix For: 9.0.0
>
>
> Is it possible for the user to increase the timeout allowed on the curl 
> settings when accessing S3 records?  The default setting appears to be more 
> aggressive than most other S3 clients I use, which means that I see a lot 
> more failures on arrow-based operations than the other clients see.  I'm not 
> seeing how this can be increased though?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16521) [C++][R] Configure curl timeout policy for S3

2022-05-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-16521:
--

Assignee: (was: Antoine Pitrou)

> [C++][R] Configure curl timeout policy for S3
> -
>
> Key: ARROW-16521
> URL: https://issues.apache.org/jira/browse/ARROW-16521
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python, R
>Affects Versions: 7.0.0
>Reporter: Carl Boettiger
>Priority: Major
> Fix For: 9.0.0
>
>
> Is it possible for the user to increase the timeout allowed on the curl 
> settings when accessing S3 records?  The default setting appears to be more 
> aggressive than most other S3 clients I use, which means that I see a lot 
> more failures on arrow-based operations than the other clients see.  I'm not 
> seeing how this can be increased though?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16521) [C++][R] Configure curl timeout policy for S3

2022-05-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538949#comment-17538949
 ] 

Antoine Pitrou commented on ARROW-16521:


We should make the timeout configurable from the API without having to set a 
CURL environment variable.

Also, it looks like the AWS SDK's own {{DefaultRetryStrategy}} has a default 
timeout of around 25 seconds, so we should also use that by default.

> [C++][R] Configure curl timeout policy for S3
> -
>
> Key: ARROW-16521
> URL: https://issues.apache.org/jira/browse/ARROW-16521
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python, R
>Affects Versions: 7.0.0
>Reporter: Carl Boettiger
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 9.0.0
>
>
> Is it possible for the user to increase the timeout allowed on the curl 
> settings when accessing S3 records?  The default setting appears to be more 
> aggressive than most other S3 clients I use, which means that I see a lot 
> more failures on arrow-based operations than the other clients see.  I'm not 
> seeing how this can be increased though?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16521) [C++][R] Configure curl timeout policy for S3

2022-05-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-16521:
--

Assignee: Antoine Pitrou

> [C++][R] Configure curl timeout policy for S3
> -
>
> Key: ARROW-16521
> URL: https://issues.apache.org/jira/browse/ARROW-16521
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python, R
>Affects Versions: 7.0.0
>Reporter: Carl Boettiger
>Assignee: Antoine Pitrou
>Priority: Major
> Fix For: 9.0.0
>
>
> Is it possible for the user to increase the timeout allowed on the curl 
> settings when accessing S3 records?  The default setting appears to be more 
> aggressive than most other S3 clients I use, which means that I see a lot 
> more failures on arrow-based operations than the other clients see.  I'm not 
> seeing how this can be increased though?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16521) [C++][R] Configure curl timeout policy for S3

2022-05-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16521:
---
Component/s: Python

> [C++][R] Configure curl timeout policy for S3
> -
>
> Key: ARROW-16521
> URL: https://issues.apache.org/jira/browse/ARROW-16521
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python, R
>Affects Versions: 7.0.0
>Reporter: Carl Boettiger
>Priority: Major
> Fix For: 9.0.0
>
>
> Is it possible for the user to increase the timeout allowed on the curl 
> settings when accessing S3 records?  The default setting appears to be more 
> aggressive than most other S3 clients I use, which means that I see a lot 
> more failures on arrow-based operations than the other clients see.  I'm not 
> seeing how this can be increased though?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16521) [C++][R] Configure curl timeout policy for S3

2022-05-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16521:
---
Component/s: C++

> [C++][R] Configure curl timeout policy for S3
> -
>
> Key: ARROW-16521
> URL: https://issues.apache.org/jira/browse/ARROW-16521
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, R
>Affects Versions: 7.0.0
>Reporter: Carl Boettiger
>Priority: Major
> Fix For: 9.0.0
>
>
> Is it possible for the user to increase the timeout allowed on the curl 
> settings when accessing S3 records?  The default setting appears to be more 
> aggressive than most other S3 clients I use, which means that I see a lot 
> more failures on arrow-based operations than the other clients see.  I'm not 
> seeing how this can be increased though?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16521) [C++][R] Configure curl timeout policy for S3

2022-05-18 Thread Antoine Pitrou (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-16521:
---
Fix Version/s: 9.0.0

> [C++][R] Configure curl timeout policy for S3
> -
>
> Key: ARROW-16521
> URL: https://issues.apache.org/jira/browse/ARROW-16521
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Carl Boettiger
>Priority: Major
> Fix For: 9.0.0
>
>
> Is it possible for the user to increase the timeout allowed on the curl 
> settings when accessing S3 records?  The default setting appears to be more 
> aggressive than most other S3 clients I use, which means that I see a lot 
> more failures on arrow-based operations than the other clients see.  I'm not 
> seeing how this can be increased though?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16161) [C++] Overhead of std::shared_ptr copies is causing thread contention

2022-05-18 Thread Tobias Zagorni (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538927#comment-17538927
 ] 

Tobias Zagorni commented on ARROW-16161:


I created a PR for avoiding calls to Slice() as ARROW-16562

> [C++] Overhead of std::shared_ptr copies is causing thread 
> contention
> ---
>
> Key: ARROW-16161
> URL: https://issues.apache.org/jira/browse/ARROW-16161
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: C++
>Reporter: Weston Pace
>Assignee: Tobias Zagorni
>Priority: Major
> Attachments: ExecArrayData-difference.txt
>
>
> We created a benchmark to measure ExecuteScalarExpression performance in 
> ARROW-16014.  We noticed significant thread contention (even though there 
> shouldn't be much, if any, for this task) As part of ARROW-16138 we have been 
> investigating possible causes.
> One cause seems to be contention from copying shared_ptr objects.
> Two possible solutions jump to mind and I'm sure there are many more.
> ExecBatch is an internal type and used inside of ExecuteScalarExpression as 
> well as inside of the execution engine.  In the former we can safely assume 
> the data types will exist for the duration of the call.  In the latter we can 
> safely assume the data types will exist for the duration of the execution 
> plan.  Thus we can probably take a more targetted fix and migrate only 
> ExecBatch to using DataType* (or const DataType&).
> On the other hand, we might consider a more global approach.  All of our 
> "stock" data types are assumed to have static storage duration.  However, we 
> must use std::shared_ptr because users could create their own 
> extension types.  We could invent an "extension type registration" system 
> where extension types must first be registered with the C++ lib before being 
> used.  Then we could have long-lived DataType instances and we could replace 
> std::shared_ptr with DataType* (or const DataType&) throughout most 
> of the entire code base.
> But, as I mentioned, I'm sure there are many approaches to take.  CC 
> [~lidavidm] and [~apitrou] and [~yibocai] for thoughts but this might be 
> interesting for just about any C++ dev.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16415) [R] Update strptime bindings to use tz

2022-05-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld reassigned ARROW-16415:


Assignee: Dragoș Moldovan-Grünfeld

> [R] Update strptime bindings to use tz 
> ---
>
> Key: ARROW-16415
> URL: https://issues.apache.org/jira/browse/ARROW-16415
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>
> {{strptime}} mentions it does not support {{tz}} - the timezone argument. 
> ARROW-12820 has been addressed and the binding definition need updating.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16606) [FlightRPC][Python] Flight RPC crashes when a middleware sends an authorization header written with an upper-case A as in 'Authorization'

2022-05-18 Thread David Li (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538883#comment-17538883
 ] 

David Li commented on ARROW-16606:
--

Thanks for the report. I'll try to reproduce it soon but yes, headers should 
absolutely be case insensitive and we shouldn't be crashing in either case.

> [FlightRPC][Python] Flight RPC crashes when a middleware sends an 
> authorization header written with an upper-case A as in 'Authorization' 
> --
>
> Key: ARROW-16606
> URL: https://issues.apache.org/jira/browse/ARROW-16606
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Python
>Affects Versions: 7.0.0, 8.0.0
> Environment: Python 3.9.12 on macOS 12.3.1
>Reporter: Paul Horn
>Assignee: David Li
>Priority: Major
>
> Sending a custom `Authorization` header leads to a crash of the client
>  
> Running this python code, for example
>  
> {code:java}
> import pyarrow.flight as flight
> class TestMiddlewareFactory(ClientMiddlewareFactory):
> def __init__(self, *args, **kwargs):
> super().__init__(*args, **kwargs)
> def start_call(self, info):
> return TestMiddleware()
> class TestMiddleware(ClientMiddleware):
> def __init__(self, *args, **kwargs):
> super().__init__(*args, **kwargs)
> def sending_headers(self):
> return {"Authorization": "Basic dXNlcjpwYXNz"}
> def test():
> client = flight.FlightClient("grpc://localhost:8491", 
> middleware=[TestMiddlewareFactory()])
> client.do_get(flight.Ticket(""))
>  {code}
>  
>  
> Results in
>  
>  
> {noformat}
> tests/rpc_repro.py Fatal Python error: AbortedCurrent thread 
> 0x000202ecc600 (most recent call first):
>   File "tests/rpc_repro.py", line 22 in test
>   File "venv/lib/python3.9/site-packages/_pytest/python.py", line 183 in 
> pytest_pyfunc_call
>   File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
> _multicall
>   File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
> _hookexec
>   File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in 
> __call__
>   File "venv/lib/python3.9/site-packages/_pytest/python.py", line 1641 in 
> runtest
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 162 in 
> pytest_runtest_call
>   File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
> _multicall
>   File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
> _hookexec
>   File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in 
> __call__
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 255 in 
> 
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 311 in 
> from_call
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 254 in 
> call_runtest_hook
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 215 in 
> call_and_report
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 126 in 
> runtestprotocol
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 109 in 
> pytest_runtest_protocol
>   File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
> _multicall
>   File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
> _hookexec
>   File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in 
> __call__
>   File "venv/lib/python3.9/site-packages/_pytest/main.py", line 348 in 
> pytest_runtestloop
>   File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
> _multicall
>   File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
> _hookexec
>   File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in 
> __call__
>   File "venv/lib/python3.9/site-packages/_pytest/main.py", line 323 in _main
>   File "venv/lib/python3.9/site-packages/_pytest/main.py", line 269 in 
> wrap_session
>   File "venv/lib/python3.9/site-packages/_pytest/main.py", line 316 in 
> pytest_cmdline_main
>   File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
> _multicall
>   File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
> _hookexec
>   File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in 
> __call__
>   File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 
> 162 in main
>   File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 
> 185 in console_main
>   File "venv/bin/pytest", line 8 in 
> Abort trap: 6 {noformat}
>  
>  
> With an additional crash report from the OS
>  
> {noformat}
> Process:               Python [26728]
> Path:                  
> 

[jira] [Assigned] (ARROW-16606) [FlightRPC][Python] Flight RPC crashes when a middleware sends an authorization header written with an upper-case A as in 'Authorization'

2022-05-18 Thread David Li (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Li reassigned ARROW-16606:


Assignee: David Li

> [FlightRPC][Python] Flight RPC crashes when a middleware sends an 
> authorization header written with an upper-case A as in 'Authorization' 
> --
>
> Key: ARROW-16606
> URL: https://issues.apache.org/jira/browse/ARROW-16606
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC, Python
>Affects Versions: 7.0.0, 8.0.0
> Environment: Python 3.9.12 on macOS 12.3.1
>Reporter: Paul Horn
>Assignee: David Li
>Priority: Major
>
> Sending a custom `Authorization` header leads to a crash of the client
>  
> Running this python code, for example
>  
> {code:java}
> import pyarrow.flight as flight
> class TestMiddlewareFactory(ClientMiddlewareFactory):
> def __init__(self, *args, **kwargs):
> super().__init__(*args, **kwargs)
> def start_call(self, info):
> return TestMiddleware()
> class TestMiddleware(ClientMiddleware):
> def __init__(self, *args, **kwargs):
> super().__init__(*args, **kwargs)
> def sending_headers(self):
> return {"Authorization": "Basic dXNlcjpwYXNz"}
> def test():
> client = flight.FlightClient("grpc://localhost:8491", 
> middleware=[TestMiddlewareFactory()])
> client.do_get(flight.Ticket(""))
>  {code}
>  
>  
> Results in
>  
>  
> {noformat}
> tests/rpc_repro.py Fatal Python error: AbortedCurrent thread 
> 0x000202ecc600 (most recent call first):
>   File "tests/rpc_repro.py", line 22 in test
>   File "venv/lib/python3.9/site-packages/_pytest/python.py", line 183 in 
> pytest_pyfunc_call
>   File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
> _multicall
>   File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
> _hookexec
>   File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in 
> __call__
>   File "venv/lib/python3.9/site-packages/_pytest/python.py", line 1641 in 
> runtest
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 162 in 
> pytest_runtest_call
>   File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
> _multicall
>   File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
> _hookexec
>   File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in 
> __call__
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 255 in 
> 
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 311 in 
> from_call
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 254 in 
> call_runtest_hook
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 215 in 
> call_and_report
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 126 in 
> runtestprotocol
>   File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 109 in 
> pytest_runtest_protocol
>   File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
> _multicall
>   File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
> _hookexec
>   File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in 
> __call__
>   File "venv/lib/python3.9/site-packages/_pytest/main.py", line 348 in 
> pytest_runtestloop
>   File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
> _multicall
>   File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
> _hookexec
>   File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in 
> __call__
>   File "venv/lib/python3.9/site-packages/_pytest/main.py", line 323 in _main
>   File "venv/lib/python3.9/site-packages/_pytest/main.py", line 269 in 
> wrap_session
>   File "venv/lib/python3.9/site-packages/_pytest/main.py", line 316 in 
> pytest_cmdline_main
>   File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
> _multicall
>   File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
> _hookexec
>   File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in 
> __call__
>   File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 
> 162 in main
>   File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 
> 185 in console_main
>   File "venv/bin/pytest", line 8 in 
> Abort trap: 6 {noformat}
>  
>  
> With an additional crash report from the OS
>  
> {noformat}
> Process:               Python [26728]
> Path:                  
> /usr/local/Cellar/python@3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/Resources/Python.app/Contents/MacOS/Python
> Identifier:            org.python.python
> Version:               3.9.12 (3.9.12)
> Code Type:         

[jira] [Updated] (ARROW-16606) [FlightRPC][Python] Flight RPC crashes when a middleware sends an authorization header written with an upper-case A as in 'Authorization'

2022-05-18 Thread Paul Horn (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Horn updated ARROW-16606:
--
Description: 
Sending a custom `Authorization` header leads to a crash of the client

 

Running this python code, for example

 
{code:java}
import pyarrow.flight as flight

class TestMiddlewareFactory(ClientMiddlewareFactory):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

def start_call(self, info):
return TestMiddleware()


class TestMiddleware(ClientMiddleware):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

def sending_headers(self):
return {"Authorization": "Basic dXNlcjpwYXNz"}


def test():
client = flight.FlightClient("grpc://localhost:8491", 
middleware=[TestMiddlewareFactory()])
client.do_get(flight.Ticket(""))
 {code}
 

 

Results in

 

 
{noformat}
tests/rpc_repro.py Fatal Python error: AbortedCurrent thread 0x000202ecc600 
(most recent call first):
  File "tests/rpc_repro.py", line 22 in test
  File "venv/lib/python3.9/site-packages/_pytest/python.py", line 183 in 
pytest_pyfunc_call
  File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
_multicall
  File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
_hookexec
  File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "venv/lib/python3.9/site-packages/_pytest/python.py", line 1641 in 
runtest
  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 162 in 
pytest_runtest_call
  File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
_multicall
  File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
_hookexec
  File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 255 in 

  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 311 in 
from_call
  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 254 in 
call_runtest_hook
  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 215 in 
call_and_report
  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 126 in 
runtestprotocol
  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 109 in 
pytest_runtest_protocol
  File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
_multicall
  File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
_hookexec
  File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "venv/lib/python3.9/site-packages/_pytest/main.py", line 348 in 
pytest_runtestloop
  File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
_multicall
  File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
_hookexec
  File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "venv/lib/python3.9/site-packages/_pytest/main.py", line 323 in _main
  File "venv/lib/python3.9/site-packages/_pytest/main.py", line 269 in 
wrap_session
  File "venv/lib/python3.9/site-packages/_pytest/main.py", line 316 in 
pytest_cmdline_main
  File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
_multicall
  File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
_hookexec
  File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 162 
in main
  File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 185 
in console_main
  File "venv/bin/pytest", line 8 in 
Abort trap: 6 {noformat}
 

 

With an additional crash report from the OS

 
{noformat}
Process:               Python [26728]
Path:                  
/usr/local/Cellar/python@3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/Resources/Python.app/Contents/MacOS/Python
Identifier:            org.python.python
Version:               3.9.12 (3.9.12)
Code Type:             X86-64 (Translated)
Parent Process:        bash [4683]
Responsible:           iTerm2 [99236]
User ID:               501Date/Time:             2022-05-18 15:35:10.1978 +0200
OS Version:            macOS 12.3.1 (21E258)
Report Version:        12
Anonymous UUID:        4A72633D-06AC-F2CE-0E3F-0AD87FA611CESleep/Wake UUID:     
  3D7BD416-99A9-41B3-8163-5544AEF31FF5Time Awake Since Boot: 100 seconds
Time Since Wake:       22827 secondsSystem Integrity Protection: enabledCrashed 
Thread:        0  Dispatch queue: com.apple.main-threadException Type:        
EXC_CRASH (SIGABRT)
Exception Codes:       0x, 0x
Exception Note:        EXC_CORPSE_NOTIFYApplication Specific Information:
abort() called
Thread 0 Crashed::  Dispatch queue: com.apple.main-thread
0   ???                                 0x7ff8a597e940 ???
1   libsystem_kernel.dylib       

[jira] [Created] (ARROW-16607) [R] Improve KeyValueMetadata handling

2022-05-18 Thread Neal Richardson (Jira)
Neal Richardson created ARROW-16607:
---

 Summary: [R] Improve KeyValueMetadata handling
 Key: ARROW-16607
 URL: https://issues.apache.org/jira/browse/ARROW-16607
 Project: Apache Arrow
  Issue Type: Improvement
  Components: R
Reporter: Neal Richardson
Assignee: Neal Richardson
 Fix For: 9.0.0


Followup to ARROW-15271. Among the objectives:

* Push KVM handling in ExecPlan so that Run() preserves the R metadata we want; 
also remove the duplicate handling of it for Write()
* Better encapsulate KVM for the the $metadata and $r_metadata so that as a 
user/developer, you never have to touch the serialize/deserialize functions, 
you just have a list to work with
* Factor out a common utility in r/src for taking cpp11::strings (named 
character vector) and producing arrow::KeyValueMetadata



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16606) [FlightRPC][Python] Flight RPC crashes when a middleware sends an authorization header written with an upper-case A as in 'Authorization'

2022-05-18 Thread Paul Horn (Jira)
Paul Horn created ARROW-16606:
-

 Summary: [FlightRPC][Python] Flight RPC crashes when a middleware 
sends an authorization header written with an upper-case A as in 
'Authorization' 
 Key: ARROW-16606
 URL: https://issues.apache.org/jira/browse/ARROW-16606
 Project: Apache Arrow
  Issue Type: Bug
  Components: FlightRPC, Python
Affects Versions: 8.0.0, 7.0.0
 Environment: Python 3.9.12 on macOS 12.3.1
Reporter: Paul Horn


Sending a custom `Authorization` header leads to a crash of the client

 

Running this python code, for example

 
{code:java}
import pyarrow.flight as flight

class TestMiddlewareFactory(ClientMiddlewareFactory):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

def start_call(self, info):
return TestMiddleware()


class TestMiddleware(ClientMiddleware):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

def sending_headers(self):
return {"Authorization": "Basic dXNlcjpwYXNz"}


def test():
client = flight.FlightClient("grpc://localhost:8491", 
middleware=[TestMiddlewareFactory()])
client.do_get(flight.Ticket(""))
 {code}
 

 

Results in

 

 
{noformat}
tests/rpc_repro.py Fatal Python error: AbortedCurrent thread 0x000202ecc600 
(most recent call first):
  File "tests/rpc_repro.py", line 22 in test
  File "venv/lib/python3.9/site-packages/_pytest/python.py", line 183 in 
pytest_pyfunc_call
  File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
_multicall
  File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
_hookexec
  File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "venv/lib/python3.9/site-packages/_pytest/python.py", line 1641 in 
runtest
  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 162 in 
pytest_runtest_call
  File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
_multicall
  File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
_hookexec
  File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 255 in 

  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 311 in 
from_call
  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 254 in 
call_runtest_hook
  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 215 in 
call_and_report
  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 126 in 
runtestprotocol
  File "venv/lib/python3.9/site-packages/_pytest/runner.py", line 109 in 
pytest_runtest_protocol
  File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
_multicall
  File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
_hookexec
  File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "venv/lib/python3.9/site-packages/_pytest/main.py", line 348 in 
pytest_runtestloop
  File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
_multicall
  File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
_hookexec
  File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "venv/lib/python3.9/site-packages/_pytest/main.py", line 323 in _main
  File "venv/lib/python3.9/site-packages/_pytest/main.py", line 269 in 
wrap_session
  File "venv/lib/python3.9/site-packages/_pytest/main.py", line 316 in 
pytest_cmdline_main
  File "venv/lib/python3.9/site-packages/pluggy/_callers.py", line 39 in 
_multicall
  File "venv/lib/python3.9/site-packages/pluggy/_manager.py", line 80 in 
_hookexec
  File "venv/lib/python3.9/site-packages/pluggy/_hooks.py", line 265 in __call__
  File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 162 
in main
  File "venv/lib/python3.9/site-packages/_pytest/config/__init__.py", line 185 
in console_main
  File "venv/bin/pytest", line 8 in 
Abort trap: 6 {noformat}
 

 

With an additional crash report from the OS

 
{noformat}
Process:               Python [26728]Path:                  
/usr/local/Cellar/python@3.9/3.9.12/Frameworks/Python.framework/Versions/3.9/Resources/Python.app/Contents/MacOS/PythonIdentifier:
            org.python.pythonVersion:               3.9.12 (3.9.12)Code Type:   
          X86-64 (Translated)Parent Process:        bash [4683]Responsible:     
      iTerm2 [99236]User ID:               501
Date/Time:             2022-05-18 15:35:10.1978 +0200OS Version:            
macOS 12.3.1 (21E258)Report Version:        12Anonymous UUID:        
4A72633D-06AC-F2CE-0E3F-0AD87FA611CE
Sleep/Wake UUID:       3D7BD416-99A9-41B3-8163-5544AEF31FF5
Time Awake Since Boot: 100 secondsTime Since Wake:       22827 seconds
System Integrity Protection: enabled
Crashed Thread:        0  Dispatch queue: com.apple.main-thread

[jira] [Updated] (ARROW-16415) [R] Update strptime bindings to use tz

2022-05-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-16415:
-
Description: {{strptime}} mentions it does not support {{tz}} - the 
timezone argument. ARROW-12820 has been addressed and the binding definition 
need updating.  (was: Both functions mention they do not support {{tz}} - the 
timezone argument. ARROW-12820 has been addressed and the bindings definitions 
need updating.)

> [R] Update strptime bindings to use tz 
> ---
>
> Key: ARROW-16415
> URL: https://issues.apache.org/jira/browse/ARROW-16415
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>
> {{strptime}} mentions it does not support {{tz}} - the timezone argument. 
> ARROW-12820 has been addressed and the binding definition need updating.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16415) [R] Update strptime bindings to use tz

2022-05-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-16415:
-
Summary: [R] Update strptime bindings to use tz   (was: [R] Update strptime 
and fast_strptime bindings to use tz )

> [R] Update strptime bindings to use tz 
> ---
>
> Key: ARROW-16415
> URL: https://issues.apache.org/jira/browse/ARROW-16415
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 7.0.0
>Reporter: Dragoș Moldovan-Grünfeld
>Priority: Major
> Fix For: 9.0.0
>
>
> Both functions mention they do not support {{tz}} - the timezone argument. 
> ARROW-12820 has been addressed and the bindings definitions need updating.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16317) [Archery][CI] Fix possible race condition when submitting crossbow builds

2022-05-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16317:
---
Labels: pull-request-available  (was: )

> [Archery][CI] Fix possible race condition when submitting crossbow builds
> -
>
> Key: ARROW-16317
> URL: https://issues.apache.org/jira/browse/ARROW-16317
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery, Continuous Integration
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Sometimes when trying to use github-actions to submit crossbow jobs an error 
> is raised like:
> {code:java}
> Failed to push updated references, potentially because of credential issues: 
> ['refs/heads/actions-1883-github-wheel-windows-cp310-amd64', 
> 'refs/tags/actions-1883-github-wheel-windows-cp310-amd64', 
> 'refs/heads/actions-1883-github-wheel-windows-cp39-amd64', 
> 'refs/tags/actions-1883-github-wheel-windows-cp39-amd64', 
> 'refs/heads/actions-1883-github-wheel-windows-cp37-amd64', 
> 'refs/tags/actions-1883-github-wheel-windows-cp37-amd64', 
> 'refs/heads/actions-1883-github-wheel-windows-cp38-amd64', 
> 'refs/tags/actions-1883-github-wheel-windows-cp38-amd64', 
> 'refs/heads/actions-1883']
> The Archery job run can be found at: 
> https://github.com/apache/arrow/actions/runs/2195038965{code}
> As discussed on this github comment 
> ([https://github.com/apache/arrow/pull/12930#issuecomment-1103772507)]
> We should remove the auto incremented IDs entirely and use unique hashes 
> instead, e.g.: actions--github-wheel-windows-cp310-amd64 instead 
> of actions-1883-github-wheel-windows-cp310-amd64. Then we wouldn't need to 
> fetch the new references either, making remote crossbow builds and local 
> submission much quicker.
> The error can also be seen here: 
> https://github.com/apache/arrow/pull/12987#issuecomment-1108516668



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16317) [Archery][CI] Fix possible race condition when submitting crossbow builds

2022-05-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido reassigned ARROW-16317:
-

Assignee: Raúl Cumplido

> [Archery][CI] Fix possible race condition when submitting crossbow builds
> -
>
> Key: ARROW-16317
> URL: https://issues.apache.org/jira/browse/ARROW-16317
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Archery, Continuous Integration
>Reporter: Raúl Cumplido
>Assignee: Raúl Cumplido
>Priority: Major
> Fix For: 9.0.0
>
>
> Sometimes when trying to use github-actions to submit crossbow jobs an error 
> is raised like:
> {code:java}
> Failed to push updated references, potentially because of credential issues: 
> ['refs/heads/actions-1883-github-wheel-windows-cp310-amd64', 
> 'refs/tags/actions-1883-github-wheel-windows-cp310-amd64', 
> 'refs/heads/actions-1883-github-wheel-windows-cp39-amd64', 
> 'refs/tags/actions-1883-github-wheel-windows-cp39-amd64', 
> 'refs/heads/actions-1883-github-wheel-windows-cp37-amd64', 
> 'refs/tags/actions-1883-github-wheel-windows-cp37-amd64', 
> 'refs/heads/actions-1883-github-wheel-windows-cp38-amd64', 
> 'refs/tags/actions-1883-github-wheel-windows-cp38-amd64', 
> 'refs/heads/actions-1883']
> The Archery job run can be found at: 
> https://github.com/apache/arrow/actions/runs/2195038965{code}
> As discussed on this github comment 
> ([https://github.com/apache/arrow/pull/12930#issuecomment-1103772507)]
> We should remove the auto incremented IDs entirely and use unique hashes 
> instead, e.g.: actions--github-wheel-windows-cp310-amd64 instead 
> of actions-1883-github-wheel-windows-cp310-amd64. Then we wouldn't need to 
> fetch the new references either, making remote crossbow builds and local 
> submission much quicker.
> The error can also be seen here: 
> https://github.com/apache/arrow/pull/12987#issuecomment-1108516668



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (ARROW-16582) [Python] Include DATASET in list of components in PyArrow's dev page

2022-05-18 Thread Jira


 [ 
https://issues.apache.org/jira/browse/ARROW-16582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raúl Cumplido reassigned ARROW-16582:
-

Assignee: Raúl Cumplido

> [Python] Include DATASET in list of components in PyArrow's dev page
> 
>
> Key: ARROW-16582
> URL: https://issues.apache.org/jira/browse/ARROW-16582
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Yaron Gvili
>Assignee: Raúl Cumplido
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> PyArrow's dev page has a [build-and-test 
> section|https://arrow.apache.org/docs/developers/python.html#build-and-test] 
> that currently does not list DATASET as a component. Using a recent Arrow 
> version (commit e5e490), I observed DATASET was mandatory for the successful 
> completion of the test suite ran by `{color:#201f1e}python -m pytest 
> pyarrow/{color}`, as recommended on the page. Without `export 
> PYARROW_WITH_DATASET=1`, I observed errors with `test_dataset.py`, 
> `test_exec_plan.py`, and a couple others.
> Since DATASET is intended to be an optional component, it should be listed on 
> this section. In addition, the documented test suite command should be 
> updated to one that doesn't fail without the DATASET component being selected 
> (or else the test suite itself should be fixed).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (ARROW-16582) [Python] Include DATASET in list of components in PyArrow's dev page

2022-05-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-16582:
---
Labels: pull-request-available  (was: )

> [Python] Include DATASET in list of components in PyArrow's dev page
> 
>
> Key: ARROW-16582
> URL: https://issues.apache.org/jira/browse/ARROW-16582
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Yaron Gvili
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> PyArrow's dev page has a [build-and-test 
> section|https://arrow.apache.org/docs/developers/python.html#build-and-test] 
> that currently does not list DATASET as a component. Using a recent Arrow 
> version (commit e5e490), I observed DATASET was mandatory for the successful 
> completion of the test suite ran by `{color:#201f1e}python -m pytest 
> pyarrow/{color}`, as recommended on the page. Without `export 
> PYARROW_WITH_DATASET=1`, I observed errors with `test_dataset.py`, 
> `test_exec_plan.py`, and a couple others.
> Since DATASET is intended to be an optional component, it should be listed on 
> this section. In addition, the documented test suite command should be 
> updated to one that doesn't fail without the DATASET component being selected 
> (or else the test suite itself should be fixed).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16605) [CI][R] Fix revdep Crossbow job

2022-05-18 Thread Jacob Wujciak-Jens (Jira)
Jacob Wujciak-Jens created ARROW-16605:
--

 Summary: [CI][R] Fix revdep Crossbow job
 Key: ARROW-16605
 URL: https://issues.apache.org/jira/browse/ARROW-16605
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration, R
Reporter: Jacob Wujciak-Jens
Assignee: Jacob Wujciak-Jens
 Fix For: 9.0.0


The revdep Crossbow job is currently not functioning correctly. This led to 
changed behaviour affecting a revdep with the 8.0.0 release, requiring a patch 
after initial submission.
cc: [~jonkeane]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16604) [C++] Boost not included when build benchmarks

2022-05-18 Thread Yibo Cai (Jira)
Yibo Cai created ARROW-16604:


 Summary: [C++] Boost not included when build benchmarks
 Key: ARROW-16604
 URL: https://issues.apache.org/jira/browse/ARROW-16604
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Yibo Cai


{code:bash}
cmake -GNinja -DARROW_BUILD_BENCHMARKS=ON ..
{code}
failed with many boost related error, as below
{code:bash}
CMake Error at cmake_modules/BuildUtils.cmake:522 (add_executable):
  Target "arrow-json-parser-benchmark" links to target "Boost::system" but
  the target was not found.  Perhaps a find_package() call is missing for an
  IMPORTED target, or an ALIAS target is missing?
Call Stack (most recent call first):
  src/arrow/CMakeLists.txt:114 (add_benchmark)
  src/arrow/json/CMakeLists.txt:28 (add_arrow_benchmark)
{code}

The error is gone if also build tests {{-DARROW_BUILD_TESTS=ON}}. Looks boost 
is not included when build benchmarks.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-14314) [C++] Sorting dictionary array not implemented

2022-05-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538702#comment-17538702
 ] 

Antoine Pitrou commented on ARROW-14314:


bq. But to avoid that, we can replace null values into indices, so the problem 
will look like this:

We can indeed. Another possibility is to partition nulls away first, then work 
on non-null values (partitioning is how the sorting implementation already 
deals with null values for other data types). That might be a bit faster as 
well.

bq. btw, why do we allow nulls in values? Shouldn't it be easier to have them 
only in indices?

Probably for compatibility with various data sources.


> [C++] Sorting dictionary array not implemented
> --
>
> Key: ARROW-14314
> URL: https://issues.apache.org/jira/browse/ARROW-14314
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Neal Richardson
>Priority: Major
>  Labels: kernel
> Fix For: 9.0.0
>
>
> From R, taking the stock {{mtcars}} dataset and giving it a dictionary type 
> column:
> {code}
> mtcars %>% 
>   mutate(cyl = as.factor(cyl)) %>% 
>   Table$create() %>% 
>   arrange(cyl) %>% 
>   collect()
> Error: Type error: Sorting not supported for type dictionary indices=int8, ordered=0>
> ../src/arrow/compute/kernels/vector_array_sort.cc:427  VisitTypeInline(type, 
> this)
> ../src/arrow/compute/kernels/vector_sort.cc:148  
> GetArraySorter(*physical_type_)
> ../src/arrow/compute/kernels/vector_sort.cc:1206  sorter.Sort()
> ../src/arrow/compute/api_vector.cc:259  CallFunction("sort_indices", {datum}, 
> , ctx)
> ../src/arrow/compute/exec/order_by_impl.cc:53  SortIndices(table, options_, 
> ctx_)
> ../src/arrow/compute/exec/sink_node.cc:292  impl_->DoFinish()
> ../src/arrow/compute/exec/exec_plan.cc:297  iterator_.Next()
> ../src/arrow/record_batch.cc:318  ReadNext()
> ../src/arrow/record_batch.cc:329  ReadAll()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Resolved] (ARROW-16516) [R] Implement ym() my() and yq() parsers

2022-05-18 Thread Nicola Crane (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-16516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicola Crane resolved ARROW-16516.
--
Resolution: Fixed

Issue resolved by pull request 13163
[https://github.com/apache/arrow/pull/13163]

> [R] Implement ym() my() and yq() parsers
> 
>
> Key: ARROW-16516
> URL: https://issues.apache.org/jira/browse/ARROW-16516
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Dragoș Moldovan-Grünfeld
>Assignee: Dragoș Moldovan-Grünfeld
>Priority: Major
>  Labels: pull-request-available
> Fix For: 9.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (ARROW-16603) [Python] pyarrow.json.read_json ignores nullable=False in explicit_schema parse_options

2022-05-18 Thread Alenka Frim (Jira)
Alenka Frim created ARROW-16603:
---

 Summary: [Python] pyarrow.json.read_json ignores nullable=False in 
explicit_schema parse_options
 Key: ARROW-16603
 URL: https://issues.apache.org/jira/browse/ARROW-16603
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Alenka Frim


Reproducible example:
{code:python}
import json
import pyarrow.json as pj
import pyarrow as pa

s = {"id": "value", "nested": {"value": 1}}

with open("issue.json", "w") as write_file:
json.dump(s, write_file, indent=4)

schema = pa.schema([
pa.field("id", pa.string(), nullable=False),
pa.field("nested", pa.struct([pa.field("value", pa.int64(), 
nullable=False)]))
])

table = pj.read_json('issue.json', 
parse_options=pj.ParseOptions(explicit_schema=schema))

print(schema)
print(table.schema)
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-16582) [Python] Include DATASET in list of components in PyArrow's dev page

2022-05-18 Thread Yaron Gvili (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538603#comment-17538603
 ] 

Yaron Gvili commented on ARROW-16582:
-

Another possible fix is for the build to automatically select DATASET if some 
other component, like PARQUET, is selected.

> [Python] Include DATASET in list of components in PyArrow's dev page
> 
>
> Key: ARROW-16582
> URL: https://issues.apache.org/jira/browse/ARROW-16582
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation, Python
>Reporter: Yaron Gvili
>Priority: Major
> Fix For: 9.0.0
>
>
> PyArrow's dev page has a [build-and-test 
> section|https://arrow.apache.org/docs/developers/python.html#build-and-test] 
> that currently does not list DATASET as a component. Using a recent Arrow 
> version (commit e5e490), I observed DATASET was mandatory for the successful 
> completion of the test suite ran by `{color:#201f1e}python -m pytest 
> pyarrow/{color}`, as recommended on the page. Without `export 
> PYARROW_WITH_DATASET=1`, I observed errors with `test_dataset.py`, 
> `test_exec_plan.py`, and a couple others.
> Since DATASET is intended to be an optional component, it should be listed on 
> this section. In addition, the documented test suite command should be 
> updated to one that doesn't fail without the DATASET component being selected 
> (or else the test suite itself should be fixed).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)